Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

HDF5 index corruption #8265

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
rockg opened this issue Sep 14, 2014 · 14 comments
Closed

HDF5 index corruption #8265

rockg opened this issue Sep 14, 2014 · 14 comments
Labels
Bug IO HDF5 read_hdf, HDFStore Timezones Timezone data dtype

Comments

@rockg
Copy link
Contributor

rockg commented Sep 14, 2014

I generated a multindexed DataFrame and wrote it to hdf5 using to_hdf. It uses zlib level 5 compression. The file was written all at once. The file is located here: https://www.dropbox.com/s/122q55g5ubcf4fl/indexIssue.h5?dl=0

The below methods should be identical but the former select with a where clause has 2892 records but getting all values and subselecting on the path returns 2972 (values are missing for path 6 between 3-5-2015 20:00 to 3-6-2015 9:00). I tried using reindex on the able but that didn't fix anything. I don't really know what's going on.

store   =   HDFStore(path_to_file, mode='r')

p1      =   store.select('ts', where=Term('Path', '=', 6), auto_close=False)
print(len(p1))
p2      =   store.select('ts', auto_close=False)
p2s     =   p2[p2.index.get_level_values('Path') == 6]
print(len(p2s))
@jreback
Copy link
Contributor

jreback commented Sep 14, 2014

can u show generating / writing code as well, and pd.show_versions()

@rockg
Copy link
Contributor Author

rockg commented Sep 14, 2014

I can't show how the values were generated but I can show the to_hdf call. The data in the file is fairly simple. Note that many other files are fine, but every once in awhile this shows up but it's a real issue if we can't utilize the index and have to load the entire file to ensure we are getting the full set of data.

Writing code is essentially: .

data.to_hdf(filePath, 'ts', append=False, format='table', complib='zlib', complevel=5)

INSTALLED VERSIONS

commit: None
python: 2.7.6.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-24-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.14.0-295-g08ae4f7
nose: 1.3.3
Cython: 0.20.2
numpy: 1.8.1
scipy: 0.14.0
statsmodels: 0.5.0
IPython: 2.1.0
sphinx: None
patsy: 0.2.1
scikits.timeseries: None
dateutil: 2.2
pytz: 2013b
bottleneck: 0.8.0
tables: 3.1.1
numexpr: 2.4
matplotlib: 1.3.1
openpyxl: 1.8.6
xlrd: None
xlwt: None
xlsxwriter: None
lxml: 3.3.5
bs4: 4.3.2
html5lib: 0.999
bq: None
apiclient: None
rpy2: None
sqlalchemy: 0.9.6
pymysql: None
psycopg2: None

@rockg
Copy link
Contributor Author

rockg commented Sep 14, 2014

I just tested and one can load the frame and save it out to a new hdf5 file and the problem will exist on that new file. So at least it seems to be reproducible.

@jreback
Copy link
Contributor

jreback commented Sep 14, 2014

@rockg I think this is the issue (I can't entirely debug this as its on the writing side, and I need an example):

Your index level 1 are timestamps which cross transition times. They are an object column.

This is used to convert them as we store as UTC. (and the tz).

values == object array of Timestamps
values = DatetimeIndex(values)
values.tz_convert('UTC').values.view('i8')

I think info is being lost here (iow this should NOT be done when the times are ambiguous). So the data is not in the file.

So what I need is a sample frame (can be all dummy data). That is generated from code (so can make a test), that is pretty short that demonstrates this.

@jreback jreback added Timezones Timezone data dtype IO HDF5 read_hdf, HDFStore Bug labels Sep 14, 2014
@jreback jreback added this to the 0.15.0 milestone Sep 14, 2014
@jreback
Copy link
Contributor

jreback commented Sep 14, 2014

hmm, though the data actually appears to be in the file...

@rockg
Copy link
Contributor Author

rockg commented Sep 14, 2014

Yes, that's the weird part (if you select the whole file and then select using get_level_values you get the right thing). And only Path=6 appears to have this problem, Path=7 for example works fine.

@jreback
Copy link
Contributor

jreback commented Sep 14, 2014

was just wondering something else is going on. When you select with 6 BEFORE doing anything else, I still only get 2892 rows. That is really odd. (as the Path=7) does work.

@rockg
Copy link
Contributor Author

rockg commented Sep 14, 2014

When I spent some time looking yesterday, there is a gap in the coordinates that are returned (I think one coordinate for that path is 46424 but the coordinates it finds in pytables go skip from 46027 to something like 52000). When you say before, you must mean the top two statements above or something else?

@jreback
Copy link
Contributor

jreback commented Sep 14, 2014

I think this is a bug in the index creation.

I think it is this bug. Let me put up a commit to NOT set expected rows unless its explicity set to see what it does.
PyTables/PyTables#319

In [96]: df.to_hdf('test2.h5','df',format='table',index=False)

In [97]: len(pd.read_hdf('test2.h5','df',where='Path=7'))
Out[97]: 2972

In [98]: len(pd.read_hdf('test2.h5','df',where='Path=6'))
Out[98]: 2972

In [99]: len(pd.read_hdf('test2.h5','df',where='Path=5'))
Out[99]: 2972

In [100]: df.to_hdf('test2.h5','df',format='table',index=True)

In [101]: len(pd.read_hdf('test2.h5','df',where='Path=7'))
Out[101]: 2972

In [102]: len(pd.read_hdf('test2.h5','df',where='Path=6'))
Out[102]: 2892

In [103]: len(pd.read_hdf('test2.h5','df',where='Path=5'))
Out[103]: 2972

@jreback
Copy link
Contributor

jreback commented Sep 14, 2014

@rockg maybe prod them a bit on that link to see what the actual issue is. its some kind of interaction with the selection mechansim and how the indexes are laid out.

@rockg
Copy link
Contributor Author

rockg commented Sep 14, 2014

@jreback Will do

@jreback jreback modified the milestones: 0.15.1, 0.15.0 Sep 14, 2014
@rockg
Copy link
Contributor Author

rockg commented Sep 29, 2014

@jreback There hasn't been a response on pytables. What should we do?

@jreback
Copy link
Contributor

jreback commented Sep 29, 2014

@rockg I pinged too. I really don't know enought of PyTables to see where to fix this unfortunately.

@FrancescAlted
Copy link

This has been addressed in PyTables: PyTables/PyTables@035dbd5 and the fix will be part of the forthcoming 3.2.0 release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO HDF5 read_hdf, HDFStore Timezones Timezone data dtype
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants