HDF5 index corruption #8265

rockg · 2014-09-14T02:27:16Z

I generated a multindexed DataFrame and wrote it to hdf5 using to_hdf. It uses zlib level 5 compression. The file was written all at once. The file is located here: https://www.dropbox.com/s/122q55g5ubcf4fl/indexIssue.h5?dl=0

The below methods should be identical but the former select with a where clause has 2892 records but getting all values and subselecting on the path returns 2972 (values are missing for path 6 between 3-5-2015 20:00 to 3-6-2015 9:00). I tried using reindex on the able but that didn't fix anything. I don't really know what's going on.

store   =   HDFStore(path_to_file, mode='r')

p1      =   store.select('ts', where=Term('Path', '=', 6), auto_close=False)
print(len(p1))
p2      =   store.select('ts', auto_close=False)
p2s     =   p2[p2.index.get_level_values('Path') == 6]
print(len(p2s))

The text was updated successfully, but these errors were encountered:

jreback · 2014-09-14T02:29:05Z

can u show generating / writing code as well, and pd.show_versions()

rockg · 2014-09-14T02:36:18Z

I can't show how the values were generated but I can show the to_hdf call. The data in the file is fairly simple. Note that many other files are fine, but every once in awhile this shows up but it's a real issue if we can't utilize the index and have to load the entire file to ensure we are getting the full set of data.

Writing code is essentially: .

data.to_hdf(filePath, 'ts', append=False, format='table', complib='zlib', complevel=5)

INSTALLED VERSIONS

commit: None
python: 2.7.6.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-24-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.14.0-295-g08ae4f7
nose: 1.3.3
Cython: 0.20.2
numpy: 1.8.1
scipy: 0.14.0
statsmodels: 0.5.0
IPython: 2.1.0
sphinx: None
patsy: 0.2.1
scikits.timeseries: None
dateutil: 2.2
pytz: 2013b
bottleneck: 0.8.0
tables: 3.1.1
numexpr: 2.4
matplotlib: 1.3.1
openpyxl: 1.8.6
xlrd: None
xlwt: None
xlsxwriter: None
lxml: 3.3.5
bs4: 4.3.2
html5lib: 0.999
bq: None
apiclient: None
rpy2: None
sqlalchemy: 0.9.6
pymysql: None
psycopg2: None

rockg · 2014-09-14T13:36:10Z

I just tested and one can load the frame and save it out to a new hdf5 file and the problem will exist on that new file. So at least it seems to be reproducible.

jreback · 2014-09-14T13:36:18Z

@rockg I think this is the issue (I can't entirely debug this as its on the writing side, and I need an example):

Your index level 1 are timestamps which cross transition times. They are an object column.

This is used to convert them as we store as UTC. (and the tz).

values == object array of Timestamps
values = DatetimeIndex(values)
values.tz_convert('UTC').values.view('i8')

I think info is being lost here (iow this should NOT be done when the times are ambiguous). So the data is not in the file.

So what I need is a sample frame (can be all dummy data). That is generated from code (so can make a test), that is pretty short that demonstrates this.

jreback · 2014-09-14T13:40:20Z

hmm, though the data actually appears to be in the file...

rockg · 2014-09-14T13:42:56Z

Yes, that's the weird part (if you select the whole file and then select using get_level_values you get the right thing). And only Path=6 appears to have this problem, Path=7 for example works fine.

jreback · 2014-09-14T13:44:14Z

was just wondering something else is going on. When you select with 6 BEFORE doing anything else, I still only get 2892 rows. That is really odd. (as the Path=7) does work.

rockg · 2014-09-14T13:51:43Z

When I spent some time looking yesterday, there is a gap in the coordinates that are returned (I think one coordinate for that path is 46424 but the coordinates it finds in pytables go skip from 46027 to something like 52000). When you say before, you must mean the top two statements above or something else?

jreback · 2014-09-14T13:53:47Z

I think this is a bug in the index creation.

I think it is this bug. Let me put up a commit to NOT set expected rows unless its explicity set to see what it does.
PyTables/PyTables#319

In [96]: df.to_hdf('test2.h5','df',format='table',index=False)

In [97]: len(pd.read_hdf('test2.h5','df',where='Path=7'))
Out[97]: 2972

In [98]: len(pd.read_hdf('test2.h5','df',where='Path=6'))
Out[98]: 2972

In [99]: len(pd.read_hdf('test2.h5','df',where='Path=5'))
Out[99]: 2972

In [100]: df.to_hdf('test2.h5','df',format='table',index=True)

In [101]: len(pd.read_hdf('test2.h5','df',where='Path=7'))
Out[101]: 2972

In [102]: len(pd.read_hdf('test2.h5','df',where='Path=6'))
Out[102]: 2892

In [103]: len(pd.read_hdf('test2.h5','df',where='Path=5'))
Out[103]: 2972

jreback · 2014-09-14T14:02:54Z

@rockg maybe prod them a bit on that link to see what the actual issue is. its some kind of interaction with the selection mechansim and how the indexes are laid out.

rockg · 2014-09-14T14:05:01Z

@jreback Will do

rockg · 2014-09-29T14:13:19Z

@jreback There hasn't been a response on pytables. What should we do?

jreback · 2014-09-29T14:16:02Z

@rockg I pinged too. I really don't know enought of PyTables to see where to fix this unfortunately.

FrancescAlted · 2015-04-18T19:00:14Z

This has been addressed in PyTables: PyTables/PyTables@035dbd5 and the fix will be part of the forthcoming 3.2.0 release.

jreback added Timezones Timezone data dtype IO HDF5 read_hdf, HDFStore Bug labels Sep 14, 2014

jreback added this to the 0.15.0 milestone Sep 14, 2014

jreback modified the milestones: 0.15.1, 0.15.0 Sep 14, 2014

jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015

rockg mentioned this issue Mar 18, 2015

BUG: entries missing when reading from pytables hdf store using "where" statement #9676

Closed

alexfields mentioned this issue Mar 18, 2015

table.where query does not seem to be able to find rows... PyTables/PyTables#409

Closed

jreback modified the milestones: 0.16.0, Next Major Release Mar 19, 2015

jreback mentioned this issue Mar 19, 2015

BUG: workaround PyTables 319, but not setting expected rows (GH8265, GH9676) #9681

Closed

jreback removed this from the 0.16.0 milestone Mar 19, 2015

FrancescAlted closed this as completed Apr 18, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

HDF5 index corruption #8265

HDF5 index corruption #8265

rockg commented Sep 14, 2014

jreback commented Sep 14, 2014

Uh oh!

rockg commented Sep 14, 2014

Uh oh!

rockg commented Sep 14, 2014

Uh oh!

jreback commented Sep 14, 2014

Uh oh!

jreback commented Sep 14, 2014

Uh oh!

rockg commented Sep 14, 2014

Uh oh!

jreback commented Sep 14, 2014

Uh oh!

rockg commented Sep 14, 2014

Uh oh!

jreback commented Sep 14, 2014

Uh oh!

jreback commented Sep 14, 2014

Uh oh!

rockg commented Sep 14, 2014

Uh oh!

rockg commented Sep 29, 2014

Uh oh!

jreback commented Sep 29, 2014

Uh oh!

FrancescAlted commented Apr 18, 2015

Uh oh!

Uh oh!

HDF5 index corruption #8265

HDF5 index corruption #8265

Comments

rockg commented Sep 14, 2014

jreback commented Sep 14, 2014

Uh oh!

rockg commented Sep 14, 2014

INSTALLED VERSIONS

Uh oh!

rockg commented Sep 14, 2014

Uh oh!

jreback commented Sep 14, 2014

Uh oh!

jreback commented Sep 14, 2014

Uh oh!

rockg commented Sep 14, 2014

Uh oh!

jreback commented Sep 14, 2014

Uh oh!

rockg commented Sep 14, 2014

Uh oh!

jreback commented Sep 14, 2014

Uh oh!

jreback commented Sep 14, 2014

Uh oh!

rockg commented Sep 14, 2014

Uh oh!

rockg commented Sep 29, 2014

Uh oh!

jreback commented Sep 29, 2014

Uh oh!

FrancescAlted commented Apr 18, 2015

Uh oh!