-
-
Notifications
You must be signed in to change notification settings - Fork 18.6k
HDF5 index corruption #8265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
can u show generating / writing code as well, and pd.show_versions() |
I can't show how the values were generated but I can show the to_hdf call. The data in the file is fairly simple. Note that many other files are fine, but every once in awhile this shows up but it's a real issue if we can't utilize the index and have to load the entire file to ensure we are getting the full set of data. Writing code is essentially: .
INSTALLED VERSIONScommit: None pandas: 0.14.0-295-g08ae4f7 |
I just tested and one can load the frame and save it out to a new hdf5 file and the problem will exist on that new file. So at least it seems to be reproducible. |
@rockg I think this is the issue (I can't entirely debug this as its on the writing side, and I need an example): Your index level 1 are timestamps which cross transition times. They are an object column. This is used to convert them as we store as UTC. (and the tz).
I think info is being lost here (iow this should NOT be done when the times are ambiguous). So the data is not in the file. So what I need is a sample frame (can be all dummy data). That is generated from code (so can make a test), that is pretty short that demonstrates this. |
hmm, though the data actually appears to be in the file... |
Yes, that's the weird part (if you select the whole file and then select using get_level_values you get the right thing). And only Path=6 appears to have this problem, Path=7 for example works fine. |
was just wondering something else is going on. When you select with 6 BEFORE doing anything else, I still only get 2892 rows. That is really odd. (as the Path=7) does work. |
When I spent some time looking yesterday, there is a gap in the coordinates that are returned (I think one coordinate for that path is 46424 but the coordinates it finds in pytables go skip from 46027 to something like 52000). When you say before, you must mean the top two statements above or something else? |
I think this is a bug in the index creation. I think it is this bug. Let me put up a commit to NOT set expected rows unless its explicity set to see what it does.
|
@rockg maybe prod them a bit on that link to see what the actual issue is. its some kind of interaction with the selection mechansim and how the indexes are laid out. |
@jreback Will do |
@jreback There hasn't been a response on pytables. What should we do? |
@rockg I pinged too. I really don't know enought of PyTables to see where to fix this unfortunately. |
This has been addressed in PyTables: PyTables/PyTables@035dbd5 and the fix will be part of the forthcoming 3.2.0 release. |
I generated a multindexed DataFrame and wrote it to hdf5 using to_hdf. It uses zlib level 5 compression. The file was written all at once. The file is located here: https://www.dropbox.com/s/122q55g5ubcf4fl/indexIssue.h5?dl=0
The below methods should be identical but the former select with a where clause has 2892 records but getting all values and subselecting on the path returns 2972 (values are missing for path 6 between 3-5-2015 20:00 to 3-6-2015 9:00). I tried using reindex on the able but that didn't fix anything. I don't really know what's going on.
The text was updated successfully, but these errors were encountered: