-
Notifications
You must be signed in to change notification settings - Fork 285
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
read_csv and numeric columns larger than 2^53 #976
Comments
I think it is too expensive for readr to check for loss of precision when parsing, if you know you have integers that are this big you just need to convert them to characters manually. |
While I agree I wouldn’t want readr to be slowed down by expensive precision tests. Is it possible to emit a warning for very large numbers? I didn’t initially realise I had such large integers when working with my problem. In no small part because of tibble printing methods from pillar. |
I had the same problem and consider myself lucky to have found this thread. I think there might be a lot of data where this becomes an issue. In my case the IDs of twitter handles and tweets. I think a warning would be helpful. |
I believe that pandas read_csv reads too-large numbers as sting: I have no idea whether that's expensive, and I'm very aware that I'm not qualified to make a call on whether the readr default to a loss of precision is a useful thing or not in a wider sense, but I'm drawn here because I've encountered data loss - also in ID's - due to defaults and floating point precision not being understood. My underqualified opinion would be actually that @charliejhadley 's option B would be most sensible - default to read cols with numbers larger than [the largest number before loss of precision] as character: IMO that's safer, as anyone wanting to perform a mathematical operation on such a column would likely immediately encounter an error and fix accordingly, whereas bugs involving loss of precision (which may not effect all values in the column) might not be encountered for a long time. I'd imagine changing the default behaviour of readr could be quite damaging, though, so I agree with @charliejhadley and @paulcbauer that at least a warning would be helpful. |
Should
col_spec_standardise
handle numeric columns that are larger than 2^53 specially?I've been working with Google Scholar IDs which are integers between 19 and 22 digits long, here's an example dataset:
write_csv
is specifically tested on its ability to export large numbers (tests/testthat/test-write.R), and so this works as expectedly:However, when
read_csv
is used to import this dataset thegscholar_id
column is parsed as double-precision which means the data does not survive a round trip.Should
read_csv
silently do this? There are two possible solutions I can think ofA) Emit a warning about the specific columns (as per the tidyverse error style guide):
B) Automatically convert columns into character and emit a warning:
Option A is probably the most sensible option.
The text was updated successfully, but these errors were encountered: