Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_csv and numeric columns larger than 2^53 #976

Closed
charliejhadley opened this issue Mar 7, 2019 · 4 comments
Closed

read_csv and numeric columns larger than 2^53 #976

charliejhadley opened this issue Mar 7, 2019 · 4 comments

Comments

@charliejhadley
Copy link

Should col_spec_standardise handle numeric columns that are larger than 2^53 specially?

I've been working with Google Scholar IDs which are integers between 19 and 22 digits long, here's an example dataset:

library("readr")
library("dplyr")

original_data <- tibble(
  doi = c(
    "10.1136/bmj.i2021", "10.1016/j.injury.2018.06.032", "10.1002/hbm.23739",
    "10.3389/fnagi.2017.00155", "10.1136/bmj.j2353", "10.1016/j.nicl.2017.04.011"
  ),
  gscholar_ids = c(
    "11898508762506083287", "3266119658877827603", "9104209908849226329",
    "7359588353404810659", "14160410518916347259", "16899778266587446382"
  )
)

write_csv is specifically tested on its ability to export large numbers (tests/testthat/test-write.R), and so this works as expectedly:

original_data %>%
  write_csv("exported-original-data.csv")

However, when read_csv is used to import this dataset the gscholar_id column is parsed as double-precision which means the data does not survive a round trip.

imported_data <- read_csv("exported-original-data.csv")
#> Parsed with column specification:
#> cols(
#>   doi = col_character(),
#>   gscholar_ids = col_double()
#> )
imported_data$gscholar_ids == original_data$gscholar_ids
#> [1] FALSE FALSE FALSE FALSE FALSE FALSE

Should read_csv silently do this? There are two possible solutions I can think of

A) Emit a warning about the specific columns (as per the tidyverse error style guide):

#> Error: Columns have numeric values that cannot be represented exactly in R
#> * Column `gscholar_ids`

B) Automatically convert columns into character and emit a warning:

#> Error: Columns have numeric values that cannot be represented exactly in R and have been converted to col_character()
#> * Column `gscholar_ids` converted to `col_character()`

Option A is probably the most sensible option.

@jimhester
Copy link
Collaborator

I think it is too expensive for readr to check for loss of precision when parsing, if you know you have integers that are this big you just need to convert them to characters manually.

@charliejhadley
Copy link
Author

While I agree I wouldn’t want readr to be slowed down by expensive precision tests. Is it possible to emit a warning for very large numbers?

I didn’t initially realise I had such large integers when working with my problem. In no small part because of tibble printing methods from pillar.

@paulcbauer
Copy link

I had the same problem and consider myself lucky to have found this thread. I think there might be a lot of data where this becomes an issue. In my case the IDs of twitter handles and tweets. I think a warning would be helpful.

@pstils
Copy link

pstils commented Nov 8, 2022

I believe that pandas read_csv reads too-large numbers as sting:

https://stackoverflow.com/questions/54440567/large-numbers-are-inferred-as-strings-by-pandas-when-read-from-a-csv-file

I have no idea whether that's expensive, and I'm very aware that I'm not qualified to make a call on whether the readr default to a loss of precision is a useful thing or not in a wider sense, but I'm drawn here because I've encountered data loss - also in ID's - due to defaults and floating point precision not being understood.

My underqualified opinion would be actually that @charliejhadley 's option B would be most sensible - default to read cols with numbers larger than [the largest number before loss of precision] as character: IMO that's safer, as anyone wanting to perform a mathematical operation on such a column would likely immediately encounter an error and fix accordingly, whereas bugs involving loss of precision (which may not effect all values in the column) might not be encountered for a long time.

I'd imagine changing the default behaviour of readr could be quite damaging, though, so I agree with @charliejhadley and @paulcbauer that at least a warning would be helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants