read_csv and numeric columns larger than 2^53 #976

charliejhadley · 2019-03-07T15:17:35Z

Should col_spec_standardise handle numeric columns that are larger than 2^53 specially?

I've been working with Google Scholar IDs which are integers between 19 and 22 digits long, here's an example dataset:

library("readr")
library("dplyr")

original_data <- tibble(
  doi = c(
    "10.1136/bmj.i2021", "10.1016/j.injury.2018.06.032", "10.1002/hbm.23739",
    "10.3389/fnagi.2017.00155", "10.1136/bmj.j2353", "10.1016/j.nicl.2017.04.011"
  ),
  gscholar_ids = c(
    "11898508762506083287", "3266119658877827603", "9104209908849226329",
    "7359588353404810659", "14160410518916347259", "16899778266587446382"
  )
)

write_csv is specifically tested on its ability to export large numbers (tests/testthat/test-write.R), and so this works as expectedly:

original_data %>%
  write_csv("exported-original-data.csv")

However, when read_csv is used to import this dataset the gscholar_id column is parsed as double-precision which means the data does not survive a round trip.

imported_data <- read_csv("exported-original-data.csv")
#> Parsed with column specification:
#> cols(
#>   doi = col_character(),
#>   gscholar_ids = col_double()
#> )
imported_data$gscholar_ids == original_data$gscholar_ids
#> [1] FALSE FALSE FALSE FALSE FALSE FALSE

Should read_csv silently do this? There are two possible solutions I can think of

A) Emit a warning about the specific columns (as per the tidyverse error style guide):

#> Error: Columns have numeric values that cannot be represented exactly in R
#> * Column `gscholar_ids`

B) Automatically convert columns into character and emit a warning:

#> Error: Columns have numeric values that cannot be represented exactly in R and have been converted to col_character()
#> * Column `gscholar_ids` converted to `col_character()`

Option A is probably the most sensible option.

The text was updated successfully, but these errors were encountered:

jimhester · 2019-03-10T16:33:53Z

I think it is too expensive for readr to check for loss of precision when parsing, if you know you have integers that are this big you just need to convert them to characters manually.

charliejhadley · 2019-03-10T18:18:35Z

While I agree I wouldn’t want readr to be slowed down by expensive precision tests. Is it possible to emit a warning for very large numbers?

I didn’t initially realise I had such large integers when working with my problem. In no small part because of tibble printing methods from pillar.

paulcbauer · 2020-04-20T19:04:38Z

I had the same problem and consider myself lucky to have found this thread. I think there might be a lot of data where this becomes an issue. In my case the IDs of twitter handles and tweets. I think a warning would be helpful.

pstils · 2022-11-08T21:32:06Z

I believe that pandas read_csv reads too-large numbers as sting:

https://stackoverflow.com/questions/54440567/large-numbers-are-inferred-as-strings-by-pandas-when-read-from-a-csv-file

I have no idea whether that's expensive, and I'm very aware that I'm not qualified to make a call on whether the readr default to a loss of precision is a useful thing or not in a wider sense, but I'm drawn here because I've encountered data loss - also in ID's - due to defaults and floating point precision not being understood.

My underqualified opinion would be actually that @charliejhadley 's option B would be most sensible - default to read cols with numbers larger than [the largest number before loss of precision] as character: IMO that's safer, as anyone wanting to perform a mathematical operation on such a column would likely immediately encounter an error and fix accordingly, whereas bugs involving loss of precision (which may not effect all values in the column) might not be encountered for a long time.

I'd imagine changing the default behaviour of readr could be quite damaging, though, so I agree with @charliejhadley and @paulcbauer that at least a warning would be helpful.

jimhester mentioned this issue Mar 14, 2019

read_table() has strange behavior reading doubles greater than 100M #977

Closed

jimhester closed this as completed Apr 3, 2020

huisunsh mentioned this issue Sep 27, 2024

Feature request: warning message for numeric columns larger than 2^52 #1556

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_csv and numeric columns larger than 2^53 #976

read_csv and numeric columns larger than 2^53 #976

charliejhadley commented Mar 7, 2019

jimhester commented Mar 10, 2019

charliejhadley commented Mar 10, 2019

paulcbauer commented Apr 20, 2020

pstils commented Nov 8, 2022

read_csv and numeric columns larger than 2^53 #976

read_csv and numeric columns larger than 2^53 #976

Comments

charliejhadley commented Mar 7, 2019

jimhester commented Mar 10, 2019

charliejhadley commented Mar 10, 2019

paulcbauer commented Apr 20, 2020

pstils commented Nov 8, 2022