Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

Convenience features of fread

Matt Dowle edited this page Sep 8, 2017 · 33 revisions

The fread function in the data.table R package is not just for speed on large files (although it is good for that too). Here we highlight its convenience features for small data. Full details of all arguments and abilities are on the online manual page. A brief introduction at useR!2014 is on youtube here.

1. Using command line tools directly

For example, from this answer :

fread("grep -v TRIAL sum_data.txt")
#      V1   V2 V3      V4      V5      V6      V7
#  1:   2  0.1  0 -0.0047 -0.0168 -0.9938 -0.0087
#  2:   2  0.2  0 -0.0121  0.0002 -0.9898 -0.0364
#  3:   2  0.3  0  0.0193 -0.0068 -0.9884  0.0040
   ...

The -v makes grep return all lines except lines containing the string TRIAL. Given the number of high quality engineers that have looked at the command tool grep over the years, it is most likely that it is as fast as you can get, as well as being correct, convenient, well documented online, easy to learn and search for solutions for specific tasks. If you need to do more complicated string filters (e.g. strings at the beginning or the end of the lines, etc) then grep syntax is very powerful. Learning its syntax is a transferable skill to other languages and environments.

Another example, from this answer :

fread('unzip -cq mtcars.csv.zip')
#                      V1  mpg cyl  disp  hp drat    wt  qsec vs am gear carb
#  1:           Mazda RX4 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
#  2:       Mazda RX4 Wag 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
#  ...

An example using chained commands from this answer and tweet :

fread('cat *dat.gz | gunzip | grep -v "^Day"')

The manual page ?fread contains :

input ..., a shell command that preprocesses the file (e.g. fread("grep blah filename")), ...

On Windows we recommend Cygwin (run one .exe to install) which includes the command line tools such as grep. In March 2016, Microsoft announced they will include these tools in Windows 10 natively. On Linux and Mac they have always been included in the operating system. You can find many examples and tutorials online about command line tools. We recommend Data Science at the Command Line. Some advantages of this approach were highlighted in slides 10-12 here.

2. Automatic separator and skip detection

Given a tab, comma, pipe, space, colon or semi-colon delimited file, you don't need to find or remember the function name or argument name to use. fread is a single command that will look at its input and observe the separator for you. All you as the user has to do is pass the file name to the function. The rest is done for you. Of course, if you do wish to override the separator, or use a very unusual one, you can pass it to the sep= parameter.

The manual page ?fread contains :

sep The separator between columns. Defaults to the character in the set [,\t |;:] that separates the sample of rows into the most number of lines with the same number of fields. The first row with this number of fields is taken as column names if every field on this row is type character, otherwise as data and default column names (V1, V2, etc) are assigned.

3. High quality automatic column type detection

Some files start off with blanks or zeros in some columns which only become populated later in the file. Others looks like integer at the top of the file but later contain characters. For this reason fread doesn't just look at the top of the file but it jumps to 100 equally spaced points in the file including the very end and reads 100 contiguous rows from each point. It is possible to jump directly to these points efficiently because fread uses the operating system's mmap which does not need to read from the beginning of the file to get to a particular point. In v1.9.8 (Nov 2016) the sample was increased from 15 rows (5 rows at 3 points) to 1,000 rows (100 rows at 10 points) and then in v1.10.5 (current dev) to 10,000 rows (100 rows at 100 points). The time taken for type detection is reported with verbose=TRUE and is almost always insignificant because even 10,000 rows is small. Sampling can take a bit more time when there's a very large number of columns (e.g. over 10,000 columns) but even then it is worth that time to obtain a better guess to save the time and inconvenience of a (much longer) reread (even though any reread, if required, is automatic).

The manual page ?fread contains :

A sample of 10,000 rows is used for a very good estimate of column types. 100 contiguous rows are read from 100 equally spaced points throughout the file including the beginning, middle and the very end. This results in a better guess when a column changes type later in the file (e.g. blank at the beginning/only populated near the end, or 001 at the start but 0A0 later on). This very good type guess enables a single allocation of the correct type up front once for speed, memory efficiency and convenience of avoiding the need to set colClasses after an error because there is no error. Even though the sample is large and jumping over the file, it is almost instant regardless of the size of the file because a lazy on-demand memory map is used. If a jump lands inside a quoted field containing newlines, each newline is tested until 5 lines are found following it with the expected number of fields. The lowest type for each column is chosen from the ordered list: logical, integer, integer64, double, character.

From v1.10.5 (current dev), ?fread continues:

Rarely, the file may contain data of a higher type in rows outside the sample (referred to as an out-of-sample type exception). In this event fread will automatically reread just those columns so that you don't have the inconvenience of having to set colClasses yourself. Such columns are reread from the beginning of the file to correctly distinguish, for example, ,00, from ,000, when those values were both interpreted as integer 0, but then 00A occurs out-of-sample in that column. Set verbose=TRUE to see a detailed report of the logic deployed to read the file.

4. Early return if there's a problem at the end of the file

Since the large sample (described above) includes the very end of the file, errors at the end (e.g., an inconsistent number of columns, a mangled footer or the last field having an opening quote but no closing quote) are detected and reported almost instantly. It isn't necessary to wait a long time for the whole file to be read or allocate a large amount of memory which might swap out other processes, only to reach an error at the end.

5. Reading SQL insert scripts

There is no built-in feature for reading SQL insert scripts but they are easily handled with command line pre-processing. For example, given SQL insert script insert_script.sql:

INSERT INTO tbl VALUES (1, 'asd', 923123123, 'zx');
INSERT INTO tbl VALUES (1, NULL, 923123123, 'zxz');
INSERT INTO tbl VALUES (3, 'asd3', 923123123, NULL);

you can fread it using the following command (from this question):

fread('awk -F\' *[(),]+ *\' -v OFS=, \'{for (i=2;i<NF;i++) printf "%s%s", ($i=="NULL"?"":$i), (i<(NF-1)?OFS:ORS)}\' insert_script.sql')
#    V1     V2        V3    V4
# 1:  1  'asd' 923123123  'zx'
# 2:  1        923123123 'zxz'
# 3:  3 'asd3' 923123123      

6. integer64 support

fread automatically detects large integers (> 2^31) and reads them as type integer64 from the bit64 package. This retains full precision by storing the integers as true and accurate integers. This default behaviour can be overridden on a per-column basis using the colClasses argument, for all columns in the call to fread using its integer64 argument, or always by setting options(datatable.integer64="double") or options(datatable.integer64="character").

The manual page ?fread contains:

integer64 "integer64" (default) reads columns detected as containing integers larger than 2^31 as type bit64::integer64. Alternatively, "double"|"numeric" reads as base::read.csv does; i.e., possibly with loss of precision and if so silently. Or, "character".

To know that the default can be changed globally (e.g. in your personal .Rprofile file), you can look at the Usage section of ?fread which contains the default arguments :

integer64=getOption("datatable.integer64") # default: "integer64"

7. drop columns by name or number, as well as select by name or number

Rather than read all columns in to R (using up memory) only to use a subset of columns in your analysis, you can use these arguments to tell fread to skip over a set of columns. If you only have a few to discard, it's easier to just drop those. Easier because you don't have to type out the columns to keep or think how to write an R expression that correctly results in the columns to keep. fread will warn you if you specify a column that isn't present in the file. You use either select or drop but not both at the same time.

The manual page ?fread contains:

select Vector of column names or numbers to keep, drop the rest.

drop Vector of column names or numbers to drop, keep the rest.

8. Footer information is automatically discarded with warning

The known common case is the "Rowcount: <n>" trailing line that SQL select statements tend to emit. Consider the following input.

A,B
1,3
2,4
Rowcount: 2

data.table::fread returns the correct result, with warning about the footer being disarded.

> fread("A,B\n1,3\n2,4\nRowcount: 2\n")
   A B
1: 1 3
2: 2 4
Warning message:
In fread("A,B\n1,3\n2,4\nRowcount: 2\n") :
  Stopped reading at line 4 but text exists afterwards (discarded): Rowcount: 2

readr::read_csv does not appear to detect the footer. It treats the first column as character to fit the footer into the first column.

> read_csv("A,B\n1,3\n2,4\nRowcount: 2\n")
Warning: 1 parsing failure.
row col  expected    actual
  3  -- 2 columns 1 columns

# A tibble: 3 × 2
            A     B
        <chr> <int>
1           1     3
2           2     4
3 Rowcount: 2    NA

9. Skip to a sub-table's header row by searching for a substring of its column names

The manual page ?fread contains:

skip="string" searches for "string" in the file (e.g. a substring of the column names row) and starts on that line (inspired by read.xls in package gdata).

10. Automatic quote escape method detection including no-escape

You do not have to state whether quotes have been doubled or escaped. The large sample (see point 3 above) is used to automatically detect this. If the large sample cannot determine this and an escaped quote occurs outside the sample, then by then the number of columns is known for sure (from the large sample) and if that determines which escape was used unambiguously, then that determines it. This saves the user from experiencing a failed load and then needing to rerun with the correct manual control arguments at best, or having to edit the file by hand at worst. This can be especially time consuming and frustrating if the quoted field occurs near the end of a large file which takes time to load, or in a file at the end of a list of files which takes time to load.

If fields have been quoted because the source system realized that the separator or newline is present within the field but any quotes present as well were not escaped at all, then fread is able to cope with this. From v1.10.6, provided the whole file is readable with the same rule and the same number of columns unambiguously.

If fields were not quoted because no separators or newlines are present in the field but a quote is present in the field at the beginning, then fread is able to realize this is not really a quoted field. From v1.10.6, provided the whole file is readable with the same rule and the same number of columns unambiguously.

fread aims to mimic what we humans do when eyeballing the file: we look at a sample and look at the pattern. Almost all input files we have seen that are not RFC4180 compliant, are actually unambiguously readable.

This ability does not apply when fill=TRUE. When fill=TRUE, the file must be RFC4180 compliant (in particular with regard to quotes and escapes) other than each row does not have to contain the same number of columns since that's the point of fill=TRUE. It is the fact that every row has the same number of columns that allows fread to disambiguate non-RFC4180 files when fill=FALSE; the most common and default usage of fread. From v1.10.6 it does this in a more robust way consistently across the whole file (since the source system must have used one rule) and warns if the whole file could not be unambiguously read.