Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First version of the fwrite function #580 #1613

Merged
merged 7 commits into from
Apr 7, 2016

Conversation

oseiskar
Copy link
Contributor

This implementation of fwrite (#580) aims to be faster, or at least as fast as write.csv, but a few things have been left out or simplified:

  1. When quote=TRUE, all column names are quoted
  2. When quote=FALSE, nothing is quoted, even if this would break the CSV
  3. There is no option for row.names. They only make sense for data.frames with named rows. For data.tables, they would just reduce to row numbers.

The speedup compared to write.csv depends on column types and parameters but speedup factors from 2 to 4 are possible.

@@ -0,0 +1,62 @@
fwrite <- function(dt, file.path, append = FALSE, quote = TRUE,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. should definitely use DT or x, given the dt function in the stats package.

  2. consistency with write.csv would have file.path named file and take default value "" (according to ?write.csv, the default is to print to console)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. ok, I'll change this

  2. This is semi-intentional. Supporting file="" (stdout) and connections would require passing file handles (instead of just file name strings) from R to C and I don't currently know how to do this. R's C interface is poorly documented.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Gotcha. Not sure how important the consistency is, but maybe just follow write.table to this end:
if (file == "") 
  file <- stdout()
else if (is.character(file)) {
  file <- if (nzchar(fileEncoding)) 
    file(file, ifelse(append, "a", "w"), encoding = fileEncoding)
  else file(file, ifelse(append, "a", "w"))
  on.exit(close(file))
}
else if (!isOpen(file, "w")) {
  open(file, "w")
  on.exit(close(file))
}
if (!inherits(file, "connection")) 
  stop("'file' must be a character string or connection")

And here is the C side of the code, to the extent that it helps

…f unique column names, replaced %in% -> %chin%
repeat {
block_end <- min(block_begin+(block.size-1), nrow(x))

dt_block <- x[c(block_begin:block_end),]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a reply to MichaelChirico's comment if it would be faster to create an extra column and use it like x[.(block_no)]: could be a little bit faster and not difficult to implement in C, but I dislike the idea of modifying the input data table. Is there a convenient way to generate the name of such column so that it would not conflict with existing column names?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@oseiskar none I'm aware of, once #633 solved it will be easy, you can use cryptic name (prefix it with the dot), and check if it doesn't exist in a data.table. Not sure what you are referring by modifying input, but adding column without modifying input is as simple as x = shallow(x)[, "col" := new], it won't copy the data, and it will add new column only to locally processed data.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to reinforce Jan's point, this sort of thing (restricting acceptable column names) is done under the hood in several places of data.table code already, see e.g. here. As a suggestion, there's b__, which has an analogue to f__, l__, o__, zo__, jn__, and jl__, all found in [.data.table. Agreed block_no is too likely to be in users' tables (in fact I have several cases like that myself).

@oseiskar
Copy link
Contributor Author

I moved number and NA formatting to C code, which resulted in less column copying and a significant perfomance boost. See this gist. Now this version seems to be between 2 and 4 times faster than write.csv with character and numeric columns.

@jangorecki thanks for pointing out the shortcomings of Sys.time. Any ideas how it should be replaced in benchmarking?

@MichaelChirico
Copy link
Member

My preference is to use microbenchmark::get_nanotime.

By the way, thanks for working on this! The data.table community will be overjoyed to see a working version of this, I'm sure.

@MichaelChirico
Copy link
Member

Two more things:

  • I get an error when trying to write outside current wd; modifying slightly the example in ?fwrite:
fwrite(data.table(first=c(1,2), second=c(NA, 'foo"bar')), "~/Desktop/table.csv")

Error in fwrite(data.table(first = c(1, 2), second = c(NA, "foo\"bar")), :
No such file or directory

  • Handling for list columns is probably incorrect. Not sure whether we should support... seems like it will be a good counterpart to sep2 in fread. As of now:
DT <- data.table(a = 1:3, l = list(list(1:6), list(3:5), list("a","b")))

fwrite(DT, "test.csv")

Produces:

"a","l"
1,list(1:6)
2,list(3:5)
3,list("a", "b")

I don't know the proper way to handle this, perhaps surround a list in angle brackets and comma-separate the elements?

"a","l"
1,<<1,2,3,4,5,6>>
2,<<3,4,5>>
3,<<"a">,<"b">>

?

Or perhaps take the cue from format.data.table:

format.item <- function(x) {
  if (is.atomic(x) || is.formula(x)) 
    paste(c(format(head(x, 6), justify = justify, ...), 
            if (length(x) > 6) ""), collapse = ",")
  else paste("<", class(x)[1L], ">", sep = "")
}

if (is.list(col)) 
  col = sapply(col, format.item)
else col = format(char.trunc(col), justify = justify, ...)

Otherwise just error on list columns

@oseiskar
Copy link
Contributor Author

@MichaelChirico The problem is not the working directory but the ~, which is not automatically expanded in C. I'll fix this using path.expand...

@MichaelChirico
Copy link
Member

inre: lists, just noticed that write.csv pops an error in this case, so it should be fine to just error for now:

Error in .External2(C_writetable, x, file, nrow(x), p, rnames, sep, eol, :
unimplemented type list in EncodeElement

@MichaelChirico
Copy link
Member

inre: numeric columns, for consistency with write.csv more digits should be printed; from ?write.csv:

In almost all cases the conversion of numeric quantities is governed by the option "scipen" (see options), but with the internal equivalent of digits = 15. For finer control, use format to make a character matrix/data frame, and call write.table on that.

@jangorecki
Copy link
Member

@oseiskar

elapsed.secs <- system.time(ans <- fwrite(...))[[3L]]`

is the easiest way
for precise timing you can use

pt = microbenchmark::get_nanotime()
ans = fwrite(...)
microbenchmark::get_nanotime() - pt

or from the devel version of microbenchmarkCore a drop-in replacement for system.time called system.nanotime

No need to have list columns supported now, this can be easily and flexibly handled within data.table before passing it to fwrite.

Not sure about as.character() for non-(numeric/integer/characters), it might be better to use toString instead (then by=1:nrow(.) probably needed), but it generally depends on how it is processed further in terms of edge cases detection. toString always returns non-NA, length 1L character. Not sure about performance of that.

@oseiskar
Copy link
Contributor Author

As a reply to this earlier comment about supporting write.csv-like file argument that could also be an R connection and would default to writing to the console: not possible, as far as I know. Adding support to writing in STDOUT is possible but it seems that writing to an existing R connection is not.

@MichaelChirico thanks for looking up the C side of write.csv. Unfortunately, it does not seem to be possible do replicate this because the code is calling functions that are not part of the public C API of R. It would seem that all file/connection handling is part of the private / internal R code and thus unavailable.

In particular, one would need to call getConnection to transform a parameter SEXP, representing a connection number, to an Rconnection and trying to replicate the implementation this function would not help here.

@dselivanov
Copy link

Not sure, if this will help, but leave it here: official C level API for connection handles in R 3.3.0.

@oseiskar
Copy link
Contributor Author

oseiskar commented Apr 2, 2016

Actually, it is possible to bypass the private API limitation and call getConnection at least on some platforms: see this branch for a proof of concept. However, this brings in a whole new bunch of problems:

  1. It is not guaranteed to work or remain supported (However, as @dselivanov pointed out, official support will be available in R version 3.3.0)
  2. The first attempted implementation absolutely kills performance. The code linked above is about 4 times slower that write.csv with character columns.
  3. For regular files, it is possible to "fix" 2. using lower-level functions like this (speedup drops from about 2x to about 1.5x) but this will not work for all R connections, such as stdout. It is possible to handle both cases separately, which would significantly increase the complexity of the code.

In my opinion, supporting R connection arguments is not worth the trouble at this point.

@dselivanov
Copy link

Personally I think, connections is "nice to have" feature (which can be implemented in future), but even without connections this is a great PR. Thank you, @oseiskar.

@jangorecki
Copy link
Member

+1 dselivanov
FYI @mattdowle :)

@mattdowle mattdowle merged commit 6be2ed1 into Rdatatable:master Apr 7, 2016
@mattdowle
Copy link
Member

Awesome! Looks great @oseiskar :-)

@arunsrinivasan
Copy link
Member

Great PR, @oseiskar. Using the data.dable from the extensive benchmark Michael did on SO here, I profiled it using Instruments -> time profiler. Here's a snapshot.

fwrite-time-profiler

I've not looked at the code, but IIUC writefile() takes only about 50% of the time? Also perhaps there are trivial places that could be improved on that you could spot? Hope it's of some use.

@jangorecki
Copy link
Member

@oseiskar
Could you substitute R_xlen_t with something which is available in R 2.15.0?
While installing package on R 2.15.0 (data.table stated dependency) I get the following error:

fwrite.c: In functionwritefile:
fwrite.c:31:3: error: unknown type nameR_xlen_tR_xlen_t ncols = LENGTH(list_of_columns);
   ^

You can reproduce it with:

docker run -it docker.io/jangorecki/r-2.15.0 /bin/bash
curl -O https://Rdatatable.github.io/data.table/src/contrib/data.table_1.9.7.tar.gz
R2 CMD INSTALL data.table_1.9.7.tar.gz

Alternatively, if there is no substitution for that or it would decrease performance, then @mattdowle would need to decide about stated dependency upgrade. R 2.15.0 is from March 2012, in my opinion 4 years old isn't yet old enough to deprecate it without strong reason.

@arunsrinivasan
Copy link
Member

@jangorecki here's the solution: https://github.com/Rdatatable/data.table/blob/9d2d71098d849c99e6eebb0e0b539eb58d723b05/src/rbindlist.c#L13-17

@oseiskar oseiskar deleted the fwrite branch April 25, 2016 16:25
@mattdowle
Copy link
Member

Work continued here: #1664

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants