Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Cheat Sheet - Dplyr PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 2

Data Transformation with dplyr : : CHEAT SHEET

dplyr
dplyr functions work with pipes and expect tidy data. In tidy data:
Manipulate Cases Manipulate Variables
A B C A B C
& EXTRACT CASES EXTRACT VARIABLES
pipes
Row functions return a subset of rows as a new table. Use a Column functions return a set of columns as a new table. Use a
Each variable is in Each observation, or x %>% f(y) variant that ends in _ for non-standard evaluation friendly code. variant that ends in _ for non-standard evaluation friendly code.
its own column case, is in its own row becomes f(x, y)
filter(.data, …) Extract rows that meet logical select(.data, …)
Extract columns by name. Also select_if()
Summarise Cases w
www
ww criteria. Also filter_(). filter(iris, Sepal.Length > 7)
w
www select(iris, Sepal.Length, Species)

These apply summary functions to columns to create a new distinct(.data, ..., .keep_all = FALSE) Remove
table. Summary functions take vectors as input and return one rows with duplicate values. Also distinct_(). 
 Use these helpers with select (),
value (see back).
summary function
w
www
ww distinct(iris, Species)
sample_frac(tbl, size = 1, replace = FALSE,
e.g. select(iris, starts_with("Sepal"))
contains(match) num_range(prefix, range) :, e.g. mpg:cyl
weight = NULL, .env = parent.frame()) Randomly ends_with(match) one_of(…) -, e.g, -Species
summarise(.data, …)
 select fraction of rows. 
 matches(match) starts_with(match)

w
ww
Compute table of summaries. Also
summarise_(). 

summarise(mtcars, avg = mean(mpg))
w
www
ww sample_frac(iris, 0.5, replace = TRUE)
sample_n(tbl, size, replace = FALSE, weight = MAKE NEW VARIABLES
NULL, .env = parent.frame()) Randomly select
size rows. sample_n(iris, 10, replace = TRUE) These apply vectorized functions to columns. Vectorized funs take
count(x, ..., wt = NULL, sort = FALSE)

Count number of rows in each group defined vectors as input and return vectors of the same length as output
slice(.data, …) Select rows by position. Also
w
ww by the variables in … Also tally().

count(iris, Species)
slice_(). slice(iris, 10:15)
(see back).
vectorized function

VARIATIONS w
www
ww top_n(x, n, wt) Select and order top n entries (by
group if grouped data). top_n(iris, 5, Sepal.Width) mutate(.data, …) 

Compute new column(s).
summarise_all() - Apply funs to every column.
summarise_at() - Apply funs to specific columns.
summarise_if() - Apply funs to all cols of one type. Logical and boolean operators to use with filter()
w
wwww
w mutate(mtcars, gpm = 1/mpg)

transmute(.data, …)

Compute new column(s), drop others.

Group Cases
<
>
<=
>=
is.na()
!is.na()
%in%
!
See ?base::logic and ?Comparison for help.
|
&
xor()
w
ww transmute(mtcars, gpm = 1/mpg)

mutate_all(.tbl, .funs, …) Apply funs to every


column. Use with funs(). 

Use group_by() to created a "grouped" copy of a table. 

dplyr functions will manipulate each "group" separately and
then combine the results.
w
www mutate_all(faithful, funs(log(.), log2(.)))

mutate_at(.tbl, .cols, .funs, …) Apply funs to


ARRANGE CASES
specific columns. Use with funs(), vars() and
mtcars %>%
arrange(.data, …)
Order rows by values of a column (low to high), www the helper functions for select().

mutate_at(iris, vars( -Species), funs(log(.)))
w
www
ww group_by(cyl) %>% w
www
ww use with desc() to order from high to low.
arrange(mtcars, mpg)
ww
w summarise(avg = mean(mpg)) arrange(mtcars, desc(mpg))
mutate_if(.tbl, .predicate, .funs, …) 

Apply funs to all columns of one type. 

Use with funs().

mutate_if(iris, is.numeric, funs(log(.)))
group_by(.data, ..., add = ungroup(x, …) ADD CASES
FALSE) Returns ungrouped copy 
 add_column(.data, ..., .before = NULL, .after =
add_row(.data, ..., .before = NULL, .after = NULL)
Returns copy of table 

grouped by …
g_iris <- group_by(iris, Species)
of table.
ungroup(g_iris)
w
www
ww
Add one or more rows to a table.
add_row(faithful, eruptions = 1, waiting = 1)
w
www
ww NULL) Add new column(s).
add_column(mtcars, new = 1:32)

rename(.data, …) Rename columns.



w
wwww rename(iris, Length = Sepal.Length)

RStudio® is a trademark of RStudio, Inc. • CC BY RStudio • info@rstudio.com • 844-448-1212 • rstudio.com • Learn more with browseVignettes(package = c("dplyr", "tibble")) • dplyr 0.5.0 • tibble 1.2.0 • Updated: 2017-01
Vectorized Functions Summary Functions Combine Tables
TO USE WITH MUTATE () TO USE WITH SUMMARISE () COMBINE VARIABLES COMBINE CASES dplyr
mutate() and transmute() apply vectorized summarise() applies summary functions to x y
functions to columns to create new columns. columns to create a new table. Summary A B C A B D A B C A B D A B C

Vectorized functions take vectors as input and


return vectors of the same length as output.
functions take vectors as input and return single
values as output.
a
b
c
t
u
v
1
2
3
+ a
b
d
t
u
w
3
2
1
= a
b
c
t
u
v
1
2
3
a
b
d
t
u
w
3
2
1 x
a
b
c
t
u
v
1
2
3

A B C

vectorized function summary function Use bind_cols() to paste tables beside each
other as they are. + y
C v 3
d w 4

COUNTS bind_cols(…) Returns tables placed side by


OFFSETS
dplyr::n() - number of values/rows side as a single table.  Use bind_rows() to paste tables below each
dplyr::lag() - Offset elements by 1 BE SURE THAT ROWS ALIGN.
dplyr::n_distinct() - # of uniques other as they are.
dplyr::lead() - Offset elements by -1              sum(!is.na()) - # of non-NA’s
CUMULATIVE AGGREGATES Use a "Mutating Join" to join one table to bind_rows(…, .id = NULL)
LOCATION DF
x
A
a
B
t
C
1
dplyr::cumall() - Cumulative all() columns from another, matching values with Returns tables one on top of the other
             mean() - mean, also mean(!is.na()) the rows that they correspond to. Each join
x b u 2
as a single table. Set .id to a column
dplyr::cumany() - Cumulative any() x c v 3
             median() - median retains a different combination of values from name to add a column of the original
             cummax() - Cumulative max() z c v 3
z d w 4
dplyr::cummean() - Cumulative mean() the tables. table names (as pictured)
LOGICALS
             cummin() - Cumulative min()
             cumprod() - Cumulative prod()              mean() - Proportion of TRUE’s A B C D left_join(x, y, by = NULL, A B C intersect(x, y, …)
             cumsum() - Cumulative sum()              sum() - # of TRUE’s a t 1 3 copy=FALSE, suffix=c(“.x”,“.y”),…) c v 3
Rows that appear in both x and z.
b u 2 2
c v 3 NA Join matching values from y to x.
RANKINGS POSITION/ORDER A B C setdiff(x, y, …)
dplyr::first() - first value right_join(x, y, by = NULL, copy = a t 1 Rows that appear in x but not z.
dplyr::cume_dist() - Proportion of all values <= A B C D
b u 2
dplyr::last() - last value a t 1 3 FALSE, suffix=c(“.x”,“.y”),…)
dplyr::dense_rank() - rank with ties = min, no
dplyr::nth() - value in nth location of vector
b u 2 2
Join matching values from x to y. A B C union(x, y, …)
gaps d w NA 1
a t 1 Rows that appear in x or z. 

dplyr::min_rank() - rank with ties = min b u 2
RANK A B C D inner_join(x, y, by = NULL, copy = (Duplicates removed). union_all()
dplyr::ntile() - bins into n bins c v 3
a t 1 3 FALSE, suffix=c(“.x”,“.y”),…) d w 4 retains duplicates.
dplyr::percent_rank() - min_rank scaled to [0,1]              quantile() - nth quantile  b u 2 2
Join data. Retain only rows with
dplyr::row_number() - rank with ties = "first"              min() - minimum value matches.
             max() - maximum value
MATH Use setequal() to test whether two data sets
A B C D full_join(x, y, by = NULL,
             +, - , *, /, ^, %/%, %% - arithmetic ops SPREAD a t 1 3
copy=FALSE, suffix=c(“.x”,“.y”),…) contain the exact same rows (in any order).
b u 2 2
             log(), log2(), log10() - logs              IQR() - Inter-Quartile Range c v 3 NA Join data. Retain all values, all rows.
             <, <=, >, >=, !=, == - logical comparisons              mad() - mean absolute deviation d w NA 1

             sd() - standard deviation EXTRACT ROWS


MISC              var() - variance x y
dplyr::between() - x >= left & x <= right A B.x C B.y D
Use by = c("col1", "col2") to A B C A B D
dplyr::case_when() - multi-case if_else()
dplyr::coalesce() - first non-NA values by
a
b
t 1
u 2
t 3
u 2 specify the column(s) to match on. a
b
t
u
1
2 + a
b
t
u
3
2 =
element  across a set of vectors
dplyr::if_else() - element-wise if() + else()
Row Names c v 3 NA NA left_join(x, y, by = "A") c v 3 d w 1

Tidy data does not use rownames, which store a A.x B.x C A.y B.y Use a named vector, by = c("col1" =
dplyr::na_if() - replace specific values with NA a t 1 d w
"col2"), to match on columns with Use a "Filtering Join" to filter one table against
variable outside of the columns. To work with the
             pmax() - element-wise max() rownames, first move them into a column.
b u 2 b u
different names in each data set. the rows of another.
c v 3 a t
             pmin() - element-wise min() left_join(x, y, by = c("C" = "D"))
C A B
dplyr::recode() - Vectorized switch() rownames_to_column() semi_join(x, y, by = NULL, …)
A B A B C
dplyr::recode_factor() - Vectorized switch()
 1 a t 1 a t Move row names into col. A1 B1 C A2 B2 Use suffix to specify suffix to give to a t 1 Return rows of x that have a match in y.
for factors 2 b u 2 b u a <- rownames_to_column(iris, var a t 1 d w duplicate column names. b u 2 USEFUL TO SEE WHAT WILL BE JOINED.
3 c v 3 c v
= "C") b
c
u
v
2
3
b
a
u
t left_join(x, y, by = c("C" = "D"), suffix =
c("1", "2")) A B C anti_join(x, y, by = NULL, …)

A B C A B column_to_rownames() c v 3 Return rows of x that do not have a
1 a t 1 a t
Move col in row names.  match in y. USEFUL TO SEE WHAT WILL
2 b u 2 b u
3 c v 3 c v column_to_rownames(a, var = "C") NOT BE JOINED.

Also has_rownames(), remove_rownames()

RStudio® is a trademark of RStudio, Inc. • CC BY RStudio • info@rstudio.com • 844-448-1212 • rstudio.com • Learn more with browseVignettes(package = c("dplyr", "tibble")) • dplyr 0.5.0 • tibble 1.2.0 • Updated: 2017-01

You might also like