Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Tidyr and Missing Values, Stringr, Forcats. R Markdown, Text Formatting With R Markdown, Code Chunks, YAML Header

Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

tidyr and missing values, stringr, forcats.

R Markdown, text
formatting with R markdown, code chunks, YAML header.
Dr Diana Abdul Wahab

Sem II 2021/2022

tidyr
In this chapter, you will learn a consistent way to organize your data in R, an organization
called tidy data. Getting your data into this format requires some up-front work, but that
work pays off in the long term. Once you have tidy data and the tidy tools provided by
packages in the tidyverse, you will spend much less time munging data from one
representation to another, allowing you to spend more time on the analytic questions at
hand.
This chapter will give you a practical introduction to tidy data and the accompanying tools
in the tidyr package. If you’d like to learn more about the underlying theory, you might
enjoy the Tidy Data paper published in the Journal of Statistical Software.
In this chapter we’ll focus on tidyr, a package that provides a bunch of tools to help tidy up
your messy datasets. tidyr is a member of the core tidyverse.
Tidy Data: You can represent the same underlying data in multiple ways. The following
example shows the same data organized in four different ways. Each dataset shows the
same values of four variables, country, year, population, and cases, but each dataset
organizes the values in a different way:
library(tidyr)

## Warning: package 'tidyr' was built under R version 4.1.3

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.1.3

## -- Attaching packages --------------------------------------- tidyverse 1.


3.1 --

## v ggplot2 3.3.5 v dplyr 1.0.8


## v tibble 3.1.6 v stringr 1.4.0
## v readr 2.1.2 v forcats 0.5.1
## v purrr 0.3.4

## Warning: package 'readr' was built under R version 4.1.3

## Warning: package 'purrr' was built under R version 4.1.3


## Warning: package 'dplyr' was built under R version 4.1.3

## Warning: package 'forcats' was built under R version 4.1.3

## -- Conflicts ------------------------------------------ tidyverse_conflict


s() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()

table1

## # A tibble: 6 x 4
## country year cases population
## <chr> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583

table2

## # A tibble: 12 x 4
## country year type count
## <chr> <int> <chr> <int>
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
## 5 Brazil 1999 cases 37737
## 6 Brazil 1999 population 172006362
## 7 Brazil 2000 cases 80488
## 8 Brazil 2000 population 174504898
## 9 China 1999 cases 212258
## 10 China 1999 population 1272915272
## 11 China 2000 cases 213766
## 12 China 2000 population 1280428583

table3

## # A tibble: 6 x 3
## country year rate
## * <chr> <int> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
# Spread across two tibbles
table4a # cases

## # A tibble: 3 x 3
## country `1999` `2000`
## * <chr> <int> <int>
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766

table4b # population

## # A tibble: 3 x 3
## country `1999` `2000`
## * <chr> <int> <int>
## 1 Afghanistan 19987071 20595360
## 2 Brazil 172006362 174504898
## 3 China 1272915272 1280428583

These are all representations of the same underlying data, but they are not equally easy to
use. One dataset, the tidy dataset, will be much easier to work with inside the tidyverse.
There are three interrelated rules which make a dataset tidy:
1. Each variable must have its own column.
2. Each observation must have its own row.
3. Each value must have its own cell.
These three rules are interrelated because it’s impossible to only satisfy two of the three.
That interrelationship leads to an even simpler set of practical instructions:
1. Put each dataset in a tibble.
2. Put each variable in a column.
In this example, only table1 is tidy. It’s the only representation where each column is a
variable.
Why ensure that your data is tidy? There are two main advantages:
• There’s a general advantage to picking one consistent way of storing data. If you
have a consistent data structure, it’s easier to learn the tools that work with it
because they have an underlying uniformity.
• There’s a specific advantage to placing variables in columns because it allows R’s
vectorized nature to shine. Most built-in R functions work with vectors of values.
That makes transforming tidy data feel particularly natural. dplyr, ggplot2, and all
the other packages in the tidyverse are designed to work with tidy data.
Here are a couple of small examples showing how you might work with table1:
# Compute rate per 10,000
table1 %>%
mutate(rate = cases / population * 10000)

## # A tibble: 6 x 5
## country year cases population rate
## <chr> <int> <int> <int> <dbl>
## 1 Afghanistan 1999 745 19987071 0.373
## 2 Afghanistan 2000 2666 20595360 1.29
## 3 Brazil 1999 37737 172006362 2.19
## 4 Brazil 2000 80488 174504898 4.61
## 5 China 1999 212258 1272915272 1.67
## 6 China 2000 213766 1280428583 1.67

# Compute cases per year


table1 %>%
count(year, wt = cases)

## # A tibble: 2 x 2
## year n
## <int> <int>
## 1 1999 250740
## 2 2000 296920

# Visualize changes over time


library(ggplot2)
ggplot(table1, aes(year, cases)) +
geom_line(aes(group = country), color = "grey50") +
geom_point(aes(color = country))
Spreading and Gathering
The principles of tidy data seem so obvious that you might wonder if you’ll ever encounter
a dataset that isn’t tidy. Unfortunately, however, most data that you will encounter will be
untidy. There are two main reasons:
• Most people aren’t familiar with the principles of tidy data, and it’s hard to derive
them yourself unless you spend a lot of time working with data.
• Data is often organized to facilitate some use other than analysis. For example, data
is often organized to make entry as easy as possible.
This means for most real analyses, you’ll need to do some tidying. The first step is always to
figure out what the variables and observations are. Sometimes this is easy; other times
you’ll need to consult with the people who originally generated the data.
The second step is to resolve one of two common problems:
• One variable might be spread across multiple columns.
• One observation might be scattered across multiple rows.
Typically a dataset will only suffer from one of these problems; it’ll only suffer from both if
you’re really unlucky! To fix these problems, you’ll need the two most important functions
in tidyr: gather() and spread().
Gathering: A common problem is a dataset where some of the column names are not
names of variables, but values of a variable. Take table4a; the column names 1999 and
2000 represent values of the year variable, and each row represents two observations, not
one:
table4a

## # A tibble: 3 x 3
## country `1999` `2000`
## * <chr> <int> <int>
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766

To tidy a dataset like this, we need to gather those columns into a new pair of variables. To
describe that operation we need three parameters:
• The set of columns that represent values, not variables. In this example, those are
the columns 1999 and 2000. The name of the variable whose values form the
column names. I call that the key, and here it is year.
• The name of the variable whose values are spread over the cells. I call that value,
and here it’s the number of cases. Together those parameters generate the call to
gather():
table4a %>%
gather(`1999`, `2000`, key = "year", value = "cases")

## # A tibble: 6 x 3
## country year cases
## <chr> <chr> <int>
## 1 Afghanistan 1999 745
## 2 Brazil 1999 37737
## 3 China 1999 212258
## 4 Afghanistan 2000 2666
## 5 Brazil 2000 80488
## 6 China 2000 213766

The columns to gather are specified with dplyr::select() style notation. Here there are
only two columns, so we list them individually. Note that “1999” and “2000” are
nonsyntactic names so we have to surround them in backticks.
In the final result, the gathered columns are dropped, and we get new key and value
columns. Otherwise, the relationships between the original variables are preserved. We can
use gather() to tidy table4b in a similar fashion. The only difference is the variable stored
in the cell values:
table4b %>%
gather(`1999`, `2000`, key = "year", value = "population")

## # A tibble: 6 x 3
## country year population
## <chr> <chr> <int>
## 1 Afghanistan 1999 19987071
## 2 Brazil 1999 172006362
## 3 China 1999 1272915272
## 4 Afghanistan 2000 20595360
## 5 Brazil 2000 174504898
## 6 China 2000 1280428583

To combine the tidied versions of table4a and table4b into a single tibble, we need to use
dplyr::left_join():
tidy4a <- table4a %>%
gather(`1999`, `2000`, key = "year", value = "cases")
tidy4b <- table4b %>%
gather(`1999`, `2000`, key = "year", value = "population")
left_join(tidy4a, tidy4b)

## Joining, by = c("country", "year")

## # A tibble: 6 x 4
## country year cases population
## <chr> <chr> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Brazil 1999 37737 172006362
## 3 China 1999 212258 1272915272
## 4 Afghanistan 2000 2666 20595360
## 5 Brazil 2000 80488 174504898
## 6 China 2000 213766 1280428583

Spreading: Spreading is the opposite of gathering. You use it when an observation is


scattered across multiple rows. For example, take table2, an observation is a country in a
year, but each observation is spread across two rows:
table2

## # A tibble: 12 x 4
## country year type count
## <chr> <int> <chr> <int>
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
## 5 Brazil 1999 cases 37737
## 6 Brazil 1999 population 172006362
## 7 Brazil 2000 cases 80488
## 8 Brazil 2000 population 174504898
## 9 China 1999 cases 212258
## 10 China 1999 population 1272915272
## 11 China 2000 cases 213766
## 12 China 2000 population 1280428583

To tidy this up, we first analyze the representation in a similar way to gather(). This time,
however, we only need two parameters:
• The column that contains variable names, the key column. Here, it’s type.
• The column that contains values forms multiple variables, the value column. Here,
it’s count.
Once we’ve figured that out, we can use spread(), as shown programmatically here:
spread(table2, key = type, value = count)

## # A tibble: 6 x 4
## country year cases population
## <chr> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583

As you might have guessed from the common key and value arguments, spread() and
gather() are complements. gather() makes wide tables narrower and longer; spread()
makes long tables shorter and wider.

Separating and Pull


So far you’ve learned how to tidy table2 and table4, but not table3. table3 has a different
problem: we have one column (rate) that contains two variables (cases and population). To
fix this problem, we’ll need the separate() function. You’ll also learn about the
complement of separate(): unite(), which you use if a single variable is spread across
multiple columns.
Separate: separate() pulls apart one column into multiple columns, by splitting wherever
a separator character appears. Take table3:
table3

## # A tibble: 6 x 3
## country year rate
## * <chr> <int> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583

The rate column contains both cases and population variables, and we need to split it into
two variables. separate() takes the name of the column to separate, and the names of the
columns to separate into, as shown in the following code:
table3 %>%
separate(rate, into = c("cases", "population"))
## # A tibble: 6 x 4
## country year cases population
## <chr> <int> <chr> <chr>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583

By default, separate() will split values wherever it sees a non- alphanumeric character
(i.e., a character that isn’t a number or letter). For example, in the preceding code,
separate() split the values of rate at the forward slash characters. If you wish to use a
specific character to separate a column, you can pass the character to the sep argument of
separate(). For example, we could rewrite the preceding code as:
table3 %>%
separate(rate, into = c("cases", "population"), sep = "/")

## # A tibble: 6 x 4
## country year cases population
## <chr> <int> <chr> <chr>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583

Look carefully at the column types: you’ll notice that case and population are character
columns. This is the default behavior in separate(): it leaves the type of the column as is.
Here, however, it’s not very useful as those really are numbers. We can ask separate() to
try and convert to better types using convert = TRUE:
table3 %>%
separate(
rate,
into = c("cases", "population"),
convert = TRUE
)

## # A tibble: 6 x 4
## country year cases population
## <chr> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
You can also pass a vector of integers to sep. separate() will interpret the integers as
positions to split at. Positive values start at 1 on the far left of the strings; negative values
start at –1 on the far right of the strings. When using integers to separate strings, the length
of sep should be one less than the number of names in into.
You can use this arrangement to separate the last two digits of each year. This makes this
data less tidy, but is useful in other cases, as you’ll see in a little bit:
table3 %>%
separate(year, into = c("century", "year"), sep = 2)

## # A tibble: 6 x 4
## country century year rate
## <chr> <chr> <chr> <chr>
## 1 Afghanistan 19 99 745/19987071
## 2 Afghanistan 20 00 2666/20595360
## 3 Brazil 19 99 37737/172006362
## 4 Brazil 20 00 80488/174504898
## 5 China 19 99 212258/1272915272
## 6 China 20 00 213766/1280428583

Unite: unite() is the inverse of separate(): it combines multiple columns into a single
column. You’ll need it much less frequently than separate(), but it’s still a useful tool to
have in your back pocket. We can use unite() to rejoin the century and year columns that
we created in the last example. That data is saved as tidyr::table5.
unite() takes a data frame, the name of the new variable to create, and a set of columns to
combine, again specified in dplyr::select(). The result is shown in the following code:
table5 %>%
unite(new, century, year)

## # A tibble: 6 x 3
## country new rate
## <chr> <chr> <chr>
## 1 Afghanistan 19_99 745/19987071
## 2 Afghanistan 20_00 2666/20595360
## 3 Brazil 19_99 37737/172006362
## 4 Brazil 20_00 80488/174504898
## 5 China 19_99 212258/1272915272
## 6 China 20_00 213766/1280428583

In this case we also need to use the sep argument. The default will place an underscore (_)
between the values from different columns. Here we don’t want any separator so we use ““:
table5 %>%
unite(new, century, year, sep = "")

## # A tibble: 6 x 3
## country new rate
## <chr> <chr> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583

Missing Values
Changing the representation of a dataset brings up an important subtlety of missing values.
Surprisingly, a value can be missing in one of two possible ways:
• Explicitly, i.e., flagged with NA.
• Implicitly, i.e., simply not present in the data.
Let’s illustrate this idea with a very simple dataset:
stocks <- tibble(
year = c(2015, 2015, 2015, 2015, 2016, 2016, 2016),
qtr = c( 1, 2, 3, 4, 2, 3, 4),
return = c(1.88, 0.59, 0.35, NA, 0.92, 0.17, 2.66)
)

There are two missing values in this dataset:


• The return for the fourth quarter of 2015 is explicitly missing, because the cell
where its value should be instead contains NA.
• The return for the first quarter of 2016 is implicitly missing, because it simply does
not appear in the dataset. One way to think about the difference is with this Zen-like
koan: an explicit missing value is the presence of an absence; an implicit missing
value is the absence of a presence.
The way that a dataset is represented can make implicit values explicit. For example, we
can make the implicit missing value explicit by putting years in the columns:
stocks %>%
spread(year, return)

## # A tibble: 4 x 3
## qtr `2015` `2016`
## <dbl> <dbl> <dbl>
## 1 1 1.88 NA
## 2 2 0.59 0.92
## 3 3 0.35 0.17
## 4 4 NA 2.66

Because these explicit missing values may not be important in other representations of the
data, you can set na.rm = TRUE in gather() to turn explicit missing values implicit:
stocks %>%
spread(year, return) %>%
gather(year, return, `2015`:`2016`, na.rm = TRUE)

## # A tibble: 6 x 3
## qtr year return
## <dbl> <chr> <dbl>
## 1 1 2015 1.88
## 2 2 2015 0.59
## 3 3 2015 0.35
## 4 2 2016 0.92
## 5 3 2016 0.17
## 6 4 2016 2.66

Another important tool for making missing values explicit in tidy data is complete():
stocks %>%
complete(year, qtr)

## # A tibble: 8 x 3
## year qtr return
## <dbl> <dbl> <dbl>
## 1 2015 1 1.88
## 2 2015 2 0.59
## 3 2015 3 0.35
## 4 2015 4 NA
## 5 2016 1 NA
## 6 2016 2 0.92
## 7 2016 3 0.17
## 8 2016 4 2.66

complete() takes a set of columns, and finds all unique combinations. It then ensures the
original dataset contains all those values, filling in explicit NAs where necessary.
There’s one other important tool that you should know for working with missing values.
Sometimes when a data source has primarily been used for data entry, missing values
indicate that the previous value should be carried forward:
treatment <- tribble(
~ person, ~ treatment, ~response,
"Derrick Whitmore", 1, 7,
NA, 2, 10,
NA, 3, 9,
"Katherine Burke", 1, 4
)

You can fill in these missing values with fill(). It takes a set of columns where you want
missing values to be replaced by the most recent nonmissing value (sometimes called last
observation carried forward):
treatment %>%
fill(person)

## # A tibble: 4 x 3
## person treatment response
## <chr> <dbl> <dbl>
## 1 Derrick Whitmore 1 7
## 2 Derrick Whitmore 2 10
## 3 Derrick Whitmore 3 9
## 4 Katherine Burke 1 4

R Markdown
R Markdown provides a unified authoring framework for data science, combining your
code, its results, and your prose commentary. R Markdown documents are fully
reproducible and support dozens of output formats, like PDFs, Word files, slideshows, and
more. R Markdown files are designed to be used in three ways:
• For communicating to decision makers, who want to focus on the conclusions, not
the code behind the analysis.
• For collaborating with other data scientists (including future you!), who are
interested in both your conclusions, and how you reached them (i.e., the code).
• As an environment in which to do data science, as a modern day lab notebook where
you can capture not only what you did, but also what you were thinking.
R Markdown integrates a number of R packages and external tools. This means that help is,
by and large, not available through ?. Instead, as you work through this chapter, and use R
Markdown in the future, keep these resources close to hand:
• R Markdown Cheat Sheet: available in the RStudio IDE under Help > Cheatsheets > R
Markdown Cheat SheetR Markdown Reference Guide: available in the RStudio IDE
under Help > Cheatsheets > R Markdown Reference Guide Both cheatsheets are also
available at http://rstudio.com/cheatsheets.
R Markdown Basics
This is an R Markdown file, a plain-text file that has the extension .Rmd:
---
title: "Diamond sizes"
date: 2016-08-25
output: html_document
---
```{r setup, include = FALSE}
library(ggplot2)
library(dplyr)
smaller <- diamonds %>%
filter(carat <= 2.5)
```
We have data about `r nrow(diamonds)` diamonds. Only
`r nrow(diamonds) - nrow(smaller)` are larger than
2.5 carats. The distribution of the remainder is shown
below:
```{r, echo = FALSE}
smaller %>%
ggplot(aes(carat)) +
geom_freqpoly(binwidth = 0.01)
```

It contains three important types of content:


1. An (optional) YAML header surrounded by ---s.
2. Chunks of R code surrounded by ```.
3. Text mixed with simple text formatting like # heading and
_italics_.

When you open an .Rmd, you get a notebook interface where code and output are
interleaved. You can run each code chunk by clicking the Run icon (it looks like a play
button at the top of the chunk), or by pressing Cmd/Ctrl-Shift-Enter. RStudio executes the
code and displays the results inline with the code.
To produce a complete report containing all text, code, and results, click “Knit” or press
Cmd/Ctrl-Shift-K. You can also do this programmatically with rmarkdown::render("1-
example.Rmd"). This will display the report in the viewer pane, and create a self-contained
HTML file that you can share with others.
When you knit the document R Markdown sends the .Rmd file to knitr, which executes all
of the code chunks and creates a new Markdown (.md) document that includes the code
and its output. The Markdown file generated by knitr is then processed by pandoc, which
is responsible for creating the finished file. The advantage of this two-step workflow is that
you can create a very wide range of output formats.
To get started with your own .Rmd file, select File > New File > R Markdown in the menu
bar. RStudio will launch a wizard that you can use to pre-populate your file with useful
content that reminds you how the key features of R Markdown work.
The following sections dive into the three components of an R Markdown document in
more detail: the Markdown text, the code chunks, and the YAML header.
Exercises
1. Create a new notebook using File > New File > R Notebook. Read the instructions.
Practice running the chunks. Verify that you can modify the code, rerun it, and see
modified output.
2. Create a new R Markdown document with File > New File > R Markdown. Knit it by
clicking the appropriate button. Knit it by using the appropriate keyboard shortcut.
Verify that you can modify the input and see the output update.
3. Compare and contrast the R Notebook and R Markdown files you created earlier.
How are the outputs similar? How are they different? How are the inputs similar?
How are they different? What happens if you copy the YAML header from one to the
other?
4. Create one new R Markdown document for each of the three built-in formats: HTML,
PDF, and Word. Knit each of the three documents. How does the output differ? How
does the input differ? (You may need to install LaTeX in order to build the PDF
output—RStudio will prompt you if this is necessary.)
Text Formatting with Markdown
Prose in .Rmd files is written in Markdown, a lightweight set of conventions for formatting
plain-text files. Markdown is designed to be easy to read and easy to write. It is also very
easy to learn. The following guide shows how to use Pandoc’s Markdown, a slightly
extended version of Markdown that R Markdown understands:
Text formatting
------------------------------------------------------------
*italic* or _italic_
**bold** __bold__
`code`
superscript^2^ and subscript~2~
Headings
------------------------------------------------------------
# 1st Level Header
## 2nd Level Header
### 3rd Level Header
Lists
------------------------------------------------------------
* Bulleted list item 1
* Item 2
* Item 2a
* Item 2b
1. Numbered list item 1
1. Item 2. The numbers are incremented automatically in
the output.
Links and images
------------------------------------------------------------
<http://example.com>
[linked phrase](http://example.com)
![optional caption text](path/to/img.png)
Tables
------------------------------------------------------------
First Header | Second Header
------------- | -------------
Content Cell | Content Cell
Content Cell | Content Cell
The best way to learn these is simply to try them out. It will take a few days, but soon they
will become second nature, and you won’t need to think about them. If you forget, you can
get to a handy reference sheet with Help > Markdown Quick Reference.
Code Chunks
To run code inside an R Markdown document, you need to insert a chunk. There are three
ways to do so:
1. The keyboard shortcut Cmd/Ctrl-Alt-I
2. The “Insert” button icon in the editor toolbar
3. By manually typing the chunk delimiters {r} and
Obviously, I’d recommend you learn the keyboard shortcut. It will save you a lot of time in
the long run!
You can continue to run the code using the keyboard shortcut that by now (I hope!) you
know and love: Cmd/Ctrl-Enter. However, chunks get a new keyboard shortcut: Cmd/Ctrl-
Shift-Enter, which runs all the code in the chunk. Think of a chunk like a function. A chunk
should be relatively self-contained, and focused around a single task.
Chunk Options
Chunk output can be customized with options, arguments supplied to the chunk header.
knitr provides almost 60 options that you can use to customize your code chunks. Here
we’ll cover the most important chunk options that you’ll use frequently. You can see the full
list at http://yihui.name/`knitr`/options/.
The most important set of options controls if your code block is executed and what results
are inserted in the finished report:
• eval = FALSE prevents code from being evaluated. (And obviously if the code is not
run, no results will be generated.) This is useful for displaying example code, or for
disabling a large block of code without commenting each line.
• include = FALSE runs the code, but doesn’t show the code or results in the final
document. Use this for setup code that you don’t want cluttering your report.
• echo = FALSE prevents code, but not the results from appearing in the finished file.
Use this when writing reports aimed at people who don’t want to see the underlying
R code.
• message = FALSE or warning = FALSE prevents messages or warnings from
appearing in the finished file.
• results = 'hide' hides printed output; fig.show = ‘hide’ hides plots.
• error = TRUE causes the render to continue even if code returns an error. This is
rarely something you’ll want to include in the final version of your report, but can be
very useful if you need to debug exactly what is going on inside your .Rmd. It’s also
useful if you’re teaching R and want to deliberately include an error. The default,
error = FALSE, causes knitting to fail if there is a single error in the document.
Table
By default, R Markdown prints data frames and matrices as you’d see them in the console:
mtcars[1:5, 1:10]

## mpg cyl disp hp drat wt qsec vs am gear


## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3

If you prefer that data be displayed with additional formatting you can use the
knitr::kable function. The following code generates the table:
knitr::kable(
mtcars[1:5, ],
caption = "A `knitr` kable."
)

A knitr kable.
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2

You might also like