Bayes CPH - Tutorial R
Bayes CPH - Tutorial R
Johannes Karreth
Applied Introduction to Bayesian Data Analysis
1 Getting started
The purpose of this tutorial is to show the very basics of the R language so that participants who
have not used R before can complete the first assignment in this workshop. For information on the
thousands of other features of R, see the suggested resources below.
In this tutorial, R code that you would enter in your script file or in the command line is preceded by
the > character, and by + if the current line of code continues from a previous line. You do not need
to type this character in your own code.
To operate R, you should rely on writing R scripts. We will write these scripts in RStudio. Download
RStudio from http://www.rstudio.org. Then, install it on your computer. Some text editors
also offer integration with R, so that you can send code directly to R. RStudio is generally the best
solution for running R and maintaining a reproducible workflow.
Lastly, install LATEX in order to compile PDF files from within RStudio. To do this, follow the
instructions under http://www.jkarreth.net/latex.html, “Installation”. You won’t have to
use LATEX directly or learn how to write LATEX code in this workshop.
1.3 R packages
Many useful and important functions in R are provided via packages that need to be installed sep-
arately. You can do this by using the Package Installer in the menu (Packages & Data > Package
Installer in R or Tools > Install Packages... in RStudio), or by typing
> install.packages("foreign")
in the R command line. Next, every time you use R, you need to load the packages you want to use:
type
> library(foreign)
1
Figure 1: RStudio.
RStudio offers a very useful function to set up a whole project (File > New Project...). Projects
automatically create a working directory for you.
1.5 R help
Within R, you can access the help files for any command that exists by typing ?commandname or,
for a list of the commands within a package, by typing help(package = packagename). So, for
instance:
> ?rnorm
> help(package = "foreign")
• Never type commands into the R command line. Always use a script file, from which you can
send (via RStudio, Emacs, ...) commands to R, or at least copy and paste them into R.
• Comment your script files! Comments are indicated by the # sign:
> # This is a comment
2
• Use a consistent style when writing code. A good place to start is Google’s style guide:
http://google-styleguide.googlecode.com/svn/trunk/google-r-style.html.
• Do not use the attach() command.
As your coding will become more complex, you may forget to complete a particular command. For
example, here I want to add 1 and the product of 2 and 4. But I forget to close the parentheses around
the product:
> 1 + (2 * 4
+ )
## [1] 9
You will notice that the little > on the left changes into a +. This means that R is offering you a new
line to finish the original command. If I type a right parenthesis, R returns the result of my operation.
• Fox and Weisberg, An R and S-Plus Companion to Applied Regression (2011, print).
• statmethods.net. This website offers well-explained computer code to complete most of
the data analysis tasks we use in this workshop.
• Maindonald and Braun, Data Analysis and Graphics Using R (2006, print).
• Verzani, simpleR - Using R for Introductory Statistics (http://cran.r-project.org/doc/
contrib/Verzani-SimpleR.pdf).
## [1] 1
> y <- 2
> x + y
## [1] 3
> x * y
3
## [1] 2
> x / y
## [1] 0.5
> y^2
## [1] 4
> log(x)
## [1] 0
> exp(x)
## [1] 2.718282
• Vectors:
> xvec <- c(1, 2, 3, 4, 5)
> xvec
## [1] 1 2 3 4 5
## [1] 1 2 3 4 5
## [1] 1 1 1 1 1
## [1] 2 3 4 5 6
• Matrices:
> mat1 <- matrix(data = c(1, 2, 3, 4, 5, 6), nrow = 3, byrow = TRUE)
> mat1
## [,1] [,2]
## [1,] 1 2
## [2,] 3 4
## [3,] 5 6
4
• Data frames (equivalent to data sets):
## name y x1 x2
## 1 Student 1 1 2 -3
## 2 Student 2 1 4 4
## 3 Student 3 3 1 -2
## 4 Student 4 4 8 0
## 5 Student 5 7 19 4
## 6 Student 6 2 11 20
You can then use a variety of plotting commands (see for more below) to visualize your draws:
• Density plots:
0.03
0.02
0.01
0.00
−20 0 20 40
• Histograms:
5
Histogram of draws
200
150
Frequency
100
50
0
−20 −10 0 10 20 30 40
draws
## [1] 5
> mydata$x1
## [1] 2 4 1 8 19 11
> mydata$names
## NULL
## [1] 1 3 5
> mat1[1, ]
## [1] 1 2
## [1] 2 4 1 8 19 11
6
> library(foreign)
Note that for each command, many options (in R language: arguments) are available; you will most
likely need to work with these options at some time, for instance when your source dataset (e.g., in
Stata) has value labels. Check the help files for the respective command in that case.
• Tables: If you have a text file with a simple tab-delimited table, where the first line designates
variable names:
> mydata.table <- read.table("http://www.jkarreth.net/files/data.txt",
+ header = TRUE)
> head(mydata.table)
## y x1 x2
## 1 -0.1629267 1.6535472 0.3001316
## 2 1.3985720 1.4152763 -0.9544489
## 3 0.8983962 0.4199516 -0.4580181
## 4 -1.6484948 0.7212208 0.9356037
## 5 0.2285570 -1.1969352 -1.1368931
• CSV files: If you have a text file with a simple tab-delimited table, where the first line desig-
nates variable names:
> mydata.csv <- read.csv("http://www.jkarreth.net/files/data.csv",
+ header = TRUE)
> head(mydata.csv)
## y x1 x2
## 1 -0.1629267 1.6535472 0.3001316
## 2 1.3985720 1.4152763 -0.9544489
## 3 0.8983962 0.4199516 -0.4580181
## 4 -1.6484948 0.7212208 0.9356037
## 5 0.2285570 -1.1969352 -1.1368931
• SPSS files: If you have an SPSS data file, you can do this:
> # mydata.spss <- read.spss("http://www.jkarreth.net/files/data.sav",
> # use.value.labels = TRUE)
• Stata files: If you have a Stata data file, you can do this:
> mydata.dta <- read.dta("http://www.jkarreth.net/files/data.dta",
+ convert.dates = TRUE, convert.factors = TRUE)
> head(mydata.dta)
## y x1 x2
## 1 -0.1629267 1.6535472 0.3001316
## 2 1.3985720 1.4152763 -0.9544489
## 3 0.8983962 0.4199516 -0.4580181
## 4 -1.6484948 0.7212208 0.9356037
## 5 0.2285570 -1.1969352 -1.1368931
Describing data
To obtain descriptive statistics of a dataset, or a variable, use the summary command:
> summary(mydata.dta)
## y x1 x2
## Min. :-1.6485 Min. :-1.1969 Min. :-1.1369
## 1st Qu.:-0.1629 1st Qu.: 0.4200 1st Qu.:-0.9544
## Median : 0.2286 Median : 0.7212 Median :-0.4580
## Mean : 0.1428 Mean : 0.6026 Mean :-0.2627
## 3rd Qu.: 0.8984 3rd Qu.: 1.4153 3rd Qu.: 0.3001
## Max. : 1.3986 Max. : 1.6535 Max. : 0.9356
> summary(mydata$y)
7
You can access particular quantities, such as standard deviations and quantiles (in this case the 5th
and 95th percentiles), with the respective functions:
> sd(mydata$y)
## [1] 2.280351
## 5% 95%
## 1.00 6.25
4 Creating figures
R offers several options to create figures. We will work with the so-called “base graphics”, mostly
using the plot() function, and the ggplot2 package.
density.default(x = dist1)
0.4
0.3
Density
0.2
0.1
0.0
−2 0 2 4
8
> dist2 <- rnorm(1000, mean = 0, sd = 2)
> dist.df <- data.frame(dist1, dist2)
> dist.df <- melt(dist.df)
> normal.plot <- ggplot(data = dist.df, aes(x = value, colour = variable, fill = variable))
> normal.plot <- normal.plot + geom_density(alpha = 0.5)
> normal.plot
0.4
0.3
variable
density
dist1
0.2
dist2
0.1
0.0
−6 −3 0 3 6
value
ggplot2 offers plenty of opportunities for customizing plots; we will also encounter these later on
in the workshop. You can also have a look at Winston Chang’s R Graphics Cookbook for plenty of
examples of ggplot2 customization: http://www.cookbook-r.com/Graphs.
> set.seed(123)
> dist1 <- rnorm(n = 1000, mean = 0, sd = 1)
> set.seed(123)
> dist2 <- rnorm(1000, mean = 0, sd = 2)
> pdf("normal_plot.pdf", width = 5, height = 5)
> plot(density(dist1))
> lines(density(dist2), col = "red")
> dev.off()
## pdf
## 2
will print a plot named normal plot.pdf of the size 5 × 5 inches to your working directory.
Plots created with ggplot2 are best saved using the ggsave() command: