Introduction To R
Introduction To R
Introduction To R
Introduction to R
General Introduction
R is a software package for statistical computing and data analysis. It is also a programming
language, based on the now ancient S language. What makes R easier to use is that it is an
interpreted language, and not a compiled language like C++. This is important for interactive data
analysis, as you do not have to compile a fully functional “program” in one go, and can run small
lines of code at a time.
R is freely distributed software (www.r-project.org) with contributions from developers from
around the world. It is one of the main software for statistical computing. We say that R is also an
“environment” because it contains a lot of pre-programmed functions (packages) that the user can
call and use for various analysis.
Getting Started
As mentioned previously, you can download the software free from www.r-project.org. This will
give you the basic R package with an interactive console. It is highly recommended that you
download RStudio (https://posit.co/downloads/) as well, which gives you a much cleaner and
easier interface to work with R.
If you would like to run RStudio on a tablet or other device, you may want to try the Posit Cloud
(https://posit.cloud/) service. This gives you access to RStudio remotely, which may be useful for
using RStudio in class if you don’t have or don’t want to bring a Windows/Mac laptop to class with
you. The course’s lecture notes will frequently have R code integrated into the notes. You can
download R script paintfiles from Canvas which have all of the R code from the lecture in one file.
This allows you to quickly run the code from the lectures.
There are some minor things we will do in this class that you won’t be able to do in the cloud
version, so I still recommend installing the local, non-cloud version if you have a Windows/Mac
computer. RStudio is also available on all campus computers if you would prefer to work on class
assignments that way too!
For more instructions on downloading and setting up R/RStudio, check Canvas for an
announcement with a tutorial video.
RStudio Layout
When you first open RStudio, I would recommend creating an
R file to type your code into. You can do this by clicking on the
new document icon in the upper left corner and then choosing
an “R script” as shown to the right. Once you do this, your
RStudio will be divided up into four windows: your script
editor, console, environment, and other utilities, identified in
the image on the next page.
Here is an explanation of each of these sections:
• Script editor: This is where you can freely work on typing up your own R code, loading up
someone else’s script file to try their R code, or save your R code to share with others. It’s a
good sandbox to try out typing out R code without needing to run it right away.
• Console: This is where the magic happens. You run all of your R code in the console, and the
text output of your code will display here. If you want to run code in the console, you
typically would find the single line of code in the script editor that you want to run, and click
the “Run” button (shortcut: Ctrl/Cmd+Enter) at the top of your script editor. You can also
type in R code directly into the console, but it is difficult to save your code this way.
o Important: when you run code from your script editor, it will only run one line at a
time, based on where your cursor is. You can run multiple lines at once by
highlighting the code you want to run and then clicking run.
• Environment: Any objects or variables you work with will be listed and displayed here.
This helps keeps track of any variables you’ve defined, any data sets you’ve loaded into R, or
any statistical models you have built.
• Other tools: The last window is used for a variety of things, including:
o Files: You can use this window to pick a folder to view data files and load them in R.
o Plots: If you run R code that creates a plot, it displays here.
o Packages: Some R functionality is not included in the “base” installation of R, and so
we often install extra packages for more functionality. This can be done here.
o Help: R has pretty good help documentation for most built-in functions, and if you
load help files, they will display here.
Try out some code!
Let’s try out using R by loading up the R script for this section and running some R code!
4+7
13*8
hist(mtcars$mpg)
In RStudio, move your cursor to the first line of code that calculates 4 + 7. After you click on that
line, click the run button at the top of the script editor window, and you’ll see the output in the
console. Your cursor will automatically move to the next line so that you can keep clicking run to
see the output of each successive line of code. The final line of code doesn’t produce output in the
console, instead, it creates a plot of some data built into R.
Variables
An important concept in computer programming is the concept of variable. A variable in computer
science is a name given to some storage location for later reference. In more practical terms, it is a
binding between a symbol and a value. Some helpful tips for using variables:
Once you do that, you can simply load up files into R by clicking on the file in this window,
previewing to make sure it looks good, and then you’re all set!
What kind of files can we load into R? While many types of data can be loaded in, the easiest format
is called CSV or “comma separated values.” This is a popular format for storing files in a simplistic,
spreadsheet-like format. If you already have data in a spreadsheet (Google Docs, Excel), it is
possible to save those files as CSV too. While it is possible to load Excel files into R, we typically
avoid this because they are structured and encoded inefficiently for this purpose and thus take
more time to load into R.
Working with Data
Let’s try loading the amazonbooks.csv file from Canvas into RStudio now! In RStudio, clicking on a
data set from the upper right of the right will open a preview window of the data. This preview
window also typically opens when you load the data set. This allows us to easily see the different
variables listed by their names. To access just one column of a data set, use a dollar sign ($)
followed by the name of the column. We can then use this column though many different functions:
amazonbooks$Price #Price vector from books data set
mean(amazonbooks$Price)
sd(amazonbooks$Price)
median(amazonbooks$Price)
summary(amazonbooks$Price)
For the non-numerical variables in our data set, we might be interested in finding proportions. To
do this, it would be helpful to identify how many items there are of each outcome. One way we can
do this is with a table function.
table(amazonbooks$Binding)
We can see that this gives us the total number of hardcover (H) and paperback (P) books
respectively. Because this table output works like a vector, we could directly calculate the
proportion of paperback books using the R code below:
table(amazonbooks$Binding)[2]/length(amazonbooks$Binding)
The “[2]” makes sure we are taking the second count from the above output for the paperback
books, and then the length function will just tell us the total number of books, regardless of binding
type. Dividing these two numbers gets us the proportion we desire!
We might also be interested in obtaining the mean price of paperback books only as well. This
might be easily done if we create a new data frame in R that is a subset of the original data frame.
Using the subset function in R, we’ll call this new subsetted data paperbackbooks.
paperbackbooks = subset(amazonbooks, Binding == "P")
This function takes the original data set as the first argument, and then asks for a condition for
subsetting in the second argument. There’s two things to note for the condition: first, two equals
signs are used to check for equality, as one equals sign is already used for defining a variable. Also,
non-numeric outcomes need to be put in quotes (“P”) to distinguish them from the names of
variables. From here, we can compute the mean just as we did before, but with the new data frame.
mean(paperbackbooks$Binding)
There are many different types of conditions you can use for subsetting too! This condition could be
based on other variable types too – maybe you want to make an expensive books data set? (Price
> 40) Or maybe you’re interested in big books? (Pages > 500) Old books? (Year < 1990).
Getting Help
If you don’t remember how to use a particular function, R has extensive documentation built into
any function you can run in it! To access help on how to use a particular function, try putting a
question mark before the function’s name and run that in the console. This will open the help tab in
the bottom right window of RStudio with the appropriate help file for that function.
?subset
?summary
?mean
Additional Practice
Example: A study on bears collected many measurements on their size and other attributes.
This data can be found in the bears.csv file on Canvas.
• Sex
• Age (years)
• Head length (in): Head.L
• Head width (in): Head.W
• Neck girth (in): Neck.G
• Chest girth (in): Chest.G
• Weight (lbs)
Use RStudio to find the following:
• The mean weight of bears with a neck girth of more than 20.