0% found this document useful (0 votes)

19 views

Getting Started With R

The document provides instructions for starting an R project, importing data into R from different formats like Excel and Stata, and describing data in R. It recommends writing a preamble, clearing the workspace, setting the working directory, and using comments. It also explains how to access data from R packages, import data from files like CSV and DTA, and view the data structure using commands like View, head and str.

Uploaded by

Dhruv Bhalla

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views

Getting Started With R

Uploaded by

Dhruv Bhalla

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Department of Economics Econ 117

Yale University Data Analysis & Econometrics

Getting Started

1 Starting a Project
Whenever you start a new project in R, it is recommended that you open a new script
file, name it, and write a brief preamble that describes what your code is doing. This is
not only helpful for others reading your code, but also allows you to organize your work
and thoughts and helps you remember the state of your work when you are coming back
to a project that you have not worked on for a while. For example, a preamble could
look like below:

#-----------------------------------------------------------------------#
# Project: Data Task #
# Date: 03/08/16 #
# File: Task.R #
# Author: md #
# #
# Tasks: a) Graph log mortality rates #
# b) Fit linear models to the data by income group #
# c) Plot fitted lines for the age range 35-76 #
#-----------------------------------------------------------------------#

In a second step, you should clear the workspace using the remove command rm() applied
to the list of user-defined objects ls(), and set up your working directory. For example,
if your project directory is a folder named “Task” on the desktop of a windows machine
or a mac, you probably want to use one of the two setwd() lines below in your script:

> rm(list = ls()) # clear workspace of user-defined objects

> setwd("C:\WINDOWS\Desktop\Task") # set working directory (Windows)
> setwd("/Users/name/Desktop/Task") # set working directory (Mac)

Note that windows uses “\” to describe the folder structure in file paths, while on macs
file paths use “/”. In order to get the file path of a document on a mac, navigate to
the file or folder you wish to copy the path for. Right-click (or control+click) on the
file or folder in the Mac Finder. While in the right-click menu, hold down the OPTION
(alt) key to reveal the “Copy (item name) as Pathname” option, it replaces the standard
Copy option. On a windows device, hold down the Shift key, right-click a folder on the
right side of the window, and choose Copy as Path. That puts the full pathname for the

1
folder you right-clicked in the Windows Clipboard.

Whenever you are telling R to load a file (e.g. a dataset) without specifying the whole
file path, it will assume that the file is located in the working directory. You can view
the current working directory by using the getwd() command.

As you can see from the example, it is possible to write comments in R scripts using the
# sign. R understands that, in a given line of code, everything that follows the # sign is
a comment. Unfortunately, there is no elegant way of writing multi-line comments in R.
You will have to either start every comment line with a # or enclose the whole comment
with an if (FALSE) { comment } block. Similar to a good preamble, thoughtfully
written comments make it much easier to understand code and allow you to effectively
collaborate. While most of the code that you are required to produce will not be too
difficult to understand, we encourage you to start carefully commenting your code as
early as possible. Once you have set up your script file, the next step is to import your
data into R.

2 Importing Data into R

Importing data into R is fairly simple. Besides the many data sets that are available
directly through R packages, you can import data into R from most commonly used data
formats. All import commands create a dataframe. Since dataframe objects require
unique column names, for some formats it is required to specify where these names
should be imported from.

2.1 From an R package

Many datasets are available through R packages. For example, the Ecdat package, which
we will use for illustration, contains more than a hundred data sets for econometric
applications. One of these data sets is Caschool, the California Test Score Data Set, a
cross section of 420 Californian schools that contains information on equipment, teachers
and students. To be able to access this data set we simply have to install the Ecdat
package and load it:

> install.packages("Ecdat",dep=TRUE)
> library(Ecdat)

You can use the data() command to see the set of datasets that are available through
the packages that you have currently loaded. Once you have decided which data set to
use, you can either create a dataframe manually using the assignment operator, or by
using the data() command applied to the data set. For example the following two lines
of code produce the same dataframe object, once called Caschool and once df.Cs.

> data(Caschool)
> df.Cs <- Caschool

2
2.2 From Stata
The library and readstata packages allow you to directly import data sets from Stata
using the read.dta() commands. Since Stata data sets by default save data using
unique variable names, no additional information is required. For Stata 17 or older .dta
files type

> install.packages("readstata13",dep=TRUE)
> data <- read.dta13("C:/WINDOWS/Desktop/Task/data.dta")

For Stata 12 or earlier versions you can also type:

> library(foreign)
> data <- read.dta("C:/WINDOWS/Desktop/Task/data.dta")

Note that you do not have to install the foreign package because it is part of the default
R library.

2.3 From a Comma Delimited Text File

If you have a .csv file that uses comma as a separator and has variable names in the first
row, the data can be imported into R using the read.table() command.

> data <- read.table("C:/WINDOWS/Desktop/Task/data.csv", header=TRUE, sep=",")

Note that the command requires you to specify the separator and whether the .csv file
contains variable names in the first row. If the header parameter is not specified, R will
automatically detect variable names only if the first row has one fewer entry than the
number of columns.

2.4 From Excel

If your data is contained in an excel file with variable names in the first row, and you
do not want to export it to a comma delimited file first, you can use the xlsx package.

> install.packages("xlsx",dep=TRUE)
> library(xlsx)
> data <- read.xlsx("C:/WINDOWS/Desktop/Task/data.xlsx", 1)

The second argument, 1, tells R to the number of the worksheet (in this example it is
the first worksheet) that should be imported from the .xlsx document.

2.5 From other formats

If your data is in another format, such as systat, SPSS or SAS, there are also data
import packages available. A simple web search will tell you what package to load and
the syntax of the import command.

3
3 Describing Data with R
After setting up your work environment and importing your data into R, the first step in
every empirical project (assuming you already know your research question) is to make
efforts to understand your data. This is essential as without a thorough understanding of
your data, it is not possible to fully understand the results that you achieve, irrespective
of the methods that you are using. For example, you need to understand what your
variables measure, how these measurements are coded, what unit of observation they
refer to, and what population the data was sampled from.

Oftentimes most of these questions are answered in code books that are provided with the
data. However, even then taking the time to study your data is highly recommended, as
it improves your intuition for the data and also helps in less obvious ways. For example,
you may notice coding errors that would go undetected otherwise, or detect that certain
entries are missing systematically. Moreover, looking at the empirical distributions of
your variables can give you ideas on how to scale your data in helpful ways, or guide you
towards certain methods in describing and analyzing them.

3.1 Loading your data

As a first step, you should always look at the raw data. This is done easily if your data
is available as a dataframe. To demonstrate how you can use R to learn about new data,
we will look at the California Test Score Data Set, Caschool.
> library(Ecdat)
> data(Caschool)

3.2 Looking at your data

There are different ways to look at raw data in R. In principle, for small dataframes, one
option is to simply print the dataframe by typing
> Caschool
However, this is not recommended for regular sized or large data sets, as the command
trys to print out the whole data frame at once, which is typically not helpful. The
recommended alternative is to make use of R’s integrated data viewer by typing
> View(Caschool)
While the default R viewer is not very beautiful, the RStudio viewer looks quite decent.
If you prefer to work with the command line interface, you can use the head() command
to look at a small subset (the first n rows) of your data set. For example,
> head(Caschool,n=3)
tells R to display the first three rows of the Caschool dataframe. This is often helpful to
get a first impression of the data. A slightly modified version of the output the command
produces is displayed below:

4
distcod county ... enrltot teachers calwpct mealpct testscr str avginc...
1 75119 Alameda ... 195 10.90 0.5102 2.0408 690.8 17.88991 22.690...
2 61499 Butte ... 240 11.15 15.4167 47.9167 661.2 21.52466 9.824...
3 61549 Butte ... 1550 82.90 55.0323 76.3226 643.6 18.69723 8.978...

While looking at the raw data you should consult the code book to understand what
each of the variables mean and how they are coded. For datasets that were loaded from a
R package it is often the case that the codebook is available through the documentation
function. For example,

> ?Caschool

gives you access to the codebook for the California Test Score Data Set.

Just like in the matrix object case, it is also possible to access subsets of a dataframe
using square brackets. For example,

> Caschool[1:3,1]
[1] 75119 61499 61549

Gives us the first three elements of the first column of the data. Analogously, we can
access components of individual variables/vectors of the dataframe using square brackets
and the $ symbol. For example,

> Caschool$testscr[3]
[1] 643.6

outputs the third element of testscr variable/third row element of the testscr column
in the dataframe. Most importantly, we can use logical expressions to access subsets of
a dataframe. For example,

> Caschool[Caschool$testscr>705,]
distcod county ... enrltot teachers calwpct mealpct ... testscr
417 69518 Santa Clara ... 3724 208.48 1.0741 1.5038 ... 706.75

tells us that there is only one school with average test scores above 705, and that the
school is listed in row 417 of the dataframe. Take a second to understand the command
above: It asks R to output all rows of the dataframe Caschool that satisfy our logical
expression Caschool$testscr>705.

Moreover, note that we are not restricted to a single logical expression. In fact, we can
combine as many logical expressions as we want using the & and and | or operators. For
example,

> Caschool[Caschool$testscr>690 & Caschool$testscr<700,1]

asks R to give us all district codes of schools with average test scores in the open interval
(690, 700).

5
3.3 Data Types
In a next step, it is important to learn the data types that are contained in the individual
columns of your dataframe. This can be done combining the sapply() and class()
functions. The sapply() command is extremely useful in the context of dataframes,
as it allows us to apply functions to each column vector of our data individually. For
example,

> sapply(Caschool,class)
distcod county district grspan enrltot teachers calwpct mealpct
"integer" "factor" "factor" "factor" "integer" "numeric" "numeric" "numeric"
computer testscr compstu expnstu str avginc elpct readscr
"integer" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
mathscr
"numeric"

tells us all the data types of each data vector of the Caschool data. As you can see,
most variables are numeric (integer is a subclass of numeric). At the same time, the
variables county, district and grspan are categorical variables and therefore stored
as factors.

3.4 Dimensions
In a final step before starting to compute descriptive statistics, it is useful to know the
dimensions of your data. This can be done using the dim() command that you already
know from the matrix context. For example,

> dim(Caschool)
[1] 420 17

tells us that the Caschool dataframe has 420 rows (observations) and 17 columns (vari-
ables).

3.5 Summary Statistics

The summary() command applied to a dataframe provides descriptive statistics of all
variables included in that dataframe. For numerical variables it computes the minimum
and maximum, the 25% percentile, the median, the 75% percentile and the mean. For
categorical variables it computes the counts in each category. For example,

> summary(Caschool)
distcod county .... mathscr
Min. :61382 Sonoma : 29 Min. :605.4
1st Qu.:64308 Kern : 27 1st Qu.:639.4
Median :67760 Los Angeles: 27 Median :652.5
Mean :67473 Tulare : 24 Mean :653.3
3rd Qu.:70419 San Diego : 21 3rd Qu.:665.9

6
Max. :75440 Santa Clara: 20 Max. :709.5
(Other) :272

R has many statistical functions. For example, the functions mean(), median() min(),
max(), sd(), var(), cov() and cor() operate as expected. For example,

> cor(Caschool$testscr,Caschool$avginc)
[1] 0.7124308
> cor(Caschool$testscr,Caschool$str)
[1] -0.2263628

computes the correlation between average test scores and average income in a school
district as then the correlation between testscores and the student teacher ratio.

A useful command in the context of descriptive statistics is the sapply() command. As

we have seen in section 3.3 it allows us to apply functions to each column of our data
set. For example,

> round(sapply(Caschool,mean),3)
distcod county district grspan enrltot teachers calwpct ...
67472.810 NA NA NA 2628.793 129.067 13.246 ...
compstu expnstu str avginc elpct readscr mathscr
0.136 5312.408 19.640 15.317 15.768 654.970 653.343

computes the mean of each variable (Can you explain why R outputs missing values for
county, disctrict and grspan?). The round also works as expected. The argument 3
tells R to output three decimals.

Data Visualization in R
No ratings yet
Data Visualization in R
36 pages
40 R Programming Interview Questions & Answers For All Levels - DataCamp
No ratings yet
40 R Programming Interview Questions & Answers For All Levels - DataCamp
22 pages
HUDM 5026 - Introduction To Data Analysis and Graphics in R 01 - Introduction
No ratings yet
HUDM 5026 - Introduction To Data Analysis and Graphics in R 01 - Introduction
8 pages
Application Security of Erlang Concurrent System: January 2008
No ratings yet
Application Security of Erlang Concurrent System: January 2008
7 pages
Instructions For Using R To Create Predictive Models v5
No ratings yet
Instructions For Using R To Create Predictive Models v5
17 pages
R Studio Info For 272
No ratings yet
R Studio Info For 272
13 pages
unit-3
No ratings yet
unit-3
21 pages
R Language Lab Manual Lab 1
100% (1)
R Language Lab Manual Lab 1
33 pages
Solution Manual for Using Multivariate Statistics 7th Edition Barbara G. Tabachnick, Linda S. Fidell - Read Online Or Download Now
100% (6)
Solution Manual for Using Multivariate Statistics 7th Edition Barbara G. Tabachnick, Linda S. Fidell - Read Online Or Download Now
35 pages
ST 540: An Introduction To R
No ratings yet
ST 540: An Introduction To R
6 pages
Solution Manual for Using Multivariate Statistics 7th Edition Barbara G. Tabachnick, Linda S. Fidell - All Chapters Are Available In PDF Format For Download
100% (5)
Solution Manual for Using Multivariate Statistics 7th Edition Barbara G. Tabachnick, Linda S. Fidell - All Chapters Are Available In PDF Format For Download
37 pages
R Programming
No ratings yet
R Programming
20 pages
Solution Manual for Using Multivariate Statistics 7th Edition Barbara G. Tabachnick, Linda S. Fidell download pdf
100% (13)
Solution Manual for Using Multivariate Statistics 7th Edition Barbara G. Tabachnick, Linda S. Fidell download pdf
44 pages
MIT R For Machine Learning
No ratings yet
MIT R For Machine Learning
9 pages
R Exercise 1 - Introduction To R For Non-Programmers
No ratings yet
R Exercise 1 - Introduction To R For Non-Programmers
9 pages
Solution Manual for Using Multivariate Statistics 7th Edition Barbara G. Tabachnick, Linda S. Fidellinstant download
100% (5)
Solution Manual for Using Multivariate Statistics 7th Edition Barbara G. Tabachnick, Linda S. Fidellinstant download
34 pages
Konis K. - Statistics With R (Computing and Graphics)
100% (1)
Konis K. - Statistics With R (Computing and Graphics)
15 pages
Getting Started With The R Statistical Programming Language: Instructions For Downloading R
No ratings yet
Getting Started With The R Statistical Programming Language: Instructions For Downloading R
15 pages
Solution Manual for Using Multivariate Statistics 7th Edition Barbara G. Tabachnick, Linda S. Fidell pdf download
100% (2)
Solution Manual for Using Multivariate Statistics 7th Edition Barbara G. Tabachnick, Linda S. Fidell pdf download
40 pages
Unit3 160420200647 PDF
No ratings yet
Unit3 160420200647 PDF
146 pages
Download full Solution Manual for Using Multivariate Statistics 7th Edition Barbara G. Tabachnick, Linda S. Fidell all chapters
100% (20)
Download full Solution Manual for Using Multivariate Statistics 7th Edition Barbara G. Tabachnick, Linda S. Fidell all chapters
43 pages
ML File
No ratings yet
ML File
12 pages
Ansel in Intro Spat Reg Res
No ratings yet
Ansel in Intro Spat Reg Res
25 pages
Download Study Resources for Solution Manual for Using Multivariate Statistics 7th Edition Barbara G. Tabachnick, Linda S. Fidell
100% (8)
Download Study Resources for Solution Manual for Using Multivariate Statistics 7th Edition Barbara G. Tabachnick, Linda S. Fidell
25 pages
R - II UNIT
No ratings yet
R - II UNIT
10 pages
Download full Solution Manual for Using Multivariate Statistics 7th Edition Barbara G. Tabachnick, Linda S. Fidell all chapters
100% (13)
Download full Solution Manual for Using Multivariate Statistics 7th Edition Barbara G. Tabachnick, Linda S. Fidell all chapters
43 pages
Intro2R Wk2
No ratings yet
Intro2R Wk2
40 pages
Mod1 R Programming
No ratings yet
Mod1 R Programming
49 pages
seminar_1 2
No ratings yet
seminar_1 2
14 pages
Solution Manual for Using Multivariate Statistics 7th Edition Barbara G. Tabachnick, Linda S. Fidell pdf download
100% (1)
Solution Manual for Using Multivariate Statistics 7th Edition Barbara G. Tabachnick, Linda S. Fidell pdf download
45 pages
R Lectures
No ratings yet
R Lectures
10 pages
R For Data Science
No ratings yet
R For Data Science
47 pages
Useful R Packages
No ratings yet
Useful R Packages
73 pages
Best programming language
No ratings yet
Best programming language
23 pages
Handout 2
No ratings yet
Handout 2
15 pages
Experiment OEC
No ratings yet
Experiment OEC
5 pages
Econ117 ps1
No ratings yet
Econ117 ps1
6 pages
Beginner Guide To R and R Studio V1
No ratings yet
Beginner Guide To R and R Studio V1
27 pages
How To Install R
No ratings yet
How To Install R
6 pages
ANUSHKA
No ratings yet
ANUSHKA
41 pages
Fast R
No ratings yet
Fast R
43 pages
DEV Manual - ESEC
No ratings yet
DEV Manual - ESEC
27 pages
R Slides
No ratings yet
R Slides
326 pages
Data Science Wrangling
No ratings yet
Data Science Wrangling
121 pages
R Manual
No ratings yet
R Manual
10 pages
Solution Manual for Using Multivariate Statistics 7th Edition Barbara G. Tabachnick, Linda S. Fidell - Available For Instant Download And Reading
100% (5)
Solution Manual for Using Multivariate Statistics 7th Edition Barbara G. Tabachnick, Linda S. Fidell - Available For Instant Download And Reading
47 pages
1.R Unit 1
No ratings yet
1.R Unit 1
49 pages
Dev
No ratings yet
Dev
33 pages
Ad3301 Data Exploration and Visualization
100% (3)
Ad3301 Data Exploration and Visualization
30 pages
Q3 - To Run A Basic Word Count MapReduce
No ratings yet
Q3 - To Run A Basic Word Count MapReduce
2 pages
R
No ratings yet
R
14 pages
Introduction
No ratings yet
Introduction
2 pages
Brief R Tutorial
No ratings yet
Brief R Tutorial
8 pages
Solution Manual for Using Multivariate Statistics 7th Edition Barbara G. Tabachnick, Linda S. Fidell - PDF DOCX Format Is Available For Instant Download
100% (7)
Solution Manual for Using Multivariate Statistics 7th Edition Barbara G. Tabachnick, Linda S. Fidell - PDF DOCX Format Is Available For Instant Download
42 pages
Sanju - R
No ratings yet
Sanju - R
34 pages
Lab2
No ratings yet
Lab2
4 pages
The INFILE Statement: Reading Files Into SAS From An Outside Source: A Very Useful Tool!
No ratings yet
The INFILE Statement: Reading Files Into SAS From An Outside Source: A Very Useful Tool!
17 pages
The Mac Terminal Reference and Scripting Primer
From Everand
The Mac Terminal Reference and Scripting Primer
Jay Docherty
4.5/5 (3)
R coding for data analysts: from beginner to advanced
From Everand
R coding for data analysts: from beginner to advanced
Porcu Valentina
No ratings yet
Beginning R: The Statistical Programming Language
From Everand
Beginning R: The Statistical Programming Language
Mark Gardener
4.5/5 (4)
Learn R By Coding
From Everand
Learn R By Coding
Thomas Kurnicki
No ratings yet
Top 60 C++ Interview Questions and Answers For 2022
No ratings yet
Top 60 C++ Interview Questions and Answers For 2022
28 pages
Flex BE
No ratings yet
Flex BE
18 pages
What Are The SOLID Principles in C
No ratings yet
What Are The SOLID Principles in C
8 pages
111PROJECT33 Bride
No ratings yet
111PROJECT33 Bride
4 pages
35 Spanning Trees
No ratings yet
35 Spanning Trees
13 pages
A Level CS CH 11 9618
No ratings yet
A Level CS CH 11 9618
21 pages
Weka (20030421-Version1 by Kdelab)
No ratings yet
Weka (20030421-Version1 by Kdelab)
51 pages
Full download Memory as a Programming Concept in C and C Frantisek Franek pdf docx
100% (2)
Full download Memory as a Programming Concept in C and C Frantisek Franek pdf docx
81 pages
How To Paste Plain Text Only in RichTextbox Control (Visual Basic)
No ratings yet
How To Paste Plain Text Only in RichTextbox Control (Visual Basic)
5 pages
Hey! I'm A Little Lost With This Can Someone Help Me Please - Replit
No ratings yet
Hey! I'm A Little Lost With This Can Someone Help Me Please - Replit
1 page
Resume of Fariha Rahman
No ratings yet
Resume of Fariha Rahman
2 pages
Titanic Classification
100% (1)
Titanic Classification
7 pages
PowerShell 7 Fundamentals by Jeff Hicks
No ratings yet
PowerShell 7 Fundamentals by Jeff Hicks
115 pages
Best Books For Programmers (The Ultimate List)
No ratings yet
Best Books For Programmers (The Ultimate List)
13 pages
QPLsurvey PDF
No ratings yet
QPLsurvey PDF
20 pages
2024-harding
No ratings yet
2024-harding
24 pages
UC Mutual Exclusion Semaphore
No ratings yet
UC Mutual Exclusion Semaphore
27 pages
Java Web Technologies
No ratings yet
Java Web Technologies
39 pages
Lec 2 - Neural Network Perceptron Adaline PDF
No ratings yet
Lec 2 - Neural Network Perceptron Adaline PDF
7 pages
Number To Word Indian Rupee
No ratings yet
Number To Word Indian Rupee
3 pages
Learning Spark Programming Basics: Introduction To Rdds
No ratings yet
Learning Spark Programming Basics: Introduction To Rdds
70 pages
Calci Roject
No ratings yet
Calci Roject
23 pages
Example (Part 1) : Simplex Method: 2x + 3y 42 3x + y 24
No ratings yet
Example (Part 1) : Simplex Method: 2x + 3y 42 3x + y 24
5 pages
Hive PPTs
No ratings yet
Hive PPTs
34 pages
9618 - Y21 - SP - 4 - Evidence (AutoRecovered)
No ratings yet
9618 - Y21 - SP - 4 - Evidence (AutoRecovered)
6 pages
Arduino Playground - Matlab
No ratings yet
Arduino Playground - Matlab
3 pages
Crash Report
No ratings yet
Crash Report
4 pages
Evashish ANE: Profile Summary
No ratings yet
Evashish ANE: Profile Summary
2 pages
F
No ratings yet
F
25 pages