Introduction To R Programming 1691124649
Introduction To R Programming 1691124649
Source: https://sites.google.com/a/nyu.edu/statistical-software-guide/summary
Summary of Various Statistical
Software Packages
Source: https://sites.google.com/a/nyu.edu/statistical-software-guide/summary
Goals of Today’s Talk
• Provide an overview of the use of R for database
management
– By doing so, we can hopefully lower the learning curve
of R, thereby allowing us to take advantage of its “very
strong” data manipulation capabilities
• Provide an overview of the use of R for statistical
analysis
– This includes descriptive analysis (means, standard
deviations, frequencies, etc.) as well as regression
analysis
– R contains a wide number of pre-canned routines that
we can use to implement the method we’d like to use
Part I
R Basics
Command Window Syntax Window
Programming Language
• Programming language in R is generally object
oriented
– Roughly speaking, this means that data, variables,
vectors, matrices, characters, arrays, etc. are treated
as “objects” of a certain “class” that are created
throughout the analysis and stored by name.
– We then apply “methods” for certain “generic
functions” to these objects
• Case sensitive (like most statistical software
packages), so be careful
Classes in R
• In R, every object has a class
– For example, character variables are given the
class of factor or character, whereas numeric
variables are integer
• Classes determine how objects are handled by
generic functions. For example:
– the mean(x) function will work for integers but not
for factors or characters - which generally makes
sense for these types of variables
Packages available (and loaded) in R by
default
Package Description
base Base R functions (and datasets before R 2.0.0).
compiler R byte code compiler (added in R 2.13.0).
datasets Base R datasets (added in R 2.0.0).
grDevices Graphics devices for base and grid graphics (added in R 2.0.0).
graphics R functions for base graphics.
grid A rewrite of the graphics layout capabilities, plus some support for interaction.
Formally defined methods and classes for R objects, plus other programming tools, as
methods described in the Green Book.
Support for parallel computation, including by forking and by sockets, and random-
parallel number generation (added in R 2.14.0).
splines Regression spline functions and classes.
stats R statistical functions.
stats4 Statistical functions using S4 classes.
tcltk Interface and language bindings to Tcl/Tk GUI elements.
tools Tools for package development and administration.
utils R utility functions.
Source: https://cran.r-project.org/doc/FAQ/R-FAQ.html
For database management, we usually won’t need to load or install any additional packages,
although we might need the “foreign” package (available in R by default, but not initially loaded)
if we’re working with a dataset from another statistical program (SPSS, SAS, STATA, etc.)
Packages in R
• Functions in R are stored in packages
– For example, the function for OLS (lm) is accessed via
the “stats” package, which is available in R by default
– Only when a package is loaded will its contents be
available. The full list of packages is not loaded by
default for computational efficiency
– Some packages in R are not installed (and thus
loaded) by default, meaning that we will have to
install packages that we will need beforehand, and
then load them later on
Packages in R (Continued)
• To load a package, type library(packagename)
– Ex: To load the foreign package, I would type library(foreign) before
running any routines that require this package
• To install a package in R:
– Type install.packages(“packagename”) in command window
– For example, the package for panel data econometrics is plm in R. So, to
install the plm package, I would type install.packages(“plm”).
• Note that, although installed, a package will not be loaded by default
(i.e. when opening R). So, you’ll need library(package) at the top of
your code (or at least sometime before the package is invoked).
– Some packages will draw upon functions in other packages, so those
packages will need to be installed as well. By using install.packages(“ ”), it
will automatically install dependent packages
Some Basic Operations in R
• Q: If x = 5, and y = 10, and z = x + y, what is the value of z?
• Let’s get R to do this for us:
• Note that with vectors, ‘+’, ‘-’, ‘/’, ‘*’, ‘^’ perform element-wise
calculations when applied to vectors. So, vectors need to be
the same length.
Working with Matrices in R
• A matrix with typical element (i,j) takes the following form:
Our first dataset contains some data on age and income, but not health care
costs to the public system. Dataset 2 contains this data, but was not initially
available to us. It also doesn’t have age or income.
The common element between the two datasets is “Id”, which uniquely identifies
the same individuals across the two datasets.
Note that, for some reason, individual 5 does not have a reported health care
cost
Merging Datasets (Continued)
• Command: merge
– Package: base
• For our example:
Optional, but
– Datam = merge(Data1, Data2, by=“Id”, all=T) default is F,
meaning those
who can’t be
Unique identifier matched will be
across datasets excluded
– Resulting Dataset
Datam
Id Age Income Health Care Cost
1 55 49841.65 188.1965
2 63 46884.78 172.2420
3 65 45550.87 102.8355
4 69 26254.15 150.2247
5 52 22044.73 NA
Part II
– Package: stats
– Formula: lm(formula, data, subset, weights, na.action, method = "qr", model =
TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE, contrasts = NULL,
offset, ...)
Examples:
• The above is a matrix, so we can get the information we need through column
extractions:
– Beta coefficients: summary(ols.costdata)$coefficients[,1]
– Standard errors: summary(ols.costdata)$coefficients[,2]
– T-value: summary(ols.costdata)$coefficients[,3]
– P-value: summary(ols.costdata)$coefficients[,4]
Residuals vs Fitted Values
• For Residuals vs Fitted Values (RVFV) Plot, use generic plot() function on
regression object. First plot is RVFV
• Formula: plot(ols.costdata, 1)
*The other 5 plots are: Normal Q-Q, Scale-Location, Cook’s distance, Residuals vs
Leverage, and Cook’s distance vs Leverage
Models for Binary Outcomes
• R does not come with different programs for binary outcomes. Instead, it
utilizes a unifying framework of generalized linear models (GLMs) and a
single fitting function, glm() (Kleiber & Zeileis (2008))
Package: stats
Formula: glm(formula, family = gaussian, data, weights, subset, na.action,
start = NULL, etastart, mustart, offset, control = list(...), model = TRUE,
method = "glm.fit”, x = FALSE, y = TRUE, contrasts = NULL, ...)
Package: AER
Formula: ivreg(formula, instruments, data, subset, na.action,
weights, offset, contrasts = NULL, model = TRUE, y = TRUE, x =
FALSE, ...)
Source: https://www.tidyverse.org
• doParallel
– “parallel backend” for the “foreach” package
– provides a mechanism needed to execute foreach loops in parallel
– https://cran.r-
project.org/web/packages/doParallel/vignettes/gettingstartedParallel.
pdf
Example: Monte Carlo Experiment
Example: Monte Carlo Experiment
(Continued)
Syntax Window
Command/Results Window
R Markdown
(In R Studio)
What is R Markdown?
From R Markdown website:
“R Markdown provides an authoring framework for data science. You
can use a single R Markdown file to both
• save and execute code
• generate high quality reports that can be shared with an audience”
Source: https://rmarkdown.rstudio.com/lesson-1.html
# Regression Results
```{r regresults}
load("cost.data.results.RData")
knitr::kable(cost.data.results)
```
# Plot
```{r plot}
knitr::include_graphics("RVFV.jpg")
```
Conclusions
• R has extremely powerful database management
capabilities
– Is fully capable of performing the same sort of tasks as
commercial software programs
– Can be enhanced through Tidyverse package for a more user
friendly experience
• R is very capable of statistical analysis
– Is fully capable of calculating summary statistics and performing
regression analysis right out of the box
– Can install additional packages to perform other sorts of
analysis, depending on the research question of the user
– Performance can be improved by the use of parallel processing
• R, and the additional packages available to enhance the use
of R, are available free of charge
R Resources
R Online Resources
• A list of R packages is contained here:
https://cran.r-
project.org/web/packages/available_packages_by_
date.html
• By clicking on a particular package, you’ll be
taken to a page with more details, as well as a link
to download the documation
• Typing help(topic) in R pulls up a brief help file
with synax and examples, but the online manuals
contain more detail
R Online Resources
• UCLA Institute for Digital Research and
Education
– List of topics and R resources (getting started, data
examples, etc.) can be found here:
http://www.ats.ucla.edu/stat/r/
• RStudio Cheatsheets
– https://www.rstudio.com/resources/cheatsheets/
Other R Resources
1. Kleiber, C., & Zeileis, A. (2008). Applied econometrics with
R. Springer Science & Business Media.
• Great reference for the applied researcher wanting to use R for
econometric analysis. Includes R basics, linear regression
model, panel data models, binary outcomes, etc.
2. Jones, A. M., Rice, N., d'Uva, T. B., & Balia, S.
(2013). Applied health economics. Routledge.
• Excellent reference for applied health economics. Examples
are all performed using STATA, but foreign package should help
here.
3. CRAN Task View: Econometrics
• A listing of the statistical models used in econometrics, as well
as the R package(s) needed to perform them. Available at:
https://cran.r-project.org/view=Econometrics
Thanks for Listening
Good luck with R!