Introduction To R
Introduction To R
Introduction To R
R is a programming language developed by Ross Ihaka and Robert Gentleman in 1993. R possesses
an extensive catalog of statistical and graphical methods. It includes machine learning algorithm,
linear regression, time series, statistical inference to name a few. Most of the R libraries are written in
R, but for heavy computational task, C, C++ and Fortran codes are preferred.
R is not only entrusted by academic, but many large companies also use R programming language,
including Uber, Google, Airbnb, Facebook and so on.
Data analysis with R is done in a series of steps; programming, transforming, discovering, modeling
and communicate the results
Use Of R As A Calculator
Function What It Does
Although the major purpose of R is to work with statistical data, it can also serve as
a calculator. It is not as powerful as a full computer algebra system, having no provision
for imaginary numbers or purely algebraic computations. It does know basic constants like
π, the usual trigonometric functions and inverses (with angles in radians), as well as the
> pi
[1] 3.141593
> sin[pi/2]
[1] 1
[1] 5
Note. The [1] that occurs at the left of the output technically means that the output of
each calculation is a list. That does not mean much in our simple examples, because each
output is a list with only one item. If the output were a list with enough items to wrap to
the next line, subsequent lines would get (higher) numbers indicating how far through the
list we had gone. That is, the sixth item would be preceded by [6], and so on.
R has an extensive help system containing information (with examples) about all the
Example. We learn from the help system how to do logarithms in other bases.
> help(log)
[1] 3
Variables in R are created with the assignment operator, which historically has been the
left arrow (<-). Modern versions of R also accept equals (=) for assignment, and you may
>a=5
> a^2
[1] 25
> b <- a
> b^2
[1] 25
A list of currently defined variables can be generated with ls(), and variables may be
removed with the rm() function.
>a=5
> ls()
[1] "a"
> rm( a )
>a
Whenever we use one of these functions, R calculates the natural logarithm if we don’t
specify any base.
For the logarithms with bases 2 and 10, we can use the convenience
functions log2() and log10().
We carry out the inverse operation of log() by using exp(). This last function raises e to
the power mentioned between brackets, like this:
> x <- log(1:3)
> exp(x)
SCIENTIFIC NOTATION IN R
Scientific notation allows us to represent a very large or very small number in a
convenient way. The number is presented as a decimal and an exponent, separated
by e. We get the number by multiplying the decimal by 10 to the power of the exponent.
The number 13,300, for example, also can be written as 1.33 × 10^4, which is 1.33e4 in
R:
> 1.33e4
[1] 13300
> 4.12e-2
[1] 0.0412
R doesn’t use scientific notation just to represent very large or very small numbers; it
also understands scientific notation when we write it. We can use numbers written in
scientific notation as though they were regular numbers, like so:
> 1.2e6 / 2e3
[1] 600
the parentheses, R will return the definition of the function rather than evaluating it.
R works naturally with array variables, since data commonly occur in lists. The two
most typical ways to create arrays in R are via the c() concatenation function and the
> a = c( 2,3,5,7,11 )
>a
[1] 2 3 5 7 11
The scan() function is usually more convenient for longer sets of data that can be
cumbersome to enter using the concatenation function. Data may be scanned from either
the keyboard or from a file. Individual data values should be separated by white space (by
default), either on the same line or on adjacent lines. It is possible to specify a different
delimiter for data that exists in other formats; check the help. The end of data is indicated
Note. R will prompt with the position of the next item to read. If five items have already
been typed, the prompt is changed to 6: to indicate that the sixth item is next.
Example We read in the first ten prime numbers, as they are typed from the key-
1: 2 3 5 7 11
6: 13 17 19 23 29
11:
Read 10 items
Read 10 items
> primes1
[1] 2 3 5 7 11 13 17 19 23 29
> primes2
[1] 2 3 5 7 11 13 17 19 23 29
subtraction this is exactly the way mathematical arrays work. Unlike mathematical arrays,
R arrays can sometimes be combined even when they are not the same size. For example,
an array of length 3 can be added to an array of length 6 (and the answer is an array of
length 6). To make the process work, R will expand the shorter array by reusing entries
that start from the left; the array [1, 2, 3] would be expanded to [1, 2, 3, 1, 2, 3], for example.
> a = c(1,2,3)
> b = c(5,5,5,5,5,5)
> a^2
[1] 1 4 9
> 4+a
[1] 5 6 7
> a+b
[1] 6 7 8 6 7 8
1: 81 81 96 77
5: 95 98 73 83
9: 92 79 82 93
13: 80 86 89 60
17: 79 62 74 60
21:
Read 20 items
> range(scores)
[1] 60 98
> median(scores)
[1] 81
> mean(scores)
[1] 81
> sd(scores)
[1] 11.3555
There are a number of plots and charts in R for presenting or exploring data. For
example, we might wonder if a set of exam scores is normally distributed. A stem and leaf
Example. We explore a group of exam scores to learn about the shape of the Distribution.
1: 81 81 96 77
5: 95 98 73 83
9: 92 79 82 93
13: 80 86 89 60
17: 79 62 74 60
21:
Read 20 items
> stem(scores)
6 | 002
7 | 34799
8 | 0112369
9 | 23568
6 | 002
6|
7 | 34
7 | 799
8 | 01123
8 | 69
9 | 23
9 | 568
or remove() commands. To remove objects we can simply list them in the parentheses of the
command:
rm(list)
remove(list)
We can type the names of the objects separated by commas. For example:
ls() command to produce a list, which will then be deleted. We need to include the instruction
Here the ls() command is used to search for objects beginning with “b” and remove them.
WARNING Use the rm()command with caution; R doesn’t give you a warning
before it removes the data you indicate, it simply removes it after receiving the
command.
Pie chart
If we have data that represents how something is divided up between various categories, the pie
chartis a common graphic choice to illustrate our data. For example, we might have data that shows
salesfor various items for a whole year. The pie chart enables us to show how each item contributed
tototal sales. Each item is represented by a slice of pie—the bigger the slice, the bigger the
contribution to the total sales. In simple terms, the pie chart takes a series of data, determines the
proportion of each item toward the total, and then represents these as different slices of the pie.
NOTE : The human eye is not really that good at converting angular measure-
ments (slices of pie) into “real” values, and in many disciplines the pie chart is
falling out of favor. However, the pie chart is still an attractive proposition for
plenty of occasions.
The pie chart is commonly used to display proportional data. You can create pie charts using the
pie() command. In its simplest form, we can use a vector of numeric values to create your plot like
so:
> data11=c(3,5,7,5,3,2,6,8,5,6,9,8)
>data11
[1] 3 5 7 5 3 2 6 8 5 6 9 8
When we use the pie() command, these values are converted to proportions of the total and then
the angle of the pie slices is determined. If possible, the slices are labeled with the names of the
data.
In the current example you have a simple vector of values with no names, so you must supply them
separately. We can do this in a variety of ways; in this instance we have a vector of character labels:
> data8= ("Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec")
>data8
[1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"
To create a pie chart with labels we use the pie() command in the following manner:
We can alter the direction and starting point of the slices using the clockwise = and init.angle
can set this to TRUE to produce clockwise slices. The starting angle is set to 0o (this is 3 o’clock) by
default when we have clockwise = FALSE. The starting angle is set to 90o (12 o’clock) when we
have clockwise = TRUE. To start the slices from a different point, w simply give the starting
angle in degrees; these may also be negative with –90 being equivalent to 270o.
The default colors used are a range of six pastel colors; these are recycled as necessary. We can
specify a range of colors to use with the col = instruction. One way to do this is to make a list of
color names. In the following example we make a list of gray colors and then use these for our
charted colors:
> pie(data11, labels = data8, col = pc, clockwise = TRUE, init.angle = 180)
We can also set the slices to be drawn clockwise and set the starting point to 180o, which is
Dot chart
An alternative to the pie chart is a Cleveland dot plot. All data that might be presented as a pie
chart could also be presented as a bar chart or a dot plot. We can create Cleveland dot plots
using the dotchart() command. If our data are a simple vector of values then like the pie()
command, we simply give the vector name. To create labels we need to specify them. In the
following example we have a vector of numeric values and a vector of character labels; we met
[1] 3 5 7 5 3 2 6 8 5 6 9 8
[1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"
Bar Charts
The bar chart is suitable for showing data that fall into discrete categories. “Starting
Out: Working with Objects,” we met the histogram, which is a form of bar chart. In that example
each bar of the graph showed the number of items in a certain range of data values. Bar charts are
widely used because they convey information in a readily understood fashion. They are also flexible
We use the barplot() command to produce bar charts. In this section we see how to create a range
of bar charts, and also have a go at making some for ourself by following the activity at the end.
> rain
[1] 3 5 7 5 3 2 6 8 5 6 9 8
To make a bar chart we use the barplot() command and specify the vector name in the instruction
like so:
barplot(rain)
The chart has no axis labels of any kind, but we can add them quite simply. To start with, we can
make names for the bars; we can use the names = instruction to point to a vector of names. The
> rain
[1] 3 5 7 5 3 2 6 8 5 6 9 8
> month
[1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"
In this case we already had a vector of names; if we do not, we could make one or simply specify
> barplot(rain, names = c('Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul',
If our vector has a names attribute, the barplot() command can read the names directly. In the
following example you set the names() of the rain vector and then use the barplot() command:
[1] 3 5 7 5 3 2 6 8 5 6 9 8
[1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"
> rain
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
357532685698
> barplot(rain)
Now the bars are neatly labeled with the names taken from the data itself.
First, let’s make a data of a vector of months, a vector of the number of chickens and a vector of the number of eggs. That’s
random enough for this purpose.
# make some data
months <-rep(c("jan", "feb", "mar", "apr", "may", "jun",
"jul", "aug", "sep", "oct", "nov", "dec"), 2)
chickens <-c(1, 2, 3, 3, 3, 4, 5, 4, 3, 4, 2, 2)
eggs <-c(0, 8, 10, 13, 16, 20, 25, 20, 18, 16, 10, 8)
values <-c(chickens, eggs)
type <-c(rep("chickens", 12), rep("eggs", 12))
mydata <-data.frame(months, values)
If parts of the above code don’t make sense, take a look at my post on using the R functions seq (sequence), rep (repeat), and
cbind (column bind) HERE.
Now let’s load the ggplot package.
library(ggplot2)
We want to make a plot with the months as the x-axis and the number of chickens and eggs as the height of the bar. To do this,
we need to make sure we specify stat = “identity”. Here’s the basic code for this plot.
p <-ggplot(mydata, aes(months, values))
p +geom_bar()
Notice that you will get the error shown above, “stat_count() must not be used with a y aesthetic.” We forgot to specify that we
want the height of the column to equal the value for that month. So let’s do it again.
This time we get a plot, but it looks fairly ugly, and the months are out of order. In fact the months are in alphabetical order so
let’s fix that first. If we investigate the months, we will see they have ordered levels.
mydata$months
#[1] jan feb mar apr may jun jul aug sep oct nov dec jan feb mar apr may
#[18] jun jul aug sep oct nov dec
#Levels: apr aug dec feb jan jul jun mar may nov oct sep
We can fix the order of this category by changing the factor. Here’s some code that will fix our problem.
mydata$months <-factor(mydata$months,
levels = c("jan", "feb", "mar", "apr", "may", "jun",
"jul", "aug", "sep", "oct", "nov", "dec"))
Now if we look at the levels again, we will see that they’re rearranged in the order that we want.
mydata$months
#[1] jan feb mar apr may jun jul aug sep oct nov dec jan feb mar apr may
#[18] jun jul aug sep oct nov dec
#Levels: jan feb mar apr may jun jul aug sep oct nov dec
Okay, let’s make our plot again, this time with the months in the correct order.
that it is a bit difficult to translate mathematical writing directly to the computer. Humans
instinctively adapt to ambiguity, but software is less flexible. For example, mathematicians
use parentheses for grouping, as in (x−1)(x+ 2). But they also use parentheses to indicate
Another ambiguity that is perhaps a bit more subtle occurs with “equals.” Humans
have little trouble understanding that sometimes we intend equals to assign values, as with
“let x = 2.” At other times, we mean to assert equality; a circle is the set of points (x, y)
such that x
2+y
2 = 1.
Each computer algebra system addresses the job of translating mathematical syntax into
unambiguous “computer syntax” in its own way. To a first-time software user this can feel
somewhat unintuitive, even quirky, but mastering the language of your favorite CAS is an
immediately notice one idiosyncrasy when you try to use Mathematics as a calculator. The
ENTER key does not run a calculation (it ends a paragraph or makes it possible to enter
multi-line computations). To use Mathematica as a calculator, type the expression you wish
to evaluate and press SHIFT+ENTER. (The special ENTER key on the lower-right corner
In[1]:= 2 + 2
Out[1]= 4
Note: Mathematica assigns line numbers to the input and output, e.g., the “In[1]:=” and
To multiply two numbers, type the numbers with a space between them. Use a caret (^) for
exponentiation. Notice that Mathematica can handle very large numbers easily, even numbers with
hundreds of digits.
Out[2]= 60466176
In[3]:= 3^100
Out[3]= 515377520732011331036461129765621272702107522001
Mathematica. To distinguish them from variables you might create yourself, Mathematica’s
The built-in constants are handled algebraically, but we can request the numerical value
Mathematica also performs matrix calculations. Matrices are entered with braces and
4 5 6
7 8 9
and N =0 1 0
001
100
Note. The ; (semicolon) symbol is used to separate commands, allowing you to perform
more than one calculation on a line. If you end a command with a semicolon, the output
In[15]:= m + n
In[16]:= m . n
Functions in Mathematica
The operator N, which we saw earlier, is actually a function. Mathematica contains
many such built-in functions, and you can usually guess the names of common functions.
Note. In Mathematica, every built-in function name begins with a capital letter. Arguments
In[1]:= Sin[Pi/2]
Out[1]= 1
In[2]:= Binomial[7,2]
Out[2]= 21
In[3]:= FactorInteger[60466176]
Example: The function Prime[n] gives the nth prime number. Using this function,
Out[4]= {2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43,
> 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97, 101, 103, 107,
> 109, 113, 127, 131, 137, 139, 149, 151, 157, 163, 167, 173,
> 179, 181, 191, 193, 197, 199, 211, 223, 227, 229, 233, 239,
> 241, 251, 257, 263, 269, 271, 277, 281, 283, 293, 307, 311,
> 313, 317, 331, 337, 347, 349, 353, 359, 367, 373, 379, 383,
> 389, 397, 401, 409, 419, 421, 431, 433, 439, 443, 449, 457,
> 461, 463, 467, 479, 487, 491, 499, 503, 509, 521, 523, 541}
Example:
Out[5]= 385
You can define your own functions. To create a function f(x), write f[x_] := followed
by the definition of f.
In[8]:= f[Pi/2]
In[9]:= D[f[x],x]
In[10]:= Integrate[f[x],x]
Note. Mathematica does not supply an additive constant (+C) for indefinite integrals.
You can define functions recursively (in terms of previous values), as with the function
below. Notice the use of = for the assignment of initial values in contrast with := for the
definition of the iteration.
In[1]:= f[0] = 1;
In[2]:= f[1] = 1;
To explore this idea, consider the first time we ask Mathematica for the value f[3]. Since
we used :=, the value of 3 will be substituted for n on the right side of the definition, which
will evaluate to the expression f[3] = f[1] + f[2] (which is actually an assignment itself).
Mathematica already knows the values of f[1] and f[2] and consequently sets f[3] = 2
1. f[3] = f[3-2] + f[3-1] (this is the right-hand side of the :=, with 3 substituted
In[5]:= f[3]
The consequence of doing the computation this way is that Mathematica now knows
permanently that f[3] has the value 2 and will never have to evaluate it again (say, when
we ask for f[4] or any other value). This becomes important for larger values, like f[100],
which would evaluate too slowly if we created the function less carefully.
Graphs in Mathematica
Mathematica offers many graphing options. We show a few examples here. You can
0 ≤ x ≤ 2.
In[1]:= f[x_] := 4 x + 1;
In[2]:= g[x_] := -x + 4;
In[3]:= h[x_] := 9 x - 8;
, for −2 ≤ x, y ≤ 2.