R Programming
R Programming
19 February, 2015
1 / 50
Hints on programming
1 Save all your commands in a SCRIPT FILE, they will be useful in future...no one knows...
2 Save your script file any time you can! You swet a lot writing those instructions; You don’t
want to loose them!
3 Try to give smart name to variables and functions (try to avoid “pippo”, “pluto” “a”, “b” etc...)
4 Use comments to define sections in your script and describe what the section does
If you read the code after 2 month you won’t be able to remember what it does, unless you try to read all the instructions...it’s
not worth spending time reading codes, use COMMENT instead
5 If using values in more than one instruction, try to avoid code repetitions and static values.
BAD:
sum(a[a>0])
GOOD:
thr <- 0
sum(a[a>thr])
2 / 50
Programming with R
## [1] 8
3 / 50
Testing condition using combination of epression (& |)
a<-2
b<-3
d<-4
# Using & to test two conditions, both true
if(a<b & b<d)
x<-a+b+d
x
## [1] 9
4 / 50
Looping
5 / 50
Looping II
Nested Loops
mat <- matrix(nrow=2,ncol=4)
for (i in 1:2){
for (j in 1:4){
mat[i,j] <- i + j
}
}
mat
## [,1] [,2] [,3] [,4]
## [1,] 2 3 4 5
## [2,] 3 4 5 6
6 / 50
Vectors I I
Indexing
## [1] 8 9
8 / 50
Subsetting using logical operators II
Getting indexes
mymat > 7
## [1] 6 7
which(mymat>7, arr.ind=TRUE)
## row col
## [1,] 3 2
## [2,] 1 3
9 / 50
Exercises I
10 / 50
Functions I
## [1] 100
11 / 50
Functions II
Variables defined inside a function will be valid only inside the function
res
## Error in eval(expr, envir, enclos): object ’res’ not found
debug(mypow)
12 / 50
Functions II
13 / 50
Data Exploration and summary statistic
The aim is to reduce the amount of information and focus only on key aspect of the data
14 / 50
Working with data objects
15 / 50
Working with data objects
16 / 50
Working with data objects I
## [1] TRUE
is.numeric(bt$Age) ## Check if the mode of the column is numeric
## [1] TRUE
is.character(bt$Gender) ## Check if the mode of the variable Gender is character
## [1] TRUE
17 / 50
Working with data objects II
18 / 50
Exercise II
19 / 50
Probability Distributions in R
Probability functions:
Every probability function in R has 4 functions denoted by the root (e.g. norm for normal
distribution) and a prefix:
p for “probability”, the cumulative distribution function (c.d.f.)
F (x) = P(X <= x)
Example:
For the normal distribution we have the functions: pnorm, qnorm, dnorm, rnorm
20 / 50
Probability distribution in R
Available functions
Distributions Functions
Student t pt qt dt rt
Check the help (?<function>) for further information on the parameters and the usage of each
function.
21 / 50
The Normal Distribution in R
Cumulative Distribution Function
pnorm(2)
1.0
## [1] 0.97725
0.8
## P(X<=12), X=N(10,4)
pnorm(12, mean=10, sd=2)
0.6
## [1] 0.84134
pnorm
0.4
What is the P(X > 19) where
0.2
X = N (17.4, 375.67)? 0.0
−4 −2 0 2 4
22 / 50
The Normal Distribution in R
The quantiles
qnorm: computes the inverse of thd c.d.f. Given a number 0 ≤ p ≤ 1 it returns the p − th quantile
of the distribution.
p = F (X )
X = F −1 (p)
qnorm(0.95)
1.0
p
0.95
## [1] 1.6449
0.8
## X = F^-1(0.95), N(100,625)
qnorm(0.95, mean=100, sd=25)
0.6
## [1] 141.12
pnorm
qnorm(p)
1.645
−3 −2 −1 0 1 2 3
23 / 50
The Normal Distribution in R
The Density Function
dnorm: computes the Probability Density Function (p.d.f.) of the normal distribution.
(x−µ)2
−
f (x) = √1 e 2σ 2
2π
dnorm(0.5)
0.4
## [1] 0.35207
## F(-2.5), X = N(-1.5,2)
0.3
dnorm(-2.5, mean=-1.5, sd=sqrt(2))
## [1] 0.2197
dnorm
0.2
0.1
0.0
−4 −2 0 2 4
24 / 50
The Normal Distribution in R
The Random Function
x <- rnorm(1000)
0.025
## Extract 1000 samples X = N(100,225)
x <- rnorm(1000, mean=100, sd=15)
0.020
xx <- seq(min(x), max(x), length=100)
hist(x, probability=TRUE)
lines(xx, dnorm(xx, mean=100, sd=15))
0.015
Density
0.010
0.005
0.000
25 / 50
Exercise III
1 Compute the values for p = [0.01, 0.05, 0.1, 0.2, 0.25] given X = N (−2, 8)
2 What is P(X = 1) when X = Bin(25, 0.005)?
3 What is P(13 ≤ X ≤ 22) where X = N (17.46, 375.67)?
26 / 50
Plotting in R
27 / 50
Simple visualization on numeric variables
10
●
●
8
●
6
y
●
4
●
2
2 4 6 8 10
28 / 50
Simple visualization on numeric variables
Visualizing two vectors, adding axis labels and changin the line type
plot(x,y, xlab="X values", ylab="Y values", main="X vs Y", type="b")
X vs Y
10
●
8
●
Y values
●
6
●
4
●
2
2 4 6 8 10
X values
29 / 50
Additional parameter to graphical functions
plot(x,y)
abline(0,1)
points(2,3, pch=19)
lines(x,y)
text(4,6, label="Slope=1") 10
●
8
Slope=1 ●
6
y
●
4
● ●
●
2
2 4 6 8 10
30 / 50
Barplot
10
8
6
4
2
0
31 / 50
Barplot
32 / 50
Visualization on Categorical variables
Summarize the count for factors
table(bt$Gender) ## Collect the factors and count occurences for each factor
##
## F M
## 51 49
Look at the summarization in a bar plot
barplot(table(bt$Gender),
xlab="Gender", ylab="Frequency", main="Summarize Gender variable")
20
10
0
F M
Gender
33 / 50
Histograms
34 / 50
Look at the distribution of the data
Histogram of bt$HeartRate
30
25
20
Frequency
15
10
5
0
60 65 70 75 80 85 90
bt$HeartRate
35 / 50
Look at the distribution of the data
Histogram of bt$HeartRate
0.06
0.05
0.04
Density
0.03
0.02
0.01
0.00
60 65 70 75 80 85 90
bt$HeartRate
36 / 50
Look at the distribution of the data
Histogram of bt$HeartRate
8
6
Frequency
4
2
0
60 65 70 75 80 85
bt$HeartRate
37 / 50
Look at the distribution of the data
30
25
20
Frequency
Mean
Median
15
10
5
0
60 65 70 75 80 85 90
bt$HeartRate
38 / 50
Boxplots
39 / 50
Boxplots
60 65 70 75 80 85
40 / 50
Boxplots
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
60 65 70 75 80 85
41 / 50
Using factors and formula objects
M
F
60 65 70 75 80 85
42 / 50
Pairs
The pairs()
function
It plots all the possible pairwise comparison in a data.frame
It allows a fast visual data exploration
20 25 30 35 40 45 50 96 97 98 99 101
Gender
● ● ● ●●●●●●●●●●●●●●●●●● ● ●●●●
●●
●●●●●●●●●●● ● ● ●●
●●●
●●●
●●●
●●
●●●
●●●
●●●
●● ●
20 25 30 35 40 45 50
● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ● ●
● ● ●● ●●
● ● ● ● ●
● ● ● ● ●● ● ●●●
● ● ● ● ● ● ● ●●●● ●
● ● ● ● ●● ●● ● ●
● ● ●● ●● ● ● ● ● ●
● ● ● ●● ●●● ● ● ● ●● ● ● ●
● ● ● ●● ● ● ●● ●● ●●
● ● ●
● ● ● ● ●● ● ● ● ● ● ●
●
●
●
●
●
●
●
●
●
●
Age ●
●●
●
●
●●
●● ● ●● ●●
● ●
●
● ●
● ●
●●
●●●●
●
●
●
●
●
●
● ●● ●
●●
● ●●
●● ● ●
●●
●
●
● ●
● ●● ●
●
● ● ●● ● ● ●
● ● ● ●● ● ●● ●●
● ● ● ● ● ● ● ● ●
● ● ●
● ● ●
● ● ●
● ● ● ● ●
● ● ●
● ● ●
● ● ●
● ● ● ● ●
60 65 70 75 80 85
● ● ●
● ● ●
● ● ●
● ● ● ● ● ● ●● ●●
● ● ● ● ● ● ● ●● ●
● ● ● ● ● ● ● ● ●●●● ●
● ● ● ● ● ● ●● ● ● ● ● ●●● ●
● ● ● ● ●● ● ● ● ●● ●
● ● ● ● ● ● ● ● ● ●● ●
●
●
●
●
●
●
●
●
●
●
● ●
●●
●●
● ●
● ●● ●● ● ●
● ●●●
●● ●
●
●
●
●
●
HeartRate ●●
●
●●
●●●●●
●
● ● ● ●
●
●●●
● ● ●
●
●
●
●
●
●●
● ● ● ● ● ●● ● ● ● ●●
● ● ● ● ● ●●● ● ● ●● ● ●● ●
● ● ● ●● ● ● ● ● ● ●
● ● ● ●● ●● ● ● ●
● ●
●●●
● ● ●
● ● ●
● ● ●
● ● ●
● ● ●
● ● ●
101
● ● ●
● ● ●
● ● ●
●
● ● ● ● ●
●
● ● ● ● ●
● ● ●
●
● ● ● ● ● ●● ●●● ●
● ● ● ● ● ●● ● ● ●
96 97 98 99
● ● ● ●● ●● ●● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
● ●
●
●
● ● ●
●●
●●●●
●
●● ●● ●
● ●
● ●●
●
● ●
●
● ●
●●
●
●
● ●●●
●●● ● ●● ●
● ● ●
●
●●● ●
● ● ●
●
Temperature
●
● ● ●● ● ●
●
● ● ● ●●●●● ● ● ● ●●●●●
● ●● ●
● ●
● ● ● ●● ●● ●● ●●●
● ● ● ●
● ●
● ●● ● ●
●
● ●● ● ● ● ●● ● ●
● ●●
●
● ●
● ● ● ● ● ●● ●●● ●●●● ●● ●●
● ●
● ● ● ● ● ●
● ● ●
● ● ●
43 / 50
Normal plot
● ●
85
●
●
●
● ● ● ●
80
● ● ● ●
● ● ● ● ●
● ● ● ● ●● ●
● ● ● ●●
bt$HeartRate
● ● ●● ●
75
● ● ● ●● ● ●●
● ●●
●●● ● ● ●●
● ●● ● ● ●
● ●● ● ●
70
● ● ● ● ●
● ● ●● ● ● ●
● ● ● ●
●● ●●● ●
●
65
●
●
●
●
60
96 97 98 99 100 101
bt$Temperature
44 / 50
Multiple plots on the same windows
Put more information together on the same plot
par(mfrow=c(2,1)) ## Note mfrow defining 2 rows and 1 column for allowing 2 plots
hist(bt$HeartRate, col="grey80", main="HeartRate histogram")
abline(v=mean(bt$HeartRate), lwd=3)
abline(v=median(bt$HeartRate), lty=3, lwd=3)
legend("right", legend=c("Mean", "Median"), lty=c(1,3))
boxplot(bt$HeartRate~bt$Gender, horizontal=TRUE, col=c( "pink", "blue"))
title("Boxplot for different gender")
points(bt$HeartRate[bt$Gender=="F"], rep(1,length(bt$HeartRate[bt$Gender=="F"])), pch=19)
points(bt$HeartRate[bt$Gender=="M"], rep(2,length(bt$HeartRate[bt$Gender=="M"])), pch=19)
HeartRate histogram
25
Frequency
Mean
15
Median
0 5
60 65 70 75 80 85 90
bt$HeartRate
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
F
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
60 65 70 75 80 85
45 / 50
Exporting graphs
It is possible to export graph in different formats
Png, Jpg, Pdf, Eps, Tiff
Look at the help for the functions pdf,png
pdf("myfirstgraph.pdf") ## Start the png device
par(mfrow=c(2,1))
hist(bt$HeartRate, col="grey80", main="HeartRate histogram")
boxplot(bt$HeartRate, horizontal=TRUE, col="grey80", main="Boxplot")
dev.off() ## switch off the device
1.4
1.2
nif
.0
46 / 50
Look probability distribution in plot
0.1
0.0
−3 −2 −1 0 1 2 3
47 / 50
Data in R
48 / 50
Exercise I
4 Two integer number are “friends” if the quotient between the number itself and the sum of the
divisors are equal. For example the sum of divisors of 6 is 1 + 2 + 3 + 6 =12. The sum of
divisors of 28 is 1 + 2 + 4 + 7 + 14 + 28 = 56. Then 12 /6 = 56 / 28 = 2, thus 6 and 28 are
“friends”.
Define a function that given 2 number as input checks if the numbers are “friends”.
5 Fix the number of samples to 1000 and extract at least 8 N (m, 1) where m ∈ [−3, 3].
With the same number of samples extract at least 8 N (0, s) where s ∈ [0.1, 2].
Plot the results in a same window with 3 different plot, one for N (m, 1), one for N (0, s) and one for
N (m, 1) and N (0, s) together. Decide the color code for each line
suggestion: search for “R color charts” in google and the function colors() in R
49 / 50
Exercise II
6 Extract form a normal distribution an increasing number of samples (10-10000) and look at
the differences in the distribution between sample sizes
7 The dataset Pima.tr collects samples from the US National Institute of Diabetes and
Difestive and Kidney Disease. It includes 200 women of Pima Indian heritage living near
Phoenix, Arizona.
Get the dataset from the MASS package or download it from the website.
Describe the dataset, how many variables, which type of variable, how many samples ...
What do the variable mean?
Get the frquencies of the women affected by diabetes.
Explore the dataset using histograms, barplot and plots. For each plot you do describe what you see
and why did you do that plot.
Using categorical variable type to see if there is any difference in age distribution, bmi, and glu
variables
50 / 50