Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

R - A Practical Course

Download as pdf or txt
Download as pdf or txt
You are on page 1of 42

R Lab: DataCamp

Basic of R
Variable Assignment
A basic concept in R programming is the variable. It allows you to store a value or an object in R. You can then
later use this variable's name to easily access the value or the object that is stored within this variable. You use
<- to assign a variable:
my_variable <- 4

Variables are great to perform arithmetic operations with. In this assignment, we have defined a
variable my_apples. You want to define another variable called my_oranges and add these two together.
my_apple+my_oranges

Common sense tells you not to add apples and oranges. The my_apples and my_oranges variables both
contained a number in the previous exercise. The + operator works with numeric variables in R. If you really
tried to add "apples" and "oranges", and assigned a text value to the variable my_oranges but not
to my_apples (see the editor), you would be trying to assign the addition of a numeric and a character
variable to the variable my_fruit. This is not possible.

Coercion: Taming your Data


It is possible to transform your data from one type to the other. Next to the class() function, you can use the
as.*() functions to enforce data to change types. For example,

var <- "3"


var_num <- as.numeric(var)

converts the character string "3" in var to a numeric 3 and assigns it to var_num. However, keep in my that
it is not always possible to convert the types without losing information or getting errors.

as.integer("4.5")
as.numeric("three")

The first line will convert the character string "4.5" to the integer 4. The second one will convert the character
string "three" to an NA.

Create a Vector I
On your way from rags to riches, you will make extensive use of vectors. Vectors are one-dimension
arrays that can hold numeric data, character data, or logical data. In other words, a vector is a simple tool to
store data. For example, you can store your daily gains and losses in the casinos.In R, you create a vector with
the combine function c(). You place the vector elements separated by a comma between the brackets. For
example:

numeric_vector <- c(1, 2, 3)


character_vector <- c("a", "b", "c")
boolean_vector <- c(TRUE, FALSE)

Once you have created these vectors in R, you can use them to do calculations.

1
Sometimes you only want to select a specific element from one of those vectors instead of using the entire
vector. R makes this very easy using indexing. Indexing entails the use of square brackets [] to select elements
from a vector.
For instance, numeric_vector[1] will select the first element of the
vector numeric_vector. numeric_vector[c(1,3)] will select the first and the third element of the
vector numeric_vector.
Selection by Comparison I
Sometimes you want to select elements from a vector in a more advanced fashion. This is where the use of
logical operators may come in handy.
The (logical) comparison operators known to R are: - < for less than - > for greater than - <= for less than or
equal to - >= for greater than or equal to - == for equal to each other - != not equal to each other
The nice thing about R is that you can use these comparison operators on vectors. For example, the statement
c(4,5,6) > 5 returns: FALSE FALSE TRUE. In other words, you test for every element of the vector if the
condition stated by the comparison operator is TRUE or FALSE.
Behind the scenes, R does an element-wise comparison of each element in the vector c(4,5,6) with the element
5. However, 5 is not a vector of length three. To solve this, R automatically replicates the value 5 to generate a
vector of three elements, c(5, 5, 5) and then carries out the element-wise comparison.

Script.R
#A numeric vector containing 3 elements
numeric_vector <- c(1, 10, 49)

c (1, 10, 49) > 10

#Assign the variable to larger_than_ten and print the


variable

larger_than_ten<-numeric_vector> 10

In the last exercise we saw larger_than_ten consisted of a vector of TRUE and FALSE. We make use of
this logical vector to select elements from another vector. For instance, numeric_vector[c(TRUE,
FALSE, TRUE)] will select the first and the third element from the vector numeric_vector

numeric_vector <- c(1, 10, 49)


larger_than_ten <- numeric_vector > 10
numeric_vector[larger_than_ten]

Matrices
In R, a matrix is a collection of elements of the same data type (numeric, character, or logical) arranged into a
fixed number of rows and columns. Since you are only working with rows and columns, a matrix is called two-
dimensional.
You can construct a matrix in R with the matrix() function. Consider the following example:
matrix(1:9, byrow = TRUE, nrow = 3, ncol = 3)

In the matrix() function:

2
 The first argument is the collection of elements that R will arrange into the rows and columns of the
matrix. Here, we use 1:9 which constructs the vector c(1, 2, 3, 4, 5, 6, 7, 8, 9).
 The argument byrow indicates that the matrix is filled by the rows. This means that the matrix is filled
from left to right and when the first row is completed, the filling continues on the second row. If we
want the matrix to be filled by the columns, we just place byrow = FALSE.

 The third argument nrow indicates that the matrix should have three rows.

 The fourth argument ncol indicates the number of columns that the matrix should have

Factors
The term factor refers to a statistical data type used to store categorical variables. The difference between a
categorical variable and a continuous variable is that a categorical variable can belong to a limited number of
categories. A continuous variable, on the other hand, can correspond to an infinite number of values.
It is important that R knows whether it is dealing with a continuous or a categorical variable, as the statistical
models you will develop in the future treat both types differently.
A good example of a categorical variable is the variable student_status. An individual can either be
"student" or "not student". This means that "student" and "not student" are two values of the categorical variable
student_status and every observation can be assigned one of these values. We can do this using the
factor function.
Dataframes: What's a Data Frame?
You may remember the matrix, a multi-dimensional object that we discussed earlier. All the elements that you
put in a matrix should be of the same type. However, when performing a market research survey, you often
have questions such as:
 'Are your married?' or 'yes/no' questions (= boolean data type)
 'How old are you?' (= numeric data type)
 'What is your opinion on this product?' or other 'open-ended' questions (= character data type)
The output, namely the respondents' answers to the questions formulated above, is a data set of different data
types. You will often find yourself working with data sets that contain different data types instead of only one.
A data frame has the variables of a data set as columns and the observations as rows. This will be a familiar
concept for those coming from different statistical software packages such as SAS or SPSS.
Inspecting dataframes
There are several functions you can use to inspect your dataframe. To name a few
 head: this by default prints the first 6 rows of the dataframe
 tail: this by default prints the last 6 rows to the console
 str: this prints the structure of your dataframe
 dim: this by default prints the dimensions, that is, the number of rows and columns of your dataframe
 colnames: this prints the names of the columns of your dataframe.

Constructing a Dataframe Yourself


Since using built-in data sets is not even half the fun of creating your own data sets, the rest of this chapter is
based on your personally developed data set.
As a first goal, you want to construct a data frame that describes the main characteristics of eight planets in our
solar system. The main features of a planet are:
 The type of planet (Terrestrial or Gas Giant).
 The planet's diameter relative to the diameter of the Earth.
 The planet's rotation across the sun relative to that of the Earth.
 If the planet has rings or not (TRUE or FALSE).

3
You construct a data frame with the data.frame() function. As arguments, you should provide the above
mentioned vectors as input that should become the different columns of that data frame. Therefore, it is
important that each vector used to construct a data frame has an equal length. But do not forget that it is possible
(and likely) that they contain different types of data.

Indexing and Selecting Columns From a Dataframe


In the same way as you indexed your vectors, you can select elements from your dataframe using square brackets.
Different from dataframes however, you now have multiple dimensions: rows and columns. That's why you can
use a comma in the middle of the brackets to differentiate between rows and columns. For instance, the
following code planet_df[1,2] would select the element in the first row and the second column from the
dataframe planet_df.
You can also use the $ operator to select an entire column from a dataframe. For
instance, planet_df$planets would select the entire planets column from the dataframe planet_df.

Lists
A list in R is similar to your to-do list at work or school: the different items on that list most likely differ in
length, characteristic, type of activity that has to do be done.
A list in R allows you to gather a variety of objects under one name (that is, the name of the list) in an ordered
way. These objects can be matrices, vectors, data frames, even other lists, etc. It is not even required that these
objects are related to each other.
You can easily construct a list using the list() function. In this function you can wrap the different elements
like so: list(item1, item2, item3).

Selecting Elements From a List


Your list will often be built out of numerous elements and components. Therefore, getting a single element,
multiple elements, or a component out of it is not always straightforward. One way to select a component is
using the numbered position of that component. For example, to "grab" the first component of my_list you
type my_list[[1]]
Another way to check is to refer to the names of the components: my_list[["my_vector"]] selects
the my_vector vector.

A last way to grab an element from a list is using the $ sign. The following code would
select my_df from my_list: my_list$my_df.
Besides selecting components, you often need to select specific elements out of these components. For example,
with my_list[[1]][1] you select from the first component of my_list the first element. This would
select the number 1.

Introduction to R
Making a Start with Functions: Getting Help
So far we have seen many datatypes in R. The next thing to learn about is functions. We have already seen many
functions when working with vectors, dataframes and lists. For instance, when making a list, we used the
function list() to make one.
In programming, functions are used to incorporate sets of instructions that we want to use repeatedly. A function
is actually a piece of code written to carry out a specified task; it may accept arguments or parameters (or not)
and it may return one or more values (or not!).

4
Let's look at a pre-programmed function in R: mean. To consult the R documentation on this function, you can
use the following commands:

 help(mean)

 ?mean
Try these commands out in the Datacamp console. If you do so, you'll be redirected to
www.RDocumentation.org. If you would type this function into you R studio console, a help tab would
automatically open in R studio.
There is another way of getting help on a function. For instance, if you want to know which parameters need to
be provided, you can use the R function args on the specified function. An example of using args on a
function is the following: args(mean)

Functions Continued
In the last exercise we made a start with functions. Also, we looked at how we could get help on using functions.
When getting help on the mean function, you saw that it takes an argument x. X here is just an arbitrary name
for the object that you want to find the mean of. Usually this object will be an R vector. We also saw the ....
This is called an elipsis and is used to provide a number of optional arguments to the function.
Remember that R can match arguments both by position and by name. Let's say we want to find the mean of a
vector called temperature. An example of matching by name is the following:
mean(x = temperature)

An example of matching by position is the following:


mean(temperature)
In this exercise, we have provided you with a vector of 5 numbers. There are the grades that you got during the
semester.

Script.R
# a grades vector
grades <- c(8.5, 7, 9, 5.5, 6)

# calculate the mean of grades using matching by name


mean(x=grades)

# calculate the mean of grades using matching by


position
mean(grades)

When we looked at the documentation of mean. The documentation showed us the following method:
mean(x, trim = 0, na.rm = FALSE, ...)

As you can see, both trim and na.rm have default values. However, x doesn't. This makes x a required
argument. That means that the function mean will throw an error if x hasn't been specified. Trim and na.rm are
however optional arguments with default values and can be changed or specified by the user.
Na.rm can be changed by the user if a given vector contains missing values. For instance, if the aforementioned
vector called temperature would have missing values, calling mean on it would throw an output of NA. If you
want the mean function to exclude the NA values when calculating the mean, you can specify na.rm = TRUE.
Let's bring this into practice:

5
Script.R
# a grades vector
grades <- c(8.5, 7, 9, NA, 6)
#mean without removing NA
mean(grades)
#mean with removing NA
mean(x=grades, trim=0, na.rm = TRUE)

Making Your Own Functions


During the last 3 exercises, we have been using existing functions. However, you can also write your own
functions. You can define a function using function() code chunk. For instance, look at the code below to
see a function that takes 2 parameters, a and b, sums them, and returns them.

sum_a_b <- function(a, b){


return (a + b)
}

You could call this function and assign its result to the variable result, using the following code: result =
sum_a_b(4, 5)

This would put the value of 9 into the variable result.


Getting Your Data into R
One important thing before you actually do analyses on your data, is that you will need to get your data into R.
R contains many functions to read in data from different formats. To name only a few:
 read.table: Reads in tabular data such as txt files
 read.csv: Read in data from a comma-separated file format
 readWorksheetFromFile : Reads in an excel worksheet
 read.spss: Reads in data from .sav SPSS format.
For the current exercise, we have put the R mtcars dataset into a csv file format

Reading in Your Own Data


In the last exercise, you just read in your first dataset. All you needed to specify was the "address" where the
dataset could be found. However, sometimes data isn't stored into the most convenient format. For instance,
sometimes the separator that separates all the different cells is different than you would expect.
You can specify the separator in your read.csv function using the sep argument. By default, this argument
for csv files is a comma. You can however easily change this to a tab by using the following code: sep =
'\t'.

Script.R
# load in the dataset
cars <-read.csv ("http://s3. amazonaws.com/ assets.datacamp.
com/ course/ uva/mtcars_semicolon.csv", sep=';')

# print the first 6 rows of the dataset


head(cars)

6
Working Directories in R
In the previous assignments, we practised reading in files in R. So far, all of these files were on the internet.
However, if you would work with R studio on your own computer, you would probably like to read in local
files.
When reading in local files, it's good to have an idea what your working-directory is. Your working-directory
is basically the part of your file system that R will look for files. Usually this is something along the lines of
C:/Users/Username/documents. Of course this working directory is not static and can be changed by the user.
In R there are two important functions:

 getwd(): This function will retrieve the current working directory for the user

 setwd(): This functions allows the user to set her own working directory

Changing Working Directories in R


In this last exercise we shortly introduced the function setwd(). This function takes a character string as a
name to set the working directory. You can either provide it a relative path, or you provide it an absolute path.
An example of an absolute path is the following:
setwd("C:/Users/Username/Documents/datasets")
An example of a relative path is the following:
setwd("./datasets")
If you would use the latter option in your local R session, it uses the string "C:/Users/Username/Documents"
through the use of the . character. In datacamp, it takes the current working directory and combines it with the
datasets folder. As such, it saves the user a lot of typing.
Checking Files in Your Working Directory
R has some great convenience functions for checking the files that exist in your current working directory. For
instance, list.files() lists all the files that exists in your working directory.

Importing R Packages
Although base R comes with a lot of useful functions, you will not be able to fully leverage the full power of R
without being able to import R modules developed by others. Imagine we want to do some great plotting and
we want to use ggplot2 for it. If we want to do so, we need to take 2 steps:
1. Install the package ggplot2 using install.packages("ggplot2")
2. Load the package ggplot2 using library(ggplot2) or require(ggplot2)
In datacamp however, most packages are already installed and readily available. As such, you won't need to
run install.packages()

7
Explore Data
Checking The Dimensions of Your Data
We are going to start out using the mtcars data set, which contains measures of design, performance and
consumption for different cars. If we want to know how many cases and variables there are in the data set we
could count them manually, but this could take a very long time. A faster way is to use the function dim().
The first value returned by dim() is the number of cases (rows) and the second value is the number of variables
(columns).
The variables in mtcars are as follows:
 [, 1] mpg Miles/(US) gallon.
 [, 2] cyl Number of cylinders.
 [, 3] disp Displacement (cu.in.)
 [, 4] hp Gross horsepower
 [, 5] drat Rear axle ratio
 [, 6] wt Weight (lb/1000)
 [, 7] qsec 1/4 mile time
 [, 8] vs V/S
 [, 9] am Transmission (0 = automatic, 1 = manual)
 [,10] gear Number of forward gears
 [,11] carb Number of carburettors

Data Structure
Using the str() function we can look at the structure of a dataset. str() takes the name of the data set as its
first argument. The output shows the variable names, their type, and the values of the first observations.
The am variable indicates whether a car has an automatic or manual transmission. Performing
the str() function on mtcars in your console and look at your output. According to R, what type of variable
is am

str(mtcars)
'data.frame': 32 obs. of 11 variables:
$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
$ disp: num 160 160 108 258 360 ...
$ hp : num 110 110 93 110 175 105 245 62 95 123 ...
$ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
$ wt : num 2.62 2.88 2.32 3.21 3.44 ...
$ qsec: num 16.5 17 18.6 19.4 17 ...
$ vs : num 0 0 1 1 0 1 0 1 1 1 ...
$ am : Factor w/ 2 levels "0","1": 2 2 2 1 1 1 1 1 1 1 ...
$ gear: num 4 4 4 3 3 3 3 4 4 4 ...
$ carb: num 4 4 1 1 2 1 4 2 2 4 ...

Levels
In the last exercise you saw that the am variable of the mtcars data set was labelled by R as a factor. You can
see the levels of a factor variable by using the function levels(). Let's try this out. Remember, you can select
a specific variable using either $ or [,], If you need to check the variables in the data set, remember that you
can always use the str() function in your console.

8
# Look at the levels of the variable am
> levels(mtcars$am)
[1] "0" "1"

Recoding Variables
Currently the mpg (miles per gallon) variable of mtcars is a continuous numeric variable, but it may be more
useful if mpg was a categorical variable that immedietly told you if the car had low or high miles per gallon.
We can make categories through indexing variables that meet certain criteria.
For instance, if we want to make a new variable that categorises people over age 18 as "adult"", we might enter:

yourdata$newvariable[yourdata$age > 18] <- "adult"

This assigns the value "adult" to the variable newvariable, for all cases where age is greater than 18.
Remember, you can select a specific variable using either $ or [,]. If you need to look at your data you can
simply enter mtcars into your console, or if you just want to check the variables you can always
enter str(mtcars) in your console.

Script.R
#Assign the value of mtcars to the new variable mtcars2
mtcars2 <- mtcars

#Assign the label "high" to mpgcategory where mpg is greater than


or equal to 20
mtcars2$mpgcategory[mtcars2$mpg >= 20] <- "high"

#Assign the label "low" to mpgcategory where mpg is less than 20


mtcars2$mpgcategory[mtcars2$mpg < 20 ] <- "low"

Examining Frequencies
Frequency tables show you how often a given value occurs. To look at a frequency table of the data in R, use
the function table(). The top row of the output is the value, and the bottom row is the frequency of the value.
Let's use table() on the am variable of the mtcars data set. Remember that am is a categorical variable that
shows a 0 when a car has an automatic transmission and a 1 when a car has a manual transmission.

> table(mtcars$am)
0 1
19 13

#How many of the cars have a manual transmission?


13

Cumulative Frequency
 In your console, look at how many cars have 3, 4 or 5 gears using table() on the
variable gear from mtcars.
 In your console, calculate how many of the cars have 3 or 5 gears as a percentage of the total number
of cars.
 In your script, report this percentage

9
Rconsole Script.R
>table(mtcars $gear) # What percentage of cars
have 3 or 5 gears?
3 4 5 62.5
15 12 5
>#Percentager of cars have 3
or 5 gears
>(15/32)*100+(5/32)*100
[1]62.5

Making a Bar Graph


We easily can make graphs to visualize our data. Let's visualize the number of manual and automatic
transmissions in our car sample through a bar graph, using the function barplot(). The first argument
of barplot() is a vector containing the heights of each bar. These heights correspond to the proportional
frequencies of a desired measure in your data. You can obtain this information using the table() function.
We are going to make a bar graph of the am (transmission) variable of the mtcars dataset. In this case, the
height of the bars can be the frequency of manual and automatic transmission cars. Therefore, here we are going
to use table() and barplot() to make this plot.

Remember, you can select a specific variable using either $ or [,]. If you need to look at your data you can
simply enter mtcars into your console, or if you just want to check the variables you can always
enter str(mtcars) in your console.
Script.R
#Assign the frequency of the mtcars variable "am"
to a variable called "height"
height <- table(mtcars$am)

#Create a barplot of "height"


barplot(height)

Labelling A Bar Graph


Now we're going to add some labels to the bar graph, still using barplot(). The first argument
of barplot() was a vector of the bar heights. Following this, we can add arguments to format the graph as
necessary. For instance, barplot(height, argument1, argument2). Here we are going to add a
label to the y axis using the argument ylab = "name here", and x axis labels to the bars using the
argument names.arg = "vector of names here".
Script.R
# vector of bar heights
height <- table(mtcars$am)

# Make a vector of the names of the bars called "barnames"


barnames = c("automatic", "manual")

# Label the y axis "number of cars" and label the bars using barnames
barplot(height, ylab = "number of cars", names.arg= barnames)

10
Histograms
It can be useful to plot frequencies as histograms to visualize the spread of our data.
Let's make a histogram of the number of carburetors in our mtcars dataset using the function hist().
The first argument of hist() is vector of values for which the histogram is desired. Following this, we can
add arguments to format the graph as necessary. For instance, hist(variable, argument1,
argument2)

Script.R
# Make a histogram of the carb variable
from the mtcars data set. Set the title
to "Carburetors"
hist(mtcars$carb,main= "Carburetors")

Formatting Your Histogram


Sometimes we have to change things because R's default setting isn't suitable for our graph. In the same way as
we added a title argument to hist(), we can change the scale of the y-axis through adding the
argument ylim followed by the range we want (e.g. for a scale from 0 to 50, we would say ylim =
c(0,50)). We can also label the x-axis using the argument xlab = "title", or change the colour of the
bars to blue with the argument col = "blue".
Script.R
# arguments to change the y-axis scale
to 0 - 20, label the x-axis and colour
the bars red
hist(mtcars$carb, main = "Carburetors",
ylim = c (0,20), col = "red", xlab =
"Number of Carburetors")

Mean and Median


We can measure the mean and median of a variable using the functions mean() and median(), using the
variable in question as the first argument between brackets.

Script.R
# Calculate the mean miles per gallon
mean(mtcars$mpg) [1] 20.09062

# Calculate the median miles per gallon


median(mtcars$mpg) [1] 19.2

Mode
Sometimes it is useful to look at the the most frequent value in a data set, known as the 'mode'. R doesn't have
a standard function for mode, but we can calculate the mode easily using the table() function, which you
might be familiar with now.
When you have a large data set, the output of table() might be too long to manually identify which value is
the mode. In this case it can be useful to use the sort() function, which arranges a vector or factor into
ascending order. (You can add the argument decreasing = TRUE to sort() if you want to arrange it in
to descending order.

11
Script.R
# Produce a sorted frequency table of `carb` from `mtcars`
sort(table(mtcars$carb), decreasing = TRUE)

Range
The range of a variable is the difference between the highest and lowest value. We can find these values
using max() and min() on the variables of our choice. The value returned tells us which row (or case)
contains the requested value. We can then index this case to find the desired values. Remember, you can index
using [].
Script.R
# Minimum value
x <-min(mtcars$mpg)

# Maximum value
y <-max(mtcars$mpg)

# Calculate the range of mpg using x and y


y-x

Quartiles
You can calculate the quartiles in your data set using the function quantile(). The output
of quantile() gives you the lowest value, first quartile, second quartile, third quartile and highest value.
25% of your data lies below the first quartile value, 50% lies below the second quartile, and 75% lies below the
third quartile value.
> quantile(mtcars$qsec)
0% 25% 50% 75% 100%
14.5000 16.8925 17.7100 18.9000 22.9000

IQR and Boxplot


To better visualise your data's quartiles you can create a boxplot using the function boxplot() (in the same
way as you used hist() and barplot()). Similarly, you can calculate the interquartile range manually by
subtracting the value of the third quartile from the value of the first quartile, or we can use the
function IQR() on your variable of interest.
Script.R
# Make a boxplot of qsec
boxplot(mtcars$qsec)

# Calculate the interquartile range of qsec


IQR(mtcars$qsec)

[1] 2.0075

IQR Outliers
In the boxplot you created you can see a circle above the boxplot. This indicates an outlier. We can calculate an
outlier as a value 1.5 * IQR above the third quartile, or 1.5 * IQR below the first quartile.

12
> IQR(mtcars$qsec)
[1] 2.0075
> quantile(mtcars$qsec)
0% 25% 50% 75% 100%
14.5000 16.8925 17.7100 18.9000 22.9000
> #upper threshold
> 18.9000+1.5*2.0075
[1] 21.91125
> #lower threshold
> 16.8925-1.5*2.0075
[1] 13.88125

Standard Deviation
We can also measure the spread of data through the standard deviation. You can calculate these using the
function sd(), which takes a vector of the variable in question as its first argument.

> # Find the IQR of horsepower


> IQR(mtcars$hp)
[1] 83.5
> # Find the standard deviation of horsepower
> sd(mtcars$hp)
[1] 68.56287
> # Find the IQR of miles per gallon
> IQR(mtcars$mpg)
[1] 7.375
> # Find the standard deviation of miles per gallon
> sd(mtcars$mpg)
[1] 6.02694

Mean, median and mode.


Mean, median and mode are all measures of the average. In a
perfect normal distribution, the mean, median and mode values are
identical, but when the data is skewed this changes. In the the
graph on the right which of the following statements are most
accurate?
Possible Answers
A. The mode is higher than the mean. It makes most sense to
use the median to measure central tendency.
B. The mean is higher than the mode. It makes most sense to
use the mean to measure central tendency.
C. The median is higher than the mode. It makes most sense to use the mode to measure central tendency
Ans. A

Calculating Z-scores
We can calculate the z-score for a given value (X) as (X - mean) / standard deviation. In R you can do this with
a whole variable at once by putting the variable name in the place of X
Script.R
# Calculate the z-scores of mpg
(mtcars$mpg-mean(mtcars$mpg))/sd(mtcars$mpg)

13
Correlation and Regression
Scatterplots
Saved in your console is a dataset called women which contains the height and weight of 15 women (try
typing it into your console and press enter to have a look).
Let's have a look at the relationship between height and weight through a scatterplot, using the R
function plot(). The first argument of plot() is the x-axis coordinates, and the second argument is the y-
axis coordinates.

Script.R
plot(women$weight, women$height, main = "Heights
and Weights")
> women
height weight
1 58 115
2 59 117
3 60 120
4 61 123
5 62 126
6 63 129
7 64 132
8 65 135
9 66 139
10 67 142
11 68 146
12 69 150
13 70 154
14 71 159
15 72 164

Making a Contingency Table


Saved in your console is a dataset called smoking, which contains data about amount of tobacco smoked per
day in a sample of 88 students. The student variable says whether a student is in high school, or university,
and the tobacco variable indicates how many grams of tobacco are smoked per day. We expected that there
would be more tobacco use (the dependent variable) in university (the independent variable).
We can make a contingency table of this data using the table() function. While previously you may have
used this with one variable, this time you will use it with two. The first variable used with table() will appear
in the rows, while the second variable will appear in the columns

Script.R
table(smoking$tobacco, smoking$student)

>table(smoking$tobacco, smoking$student)

high school university


0-9g 17 14
10-19g 16 15
20-29g 11 15

14
Calculating Percentage from Your Contingency Table
Have a look at the contingency table of tobacco consumption and education you made in the last exercise. It's
saved in your console as st. Let's use it to calculate some percentages!
In this exercise you need to report your answers to one decimal place. You are free to do this manually, but if
you want a quick way to do this through R you can use the round() function. The first argument
of round() is the value that you want to round (this can be in the form of a raw number, or an equation), and
the second argument is digits =, where you specify the number of decimal places you want the number
rounded to. For instance, round(12.6734, digits = 2) would return the value 12.67.

Script Console
# What percentage of high school > (17/(17+16+11))*100
students smoke 0-9g of tobacco? [1] 38.63636
38.6 > round(38.63636, digits =
1)
# Of the students who smoke the most, [1] 38.6
> (15/(11+15))*100
what percentage are in university?
[1] 57.69231
57.7
> round(57.69231, digits =
1)
[1] 57.7

Calculating Correlation Using R


We can calculate the correlation in R using the function cor(), which takes your two variables as it's first
argument. Try it out on the variables shown in the graph.

Script.R
# Calculate the correlation
between var1 and var2
cor(var2, var1)

> cor(var2, var1)


[1] -0.2642027

Finding The Line


When we draw a line through our data, we measure error as the sum of the
difference between the observation and the line. We usually square this so
that positive and negative residuals don't cancel each other out. The line that
gives us the least error is our regression line.
To do this you should use the sum() function, which returns the sum of all
vectors provided between brackets. You can also put ^2 inside the brackets
with your vectors in order to square the differences. For
example, sum((vector1 - vector2) ^ 2)

15
Script.R
# predicted values of y according to line 1
y1 <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
# predicted values of y according to line 2
y2 <- c(2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
# actual values of y
y <- c(3, 2, 1, 4, 5, 10, 8, 7, 6, 9)
# calculate the squared error of line 1
sum((y-y1)^2)

>sum((y-y1)^2)
[1] 36
>sum((y-y2)^2)
[1] 46

Finding The Regression Coefficients in R


We can find the regression coefficients for our data using the lm() function, which takes our model as the
first argument: first the y variable, followed by a ~ symbol, then the x variable. For instance: lm(y ~ x). The
output labels the value of the intercept with 'intercept', and the value of the slope with the name of the
independent variable. Let's try this out with our study that investigated how money (independent variable)
predicted prosocial behaviour (dependent variable).

Script.R Console
#Our data money <- > lm(prosocial~money)
c(1,2,3,4,5,6,7,8,9,10) Call:
prosocial <- lm(formula = prosocial ~ money)
c(3,2,1,4,5,10,8,7,6,9)
Coefficients:
#Find the regression coefficients (Intercept) money
lm(prosocial~money) 1.2000 0.7818

Using lm() To Add A Regression Line To Your Plot


In the last exercise you used lm() to obtain the coefficients for your model's regression equation, in the
format lm(y ~ x). takes the y variabWe can store this output and use it to add the regression line to your
scatterplots! After you have created your scatterplot, you can add a line using the
function abline(). abline() takes the intercept of the line as its first argument, and the slope of the line
as its second argument. This makes it a pretty good candidate for storing your lm() output as an object, and
putting it straight into abline.

Script.R
# Your plot
plot(money, prosocial, xlab = "Money", ylab = "Prosocial
Behavior")
# Store your regression coefficients in a variable called "line"
line <- lm(prosocial~money)
# Use "line" to tell abline() to make a line on your graph
abline(line)

16
Script.R
# Your plot
plot(money, prosocial, xlab = "Money",
ylab = "Prosocial Behavior")

# Your regression line


line <- lm(prosocial ~ money)
abline(line)

# Add a line that shows the mean of the


dependent variable
a= 5.5 b= 0
abline(a, b)

R Squared I
These are the two lines you plotted in the last assignment. One line shows the mean, and one shows the
regression line. Clearly, there is less error when we use the regression line compared to the mean line. This
reduction in error from using the regression line compared to the mean line tells us how well the independent
variable (money) predicts the dependent variable (prosocial behaviour).
Conveniently, the R squared is equivalent to squaring the Pearson R correlation coefficient. We're going to
calculate the R squared for prosocial and money.

Script.R
# Calculate the R squared of prosocial and money
r <- cor(money, prosocial)
r*r

> r*r
[1] 0.6112397

17
Probability Distribution
Probability mass and density functions
From the lectures you may recall the concepts of probability mass and density functions. Probability mass
functions relate to the probability distributions discrete variables, while probability density functions relate to
probability distributions of continuous variables. Suppose we have the following probability density function:

Script.R
# the data frame
data<-data.frame(outcome= 0:5, probs = c(0.1, 0.2,
0.3, 0.2, 0.1, 0.1))

# make a histogram of the probability distribution


barplot(name= data$outcome, height= data$probs)

Probability mass and density functions (2)


For continuous variables, the values of a variable are associated with a probability density. To get a probability,
you will need to consider an interval under the curve of the probability density function. Probabilities here are
thus considered surface areas.
In this exercise, we will simulate some random normally distributed data using the rnorm() function. This
data is contained within the data vector. You will then need to visualize the data.
The Normal Distribution
Density, distribution function, quantile function and random generation for the normal distribution with mean
equal to mean and standard deviation equal to sd.

Usage

dnorm(x, mean = 0, sd = 1, log = FALSE)


pnorm(q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
qnorm(p, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
rnorm(n, mean = 0, sd = 1)

Arguments
x, q  vector of quantiles.
p  vector of probabilities.
n  number of observations. If length(n) > 1, the length is taken to be the number
required.
mean  vector of means.
sd  vector of standard deviations.
log, log.p  logical; if TRUE, probabilities p are given as log(p).
lower.tail  logical; if TRUE (default), probabilities are \(P[X \le x]\) otherwise, \(P[X > x]\).
Details
If mean or sd are not specified they assume the default values of 0 and 1, respectively. The normal distribution
has density $$ f(x) = \frac{1}{\sqrt{2\pi}\sigma} e^{-(x-\mu)^2/2\sigma^2}$$ where \(\mu\) is the mean of the
distribution and \(\sigma\) the standard deviation.

18
Value
dnorm gives the density, pnorm gives the distribution function, qnorm gives the quantile function,
and rnorm generates random deviates.
The length of the result is determined by n for rnorm, and is the maximum of the lengths of the numerical
arguments for the other functions.
The numerical arguments other than n are recycled to the length of the result. Only the first elements of the
logical arguments are used.

For sd = 0 this gives the limit as sd decreases to 0, a point mass at mu. sd < 0 is an error and returns NaN.

Script.R
# simulating data
set.seed(11225)
data <- rnorm(10000)

# check for documentation of the dnorm


function
help(dnorm)

# calculate the density of data and store


it in the variable density
density <- dnorm(data)

The cumulative probability distribution


In the last two exercises, we saw the probability distributions of a discrete and a continuous variable. In this
exercise we will jump into cumulative probability distributions. Let's go back to our probability density function
of the first exercise:

All the probabilities in the table are included in the dataframe probability_distribution which
contains the variables outcome and probs. We could sum individual probabilities in order to get a cumulative
probability of a given value. However, in some cases, the function cumsum() may come in handy.
What cumsum() does is that returns a vector whose elements are the cumulative sums of the elements of the
arguments. For instance, if we would have a vector which contains the elements: c(1, 2,
3), cumsum() would return c(1, 3, 6).

Script.R
# probability that x is smaller or equal to two
prob <- (0.1 + 0.2 + 0.3)

#' probability that x is 0, smaller or equal to one,


#' smaller or equal to two, and smaller or equal to three
cumsum(c(0.1, 0.2, 0.3, 0.2))

[1] 0.1 0.3 0.6 0.8

19
Summary statistics: The mean
One of the first things that you would like to know about a probability distribution are some summary statistics
that capture the essence of the distribution. One example of such a summary statistic is the mean. The mean of
a probability distribution is calculated by taking the weighted average of all possible values that a random
variable can take. In the case of a discrete variable, you calculate the sum of each possible value times its
probability. Let's go back to our probability mass function of the first exercise.

Script.R
# calculate the expected probability value and assign it to
the variable expected_score
expected_score <- sum(data$outcome * data$probs)

# print the variable expected_score


expected_score

[1] 2.3

Summary statistics: Variance and the standard deviation


In addition to the mean, sometimes you would also like to know about the spread of the distribution. The
variance is often taken as a measure of spread of a distribution. It is the squared deviation of an observation
from its mean. If you want to calculate it on the basis of a probability distribution, it is the sum of the squared
difference between the individual observation and their mean multiplied by their probabilities. See the following
formula: var(X)=∑(xi−x¯)2∗Pi(xi)var(X)=∑(xi−x¯)2∗Pi(xi).
If we want to turn that variance into the standard deviation, all we need to do is to take its square root. Let's go
back to our probability mass function of the first exercise and see if we can get the variance.

Script.R
# the mean of the probability mass function
expected_score <- sum(data$outcome * data$probs)

# calculate the variance and store it in a variable called


variance
variance <- sum((data$outcome -expected_score)^2 *
data$probs)
# calculate the standard deviation and store it in a variable

The normal distribution and cumulative probability


In the previous assignment we calculated probabilities according to the normal distribution by looking at an
image. However, it is not always as simple as that. Sometimes we deal with cases where we want to know the
probability that a normally distributed variable is between a certain interval. Let's work with an example of
female hair length.

20
Hair length is considered to be normally distributed with a mean of 25 centimeters and a standard deviation of
5. Imagine we wanted to know the probability that a
woman's hair length is less than 30. We can do this in
R using the pnorm() function. This function calculates
the cumultative probability. We can use it the following
way: pnorm(30, mean = 25, sd = 5). If you wanted to
calculate the probability of a woman having a hair
length larger or equal to 30 centimers, you can set
the lower.tail argument to FALSE. For
instance, pnorm(30, mean = 25, sd = 5, lower.tail =
FALSE). Let's visualize this. Note that the first
example is visualized on the left, while the second
example is visualized on the right:
Calculate the probability of a woman having a hair length less than 20 centimeters using a mean of 25 and a
standard deviation of 5. Use the pnorm() function and round the value to two decimal.

Script.R
# probability of a woman having a hair length of less than 20 centimeters
round(pnorm(20, mean= 25,sd=5), 2)

[1] 0.16

The normal distribution and quantiles


Sometimes we have a probability that we want to associate with a value. This is basically the opposite situation
as the situation described in the previous question. Say we want
the value of a woman's hair length that corresponds with the 0.2
quantile (=20th percentile). Let's consider visually what this
means:
In the visualization, we are given a blue area with a probability
of 0.2. We however want to know the value that is associated
with the yellow dotted vertical line. This value is the 0.2
quantile (=20th percentile) and divides the curve in an area that
contains the lower 20% of the scores and an area that the rest of
the scores. If our variable is normally distributed, in R we can
use the function qnorm() to do so. We can specify the
probability as the first parameter, then specify the mean and then
specify the standard deviation, for example, qnorm(0.2, mean = 25, sd = 5).
Calculate the 85th percentile of the distribution of female hair length and round this value to two decimals. Note
that the mean is 25 and the standard deviation is 5.

Script.R
# 85th percentile of female hair length
round(qnorm(0.85, mean=25, sd=5),2)

[1] 30.18

21
The normal distribution and Z scores
A special form of the normal probability distribution is the standard normal distribution, also known as the z -
distribution. A z distribution has a mean of 0 and a standard deviation of 1. Often you can transform variables
to z values. You can transform the values of a variable to z-scores by subtracting the mean, and dividing this by
the standard deviation. If you perform this transformation on the values of a data set, your transformed data set
will ave a mean of 0 and a standard deviation of 1. The formula to transform a value to a z score is the following:
𝑥𝑖 − 𝑥̅
𝑍𝑖 =
𝑠𝑥
The Z-score represents how many standard deviations from the mean a value lies.
Script.R
# calculate the z value and store it in the
variable z_value
round(z_value <- c((38-25)/5))

[1] 3

The binomial distribution


The binomial distribution is important for discrete variables. There are a few conditions that need to be met
before you can consider a random variable to binomially distributed:
1. There is a phenomenon or trial with two possible outcomes and a constant probability of success - this
is called a Bernoulli trial
2. All trials are independent
Other ingredients that are essential to a binomial distribution is that we need to observe a certain number of
trials, let's call this n, and we count the number of successes in which we are interested, let's call this x. Useful
summary statistics for a binomial distribution are the same as for the normal distribution: the mean and the
standard deviation.
The mean is calculated by multiplying the number of trials n by the probability of a success denoted by p. The
standard deviation of a binomial distribution is calculated by the following formula:

√𝑛 × 𝑝 × (1 − 𝑝)

An exam consisting of 25 multiple choice questions. Each questions has 5 possible answers. This means that
the probability of answering a question correctly by chance is 0.2. Calculate the mean of this distribution and
store it in a variable called mean_chance
Calculate the standard deviation of this distribution and store it in the variable std_chance.

Script.R
# calculate the mean and store it in the variable mean_chance
mean_chance <- c(25*0.2)

# calculate the standard deviation and store it in the variable


std_chance
std_chance <- c (sqrt(25*0.2*(1-0.2)))

Calculating probabilities of binomial distributions in R


Just as with the normal distribution, we can also calculate probabilities according to the binomial distributions.
Let's consider the example in the previous question. We had an exam with 25 questions and 0.2 probability of

22
guessing a question correctly. In contrast to the normal distribution, when we have to deal with a binomial
distribution we can calculate the probability of exactly answering say 5 questions correctly. This is because a
binomial distribution is a discrete distribution.
When we want to calculate the probability of answering 5 questions correctly, we can use the dbinom function.
This function calculates an exact probability. If we would like to calculate an interval of probabilities, say the
probability of answer 5 or more questions correctly, we can use the pbinom function. We have already seen a
similar function when we were dealing with the normal distribution: the pnorm() function.
Calculate the exact probability of answering 5 questions correctly and store this in the variable five_correct,
Calculate the cumulative probability of answering at least 5 questions correctly and store this in the
variable atleast_five_correct,

Script.R
# probability of answering 5 questions correctly
five_correct <- dbinom(5, size = 25, prob = 0.2)

# probability of answering at least 5 questions correctly


atleast_five_correct <- pbinom(4, size = 25, prob = 0.2, lower.tail = FALSE)

Quantiles and the binomial distribution


Remember the concept of quantiles? If not, let me briefly recap it. Quantiles are used when you have a
probability and you want to associate this probability with a value. In our last example we had 25 questions and
the probability of guessing a question correctly was 0.2. Also, in our last example we wanted to know the
probability of answering at least 5 questions correctly and used the pbinom() function to do so. With quantiles,
we do the exact opposite; we want to calculate the value that is associated with for instance the 0.2 quantile
(=20th percentile). In case we are working with a binomial distribution, we can use the function qbinom() for
this.

Calculate the 60th percentile of the binomial distribution of exam questions. Note that the number of questions
is 25 and the probability of guessing a question correctly is 0.2.
Script.R
# calculate the 60th percentile
qbinom(0.60, size = 25, prob = 0.2)

[1] 5

23
Sampling Distribution
Sampling from the population
In this lab we have access to the entire population. In real life, this is rarely the case. Often we gather information
by taking a sample from a population. In the lectures, you've become familiar with the male beard length (in
millimetres) of hipsters in Scandinavia. In this lab, we will be working with this example.
If we were interested in estimating the average male beard length of hipsters in Scandinavia, in R we can use
the sample() function to sample from the population. For instance, to sample 50 inhabitants from our
Scandinavian male hipster population which is included in the variable scandinavia_data, we could do
the following: sample(scandinavia_data, size = 50). This command collects a simple random
sample of size 50. If we didn't have access to the entire male hipster Scandinavian population, working with
these 50 inhabitants would be considerably simpler than having to go through the entire Scandinavian male
hipster population.
Make sure not to remove Script.R
the set.seed(11225) code. This makes #variable scandinavia_data contains the
beard lengths of scandinavian male
sure that you will get the same results as the population
solution code set.seed(11225)
Sample 100 values from the
first_sample<-sample(scandinavia_data,
dataset scandinavia_data and store this in size = 100)
a variable first_sample
Calculate the mean of first sample and print mean(first_sample)
[1] 25.42916
the result

Using a for loop


Not surprisingly, every time we take another random sample, we get a different sample mean. It's useful to get
a sense of just how much variability we should expect when estimating the population mean this way.
The distribution of sample means, or the sampling distribution, can help us understand this variability. However,
before continuing with the sampling distribution, we will firstly introduce the concept of a for loop in R.
Every time some operation has to be repeated a specific number of times, a for loop may come in handy. We
only need to specify how many times or upon which conditions those operations need execution: we assign
initial values to a control loop variable, perform the loop and then, once finished, we typically do something
with the results.
Script.R
# initialize an empty vector
new_number <- NULL

# run an operation 10 times.


# The ith position of new number will be set to i
# at the end of the loop, the vector new_number is printed
for (i in 1:10) {
new_number[i] <- i
}
print(new_number)

[1] 1 2 3 4 5 6 7 8 9 10

Mean of the sampling distribution


The mean of a sample that you take from the population will never be very far away from the population mean
(provided that you randomly sample from the population). Furthermore, the mean of the sampling distribution,

24
that is the mean of the mean of all the samples that we took from the population will never be far away from the
population mean. Let's observe this in practice.

 Calculate the mean of the population and print it. Note that the
population is included in the variable scandinavia_data.
 Calculate the mean of the sample means and print it. Note that the
sample means are included in the variable sample_means.
 Note how close the two are

Script.R
# set the seed such that you will get the same sample as in the
solution code
set.seed(11225)

# empty vector sample means


sample_means <- NULL

# take 200 samples from scandinavia_data


for (i in 1:500){
samp <- sample(scandinavia_data, 200)
sample_means[i] <- mean(samp)
}

# calculate the population mean, that is, the mean of


scandinavia_data and print it
mean(scandinavia_data)
[1] 24.97331

# calculate the mean of the sample means, that is, sample_means


mean(sample_means)
[1] 24.9644

Standard deviation of the sampling distribution


In the previous weeks you have become familiar with the concept of standard deviation. You may recall that
this concept refers to the spread of a distribution. In R you can calculate the standard deviation using the
function sd().
However, the standard deviation of the sampling distribution is called the standard error. The standard error is
calculated slightly differently from the standard deviation. The formula for the standard error can be found
below:
𝝈𝒙̅ =σ/√n
In this formula, the sigma refers to the standard deviation, while n refers to the sample size of the sample.
 Calculate the standard deviation of the population and put it in the
variable population_sd. Note that the population can be found in the
variable scandinavia_data. Print population_sd to the console.

 Use population_sd to calculate the standard error of a sample of 200 cases


and put it in the variable sampling_sd. Print sampling_sd to the console.

25
Script.R
# standard deviation of the population
population_sd <- sd(scandinavia_data)
population_sd

# standard deviation of the sampling distribution


sampling_sd <- population_sd/sqrt(200)
sampling_sd

[1] 3.466054
[1] 0.245087

Standard deviation of the sampling distribution (2)


The standard deviation of the sampling distribution, also known as standard error, is affected by the
values of:
1. The population standard deviation
2. The sample size of the samples taken from the population
What happens to the standard error when i) the standard deviation of the population becomes larger,
ii) the size of the sample becomes larger.
Answer: The standard error becomes larger. The standard error becomes smaller.
The central limit theorem
Earlier we touched on the sampling distribution and its mean and standard deviations. Now, we will look at the
central limit theorem, one of the most important theorems when it comes to inferential statistics. Briefly this
theorem states the following:
"Provided that the sample size is sufficiently large, the sampling distribution of the sample mean is
approximately normally distributed even if the variable of interest is not normally distributed in the
population"
In this exercise we will take a look at a new
population of simulated household income of Script.R
citizens in the United States. The data is stored in # empty vector sample means
a variable called household_income. This sample_means <- NULL
population is right skewed. We will take a 1000
samples of n = 200 from this population and # take 200 samples from
calculate the sample mean each time. You will see scandinavia_data
that the sampling distribution, just as the central for (i in 1:1000){
limit theorem states, is normally distributed. samp <- sample(household_income,
200)
 Make a histogram of the sample_means[i] <- mean(samp)
variable household_income and see }
how the population is skewed to the right.
 Make a histogram of the variable # make a histogram of household_income
sample_means. Can you see how this hist(household_income)
variable looks normally distributed? # make a histogram of sample_means
 You can press on the button "Previous hist(sample_means)
plot" to check the previous histogram.

26
Z-scores
Recall the concept of Z scores from the lectures. Z scores are standardized scores how far a parameter is removed
from its mean. A Z score with value 2 means that an observation is 2 standard deviations away from its
population mean. Also recall that the formula for the Z score is the following:
(𝑥𝑖 − 𝜇)
𝑍𝑖 =
𝜎
In this formula, 𝑥𝑖 refers to the observation for which you want to calculate the z score, while μ refers to the
population mean of the phenomenon and σ refers to the population standard deviation.
To illustrate the concept of the Z score, let's go back to our scandinavia_data dataset. In this population
of male hipsters from Scandinavia, the average beard length is 25. The standard deviation in the population is
3.47. Suppose we had a hipster with a beard
length of 32mm, this would be unusual for Script.R
this population and thus would have a rather # z score of hipster with a beard of 32
high Z score. millimeter
z_score <- (32-25)/3.47
 Calculate the Z score of this hipster # print the variable z_score to the
with a beard of 32 millimetres and console
store it in a variable called z_score z_score
 print the variable z_score to the
console [1] 2.017291

Calculating areas with subjects


A z-score by itself may not always be easy to interpret. Yes, it does indicate the amount of standard deviations
away from the population mean, but this may sound like gibberish to many people. Wouldn't it be great to
translate a z-score to a probability?
Z scores can be easily translated to probabilities. There are multiple ways to do so:
1. Look up the z score in a table
2. Calculate the probability using R
In R we can use the pnorm() function to calculate the probability of obtaining a given score or a more extreme
score in the population. Basically this calculate an area under the bell curve. The function pnorm() has several
parameters you can include such as:
 q: The observation for which you want to calculate the probability
 mean: The population mean
 sd: The population standard deviation
 lower.tail: Indicates whether you want to calculate the area under the curve left of your observations or
right of your observations

27
Let's look at how to use pnorm() and let's play around
with the lower.tail option. Script.R
# calculate the area under the
 Recall that the z-score for the Scandinavian hipster curve left of the observation
in the previous exercise was 2.02. Calculate the
pnorm(2.02, lower.tail = TRUE)
area left of this observation by
specifying lower.tail = TRUE in pnorm and
print this probability. # calculate the area under the
 Now calculate the area under the curve right of this curve right of the
observation by specifying lower.tail = pnorm(2.02, lower.tail = FALSE)
FALSE and print this probability

[1] 0.9783083 & [1] 0.02169169

Interpreting areas under the curve


In the last exercise, we calculated the area under the curve both
left and right of our observation. Let's now visualize this area
under the curve such that we get a better idea what this means.
The red area here refers to the area under the curve. In the top
most visualization, you can see that the area under the curve is
quite large and covers the largest part of the distribution. This
is because we assigned the pnorm function to calculate the
lower-tail area below the value of 32. In contrast, in the lower
visualization we specified lower-tail to FALSE and as such
calculated the area under the curve from the observation 32
onwards.
By now you may have realized that this area under the curve
represents a probability. In the top visualization this
represents the probability of finding a hipster with a beard
length smaller or equal to 32. Can you guess what the area
under the curve represents in the bottom visualization?

Answer: This represents the probability of finding a


hipster with a beard length larger or equal to 32.
Calculating areas with sample means
So far we have calculated the probabilities of observations using mean
and standard deviation values from the population. However, we can
also calculate these observation probabilities using mean and standard
deviation values from the sample. For instance, we could have a
question along the lines of what is the probability that the sample mean
of the beard length of 50 Scandinavian hipsters is larger or equal to 26
millimetres. Because in this example we are talking about a specific
sample from the population, we make use of the sampling distribution
and not the population distribution.
Because we make use of the sampling distribution, we are now using the standard deviation of the sampling
distribution which is calculated using the formula σ/√n.
 Calculate the probability that a sample mean of the beard length of 50 Scandinavian hipsters is larger
or equal to 26 millimetres. Recall that the population is contained in the
variable scandinavia_data. Take all the steps indicated by comments in the editor.

28
Script.R
#calculate the population mean
population_mean <- mean(scandinavia_data)

#calculate the population standard deviation


population_sd <- sd(scandinavia_data)

#calculate the standard deviation of the sampling distribution and put it in a variable
sampling_sd
sampling_sd <- population_sd/sqrt(50)

#calculate the Z score and put in a variable called z_score


z_score <- (26-mean(scandinavia_data))/sampling_sd

#cumulative probability calculation. Don't forget to set lower.tail to FALSE


pnorm(z_score, lower.tail = FALSE)

[1] 0.01810623

Calculating areas with sample means (2)


In the last exercise, we calculated the area under the curve right of our sample mean of 26. We found that this
area had a probability of 0.023. Let's visualize this:

You may notice that this distribution of sample means is much narrower than the distribution of observations
of individual hipsters. How would you interpret the red area?

Answer: The red area is the probability of obtaining a sample with a mean beard length equal or larger than
26 millimetre based on a sample of 50 hipsters.

Sampling distributions and proportions


In all the examples that we have seen, we worked with continuous variables such as beard length. However, in
practice we often work with proportions such as the proportion / percentage of hipsters in the population of
London. Imagine that we took a sample of 200 from the population of London and based on this sample we
concluded that London's population is 10% hipster. The mean of the sampling distribution thus is 10%.
But how do we calculate the standard deviation of the sampling distribution? From the lectures, you may recall
𝜋(1−𝜋) 𝑝𝑖∗(1−𝑝𝑖)
the following formula: √ 𝑛
or √ 𝑛
. Let's start practicing with sampling distributions and proportions.
 Given that the average
percentage of a sample of 200 Script.R
from the London population has #sample proportion
a percentage of hipsters of 10%, proportion_hipsters <- 0.10
calculate the standard deviation
of the sampling distributions #standard deviation of the sampling distribution
and store this in a variable
sample_sd <- sqrt((0.1*0.9)/200)
called sample_sd

Sampling distributions and proportions (2)


So let's continue working with proportions. Imagine we took a random sample of 200 people from London's
general population, and a proportion of 0.13 of these people were hipsters. We however know that in the
population of London, the proportion of hipsters is 0.10. What is the probability of finding a sample of 200 with
a proportion of 0.13 or more hipsters?

29
Let's break this problem into steps. Firstly, we can calculate the standard deviation of the sampling distribution.
The second step is using a function that may look familiar: pnorm(). Although we do not have a mean, we
can use our sampling and population
proportions. Our sampling proportion will Script.R
constitute the q argument here, while our # calculate the standard deviation of the
population proportion will constitute the sampling distribution and put in a variable
mean argument. Now let's get going, what is called sample_sd
sample_sd <- sqrt((0.10 * (1 - 0.10)) / 200)
the probability of finding a sample of 200
with a proportion of 0.13 or more hipsters?
# calculate the probability
Calculate the probability of finding
a sample of 200 with a proportion pnorm(0.13, mean = 0.10, sd = sample_sd,
of 0.13 or more hipsters using lower.tail = FALSE)
the pnorm() function.
[1] 0.0786496

Confidence Intervals
Confidence Interval with Known SD I
We know that in a normally distributed phenomenon, 95% of cases will fall within 1.96 standard deviations
above and below the mean. Let's see what that would look like. Imagine we magically know that the world
population mean for happiness has a value of 36.5, with a standard deviation of 7. Let's find out where 95% of
the people in the world lie.
Script.R
 In your script, calculate the amount above # above
and below 36.5 that 95% of the world falls 36.5+(1.96*7) [1] 50.22
 Remember to format your calculations
correctly and use () where appropriate
# below
36.5- (1.96*7) [1] 22.78

Confidence Interval with Known SD II


In the last question we demonstrated how 95% of a population fall between 1.96 standard deviations above and
below the population mean. Let's pretend we have psychic knowledge that the standard deviation of sadness in
the world is 8, but we need to find out what the mean is. We take a sample of 300 people. Let's estimate where
the population mean is likely to lie using this sample.
If you remember, the formula for calculating the confidence interval is the sample mean +/- 1.96 * standard
deviation. In this case, the standard deviation is the population standard deviation, divided by the square root of
the sample size.
 The sadness level of each participant in your sample is saved in your console as samp
 Complete the steps in your script to calculate the confidence interval for samp
 Remember to format your calculations correctly and use () where appropriate

30
Script.R
# Assign the sample mean to object "m"
m <- mean(samp)

# Assign the standard deviation to object "s"


s <- (8/sqrt(300))

# Calculate the upper confidence interval


m+(1.96*s) [1] 31.3083

# Calculate the lower confidence interval


m-(1.96*s) [1] 29.4977

Calculating a Confidence Interval Without the Population Standard Deviation


Unfortunately, in reality we usually don't know a population standard deviation, and thus must rely on
sample standard deviations and T-scores. T-scores come from T-distributions, which help us account
for error that occurs when we sample from a population. We use a different T-distribution to calculate
cumulative probabilities depending on our degrees of freedom.
Let’s say we conducted another study on how often people get angry when they're driving (known as
'road rage') using a sample of 200 people chosen at random, saved in your console as rrage. Let's
calculate the 95% confidence interval for where the population mean lies.
This time we must use a slightly different formula: sample mean +/- t value * standard error. The
standard error is calculated as the population standard deviation, divided by the square root of the
sample size. The T-score for a 𝑑𝑓 of 199 is 1.9720.
 The road rage level of each participant in your sample is saved in your console as rrage
 Complete the steps in your script to calculate the confidence interval for rrage
 Remember to format your calculations correctly and use () where appropriate

Script.R
# Assign the sample mean to object "m"
m <- mean(rrage)

# Assign the sample standard error to object "s"


s <- (sd(rrage)/sqrt(200))

# Calculate the upper 95% confidence interval


m+(1.9720*s) [1] 52.60419

# Calculate the lower 95% confidence interval


m-(1.9720*s) [1] 48.72093

Calculating A Confidence Interval for a Proportion I


Instead of measuring road rage as a continuous variable, you ask a sample to simply answer "yes" or "no" to the
question "do you have road rage?". The outcome is saved in your console as roadrage. Let's find what
proportion of your sample do have road rage.

31
 In your console, check the frequency of "yes"
(indicating the presence have road rage) and script.R
"no" (indicating no road rage) # Make p the proportion of the
 Using your script, calculate use this information sample with road rage
to calculate what proportion of your sample have p <- 70 / 200
road rage and assign this to the object p

Calculating A Confidence Interval for a Proportion II


In your study you found that a proportion p of 0.35 of your sample said they have road rage. The standard error
of this proportion is found through square root of: p
multiplied by 1 - p, divided by n. Let's try this! script.R
 Add a line of code that calculates the standard # Make p the proportion of the
error of p, and stores this as se sample with road rage
 Remember to format your calculations correctly p <- 70 / 200
and use () where appropriate
 If you want to take a look at your dataset, it is # Find the standard error of p
still saved in your console as roadrage se <- sqrt((p*(1-p))/200)

Calculating A Confidence Interval for a Proportion III


So you've done most of the hard word already because you have already calculated p and the standard error.
Let's finalise this by calculating the upper and lower ends of the 95% confidence interval for your road rage
study. See if you can manage without the formula in front of you. If you need it you can always look it up, or
press "hint"
 Using the objects, you have
already defined, add a line of code
to calculate the upper level of the
confidence interval
 Add a second line of code to
calculate the lower level of the
confidence interval

Script.R
# Make p the proportion of the sample with road rage
p <- 70 / 200

# Find the standard error of p


se <- sqrt( (p * (1 - p)) / 200)

# Calculate the upper level of the confidence interval


p+(1.96*se) [1] 0.4161046

# Calculate the lower level of the confidence interval


p-(1.96*se) [1] 0.2838954

Other Types of Confidence Intervals I


The last confidence interval you calculated was the 95% confidence interval for the proportion of people who
said they had road rage. Now let's try finding the 99% confidence interval (corresponding to a Z score of 2.58),
and comparing what happens when we use these different intervals.

32
 In your console, copy the steps from your script, and use this to calculate the range of the 95% confidence
interval
 In your console, copy and adapt the code from your script for finding the 99% confidence interval, and
use this to calculate the range of the 99% confidence interval
 In your script, report the ranges of each interval
 In your script, report which confidence interval is wider as a single number (95 or 99)

Script.R
# Find the standard error of p
se <- sqrt( (p * (1 - p)) / 200)
# Calculate the upper level of the 95% confidence interval
p + 1.96 * se
# Calculate the lower level of the 95% confidence interval
p - 1.96 * se
# Report the range of the 95% confidence interval
0.134192 { p + 1.96 * se- p - 1.96 * se}
# Report the range of the 99% confidence interval
0.1766405 { p + 2.58 * se- p – 2.58 * se}
# Which has the widest range?
99

Other Types of Confidence Intervals II


Let's do the same thing again with your original study that looked at how often people get angry when they're
driving. The data from this study was from a sample of 200, and the results are saved in your console
as rrage if you need them. We left you the code from where you calculated the 95% confidence interval. Now
let's try finding the 90% confidence interval (corresponding to a T score of 1.6525), and comparing what
happens when we use these different intervals.
 In your console, copy the steps from your script, and use this to calculate the range of the 95% confidence
interval
 In your console, copy and adapt the code from your script to calculate the range of the 90% confidence
interval
 In your script, report the ranges of each interval
 In your script, report which confidence interval is larger as a single number (95 or 90)

Script.R
# Assign the sample mean to object "m"
m <- mean(rrage)
# Assign the sample standard error to object "s"
s <- sd(rrage) / sqrt(200)
# Calculate the upper level of the 95% confidence interval
m + ( 1.9720 * s )
# Calculate the lower level of the 95% confidence interval
m - ( 1.9720 * s)
# Calculate the range of the 95% confidence interval
3.88326 { p + 1.9720 * se- p - 1.9720 * se}
# Calculate the range of the 90% confidence interval
3.2541 { p + 1.6525 * se- p - 1.6525 * se}
# Which has the widest range?
95

33
Sample Size I Answer:
Which of the following does a large sample size reduce? The margin of error (m= CL*standard deviation)
 The standard deviation
 The margin of error
 The level of confidence

Sample Size II
You're interested in looking at how many days in the week students drink alcohol, and need to know what kind
of sample size to use. You know that to find this out, you need a Z-score, a margin of error and a standard
deviation. Let's try to establish the standard deviation first. You expect that about 95% of people will consume
an alcoholic drink between 1 and 6 days in the week.
 Assuming that your data is normally Script.R
distributed, and that 95% of people will # What is the expected mean
report consuming alcohol between 1 and 6
number of drinking days
days in the week, report the expected mean
number of drinking days. 3.5 {(1+6)/2}
 Based on the expected distribution, and # Assign standard deviation to
mean, assign the expected standard deviation object "s"
to a new object called "s". s <- 1.25

Sample Size III


You're interested in looking at how many days in the week students drink alcohol, and need to know what kind
of sample size to use. You know that to find this out, you need a Z-score, a margin of error and a standard
deviation. In the last exercise you established an estimated mean of 3.5, and standard deviation of 1.25. You
want to calculate a confidence interval of 95%, with a margin of error of 0.2. Let's input these values into our
equation to estimate our required sample size!
Remember, our formula for sample size is the population standard deviation squared, multiplied by the Z-score
squared, divided by the margin of error squared. You know what Z-score to use for 95% confidence by know,
don't you?
 In your script, calculate the separate components of the sample size equation and assign to stated objects
 In your script, use the separate components to calculate sample size (see formula above)
 Remember to format your calculations correctly and use () where appropriate.

Script.R
# Assign the standard deviation squared to new object "ss"
ss <- 1.25^2
# Assign the value of the Z-score squared to new object "zs"
zs <- 1.96^2
# Assign the value of the margin of error squared to the new
object "ms"
ms <- 0.2^2
# Calculate the neccessary sample size
(ss*zs)/ms

[1] 150.0625

34
Sample Size IV
Now you're conducting a study on what proportion of student drink alcohol and want to know what sample size
to use for a confidence interval of 95%, with a margin of error of 0.05.
The sample size will be p multiplied by 1-p multiplied by the Z-score squared, divided by the margin of error
squared. Let's try to find this using the 'safe approach' for p. This is the value above which the output p*(1-p)
cannot get any larger (you can always go back to the lecture notes or click 'hint' if you don't remember what this
is).
 In your script, calculate the separate components of the sample size equation and assign to stated object
 In your script, use the separate components to calculate sample size
 Remember to format your calculations correctly and use () where appropriate.

Script.R
# Assign the value of p(1-p) to object "p"
p <- 0.5*0.5
# Assign the value of the Z-score to new object "z"
z <- 1.96
# Assign the value of the margin of error squared to the new
object "ms"
ms <- 0.05^2
# Calculate the neccessary sample size
(p*z^2)/ms

[1] 384.16

35
Hypothesis Testing
Significance testing: one-sided versus two-sided
An important consideration when doing hypothesis testing is whether to do a one-sided or a two-sided test.
Consider the example that we are working with a significance level (αα) of 0.05. In the case we are doing a one-
sided hypothesis test, we would only focus on one side of the distribution (either the right or the left side). Our
entire cumulative probability of 0.05 would then be allocated to only this side. So what does this mean in
practice? In practice this means that our rejection region starts at a probability of 0.95 when our alternative
hypothesis tests whether a given value is greater than a population value.
Alternatively, our rejection region starts at a probability of 0.05 when our
alternative hypothesis tests whether a given value is smaller than a
population value. Let's consider what this means visually:
In the visualization, we have taken as an example the sampling
distribution of the beard length of samples of 40 Scandinavian hipsters.
The mean here is 25 and the standard error is 0.55 round(3.5 /
sqrt(40), 2). The red area is considered the rejection region when
we are doing a one-sided hypothesis where the alternative hypothesis
checks whether the population mean of the beard length of Scandinavian
hipsters is larger than 25 millimetres.
 The visualization mentions the value 25.90. This is Script.R
the starting value of the rejection region. Consider # calculate the value of cut_off
our example mentioned above with a mean beard
cut_off <- round(qnorm(0.95 ,mean=25
length of 25 and a standard error of 0.55.
,sd=round(3.5/sqrt(40),2)),2)
Reproduce the value of 25.90 using
the qnorm() function and assign it to the # print the value of cut_off to the
console
variable cut_off. Make sure to round every
value in this exercise to two digits. cut_off
 Print the value of cut_off to the console [1] 25.9

Significance testing: one-sided versus two-sided (2)


In the last exercise, we calculated the cut-off value for a one-sided significance test. It is however more common
to do a two-tailed significance test. If we stick to our significance level of 0.05, we would have to chop this
value into two. This means that we get two rejection regions each corresponding to a cumulative probability of
0.025. Let's consider what this means visually:
In the above visualization, we have taken as an example the sampling
distribution of the beard length of samples of 40 Scandinavian hipsters. The
mean here is 25 and the standard error is 0.55 round(3.5 /
sqrt(40), 2). The red area is considered the rejection region when we
are doing a two-tailed hypothesis test. This corresponds to the alternative
hypothesis which checks whether the population mean of the beard length of
Scandinavian hipsters is not equal to 25 millimetres. As you can see the 0.05
probability is divided into two chunks of 0.025.
 The visualization mentions the values of 23.92 and 26.08. These values indicate the start of the rejection region.
Consider our example mentioned above with a mean beard length of 25 and a standard error of 0.55. Reproduce
the value of 23.92 using the qnorm() function and assign it to the variable lower_cut_off. Make sure
to round every value in this exercise to two digits.
 Reproduce the value of 26.08 using the qnorm() function and assign it to the variable upper_cut_off.
Make sure to round every value in this exercise to two digits.
 Print the values of lower_cut_off and upper_cut_off to the console.

36
Script.R
# calculate the value of the variable lower_cut_off
lower_cut_off <- round(qnorm(0.025,mean=25,sd=0.55),2)

# calculate the value of the variable upper_cut_off


upper_cut_off <- round(qnorm(0.975,mean=25,sd=0.55),2)

# print lower_cut_off to the console


lower_cut_off [1]23.92

# print upper_cut_off to the console


upper_cut_off [1]25.79

Significance testing: one-sided versus two-sided (3)


In the last exercises we saw that there are different cut-off values for one-sided and two-tailed hypothesis tests.
You saw that in order to reject the null hypothesis when performing a two-tailed hypothesis, you would need to
pass a higher threshold. In this exercise and the exercise to come, we will calculate probabilities based on sample
means.
Let's go back to our example of scandinavian hipsters. Here we had a population mean of 25 and a population
standard deviation of 3.5. Because we were taking samples of 40 subjects from this population, we were actually
working with the standard error which was 3.5 / 40. Imagine we found a sample mean of exactly 26 and a
corresponding Z score of 1.81. Whether this result is significant depends on the test we do. If we would do a
one-sided hypothesis test against a 5% significance level, we would only have to test for the effect in one
direction. As such, we could check the following: 𝑃(> 1.81)𝑃(> 1.81). In order to do this, we could use
our pnorm() function which calculates a probability that corresponds to a quantile or z score. We could use it
in the following way: pnorm(1.81, lower.tail = FALSE). We set the lower tail equal
to FALSE because pnorm() calculates the cumulative probability until the value of 1.81 and we want to know
the probability of finding a value of 1.81 or larger. The functions yields a p value of 0.035 which is smaller than
0.05.
 Imagine that we found a sample mean of 25.95 with a sample size of 40. Calculate the corresponding test
statistic, a z score in this case, and assign it to the variable z_value. Look at the hint if you have
forgotten the formula to calculate a z score. Assume that the population mean and standard deviation are
the same as described above. Round all values in this exercise to two digits.
 Use the pnorm() function to find the probability of finding a sample mean as large or more extreme
and store this in the variable p_value. Round all values in this exercise to two digits.
 Print the variable p_value to the console

Script.R
# calculate the z score and assign it to a variable called z_value
z_value <- round((25.95-25)/round(3.5/sqrt (40),2),2)

# calculate the corresponding p value and store it in the variable called


p_value
p_value <- round(pnorm(z_value, lower.tail= FALSE),2)

# print p_value to the console


p_value
[1]0.04

37
Significance testing: one-sided versus two-sided (4)
In the last exercises we calculated a p value corresponding to a one-sided test. Given the fact that we were
testing against a significance level of 0.05, we have actually found a significant result. But what if we would
have done a two-sided hypothesis test?
In the instructions of the last exercise, we found a sample mean of exactly 26. When doing a one-sided
hypothesis test, we find a corresponding p value of 0.04 to our z score of 1.81. If we would however do a two-
sided hypothesis test, we should not only look for 𝑃(> 1.81)𝑃(> 1.81). In this case we should test for
both 𝑃(> 1.81)𝑃(> 1.81) and 𝑃(< −1.81)𝑃(< −1.81). As such, to get the p value that corresponds to z score
of 1.81 we have to sum both 𝑃(> 1.81)𝑃(> 1.81) and 𝑃(< −1.81)𝑃(< −1.81). As the Z distribution we are
working with is symmetric, we could multiply the outcome of round(pnorm(1.81, lower.tail =
FALSE), 2) by 2. This would yield a p-value of 0.07 in which case we would fail to reject the null hypothesis
as 0.07 is larger than 0.05.
 Imagine that we found a sample mean of 25.95 with a sample size of 40. Calculate the corresponding test
statistic, a z score in this case, and assign it to the variable z_value. Assume that the population mean
and standard deviation are the same as described above. Round all values to two decimals.
 Assume that we are doing a two-sided hypothesis test. Use the function pnorm() to find the
corresponding p value and print this to the console. Round the obtained p value to two decimals.

Script.R
# calculate the z score and assign it to a variable called z_value
z_value <- round((25.95-25)/round(3.5/sqrt(40),2),2)

# calculate the corresponding p value and store it in the variable called


p_value
p_value <- round((pnorm(z_value, lower.tail= FALSE)*2),2)

# print p_value to the console


p_value
[1]0.08

Hypothesis testing and the binomial distribution (2)


So now we know our hypotheses, let's actually test them. To test hypotheses and calculate p values, we can use
the function pbinom(). Looks familiar, doesn't it? Imagine we want to test the hypothesis that a student who
scored 18 out of 25 questions did better than randomly guessing, we can calculate the area under the curve, that
is, pbinom(17, size = 25, prob = 0.20). While this formula calculates the area under the curve
for values below 17 and equal to 17, we need to know the area ABOVE 17. Because the total probability of all
possible scores occuring is 1, we can subtract the probability of scores less than or equal to 17 from the total
area of 1, and the remaining value will be the probability of a score that is equal to or larger than 18.
 Imagine we have a student who got 12 out of 25 questions correctly and the probability of guessing a
question correctly is 0.20. Calculate the probability of answering 12 or more questions correctly given
that the student is merely guessing and store this in the variable p_value. Round this probability to
two digits. Remember that we are doing a one-sided hypothesis test.
 Print p_value to the console
 Assign your conclusion whether H0 (the student is merely guessing) is accepted or rejected to the
variable conclusion, that is, assign either the value "accepted" or the value "rejected" to the
variable conclusion.

38
Script.R
#’ calculate the probability of answering 12 ore more questions correc
tly given
#' that the student is merely guessing and store this in the variable
p_value
p_value <- round(pbinom(11, size = 25, prob = 0.20, lower.tail = FALSE
), 2)

# print the probability calculated above to the console


p_value [1] 0

# assign either accepted or rejected to the variable conclusion


conclusion <- "rejected

Hypothesis testing and the binomial distribution (3)


In the last exercise, we did a hypothesis test by calculating the p value by using the pbinom() function. However,
a more widely used way to do so is to calculate the mean (the expected probability) of our distribution and its
standard deviation and to verify how many standard deviations the observed probability is removed from the
expected probability (the z score). Because we usually test our hypothesis using a sample, we work with the
sampling distribution instead of the population distribution. This means that we use the standard error, rather
than the standard deviation. Recall that the formula for the standard error for a binomial distribution was the
following:
𝑝(1 − 𝑝)

𝑛

 Calculate the mean (expected probability) and standard error and store them in the
variable average and se. Remember that we worked with an exam of 25 questions and the probability
of guessing the correct answer on a question was 0.20. Round these values to 2 digits.
 Assume that a student answered 12 questions correctly. Now calculate the z value and store this in the
variable z_value. Round this value to 2 digits.
 Lastly, calculate the associated p value, round this value to two digits and store it in the variable p_value.
Remember that we are doing a one-sided hypothesis test.
 print p_value to the console

# calculate the mean (expected probability) and assign it to a variable called a


verage
average <- 0.20
# calculate the standard error and assign it to a variable called se
se <- round(sqrt((0.20 * 0.80) / 25), 2)
# calculate the z value and assign it to a variable z_value
z_value <- round((((12 / 25) - 0.2) / se), 2)
# calculate the p value and store it in a variable p_value
p_value <- round(pnorm(z_value, lower.tail = FALSE), 2)
# print p_value to the console
p_value

[1] 0

39
The t distribution
Often when comparing means of continuous variables, we use a t distribution instead of the normal distribution.
The main reason to use the t distribution here is because we often have to deal with small samples.
Now image the following example of height: They say that Dutch people are among the tallest in the world with
an average male height of 185 centimetres with a standard deviation of 5 centimes. We take a sample of 50
males from this population and find an average height of 186.5 centimetres which is above the population mean.
Imagine we want to do a one-sided hypothesis test where we check whether the population mean of Dutch male
height is larger than 185 and we use a significance level of 0.05. There are several things we can do now and 1
thing that we must do.
Firstly, we need to calculate the degrees of freedom which refers to the amount of independent samples in the
set of data, which is equal to the sample size - 1. Thus, the degrees of freedom here is 50 − 1 = 4950 − 1 =
49. Secondly, we could either calculate the associated p value or, alternatively, we could calculate the critical
cut-off value. The critical cut-off value in this case is the 95th percentile as we are doing a one-sided hypothesis
test.
 Calculate the critical cut-off value using the qt() function given the fact that we perform a one-sided
hypothesis test with a significance level of 0.05. Round this value to two digits and store it in a variable
called cut_off. You can look up the help documentation of this function by typing help(qt) in the
console.
 Print the value of cut_off to the console.

# qt(p, df, ncp, lower.tail = TRUE, log.p = FALSE)


Arguments
x, q: vector of quantiles.
P: vector of probabilities.
n: number of observations. If length(n) > 1, the length is taken to be the number required.
df: degrees of freedom (\(> 0\), maybe non-integer). df = Inf is allowed.
ncp: non-centrality parameter \(\delta\); currently except for rt(), only for abs(ncp) <= 37.62. If omitted, use
the
central t distribution.
log, log.p: logical; if TRUE, probabilities p are given as log(p).
lower.tail: logical; if TRUE (default), probabilities are \(P[X \le x]\), otherwise, \(P[X > x]\).

Script.R
# calculate the critical cut off value and store it in a variable called cut_off
cut_off <- cut_off <- round(qt(0.95, df = 49), 2)

# print cut_off to the console


cut_off

[1] 1.68

The t distribution (2)


In the last exercise we calculated the critical value using the qt() function. However, we still do not know our t
test statistics and whether this statistic is larger than the cut-off value. Let's calculate the t value in this exercise
and see which p value is associated with it. The formula for the standard error is the following:
𝜎
√𝑛

40
The formula for the t value is the same as the formula for the Z value:
𝑥̅ − 𝜇0
𝑡=
𝑠𝑒
 Using our example where we had a sample of 50 males with a mean height of 186.5 and a population
standard deviation of 5 and population mean of 185, calculate the associated standard error, round this
value to two digits and store it in the variable se.
 Calculate the associated t value, round it to two digits and store it in the variable t_value. Remember
to use the same formula as when calculating a z score.
 Using the pt() function with lower.tail = FALSE, calculate the associated p value, round it to
two digits and store it in a variable called p_value. Remember that we are doing a one-sided test.

Script.R
# calculate the standard error and store it in the variable se
se <- round(5 / sqrt(50), 2)
# calculate the t value and store it in a variable called t_value
t_value <- round((186.5 - 185) / se, 2)
# calculate the p value and store it in a variable called p_value
p_value <- round(pt(t_value, df = 49, lower.tail = FALSE), 2)
# print p_value to the console
p_value
[1] 0.02

Confidence interval and significance testing


Do you remember how to calculate confidence intervals? If not, let's shortly recap this. You can calculate a
confidence interval, say a 95% confidence interval, by taking the mean and adding and subtracting its standard
error multiplied by the given t value or z value. Usually confidence intervals are expressed as a two-sided range
as we will also do in this exercise.
A 95% confidence intervals can be interpreted as that we are 95% confident that this interval will contain our
population statistic. Take our last example where we found a standard error of 0.71, a population mean of 185,
and a sample mean of 186.5. As the sample size was 50, our relevant degrees of freedom were 49.
 Calculate the associated t value with a 95% confidence interval, round it to two digits and store it in the
variable t_value. Be aware of the fact that this is similar to a two-way hypothesis testing and you need
to consider areas in both tails so you will need to use the 97.5 or a 2.5 percentile.
 Calculate the 95% confidence interval, round the lower and upper value of the confidence interval to two
digits and store it in a variable called conf_interval
 Print the variable conf_interval to the console.

Script.R
# calculate the t value and store it in the variable t_value
t_value <- round(qt(0.975, df = 49), 2)
#' calculate a 95% confidence interval as a vector with two values and store it
in a
#' a variable called conf_interval
conf_interval <- round(186.5 + c(-1, 1) * t_value * 0.71, 2)
# print conf_interval to the console

41
conf_interval
[1] 185.07, 187.93

μ-σ

42

You might also like