Teaching Notes of R
Teaching Notes of R
Teaching Notes of R
Mean
It is calculated by taking the sum of the values and dividing with the number of
values in a data series.
The function mean() is used to calculate this in R.
Syntax
The basic syntax for calculating mean in R is −
mean(x, trim = 0, na.rm = FALSE, ...)
Following is the description of the parameters used −
x is the input vector.
trim is used to drop some observations from both end of the sorted vector.
na.rm is used to remove the missing values from the input vector.
Example
# Create a vector.
x <- c(12,7,3,4.2,18,2,54,-21,8,-5)
# Find Mean.
result.mean <- mean(x)
print(result.mean)
When we execute the above code, it produces the following result −
[1] 8.22
Applying Trim Option
When trim parameter is supplied, the values in the vector get sorted and then the
required numbers of observations are dropped from calculating the mean.
When trim = 0.3, 3 values from each end will be dropped from the calculations to
find mean.
In this case the sorted vector is (−21, −5, 2, 3, 4.2, 7, 8, 12, 18, 54) and the values
removed from the vector for calculating mean are (−21,−5,2) from left and
(12,18,54) from right.
# Create a vector.
x <- c(12,7,3,4.2,18,2,54,-21,8,-5)
# Find Mean.
result.mean <- mean(x,trim = 0.3)
print(result.mean)
# Find mean.
result.mean <- mean(x)
print(result.mean)
Authors:
R is a freely distributed software package for statistical analysis and graphics, developed
and managed by the R Development Core Team. R can be downloaded from the Internet
site of the Comprehensive R Archive Network (CRAN) (http://cran.r-project.org). Check that
you download the correct version of R for your operating system (for example, XP for the
PC, Tiger or earlier versions of OSX for Macs). R is related to the S statistical language
which is commercially available as S-PLUS.
R is an object-oriented language. For our basic applications, matrices representing data sets
(where columns represent different variables and rows represent different subjects) and
column vectors representing variables (one value for each subject in a sample) are objects
in R. Functions in R perform calculations on objects. For example, if 'cholesterol' was an
object representing cholesterol levels from a sample, the function 'mean(cholesterol)' would
calculate the mean cholesterol for the sample. For our basic applications, results of an
analysis are displayed on the screen. Results from analyses can also be saved as objects in
R, allowing the user to manipulate results or use the results in further analyses.
Data can be directly entered into R, but we will usually use MS Excel to create a data set.
Data sets are arranged with each column representing a variable, and each row
representing a subject; a data set with 5 variables recorded on 50 subjects would be
represented in an Excel file with 5 columns and 50 rows. Data can be entered and edited
using Excel. Excel can save files in 'comma delimited format', or .csv files; these .csv files
can then be read into R for analysis.
R is an interactive language. When you start R, a blank window appears with a '>', which is
the ready prompt, on the first line of the window. Analyses are performed through a series of
commands; the user enters a command and R responds, the user then enters the next
command and R responds. In this document, commands typed in by the user are given in
red and responses from R are given in blue; R uses this same color scheme.
The 'assign operator' in R is used to assign a name to an object. For example, suppose we
have a sample of 5 infants with ages (in months) of 6, 10, 12, 7, 15. In R, these values can
be represented as a column vector (as a data set, these values would be arranged in one
column for the variable age, with 5 rows). To enter these data into R and give the name
'agemos' to these data, we can use the command:
The '>' is the ready prompt given by R, indicating that R is ready for our input (R typed
the >, I typed the rest of the line). Here, agemos is the name we are giving to the object that
we will be creating. The '<-' is the assign operator, and the 'c( …)' is a function creating
a column vector from the indicated values. So we are creating the object 'agemos' which is a
data vector (or variable in a data set).
> agemos
[1] 6 10 12 7 14
The '[1]' the R gives at the start of the line is a counter – this line starts with the first value
in the object (this is helpful with larger data sets when the print out extends over several
lines). We can use this object name in later analyses. For example, the mean age of these 5
infants can be calculated using the 'mean( )' function:
> mean(agemos)
[1] 9.8
In R, object names are arbitrary and will generally vary to fit a particular application or study.
Functions always involve parentheses to enclose the relevant arguments, and function
names make up the R language. So, we might calculate mean age using mean(agemos) or
mean cholesterol using mean(cholesterol); the function name is constant, but the object
name varies to fit the particular study.
A copy of the R screen for the above analysis, with the input lines that we typed given in red
and the output lines that R provides given in blue:
For an analysis of a single variable, with a small number of observations, it is easy to enter a
column vector directly into R as described above. But with larger data sets, it is easier to first
create and save the data set in Excel, and then to bring information from the Excel file into
R. There are several ways to do this. I find it easiest to use the 'read.csv(file.choose))'
command, which is described first and uses a Windows-like file menu to find the data file
and then bring data into R.
1.3.1 Bringing data into R from an Excel file using the
read.csv(file.choose()) command
MS Excel is an excellent tool for entering and managing data from a small statistical study.
Data are arranged with variables as columns and subjects as rows. The first row of the Excel
file (the 'header') can be used to provide variable names (object names for vectors in R). For
example, the following are data from the first 5 subjects in a study to compare age first
walking between two groups of infants:
Here, "Subject" is an id code; "group" is coded 1 or 2 for the two study groups; "sexmale" is
coded 1 for males and 0 for females; and "agewalk" is the age when the infant first walked,
in months. Note that I've used single-word (no spaces) variable names; using the
underscore '_' or period '.' are nice ways to separate words in a variable name (for example,
age_years or age.years are viewed as one-word variable names by R).
To bring an Excel data file into R, it first has to be saved as a comma-delimited file. In Excel,
click on 'Save as', and select '.csv' as the file type. Save the file and exit Excel. The .csv file
can then be brought into R as a 'data frame' using the 'read.csv(file.choose())' command.
Entering
1.3.2 (Optional) Bringing data into R from an Excel file using the
read.csv() command
If you know the name of the file that you want to bring into R, you can read a .csv file directly
into R. For example, suppose we saved the data for the Age at Walking example as the file
'agewalk4R.csv' in the R default directory. It can be read in as:
The 'read.csv' command creates an object (dataframe) for the entire data set represented by
an Excel file, but it does not create objects for the individual variables. So, while we could
perform some analyses on this entire data set, we cannot yet perform analyses on specific
variables. When variable names are specified as the first row of the imported Excel file, R
creates objects using the 'dataframename$variablename' convention. For example, in the
Age First Walking example, after reading in the data set
> mean(kidswalk$agewalk)
[1] 11.13
The attach( ) command
For convenience, the individual variables in a data set can also be named without the
dataframename prefix. The 'attach()' function creates individual objects for each
variable, where the data frame name is specified in the parentheses:
> attach(kidswalk)
This function does not give any visible output, but creates objects (column vectors) for each
individual variable in the data set, using the variable names specified in the first row as the
object names. For the Age at Walking example, it creates data objects named Subject,
group, sexmale, and agewalk. We could then use any of these variable objects in analyses:
> mean(agewalk)
[1] 11.13
Note that R is case-sensitive, and so 'Subject' is a different name than 'subject'. Also, two
objects cannot have the same name in R, and so you cannot use the same name for both a
dataframe and a variable.
> fix(kidswalk)
The data set appears in a spreadsheet format. Analyses cannot be performed while the data
editor is open.
For PCs: Before starting R, right click on the R icon and then click on 'Properties'.
In the 'Start In' field, specify the path of the default folder (the path name should be in
quotes, for example "C:\Users\tch\Documents\BS703\Data Sets".
For Macs: For Macs: Open R and select the 'Misc' option. Then choose 'Set
Working Directory'. Browse to the default folder.
The default folder only needs to be set once, and R will continue to look for files in the
default folder.
The default folder for R can be over-written for a single session. After starting R, click on the
'File' menu in the R screen, then select 'Change dir', and specify the directory to be used for
this session. R will look for files in this directory for the current session, but will go back to
the default directory in future sessions. However, if you 'save the workspace', and the start R
by clicking on the saved workspace, settings can be carried over to future sessions.
Many research studies involve some data management before the data are ready for
statistical analysis. For example, in using R to manage grades for a course, 'total score' for
homework may be calculated by summing scores over 5 homework assignments. Or for a
study examining age of a group of patients, we may have recorded age in years but we may
want to categorize age for analysis as either under 30 years vs. 30 or more years. R can be
used for these data management tasks.
The 'ifelse( )' function can be used to create a two-category variable. The following
example creates an age group variable that takes on the value 1 for those under 30, and the
value 0 for those 30 or over, from an existing 'age' variable:
In logical expressions, two equal signs are needed for 'is equal to'
(e.g., > obese <- ifelse(BMIgroup==4,1,0), and the 'not equal to'
sign in R is '!='.
A series of commands are needed to create a categorical variable that takes on more than
two categories. For example, to create an agecat variable that takes on the values 1, 2, 3, or
4 for those under 20, between 20 and 39, between 40 and 59, and over 60, respectively:
The 'write.csv( )' command can be used to save an R data frame as a .csv file.
While variables created in R can be used with existing variables in analyses, the new
variables are not automatically associated with a dataframe. For example, suppose we read
in a .csv file under the dataframe name 'healthstudy', and that 'age' and 'weight.lb' were
variables in this data frame. If we created the 'weight.kg' and 'agecat' variables described
above, these variables would be available for analyses in R but would not be part of the
'healthstudy' dataframe. The 'cbind( )' can be used to add new variables to a
dataframe (bind new columns to the dataframe). For example,
When new variables have been created and added to a dataframe/data set in R, it may be
helpful to save this updated data set as a .csv file (which can then be converted to an Excel
file or other format). To save a dataframe as a .csv file:
1. First, click on the 'File' menu, click on 'Change directory', and select the folder where you
want to save the file.
2. Use the 'write.csv( )' command to save the file:
> write.csv(healthstudy,'healthstudy2.csv')
The first argument (healthstudy) is the name of the dataframe in R, and the second
argument in quotes is the name to be given the .csv file saved on your computer. R will
overwrite a file if the name is already in use.
> help(read.csv)
gives details relating to the read.csv( ) function, while
> help(mean)
> ?read.csv
> ?mean
gives the same help information as the commands above.
The help( ) function only gives information on R functions. To search more broadly, you can
use the 'help.search( ) function. For example,
The 'mean( )' function calculates means from an object representing either a data
matrix or a variable vector. For example, for the 'kidswalk' data set described above, we can
calculate the means for all the variables in the data set (a dataframe object):
> mean(kidswalk)
25.50 1.34 0.48 11.13
The mean( ) function can also be used to calculate the mean of a single variable (a
data vector object):
> mean(agewalk)
[1] 11.13
The 'sd( )' function calculates standard deviations, either for all variables in a data set or
for specific variables.
> sd(kidswalk)
> sd(agewalk)
[1] 1.358308
The length() function returns the number of values (n, the sample size) in a data
vector:
> length(agewalk)
[1] 50
The median of a variable, along with the minimum, maximum, 25th percentile and 75th
percentile, are given by the 'summary( )' function:
> summary(Age_walk)
> table(sexmale)
sexmale
0 1
26 24
The proportions of males and females can be calculated from the frequencies, using R as a
calculator:
> 26/(26+24)
0.52
> 24/(26+24)
0.48
> prop.table(table(sexmale))
sexmale
0 1
0.52 0.48
> mean(agewalk[group==1])
[1] 10.72727
finds the mean of the variable 'agewalk' for those subjects with group equal to 1. When
specifying the condition for inclusion in the subset analysis ('Group==1' in this example), two
equal signs '==' are needed to indicate a value for inclusion. Less than (<) and greater than
(>) arguments can also be used. For example, the following command would find the mean
systolic blood pressure for subjects with age over 50:
> mean(sysbp[age>50])
1 2
10.7272 11.9117
7 6
1 2
1.2316 1.2776
84 36
1 2
33 17
The subset() function creates a new data frame, restricting observations to those that
meet some criteria. For example, the following creates a new data frame for kids in Group 2
of the kidswalk data frame (named 'group2kids'), and finds the n and mean Age_walk for this
subgroup:
> length(group2kids)
[1] 5
> mean(group2kids$Age_walk)
[1] 11.91176
In this example, there are two data sets open in R (kidswalk for the overall sample and
group2kids for the subsample) that use the same set of variables names. In this situation, it
is helpful to use the 'dataframe$variablename' format to specify a variable name for the
appropriate sample.
When specifying the condition for inclusion in the subsample ('Group==2' in this example),
two equal signs '==' are needed to indicate a value for inclusion. Less than (<), less than or
equal to (<=), greater than (>), greater than or equal to (>=), or not equal to (!=) arguments
can also be used. For example,
Many research studies involve missing data – not all study variables are measured for on all
study subjects. Most functions in R handle missing data appropriately by default, but a
couple of basic functions require care when missing data are present. And it's always a good
idea to check for missing data in a data set.
When inputting data directly into R, 'NA' is used to designate missing data. For example,
> xvar
[1] 2 NA 3 4 5 8
When setting up a dataset using Excel, missing data can be represented either by 'NA' or by
just leaving the cell blank in Excel. In either case, data will be treated as missing when
imported into R.
To check for missing data with a measurement variable, we can use the 'summary( )'
command,
> summary(xvar)
> table(currsmoke,useNA='always')
currsmoke
0 1 <NA>
11 6 3
In this example of current smoking status, there are 11 non-smokers, 6 smokers, and 3 with
missing data.
Most R functions appropriately handle missing data, excluding it from analysis. There are a
couple of basic functions where extra care is needed with missing data.
The length( ) command gives the number of observations in a data vector, including missing
data. For example, there were 6 subjects in the data set for the 'xvar' variable in the example
above, although there were only 5 subjects with actual data and one had a missing value.
Using the length( ) function gives
> length(xvar)
[1] 6
which can be misleading, since there are only 5 subjects with valid values for this variable.
To find the number of non-missing observations for a variable, we can combine the length( )
function with the na.omit( ) function. The na.omit( ) function omits missing data from
a calculation. So, listing the values of xvar gives:
> xvar
[1] 2 NA 3 4 5 8
while listing the non-missing values of xvar gives
> na.omit(xvar)
[1] 2 3 4 5 8
To find the number of non-missing observations for xvar,
> length(na.omit(xvar))
[1] 5
Another common function that does not automatically deal with missing data is the mean( )
function. Trying to calculate a mean for a variable with missing data gives the following:
> mean(xvar)
[1] NA
We can calculate the mean for the non-missing values the 'na.omit( )' function:
> mean(na.omit(xvar))
[1] 4.4
Some functions also have options to deal with missing data. For example, the mean( )
function has the 'na.rm=TRUE' option to remove missing values from the calculation. So
another way to calculate the mean of non-missing values for a variable:
[1] 4.4
See the help( ) function documents in R for options for missing data for specific analyses.
> hist(agewalk)
By default, R uses the variable name (agewalk) in the title and x-axis label for the histogram.
The default title can be over-written using the 'main=paste( )' option, and the x-
axis label can be overwritten using the 'xlab=' option. For example,
Box plots in R give the minimum, 25th percentile, median, 75th percentile, and maximum of
a distribution; observations flagged as outliers (either below Q1-1.5*IQR or above
Q3+1.5*IQR) are shown as circles (no observations are flagged as outliers in the above box
plot). So, for study group 1, the youngest age at walking was 9 months, the median was
about 10 months, and the oldest age at walking was 13 months.
Labels can be added to the x-axis and y-axis using the 'xlab=' and 'ylab=' options:
Statistical table functions in R can be used to find p-values for test statistics. See Section 24,
User Defined Functions, for an example of creating a function to directly give a two-tailed p-
value from a t-statistic.
> pnorm(1.96)
[1] 0.9750021
To find a two-tailed area (corresponding to a 2-tailed p-value) for a positive z-value:
> 2*(1-pnorm(1.96))
[1] 0.04999579
> qnorm(.05)
[1] -1.644854
To find a critical value for a two-tailed 95% confidence interval:
> qnorm(1-.05/2)
[1] 1.959964
The t distribution
The pt( ) function gives the area, or probability, below a t-value. For example, the area
below t=2.50 with 25 d.f. is
> pt(2.50,25)
[1] 0.9903284
To find a two-tailed p-value for a positive t-value:
> 2*(1-pt(2.50,25))
[1] 0.01934313
> qt(.05,25)
[1] -1.708141
To find the critical t-value for a 95% confidence interval with 25 degrees freedom:
> qt(1-.05/2,25)
[1] 2.059539
The chi-square distribution
The pchisq( ) function gives the lower tail area for a chi-square value:
> pchisq(3.84,1)
[1] 0.9499565
For the chi-square test, we are usually interested in upper-tail areas as p-values. To find the
p-value corresponding to a chi-square value of 4.50 with 1 d.f.:
> 1-pchisq(4.50,1)
[1] 0.03389485
> t.test(agewalk)
data: agewalk
10.74397 11.51603
sample estimates:
mean of x
11.13
The t.test( ) function can be used to conduct several types of t-tests, and it's a
good idea to check the title in the output ('One Sample t-test) and the degrees of
freedom (which for a CI for a mean are n-1) to be sure R is performing a one-sample t-
test.
If we are interested in a confidence interval for the mean, we can ignore the t-value and p-
value given by this procedure (which are discussed in Section 2.2), and focus on the 95%
confidence interval. Here, the mean age at walking for the sample of n=50 infants (degrees
of freedom are n-1) was 11.13, with a 95% confidence interval of (10.74 , 11.52).
R calculates a 95% confidence interval by default, but we can request other confidence
levels using the 'conf.level' option. For example, the following requests the 90% confidence
interval for the mean age at walking:
> t.test(agewalk,conf.level=.90)
data: agewalk
10.80795 11.45205
sample estimates:
mean of x
11.13
Note that R changes the label for the confidence interval (90 percent …) to reflect the
specified confidence level.
NOTE: When using the prop.test( ) function, specifying 'correct=TRUE' tells R to use the
small sample correction when calculating the confidence interval (a slightly different
formula), and specifying 'correct=FALSE' tells R to use the usual large sample formula for
the confidence interval (Since categorical data are not normally distributed, the usual z-
statistic formula for the confidence interval for a proportion is only reliable with large samples
- with at least 5 events and 5 non-events in the sample).
> table(sexmale)
sexmale
0 1
26 24
> 26/(26+24)
0.52
> prop.test(26,50,correct=FALSE)
0.3851174 0.6520286
sample estimates:
0.52
The prop.test( ) procedure can be used for several scenarios, so it's a good idea
to check the labeling (1-sample proportions) to make sure we set things up correctly. The
procedure also tests a hypothesis about the proportion (see Section 2.3), but we can focus
on the 'p' of 0.52 (the sample proportion) and the confidence interval (0.385 , 0.652). This
procedure uses a slightly different formula for the CI than presented in class, and the results
of the two versions of the formula may differ slightly. With small samples, it is more
appropriate to use the 'correct=TRUE' option to use the correction factor. There is also a
'binom.exact( )' function which calculates a confidence interval for a proportion using an
exact formula appropriate for small sample sizes.
Confidence Intervals for Comparing Means
The t.test( ) function can also be used to compare means between two samples,
and gives the confidence interval for the difference in the means from two independent
samples as well as performing the independent samples t-test. For the following syntax, the
underlying data set includes the subjects from both samples, with one variable indicating the
dependent variable (the outcome variable) and another variable indicating which group a
subject is in. The outcome variable and grouping variable are identified using the 'outcome ~
group' syntax. For the usual pooled-variance version of the t-test:
-1.9331253 -0.4358587
sample estimates:
10.72727 11.91176
The t.test( ) function can be used to conduct several types of t-tests, with several
different data set ups, and it's a good idea to check the title in the output ('Two Sample t-
test) and the degrees of freedom (n1 + n2 – 2) to be sure R is performing the pooled-
variance version of the two sample t-test.
The t-statistic and p-value are discussed under Section 2.2.2. The 95% confidence interval
that is given is for the difference in the means for the two groups (10.73 – 11.91 gives a
difference in means of -1.18, and the CI that R gives is a CI for this difference in means). By
default, R gives the 95% CI; the 'conf.level' level option can be used to change the
confidence level (see Section 2.1.1). Note that the output gives the means for each of the
two groups being compared, but not the standard deviations or sample sizes. This additional
information can be obtained using the tapply(
) function as described in Section 7 (in
this example, tapply(agewalk,group,sd) will give standard deviations,
table(group) will give n's).
To calculate the confidence interval for the difference in means using the unequal variance
formula:
-1.9526304 -0.4163536
sample estimates:
10.72727 11.91176
Again, it's good to check the title (Welch Two Sample t-test) and degrees of freedom (which
often take on decimal values for the unequal variance version of the t-test) to be sure R is
using the unequal variance formula for the confidence interval and t-test.
The t.test( ) function can also be used to calculate the confidence interval for a
mean from a paired (pre-post) sample, and to perform the paired-sample t-test. In this
situation, we need to specify the two data vectors representing the two variables to be
compared. The following example compares the means of a pre-test score (score1) and a
post-test score (score2) from a sample of 5 subjects. The t.test( ) function does not
give the means of the two underlying variables (it does give the mean difference) and so I
used the mean( ) function to get this descriptive information. Generally standard
deviations and sample size would also be reported, which can be obtained from the sd( )
and length( ) functions.
> mean(score1)
[1] 20.2
> mean(score2)
[1] 21
> t.test(score1,score2,paired=TRUE)
Paired t-test
-5.874139 4.274139
sample estimates:
-0.8
The t.test( ) function can be used for several different types of t-tests, and so it's a
good idea to check the title (Paired t-test) and degrees of freedom (n-1, where n is the
number of pairs in the study) to be sure R is performing a paired sample analysis.
The confidence interval here is the confidence interval for the mean difference; the
confidence interval should agree with the p-value in that the CI should not contain 0 when
p<0.05, and the CI should contain 0 when p>0.05.
Note that the t.test( ) procedure gives the mean difference, but does not give the
standard deviations of the difference or the standard deviations of the two variables.
Generally, standard deviations are reported as part of the data summary for a comparison of
means, and these standard deviations can be found using the 'sd( )' command.
> table(by1year,group)
group
by1year 1 2
0 5 9
1 28 8
> 28/33
0.848
> 8/17
0.470
> prop.test(c(28,8),c(33,17),correct=FALSE)
correction
0.1109476 0.6448456
sample estimates:
prop 1 prop 2
0.8484848 0.4705882
Warning message:
The prop.test( ) command does several different analyses, and it's a good idea to
check the title to make sure R is comparing two groups ('2-sample test for equality…'). The
procedure also gives the results of a chi-square test comparing the two proportions (see
Section 2.5), but here we are interested in the confidence interval and the proportions in
each study group. For this example, 84.8% of the exercise group was walking by 1 year, and
47.1% of the control group was walking by 1 year. The difference in these two proportions is
84.8 – 47.1 = 37.7, and the 95% CI for this difference is (11.1% , 64.5%). We are 95%
confident that more infants walk by 1 year in the exercise group (since this interval does not
contain 0); we are 95% confident that the additional percent of kids walking by 1 year is
between 11.1% and 64.5%.
The data layout matters for calculating RRs. For the riskratio( ) function from epitools,
data should be set up in the following format:
No Disease Disease
Control
Exposed
This data layout corresponds to the usual 0/1 coding for the exposure and disease variables,
but is slightly different than the layout traditionally used in the Introductory Epidemiology
class (so be careful!). The riskratio( ) command calculates the RR of disease for
those in the exposed group relative to the control group.
> table(NoExercise,LateWalker)
LateWalker
NoExercise 0 1
0 28 5
1 8 9
> riskratio.wald(NoExercise,LateWalker)
$data
Outcome
Predictor 0 1 Total
0 28 5 33
1 8 9 17
Total 36 14 50
$measure
$p.value
two-sided
0 NA NA NA
$correction
[1] FALSE
attr(,"method")
Warning message:
The RR here is 3.49 ( (9/17) / (5/33) ) , with a 95% CI of (1.39 , 8.80). There are several
versions of a CI for a relative risk, and using 'riskratio.wald( )' requests the
standard normal approximation formula; 'riskratio.small( )' uses a correction
to the CI for small samples (and the 'Warning message' that R gave in the above example,
that the 'Chi-squared approximation may be incorrect' is a small sample size warning). R will
choose the appropriate version of the CI if 'riskratio( )' is specified.
No Side Side
Effect Effect
Traditional 5169 111
Robo-Assist 3355 165
The rate of side effects was 2.1% (111/5280) vs. 4.7% (165/3520) for those undergoing
traditional vs. robot-assisted surgery. Table orientation matters for the RR (see Section
2.1.6.1), and this table is set up to find the RR of a side effect, for those undergoing robot-
assisted compared to traditional surgery.
> sideeffects
[,1] [,2]
> riskratio.wald(sideeffects)
$data
Outcome
$measure
Exposed1 1.00000 NA NA
$p.value
two-sided
Exposed1 NA NA NA
$correction
[1] FALSE
attr(,"method")
Calculating the odds ratio ( (9/8) / (5/28) = 6.3 ) and 95% CI for late walkers (see the
example in 2.1.6 above), for non-exercisers vs. exercisers in the Age at Walking example:
> oddsratio.wald(NoExercise,LateWalker)
$data
Outcome
Predictor 0 1 Total
0 28 5 33
1 8 9 17
Total 36 14 50
$measure
0 1.0 NA NA
$p.value
two-sided
0 NA NA NA
$correction
[1] FALSE
attr(,"method")
Warning message:
> t.test(agewalk,mu=12)
data: agewalk
10.74397 11.51603
sample estimates:
mean of x
11.13
The t.test()function can be used to conduct several types of t-tests, and it's a good
idea to check the title in the output ('One Sample t-test) and the degrees of freedom (n-1
for a one-sample t-test) to be sure R is performing a one-sample t-test.
Note that the t.test( ) function does give the mean, but does not give the standard
deviation or sample size which are usually reported along with a mean (although, for a one
sample test, sample size can be determined from the degrees freedom which are given).
This information can be obtained using the sd( ) function and the length( ) function
(sd(agewalk) and length(agewalk) for this example – although care is needed with
the length( ) command when there are missing values.
-1.9331253 -0.4358587
sample estimates:
10.72727 11.91176
The t.test( ) function can be used to conduct several types of t-tests, with several
different data set ups, and it's a good idea to check the title in the output ('Two Sample t-
test) and the degrees of freedom (n1 + n2 – 2) to be sure R is performing the pooled-
variance version of the two sample t-test.
R reports a two-tailed p-value, as indicated by the two-tailed phrasing of the alternative
hypothesis. The 95% confidence interval that is given is for the difference in the means for
the two groups (10.73 – 11.91 gives a difference in means of -1.18, and the CI that R gives
is a CI for this difference in means). Note that the output gives the means for each of the two
groups being compared, but not the standard deviations or sample sizes. This additional
information can be obtained using the tapply( ) function as described in Section 7 (in
this example, tapply(agewalk,group,sd) will give standard deviations,
table(group) will give n's).
To perform an independent sample t-test using the unequal variance version of the t-test:
-1.9526304 -0.4163536
sample estimates:
10.72727 11.91176
Again, it's good to check the title (Welch Two Sample t-test) and degrees of freedom (which
often take on decimal values for the unequal variance version of the t-test) to be sure R is
performing the unequal variance version of the two sample t-test. As discussed above,
standard deviations and sample sizes are also usually given as part of the summary for a
two-sample t-test.
[1] 20.2
> mean(score2)
[1] 21
> t.test(score1,score2,paired=TRUE)
Paired t-test
-5.874139 4.274139
sample estimates:
-0.8
The t.test( ) function can be used for several different types of t-tests, and so it's a
good idea to check the title (Paired t-test) and degrees of freedom (n-1, where n is the
number of pairs in the study) to be sure R is performing a paired sample test.
The confidence interval here is the confidence interval for the mean difference; the
confidence interval should agree with the p-value in that the CI should not contain 0 when
p<0.05, and the CI should contain 0 when p>0.05.
Note that the t.test( ) procedure gives the mean difference, but does not give the
standard deviations of the difference or the standard deviations of the two variables.
Generally, standard deviations are reported as part of the data summary for a comparison of
means, and these standard deviations can be found using the 'sd( )' command.
> table(walkby12)
walkby12
0 1
14 36
> prop.test(36,50,p=0.5,correct=FALSE)
0.5833488 0.8252583
sample estimates:
p
0.72
The prop.test( ) procedure can be used for several scenarios, so it's a good idea
to check the labeling (1-sample proportions) to make sure we set things up correctly. 72%
of infants began walking before age 12 months. The two-tailed p-value here is p=0.0018,
which is less than the conventional cut-off of 0.05, and so we can conclude that the percent
of infants walking before age 12 months is significantly greater than 50%. The prop.test( )
procedure also gives a confidence interval for this proportion tests a hypothesis about the
proportion (see Section 2.1.2). Note that the CI here does not contain the null value of 0.50,
agreeing with the p-value that the percent walking by age 12 is greater than 50%.
There is also a 'binom.exact( )' function which calculates a confidence interval for
a proportion using an exact formula appropriate for small sample sizes.
The example below uses data from the Age at Walking example, comparing the proportion
of infants walking by 1 year in the exercise group (group=1) and control group (group=2).
The table( ) command is used to find the number of infants walking by 1 year in each study
group, and the proportion walking can be calculated from these frequencies. The prop.test( )
command performs the chi-square test comparing the two proportions; for the two-sample
situation, first enter a vector representing the number of successes in each of the two groups
(using the c( ) command to create a column vector), and then a vector representing the
number of subjects in each of the two groups. To use the usual large-sample formula in
calculating the confidence interval, include the 'correct=FALSE' option to turn off the small
sample size correction factor in the calculation (although in this example, with only 17
subjects in the control group, the small sample version of the confidence interval might be
more appropriate).
> table(by1year,group)
group
by1year 1 2
0 5 9
1 28 8
> 28/33
0.848
> 8/17
0.470
> prop.test(c(28,8),c(33,17),correct=FALSE)
correction
0.1109476 0.6448456
sample estimates:
prop 1 prop 2
0.8484848 0.4705882
Warning message:
The prop.test( ) command does several different analyses, and it's a good idea to
check the title to make sure R is comparing two groups ('2-sample test for equality…'). The
p-value (p=0.0048) is a two-tailed p-value testing the null hypothesis of no difference
between the two proportions. Since the p-value is less than the conventional 0.05, this
example shows a significant difference in the percent of infants walking by 1 year; more
infants in the exercise group are walking by 1 year than in the control group. The procedure
gives a chi-square statistic which is equal to the square of the z-statistic. Here the z-statistic
would be the square root of 7.9478 or z=2.819. The procedure also gives the results of a
confidence interval for the difference between the two proportions (see section 2.1.5).
2.4 One factor ANOVA comparing means across
several groups
As an example, suppose we want to compare the mean days to healing for 5 different
treatments for fever blisters.
1. 'DaysHeal' is the number of days to healing (fewer days indicate more effective
medication) and our outcome variable;
2. 'Treatment' is a group variable coded 1 through 5 for the 5 treatments;
3. 'TreatName' is a character variable, with character values (TreatA, TreatB, etc.)
rather than numeric values for treatment group.
4. There are 6 subjects given each of the 5 treatments, for a sample of 30 subjects
overall. For most analyses, R prefers numeric variables, but for Analysis of Variance,
R prefers that the grouping variable be a character variable rather than a numeric
variable.
When R performs an ANOVA, there is a lot of potential output. So I generally save the
'results' of the ANOVA as an object, and then ask for different parts of the output through
different commands. To perform the ANOVA:
(If the grouping variable is a numeric variable, you can declare it to be categorical using the
factor( ) function. For example, for the numeric 'Treatment' variable, the above ANOVA
command becomes
We can now request different summary results about the analysis using the results of this
analysis. To see the means for the study groups:
> model.tables(fever_anova,"means",digits=3)
Tables of means
Grand mean
5.633333
TreatmentF
TreatmentF
1 2 3 4 5
The select if command or the tapply( ) function can be used to get standard deviations and
sample sizes for each group, as described in Section 5b: Finding means and standard
deviations for subgroups.
To request the ANOVA table and p-value for the overall ANOVA comparing means across
the 5 groups:
> summary(fever_anova)
---
Given the overall ANOVA shows significance, we can request pairwise comparisons using
Tukey's multiple comparison procedure:
> TukeyHSD(fever_anova)
$TreatmentF
Expected Proportion
Preference Observed Frequency
Under the Null
Test A 10 0.333
Test B 15 0.333
Test C 20 0.333
To analyze these data in R, first create an object (arbitrarily named 'obsfreq' in the example)
that contains the observed frequencies. Second, we create an object that contains the
expected probabilities under the null (arbitrarily named 'nullprobs'; the third probability was
rounded to .334 because the probabilities must sum to 1.00; perhaps a better solution would
have been to give the probabilities as 1/3,1/3,1/3, which would also work). Third, we
compare the observed frequencies to the expected probabilities through the chisq.test( )
function:
> chisq.test(obsfreq,p=nullprobs)
data: x
From the Age at Walking example, suppose we want to compare the percent of males
(coded sexmale=1) between the two groups in our age first walking example. We can first
use the 'table( )' function to get the observed counts for the underlying frequency table:
> table(group,sexmale)
sexmale
group 0 1
1 17 16
2 9 8
In group 1, there are 16 males and 17 females, so 48.5% (16/33) of group 1 is male.
In group 2, 47.1% (8/17) are male. The 'prop.table( )' function will calculate these proportions
in R:
> prop.table(table(group,sexmale),1)
sexmale
group 0 1
1 0.5151515 0.4848485
2 0.5294118 0.4705882
The 'prop.table( )' command calculates proportions from the indicated table; in
this example we want to calculate proportions within groups, and the '1' in the 'prop.table( )'
example above indicates that we want proportions calculated within groups for the first
variable in the table (within group, so we're calculating the percent of males and females
within group 1, and the percent of males and females within group 2). Had we indicated '2' in
the above example, R would have calculated proportions within sex, giving the proportions in
groups 1 and 2 for males, and the proportions within groups 1 and 2 for females.
> 16/(16+17)
[1] 0.4848485
> 8/(8+9)
[1] 0.4705882
> chisq.test(table(group,sexmale),correct=FALSE)
R can also perform a chi-square test on frequencies from a contingency table. For example,
suppose we want to compare percent of subjects testing positive on a marker for an
exposure across three groups:
Group 1 Group 2
Test Positive 20 (40%) 5 (33.3%
Test Negative 30 10
First, we create an object ('obsfreq' in the example) containing the observed frequencies
from the observed table. I printed the object as a check that it was created correctly:
> obsfreq
[1,] 20 5 40
[2,] 30 10 40
The 'chisq.test( )' function will then calculate the chi-square statistic for the test
of independence for this table:
> chisq.test(obsfreq)
data: obsfreq
The usual chi-square test is appropriate for large sample sizes. For 2x2 tables with small
samples (an expected frequency less than 5), the usual chi-square test exaggerates
significance, and Fisher's exact test is generally considered to be a more appropriate
procedure. The fisher.test() function performs Fisher's exact test in R:
> fisher.test(group,sexmale)
p-value = 1
alternative hypothesis: true odds ratio is not
equal to 1
0.2480199 3.5592990
sample estimates:
odds ratio
0.9455544
R gives the two-tailed p-value, as indicated by the wording of the alternative hypothesis. The
odds ratio and a 95% confidence interval for the odds ratio are also given. Since Fisher's
test is usually used for small sample situations, the CI for the odds ratio includes a correction
for small sample sizes.
Epidemiologic analyses are available through 'epitools', an add-on package to R. To use the
epitools functions, you must first do a one-time installation. In R, click on the
'Packages' menu, then 'Install Package(s)', then select a download site (from the US), then
select the epitools package. This will install the add-on package onto your computer. To use
the package, you must also load it into R: click on the 'Packages' menu, then 'Load
Package', then select epitools. While you only need to install the package once onto your
computer, you will need to load the package into R each time you want to use it.
The data layout matters for calculating RRs. For the riskratio( ) function from epitools, data
should be set up in the following format:
No Disease Disease
Control
Exposed
riskratio( ) calculates the RR of disease for those in the exposed group relative
to the control group.
For the Age at Walking example, I categorized age at walking as early walking (under 12
months, coded 0) and late walking (12 months or older, coded 1). To find the relative risk for
late walking, for kids in Group 2 vs. Group 1, I first printed the 2x2 table as a check, then
used the riskratio() function to calculate the relative risk and large sample 95% confidence
interval.
> table(group,LateWalker)
LateWalker
group FALSE TRUE
1 28 5
2 8 9
> riskratio.wald(group,LateWalker)
$data
Outcome
1 28 5 33
2 8 9 17
Total 36 14 50
$measure
1 1.000000 NA NA
$p.value
two-sided
1 NA NA NA
$correction
[1] FALSE
attr(,"method")
Warning message:
The RR here is 3.49 ( (9/17) / (5/33) ) , with a 95% CI of (1.39 , 8.80). There are several
versions of a CI for a relative risk, and using 'riskratio.wald( )' requests the standard normal
approximation formula; 'riskratio.small( )' uses a correction to the CI for small samples. R will
choose the appropriate version of the CI if 'riskratio( )' is specified.
The epitools add-on package also has a function to calculate odds ratios and confidence
intervals for odds ratios. You must first load the epitools package into R (see Section 16d).
Orientation of the table matters when calculating the OR, and the orientation described
above for the relative risk also applies for the odds ratio. Calculating the odds ratio ( (9/8) /
(5/28) = 6.3 ) and 95% CI for late walkers, for Group 2 vs. Group 1 in the Age at Walking
example:
> oddsratio.wald(group,LateWalker)
$data
Outcome
1 28 5 33
2 8 9 17
Total 36 14 50
$measure
$p.value
two-sided
1 NA NA NA
$correction
[1] FALSE
attr(,"method")
Warning message:
The 'oddsratio.wald" option gives the usual estimate for the odds ratio, with
OR=6.3 and 95% CI of 1.64 , 24.21. 'oddsratio.small( )' uses a correction for
small sample size in calculating the CI.
The wilcox.test( ) function performs the Wilcoxon rank sum test (for two
independent samples, with the 'paired=FALSE option) and the Wilcoxon signed rank test (for
paired samples, with the 'paired=TRUE' option). With samples less than 50 and no ties, R
calculates an exact p-value, otherwise R uses a normal approximation with a correction
factor to calculate a p-value.
To perform a Wilcoxon rank sum test, data from the two independent groups must be
represented by two data vectors. In this example, we want to compare lactate levels for
subjects from Group=1 vs. Group=2 (the original data frame contains data on subjects from
both study groups, with the Group variable indicating group membership). The following
commands create separate data vectors for lactate for subjects in the two study groups (see
Section 7 for the subset command; I printed the two data vectors as a check):
> lactate.sga
[1] 5.79 4.60 4.20 1.65 2.38 5.67 12.60 3.40 7.57
2.48 4.36
> lactate.controls
>
wilcox.test(lactate.sga,lactate.controls,paired=FA
LSE)
> summary(lactate.sga)
>
wilcox.test(Lactate[Group==2],Lactate[Group==1],pa
ired=FALSE)
The wilcox.test( ) function will perform the Wilcoxon signed rank test comparing
medians for paired samples. The paired data must be represented by two data vectors with
the same number of subjects. In this example, the prescores and postscores variables
represent paired test results before and after an intervention. Note that
the wilcox.test( )function does not provide descriptive statistics, and so
the median( )function was used to calculate the median test scores pre and post
intervention. The summary( )function would give the range and interquartile range in
addition to the median.
> wilcox.test(prescores,postscores,paired=TRUE)
V = 8, p-value = 0.3508
Warning message:
In wilcox.test.default(prescores, postscores,
paired = TRUE) :
> median(prescores)
[1] 61
> median(postscores)
[1] 59
This section describes how to calculate necessary sample size or power for a study
comparing two groups on either a measurement outcome variable (through the independent
sample t-test) or a categorical outcome variable (through the chi-square test of
independence).
To find the required sample size to achieve a specified power, specify delta, sd, and power.
To find the power for a specified scenario, specify n, delta, and sd. R assumes you are
testing at the two-tailed p=.05 level; you can over-ride these defaults by including
sig.level=xx or 'alternative='one.sided'.
>power.t.test(delta=.25,sd=0.7,power=.80)
n = 124.0381
delta = 0.25
sd = 0.7
sig.level = 0.05
power = 0.8
alternative = two.sided
Finding power:
> power.t.test(n=50,delta=.25,sd=0.7)
n = 50
delta = 0.25
sd = 0.7
sig.level = 0.05
power = 0.4239677
alternative = two.sided
To find the necessary sample size, specify p1, p2, and power. To find the power for a
particular situation, specify n, p1, and p2. R assumes you are testing at the two-tailed p=.05
level; you can over-ride these defaults by including sig.level=xx or 'alternative='one.sided'.
Examples:
Finding power:
> power.prop.test(n=100,p1=.2,p2=.1)
n = 100
p1 = 0.2
p2 = 0.1
sig.level = 0.05
power = 0.5081911
alternative = two.sided
> power.prop.test(p1=.2,p2=.1,power=.8)
n = 198.9634
p1 = 0.2
p2 = 0.1
sig.level = 0.05
power = 0.8
alternative = two.sided
ID sexM ht_cm fe
1 1 174
2 1 181
3 0 184
4 1 177
5 1 177
4.1.1 Scatterplots
The plot( ) function will graph a scatter plot. To plot FEV1 (the dependent or outcome
variable) on the Y axis, and height (the independent or predictor variable) on the X axis:
4.1.2 Correlation
The 'cor( )' function calculates correlation coefficients between the variables in a data
set (vectors in a matrix object). For our height and lung function example, where 'fevheight'
is the matrix object representing the data set:
> cor(fevheight)
ID sexM ht_cm fev1_litres
The 'cor.test( )' function gives more detail around the correlation coefficient
between two measurement variables, testing the null hypothesis of zero correlation (no
association) and giving a CI for the correlation coefficient. For our height and lung function
example:
> cor.test(ht_cm,fev1_litres)
0.2104363 0.8224525
sample estimates:
cor
0.597332
4.1.3 Simple regression analysis
Regression analysis is performed through the 'lm( )' function. LM stands for Linear
Models, and this function can be used to perform simple regression, multiple regression, and
Analysis of Variance.
For simple regression (with just one independent or predictor variable), predicting FEV1 from
height:
Call:
Residuals:
Coefficients:
---
The syntax here is actually calling two functions, the lm( ) function performs the
regression analysis, and the summary( ) function prints selected output from the
regression. The 'Estimate' column in the output gives the intercept and slope for the
regression:
The Pr(>|t|) column in the output gives the p-value for the slope. Here, the p-value for the
slope for height is .00542.
4.1.4 Spearman's nonparametric correlation coefficient
The cor.test( ) function that calculates the usual Pearson's correlation will also
calculate Spearman's nonparametric correlation coefficient (rho). With small samples and no
ties, an exact p-value is calculated, otherwise a normal approximation is used to calculate
the p-value. In this example, Lactate and Alanine are two variables measured on a sample of
n=16 subjects.
> cor.test(Lactate,Alanine,method='spearman')
sample estimates:
rho
0.7117647
Call:
Residuals:
---
Call:
Residuals:
Min 1Q Median 3Q Max
Coefficients:
---
Note that three dummy variable were included in the regression representing the four bmi
categories.
In this example, it would be more natural to use 'normal weight', which is coded as BMIcat of
2, as the reference group. You can specify the reference group for a categorical variable with
the 'relevel( )' command (for reference level, I think). Here, to specify '2' as the
reference category, we would use relevel(factor(BMIcat,ref="2")) (getting a bit involved,
using R functions within functions within functions):
> summary(lm(sysbp ~ age + studygrp +
relevel(factor(BMIcat),ref="2")))
Call:
Coefficients:
---
For example, the following regression model predicts systolic blood pressure from sex, age,
BMI, and cholesterol levels:
Call:
Coefficients:
---
> mean(totchol.z)
[1] -2.196409e-16
> sd(totchol.z)
[1] 1
We can find the standardized coefficients by running the regression analysis on the z-score
version of the variables:
Call:
Residuals:
Coefficients:
---
The 'slopes' from this analysis are the standardized slopes. Note that the p-values for the
(now standardized) slopes match the p-values from the original version of the analysis, and
that the model R-square is the same as in the original version of the analysis.\
In R, logistic regression is performed using the glm( ) function, for general linear model. This
function can fit several regression models, and the syntax specifies the request for a logistic
regression model.
As an example, we will look at factors associated with smoking among a sample of n=300
high school students from the Youth Risk Behavior Survey. The outcome variable is
'eversmokedaily1', coded as 1 for those who have smoked vs. 0 for those who have not. As
a preliminary analysis, we calculate the proportion of respondents who have ever smoked
daily:
> table(eversmokedaily1)
eversmokedaily1
0 1
229 69
> 69/(69+229)
[1] 0.2315436
In the following example, the glm( ) function performs the logistic regression, and
the summary() function requests the default output summarizing the analysis. The
'family=binomial(link=logit)' syntax specifies a logistic regression
model. As with the linear regression routine and the ANOVA routine in R, the 'factor( )'
command can be used to declare a categorical predictor (with more than two categories) in a
logistic regression; R will create dummy variables to represent the categorical predictor
using the lowest coded category as the reference group.
In entering this command, I hit the 'return' to type things in over 2 lines; R will allow you to
continue a command onto a second or third line.
In this example, 'bmi_cat' is a categorical variable coded 1,2,3 or 4 for those in BMI
categories of underweight, normal weight, overweight, or obese. By default, R creates 3
dummy variables to represent BMI category, using the lowest coded group (here
'underweight') as the reference. You can change the reference category by using the
'relevel( )' command (see dummy variables in multiple linear regression, above). The format
of the relevel( ) command is:
relevel(factor(bmi_cat,ref="2")
This command would treat bmi_cat as a categorical predictor, and use category '2' (normal
weight) as the reference category when creating dummy variables:
relevel(factor(bmi_cat),ref='2') + alc_30days,
family=binomial(link=logit)))
Call:
Coefficients:
---
AIC: 261.20
Number of Fisher Scoring iterations: 5
In logistic regression, slopes can be converted to odds ratios for interpretation. Below we
calculate the odds ratio for those in the BMI overweight category, and we calculate the OR
and the 95% CI for the OR for those having had a drink in the past month vs. those not
having had a drink in the past month (the # indicates a comment that is ignored by R):
[1] 1.743196
[1] 0.9267183
[1] 3.279023
The C-statistic (also called the AUC statistic) for the logistic regression can be obtained from
the lroc( ) command, which is in the 'epicalc' add-on package. To find the C-statistic, you
must first install and then load the epicalc package. Once the package is loaded, you can
find the C-statistic by first saving the results of the logistic regression, and then using the
lroc( ) command:
> lroc(logisticresults)
$model.description
$auc
[1] 0.5787582
The lroc( ) command gives a lot of additional output (more detail than we generally
need) and a graph of the ROC curve; the C-statistic is given at the top of the output, labeled
'auc'.
5 Survival Analysis
Survival analyses can be performed using the 'survival' add-on package in R (see Section
16d to download the package into R). First, load 'survival' into the R session by clicking on
the Packages menu, then Load Packages and selecting survival.
> summary(survfit(Surv(days.surv,death)))
> print(survfit(Surv(days.surv,death)))
13 11 25 18 Inf
> plot(survfit(Surv(days.surv,death)))
The 'print( )', 'plot( )', and 'survdiff( )' functions in the 'survival' add-
ono package can be used to compare median survival times, plot K-M survival curves by
group, and perform the log-rank test to compare two groups on survival. In the following
example, 'survmonths' is survival time in months, 'event' is an indicator variable coded 1 for
those who have had the outcome event and 0 for those who are censored, and 'group' is an
indicator variable coded 1 for the experimental and 0 for the control group.
group=0 28 22 4 3 5
group=1 22 12 14 6 Inf
Call:
The chi-square statistic and p-value given above are for the log rank test of the null
hypothesis of no difference between the two survival curves.
Cox's proportional hazards regression can be performed using the 'coxph( )' and
'Surv( )' functions of the 'survival' add –on package. R gives the parameter estimates
for the Cox model, which can be exponentiated to give estimated hazard ratios (HRs), and
confidence intervals for the parameter estimates can be used to get confidence intervals for
the hazards ratios. The following performs a proportional hazards regression predicting
survival from treatment group (coded 0,1) and age in years, and then finds the HR and 95%
CI for the HR comparing groups.
The 'factor( )' function can be used to declare multi-category categorical predictors
in a Cox model (to be represented by dummy variables in the model), and the 'relevel(factor(
), ref='') command can be used to specify the reference category in creating dummy
variables (see the examples under multiple linear regression and multiple logistic regression
above).
Call:
> exp(-1.94034)
[1] 0.1436551
> exp(-1.94034-1.93*0.144)
[1] 0.1087983
> exp(-1.94034+1.93*0.144)
[1] 0.1896794
Here, group is significantly related to survival (p<.001), with better survival in the treatment
group (group=1) than control group (group=0), with HR=0.143, 95% CI (0.019 , 0.190). Age
does not significantly relate to survival (p=0.76).
For studies with multiple outcomes, p-values can be adjusted to account for the multiple
comparisons issue. The 'p.adjust( )' command in R calculates adjusted p-values
from a set of un-adjusted p-values, using a number of adjustment procedures.
Adjustment procedures that give strong control of the family-wise error rate are the
Bonferroni, Holm, Hochberg, and Hommel procedures.
Adjustments that control for the false discovery rate, which is the expected proportion of
false discoveries among the rejected hypotheses, are the Benjamini and Hochberg, and
Benjamini, Hochberg, and Yekutieli procedures.
To calculate adjusted p-values, first save a vector of un-adjusted p-values. The following
example is from a study comparing two groups on 10 outcomes through t-tests and chi-
square tests, where 3 of the outcomes gave un-adjusted p-values below the conventional
0.05 level. The following calculates adjusted p-values using the Bonferroni, Hochberg, and
Benjamini and Hochberg (BH) methods:
> p.adjust(pvalues,method="bonferroni")
[1] 0.02 0.05 0.15 1.00 1.00 1.00 1.00 1.00 1.00
1.00
> p.adjust(pvalues,method="hochberg")
> p.adjust(pvalues,method="BH")
[1] 0.0200000 0.0250000 0.0500000 0.2825000
0.3783333 0.3783333 0.6485714
User-defined functions can also be created and saved in R. As a simple example, the
following code creates a user-defined function to calculate a 95% confidence interval for a
proportion. The function name is 'CIp', and the input for the function is p (the sample
proportion) and n (the sample size). The '+'s at the beginning of lines were typed by R and
indicate a continuation of the previous line/calculation. The '{ }'s in the function specification
indicate individual calculations or function calls within the function. 'cresult' is a column
vector containing lower (p.lower) and upper (p.upper) confidence limits, and the 'return( )'
function indicates which of the objects created in the function are to be printed when the
function is called.
+ return(cresult)
+ }
> CIp(.30,100)
> CIp(.3,1000)
The following creates a function to calculate two-tailed p-values from a t-statistic. The cat( )
function specifies the print out. Unlike the return( ) function (I think), cat( ) allows text labels
to be included in quotes and more than one object to be printed on a line. The '\n' in the
cat( ) function inserts a line return after printing the label and p-value, and multiple line
returns could be specified in a cat( ) statement.
+ cat('Two-tailed p-value',pval,'\n')
+ }
> pvalue.t(2.33,200)
> pvalue.t(-1.55,45)