Lecture_5_(Managing_and_Understanding_Data)
Lecture_5_(Managing_and_Understanding_Data)
2023-10-04
Contents
1 Learning objectives: 1
2 Materials 1
2.1 R data structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2.2 Managing data with R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1 Learning objectives:
• R data structures: Vectors, Factors
• Managing data with R
2 Materials
2.1.1 Elements
x=2
x<-3
4->x
x
## [1] 4
Please note that R is used in an interactive manner. To see what is installed in x, type x and press the enter
key.
(2). For a small data set, you can use the c() function to input data directly from the command line. As an
example, to set up two vectors named height and arm length, use the following commands.
1
arm_length<-c(67,67,77,62,69.84,74.168,29.5,81.8,71,71,74.8)
height<-c(170,176,172,183,167.64,175.26,165.1,177.8,170,178,176)
2.1.2 Vectors
The fundamental R data structure is the vector, which stores an ordered set of values called elements
common types:
- integer
- numeric
- character(text data)
- logical(TRUE or FALSE values)
- absence value: NULL, NA
## [1] 1 1 2 3 5 8 13
is.vector(x)
## [1] TRUE
Always check NA
x<-c(67,67,77,62,69.84,74.168,29.5,81.8,71,71,74.8)
x[5] <- NA
is.na(x)
if(any(is.na(x))){ stop ("there are NAs in the dataset!")}
mean(x)
mean(x,na.rm=TRUE)
x
2
subject_name<-c("John Doe","Jane Doe","Steve Graves")
temperature<-c(98.1,98.6,101.4)
flu_status<-c(FALSE,FALSE,TRUE)
temperature[2]
## [1] 98.6
temperature[2:3]
temperature[-2]
temperature[c(TRUE,TRUE,FALSE)]
2.1.3 Factors
Factor is a special case of vector are features that represent a characteristic with categories of values.
gender<-factor(c("MALE","FEMALE","MALE"))
gender
When factors are created, we can ad additional levels that may not appear in the data.
Example blood factor for the three patients, we specified an additional vector of four possible blood types
using the levels=statement.
blood<-factor(c("O","AB","A"),
levels=c("A","B","AB","O"))
blood
## [1] O AB A
## Levels: A B AB O
## [1] 0 3 2 2 1
3
SevPain <- as.factor(c(0,3,2,2,1)); SevPain # or: as.factor(Pain)
## [1] 0 3 2 2 1
## Levels: 0 1 2 3
is.factor(SevPain)
## [1] TRUE
is.vector(SevPain)
## [1] FALSE
2.1.4 Lists
List is another special type of vector and used for storing an ordered set of values. A list allows different
types of values to be collected.
A list is created using the list() function as shown in the following example.
subject1<-list(fullname = subject_name[1],
temperature = temperature[1],
flu_status = flu_status[1],
gender = gender[1],
blood = blood[1])
subject1
## $fullname
## [1] "John Doe"
##
## $temperature
## [1] 98.1
##
## $flu_status
## [1] FALSE
##
## $gender
## [1] MALE
## Levels: FEMALE MALE
##
## $blood
## [1] O
## Levels: A B AB O
4
subject1$fullname
subject1[1]
## $fullname
## [1] "John Doe"
subject1$temperature
## [1] 98.1
subject1[c("temperature","blood")]
## $temperature
## [1] 98.1
##
## $blood
## [1] O
## Levels: A B AB O
The data frame, a structure analogous to a spreadsheet or database since it has both rows and columns of
data. In R terms, a data frame can be understood as a list of vectors or factors, each having exactly the
same number of values.
Example Using the patient data vectors we created previously, the data.f rame() function combines them
into a data frame:
## temperature blood
## 1 98.1 O
## 2 98.6 AB
## 3 101.4 A
pt_data[1,2] # extract the value in the first row and second column of the patient data frame
## [1] 98.1
5
pt_data[,1] # extract all rows of the first column
Note: stringsAsF actors = F ALSE. If we do not specify this option, R will automatically convert every
character vector to a factor.
2.1.6 Matrix
A matrix is a data structure that represents a two-dimensional table, with rows and columns of data.
To create a matrix, simply supply a vector of data to the matrix() function, along with a parameter specifying
the number of rows (nrow) or number of columns (ncol).
x<-1:12
x
## [1] 1 2 3 4 5 6 7 8 9 10 11 12
length(x)
## [1] 12
## NULL
dim(x)<-c(3,4)
x
6
a<-matrix(1:12,nrow=3,byrow=TRUE) ## another way of defining a matrix
a
a<-matrix(1:12,nrow=3,byrow=FALSE)
a
rownames(a)
## NULL
rownames(a)<-c("A","B","C")
a
colnames(a)<-c("1","2","x","y")
a
## 1 2 x y
## A 1 4 7 10
## B 2 5 8 11
## C 3 6 9 12
To save a particular data structure to a file that can be reloaded later or transferred to another system, you
can use the save() function. The save() function writes R data structures to the location specified by the
file parameter. R data files have the file extension .RData.
save(x, y, z, file = "mydata.RData") #Regardless of whether x, y, and z are vectors, factors, lists, or
7
2.2.2 Importing and saving data from CSV files
For a large data set, you can read it into R from an external file.R input facilities are simple. You can use
the following R functions to import data.
The read.table() function;
The read.csv() function
Example: A CSV file representing the medical dataset constructed previously would look as fol-
lows: subject_name,temperature,flu_status,gender,blood_type John Doe,98.1,FALSE,MALE,O Jane
Doe,98.6,FALSE,FEMALE,AB Steve Graves,101.4,TRUE,MALE,A
To load this CSV file into R, the read.csv() is used as follows: Loading agpop.csv into R.
#agpop<-read.csv("agpop.csv")
agpop<-read.csv("agpop.csv") #import data from full path
install.packages("tidyverse")
library(tidyverse)
dim(agpop)
## [1] 3078 15
str(agpop)
8
select columns
df1<-select(agpop,acres92,farms92)
df2<-select(agpop,region,county,state,everything())
df3<-rename(agpop,acr87 = acres87)
filter the rows, filter()function get a new dataframe in which the state is “AK” and the region is “W”
arrange/sort rows with the arrange()function sort the data frame by state and acres92
df7<-arrange(agpop,state,acres92)
df8<-arrange(agpop,desc(state),desc(acres92))
df9<-mutate(agpop,Ratio=largef82/smallf92)
aggregate function:mean(),sum() and max(). group by:group_by() calculate the total smallf82 by region
by_region<-group_by(agpop,region)
smallf82_by_category<-summarise(by_region,smallf82value=sum(smallf82))