Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
2 views

Lecture_5_(Managing_and_Understanding_Data)

The document provides an overview of managing and understanding data in R, focusing on data structures such as vectors, factors, lists, data frames, and matrices. It also covers practical aspects of data management, including saving/loading data, importing from CSV files, and using functions from the tidyverse package for data manipulation. Key learning objectives include understanding R data structures and managing data effectively with R.

Uploaded by

MingYangChen
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Lecture_5_(Managing_and_Understanding_Data)

The document provides an overview of managing and understanding data in R, focusing on data structures such as vectors, factors, lists, data frames, and matrices. It also covers practical aspects of data management, including saving/loading data, importing from CSV files, and using functions from the tidyverse package for data manipulation. Key learning objectives include understanding R data structures and managing data effectively with R.

Uploaded by

MingYangChen
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Stats447/847 Lecture 5: Managing and Understanding Data

Kyle Gardiner, Jing Wang, Lina Li

2023-10-04

Contents
1 Learning objectives: 1

2 Materials 1
2.1 R data structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2.2 Managing data with R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1 Learning objectives:
• R data structures: Vectors, Factors
• Managing data with R

2 Materials

2.1 R data structures

2.1.1 Elements

How to input data to R?


(1). For a single number, you can either use = or <-. For example

x=2
x<-3
4->x
x

## [1] 4

Please note that R is used in an interactive manner. To see what is installed in x, type x and press the enter
key.
(2). For a small data set, you can use the c() function to input data directly from the command line. As an
example, to set up two vectors named height and arm length, use the following commands.

1
arm_length<-c(67,67,77,62,69.84,74.168,29.5,81.8,71,71,74.8)
height<-c(170,176,172,183,167.64,175.26,165.1,177.8,170,178,176)

2.1.2 Vectors

The fundamental R data structure is the vector, which stores an ordered set of values called elements
common types:
- integer
- numeric
- character(text data)
- logical(TRUE or FALSE values)
- absence value: NULL, NA

x <- c(1, 1, 2, 3, 5, 8, 13) # Numeric


x

## [1] 1 1 2 3 5 8 13

x <- c(TRUE, TRUE, FALSE, TRUE) # Logical


x

## [1] TRUE TRUE FALSE TRUE

x <- c ("Hello","world") # Character


x

## [1] "Hello" "world"

x <- c(1, TRUE, "Thursday") # Mixed


x # Remember: all elements in a vector must have the same type!

## [1] "1" "TRUE" "Thursday"

is.vector(x)

## [1] TRUE

Always check NA

x<-c(67,67,77,62,69.84,74.168,29.5,81.8,71,71,74.8)
x[5] <- NA

is.na(x)
if(any(is.na(x))){ stop ("there are NAs in the dataset!")}

mean(x)
mean(x,na.rm=TRUE)
x

Some example for vectors

2
subject_name<-c("John Doe","Jane Doe","Steve Graves")
temperature<-c(98.1,98.6,101.4)
flu_status<-c(FALSE,FALSE,TRUE)

temperature[2]

## [1] 98.6

temperature[2:3]

## [1] 98.6 101.4

temperature[-2]

## [1] 98.1 101.4

temperature[c(TRUE,TRUE,FALSE)]

## [1] 98.1 98.6

2.1.3 Factors

Factor is a special case of vector are features that represent a characteristic with categories of values.

gender<-factor(c("MALE","FEMALE","MALE"))
gender

## [1] MALE FEMALE MALE


## Levels: FEMALE MALE

When factors are created, we can ad additional levels that may not appear in the data.
Example blood factor for the three patients, we specified an additional vector of four possible blood types
using the levels=statement.

blood<-factor(c("O","AB","A"),
levels=c("A","B","AB","O"))
blood

## [1] O AB A
## Levels: A B AB O

Pain <- c(0,3,2,2,1);Pain

## [1] 0 3 2 2 1

3
SevPain <- as.factor(c(0,3,2,2,1)); SevPain # or: as.factor(Pain)

## [1] 0 3 2 2 1
## Levels: 0 1 2 3

levels(SevPain) <- c("none","mild","medium","severe")


SevPain

## [1] none severe medium medium mild


## Levels: none mild medium severe

is.factor(SevPain)

## [1] TRUE

is.vector(SevPain)

## [1] FALSE

2.1.4 Lists

List is another special type of vector and used for storing an ordered set of values. A list allows different
types of values to be collected.
A list is created using the list() function as shown in the following example.

subject1<-list(fullname = subject_name[1],
temperature = temperature[1],
flu_status = flu_status[1],
gender = gender[1],
blood = blood[1])
subject1

## $fullname
## [1] "John Doe"
##
## $temperature
## [1] 98.1
##
## $flu_status
## [1] FALSE
##
## $gender
## [1] MALE
## Levels: FEMALE MALE
##
## $blood
## [1] O
## Levels: A B AB O

4
subject1$fullname

## [1] "John Doe"

subject1[1]

## $fullname
## [1] "John Doe"

subject1$temperature

## [1] 98.1

subject1[c("temperature","blood")]

## $temperature
## [1] 98.1
##
## $blood
## [1] O
## Levels: A B AB O

2.1.5 Data frames

The data frame, a structure analogous to a spreadsheet or database since it has both rows and columns of
data. In R terms, a data frame can be understood as a list of vectors or factors, each having exactly the
same number of values.
Example Using the patient data vectors we created previously, the data.f rame() function combines them
into a data frame:

pt_data <- data.frame(subject_name, temperature, flu_status,


gender, blood, stringsAsFactors = FALSE)
pt_data$subject_name #obtain the subject_name vector

## [1] "John Doe" "Jane Doe" "Steve Graves"

pt_data[c("temperature","blood")] #extract several columns from a data frame

## temperature blood
## 1 98.1 O
## 2 98.6 AB
## 3 101.4 A

pt_data[1,2] # extract the value in the first row and second column of the patient data frame

## [1] 98.1

5
pt_data[,1] # extract all rows of the first column

## [1] "John Doe" "Jane Doe" "Steve Graves"

pt_data[1,] # extract all columns for the first row

## subject_name temperature flu_status gender blood


## 1 John Doe 98.1 FALSE MALE O

pt_data[,] # extract everything

## subject_name temperature flu_status gender blood


## 1 John Doe 98.1 FALSE MALE O
## 2 Jane Doe 98.6 FALSE FEMALE AB
## 3 Steve Graves 101.4 TRUE MALE A

Note: stringsAsF actors = F ALSE. If we do not specify this option, R will automatically convert every
character vector to a factor.

2.1.6 Matrix

A matrix is a data structure that represents a two-dimensional table, with rows and columns of data.

To create a matrix, simply supply a vector of data to the matrix() function, along with a parameter specifying
the number of rows (nrow) or number of columns (ncol).

x<-1:12
x

## [1] 1 2 3 4 5 6 7 8 9 10 11 12

length(x)

## [1] 12

dim(x) ## dim command to transform a sequence to a matrix

## NULL

dim(x)<-c(3,4)
x

## [,1] [,2] [,3] [,4]


## [1,] 1 4 7 10
## [2,] 2 5 8 11
## [3,] 3 6 9 12

6
a<-matrix(1:12,nrow=3,byrow=TRUE) ## another way of defining a matrix
a

## [,1] [,2] [,3] [,4]


## [1,] 1 2 3 4
## [2,] 5 6 7 8
## [3,] 9 10 11 12

a<-matrix(1:12,nrow=3,byrow=FALSE)
a

## [,1] [,2] [,3] [,4]


## [1,] 1 4 7 10
## [2,] 2 5 8 11
## [3,] 3 6 9 12

rownames(a)

## NULL

rownames(a)<-c("A","B","C")
a

## [,1] [,2] [,3] [,4]


## A 1 4 7 10
## B 2 5 8 11
## C 3 6 9 12

colnames(a)<-c("1","2","x","y")
a

## 1 2 x y
## A 1 4 7 10
## B 2 5 8 11
## C 3 6 9 12

2.2 Managing data with R

2.2.1 Saving and loading R data structures

To save a particular data structure to a file that can be reloaded later or transferred to another system, you
can use the save() function. The save() function writes R data structures to the location specified by the
file parameter. R data files have the file extension .RData.

save(x, y, z, file = "mydata.RData") #Regardless of whether x, y, and z are vectors, factors, lists, or

load("mydata.RData") # recreate the x, y, and z data structures

7
2.2.2 Importing and saving data from CSV files

For a large data set, you can read it into R from an external file.R input facilities are simple. You can use
the following R functions to import data.
The read.table() function;
The read.csv() function
Example: A CSV file representing the medical dataset constructed previously would look as fol-
lows: subject_name,temperature,flu_status,gender,blood_type John Doe,98.1,FALSE,MALE,O Jane
Doe,98.6,FALSE,FEMALE,AB Steve Graves,101.4,TRUE,MALE,A
To load this CSV file into R, the read.csv() is used as follows: Loading agpop.csv into R.

#agpop<-read.csv("agpop.csv")
agpop<-read.csv("agpop.csv") #import data from full path

install.packages("tidyverse")

library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.2 --


## v ggplot2 3.4.3 v purrr 0.3.4
## v tibble 3.2.1 v dplyr 1.0.10
## v tidyr 1.2.0 v stringr 1.5.0
## v readr 2.1.2 v forcats 0.5.2
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()

dim(agpop)

## [1] 3078 15

str(agpop)

## ’data.frame’: 3078 obs. of 15 variables:


## $ county : chr "ALEUTIAN ISLANDS AREA" "ANCHORAGE AREA" "FAIRBANKS AREA" "JUNEAU AREA" ...
## $ state : chr "AK" "AK" "AK" "AK" ...
## $ acres92 : int 683533 47146 141338 210 50810 107259 167832 177189 48022 137426 ...
## $ acres87 : int 726596 59297 154913 214 85712 116050 192082 207906 50818 140107 ...
## $ acres82 : int 764514 256709 204568 127 98035 145044 223502 222066 49630 163638 ...
## $ farms92 : int 26 217 168 8 93 322 941 421 177 1121 ...
## $ farms87 : int 27 245 175 8 119 388 991 498 202 1199 ...
## $ farms82 : int 28 223 170 12 137 453 1119 587 228 1338 ...
## $ largef92: int 14 9 25 0 9 25 24 40 6 9 ...
## $ largef87: int 16 10 28 0 18 32 37 48 10 11 ...
## $ largef82: int 20 11 21 0 17 32 48 43 10 16 ...
## $ smallf92: int 6 41 12 5 12 8 90 9 6 43 ...
## $ smallf87: int 4 52 18 4 18 19 91 21 10 44 ...
## $ smallf82: int 1 38 25 8 19 17 95 36 15 64 ...
## $ region : chr "W" "W" "W" "W" ...

8
select columns

df1<-select(agpop,acres92,farms92)

re-ordering the columns

df2<-select(agpop,region,county,state,everything())

change the column name, rename() function

df3<-rename(agpop,acr87 = acres87)

filter the rows, filter()function get a new dataframe in which the state is “AK” and the region is “W”

df4<-filter(agpop,state=="AK"& region == "W")

Find the largef82 where value is between 0 and 20

df5<-filter(agpop,largef82 <=20 & largef82 >=0)

filter the data frame where acres82 is[5,40000)

df6<-filter(agpop, acres82 <40000 & acres82>=5)

arrange/sort rows with the arrange()function sort the data frame by state and acres92

df7<-arrange(agpop,state,acres92)

sort in descending order

df8<-arrange(agpop,desc(state),desc(acres92))

create new columns,mutate()function Ratio=largef82/smallf92

df9<-mutate(agpop,Ratio=largef82/smallf92)

aggregate function:mean(),sum() and max(). group by:group_by() calculate the total smallf82 by region

by_region<-group_by(agpop,region)
smallf82_by_category<-summarise(by_region,smallf82value=sum(smallf82))

You might also like