0% found this document useful (0 votes)

2 views

Lecture_5_(Managing_and_Understanding_Data)

The document provides an overview of managing and understanding data in R, focusing on data structures such as vectors, factors, lists, data frames, and matrices. It also covers practical aspects of data management, including saving/loading data, importing from CSV files, and using functions from the tidyverse package for data manipulation. Key learning objectives include understanding R data structures and managing data effectively with R.

Uploaded by

MingYangChen

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

Lecture_5_(Managing_and_Understanding_Data)

Uploaded by

MingYangChen

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Stats447/847 Lecture 5: Managing and Understanding Data

Kyle Gardiner, Jing Wang, Lina Li

2023-10-04

Contents
1 Learning objectives: 1

2 Materials 1
2.1 R data structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2.2 Managing data with R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1 Learning objectives:
• R data structures: Vectors, Factors
• Managing data with R

2 Materials

2.1 R data structures

2.1.1 Elements

How to input data to R?

(1). For a single number, you can either use = or <-. For example

x=2
x<-3
4->x
x

## [1] 4

Please note that R is used in an interactive manner. To see what is installed in x, type x and press the enter
key.
(2). For a small data set, you can use the c() function to input data directly from the command line. As an
example, to set up two vectors named height and arm length, use the following commands.

1
arm_length<-c(67,67,77,62,69.84,74.168,29.5,81.8,71,71,74.8)
height<-c(170,176,172,183,167.64,175.26,165.1,177.8,170,178,176)

2.1.2 Vectors

The fundamental R data structure is the vector, which stores an ordered set of values called elements
common types:
- integer
- numeric
- character(text data)
- logical(TRUE or FALSE values)
- absence value: NULL, NA

x <- c(1, 1, 2, 3, 5, 8, 13) # Numeric

## [1] 1 1 2 3 5 8 13

x <- c(TRUE, TRUE, FALSE, TRUE) # Logical

## [1] TRUE TRUE FALSE TRUE

x <- c ("Hello","world") # Character

## [1] "Hello" "world"

x <- c(1, TRUE, "Thursday") # Mixed

x # Remember: all elements in a vector must have the same type!

## [1] "1" "TRUE" "Thursday"

is.vector(x)

## [1] TRUE

Always check NA

x<-c(67,67,77,62,69.84,74.168,29.5,81.8,71,71,74.8)
x[5] <- NA

is.na(x)
if(any(is.na(x))){ stop ("there are NAs in the dataset!")}

mean(x)
mean(x,na.rm=TRUE)
x

Some example for vectors

2
subject_name<-c("John Doe","Jane Doe","Steve Graves")
temperature<-c(98.1,98.6,101.4)
flu_status<-c(FALSE,FALSE,TRUE)

temperature[2]

## [1] 98.6

temperature[2:3]

## [1] 98.6 101.4

temperature[-2]

## [1] 98.1 101.4

temperature[c(TRUE,TRUE,FALSE)]

## [1] 98.1 98.6

2.1.3 Factors

Factor is a special case of vector are features that represent a characteristic with categories of values.

gender<-factor(c("MALE","FEMALE","MALE"))
gender

## [1] MALE FEMALE MALE

## Levels: FEMALE MALE

When factors are created, we can ad additional levels that may not appear in the data.
Example blood factor for the three patients, we specified an additional vector of four possible blood types
using the levels=statement.

blood<-factor(c("O","AB","A"),
levels=c("A","B","AB","O"))
blood

## [1] O AB A
## Levels: A B AB O

Pain <- c(0,3,2,2,1);Pain

## [1] 0 3 2 2 1

3
SevPain <- as.factor(c(0,3,2,2,1)); SevPain # or: as.factor(Pain)

## [1] 0 3 2 2 1
## Levels: 0 1 2 3

levels(SevPain) <- c("none","mild","medium","severe")

SevPain

## [1] none severe medium medium mild

## Levels: none mild medium severe

is.factor(SevPain)

## [1] TRUE

is.vector(SevPain)

## [1] FALSE

2.1.4 Lists

List is another special type of vector and used for storing an ordered set of values. A list allows different
types of values to be collected.
A list is created using the list() function as shown in the following example.

subject1<-list(fullname = subject_name[1],
temperature = temperature[1],
flu_status = flu_status[1],
gender = gender[1],
blood = blood[1])
subject1

## $fullname
## [1] "John Doe"
##
## $temperature
## [1] 98.1
##
## $flu_status
## [1] FALSE
##
## $gender
## [1] MALE
## Levels: FEMALE MALE
##
## $blood
## [1] O
## Levels: A B AB O

4
subject1$fullname

## [1] "John Doe"

subject1[1]

## $fullname
## [1] "John Doe"

subject1$temperature

## [1] 98.1

subject1[c("temperature","blood")]

## $temperature
## [1] 98.1
##
## $blood
## [1] O
## Levels: A B AB O

2.1.5 Data frames

The data frame, a structure analogous to a spreadsheet or database since it has both rows and columns of
data. In R terms, a data frame can be understood as a list of vectors or factors, each having exactly the
same number of values.
Example Using the patient data vectors we created previously, the data.f rame() function combines them
into a data frame:

pt_data <- data.frame(subject_name, temperature, flu_status,

gender, blood, stringsAsFactors = FALSE)
pt_data$subject_name #obtain the subject_name vector

## [1] "John Doe" "Jane Doe" "Steve Graves"

pt_data[c("temperature","blood")] #extract several columns from a data frame

## temperature blood
## 1 98.1 O
## 2 98.6 AB
## 3 101.4 A

pt_data[1,2] # extract the value in the first row and second column of the patient data frame

## [1] 98.1

5
pt_data[,1] # extract all rows of the first column

## [1] "John Doe" "Jane Doe" "Steve Graves"

pt_data[1,] # extract all columns for the first row

## subject_name temperature flu_status gender blood

## 1 John Doe 98.1 FALSE MALE O

pt_data[,] # extract everything

## subject_name temperature flu_status gender blood

## 1 John Doe 98.1 FALSE MALE O
## 2 Jane Doe 98.6 FALSE FEMALE AB
## 3 Steve Graves 101.4 TRUE MALE A

Note: stringsAsF actors = F ALSE. If we do not specify this option, R will automatically convert every
character vector to a factor.

2.1.6 Matrix

A matrix is a data structure that represents a two-dimensional table, with rows and columns of data.

To create a matrix, simply supply a vector of data to the matrix() function, along with a parameter specifying
the number of rows (nrow) or number of columns (ncol).

x<-1:12
x

## [1] 1 2 3 4 5 6 7 8 9 10 11 12

length(x)

## [1] 12

dim(x) ## dim command to transform a sequence to a matrix

## NULL

dim(x)<-c(3,4)
x

## [,1] [,2] [,3] [,4]

## [1,] 1 4 7 10
## [2,] 2 5 8 11
## [3,] 3 6 9 12

6
a<-matrix(1:12,nrow=3,byrow=TRUE) ## another way of defining a matrix
a

## [,1] [,2] [,3] [,4]

## [1,] 1 2 3 4
## [2,] 5 6 7 8
## [3,] 9 10 11 12

a<-matrix(1:12,nrow=3,byrow=FALSE)
a

## [,1] [,2] [,3] [,4]

## [1,] 1 4 7 10
## [2,] 2 5 8 11
## [3,] 3 6 9 12

rownames(a)

## NULL

rownames(a)<-c("A","B","C")
a

## [,1] [,2] [,3] [,4]

## A 1 4 7 10
## B 2 5 8 11
## C 3 6 9 12

colnames(a)<-c("1","2","x","y")
a

## 1 2 x y
## A 1 4 7 10
## B 2 5 8 11
## C 3 6 9 12

2.2 Managing data with R

2.2.1 Saving and loading R data structures

To save a particular data structure to a file that can be reloaded later or transferred to another system, you
can use the save() function. The save() function writes R data structures to the location specified by the
file parameter. R data files have the file extension .RData.

save(x, y, z, file = "mydata.RData") #Regardless of whether x, y, and z are vectors, factors, lists, or

load("mydata.RData") # recreate the x, y, and z data structures

7
2.2.2 Importing and saving data from CSV files

For a large data set, you can read it into R from an external file.R input facilities are simple. You can use
the following R functions to import data.
The read.table() function;
The read.csv() function
Example: A CSV file representing the medical dataset constructed previously would look as fol-
lows: subject_name,temperature,flu_status,gender,blood_type John Doe,98.1,FALSE,MALE,O Jane
Doe,98.6,FALSE,FEMALE,AB Steve Graves,101.4,TRUE,MALE,A
To load this CSV file into R, the read.csv() is used as follows: Loading agpop.csv into R.

#agpop<-read.csv("agpop.csv")
agpop<-read.csv("agpop.csv") #import data from full path

install.packages("tidyverse")

library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.2 --

## v ggplot2 3.4.3 v purrr 0.3.4
## v tibble 3.2.1 v dplyr 1.0.10
## v tidyr 1.2.0 v stringr 1.5.0
## v readr 2.1.2 v forcats 0.5.2
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()

dim(agpop)

## [1] 3078 15

str(agpop)

## ’data.frame’: 3078 obs. of 15 variables:

## $ county : chr "ALEUTIAN ISLANDS AREA" "ANCHORAGE AREA" "FAIRBANKS AREA" "JUNEAU AREA" ...
## $ state : chr "AK" "AK" "AK" "AK" ...
## $ acres92 : int 683533 47146 141338 210 50810 107259 167832 177189 48022 137426 ...
## $ acres87 : int 726596 59297 154913 214 85712 116050 192082 207906 50818 140107 ...
## $ acres82 : int 764514 256709 204568 127 98035 145044 223502 222066 49630 163638 ...
## $ farms92 : int 26 217 168 8 93 322 941 421 177 1121 ...
## $ farms87 : int 27 245 175 8 119 388 991 498 202 1199 ...
## $ farms82 : int 28 223 170 12 137 453 1119 587 228 1338 ...
## $ largef92: int 14 9 25 0 9 25 24 40 6 9 ...
## $ largef87: int 16 10 28 0 18 32 37 48 10 11 ...
## $ largef82: int 20 11 21 0 17 32 48 43 10 16 ...
## $ smallf92: int 6 41 12 5 12 8 90 9 6 43 ...
## $ smallf87: int 4 52 18 4 18 19 91 21 10 44 ...
## $ smallf82: int 1 38 25 8 19 17 95 36 15 64 ...
## $ region : chr "W" "W" "W" "W" ...

8
select columns

df1<-select(agpop,acres92,farms92)

re-ordering the columns

df2<-select(agpop,region,county,state,everything())

change the column name, rename() function

df3<-rename(agpop,acr87 = acres87)

filter the rows, filter()function get a new dataframe in which the state is “AK” and the region is “W”

df4<-filter(agpop,state=="AK"& region == "W")

Find the largef82 where value is between 0 and 20

df5<-filter(agpop,largef82 <=20 & largef82 >=0)

filter the data frame where acres82 is[5,40000)

df6<-filter(agpop, acres82 <40000 & acres82>=5)

arrange/sort rows with the arrange()function sort the data frame by state and acres92

df7<-arrange(agpop,state,acres92)

sort in descending order

df8<-arrange(agpop,desc(state),desc(acres92))

create new columns,mutate()function Ratio=largef82/smallf92

df9<-mutate(agpop,Ratio=largef82/smallf92)

aggregate function:mean(),sum() and max(). group by:group_by() calculate the total smallf82 by region

by_region<-group_by(agpop,region)
smallf82_by_category<-summarise(by_region,smallf82value=sum(smallf82))

R Commands
No ratings yet
R Commands
18 pages
Introduction To R
No ratings yet
Introduction To R
11 pages
BRM PRACTICAL FILE H--
No ratings yet
BRM PRACTICAL FILE H--
37 pages
R Basics: Daniel Stegmueller
No ratings yet
R Basics: Daniel Stegmueller
14 pages
Data Analytics Using R
No ratings yet
Data Analytics Using R
37 pages
DMPA Codes
No ratings yet
DMPA Codes
16 pages
Intro R
No ratings yet
Intro R
38 pages
MLlab5th
No ratings yet
MLlab5th
17 pages
Programming With R: Lecture #4
No ratings yet
Programming With R: Lecture #4
34 pages
Empirical Software Engineering (Swe504) : Practical File
No ratings yet
Empirical Software Engineering (Swe504) : Practical File
27 pages
BT1101 - R Code Cheatsheet 1.0
No ratings yet
BT1101 - R Code Cheatsheet 1.0
12 pages
R Syntax Examples 1
No ratings yet
R Syntax Examples 1
6 pages
R Programming Tutorial for Beginners (1)
No ratings yet
R Programming Tutorial for Beginners (1)
7 pages
R Studio
No ratings yet
R Studio
42 pages
BRM File
No ratings yet
BRM File
20 pages
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet
R Language - Experiment 1 (21-01-25)
No ratings yet
R Language - Experiment 1 (21-01-25)
8 pages
Lab 1- Basic functions in R and plotting
No ratings yet
Lab 1- Basic functions in R and plotting
8 pages
Codes_part 1
No ratings yet
Codes_part 1
7 pages
R Training Deck - v1
No ratings yet
R Training Deck - v1
35 pages
Practical 1 EDA
No ratings yet
Practical 1 EDA
14 pages
Introduction To R PDF
No ratings yet
Introduction To R PDF
56 pages
Da Session 4
No ratings yet
Da Session 4
75 pages
R Intro STAT5000
No ratings yet
R Intro STAT5000
17 pages
R Practicals
No ratings yet
R Practicals
32 pages
An R Tutorial Starting Out
No ratings yet
An R Tutorial Starting Out
9 pages
Data Visualization Notes-2
No ratings yet
Data Visualization Notes-2
223 pages
ProfessiR programming
No ratings yet
ProfessiR programming
22 pages
Lecture S2
No ratings yet
Lecture S2
24 pages
1 - Introduction To Programming With R
No ratings yet
1 - Introduction To Programming With R
13 pages
r22 Unit3 Vector Matrix
No ratings yet
r22 Unit3 Vector Matrix
30 pages
R22-UNIT3-VECTOR-LIST-MATRIX.pptx
No ratings yet
R22-UNIT3-VECTOR-LIST-MATRIX.pptx
37 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
5 pages
2 Undefined
No ratings yet
2 Undefined
86 pages
Sam BRM Rstudio
No ratings yet
Sam BRM Rstudio
43 pages
indexingExercises
No ratings yet
indexingExercises
6 pages
Pattern Recognition
No ratings yet
Pattern Recognition
26 pages
R-Programming Record - Odd Sem 21-22
No ratings yet
R-Programming Record - Odd Sem 21-22
35 pages
18 3 24 Upto Week 6 A B Latest 1
No ratings yet
18 3 24 Upto Week 6 A B Latest 1
25 pages
R Programming Lab
No ratings yet
R Programming Lab
14 pages
09_Pandas slides
No ratings yet
09_Pandas slides
33 pages
data-frames-in-R
No ratings yet
data-frames-in-R
7 pages
Rbasics
No ratings yet
Rbasics
96 pages
R Prog
No ratings yet
R Prog
27 pages
R Programming PDF
No ratings yet
R Programming PDF
128 pages
R Programming PDF
No ratings yet
R Programming PDF
128 pages
dav 2 unit
No ratings yet
dav 2 unit
55 pages
R Module 7 - Data Classes
No ratings yet
R Module 7 - Data Classes
45 pages
UNIT-3 Data Science
No ratings yet
UNIT-3 Data Science
21 pages
EM622 Data Analysis and Visualization Techniques For Decision-Making
No ratings yet
EM622 Data Analysis and Visualization Techniques For Decision-Making
47 pages
Exploratory Data Analysis and Graphics: Lab 2
No ratings yet
Exploratory Data Analysis and Graphics: Lab 2
19 pages
My R Report
No ratings yet
My R Report
52 pages
Big Data File in R
No ratings yet
Big Data File in R
23 pages
R Training by Emma Mba
No ratings yet
R Training by Emma Mba
68 pages
Mod1 R Programming
No ratings yet
Mod1 R Programming
49 pages
Week 7
No ratings yet
Week 7
4 pages
R-Workshop: Training Program On R Programming Basic Concepts
No ratings yet
R-Workshop: Training Program On R Programming Basic Concepts
21 pages
Julia Basic Commands
No ratings yet
Julia Basic Commands
10 pages
Advanced C Concepts and Programming: First Edition
From Everand
Advanced C Concepts and Programming: First Edition
Gayatri
3/5 (1)
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet
Advanced Thermodynamics: Exergy / Availability
No ratings yet
Advanced Thermodynamics: Exergy / Availability
64 pages
Linear Optimization Practice 4
No ratings yet
Linear Optimization Practice 4
8 pages
Commissioning Generator AVR, PSS and Model Validation: Wenyan Gu, Member, IEEE
No ratings yet
Commissioning Generator AVR, PSS and Model Validation: Wenyan Gu, Member, IEEE
5 pages
MFF mt10035 S
No ratings yet
MFF mt10035 S
458 pages
Python Solutions
No ratings yet
Python Solutions
11 pages
BookReviewPetroleum Reservoir Simulations-A Basic
No ratings yet
BookReviewPetroleum Reservoir Simulations-A Basic
4 pages
Alg1 01 Student Journal
No ratings yet
Alg1 01 Student Journal
26 pages
G Saxby Practical - Holography - Chapter 1 - What Is A Hologram
No ratings yet
G Saxby Practical - Holography - Chapter 1 - What Is A Hologram
13 pages
Multiple Regression Analysis: OLS Asymptotics: Wooldridge: Introductory Econometrics: A Modern Approach, 5e
100% (2)
Multiple Regression Analysis: OLS Asymptotics: Wooldridge: Introductory Econometrics: A Modern Approach, 5e
8 pages
c03 Crypto DES AES Utc
No ratings yet
c03 Crypto DES AES Utc
72 pages
Activity Analysis, Cost Behavior, and Cost Estimation: Answers To Review Questions
No ratings yet
Activity Analysis, Cost Behavior, and Cost Estimation: Answers To Review Questions
84 pages
Ktu Syllabus
No ratings yet
Ktu Syllabus
87 pages
Muirfield Y12 EXT 2 TASK 4R - 2021
No ratings yet
Muirfield Y12 EXT 2 TASK 4R - 2021
9 pages
5.60 Thermodynamics & Kinetics: Mit Opencourseware
No ratings yet
5.60 Thermodynamics & Kinetics: Mit Opencourseware
6 pages
Combinatorial Set Theory PDF
No ratings yet
Combinatorial Set Theory PDF
2 pages
Attributable Fractions (As11) : Course: PG Diploma/ MSC Epidemiology
No ratings yet
Attributable Fractions (As11) : Course: PG Diploma/ MSC Epidemiology
37 pages
12 6 3 Notes
No ratings yet
12 6 3 Notes
4 pages
Measurement: Ajay Kumar, Vishal Gulati
No ratings yet
Measurement: Ajay Kumar, Vishal Gulati
12 pages
Bmos Mentoring Scheme (Senior Level) December 2012 (Sheet 3) Solutions
No ratings yet
Bmos Mentoring Scheme (Senior Level) December 2012 (Sheet 3) Solutions
6 pages
Fault Analysis in Transmission System Using Matlab
89% (46)
Fault Analysis in Transmission System Using Matlab
50 pages
Syllabus Econ 310 Spring 2013
No ratings yet
Syllabus Econ 310 Spring 2013
3 pages
TCS CODING QUESTIONS
No ratings yet
TCS CODING QUESTIONS
31 pages
Kom Coursefile
No ratings yet
Kom Coursefile
53 pages
Comp Sem2
No ratings yet
Comp Sem2
27 pages
Bio-Data (DR.P.K.sharMA) As On 13th June 2023
No ratings yet
Bio-Data (DR.P.K.sharMA) As On 13th June 2023
34 pages
A New Algebra Kaputt 1999
No ratings yet
A New Algebra Kaputt 1999
35 pages
Book 01 2d
No ratings yet
Book 01 2d
12 pages
Fir Compiler Xilinx
No ratings yet
Fir Compiler Xilinx
85 pages
Sand Heap Analogy
No ratings yet
Sand Heap Analogy
23 pages
PPT1
No ratings yet
PPT1
93 pages