Data Preprocessing

Uploaded by

bauuaverma2002

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views

Data Preprocessing

Uploaded by

bauuaverma2002

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 27

Data Preprocessing

Managing Data with R

• One of the challenges faced while working
with massive datasets involves gathering,
preparing, and otherwise managing data from
a variety of sources.
Saving, loading, and removing R
data
structures
• To save a data structure to a file that can be
reloaded later or transferred to another system,
use the save() function.
• The save() function writes one or more R data
structures to the location specified by the file
parameter.
• Suppose you have three objects named x, y, and
z that you would like to save in a
• permanent file.
> save(x, y, z, file = "mydata.RData")
• The load() command can recreate any data
structures that have been saved to an .RData
file. To load the mydata.RData file we saved in
the preceding code, simply type:
> load("mydata.RData")
• After working on an R session for some time,
you may have accumulated a number of data
structures.
• The ls() listing function returns a vector of all
the data structures currently in the memory.
> ls()
[1] "blood" "flu_status" "gender" "m"
[5] "pt_data" "subject_name" "subject1"
"symptoms"
[9] "temperature"
• R will automatically remove these from its memory upon
quitting the session, but for large data structures, you may
want to free up the memory sooner.
• The rm() remove function can be used for this purpose. For
example, to eliminate the m and subject1 objects, simply
type:
> rm(m, subject1)
• The rm() function can also be supplied with a character vector
of the object names to be removed. This works with the ls()
function to clear the entire R session:
> rm(list=ls())
Importing and saving data from
CSV files
• The most common tabular text file format is the
CSV (Comma-Separated Values) file, which as the
name suggests, uses the comma as a delimiter.
• The CSV files can be imported to and exported
from many common applications. A CSV file
representing the medical dataset constructed
previously could be stored as:
subject_name,temperature,flu_status,gender,blood
_type
• John Doe,98.1,FALSE,MALE,O
• Jane Doe,98.6,FALSE,FEMALE,AB
• Steve Graves,101.4,TRUE,MALE,A
• Given a patient data file named pt_data.csv
located in the R working directory, the read.csv()
function can be used as follows to load the file
into R:
> pt_data <- read.csv("pt_data.csv",
stringsAsFactors = FALSE)
• By default, R assumes that the CSV file includes a
header line listing the names of the features in
the dataset.
• If a CSV file does not have a header, specify the
optionheader = FALSE, as shown in the following
command, and R will assign default
• feature names in the V1 and V2 forms and so on:
> mydata <- read.csv("mydata.csv",
stringsAsFactors = FALSE, header = FALSE)
• To save a data frame to a CSV file, use the
write.csv() function. If your data frame is
named pt_data, simply enter:
> write.csv(pt_data, file = "pt_data.csv",
row.names = FALSE)
Exploring and understanding data
• After collecting data and loading it into R's
data structures, the next step in the machine
learning process involves examining the data
in detail.
• We will explore the usedcars.csv dataset,
which contains actual data about used cars.
• Since the dataset is stored in the CSV form, we
can use the read.csv() function to load the
data into an R data frame:
> usedcars <- read.csv("usedcars.csv",
stringsAsFactors = FALSE)
Exploring the structure of data
• One of the first questions to ask is how the
dataset is organized.
• The str() function provides a method to display
the structure of R data structures such as data
frames, vectors, or lists. It can be used to create
the basic outline for our data dictionary:
> str(usedcars)
• Using such a simple command, we learn a wealth
of information about the dataset.
Exploring numeric variables
• To investigate the numeric variables in the
used car data, we will employ a common set
of measurements to describe values known as
summary statistics.
• The summary() function displays several
common summary statistics. Let's take a look
at a single feature, year:
> summary(usedcars$year)
• We can also use the summary() function to
obtain summary statistics for several numeric
variables at the same time:
> summary(usedcars[c("price", "mileage")])
Measuring the central tendency –
mean and median
• Measures of central tendency are a class of
statistics used to identify a value that falls in
the middle of a set of data.
• You most likely are already familiar with one
common measure of center: the average. In
common use, when something is deemed
average, it falls somewhere between the
extreme ends of the scale.
• R also provides a mean() function, which
calculates the mean for a vector of numbers:
> mean(c(36000, 44000, 56000))
[1] 45333.33
• summary() output listed mean values for the
price and mileage variables. The means suggest
that the typical used car in this dataset was
listed at a price of $12,962 and had a mileage of
44,261.
• Another commonly used measure of central
tendency is the median, which is the value that
occurs halfway through an ordered list of
values.
• As with the mean, R provides a median()
function, which we can apply to our salary data,
as shown in the following example:
> median(c(36000, 44000, 56000))
[1] 44000
Measuring spread – quartiles and
the five-number
summary
• To measure the diversity, we need to employ another type
of summary statistics that is concerned with the spread of
data, or how tightly or loosely the values are spaced.
• The five-number summary is a set of five statistics that
roughly depict the spread of a feature's values.
1. Minimum (Min.)
2. First quartile, or Q1 (1st Qu.)
3. Median, or Q2 (Median)
4. Third quartile, or Q3 (3rd Qu.)
5. Maximum (Max.)
• Minimum and maximum are the most
extreme feature values, indicating the smallest
and largest values, respectively.
• R provides the min() and max() functions to
calculate these values on a vector of data.
• In R, range() function returns both the
minimum and maximum value.
range(usedcars$price)
• Combining range() with the diff() difference
function allows you to examine the range of data
> diff(range(usedcars$price))
• The quartiles divide a dataset into four portions.
• The seq() function is used to generate vectors of
evenly-spaced values. This makes it easy to obtain
other slices of data, such as the quintiles (five
groups), as shown in
• the following command:
• > quantile(usedcars$price, seq(from = 0, to =
1, by = 0.20))
• 0% 20% 40% 60% 80% 100%
• 3800.0 10759.4 12993.8 13992.0 14999.0
21992.0
Exploring categorical variables
• The used car dataset had three categorical
variables: model, color, and transmission.
• Additionally, we might consider treating the
year variable as categorical; although it has
been loaded as a numeric (int) type vector,
each year is a category that could apply to
multiple cars.
• A table that presents a single categorical
variable is known as a one-way table.
• The table() function can be used to generate
one-way tables for our used car data.
> table(usedcars$year)
> table(usedcars$model)
> table(usedcars$color)
• The table() output lists the categories of the
nominal variable and a count of the number of
values falling into this category.
• R can also perform the calculation of table
proportions directly, by using the prop.table()
command on a table produced by the table()
function:
model_table <- table(usedcars$model)
prop.table(model_table)
• The results of prop.table() can be combined
with other R functions to transform the output.
> color_pct <- table(usedcars$color)
> color_pct <- prop.table(color_pct) * 100
> round(color_pct, digits = 1)
Exploring relationships between
variables
• So far, we have examined variables one at a
time, calculating only univariate statistics.
• bivariate relationships, which consider the
relationship between two variables.
• Relationships of more than two variables are
called multivariate relationships.

BRITTANY - Design Assignment 3 Data
0% (4)
BRITTANY - Design Assignment 3 Data
12 pages
Unit 2
No ratings yet
Unit 2
32 pages
Introduction To Data Science With R Programming
No ratings yet
Introduction To Data Science With R Programming
91 pages
Module IV
No ratings yet
Module IV
43 pages
Introduction To R
No ratings yet
Introduction To R
36 pages
#02 R Basics
No ratings yet
#02 R Basics
30 pages
Untitled
No ratings yet
Untitled
59 pages
Introduction to R for Business Analytics(1)
No ratings yet
Introduction to R for Business Analytics(1)
7 pages
R Lectures
No ratings yet
R Lectures
10 pages
Machine Learning - Unit IV Notes
No ratings yet
Machine Learning - Unit IV Notes
18 pages
STATS LAB Basics of R PDF
No ratings yet
STATS LAB Basics of R PDF
77 pages
Introduction To R
No ratings yet
Introduction To R
20 pages
Lenguaje R C3
No ratings yet
Lenguaje R C3
19 pages
Pandas For Machine Learning: Acadview
No ratings yet
Pandas For Machine Learning: Acadview
18 pages
Data Analysis Using R and Vectors
No ratings yet
Data Analysis Using R and Vectors
35 pages
DWDM - Lab Manual1
No ratings yet
DWDM - Lab Manual1
40 pages
R Programming
No ratings yet
R Programming
20 pages
An Introduction To R: 1 Background
No ratings yet
An Introduction To R: 1 Background
17 pages
R Programming Tutorial
No ratings yet
R Programming Tutorial
8 pages
R Tutorial
No ratings yet
R Tutorial
39 pages
Lecture 1
No ratings yet
Lecture 1
167 pages
Introduction To R Installation: Data Types Value Examples
No ratings yet
Introduction To R Installation: Data Types Value Examples
9 pages
Data cleaning Using R
No ratings yet
Data cleaning Using R
5 pages
R Programming For NGS Data Analysis
No ratings yet
R Programming For NGS Data Analysis
5 pages
Tutorial-Introduction To Dplyr
No ratings yet
Tutorial-Introduction To Dplyr
54 pages
Capital Gains
No ratings yet
Capital Gains
8 pages
Practical 1_Data Frame Manipulation_072502
No ratings yet
Practical 1_Data Frame Manipulation_072502
16 pages
R - II UNIT
No ratings yet
R - II UNIT
10 pages
R - Lecture 4
No ratings yet
R - Lecture 4
37 pages
MTech R Notes
No ratings yet
MTech R Notes
14 pages
Exercise and Experiment 3
No ratings yet
Exercise and Experiment 3
14 pages
ST 540: An Introduction To R
No ratings yet
ST 540: An Introduction To R
6 pages
Basic Stats For Ecology
No ratings yet
Basic Stats For Ecology
26 pages
Database Query using SQL
No ratings yet
Database Query using SQL
22 pages
12-14 Answers R No GPT Foramt
No ratings yet
12-14 Answers R No GPT Foramt
7 pages
Dar lecture 7
No ratings yet
Dar lecture 7
24 pages
Data Analysis2
No ratings yet
Data Analysis2
16 pages
Pandas Notes
No ratings yet
Pandas Notes
4 pages
Untitled Document
No ratings yet
Untitled Document
27 pages
pandas (1)
No ratings yet
pandas (1)
25 pages
Bdo Co1 Session 4
No ratings yet
Bdo Co1 Session 4
43 pages
Chapter - 03 - Review of Basic Data
No ratings yet
Chapter - 03 - Review of Basic Data
92 pages
RAW Data
No ratings yet
RAW Data
22 pages
R Short Tutorial
No ratings yet
R Short Tutorial
5 pages
R Programming Slides
No ratings yet
R Programming Slides
73 pages
Time Series Analysis With R - Part I
No ratings yet
Time Series Analysis With R - Part I
23 pages
Course_ Introduction to Data Science (SD211105)
No ratings yet
Course_ Introduction to Data Science (SD211105)
10 pages
Dzone R Refcard
No ratings yet
Dzone R Refcard
9 pages
basics of R
No ratings yet
basics of R
12 pages
Chapter Nine
No ratings yet
Chapter Nine
17 pages
Unit-1 (Part-2) : Loading and Handling Data in R
No ratings yet
Unit-1 (Part-2) : Loading and Handling Data in R
78 pages
MIT 302 - Statistical Computing II - Tutorial 02
No ratings yet
MIT 302 - Statistical Computing II - Tutorial 02
5 pages
R Gettingstarted
No ratings yet
R Gettingstarted
7 pages
DA_Lab_Week-2
No ratings yet
DA_Lab_Week-2
22 pages
Muthayammal College of Arts and Science Rasipuram: Assignment No - 1
No ratings yet
Muthayammal College of Arts and Science Rasipuram: Assignment No - 1
10 pages
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet
Learn C++
From Everand
Learn C++
Durgesh
4.5/5 (9)
Data Structures and Algorithm
From Everand
Data Structures and Algorithm
Knowledge Flow
No ratings yet
Advanced C Concepts and Programming: First Edition
From Everand
Advanced C Concepts and Programming: First Edition
Gayatri
3/5 (1)
Mastering Data Structures and Algorithms in C and C++
From Everand
Mastering Data Structures and Algorithms in C and C++
Sachin Naha
No ratings yet
Introduction to Algorithms
From Everand
Introduction to Algorithms
S VASIST
No ratings yet
4 Ideal Models of Engine Cycles
No ratings yet
4 Ideal Models of Engine Cycles
23 pages
IGNOU MBA MS - 09 Solved Assignments 2011
No ratings yet
IGNOU MBA MS - 09 Solved Assignments 2011
18 pages
® Iit-Jee: 2024-25 Enthusiast Course Phase-I (A), I (B), I & Ii
No ratings yet
® Iit-Jee: 2024-25 Enthusiast Course Phase-I (A), I (B), I & Ii
1 page
SPE 80437 Integrated Reservoir Simulation Studies To Optimize Recovery From A Carbonate Reservoir
No ratings yet
SPE 80437 Integrated Reservoir Simulation Studies To Optimize Recovery From A Carbonate Reservoir
14 pages
CH 2 MCQ
No ratings yet
CH 2 MCQ
6 pages
How Does A Vernier Scale Work?
No ratings yet
How Does A Vernier Scale Work?
9 pages
Basics in Number Theory
No ratings yet
Basics in Number Theory
17 pages
PRI Analysis and Deinterleaving
100% (1)
PRI Analysis and Deinterleaving
76 pages
Dsp2013 Hw2 Sol
No ratings yet
Dsp2013 Hw2 Sol
11 pages
Discordant Uranium-Lead Ages, I: Vol. 37, N O - 3 Transactions, American Geophysicaljunion
No ratings yet
Discordant Uranium-Lead Ages, I: Vol. 37, N O - 3 Transactions, American Geophysicaljunion
7 pages
Drain Design
100% (2)
Drain Design
106 pages
Intersection - Resection P1
No ratings yet
Intersection - Resection P1
32 pages
Math 7 Mye 2017-18 Constructed Response Teacher Version
No ratings yet
Math 7 Mye 2017-18 Constructed Response Teacher Version
11 pages
SPC307-Lec 8
No ratings yet
SPC307-Lec 8
72 pages
Utf-8' '2023-24 - Fall1 - Oct18
No ratings yet
Utf-8' '2023-24 - Fall1 - Oct18
1 page
Thornton Et Al. - 2019 - Developing Athlete Monitoring Systems in Team Spor
No ratings yet
Thornton Et Al. - 2019 - Developing Athlete Monitoring Systems in Team Spor
27 pages
Automated Estimate - Fit Out
No ratings yet
Automated Estimate - Fit Out
12 pages
Quadratic Equation Previous Year Question - 01 (1) (9 Files Merged)
No ratings yet
Quadratic Equation Previous Year Question - 01 (1) (9 Files Merged)
37 pages
Unigraphics NX Interview Questions and Answers - 1
No ratings yet
Unigraphics NX Interview Questions and Answers - 1
4 pages
Artificial Neural Networks Yegnanarayana PDF Downloadgolkes PDF
No ratings yet
Artificial Neural Networks Yegnanarayana PDF Downloadgolkes PDF
2 pages
Learning Objectives: Simple Linear Regression
No ratings yet
Learning Objectives: Simple Linear Regression
6 pages
ACAD Draw Menu
No ratings yet
ACAD Draw Menu
37 pages
Grade 5 (Quarter 2 S.Y. 2023-2024)
100% (1)
Grade 5 (Quarter 2 S.Y. 2023-2024)
4 pages
Double-Species Slurry Flow in A Horizontal Pipeline: P. V. Skudarnov C. X. Lin M. A. Ebadian
No ratings yet
Double-Species Slurry Flow in A Horizontal Pipeline: P. V. Skudarnov C. X. Lin M. A. Ebadian
8 pages
Modelling and Dynamic Simulation of Processes With MATLAB'. An Application of A Natural Gas Installation in A Power Plant
No ratings yet
Modelling and Dynamic Simulation of Processes With MATLAB'. An Application of A Natural Gas Installation in A Power Plant
12 pages
Estonian Math
100% (1)
Estonian Math
26 pages
Sismic-Forces Cpe Inen
No ratings yet
Sismic-Forces Cpe Inen
22 pages
Maths G 11 & 12 Unit 3
No ratings yet
Maths G 11 & 12 Unit 3
11 pages
EESA06 Final Exam PDF
0% (1)
EESA06 Final Exam PDF
16 pages