Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
5 views

Data Preprocessing

Uploaded by

bauuaverma2002
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Data Preprocessing

Uploaded by

bauuaverma2002
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 27

Data Preprocessing

Managing Data with R


• One of the challenges faced while working
with massive datasets involves gathering,
preparing, and otherwise managing data from
a variety of sources.
Saving, loading, and removing R
data
structures
• To save a data structure to a file that can be
reloaded later or transferred to another system,
use the save() function.
• The save() function writes one or more R data
structures to the location specified by the file
parameter.
• Suppose you have three objects named x, y, and
z that you would like to save in a
• permanent file.
> save(x, y, z, file = "mydata.RData")
• The load() command can recreate any data
structures that have been saved to an .RData
file. To load the mydata.RData file we saved in
the preceding code, simply type:
> load("mydata.RData")
• After working on an R session for some time,
you may have accumulated a number of data
structures.
• The ls() listing function returns a vector of all
the data structures currently in the memory.
> ls()
[1] "blood" "flu_status" "gender" "m"
[5] "pt_data" "subject_name" "subject1"
"symptoms"
[9] "temperature"
• R will automatically remove these from its memory upon
quitting the session, but for large data structures, you may
want to free up the memory sooner.
• The rm() remove function can be used for this purpose. For
example, to eliminate the m and subject1 objects, simply
type:
> rm(m, subject1)
• The rm() function can also be supplied with a character vector
of the object names to be removed. This works with the ls()
function to clear the entire R session:
> rm(list=ls())
Importing and saving data from
CSV files
• The most common tabular text file format is the
CSV (Comma-Separated Values) file, which as the
name suggests, uses the comma as a delimiter.
• The CSV files can be imported to and exported
from many common applications. A CSV file
representing the medical dataset constructed
previously could be stored as:
subject_name,temperature,flu_status,gender,blood
_type
• John Doe,98.1,FALSE,MALE,O
• Jane Doe,98.6,FALSE,FEMALE,AB
• Steve Graves,101.4,TRUE,MALE,A
• Given a patient data file named pt_data.csv
located in the R working directory, the read.csv()
function can be used as follows to load the file
into R:
> pt_data <- read.csv("pt_data.csv",
stringsAsFactors = FALSE)
• By default, R assumes that the CSV file includes a
header line listing the names of the features in
the dataset.
• If a CSV file does not have a header, specify the
optionheader = FALSE, as shown in the following
command, and R will assign default
• feature names in the V1 and V2 forms and so on:
> mydata <- read.csv("mydata.csv",
stringsAsFactors = FALSE, header = FALSE)
• To save a data frame to a CSV file, use the
write.csv() function. If your data frame is
named pt_data, simply enter:
> write.csv(pt_data, file = "pt_data.csv",
row.names = FALSE)
Exploring and understanding data
• After collecting data and loading it into R's
data structures, the next step in the machine
learning process involves examining the data
in detail.
• We will explore the usedcars.csv dataset,
which contains actual data about used cars.
• Since the dataset is stored in the CSV form, we
can use the read.csv() function to load the
data into an R data frame:
> usedcars <- read.csv("usedcars.csv",
stringsAsFactors = FALSE)
Exploring the structure of data
• One of the first questions to ask is how the
dataset is organized.
• The str() function provides a method to display
the structure of R data structures such as data
frames, vectors, or lists. It can be used to create
the basic outline for our data dictionary:
> str(usedcars)
• Using such a simple command, we learn a wealth
of information about the dataset.
Exploring numeric variables
• To investigate the numeric variables in the
used car data, we will employ a common set
of measurements to describe values known as
summary statistics.
• The summary() function displays several
common summary statistics. Let's take a look
at a single feature, year:
> summary(usedcars$year)
• We can also use the summary() function to
obtain summary statistics for several numeric
variables at the same time:
> summary(usedcars[c("price", "mileage")])
Measuring the central tendency –
mean and median
• Measures of central tendency are a class of
statistics used to identify a value that falls in
the middle of a set of data.
• You most likely are already familiar with one
common measure of center: the average. In
common use, when something is deemed
average, it falls somewhere between the
extreme ends of the scale.
• R also provides a mean() function, which
calculates the mean for a vector of numbers:
> mean(c(36000, 44000, 56000))
[1] 45333.33
• summary() output listed mean values for the
price and mileage variables. The means suggest
that the typical used car in this dataset was
listed at a price of $12,962 and had a mileage of
44,261.
• Another commonly used measure of central
tendency is the median, which is the value that
occurs halfway through an ordered list of
values.
• As with the mean, R provides a median()
function, which we can apply to our salary data,
as shown in the following example:
> median(c(36000, 44000, 56000))
[1] 44000
Measuring spread – quartiles and
the five-number
summary
• To measure the diversity, we need to employ another type
of summary statistics that is concerned with the spread of
data, or how tightly or loosely the values are spaced.
• The five-number summary is a set of five statistics that
roughly depict the spread of a feature's values.
1. Minimum (Min.)
2. First quartile, or Q1 (1st Qu.)
3. Median, or Q2 (Median)
4. Third quartile, or Q3 (3rd Qu.)
5. Maximum (Max.)
• Minimum and maximum are the most
extreme feature values, indicating the smallest
and largest values, respectively.
• R provides the min() and max() functions to
calculate these values on a vector of data.
• In R, range() function returns both the
minimum and maximum value.
range(usedcars$price)
• Combining range() with the diff() difference
function allows you to examine the range of data
> diff(range(usedcars$price))
• The quartiles divide a dataset into four portions.
• The seq() function is used to generate vectors of
evenly-spaced values. This makes it easy to obtain
other slices of data, such as the quintiles (five
groups), as shown in
• the following command:
• > quantile(usedcars$price, seq(from = 0, to =
1, by = 0.20))
• 0% 20% 40% 60% 80% 100%
• 3800.0 10759.4 12993.8 13992.0 14999.0
21992.0
Exploring categorical variables
• The used car dataset had three categorical
variables: model, color, and transmission.
• Additionally, we might consider treating the
year variable as categorical; although it has
been loaded as a numeric (int) type vector,
each year is a category that could apply to
multiple cars.
• A table that presents a single categorical
variable is known as a one-way table.
• The table() function can be used to generate
one-way tables for our used car data.
> table(usedcars$year)
> table(usedcars$model)
> table(usedcars$color)
• The table() output lists the categories of the
nominal variable and a count of the number of
values falling into this category.
• R can also perform the calculation of table
proportions directly, by using the prop.table()
command on a table produced by the table()
function:
model_table <- table(usedcars$model)
prop.table(model_table)
• The results of prop.table() can be combined
with other R functions to transform the output.
> color_pct <- table(usedcars$color)
> color_pct <- prop.table(color_pct) * 100
> round(color_pct, digits = 1)
Exploring relationships between
variables
• So far, we have examined variables one at a
time, calculating only univariate statistics.
• bivariate relationships, which consider the
relationship between two variables.
• Relationships of more than two variables are
called multivariate relationships.

You might also like