Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 27
Data Preprocessing
Managing Data with R
• One of the challenges faced while working with massive datasets involves gathering, preparing, and otherwise managing data from a variety of sources. Saving, loading, and removing R data structures • To save a data structure to a file that can be reloaded later or transferred to another system, use the save() function. • The save() function writes one or more R data structures to the location specified by the file parameter. • Suppose you have three objects named x, y, and z that you would like to save in a • permanent file. > save(x, y, z, file = "mydata.RData") • The load() command can recreate any data structures that have been saved to an .RData file. To load the mydata.RData file we saved in the preceding code, simply type: > load("mydata.RData") • After working on an R session for some time, you may have accumulated a number of data structures. • The ls() listing function returns a vector of all the data structures currently in the memory. > ls() [1] "blood" "flu_status" "gender" "m" [5] "pt_data" "subject_name" "subject1" "symptoms" [9] "temperature" • R will automatically remove these from its memory upon quitting the session, but for large data structures, you may want to free up the memory sooner. • The rm() remove function can be used for this purpose. For example, to eliminate the m and subject1 objects, simply type: > rm(m, subject1) • The rm() function can also be supplied with a character vector of the object names to be removed. This works with the ls() function to clear the entire R session: > rm(list=ls()) Importing and saving data from CSV files • The most common tabular text file format is the CSV (Comma-Separated Values) file, which as the name suggests, uses the comma as a delimiter. • The CSV files can be imported to and exported from many common applications. A CSV file representing the medical dataset constructed previously could be stored as: subject_name,temperature,flu_status,gender,blood _type • John Doe,98.1,FALSE,MALE,O • Jane Doe,98.6,FALSE,FEMALE,AB • Steve Graves,101.4,TRUE,MALE,A • Given a patient data file named pt_data.csv located in the R working directory, the read.csv() function can be used as follows to load the file into R: > pt_data <- read.csv("pt_data.csv", stringsAsFactors = FALSE) • By default, R assumes that the CSV file includes a header line listing the names of the features in the dataset. • If a CSV file does not have a header, specify the optionheader = FALSE, as shown in the following command, and R will assign default • feature names in the V1 and V2 forms and so on: > mydata <- read.csv("mydata.csv", stringsAsFactors = FALSE, header = FALSE) • To save a data frame to a CSV file, use the write.csv() function. If your data frame is named pt_data, simply enter: > write.csv(pt_data, file = "pt_data.csv", row.names = FALSE) Exploring and understanding data • After collecting data and loading it into R's data structures, the next step in the machine learning process involves examining the data in detail. • We will explore the usedcars.csv dataset, which contains actual data about used cars. • Since the dataset is stored in the CSV form, we can use the read.csv() function to load the data into an R data frame: > usedcars <- read.csv("usedcars.csv", stringsAsFactors = FALSE) Exploring the structure of data • One of the first questions to ask is how the dataset is organized. • The str() function provides a method to display the structure of R data structures such as data frames, vectors, or lists. It can be used to create the basic outline for our data dictionary: > str(usedcars) • Using such a simple command, we learn a wealth of information about the dataset. Exploring numeric variables • To investigate the numeric variables in the used car data, we will employ a common set of measurements to describe values known as summary statistics. • The summary() function displays several common summary statistics. Let's take a look at a single feature, year: > summary(usedcars$year) • We can also use the summary() function to obtain summary statistics for several numeric variables at the same time: > summary(usedcars[c("price", "mileage")]) Measuring the central tendency – mean and median • Measures of central tendency are a class of statistics used to identify a value that falls in the middle of a set of data. • You most likely are already familiar with one common measure of center: the average. In common use, when something is deemed average, it falls somewhere between the extreme ends of the scale. • R also provides a mean() function, which calculates the mean for a vector of numbers: > mean(c(36000, 44000, 56000)) [1] 45333.33 • summary() output listed mean values for the price and mileage variables. The means suggest that the typical used car in this dataset was listed at a price of $12,962 and had a mileage of 44,261. • Another commonly used measure of central tendency is the median, which is the value that occurs halfway through an ordered list of values. • As with the mean, R provides a median() function, which we can apply to our salary data, as shown in the following example: > median(c(36000, 44000, 56000)) [1] 44000 Measuring spread – quartiles and the five-number summary • To measure the diversity, we need to employ another type of summary statistics that is concerned with the spread of data, or how tightly or loosely the values are spaced. • The five-number summary is a set of five statistics that roughly depict the spread of a feature's values. 1. Minimum (Min.) 2. First quartile, or Q1 (1st Qu.) 3. Median, or Q2 (Median) 4. Third quartile, or Q3 (3rd Qu.) 5. Maximum (Max.) • Minimum and maximum are the most extreme feature values, indicating the smallest and largest values, respectively. • R provides the min() and max() functions to calculate these values on a vector of data. • In R, range() function returns both the minimum and maximum value. range(usedcars$price) • Combining range() with the diff() difference function allows you to examine the range of data > diff(range(usedcars$price)) • The quartiles divide a dataset into four portions. • The seq() function is used to generate vectors of evenly-spaced values. This makes it easy to obtain other slices of data, such as the quintiles (five groups), as shown in • the following command: • > quantile(usedcars$price, seq(from = 0, to = 1, by = 0.20)) • 0% 20% 40% 60% 80% 100% • 3800.0 10759.4 12993.8 13992.0 14999.0 21992.0 Exploring categorical variables • The used car dataset had three categorical variables: model, color, and transmission. • Additionally, we might consider treating the year variable as categorical; although it has been loaded as a numeric (int) type vector, each year is a category that could apply to multiple cars. • A table that presents a single categorical variable is known as a one-way table. • The table() function can be used to generate one-way tables for our used car data. > table(usedcars$year) > table(usedcars$model) > table(usedcars$color) • The table() output lists the categories of the nominal variable and a count of the number of values falling into this category. • R can also perform the calculation of table proportions directly, by using the prop.table() command on a table produced by the table() function: model_table <- table(usedcars$model) prop.table(model_table) • The results of prop.table() can be combined with other R functions to transform the output. > color_pct <- table(usedcars$color) > color_pct <- prop.table(color_pct) * 100 > round(color_pct, digits = 1) Exploring relationships between variables • So far, we have examined variables one at a time, calculating only univariate statistics. • bivariate relationships, which consider the relationship between two variables. • Relationships of more than two variables are called multivariate relationships.