Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
15 views

Module IV

Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Module IV

Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 43

CSE1006 – Foundations of Data Analytics

Module 4
Data Analysis
Data Import: Reading Data in to R
R base functions for importing data

• R base functions for importing data from txt|csv files


– Read tab separated values
read.delim(file.choose())
– Read comma (",") separated values
read.csv(file.choose())
– Read semicolon (";") separated values
read.csv2(file.choose())
Using read.csv() methods

Syntax:
read.csv(path, header = TRUE, sep = “,”)
path : The path of the file to be imported
header : By default : TRUE . Indicator of whether to
import column headings.
sep = “,” : The separator for the values in each row.
Reading Excel files in R

• read_excel() function is basically used to import/read


an excel file and it can only be accessed after
importing of the readxl library in R language.
• library(readxl)
• read_excel(path, sheet = NULL, range = NULL,
col_names = TRUE, col_types = NULL, na = "",
trim_ws = TRUE, skip = 0, n_max = Inf, guess_max =
min(1000, n_max), progress =
readxl_progress(), .name_repair = "unique")
Exporting Data: Writing data from R to files
• To Excel files (we can use xlsx package or writexl package)
write.xlsx(data,file,sheetName, col.names, row.names, append)
• data – the dataframe to be written to excel file
• file – the output path in which the excel file has to be stored
• sheetName – desired name for the sheet
• row.names, col.names – (bool) if true the row and column
names are copied to the excel file
• append – if True data will be appended to the existing excel file
Examples
• Using xlsx package
install.packages("xlsx")
library(xlsx)
write.xlsx(iris, file = "iris.xlsx", sheetName="iris_data")
• Using writexl package
install.packages(“writexl")
library(writexl)
write_slsx(air,”c:\\Users\\mypc\\Desktop\\air1.xlsx”)
Data cleaning and summarizing with
dplyr package
• dplyr is a powerful R-package to transform and
summarize tabular data with rows and columns.
• The package contains a set of functions (or “verbs”) that
perform common data manipulation operations such as
filtering for rows, selecting specific columns, re-ordering
rows, adding new columns and summarizing data.
• In addition, dplyr contains a useful function to perform
another common task which is the “split-apply-combine”
concept.
Why dplyr?
• dplyr is a new package which provides a set of tools for
efficiently manipulating datasets in R. dplyr is the next
iteration of plyr, focusing on only data frames. The main
advantages include:
1. Speed. Compared to plyr library (home of the familiar ddply
function), dplyr is anywhere between 20X - 100X faster in its
calculations.
2. Cleaner Code. the syntax allows for function chaining, preventing
any potential cluttering in the code, which in turn makes for
easier code writing/reading.
3. Simpler Code. dplyr has a limited number of functions that are
focused on the most common requirements of data
manipulation. the syntax is both simple and efficient.
Install and load dplyr

• To install dplyr
install.packages("dplyr")
• To load dplyr
library(dplyr)
Important dplyr verbs to remember

dplyr verbs Description


1. select() select columns
2. filter() filter rows
3. arrange() re-order or arrange rows
4. mutate() create new columns
5. summarise() summarise values
6. rename rename variables in a data frame
7. group_by() allows for group operations in the
“split-apply-combine” concept
%>%: the “pipe” operator is used to connect multiple verb
actions together into a pipeline
dplyr vs. Base R Functions
• dplyr functions process faster than base R
functions. It is because dplyr functions were
written in a computationally efficient manner.
• They are also more stable in the syntax and
better supports data frames than vectors.
SQL Queries vs. dplyr
Selecting columns using select()
• Syntax
select(dataframe, column name)
Example
select(airquality, OZone)
• To select all the columns except a specific column,
use the “-“ (subtraction) operator (also known as
negative indexing)
Example
Select(airquality, -Temp)
• Some additional options to select columns based on a specific
criteria include
• ends_with() = Select columns that end with a character string
• contains() = Select columns that contain a character string
• matches() = Select columns that match a regular expression
• one_of() = Select columns names that are from a group of
names
• Example
select(airquality, starts_with(“Mo"))
Selecting rows using filter()

• Example
• Filter the rows for airquality that Temprature more
than 62.
filter(airquality, Temp >62)
• Filter the rows for airquality that Temprature more
than 62 and after 3rd day.
filter(airquality, Temp >62,Day>=3)
Arrange or re-order rows using arrange()
• To arrange (or re-order) rows by a particular column
arrange(airquality,Wind) (OR)
airquality %>% arrange(Wind) %>% head
• Now, we will select three columns from airquality, arrange the
rows by Temp and then arrange the rows by Wind. Finally
show the head of the final data frame.
airquality %>% select(Temp,Wind, Ozone, Day) %>%
arrange(Temp,Wind) %>% head
• To arrange in a descending order:
arrange(airquality, desc(Wind))
Create new columns using mutate()

Example:
Add new column “year” in air data frame.
Renaming columns using rename()

• Example:
Rename columns Month to M and Day to D
Exploratory Data Analysis
• It involves visualizing your data using graphical and numerical
summaries.
• We can visualize data graphically using following charts.
– Histogram
– Box plot
– Pie graph
– Line chart
– Bar plot
– Scatter Plot
Histogram
• A histogram represents the frequencies of values of a variable
bucketed into ranges.
• Histogram is similar to bar chat but the difference is it groups
the values into continuous ranges.
• Each bar in histogram represents the height of the number of
values present in that range.
• R creates histogram using hist() function. This function takes a
vector as an input and uses some more parameters to plot
histograms.
Syntax of hist() function
• hist(v,main,xlab,xlim,ylim,breaks,col,border)
• v is a vector containing numeric values used in histogram.
• main indicates title of the chart.
• col is used to set color of the bars.
• border is used to set border color of each bar.
• xlab is used to give description of x-axis.
• xlim is used to specify the range of values on the x-axis.
• ylim is used to specify the range of values on the y-axis.
• breaks is used to mention the width of each bar.
Example: v <- c(9,13,21,8,36,22,12,41,31,33,19)
hist(v, xlab = "Weight", col = "yellow", border = "blue“)
hist(v, xlab = "Weight",col = "green",border = "red",
xlim = c(0,40), ylim = c(0,5), breaks = 5)
Saving chart in a file

• # Give the chart file a name.


png(file = "histogram.png")
• # Create the histogram.
hist(v,xlab = "Weight",col = "yellow",border = "blue")
• # Save the file.
dev.off()
Bar plot
• Bar Plots are suitable for showing comparison
between cumulative totals across several groups.
• A bar chart represents data in rectangular bars with
length of the bar proportional to the value of the
variable.
• R uses the function barplot() to create bar charts. In
bar chart each of the bars can be given different
colors.
Syntax of hist() function
• barplot(H,xlab,ylab,main, names.arg,col)
• H is a vector or matrix containing numeric values
used in bar chart.
• xlab is the label for x axis.
• ylab is the label for y axis.
• main is the title of the bar chart.
• names.arg is a vector of names appearing under
each bar.
• col is used to give colors to the bars in the graph.
Example: H <- c(7,12,28,3,41)
barplot(H)
Example: H <- c(7,12,28,3,41)
M <- c("Mar","Apr","May","Jun","Jul")
barplot(H,names.arg=M,xlab="Month",ylab="Revenue",col="blu
e", main="Revenue chart",border="red")
Box plot

• Boxplots are a measure of how well distributed is the


data in a data set.
• It divides the data set into three quartiles. This graph
represents the minimum, maximum, median, first
quartile and third quartile in the data set.
• It is also useful in comparing the distribution of data
across data sets by drawing boxplots for each of them.
• Boxplots are created in R by using the boxplot() function.
Syntax of boxplot() function
• boxplot(x, data, notch, varwidth, names, main)
• x is a vector or a formula.
• data is the data frame.
• notch is a logical value. Set as TRUE to draw a notch.
• varwidth is a logical value. Set as true to draw width
of the box proportionate to the sample size.
• names are the group labels which will be printed
under each boxplot.
• main is used to give a title to the graph.
Creating the Boxplot

The below script will create a boxplot graph for the


relation between mpg (miles per gallon) and cyl
(number of cylinders).
# Plot the chart.
boxplot(mpg ~ cyl, data = mtcars, xlab = "Number of
Cylinders", ylab = "Miles Per Gallon", main = "Mileage
Data")
Pie Graph
pie(x, labels, radius, main, col, clockwise)
• x is a vector containing the numeric values used in
the pie chart.
• labels is used to give description to the slices.
• radius indicates the radius of the circle of the pie
chart.(value between −1 and +1).
• main indicates the title of the chart.
• col indicates the color palette.
• clockwise is a logical value indicating if the slices are
drawn clockwise or anti clockwise.
Example: Consider population of 4 cities in India
x <- c(12691836,10927986,5104047, 4631392)
labels <- c("Mumbai", “Delhi", “Bangalore”, “Kolkata”)
pie(x,labels)
Line Chart
• A line chart is a graph that connects a series of points by
drawing line segments between them.
• The plot() function in R is used to create the line graph.
plot(v,type,col,xlab,ylab)
• Where
• v is a vector containing the numeric values.
• type takes the value "p" to draw only the points, "l" to draw
only the lines and "o" to draw both points and lines.
• xlab is the label for x axis.
• ylab is the label for y axis.
• main is the Title of the chart.
• col is used to give colors to both the points and lines.
Example:
v <- c(7,12,28,3,41)
plot(v, type = "o")
Line Chart Title, Color and Labels
Example: v <- c(7,12,28,3,41)
plot(v, type = "o", col = "red", xlab = "Month", ylab = "Rain fall",
main = "Rain fall chart")
Scatter plot
• Scatter plots help in visualizing data easily and for
simple data inspection.
• Scatterplots show many points plotted in the
Cartesian plane. Each point represents the values of
two variables. One variable is chosen in the
horizontal axis and another in the vertical axis.
• The simple scatterplot is created using
the plot() function.
Syntax
• The basic syntax for creating scatterplot in R is
• plot(x, y, main, xlab, ylab, xlim, ylim, axes)
• x is the data set whose values are the horizontal coordinates.
• y is the data set whose values are the vertical coordinates.
• main is the tile of the graph.
• xlab is the label in the horizontal axis.
• ylab is the label in the vertical axis.
• xlim is the limits of the values of x used for plotting.
• ylim is the limits of the values of y used for plotting.
• axes indicates whether both axes should be drawn on the
plot..
Example:
# Get the input values.
input <- mtcars[,c('wt','mpg')]
# Plot the chart for cars with weight between 2.5 to 5 and mileage between 15 and 30.
plot(x = input$wt,y = input$mpg,
xlab = "Weight",
ylab = "Milage",
xlim = c(2.5,5),
ylim = c(15,30),
main = "Weight vs Milage"
)
Various forms of plot() function
• plot(x, y)
• plot(xy) : If x and y are vectors, plot(x, y) produces a scatterplot of y against x. The same
effect can be produced by supplying one argument (second form) as either a list
containing two elements x and y or a two-column matrix.
• plot(x): If x is a time series, this produces a time-series plot. If x is a numeric vector, it
produces a plot of the values in the vector against their index in the vector. If x is a
complex vector, it produces a plot of imaginary versus real parts of the vector elements.
• plot(f)
• plot(f, y): f is a factor object, y is a numeric vector. The first form generates a bar plot of f
; the second form produces boxplots of y for each level of f.
• plot(df)
• plot(~ expr)
• plot(y ~ expr): df is a data frame, y is any object, expr is a list of object names separated
by ‘+’ (e.g., a + b + c). The first two forms produce distributional plots of the variables in
a data frame (first form) or of a number of named objects (second form). The third form
plots y against every object named in expr.

You might also like