Module IV
Module IV
Module 4
Data Analysis
Data Import: Reading Data in to R
R base functions for importing data
Syntax:
read.csv(path, header = TRUE, sep = “,”)
path : The path of the file to be imported
header : By default : TRUE . Indicator of whether to
import column headings.
sep = “,” : The separator for the values in each row.
Reading Excel files in R
• To install dplyr
install.packages("dplyr")
• To load dplyr
library(dplyr)
Important dplyr verbs to remember
• Example
• Filter the rows for airquality that Temprature more
than 62.
filter(airquality, Temp >62)
• Filter the rows for airquality that Temprature more
than 62 and after 3rd day.
filter(airquality, Temp >62,Day>=3)
Arrange or re-order rows using arrange()
• To arrange (or re-order) rows by a particular column
arrange(airquality,Wind) (OR)
airquality %>% arrange(Wind) %>% head
• Now, we will select three columns from airquality, arrange the
rows by Temp and then arrange the rows by Wind. Finally
show the head of the final data frame.
airquality %>% select(Temp,Wind, Ozone, Day) %>%
arrange(Temp,Wind) %>% head
• To arrange in a descending order:
arrange(airquality, desc(Wind))
Create new columns using mutate()
Example:
Add new column “year” in air data frame.
Renaming columns using rename()
• Example:
Rename columns Month to M and Day to D
Exploratory Data Analysis
• It involves visualizing your data using graphical and numerical
summaries.
• We can visualize data graphically using following charts.
– Histogram
– Box plot
– Pie graph
– Line chart
– Bar plot
– Scatter Plot
Histogram
• A histogram represents the frequencies of values of a variable
bucketed into ranges.
• Histogram is similar to bar chat but the difference is it groups
the values into continuous ranges.
• Each bar in histogram represents the height of the number of
values present in that range.
• R creates histogram using hist() function. This function takes a
vector as an input and uses some more parameters to plot
histograms.
Syntax of hist() function
• hist(v,main,xlab,xlim,ylim,breaks,col,border)
• v is a vector containing numeric values used in histogram.
• main indicates title of the chart.
• col is used to set color of the bars.
• border is used to set border color of each bar.
• xlab is used to give description of x-axis.
• xlim is used to specify the range of values on the x-axis.
• ylim is used to specify the range of values on the y-axis.
• breaks is used to mention the width of each bar.
Example: v <- c(9,13,21,8,36,22,12,41,31,33,19)
hist(v, xlab = "Weight", col = "yellow", border = "blue“)
hist(v, xlab = "Weight",col = "green",border = "red",
xlim = c(0,40), ylim = c(0,5), breaks = 5)
Saving chart in a file