Module IV

Uploaded by

saikumar.addanki990

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views

Module IV

Uploaded by

saikumar.addanki990

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 43

CSE1006 – Foundations of Data Analytics

Module 4
Data Analysis
Data Import: Reading Data in to R
R base functions for importing data

• R base functions for importing data from txt|csv files

– Read tab separated values
read.delim(file.choose())
– Read comma (",") separated values
read.csv(file.choose())
– Read semicolon (";") separated values
read.csv2(file.choose())
Using read.csv() methods

Syntax:
read.csv(path, header = TRUE, sep = “,”)
path : The path of the file to be imported
header : By default : TRUE . Indicator of whether to
import column headings.
sep = “,” : The separator for the values in each row.
Reading Excel files in R

• read_excel() function is basically used to import/read

an excel file and it can only be accessed after
importing of the readxl library in R language.
• library(readxl)
• read_excel(path, sheet = NULL, range = NULL,
col_names = TRUE, col_types = NULL, na = "",
trim_ws = TRUE, skip = 0, n_max = Inf, guess_max =
min(1000, n_max), progress =
readxl_progress(), .name_repair = "unique")
Exporting Data: Writing data from R to files
• To Excel files (we can use xlsx package or writexl package)
write.xlsx(data,file,sheetName, col.names, row.names, append)
• data – the dataframe to be written to excel file
• file – the output path in which the excel file has to be stored
• sheetName – desired name for the sheet
• row.names, col.names – (bool) if true the row and column
names are copied to the excel file
• append – if True data will be appended to the existing excel file
Examples
• Using xlsx package
install.packages("xlsx")
library(xlsx)
write.xlsx(iris, file = "iris.xlsx", sheetName="iris_data")
• Using writexl package
install.packages(“writexl")
library(writexl)
write_slsx(air,”c:\\Users\\mypc\\Desktop\\air1.xlsx”)
Data cleaning and summarizing with
dplyr package
• dplyr is a powerful R-package to transform and
summarize tabular data with rows and columns.
• The package contains a set of functions (or “verbs”) that
perform common data manipulation operations such as
filtering for rows, selecting specific columns, re-ordering
rows, adding new columns and summarizing data.
• In addition, dplyr contains a useful function to perform
another common task which is the “split-apply-combine”
concept.
Why dplyr?
• dplyr is a new package which provides a set of tools for
efficiently manipulating datasets in R. dplyr is the next
iteration of plyr, focusing on only data frames. The main
advantages include:
1. Speed. Compared to plyr library (home of the familiar ddply
function), dplyr is anywhere between 20X - 100X faster in its
calculations.
2. Cleaner Code. the syntax allows for function chaining, preventing
any potential cluttering in the code, which in turn makes for
easier code writing/reading.
3. Simpler Code. dplyr has a limited number of functions that are
focused on the most common requirements of data
manipulation. the syntax is both simple and efficient.
Install and load dplyr

• To install dplyr
install.packages("dplyr")
• To load dplyr
library(dplyr)
Important dplyr verbs to remember

dplyr verbs Description

1. select() select columns
2. filter() filter rows
3. arrange() re-order or arrange rows
4. mutate() create new columns
5. summarise() summarise values
6. rename rename variables in a data frame
7. group_by() allows for group operations in the
“split-apply-combine” concept
%>%: the “pipe” operator is used to connect multiple verb
actions together into a pipeline
dplyr vs. Base R Functions
• dplyr functions process faster than base R
functions. It is because dplyr functions were
written in a computationally efficient manner.
• They are also more stable in the syntax and
better supports data frames than vectors.
SQL Queries vs. dplyr
Selecting columns using select()
• Syntax
select(dataframe, column name)
Example
select(airquality, OZone)
• To select all the columns except a specific column,
use the “-“ (subtraction) operator (also known as
negative indexing)
Example
Select(airquality, -Temp)
• Some additional options to select columns based on a specific
criteria include
• ends_with() = Select columns that end with a character string
• contains() = Select columns that contain a character string
• matches() = Select columns that match a regular expression
• one_of() = Select columns names that are from a group of
names
• Example
select(airquality, starts_with(“Mo"))
Selecting rows using filter()

• Example
• Filter the rows for airquality that Temprature more
than 62.
filter(airquality, Temp >62)
• Filter the rows for airquality that Temprature more
than 62 and after 3rd day.
filter(airquality, Temp >62,Day>=3)
Arrange or re-order rows using arrange()
• To arrange (or re-order) rows by a particular column
arrange(airquality,Wind) (OR)
airquality %>% arrange(Wind) %>% head
• Now, we will select three columns from airquality, arrange the
rows by Temp and then arrange the rows by Wind. Finally
show the head of the final data frame.
airquality %>% select(Temp,Wind, Ozone, Day) %>%
arrange(Temp,Wind) %>% head
• To arrange in a descending order:
arrange(airquality, desc(Wind))
Create new columns using mutate()

Example:
Add new column “year” in air data frame.
Renaming columns using rename()

• Example:
Rename columns Month to M and Day to D
Exploratory Data Analysis
• It involves visualizing your data using graphical and numerical
summaries.
• We can visualize data graphically using following charts.
– Histogram
– Box plot
– Pie graph
– Line chart
– Bar plot
– Scatter Plot
Histogram
• A histogram represents the frequencies of values of a variable
bucketed into ranges.
• Histogram is similar to bar chat but the difference is it groups
the values into continuous ranges.
• Each bar in histogram represents the height of the number of
values present in that range.
• R creates histogram using hist() function. This function takes a
vector as an input and uses some more parameters to plot
histograms.
Syntax of hist() function
• hist(v,main,xlab,xlim,ylim,breaks,col,border)
• v is a vector containing numeric values used in histogram.
• main indicates title of the chart.
• col is used to set color of the bars.
• border is used to set border color of each bar.
• xlab is used to give description of x-axis.
• xlim is used to specify the range of values on the x-axis.
• ylim is used to specify the range of values on the y-axis.
• breaks is used to mention the width of each bar.
Example: v <- c(9,13,21,8,36,22,12,41,31,33,19)
hist(v, xlab = "Weight", col = "yellow", border = "blue“)
hist(v, xlab = "Weight",col = "green",border = "red",
xlim = c(0,40), ylim = c(0,5), breaks = 5)
Saving chart in a file

• # Give the chart file a name.

png(file = "histogram.png")
• # Create the histogram.
hist(v,xlab = "Weight",col = "yellow",border = "blue")
• # Save the file.
dev.off()
Bar plot
• Bar Plots are suitable for showing comparison
between cumulative totals across several groups.
• A bar chart represents data in rectangular bars with
length of the bar proportional to the value of the
variable.
• R uses the function barplot() to create bar charts. In
bar chart each of the bars can be given different
colors.
Syntax of hist() function
• barplot(H,xlab,ylab,main, names.arg,col)
• H is a vector or matrix containing numeric values
used in bar chart.
• xlab is the label for x axis.
• ylab is the label for y axis.
• main is the title of the bar chart.
• names.arg is a vector of names appearing under
each bar.
• col is used to give colors to the bars in the graph.
Example: H <- c(7,12,28,3,41)
barplot(H)
Example: H <- c(7,12,28,3,41)
M <- c("Mar","Apr","May","Jun","Jul")
barplot(H,names.arg=M,xlab="Month",ylab="Revenue",col="blu
e", main="Revenue chart",border="red")
Box plot

• Boxplots are a measure of how well distributed is the

data in a data set.
• It divides the data set into three quartiles. This graph
represents the minimum, maximum, median, first
quartile and third quartile in the data set.
• It is also useful in comparing the distribution of data
across data sets by drawing boxplots for each of them.
• Boxplots are created in R by using the boxplot() function.
Syntax of boxplot() function
• boxplot(x, data, notch, varwidth, names, main)
• x is a vector or a formula.
• data is the data frame.
• notch is a logical value. Set as TRUE to draw a notch.
• varwidth is a logical value. Set as true to draw width
of the box proportionate to the sample size.
• names are the group labels which will be printed
under each boxplot.
• main is used to give a title to the graph.
Creating the Boxplot

The below script will create a boxplot graph for the

relation between mpg (miles per gallon) and cyl
(number of cylinders).
# Plot the chart.
boxplot(mpg ~ cyl, data = mtcars, xlab = "Number of
Cylinders", ylab = "Miles Per Gallon", main = "Mileage
Data")
Pie Graph
pie(x, labels, radius, main, col, clockwise)
• x is a vector containing the numeric values used in
the pie chart.
• labels is used to give description to the slices.
• radius indicates the radius of the circle of the pie
chart.(value between −1 and +1).
• main indicates the title of the chart.
• col indicates the color palette.
• clockwise is a logical value indicating if the slices are
drawn clockwise or anti clockwise.
Example: Consider population of 4 cities in India
x <- c(12691836,10927986,5104047, 4631392)
labels <- c("Mumbai", “Delhi", “Bangalore”, “Kolkata”)
pie(x,labels)
Line Chart
• A line chart is a graph that connects a series of points by
drawing line segments between them.
• The plot() function in R is used to create the line graph.
plot(v,type,col,xlab,ylab)
• Where
• v is a vector containing the numeric values.
• type takes the value "p" to draw only the points, "l" to draw
only the lines and "o" to draw both points and lines.
• xlab is the label for x axis.
• ylab is the label for y axis.
• main is the Title of the chart.
• col is used to give colors to both the points and lines.
Example:
v <- c(7,12,28,3,41)
plot(v, type = "o")
Line Chart Title, Color and Labels
Example: v <- c(7,12,28,3,41)
plot(v, type = "o", col = "red", xlab = "Month", ylab = "Rain fall",
main = "Rain fall chart")
Scatter plot
• Scatter plots help in visualizing data easily and for
simple data inspection.
• Scatterplots show many points plotted in the
Cartesian plane. Each point represents the values of
two variables. One variable is chosen in the
horizontal axis and another in the vertical axis.
• The simple scatterplot is created using
the plot() function.
Syntax
• The basic syntax for creating scatterplot in R is
• plot(x, y, main, xlab, ylab, xlim, ylim, axes)
• x is the data set whose values are the horizontal coordinates.
• y is the data set whose values are the vertical coordinates.
• main is the tile of the graph.
• xlab is the label in the horizontal axis.
• ylab is the label in the vertical axis.
• xlim is the limits of the values of x used for plotting.
• ylim is the limits of the values of y used for plotting.
• axes indicates whether both axes should be drawn on the
plot..
Example:
# Get the input values.
input <- mtcars[,c('wt','mpg')]
# Plot the chart for cars with weight between 2.5 to 5 and mileage between 15 and 30.
plot(x = input$wt,y = input$mpg,
xlab = "Weight",
ylab = "Milage",
xlim = c(2.5,5),
ylim = c(15,30),
main = "Weight vs Milage"
)
Various forms of plot() function
• plot(x, y)
• plot(xy) : If x and y are vectors, plot(x, y) produces a scatterplot of y against x. The same
effect can be produced by supplying one argument (second form) as either a list
containing two elements x and y or a two-column matrix.
• plot(x): If x is a time series, this produces a time-series plot. If x is a numeric vector, it
produces a plot of the values in the vector against their index in the vector. If x is a
complex vector, it produces a plot of imaginary versus real parts of the vector elements.
• plot(f)
• plot(f, y): f is a factor object, y is a numeric vector. The first form generates a bar plot of f
; the second form produces boxplots of y for each level of f.
• plot(df)
• plot(~ expr)
• plot(y ~ expr): df is a data frame, y is any object, expr is a list of object names separated
by ‘+’ (e.g., a + b + c). The first two forms produce distributional plots of the variables in
a data frame (first form) or of a number of named objects (second form). The third form
plots y against every object named in expr.

Stereonet Help
No ratings yet
Stereonet Help
38 pages
Mincom LinkOne WinView Technical Reference
No ratings yet
Mincom LinkOne WinView Technical Reference
105 pages
Data Preprocessing
No ratings yet
Data Preprocessing
27 pages
Introduction To Data Science With R Programming
No ratings yet
Introduction To Data Science With R Programming
91 pages
Experiment 3
No ratings yet
Experiment 3
43 pages
#02 R Basics
No ratings yet
#02 R Basics
30 pages
Charts and Graphs in R
No ratings yet
Charts and Graphs in R
50 pages
Pandas Notes
No ratings yet
Pandas Notes
4 pages
IP Imp Notes
No ratings yet
IP Imp Notes
5 pages
Pandas What Can Pandas Do For You ?: Statsmodels SM Seaborn Sns
No ratings yet
Pandas What Can Pandas Do For You ?: Statsmodels SM Seaborn Sns
9 pages
R Chart Exercise
No ratings yet
R Chart Exercise
9 pages
Lecture 1
No ratings yet
Lecture 1
167 pages
Introduction To Pandas
No ratings yet
Introduction To Pandas
27 pages
Pandas in Python
No ratings yet
Pandas in Python
59 pages
Pyspark Basics
No ratings yet
Pyspark Basics
16 pages
BA Notes
No ratings yet
BA Notes
5 pages
On Eda
No ratings yet
On Eda
60 pages
DSA-Chapter-3.1-2024 (1)
No ratings yet
DSA-Chapter-3.1-2024 (1)
9 pages
Arrays and Strings C++
No ratings yet
Arrays and Strings C++
27 pages
Introduction To R
No ratings yet
Introduction To R
36 pages
Dar lecture 7
No ratings yet
Dar lecture 7
24 pages
Lecture Slide 2
No ratings yet
Lecture Slide 2
25 pages
DSR_Unit 2-2.1 ExploringBasicgraphs
No ratings yet
DSR_Unit 2-2.1 ExploringBasicgraphs
51 pages
Basic R Tutorial
No ratings yet
Basic R Tutorial
56 pages
STATA Programming II
100% (1)
STATA Programming II
2 pages
Tutorial-Introduction To Dplyr
No ratings yet
Tutorial-Introduction To Dplyr
54 pages
Unit-3
No ratings yet
Unit-3
12 pages
Module 2
No ratings yet
Module 2
18 pages
DA R Unit-4
No ratings yet
DA R Unit-4
32 pages
DA_Lab_Week-2
No ratings yet
DA_Lab_Week-2
22 pages
Pandas For Machine Learning: Acadview
No ratings yet
Pandas For Machine Learning: Acadview
18 pages
PPT for Assignment-3 (Final_Pandas_Lab)
No ratings yet
PPT for Assignment-3 (Final_Pandas_Lab)
40 pages
UNIT-III
No ratings yet
UNIT-III
27 pages
Introduction To R
No ratings yet
Introduction To R
52 pages
R Introduction
No ratings yet
R Introduction
40 pages
R Programming Slides
No ratings yet
R Programming Slides
73 pages
R Programming
No ratings yet
R Programming
22 pages
Introduction To R: Nihan Acar-Denizli, Pau Fonseca
No ratings yet
Introduction To R: Nihan Acar-Denizli, Pau Fonseca
50 pages
Lecture 10 24-11
100% (1)
Lecture 10 24-11
19 pages
Plot
No ratings yet
Plot
34 pages
Data Frames
No ratings yet
Data Frames
60 pages
Introduction to R for Business Analytics(1)
No ratings yet
Introduction to R for Business Analytics(1)
7 pages
Rasterio: Presenters: Sushma Ghimire (13) Ashmin Sharma Pokharel (19) Asim Shrestha
No ratings yet
Rasterio: Presenters: Sushma Ghimire (13) Ashmin Sharma Pokharel (19) Asim Shrestha
22 pages
ARRAYS, STRINGS, POINTERSclass PDF
No ratings yet
ARRAYS, STRINGS, POINTERSclass PDF
28 pages
M2_DAR_
No ratings yet
M2_DAR_
46 pages
What is pandas
No ratings yet
What is pandas
9 pages
Differentiate Between Data Type and Data Structures
No ratings yet
Differentiate Between Data Type and Data Structures
11 pages
Lecture 7 - Integrated Analysis With R
No ratings yet
Lecture 7 - Integrated Analysis With R
79 pages
C++ Standard Library-Functions
No ratings yet
C++ Standard Library-Functions
5 pages
Exploring - Journals in Fluent and Gambit
No ratings yet
Exploring - Journals in Fluent and Gambit
68 pages
Peter Adler RCheat Sheet
100% (1)
Peter Adler RCheat Sheet
2 pages
Js 2
No ratings yet
Js 2
28 pages
Experiment No 9
No ratings yet
Experiment No 9
5 pages
DSBDA Lab Assignment No 10
No ratings yet
DSBDA Lab Assignment No 10
3 pages
Tidyverse: Core Packages in Tidyverse
No ratings yet
Tidyverse: Core Packages in Tidyverse
8 pages
DATA VISUALIZATION USING BOX PLOT PLOTTING FRAMEWORK
No ratings yet
DATA VISUALIZATION USING BOX PLOT PLOTTING FRAMEWORK
11 pages
Learn R_ Learn R_ Data Cleaning Cheatsheet _ Codecademy
No ratings yet
Learn R_ Learn R_ Data Cleaning Cheatsheet _ Codecademy
4 pages
Data Structure (Data Frame)
No ratings yet
Data Structure (Data Frame)
12 pages
RBigData NTL
No ratings yet
RBigData NTL
24 pages
Python
No ratings yet
Python
15 pages
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet
Learn C++
From Everand
Learn C++
Durgesh
4.5/5 (9)
Ticketcreator Barcodechecker Manual: Check Secure Tickets With Barcodes
No ratings yet
Ticketcreator Barcodechecker Manual: Check Secure Tickets With Barcodes
8 pages
List of MQL4 Functions - MQL4 Reference
No ratings yet
List of MQL4 Functions - MQL4 Reference
29 pages
Odimultiplecsvtotabledynamicodi12c 160702115757
No ratings yet
Odimultiplecsvtotabledynamicodi12c 160702115757
16 pages
SG Dashcam Viewer Users Manual
No ratings yet
SG Dashcam Viewer Users Manual
49 pages
WindCatcher Manual Basic
No ratings yet
WindCatcher Manual Basic
181 pages
Analysis of The 100 Most Cited Papers
No ratings yet
Analysis of The 100 Most Cited Papers
18 pages
Power Bi PDF
100% (1)
Power Bi PDF
840 pages
Hibernate Notes by Sriman
50% (2)
Hibernate Notes by Sriman
206 pages
Pythonsupplement PDF
No ratings yet
Pythonsupplement PDF
212 pages
SAS and Excel Presentation PDF
No ratings yet
SAS and Excel Presentation PDF
98 pages
Data Logging in S7 1200
No ratings yet
Data Logging in S7 1200
17 pages
Data Cheat Sheet
No ratings yet
Data Cheat Sheet
2 pages
h8017 Unisphere Element Manager PDF
No ratings yet
h8017 Unisphere Element Manager PDF
40 pages
Champions League Legends Coding Challenge
No ratings yet
Champions League Legends Coding Challenge
2 pages
Avanthi'S Research &technological Academy: Data Mining Lab
No ratings yet
Avanthi'S Research &technological Academy: Data Mining Lab
50 pages
Centific - Karl - 3P Prompt Rewrite 2
No ratings yet
Centific - Karl - 3P Prompt Rewrite 2
66 pages
04A - Working With Datastores - Jupyter Notebook PDF
No ratings yet
04A - Working With Datastores - Jupyter Notebook PDF
11 pages
NJ Data Log Function Block
No ratings yet
NJ Data Log Function Block
6 pages
Datacolor QTX File Specification PDF
No ratings yet
Datacolor QTX File Specification PDF
8 pages
12th Computer Science Expected Public Questions
No ratings yet
12th Computer Science Expected Public Questions
15 pages
An Introduction To Spreadsheets (Slides)
No ratings yet
An Introduction To Spreadsheets (Slides)
8 pages
PostProcessing 3
No ratings yet
PostProcessing 3
9 pages
Worksheet-1 (Python)
No ratings yet
Worksheet-1 (Python)
9 pages
Entuity Report Managing
No ratings yet
Entuity Report Managing
158 pages
Css Import Export
No ratings yet
Css Import Export
9 pages
Interaction Styles in The Workplace - Review
No ratings yet
Interaction Styles in The Workplace - Review
150 pages
TRIMBLE M3 Total Station Quick Step Manual
No ratings yet
TRIMBLE M3 Total Station Quick Step Manual
18 pages
Class - XII - Pre-Board-Set-A Maximum Marks: Time Allowed:1:30 Hours
No ratings yet
Class - XII - Pre-Board-Set-A Maximum Marks: Time Allowed:1:30 Hours
17 pages