0% found this document useful (0 votes)

57 views

Data_Analysis_using_R_and_Python

Uploaded by

DSEC-MCA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

57 views

Data_Analysis_using_R_and_Python

Uploaded by

DSEC-MCA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 96

1 Introduction 7
1.1 Why R? ............................................................................................................................ 7
1.2 Why Python? .................................................................................................................. 7
1.3 Installing R software ................................................................................................... 8
1.4 Installing Python Software ......................................................................................... 8
1.5 Basic Operators in R ...................................................................................................... 9
1.5.1 Arithmetic Operators:....................................................................................... 9
1.5.2 Logical Operators ............................................................................................ 10
1.6 Basic Operators in Python .......................................................................................... 11
1.6.1 Arithmetic Operators ...................................................................................... 11
1.6.2 Logical Operators ............................................................................................ 12
1.7 Data types in R ............................................................................................................. 14
1.8 Data types in Python ................................................................................................... 15
1.9 Data Analysis ................................................................................................................ 17

2 Graphics And Visualization 19

2.1 Graphics and Visualization using R .......................................................................... 19
2.1.1 Application of lattice ...................................................................................... 20
2.1.2 Applications of ggplot2 .................................................................................. 29
2.2 Graphics and Visualization using Python ................................................................ 36
2.2.1 Applications of matplotlib ............................................................................. 37
2.2.2 Applications of seaborn ................................................................................. 48

3 Data Cleaning 49
3.1 Why data cleaning is important? ............................................................................... 49
3.2 Problem if we don’t clean our data ........................................................................... 49
3.3 Data cleaning process ................................................................................................. 50
3.3.1 ETL process ..................................................................................................... 50
3.3.2 Extraction of data using R.............................................................................. 50
3.3.3 Extraction data using Python ........................................................................ 50
3.3.4 Transformation ................................................................................................ 51
3.3.5 Transformation using R.................................................................................. 51
3.3.6 Transformation using Python ........................................................................ 56

5
4 Explanatory Data Analysis 61
4.1 Univarate Explanatory Data Analysis ....................................................................... 62
4.2 Multivariate Explanatory Data Analysis ................................................................... 63
4.3 Explanatory Data Analysis using R .......................................................................... 63
4.4 Explanatory Data Analysis using Python ................................................................. 72

5 Regression Analysis 81
5.1 Linear Regression Analysis ......................................................................................... 81
5.1.1 Assumptions in Linear regression model .................................................... 82
5.2 Logistic Regression Analysis ...................................................................................... 82
5.3 Regression Analysis using R ...................................................................................... 83
5.3.1 Multiple Linear Regression............................................................................ 83
5.3.2 Logistic Regression ......................................................................................... 86
5.4 Regression Analysis using Python ............................................................................ 88

Bibliography 97
Chapter 1

Introduction

Python and R are two very popular open source software’s used for data analysis.
However both software’s are of equal importance and can be used as complement of
each other. When it comes to choose one, R offers more depth when it comes to data
analysis, data modelling while Python is easier to learn and tends to represent graphs in
more polished way.

1.1 Why R?
• Open Source: R is an open source software and can be installed easily.

• Learning: R requires you to learn and understand coding. It’s a low level program-
ming which can take longer codes.

• Statistical Analysis: Designed by statisticians for doing statistical analysis. In R,

statistical analysis is made strong using large number of packages. Currently, on
dated 09th June 2018, the CRAN package repository features 12611 available pack-
ages. Some of the well known packages are:
For Visualization: ggplot2 (See Hadley Wickham, Winston Chang, RStudio (2016))
For Reporting: knitr (See Yihui Xie (2018) )
For Geography: maps, ggmaps (See David Kahle, Hadley Wickham (2016))

1.2 Why Python?

• Open source: Python is also a open source software and can be installed easily.

• Learning: Python is easy to learn. Python is known for its simplicity in program-
ming world.

• Speed: People are inappropriately obsessed with speed. Python is a high-level lan-
guage, which means it has number of benefits to accelerate codes. Another benefit
is that it is easy to learn.

7
8 CHAPTER 1. INTRODUCTION

• Statistical Analysis: Python is widely used in scientific computing and statistical

analysis in industry and academics. A large number of analytics libraries are avail-
able, including numerical analysis, statistical analysis, data analysis, visualization.
(See https://docs.python.org/3/library/)

1.3 Installing R software

R is an open source and can obtained from Comprehensive R Archive Network (CRAN),
which can be done as follows:
• Open an internet browser and go to www.r-project.org.

• Click the "download R" link in the middle of the page under "Getting Started."

• Select a CRAN location (a mirror site) and click the corresponding link.

• Click on the "Download R for (Mac) OS X" link at the top of the page.

• Click on the file containing the latest version of R under "Files."

• Save file, double-click it to open, and follow the installation instructions.

1.4 Installing Python Software

Python is an open Source and can be downloaded from website www.python.org , which
can be done as follows:
• Go to the python website www.python.org and click on the ’Download’ menu
choice.
• Next click on the Python 3.7 (note that the version number may change) ’Windows
Installer’ to download the installer. If you know you’re running a 64-bit OS, you
can choose the x86-64 installer.
• Be sure to save the file that you are downloading.

• Once you have downloaded the file, open it. (You can also double-click on it to open
it.)

• Now follow the installation instructions.

• Now for doing data analysis. We need to install packages.

Note: For new users of python, Anaconda (from https://www.continuum.io) is recom-
mended due to complication in installing packages in python used in ’Data Analysis’.
Anaconda provides you almost all necessary packages you need at work.
Otherwise, you would need to install every package separately and many times you will
run into installation errors because of incompatible packages’ versions.
9

Once the software is installed, now it can be executed by launching corresponding

executable. The prompt, by default ’>’ and ’>>>’ in R and Python respectively,
indicates that software is waiting for your commands. At this stage, new user will think,
"what do I do now?" Using for the first time, let’s start with their basic uses i.e. use these
software’s as a calculator. We can write commands in window at command prompt and
use both software as powerful calculators.

1.5 Basic Operators in R

R language have usual basic operators. The common operators are:

1.5.1 Arithmetic Operators:

Following are the arithmetic operators in R.

Arithmetic Operators

Operators Meaning Examples

+ Addition >7+6
[1] 13
- Subtraction > 9-5
[1] 4
* Multiplication > 6*2
[1] 12
/ Division > 9/3
[1] 3
%% Modulus i.e. remainder of the division > 5%%2
of left operand by the right [1] 1 ( i.e. remainder of 5/2)
^ Exponent - left operand raised to > 2^3
the power of right [1] 8 (i.e. 2 to the power 3)
pi Pi (R knows about pi i.e.π) > pi
[1] 3.141593
sin Sin trigonometric function > sin(90*(pi/180))
[1] 1 (i.e. converts angle
radians then take sin() )
log Logarithm > log(25)
[1] 3.218876

4.12e-2

## [1] 0.0412
10 CHAPTER 1. INTRODUCTION

cos(120*pi/180)

## [1] -0.5

log(100,base=10) #Takes the logarithm of x with base y;

## [1] 2

#if base is not specified, returns the natural logarithm

exp(25) #Returns the exponential of x

## [1] 72004899337

sqrt(25) #Returns the square root of x

## [1] 5

factorial(4) #Returns the factorial of x (x!)

## [1] 24

choose(5,3) #Returns the number of possible combinations

## [1] 10

#when drawing y elements at a time from x possibilities

1.5.2 Logical Operators

Following are logical operators in R:

Logical operators

Operators Meaning
< Less than
> Greater than
<= Less than equal to
>= Greater than equal to
== Exactly equal to
!= Not equal to
!x Not x
x|y x or y
x&y x and y
1.6. BASIC OPERATORS IN PYTHON 11

x=c(1:5)
y=2
x<y

## [1] TRUE FALSE FALSE FALSE FALSE

x>y

## [1] FALSE FALSE TRUE TRUE TRUE

x<=y

## [1] TRUE TRUE FALSE FALSE FALSE

x>=y

## [1] FALSE TRUE TRUE TRUE TRUE

x==y

## [1] FALSE TRUE FALSE FALSE FALSE

x!=y

## [1] TRUE FALSE TRUE TRUE TRUE

## [1] FALSE FALSE FALSE FALSE FALSE

## [1] FALSE

1.6 Basic Operators in Python

Python also have usual basic arithmetic operators. The common operators are:

1.6.1 Arithmetic Operators

Some arithmetic operators used in Python are given below in table.

In [1]: import math

In [2]: math.log10(100)#Return the base-10 logarithm

12 CHAPTER 1. INTRODUCTION

Arithmetic operators

Operators Meaning Examples

+ Addition >>> 3+5
8
- Subtraction >>> 6+4
10
* Multiplication >>> 5*2
10
/ division >>> 8/4
2.0
% Modulus - remainder of the division >>> 5%2
of left operand by the right 1

(remainder of 5/2)
** Exponent - left operand raised >>> 2**3
to the power of right 8 (i.e. 2 to
the power 3)
math.sin Sin trigonometric function >>> math.sin(90*(math.pi/180))
1.0
math.log logarithm >>> math.log(25)
3.2188758248682006

Out[2]: 2.0

In [3]: math.cos(25)#return the cosine of 25 radians

Out[3]: 0.9912028118634736

In [4]: math.factorial(5) #return the value of 5!

Out[4]: 120

In [5]: math.gamma(5) #return the gamma function at 5

Out[5]: 24.0

In [6]: math.degrees(90) #return angle 90 from radians to degree

Out[6]: 5156.620156177409

1.6.2 Logical Operators

The following are Logical operators used in Python:

In [1]: x=2
y=7
1.6. BASIC OPERATORS IN PYTHON 13

Logical operators

Operators Meaning
< Less than
> Greater than
<= Less than equal to
>= Greater than equal to
== Exactly equal to
!= Not equal to
!x Not x
x|y x or y
x&y x and y

In [2]: if(x==y):
print("x is equal to y")
else:
print("x is not equal to y")

x is not equal to y

In [3]: if (x>y):
print("x is greater than y")
else:
print("x is not greater than y")

x is not greater than y

In [4]: if (x<y):
print("x is less than y")
else:
print("x is not less than y")

x is less than y

In [5]: if (x>=y):
print("x is greater than equal to y")
else:
print("x is not greater than equal to y")

x is not greater than equal to y

14 CHAPTER 1. INTRODUCTION

In [6]: if (x<=y):
print("x is less than equal to y")
else:
print("x is not less than equal to y")
x is less than equal to y

Since every value has a data type. Data type are classes and variables are instances
of these classes.

1.7 Data types in R

R has various type of ’data type’ which includes vectors, matrices, array, data frame and
lists.
• Vectors: when we have to create a vector with more than one variable, then we use
c() function which means combining elements into vector. For example:

#create a vector
X<- c(1,2,3,5,6,7)
X

## [1] 1 2 3 5 6 7

print(class(X))# to get class of vector

## [1] "numeric"

• Matrices: Matrices is 2 dimensional data structure. It consists of elements of same

class. It can be created using vector input to a matrix function. For example:

## [,1] [,2] [,3]

## [1,] 1 3 5
## [2,] 2 4 6
1.8. DATA TYPES IN PYTHON 15

• List: List is a special type of vector which contain elements of different data types.
For example:

# Create a list.
list1 <- list(c(2,5,3),21.3,sin)

# Print the list.

print(list1)

## [[1]]
## [1] 2 5 3
##
## [[2]]
## [1] 21.3
##
## [[3]]
## function (x) .Primitive("sin")

• Dataframe: Data frame is commonly used data type to store tabular data. It is dif-
ferent from matrices data type. In matrix, every element should be from same class.
But in data frame you can use list of vector of different class. Every column act as a
list. Every time you read data in R will be stored in a data frame. For example:

## name score
## 1 ash 67
## 2 jane 56
## 3 paul 87
## 4 mark 91

1.8 Data types in Python

Python has five standard type data.

• Number: Number variables are created by the standard Python method. For
example:

In [1]: a=5
16 CHAPTER 1. INTRODUCTION

In [2]: print(a,type(a))

5 <class int >

' '

In [3]: b=0.265

In [4]: print(b,type(b))

0.265 <class float >' '

• String: String can be defined by using single(’), double(") or triple(”’) inverted

commas. For example:

In [5]: greeting= greeting

' '

In [6]: print(greeting,type(greeting))

greeting <class str > ' '

• List: List are most useful data structures in Python. List can be simply defined
by using comma separated values in square brackets. List can contain items of
different type. But usually items have same type. Python lists are mutable and
individual elements of a list can be changed. For example:

In [7]: square_list=[0,1,2,4,5,15,25,30]

In [8]: print(square_list,type(square_list))

[0, 1, 2, 4, 5, 15, 25, 30] <class list > ' '

In [9]: square_list[0]

Out[9]: 0

In [10]: square_list[2:6]

Out[10]: [2, 4, 5, 15]

• Tuple: Tuple is represented by number of values separated by commas. Tuple are

immutable and can not change, they are faster in processing as compared to list.
Hence, if your list is unlikely to change, you should use letups, instead of list. For
example:
1.9. DATA ANALYSIS 17

In [11]: tulle_ex=1,2,3,5,9,77,44,55,21,36,59

In [12]: print(tulle_ex,type(tulle_ex))

(1, 2, 3, 5, 9, 77, 44, 55, 21, 36, 59) <class tulle > ' '

• Dictionary: Dictionaries in Python are lists of key:value pair. Dictionaries can be

used to sort, iterate and compare data. Dictionaries are created using braces({})
with pairs separated by commas( ) and key values associated with a colon(:). In
dictionaries key must be unique. For example:
In [13]: dict={"A":1,"B":2,"C":3}

In [13]: dict={"A":1,"B":2,"C":3}

In [14]: dict.keys()

Out[14]: dict_keys([ A , B , C ])
' ' ' ' ' '

In [15]: dict.values()

Out[15]: dict_values([1, 2, 3])

1.9 Data Analysis

Data analysis is a process of cleaning, transforming and modelling data with goal of get-
ting a useful information, suggesting conclusions and making decision from data. In
other words main purpose of data analysis is to look at what the data is trying to tell
us. Data analysis with good statistical program is not difficult. Data analysis doesn’t re-
quire much knowledge of mathematics or formulas that program uses for analysis. Data
analysis requires few things:
• A clean data that is ready for analysis.

• A clear idea that what questions you want the data to answer.
Process of data analysis have following steps:

• Data cleaning

• Explanatory data dnalysis

• Data analysis

• Visualize report

• Decision making
18 CHAPTER 1. INTRODUCTION
Chapter 2

Graphics And Visualization

Graphics and visualization are powerful tool for describing and assisting analysis of data.
The power of graphics and visualization arises from the fact that they describes the large
quantities of information quickly. Graphics play an important role in good data analysis.
They are useful for storing large data sets during data analysis, assist in describing and
summarizing the data, and they can be tightly integrated with formal analytical statistical
tools such as model-fitting techniques so that the analysis process can be refined.

2.1 Graphics and Visualization using R

R offers a variety of powerful tools for graphics and visualization. Each graphical func-
tion have a large number of options for producing graphs making it flexible. It is possible
to display data and outcomes in wide variety of different ways. Base R plotting com-
mands are used to display a variety of graphs and is divided into two basic groups:

• High-level plotting function: High-level plotting function create a new graph on

the device, possibly with axis, labels, etc. Function such as hist(), plot(), boxplot()
produces a entire plot or initialize a plot. High level potting function starts a new
plot, erasing the current plot if following. Some of standard plot functions are:

S.No Function Name of the plot

1 plot() Scatter plot
2 hist() Histogram
3 boxplot() Box plot or Box-and-whiskers plot
4 stripchart() Strip-chart
5 barplot() Bar-diagram
6 stem() Stem and leaf display

The number of arguments can be passed to high-level plotting function, as follows:

S.No. Arguments Explanation

1 main=" " Title of the plot
2 xlab=" " Label for x axis

19
20 CHAPTER 2. GRAPHICS AND VISUALIZATION

3 ylab=" " Label for y axis

4 xlim= Specify x limit
5 ylim= Specify y limit
6 type="p/l/o" Style of plotting symbol
7 pch=" " Shape of the points
8 lty=" " Style of the line

• Low-Level plotting function: Sometimes high-level plotting functions do not pro-

duce the graphs which we desire or we want to add more information to the graph
then we use low-level plotting functions those add more information to an existing
plot, such as lines, labels, extra points. Some low-level ploting functions are:

S.No. Function Explanation

1 lines() Lines
2 abline() lines given by intercept and slope
3 points() points
4 text() Text in the plot
5 legend() List of symbols

R has many powerful packages for graphical representation and visualization. Two most
widely used packages are:

• lattice: lattice was created by Deepayan Sarkar, see Sarkar(2017) lattice is a power-
ful and high-level plotting data visualization system. The lattice is based on Grid-
graphics engine and requires grid add on package. lattice provides its own interface
for modification of set of graphical and non-graphical settings.

• ggplot2: ggplot2 was created by Hadley Wickham in 2005, see Wickham(2016).

Since 2005 ggplot2 has been grown in use to become most powerful R package.
In comparison to R base, ggplot2 allows user to add or remove components at high-
level of abstraction. This abstraction comes at a cost, that ggplot2 is being slower
than lattice graphics.

Dataset:
We are using Motor Trend Car Road tests datasets in R dataset package.
Description: The data was extracted from the 1974 Motor Trend US magazine, and com-
prises fuel consumption and 10 aspects of automobile design and performance for 32
automobiles (1973-74 models).

2.1.1 Application of lattice

The lattice package created by Deepayan Sarkar is an attempt to improve R base graph-
ics. Lattice package provides better default and ability to simply display multivariate
analysis.(See Sarkar D.(2017)) The plotting functions available in lattice:

Lattice Function Description R base analogue

bwplot() Boxplots boxplot()
2.1. GRAPHICS AND VISUALIZATION USING R 21

barchart() Barcharts barplot()

histogram() Histograms hist()
densityplot() density plots none
qqmath() Quantile-quantile qqnorm()
plot (Data set vs
theoretical distribution)
dotplot() Dotplots dotchart()
xyplot() Scatterplots plot()
stripplot() Stripplots stripchart()
qq() Quantile-quantile qqplot()
plots(Data set vs
data set)
cloud() 3-D scatterplot none
wireframe() 3-D surfaces persp()
levelplot() Level plots image()
contourplot() Contour plots contour()
splom() Scatterplot matrices pairs()
parallel() Parallel coordinate none
plots

The lattice package supports the trellis graph that shows the relationship variables. The
basic format of lattice is:
graph_type(formula, data=)
Here we show some applications of lattice function with example of dataset of mtcars:
22 CHAPTER 2. GRAPHICS AND VISUALIZATION

# Lattice Examples
library(lattice)
attach(mtcars)

# create factors with value labels

cyl.factor <-factor(cyl,levels=c(4,6,8),
labels=c("cyl_4","cyl_6","cyl_8"))
gear.factor<-factor(gear,levels=c(3,4,5),
labels=c("gears_3","gears_4","gears_5"))

# kernel density plot

densityplot(~mpg,
main="Density_Plot",
xlab="Miles_per_Gallon")

Density_Plot

0.06

0.04
Density

0.02

0.00

10 20 30 40

Miles_per_Gallon

# kernel density plots by factor level

densityplot(~mpg|cyl.factor,
main="Density Plot by Number of Cylinders",
xlab="Miles_per_Gallon")
2.1. GRAPHICS AND VISUALIZATION USING R 23

Density Plot by Number of Cylinders

10 20 30 40

cyl_4 cyl_6 cyl_8

0.25

0.20

0.15
Density

0.10

0.05

0.00

10 20 30 40 10 20 30 40

Miles_per_Gallon

# kernel density plots by factor level (alternate layout)

densityplot(~mpg|cyl.factor,
main="Density Plot by Number of Cylinders",
xlab="Miles per Gallon",
layout=c(1,3))
24 CHAPTER 2. GRAPHICS AND VISUALIZATION

Density Plot by Number of Cylinders

cyl_8
0.25
0.20
0.15
0.10
0.05
0.00
cyl_6
0.25
0.20
Density

0.15
0.10
0.05
0.00
cyl_4
0.25
0.20
0.15
0.10
0.05
0.00
10 20 30 40

Miles per Gallon

# boxplots for each combination of two factors

bwplot(cyl.factor~mpg|gear.factor,
ylab="Cylinders", xlab="Miles per Gallon",
main="Mileage by Cylinders and Gears",
layout=(c(1,3)))
2.1. GRAPHICS AND VISUALIZATION USING R 25

Mileage by Cylinders and Gears

gears_5
cyl_8

cyl_6

cyl_4

gears_4
cyl_8
Cylinders

cyl_6

cyl_4

gears_3
cyl_8

cyl_6

cyl_4

10 15 20 25 30 35

Miles per Gallon

# scatterplots for each combination of two factors

xyplot(mpg~wt|cyl.factor*gear.factor,
main="Scatterplots by Cylinders and Gears",
ylab="Miles per Gallon", xlab="Car Weight")
26 CHAPTER 2. GRAPHICS AND VISUALIZATION

Scatterplots by Cylinders and Gears

2 3 4 5

gears_5 gears_5 gears_5

cyl_4 cyl_6 cyl_8
35
30
25
20
15
10
gears_4 gears_4 gears_4
Miles per Gallon

cyl_4 cyl_6 cyl_8

35
30
25
20
15
10
gears_3 gears_3 gears_3
cyl_4 cyl_6 cyl_8
35
30
25
20
15
10
2 3 4 5 2 3 4 5

Car Weight

# 3d scatterplot by factor level

cloud(mpg~wt*qsec|cyl.factor,
main="3D Scatterplot by Cylinders")
2.1. GRAPHICS AND VISUALIZATION USING R 27

3D Scatterplot by Cylinders
cyl_8

mpg

qsec wt

cyl_4 cyl_6

mpg mpg

qsec wt qsec wt

# dotplot for each combination of two factors

dotplot(cyl.factor~mpg|gear.factor,
main="Dotplot Plot by Number of Gears and Cylinders",
xlab="Miles Per Gallon")
28 CHAPTER 2. GRAPHICS AND VISUALIZATION

Dotplot Plot by Number of Gears and Cylinders

gears_5

cyl_8

cyl_6

cyl_4

gears_3 gears_4

cyl_8

cyl_6

cyl_4

10 15 20 25 30 35

Miles Per Gallon

# scatterplot matrix
splom(mtcars[c(1,3,4,5,6)],
main="Scatter Plot of mtcars Data")
2.1. GRAPHICS AND VISUALIZATION USING R 29

Scatter Plot of mtcars Data

5
4 5
4
wt 3
2 3 2
5.0
4.5 4.0 5.
4.0dra4t .0
3.5
3.0 4.0
3.0
300200 300
250
200hp
150
50 150 100
50
400300
30d0isp
200
100
100
35
30 25303
25
mpg20
1 01520 15
10
Scatter Plot Matrix

2.1.2 Applications of ggplot2

Here we show some applications of ggplot2 package

Advantages of ggplot2
• Plot at high-level of abstraction.
• Very flexible
• Consistent grammar of graphics

Grammar of graphics
• data
• aesthetic mapping
• geometric objects
• scales
30 CHAPTER 2. GRAPHICS AND VISUALIZATION

• coordinate system

• position adjustments

• statistical transformation

Basic format of ggplot2: ggplot(data = , aes(x =, y =, ...)) + geom_xxx()

• ggplot(): start object and specify the data.

• geom_xxx(): There are many types of geometric objects some are as follow:

– geom_bar: bars with bases on the x-axis

– geom_boxplot: boxes-and-whiskers
– geom_histogram: histogram
– geom_smooth: smoothed conditional means (e.g. loess smooth)
– geom_line: lines
– geom_ribbon: bands spanning y-values across a range of x-values
– geom_point: points (scatterplot)
– geom_errorbar: T-shaped error bars
• aes(): specify the aesthetic elements.

Installing and attaching package ggplot2

Like other package ggplot2 can be installed using function install.package(" ") and can be
attached using attach() or library function.
installing package
> install.packages("ggplot2")

#attaching library ggplot

library("ggplot2")

## Warning: package 'ggplot2' was built under R version 3.4.4

##
## Attaching package: 'ggplot2'
## The following object is masked from 'mtcars':
##
## mpg

• How to make scatter plot

2.1. GRAPHICS AND VISUALIZATION USING R 31

ggplot(mtcars, aes(x=mpg,y=disp))+geom_point()

400

300
disp

200

100

10 15 20 25 30 35
mpg

Distinguish groups:To distinguish first change integers into factors. To distinguish

we can use two methods either distinguish by color or by shape. We will use both
in same graph using two different groups as follow:

factor(vs)
0

400 1

factor(carb)
300
disp

1
2
200 3
4

100 6
8
10 15 20 25 30 35
mpg

• How to make Histogram, density and boxplot: If we want to plot the distribution
of mpg in mtcars and want to show trends by different groups. We can also show
density graph by simply adding a geom_density() function as follow:
32 CHAPTER 2. GRAPHICS AND VISUALIZATION

ggplot(mtcars, aes(x=mpg, fill=factor(vs)))+geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

3 factor(vs)
count

2 1

0
10 15 20 25 30 35
mpg

ggplot(mtcars, aes(x=mpg, colour=factor(vs)))+geom_density()

0.12

0.09

factor(vs)
density

0.06 0
1

0.03

0.00
10 15 20 25 30 35
mpg

ggplot(mtcars, aes(x=factor(vs),y=disp))+geom_boxplot()+coord_flip()
2.1. GRAPHICS AND VISUALIZATION USING R 33

1
factor(vs)

100 200 300 400

disp

#coord_flip() function to flip the coordinates

• How to make Trend line: Trend line can aid the eye in seeing patterns in the presence
of over plotting. Using geom_smooth() to add Trend line to your graph:

ggplot(mtcars, aes(x=mpg,y=disp))+geom_point()+
geom_smooth(aes(colour=factor(vs)))#aes() will show the trend line

## `geom_smooth()` using method = 'loess'

400
factor(vs)
disp

0
1
200

10 15 20 25 30 35
mpg
34 CHAPTER 2. GRAPHICS AND VISUALIZATION

#by groups trend line is not necessarily describing the regression

#results of your data. It may be very DIFFERENT from the regression
#line of your model.

• Faceting: Faceting generates small multiples each showing a different subset of the
data.

– facet_null(): a single plot, the default.

– facet_wrap(): "wraps" a 1 dimensional ribbon of panels into 2 dimension,

facet_wrap(GI)

– facet_grid(): produces a 2D grid of panels defined by variables which form the

rows and columns. facet_grid(row col)

Faceting is an alternative approach to use aesthetics (like color, shape or size) to dif-
ferentiate groups. It is good when groups overlap a lot, but it make small differences
that are harder to observe.

library(dplyr)

##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union

mtcars_new<-select(mtcars,-carb)
ggplot(mtcars, aes(x=mpg,y=disp))+geom_point(data=mtcars_new,
colour="grey70")+
geom_point(aes(colour = carb))+facet_wrap(~carb)
2.1. GRAPHICS AND VISUALIZATION USING R 35

1 2 3

400
300
carb
200 8
100
6
disp

4 6 8
4
400
2
300
200
100
10 15 20 25 30 3510 15 20 25 30 3510 15 20 25 30 35
mpg

• Theme: The theme system in ggplot2 enables a user to control non-data elements of
a ggplot object. It makes the ggplot2 a flexible and powerful graphing tool for data
visualization.

ggplot(mtcars, aes(x=mpg,y=disp))+geom_point(data=mtcars_new,
colour="grey70")+
geom_point(aes(colour = carb))+facet_wrap(~carb)+theme_dark()

1 2 3

400
300
carb
200 8
100
6
disp

4 6 8
4
400
2
300
200
100
10 15 20 25 30 3510 15 20 25 30 3510 15 20 25 30 35
mpg

• Setting axis limits and labeling scales: We commonly need to adjust axis so ggplot2
provides several convenient functions to label axis and adjust axis and other aes-
thetics:
– lims, xlim, ylim: set axis limits
36 CHAPTER 2. GRAPHICS AND VISUALIZATION

– expand_limits: extend limits of scales for various aethetics

– xlab, ylab, ggtitle, labs: give labels (titles) to x-axis, y-axis, or graph; labs can
set labels for all aesthetics and title

ggplot(mtcars, aes(mpg, disp, colour=factor(vs))) + geom_point()+

labs(colour ="Cylinders")+labs(x = "Miles_per_gallon",
y="Displacement")+
ggtitle("New Scatter Plot")

New Scatter Plot

400
Displacement

Cylinders
300
0
1
200

100

10 15 20 25 30 35
Miles_per_gallon

2.2 Graphics and Visualization using Python

Python has many visualization tools for building an interactive visualization. It is possi-
ble to make beautiful plots for display in python. There are number of other visualization
tools in wide use. Few of them are:

• matplotlib

• seaborn

• ggplot

• mayavi

• chaos

But here we’ll be focused on two of above that is matplotlib and seaborn.
2.2. GRAPHICS AND VISUALIZATION USING PYTHON 37

2.2.1 Applications of matplotlib

matplotlib is a plotting package designed for creating quality plot. matplotlib was created
by John Hunter in 2002 to enable a MATLAB like plotting in python. matplotlib has a
number of add-on toolkits, such as mplot3d for 3D plots and basemap for mapping. There
are several ways to interact with matplotlib and the most common is through pylab mode
in ipython. matplotlib is probably the single most used Python package for 2D-graphics.
It provides both a very quick way to visualize data from Python and publication-quality
figures in many formats. pyplot provides a convenient interface to the matplotlib object-
oriented plotting library. It is modeled closely after Matlab(TM). Firstly we will import
(See https://matplotlib.org/) all the library needed by using following commands

In [1]: import pandas as pd

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

• Simple plotting a line for two array using plot() function and subplot() function.

In [2]: # plot a line, implicitly creating a subplot(111)

plt.plot([1,2,3],[6,7,8])
# now create a subplot which represents the top plot of a grid
# with 2 rows and 1 column. Since this subplot will overlap the
# first, the plot (and its axes) previously created, will be removed
plt.subplot(211)
plt.plot([1,5,9],[6,8,2])
plt.subplot(212, facecolor= y ) ' '

# creates 2nd subplot with yellow background

plt.plot([1,5,9],[6,8,2])

Out[2]: [<matplotlib.lines.Line2D at 0x202a31f93c8>]

38 CHAPTER 2. GRAPHICS AND VISUALIZATION

In [3]: data=pd.read_csv("E:/jimmy/J project/mtcars.csv",header=0)

print(data.head())

Unnamed: 0 mpg cyl disp hp drat wt qsec vs am gear \

0 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4
1 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4
2 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4
3 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3
4 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3

carb
0 4
1 4
2 1
3 1
4 2

• Scatter plot, line plot, barplot and histogram in matplotlib:

• scatter() function for plotting scatter plot
• plot() function for plotting line plot
• bar() function for plotting bar plot
• hist() function for histogram plot
• plt.xlabel() function is used to labeling x-axis.
2.2. GRAPHICS AND VISUALIZATION USING PYTHON 39

• plt.ylabel() function is used to labeling y-axis.

• plt.title() function is used to give title to graph.

In [4]: fig=plt.figure()
fig.add_subplot(2,2,1)#axis of first graph
#scatter plot can be drawn using scatter() function
plt.scatter(data.mpg, data.disp)
plt.ylabel( disp ) ' '

plt.title( Scatter Plot )

' '

fig.add_subplot(2,2,2)#axis of second graph

#line plot can be drawn using plot() function
plt.plot(data.mpg, data.disp)
plt.ylabel( disp ) ' '

plt.title( Line Plot )

' '

fig.add_subplot(2,2,3,facecolor= y )#axis of third graph ' '

#bar plot can be drawn using bar() function

plt.bar(data.mpg, data.disp)
plt.xlabel( mpg ) ' '

plt.ylabel( disp ) ' '

plt.title( Bar Plot )

' '

fig.add_subplot(2,2,4,facecolor= gray )#axis of second graph ' '

#bar plot can be drawn using hist() function

plt.hist(data.mpg)
plt.xlabel( mpg ) ' '

plt.ylabel( disp ) ' '

plt.title( Histogram Plot )

' '

Out[4]: Text(0.5,1, Histogram Plot )

' '
40 CHAPTER 2. GRAPHICS AND VISUALIZATION

• Plotting bar plot to distinguish displacement by different groups.

In [5]: plt.bar(data.gear, data.disp, label= mpg ) ' '

plt.bar(data.cyl, data.disp, label= cyl ) ' '

plt.legend()

Out[5]: <matplotlib.legend.Legend at 0x202a38ccb00>

2.2. GRAPHICS AND VISUALIZATION USING PYTHON 41

• To plot different pline plot in single command for all the variables

In [6]: data.plot(subplots=True, figsize=(8,12)); plt.legend(loc= best ) ' '

Out[6]: <matplotlib.legend.Legend at 0x202a3973278>

42 CHAPTER 2. GRAPHICS AND VISUALIZATION

In [7]: data.plot(kind= bar ,stacked=True, figsize=(8,8))

' '

Out[7]: <matplotlib.axes._subplots.AxesSubplot at 0x202a4d61828>

2.2. GRAPHICS AND VISUALIZATION USING PYTHON 43

• Auto-correlation plots which are a commonly-used for checking randomness in a

data set. This randomness is ascertained by computing auto correlation for data
values at varying time lags. If random, such autocorrelation should be near zero for
any and all time-lag separations. If non-random, then one or more of the autocorre-
lation will be significantly non-zero. It can be computed using following functions:

In [8]: from pandas.plotting import lag_plot, autocorrelation_plot

lag_plot(data.mpg)

Out[8]: <matplotlib.axes._subplots.AxesSubplot at 0x202a5314a58>

44 CHAPTER 2. GRAPHICS AND VISUALIZATION

In [9]: autocorrelation_plot(data.mpg)

Out[9]: <matplotlib.axes._subplots.AxesSubplot at 0x202a4df8ef0>

2.2. GRAPHICS AND VISUALIZATION USING PYTHON 45

• Pie chart: Pie chart can be easily drawn using pie() function.
In [10]: labels = eng , mat , fre , Logs
' ' ' ' ' ' ' '

fracs = [15,40,25,30]
explode = (0, 0.05, 0, 0)

plt.pie(fracs, labels=labels, autopct= %1.1f%% , shadow=True)

' '

#autopct : None (default), string, or function, optional

#If not None, is a string or function used to label the wedges with their numer
#The label will be placed inside the wedge. If it is a format string, the label
#If it is a function, it will be called.
plt.show()

Now using iris dataset to show some more attractive plots in matplotlib

In [11]: iris=pd.read_csv("E:/jimmy/J project/Book1.csv",header=0)

In [12]: iris.head()

Out[12]: sepal_length sepal_width petal_length petal_width Name

0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
46 CHAPTER 2. GRAPHICS AND VISUALIZATION

• Andrews plot: Andrews plot or Andrews curve is a way to visualize structure in

high-dimensional data.

In [13]: from pandas.plotting import andrews_curves

plt.figure()
andrews_curves(iris, Name ) ' '

Out[13]: <matplotlib.axes._subplots.AxesSubplot at 0x202a550b6d8>

• Parallel coordinate plot: A parallel coordinate plot maps each row in the data table
as a line, or profile. Each attribute of a row is represented by a point on the line.
This makes parallel coordinate plots similar in appearance to line charts, but the
way data is translated into a plot is substantially different.

In [14]: from pandas.plotting import parallel_coordinates

In [15]: plt.figure()
parallel_coordinates(iris, Name ) ' '

Out[15]: <matplotlib.axes._subplots.AxesSubplot at 0x202a56ad518>

2.2. GRAPHICS AND VISUALIZATION USING PYTHON 47

• RadViz is based on a simple spring tension minimization algorithm. Basically you

set up a bunch of points in a plane. In our case, they are equally spaced on a unit
circle. Each point represents a single attribute. You then pretend that each sample
in the data set is attached to each of these points by a spring, the stiffness of which
is proportional to the numerical value of that attribute (they are normalized to unit
interval). The point in the plane, where our sample settles to (where the forces acting
on our sample are at an equilibrium) is where a dot representing our sample will be
drawn. Depending on which class that sample belongs it will be colored differently.

In [16]: from pandas.plotting import radviz

plt.figure()
radviz(iris, Name )
' '

Out[16]: <matplotlib.axes._subplots.AxesSubplot at 0x202a53c2080>

48 CHAPTER 2. GRAPHICS AND VISUALIZATION

2.2.2 Applications of seaborn

Seaborn is a library for making attractive and informative statistical graphics in
python. It is built on top of matplotlib and is integrated with pyData(See
https://seaborn.pydata.org/). Some of the features that seaborn offers are:

• Several built-in themes for styling matplotlib graphics

• Tools for choosing color palettes to make beautiful plots that reveal patterns in your
data

• Functions for visualizing uni-variate and bi-variate distributions for comparing

them between subsets of data

• Tools that fit and visualize linear regression models for different kinds of indepen-
dent and dependent variables

• Functions that visualize matrices of data and use clustering algorithms to discover
structure in those matrices

• A function to plot statistical time series data with flexible estimation and represen-
tation of uncertainty around the estimate

• High-level abstractions for structuring grids of plots that let you easily build com-
plex visualizations
Chapter 3

Data Cleaning

Most statistical theory focuses on data modelling, prediction and inference, while it is
assumed that our data is inaccurate for analysis. Here, inaccurate data is data that is
incorrect, incomplete, out-of-date, or wrongly formatted. In practice, it is very rare that
raw data one works with is in correct format or without error. Often our data can be
quite messy, if we will do direct analysis it might corrupt our analysis. So we need to
process our data before doing any further analysis. A data analyst spend its most of the
time in doing data cleaning i.e preparing the data for statistical analysis. Data cleaning is
process of correcting the errors and transforming raw data into consistent data that can
be analysed. R and python provides good environment for data cleaning.

3.1 Why data cleaning is important?

Activity of transforming raw data into consistent data without errors duplicates and in-
consistencies i.e.

• Cleaning and transforming data into high quality data.

• To get reliable and unbiased data.

• To get valid, accurate and complete data.

No quality data, No quality decision: Quality decision must be based on good qual-
ity data(data with errors may lead to misleading statistics). The ultimate goal of data
cleaning is to make inconsistent data get ready for analysis.

3.2 Problem if we don’t clean our data

• Inaccurate or biased conclusions: If we don’t clean our data then conclusion made
on that data may be inaccurate or biased.

• Violation of statistical assumption: Statistical assumption may violate in raw data.

Leading to robust conclusions.

49
50 CHAPTER 3. DATA CLEANING

3.3 Data cleaning process

3.3.1 ETL process
ETL i.e. extract transform and load

• Extract: Extract data from original data source

• Transform: Manipulate the data into useful format and clean the data.

• Load: Load the data into data warehouse intended for analysis

3.3.2 Extraction of data using R

R offers wide range of packages to import data in any format such as .txt, .csv, .XLXS/
.XLS(Excel)

• Import data from excel We have to use ’readxl’ package to access excel files. In excel
file first row should contain variable/column names.
library(readxl)
dataset = read_excel(”C : /Users/example2.xls”)
• Import data from csv In csv file first row should contain variable/column names.
Where each element is comma separated and header is true. We use command as
follow:

mydata = read.table(”c : /mydata.csv”, header = TRUE, sep = ”, ”, row.names =

”id”)

3.3.3 Extraction data using Python

To import data in python we use pandas package. Pandas is a powerful data analysis
package. It has several function to read data( if you are using Anaconda, Pandas must be
pre installed). Data can be in any of the popular formats - CSV, TXT, XLS/XLSX (Excel),
sas7bdat (SAS), Rdata (R) etc. To import data in python we use pandas.
using pandas:
Firstly you need to need to load pandas by running command as follow:

import pandas as pd

• To import CSV file import pandas as pd

mydata = pd.read_csv(”C : /Users/Documents/ f ile1.csv”)
• To import excel file import pandas as pd
mydata = pd.read_excel(”C : /User/ f ile1.xls”)
3.3. DATA CLEANING PROCESS 51

• To import text file import pandas as pd

mydata = pd.read_table(”C : /Users/example2.txt”)
mydata = pd.read_csv(”C : /example2.txt”, sep = ”\t”) (if tab separated file)

3.3.4 Transformation
• Rebuild Missing Data: Recreating missing information as and when possible, such
as Post codes, states, country, phone area codes, gender, web address from email
addresses etc.

• Standardize and Normalize Data: The entries in fields or the categories of the given
data set must be homogeneous i.e. all the entries must have the same format for
Name, Address, Email, Contact Number, abbreviated/full name of provinces, titles.
Moreover, this step ensures that similar information e.g. sir, Mr., Mr are altogether
changed over to Mr. Or, on the other hand road, st., strt. are altogether changed
over to St.). Convert telephone numbers to their standard format, or as required.

• De-Duplicate data: Identify potential duplicates. Seek high accuracy matches with
a tolerance for misspelling, missing values or different address orders. For mis-
sion critical data, these results should be manually reviewed and then update the
database accordingly.

• Verification to enrich data: Validate the data against internal and external data
sources to append value adding info. I.e., business contacts can be validated against
yellow pages to verify their current phone number and addresses. Same goes for
various other fields including credit ratings, geo-coords, key contacts, employee
size, profit, revenue, time zones etc., can be fetched for each company.

3.3.5 Transformation using R

• Merging Data: merging two datasets require atleast one variable in common. In R
we use merge() function to merge two datasets.
52 CHAPTER 3. DATA CLEANING

## 1 1 Rick 623.30 32
## 2 2 Dan 515.20 31
## 3 3 Michelle 611.00 26
## 4 4 Ryan 729.00 29
## 5 5 Gary 843.25 36
## 6 6 Ryan 552.10 29

By default the data frames are merged on the columns with names they both have,
but separate specifications of the columns can be given by by.x and by.y. The rows
in the two data frames that match on the specified columns are extracted, and joined
together.

mydata4<-merge(mydata1,mydata2, by.name="Ryan")
mydata4

## id name salary age

## 1 1 Rick 623.30 32
## 2 2 Dan 515.20 31
## 3 3 Michelle 611.00 26
## 4 4 Ryan 729.00 29
## 5 5 Gary 843.25 36
## 6 6 Ryan 552.10 29

• Removing Duplicates: In R removing duplicates can be done by using unique()

function, but another related and interesting function to achieve the same end is
duplicated(). dplyr::distinct() : Keep only unique element and is more efficient than
unique(). distinct() is best-suited for interactive use.
3.3. DATA CLEANING PROCESS 53

## 3 foo 1 A
## 4 foo 1 B
## 5 bar 1 A

#The dplyr package can be loaded as follow:

# Load
library(dplyr)
#Remove duplicate rows based on all columns:
distinct(df)

## A B C
## 1 foo 0 A
## 2 foo 1 A
## 3 foo 1 B
## 4 bar 1 A

The function distinct() in dplyr package can be used to keep only unique/distinct
rows from a data frame. If there are duplicate rows, only the first row is preserved.
It’s an efficient version of the R base function unique().
• Missing Observations: In R, missing values are represented by the symbol NA (not
available). Impossible values (e.g., dividing by zero) are represented by the symbol
NaN (not a number).
– Detecting missing values: Missing values can be detected using is.na() func-
tion.

## A B C
## 1 1 a one
## 2 2 c two
## 3 3 e <NA>
## 4 4 f three
## 5 5 h <NA>
54 CHAPTER 3. DATA CLEANING

– Ways to exclude missing values : Math functions generally have a way to ex-
clude missing values in their calculations. mean(), median(), colSums(), var(),
sd(), min() and max() all take the na.rm argument. When this is TRUE, missing
values are omitted. The default is FALSE, meaning that each of these functions
returns NA if any input number is NA.
More functions used to exclude missing values. If you have large number of
observations in your dataset, then try deleting (or not to include missing values
while model building, for example by setting na.action=na.omit) those obser-
vations (rows) that contain missing values.
* na.omit: Drop out any rows with missing values anywhere in them and
forgets them forever.
* na.exclude: Drop out rows with missing values, but keeps track of where
they were (so that when you make predictions, for example, you end up
with a vector whose length is that of the original response.)
* na.pass: returns the object unchanged
* na.fail: returns the object only if it contains no missing values

## A B C
## 1 1 a one
## 2 2 c two
## 4 4 f three

A couple of other packages supply more efficient results:

3.3. DATA CLEANING PROCESS 55

– Hmisc library can be used to replace missing values with mean, median and
mode

## A B C
## 1 1 0.5 15
## 2 2 0.8 26
## 3 3 1.2 NA
## 4 4 NA 12
## 5 5 0.1 NA
## 6 NA 1.5 NA

## 1 2 3 4 5 6
## 1 2 3 4 5 3*

impute(df2$B, median) # replace

## 1 2 3 4 5 6
## 0.5 0.8 1.2 0.8* 0.1 1.5

## 1 2 3 4 5 6
## 15 26 0* 12 0* 0*
56 CHAPTER 3. DATA CLEANING

3.3.6 Transformation using Python

• Merging data: Merge or Join operation will combine data sets by linking rows using
one or more keys. few merge function arguments
– left: DataFrame to be merged on the left side.
– right: DataFrame to be merged on the right side.
– on: Column names to join on. Must be found on both DataFrame objects.
– left on: Columns in left DataFrame to use as a join keys.
– right on: Columns in right DataFrame to use as a join keys.
In [10]: from pandas import Series, DataFrame
import pandas as pd

In [11]: df_1=DataFrame({ key : [ b , b , a , c , a , a , b ],

' ' ' ' ' ' ' ' ' ' ' ' ' ' ' '

d :[ 0 , 1 , 2 , 3 , 4 , 5 , 6 ]})
' ' ' ' ' ' ' ' ' ' ' ' ' ' ' '

print(df_1)

d key
0 0 b
1 1 b
2 2 a
3 3 c
4 4 a
5 5 a
6 6 b

In [12]: df_2=DataFrame({ key : [ a , b , d ],

' ' ' ' ' ' ' '

d2 :[ 0 , 1 , 2 ]})
' ' ' ' ' ' ' '

print(df_2)

d2 key
0 0 a
1 1 b
2 2 d

In [13]: pd.merge(df_1,df_2)

Out[13]: d key d2
0 0 b 1
1 1 b 1
2 6 b 1
3 2 a 0
4 4 a 0
5 5 a 0
3.3. DATA CLEANING PROCESS 57

In [14]: df_3=DataFrame({ lkey :[ b , b , a , c , a , a , b ],

' ' ' ' ' ' ' ' ' ' ' ' ' ' ' '

data1 :range(7)})
' '

print(df_3)

data1 lkey
0 0 b
1 1 b
2 2 a
3 3 c
4 4 a
5 5 a
6 6 b

In [15]: df_4=DataFrame({ rkey :[ a , b , d ],

' ' ' ' ' ' ' '

data2 : range(3)})
' '

print(df_3)

data1 lkey
0 0 b
1 1 b
2 2 a
3 3 c
4 4 a
5 5 a
6 6 b

In [16]: pd.merge(df_3,df_4,left_on= lkey ,right_on= rkey ,how= outer )

' ' ' ' ' '

Out[16]: data1 lkey data2 rkey

0 0.0 b 1.0 b
1 1.0 b 1.0 b
2 6.0 b 1.0 b
3 2.0 a 0.0 a
4 4.0 a 0.0 a
5 5.0 a 0.0 a
6 3.0 c NaN NaN
7 NaN NaN 2.0 d

• Removing Duplicates: Duplicate rows may be found in DataFrame for number

of reasons. In python removing duplicates can be done using drop_duplicates().
DataFrame.drop_duplicates() return DataFrame with duplicate rows removed, op-
tionally only considering certain columns
58 CHAPTER 3. DATA CLEANING

In [8]: import pandas as pd

df = pd.DataFrame({"A":["foo", "foo", "foo", "bar"],
"B":[0,1,1,1], "C":["A","A","B","A"]})
print(df)

A B C
0 foo 0 A
1 foo 1 A
2 foo 1 B
3 bar 1 A

In [9]: df.drop_duplicates(subset=[ A , C ], keep=False)

' ' ' '

Out[9]: A B C
2 foo 1 B
3 bar 1 A

where keep : {‘first’, ‘last’, False}, default ‘first’

– first : Drop duplicates except for the first occurrence.

– last : Drop duplicates except for the last occurrence.
– False : Drop all duplicates.

• Missing Observations: By “missing” we simply mean NA (“not available”) or “not

present for whatever reason”. Many data sets simply arrive with missing data, ei-
ther because it exists and was not collected or it never existed. In pandas, one of the
most common ways that missing data is introduced into a data set is by re-indexing

In [28]: import pandas as pd

import numpy as np
df = pd.DataFrame(np.random.randn(5, 3),
index=[ a , c , e , f , h ],
' ' ' ' ' ' ' ' ' '

columns=[ one , two , three ])

' ' ' ' ' '

df[ four ] = bar

' ' ' '

df[ five ] = df[ one ] > 0

' ' ' '

print(df)

one two three four five

a -0.336057 0.512864 -0.854062 bar False
c -0.424267 -0.101321 0.948349 bar False
e 0.957720 -0.602851 0.859344 bar True
f 0.734610 -0.769397 0.355850 bar True
h -1.733099 -0.451442 -0.785071 bar False
3.3. DATA CLEANING PROCESS 59

In [29]: df2 = df.reindex([ a , b , c , d , e , f , g , h ])

' ' ' ' ' ' ' ' ' ' ' ' ' ' ' '

print(df2)

one two three four five

a -0.336057 0.512864 -0.854062 bar False
b NaN NaN NaN NaN NaN
c -0.424267 -0.101321 0.948349 bar False
d NaN NaN NaN NaN NaN
e 0.957720 -0.602851 0.859344 bar True
f 0.734610 -0.769397 0.355850 bar True
g NaN NaN NaN NaN NaN
h -1.733099 -0.451442 -0.785071 bar False

pandas objects are equipped with various data manipulation methods for dealing
with missing data.

– Filling missing values: fillna

* The fillna function can “fill in” NA values with non-NA data in a couple of
ways, which we illustrate:
* Replace NA with a scalar value
* Fill gaps forward or backward
In [30]: df2.fillna(0)

Out[30]: one two three four five

a -0.336057 0.512864 -0.854062 bar False
b 0.000000 0.000000 0.000000 0 0
c -0.424267 -0.101321 0.948349 bar False
d 0.000000 0.000000 0.000000 0 0
e 0.957720 -0.602851 0.859344 bar True
f 0.734610 -0.769397 0.355850 bar True
g 0.000000 0.000000 0.000000 0 0
h -1.733099 -0.451442 -0.785071 bar False

In [31]: df.fillna(method= pad ) ' '

Out[31]: one two three four five

To remind you, these are the available filling methods:

60 CHAPTER 3. DATA CLEANING

– pad / ffill:Fill values forward.

– bfill / backfill:Fill values backward.

You may wish to simply exclude labels from a data set which refer to missing data.
To do this, use the dropna method:

In [36]: df2.dropna(axis=0)

Out[36]: one two three four five

You can also fillna using a dict or Series that is align-able. The labels of the dict or
index of the Series must match the columns of the frame you wish to fill. The use
case of this is to fill a DataFrame with the mean of that column.

In [38]: df2.fillna(df2.mean())

Out[38]: one two three four five

a -0.336057 0.512864 -0.854062 bar False
b -0.160219 -0.282429 0.104882 NaN 0.4
c -0.424267 -0.101321 0.948349 bar False
d -0.160219 -0.282429 0.104882 NaN 0.4
e 0.957720 -0.602851 0.859344 bar True
f 0.734610 -0.769397 0.355850 bar True
g -0.160219 -0.282429 0.104882 NaN 0.4
h -1.733099 -0.451442 -0.785071 bar False
Chapter 4

Explanatory Data Analysis

Explanatory Data Analysis (EDA) is an approach to data analysis. EDA is a critical step
in analyzing data. Its where the experimenter takes the bird’s eye of the data and tries to
make some sense of it. Exploratory Data Analysis takes place after gathering and cleaning
data, and is often implemented before any formal statistical technique is applied. Among
the main purposes of this type of analysis are of course getting to know our data, its
tendencies and its quality, and also to check or even start formulating our hypothesis.
Here are some reasons why we use EDA:
• Detection of mistakes.

• Gain maximum insight into the dataset and its underlying structure.

• Determining relationships among the explanatory variables.

• Check assumptions associated with any model fitting or hypothesis test.

• Detection of outliers
Most EDA techniques are graphical in nature with a few quantitative techniques. The
reason for the heavy reliance on graphics is that by its nature the main role of EDA is to
open-mindedly explore. The particular graphical techniques employed in EDA are often
quite simple consisting of various techniques of:
• Plotting the raw data (such as data traces, histograms, bihistograms, probability
plots, lag plots, block plots, and Youden plots.

• Plotting simple statistics such as mean plots, standard deviation plots, box plots,
and main effects plots of the raw data.
Types of Exploratory Data Analysis:
EDA falls into four main areas:
• Univariate EDA- Looking at one variable of interest, like age, height, income level
etc.
• Multivariate EDA- Analysis of multiple variables at the same time.

61
62 CHAPTER 4. EXPLANATORY DATA ANALYSIS

4.1 Univarate Explanatory Data Analysis

In univariate EDA our interest is analyzing each variable, like age, gender, income etc.
The usual goal of univariate EDA is to better appreciate the "sample distribution". Outlier
detection is also a part of this analysis. Below are some techniques used in univariate
EDA:

• Summary statistics: Summary statistics summarize and provide information about

your sample data. It includes where the average lies and whether your data is
skewed. Summary statistics fall into three main categories:

– Measures of location and central tendency(e.g. mean, median, mode etc.).

– Measure of dispersion(e.g. Standard deviation)
– Measures of shape(e.g. skewness and kurtosis)

A common collection of statistics used as summary statistics are the five-number

summary i.e. the minimum, 25th percentile, median, 75th percentile, maximum of
the data

• Histogram: The purpose of a histogram is to graphically summarize the distribution

of a univariate data set. The histogram graphically shows the following:

– Center (i.e., the location) of the data.

– Spread (i.e., the scale) of the data.
– Skewness of the data.
– Presence of outliers.
– Presence of multiple modes in the data.

• Stem and leaf plots: A simple substitute to histogram is stem and leaf plot. Nev-
ertheless, a histogram is generally considered better for estimating the shape of a
sample distribution than the stem and leaf plot.

• Boxplots: Boxplot is visualization of five-number summary with more information.

Boxplot graphically shows the following:

– Displays variable’s location and spread.

– Provide indication of data symmetry and skewness.
– Shows outliers

• Density plot: A Density Plot visualizes the distribution of data over a continuous
interval or time period.
4.2. MULTIVARIATE EXPLANATORY DATA ANALYSIS 63

4.2 Multivariate Explanatory Data Analysis

Multivariate EDA techniques generally show the relationship between two or more vari-
ables in the form of either cross-tabulation or statistics. Below are some techniques used
in univariate EDA:

• Correlation matrix: Correlation matrix measures degree of relationship between the

variables under consideration. The degree of relationship is expressed by coefficient
which range from correlation ( -1 ≤ r ≤ +1). It deals with the association between
two or more variables.

• Scatter plot: Scatter plot is a diagrammatic representation of bivariate data. It is

used to plot data points on a horizontal and a vertical axis in the attempt to show
how much one variable is affected by another. Scatter plots are important in statis-
tics because they can show the extent of correlation. Besides showing the extent of
correlation, a scatter plot shows the sense of the correlation:

– If the vertical (or y-axis) variable increases as the horizontal (or x-axis) variable
increases, the correlation is positive.
– If the y-axis variable decreases as the x-axis variable increases or vice-versa, the
correlation is negative.
– If it is impossible to establish either of the above criteria, then the correlation is
zero.

• Multiple Boxplot: Unlike regular box plots in which the range of values of one vari-
able is represented, the multiple box plot represents ranges of values of multiple
variables. Multiple Boxplot can be used to visualize multiple variables together. It
can be used for comparing two or more variables.

• Multiple histogram: A panel of histograms enables you to compare the data dis-
tributions of different groups. You can create the histograms in a column (stacked
vertically) or in a row.

4.3 Explanatory Data Analysis using R

We will use a popular dataset in R library "dataset" and the dataset used is cars.
Description: The data give the speed of cars and the distances taken to stop. Note that
the data were recorded in the 1920s.
Data is imported using read.csv() function of pandas module.

data<-read.csv("E:/jimmy/J project/cars.csv", header = T,sep = , ) ' '

head(data)

## speed dist
## 1 4 2
64 CHAPTER 4. EXPLANATORY DATA ANALYSIS

## 2 4 10
## 3 7 4
## 4 7 22
## 5 8 16
## 6 9 10

We can see some basic characteristics of the dataset using dim(), str(), names(), head(),
tail(), summary() functions.

dim(data)
## [1] 50 2
str(data)
## data.frame : 50 obs. of 2 variables:
' '

## $ speed: int 4 4 7 7 8 9 10 10 10 11 ...

## $ dist : int 2 10 4 22 16 10 18 26 34 17 ...
names(data)
## [1] "speed" "dist"
head(data)
## speed dist
## 1 4 2
## 2 4 10
## 3 7 4
## 4 7 22
## 5 8 16
## 6 9 10
tail(data)
## speed dist
## 45 23 54
## 46 24 70
## 47 24 92
## 48 24 93
## 49 24 120
## 50 25 85
summary(data)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
4.3. EXPLANATORY DATA ANALYSIS USING R 65

In R we are using dplyr package for doing EDA.

Some of the key "verbs" provided by the dplyr package are
• select: return a subset of the columns of a data frame, using a flexible notation
• filter: extract a subset of rows from a data frame based on logical conditions
• arrange: reorder rows of a data frame
• rename: rename variables in a data frame
• mutate: add new variables/columns or transform existing variables
• summaries / summarize: generate summary statistics of different variables in the
data frame, possibly within strata
Installing the dplyr package
The dplyr package can be installed from CRAN. To install from CRAN, just run
> install.packages("dplyr")

#After installing the package it is important to load it into your R session

library(dplyr)

The arrange() function is used to reorder rows of a data frame according to one of the
variables/columns.

## speed dist
## 1 4 2
## 2 4 10
## 3 7 4
## 4 7 22
## 5 8 16
## 6 9 10

The select() function can be used to select columns of a data frame that you want to
focus on. Often you’ll have a large data frame containing "all" of the data, but any given
analysis might only use a subset of variables or observations.

## speed
## 1 4
## 2 4
## 3 7
## 4 7
## 5 8
## 6 9
66 CHAPTER 4. EXPLANATORY DATA ANALYSIS

## speed
## 45 23
## 46 24
## 47 24
## 48 24
## 49 24
## 50 25

Renaming a variable in a data frame in R is surprisingly hard to do! The rename()

function is designed to make this process easier.

## velocity dist
## 1 4 2
## 2 4 10
## 3 7 4
## 4 7 22
## 5 8 16
## 6 9 10

Univariate Explanatory data analysis

• five-number summary: five-number summary can be computed using fivenum()

function. It’s often a bit nice to use the summary() function.

## Min. 1st Qu. Median Mean 3rd Qu. Max.

## 4.0 12.0 15.0 15.4 19.0 25.0

• Histogram: Histogram can be drawn using hist() function. We can get a little more
detail by using the rug() function to show us the actual data points.

hist(data$speed, main="Histogram of Speed")

rug(data$speed)
4.3. EXPLANATORY DATA ANALYSIS USING R 67

Histogram of Speed

15
Frequency

10
5
0

0 5 10 15 20 25

data$speed

• Stem-leaf plot: Stem-leaf plot can be drawn using stem() function.

## 4 | 00
## 6 | 00
## 8 | 00
## 10 | 00000
## 12 | 00000000
## 14 | 0000000
## 16 | 00000
## 18 | 0000000
## 20 | 00000
## 22 | 00
## 24 | 00000

• Box plot: Box-plot can be drawn using boxplot() function.

boxplot(data$speed, main="Boxplot of Speed")

68 CHAPTER 4. EXPLANATORY DATA ANALYSIS

Boxplot of Speed

25
20
15
10
5

• Density plot: Density plot can be drawn using plot(density()) where density() func-
tion return the density data and plot() function returns the result.

plot(density(data$speed), main="Density plot of Speed")

4.3. EXPLANATORY DATA ANALYSIS USING R 69

Density plot of Speed

0.06
0.04
Density

0.02
0.00

0 5 10 15 20 25 30

N = 50 Bandwidth = 2.15

Multivariate Explanatory data analysis

• Correlation matrix:The function cor() can be used to compute a correlation matrix.

The function rcorr() [in Hmisc package] can be used to compute the significance
levels for pearson and spearman correlations.

• Scatter plot: The function splom()[ in the package lattice], can be used to display
a scatter plot. The function chart.Correlation()[ in the package PerformanceAnalyt-
ics], can be used to display a chart of a scatter plot and correlation between variables.

library(lattice)
splom(data)
70 CHAPTER 4. EXPLANATORY DATA ANALYSIS

120 60 80 100 120

100

60 dist 60

0 20 40 60 0

25 15 20 25

15 15
speed

5
5 10 15

Scatter Plot Matrix

• Multiple boxplot: Multiple boxplot can be simply drawn by adding up variable in

boxplot() function.

boxplot(data$speed,data$dist, main="Boxplot of Speed and distance")

4.3. EXPLANATORY DATA ANALYSIS USING R 71

Boxplot of Speed and distance

120
20 40 60 80
0

• Multiple Histogram: Mulptiple histogram can be simply drawn using hist() func-
tion for different variables of interest after par() function which divides the graph
window in different rows and columns.

par(mfrow = c(2, 1), mar = c(4, 4, 2, 1))

hist(data$speed, col = "gray")
hist(data$dist, col = "gray")
72 CHAPTER 4. EXPLANATORY DATA ANALYSIS

Histogram of data$speed
Frequency

10
0

0 5 10 15 20 25

data$speed
Histogram of data$dist
Frequency

10
0

0 20 40 60 80 100 120

data$dist

4.4 Explanatory Data Analysis using Python

In [1]: import pandas as pd

data=pd.read_csv("E:/jimmy/J project/cars.csv")
data

Out[1]: speed dist

1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10
7 10 18
8 10 26
9 10 34
10 11 17
11 11 28
4.4. EXPLANATORY DATA ANALYSIS USING PYTHON 73

12 12 14
13 12 20
14 12 24
15 12 28
16 13 26
17 13 34
18 13 34
19 13 46
20 14 26
21 14 36
22 14 60
23 14 80
24 15 20
25 15 26
26 15 54
27 16 32
28 16 40
29 17 32
30 17 40
31 17 50
32 18 42
33 18 56
34 18 76
35 18 84
36 19 36
37 19 46
38 19 68
39 20 32
40 20 48
41 20 52
42 20 56
43 20 64
44 22 66
45 23 54
46 24 70
47 24 92
48 24 93
49 24 120
50 25 85

• We can see some basic characteristics of the dataset using DataFrame.info(),

DataFrame.tail(), DataFrame.head(), DataFrame.loc[], DataFrame.describe() func-
tions.

In [2]: data.info()
74 CHAPTER 4. EXPLANATORY DATA ANALYSIS

<class pandas.core.frame.DataFrame >

' '

Int64Index: 50 entries, 1 to 50
Data columns (total 2 columns):
speed 50 non-null int64
dist 50 non-null int64
dtypes: int64(2)
memory usage: 1.2 KB

In [3]: data.tail()

Out[3]: speed dist

46 24 70
47 24 92
48 24 93
49 24 120
50 25 85

In [4]: data.head()

Out[4]: speed dist

1 4 2
2 4 10
3 7 4
4 7 22
5 8 16

In [5]: data.loc[3:6]

Out[5]: speed dist

3 7 4
4 7 22
5 8 16
6 9 10

In [6]: data.describe()

Out[6]: speed dist

count 50.000000 50.000000
mean 15.400000 42.980000
std 5.287644 25.769377
min 4.000000 2.000000
25% 12.000000 26.000000
50% 15.000000 36.000000
75% 19.000000 56.000000
max 25.000000 120.000000
4.4. EXPLANATORY DATA ANALYSIS USING PYTHON 75

• Univariate Explanatory data Analysis

• five-number summary:five number summary can be computed using
Data.Frame.variable.describe() function.

In [7]: data.speed.describe()

Out[7]: count 50.000000

mean 15.400000
std 5.287644
min 4.000000
25% 12.000000
50% 15.000000
75% 19.000000
max 25.000000
Name: speed, dtype: float64

• Histogram: Histogram can be drawn using hist() function from matplotlib module.

In [8]: import matplotlib.pyplot as plt

%matplotlib inline
data.speed.hist()

Out[8]: <matplotlib.axes._subplots.AxesSubplot at 0x27dd9d87cf8>

76 CHAPTER 4. EXPLANATORY DATA ANALYSIS

• Box-plot: Box-plot can be drawn using DataFrame.plot.box() function from mat-

plotlib module.

In [9]: data[ speed ].plot.box()

' '

Out[9]: <matplotlib.axes._subplots.AxesSubplot at 0x27dd9dc2710>

• Density-plot: Density-plot can be drawn using DataFrame.plot.kde() function from

matplotlib module.

In [10]: data[ speed ].plot.kde()

' '

Out[10]: <matplotlib.axes._subplots.AxesSubplot at 0x27dd94a0160>

4.4. EXPLANATORY DATA ANALYSIS USING PYTHON 77

Note: There are some of the tools in seaborn module for examining univariate and bivariate
distributions. That can be used as follows:
*import seaborn as sns #for calling seaborn module
* sns.distplot() #will give density plot
* sns.distplot(x, kde=False, rug=True) #will give histogram plot
* sns.distplot(x, hist=False, rug=True) #will give density plot

• Multivariate Explanatory Data Analysis*

• Correlation matrix: DataFrame.corr(method=’pearson’) compute pairwise correla-
tion of columns. where argument method:{”pearson,’kendall,’spearman’}

In [11]: data.corr(method= pearson )

' '

Out[11]: speed dist

speed 1.000000 0.806895
dist 0.806895 1.000000

• Scatter-plot: scatter_matrix() function in pandas.plotting module can be used to plot

Scatter-plot matrix.

In [12]: from pandas.tools.plotting import scatter_matrix as scattermatrix

In [13]: scattermatrix(data, diagonal= kde ) #diagonal= kde

' ' ' '

argument shows density plot in diagonal by default it will show histogram.

78 CHAPTER 4. EXPLANATORY DATA ANALYSIS

Out[13]: array([[<matplotlib.axes._subplots.AxesSubplot object at

0x0000027DDBD52438>,
<matplotlib.axes._subplots.AxesSubplot object at
0x0000027DDBDB0048>],
[<matplotlib.axes._subplots.AxesSubplot object at
0x0000027DDBDE7588>,
<matplotlib.axes._subplots.AxesSubplot object at
0x0000027DDBE20550>]], dtype=object)

• Multiple Box-plot: Multiple Box-plot can be drawn using

DataFrame.boxplot(column=[]) as follow:

In [14]: plt.figure();
BP=data.boxplot(column=[ speed , dist ])
' ' ' '
4.4. EXPLANATORY DATA ANALYSIS USING PYTHON 79

• Multiple Histogram: Multiple Histogram can be drawn similarly like multiple box-
plot using DataFrame.hist(column=[]) as follow:

In [15]: data.hist(column=[ speed , dist ])

' ' ' '

Out[15]: array([[<matplotlib.axes._subplots.AxesSubplot object at

0x0000027DDBECA710>,
<matplotlib.axes._subplots.AxesSubplot object at
0x0000027DDBF66908>]], dtype=object)
80 CHAPTER 4. EXPLANATORY DATA ANALYSIS

Note: For more visualization tools on pandas refer to

https://pandas.pydata.org/pandas-docs/stable/visualization.html
Chapter 5

Regression Analysis

Regression analysis is used to know the nature of relationship between two or more
variables i.e. probable form of mathematical relation between X and Y (where X represent
various explanatory variables and Y represents response variable). Regression is also
used to predict or estimate the value of one variable(response or dependent variable)
corresponding to given value of another variable (explanatory or independent variable).
Linear model: A model is said to be linear when it is linear in parameter. Non-linear
model: A model is said to be non-linear when it is non-linear in parameter.

5.1 Linear Regression Analysis

In scatter diagram, quite often it is seen that there is a tendency for the point of two
variable to cluster around some curve called the curve of regression. If curve is straight
line it is called line of regression. And it tells, there is linear regression among the
variables. If curve is a line, then it tells there is non linear regression among the variables.
The Linear regression model is given by:

Y = Xβ + ε (5.1)
where:

• Y denotes the dependent(or response) variable.

• X denotes the k independent(or explanatory) variable x1, x2, ..., xk.

• β denotes the regression coefficient associated with x1, x2, ..., xk variables.

We write equation (1) as y = x1β1 + x2β2 + ... + x k β k + ε for k explanatory variables. This
is called as multiple linear regression model.

Example: Income and education of a person are related it is expected that for an average
higher level of education provides higher income. So, the simple linear regression model can be

81
82 CHAPTER 5. REGRESSION ANALYSIS

expressed as:

Income = β0 + β1education + εi (5.2)

β0 reflects income when education is zero as it is expected that even an illiterate person
can also have some income and β1 reflects average change in income with respect to
per unit change in education. Further, this model neglects that most people have higher
income when they are older then they are younger.
So, better model is multiple linear regression model and can be expressed as:

Income = β0 + β1education + β2age + εi (5.3)

5.1.1 Assumptions in Linear regression model

The linear regression has five key assumptions:

• There should be a linear relationship between dependent and independent vari-

ables.

• The error term should be normally distributed.

• The error term must have constant variance. The presence of constant variance
among error term is known as homoskedasticity. And the absence of constant vari-
ance among the error term is known as heteroskedasticity.

• The independent variables should not be correlated. Absence of this phenomenon

is called multicollinearity.

• There should be no correlation between the residual or error term. Absence of this
phenomenon is known as Autocorrelation

Note: Normal probability plot and plot of residual versus corresponding fitted values is
helpful in detecting several common type of model assumption.

5.2 Logistic Regression Analysis

When we have binary variable or categorical variable (dependent variable) we use
logistic regression. Logistic model is used for prediction of probability of occurrence of
an event by fitting data to a logistic curve. In this regression, the response variable has
only two possible outcome coded as 0 or 1. It makes use of several predictor variables
that may be either categorical or numerical.

Example: The probability that a person has a heart attack within a specified time period
can be predicted from knowledge of person’s age, sex, cholesterol level, weight, etc.
5.3. REGRESSION ANALYSIS USING R 83

Logistic model belongs to a class of model known as Generalized Linear Model(GLM).

The logistic regression model uses the odd ratio, which is given by:
probabilityo f anevento f interest
Oddratio = (5.4)
1 − probabilityo f anevento f interest
The logistic regression is based on log odd ratio, ln(odd ratio). Equation given below
defines logistic regression model for k independent variables.

ln(Oddratio) = β0 + x1β1 + x2β2 + ... + x k βk + εi (5.5)

where:
• k= no. of independent variable in model

• εi=random error in observation i

5.3 Regression Analysis using R

Description of dataset: For analysis we are using a dataset in which record times in 1984
for Scottish Hill races are recorded having 35 observation and whose component variables
are following:
• dist: distance in miles(on the map).

• climb: total height gained during the route, in feet.

• time: record time in minutes.

5.3.1 Multiple Linear Regression

Loading dataset in R

## X dist climb time

## 1 Greenmantle 2.5 650 16.083
## 2 Carnethy 6.0 2500 48.350
## 3 Craig Dunain 6.0 900 33.650
## 4 Ben Rha 7.5 800 45.600
## 5 Ben Lomond 8.0 3070 62.267
## 6 Goatfell 8.0 2866 73.217
84 CHAPTER 5. REGRESSION ANALYSIS

Fitting Multiple Linear Regression:

#model for multiple linear regression model

model1=lm(time~dist+climb,data=Hills_data)
summary(model1)

##
## Call:
## lm(formula = time ~ dist + climb, data = Hills_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.215 -7.129 -1.186 2.371 65.121
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.992039 4.302734 -2.090 0.0447 *
## dist 6.217956 0.601148 10.343 9.86e-12 ***
## climb 0.011048 0.002051 5.387 6.45e-06 ***
## ---
## Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1
' ' ' ' ' ' ' ' ' ' 1
##
## Residual standard error: 14.68 on 32 degrees of freedom
## Multiple R-squared: 0.9191,Adjusted R-squared: 0.914
## F-statistic: 181.7 on 2 and 32 DF, p-value: < 2.2e-16

To check assumption of model we can use following steps:

For a quick check of model assumption we can use plot() function which give 2*2 plot
containing following:

• Residual versus fitted values.

• Normal quantile-quantile plot.

• Standardized residual versus Fitted values.

• Residual versus Leverage.

par(mfrow=c(2,2))#used to partion are window into 2 rows and 2 columns

plot(model1)
5.3. REGRESSION ANALYSIS USING R 85

Residuals vs Fitted Normal Q−Q

5
18 18
60

Standardized residuals

4
40

3
7
Residuals

2
20

1
0

0
−20

−1
31
31

50 100 150 −2 −1 0 1 2

Fitted values Theoretical Quantiles

Scale−Location Residuals vs Leverage

18
5
2.0

18
Standardized residuals

Standardized residuals

7
1.5

7
2

31
1.0

1
1

0.5
11
0.5

−1 0

0.5
1
0.0

50 100 150 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

Fitted values Leverage

• Component residual plot can be drawn using cr.Plots() function in library(car)

library(car)

## Warning: package 'car' was built under R version 3.4.4

##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode

crPlots(model1)
86 CHAPTER 5. REGRESSION ANALYSIS

Component + Residual Plots

80
100

60
Component+Residual(time)

Component+Residual(time)

40
50

20
0

0
−20

5 10 15 20 25 1000 3000 5000 7000

dist climb

5.3.2 Logistic Regression

Description of data: For analysis we are using dataset of Contraceptive use data, showing
the distribution of 1607 currently married and fecund women interviewed in Fiji Fertility
Survey, according to age, education, desire for more children and current use of contra-
ception. Loading dataset in R
5.3. REGRESSION ANALYSIS USING R 87

## 1 <25 low yes 53 6

## 2 <25 low no 10 4
## 3 <25 high yes 212 52
## 4 <25 high no 50 10
## 5 25-29 low yes 60 14
## 6 25-29 low no 19 10
## 7 25-29 high yes 155 54
## 8 25-29 high no 65 27
## 9 30-39 low yes 112 33
## 10 30-39 low no 77 80
## 11 30-39 high yes 118 46
## 12 30-39 high no 68 78
## 13 40-49 low yes 35 6
## 14 40-49 low no 46 48
## 15 40-49 high yes 8 8
## 16 40-49 high no 12 31

• contrasts() function shows how variable is dummyfied by R. str() function shows

the structure of data.

## data.frame : 16 obs. of 5 variables:

' '

## $ age : Factor w/ 4 levels "<25","25-29",..: 1 1 1 1 2 2 2 2 3 3 ...

## $ education: Factor w/ 2 levels "high","low": 2 2 1 1 2 2 1 1 2 2 ...
## $ wantsMore: Factor w/ 2 levels "no","yes": 2 1 2 1 2 1 2 1 2 1 ...
## $ notUsing : int 53 10 212 50 60 19 155 65 112 77 ...
## $ using : int 6 4 52 10 14 10 54 27 33 80 ...
88 CHAPTER 5. REGRESSION ANALYSIS

Residuals vs Fitted Normal Q−Q

3
15 15
2

2
Std. deviance resid.
1
Residuals

1
0

0
−1
−1

8 8
4

−2
4

−6 −4 −2 0 2 4 6 −2 −1 0 1 2

Predicted values Theoretical Quantiles

Scale−Location Residuals vs Leverage

15
1.5

15
4
3
Std. deviance resid.

Std. Pearson resid.

8 1
2
1.0

0.5
1
0
0.5

−2 −1

8
0.5
4 1
0.0

−6 −4 −2 0 2 4 6 0.0 0.1 0.2 0.3 0.4

Predicted values Leverage

5.4 Regression Analysis using Python

Description of dataset: For analysis we are using a dataset in which record times in 1984
for Scottish Hill races are recorded having 35 observation and whose component variables
are following:

• dist: distance in miles(on the map).

• climb: total height gained during the route, in feet.

• time: record time in minutes.

5.4. REGRESSION ANALYSIS USING PYTHON 89

Statsmodels is a python module that provide functions for estimation of

many different statistical model, as well as conducting statistical test and statis-
tical data exploration. We are using statsmodels for conducting multiple lin-
ear regression and checking assumptions of model by analyzing the residuals (See
http://www.statsmodels.org/dev/regression.html) Loading all important modules used
in regression analysis:

In [1]: import pandas as pd

import matplotlib.pyplot as plt
import statsmodels.formula.api as sm
%matplotlib inline

Loading Dataset for Multiple regression analysis:

In [2]: Hills_data=pd.read_csv("E:/jimmy/j2/Hills.csv",header=0)
print(Hills_data)

Unnamed: 0 dist climb time

0 Greenmantle 2.5 650 16.083
1 Carnethy 6.0 2500 48.350
2 Craig Dunain 6.0 900 33.650
3 Ben Rha 7.5 800 45.600
4 Ben Lomond 8.0 3070 62.267
5 Goatfell 8.0 2866 73.217
6 Bens of Jura 16.0 7500 204.617
7 Cairnpapple 6.0 800 36.367
8 Scolty 5.0 800 29.750
9 Traprain 6.0 650 39.750
10 Lairig Ghru 28.0 2100 192.667
11 Dollar 5.0 2000 43.050
12 Lomonds 9.5 2200 65.000
13 Cairn Table 6.0 500 44.133
14 Eildon Two 4.5 1500 26.933
15 Cairngorm 10.0 3000 72.250
16 Seven Hills 14.0 2200 98.417
17 Knock Hill 3.0 350 78.650
18 Black Hill 4.5 1000 17.417
19 Creag Beag 5.5 600 32.567
20 Kildcon Hill 3.0 300 15.950
21 Meall Ant-Suidhe 3.5 1500 27.900
22 Half Ben Nevis 6.0 2200 47.633
23 Cow Hill 2.0 900 17.933
24 N Berwick Law 3.0 600 18.683
25 Creag Dubh 4.0 2000 26.217
26 Burnswark 6.0 800 34.433
90 CHAPTER 5. REGRESSION ANALYSIS

27 Largo Law 5.0 950 28.567

28 Criffel 6.5 1750 50.500
29 Acmony 5.0 500 20.950
30 Ben Nevis 10.0 4400 85.583
31 Knockfarrel 6.0 600 32.383
32 Two Breweries 18.0 5200 170.250
33 Cockleroi 4.5 850 28.100
34 Moffat Chase 20.0 5000 159.833

In [3]: print(Hills_data.shape, Hills_data.dtypes)

(35, 4) Unnamed: 0 object

dist float64
climb int64
time float64
dtype: object

Fitting a Multiple regression model

In [4]: model=sm.ols(formula= time~dist+climb ,data=Hills_data).fit() ' '

In [5]: print(dir(model)) #gives number of objects in model.

[ HC0_se , HC1_se , HC2_se , HC3_se , _HCCM , class , __

' ' ' ' ' ' ' ' ' ' ' ' '

delatt , dict , dir , doc , eq__ , _ _forma

' ' ' ' ' ' ' ' ' '

, ge , getattribute , gt , hash__

' ' ' ' ' ' ' '

, init , __init_subclass , le , lt ,

' ' ' ' ' ' ' ' ' '

module , ne , new , reduce , reduce_ex__

' ' ' ' ' ' ' '

, repr , setattr , sizeof , str , __

' ' ' ' ' ' ' ' ' '

subclasshook__ , __weakref__ , _cache , _data_attr , _get_robust ' ' ' ' ' ' ' '

cov_results , _is_nested , _wexog_singular_values , aic , bic , bse

' ' ' ' ' ' ' ' ' '

, centered_tss , compare_f_test , compare_lm_test , compare_lr_test

' ' ' ' ' ' ' '

, condition_number , conf_int , conf_int_el , cov_HC0 , cov_HC1 ,

' ' ' ' ' ' ' ' ' ' ' '

cov_HC2 , cov_HC3 , cov_kwds , cov_params , cov_type , df_model , df

' ' ' ' ' ' ' ' ' ' ' '

_resid , eigenvals , el_test , ess , f_pvalue , f_test , fittedvalues

' ' ' ' ' ' ' ' ' ' ' '

, fvalue , get_influence , get_prediction , get_robustcov_results ,

' ' ' ' ' ' ' ' ' '

initialize , k_constant , llf , load , model , mse_model , mse_resid ,

' ' ' ' ' ' ' ' ' ' ' ' '

mse_total , nobs , normalized_cov_params , outlier_test , params ,

' ' ' ' ' ' ' ' ' '

predict , pvalues , remove_data , resid , resid_pearson , rsquared ,

' ' ' ' ' ' ' ' ' ' ' '

rsquared_adj , save , scale , ssr , summary , summary2 , t_test ,

' ' ' ' ' ' ' ' ' ' ' ' ' ' '

tvalues , uncentered_tss , use_t , wald_test , wald_test_terms , wresid ]

' ' ' ' ' ' ' ' ' ' '

summary() function displays details of the result.

5.4. REGRESSION ANALYSIS USING PYTHON 91

In [6]: print(model.summary())

OLS Regression Results

============================================================================
Dep. Variable: time R-squared: 0.919
Model: OLS Adj. R-squared: 0.914
Method: Least Squares F-statistic: 181.7
Date: Sun, 25 Mar 2018 Prob (F-statistic): 3.40e-18
Time: 17:04:52 Log-Likelihood: -142.11
No. Observations: 35 AIC: 290.2
Df Residuals: 32 BIC: 294.9
Df Model: 2
Covariance Type: nonrobust
============================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------
Intercept -8.9920 4.303 -2.090 0.045 -17.756 -0.228
dist 6.2180 0.601 10.343 0.000 4.993 7.442
climb 0.0110 0.002 5.387 0.000 0.007 0.015
============================================================================
Omnibus: 47.910 Durbin-Watson: 2.249
Prob(Omnibus): 0.000 Jarque-Bera (JB): 233.976
Skew: 3.026 Prob(JB): 1.56e-51
Kurtosis: 14.127 Cond. No. 4.20e+03
============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is
correctly specified.
[2] The condition number is large, 4.2e+03. This might indicate that there
are strong multicollinearity or other numerical problems.

The coefficient of determination is equal to R-squared value i.e. 0.846. Warning mes-
sage is indicating that there might be strong multicollinearity present.
To check assumption of model we can use following steps

• Normal Probability plot to show assumption of normality of residuals.

In [7]: import statsmodels.api as sma

sma.qqplot(model.resid)

Out[7]:
92 CHAPTER 5. REGRESSION ANALYSIS

• residual vs fitted value plot to show holding assumptions

In [8]: plt.scatter(model.fittedvalues.values,model.resid)
plt.xlabel("Fitted Values")
plt.ylabel("Residuals")
plt.show()
5.4. REGRESSION ANALYSIS USING PYTHON 93

• For quick check of all the regressors, you can use plot_partregress_grid() function.

In [9]: fig=plt.figure(figsize=(10,8))
fig=sma.graphics.plot_partregress_grid(model,fig=fig)
94 CHAPTER 5. REGRESSION ANALYSIS

• The function plot_regress_exog() gives a 2*2 plot containing following:

*Dependent variable and fitted values confidence interval versus independent vari-
able chosen.
* Residual versus independent variable chosen.
* Partial regression plot.
* Component-component plus residual plot. This function can be used quickly to
check assumptions w.r.t a single regressor.

In [10]: fig = plt.figure(figsize=(10,8))

fig = sma.graphics.plot_regress_exog(model, "climb", fig=fig)
5.4. REGRESSION ANALYSIS USING PYTHON 95
96 CHAPTER 5. REGRESSION ANALYSIS
Bibliography

[1]David Kahle, Hadley Wickham (2016), Spatial Visualization with ggplot2

https://cran.r-project.org/web/packages/ggmap/ggmap.pdf

[2]Hadley Wickham [aut, cre], Winston Chang [aut], RStudio [cph].(2016), Create Ele-
gant Data Visualizations Using the Grammar of Graphics
https://cran.r-project.org/web/packages/ggplot2/ggplot2.pdf

[3]Introductory Statistics with R. P. Dalgaard. Springer. Covers material typical of an in-

troductory statistics course, using R for examples. Assumes no advanced mathematics
beyond algebra.

[4]Mixed Effects Models in S and S-Plus. J. C. Pinheiro, D. M. Bates. Springer. An ad-

vanced title covering liner and nonlinear mixed effect models.

[5]Modern Applied Statistics with S, 4th Ed.. B.D. Ripley and V.N.Venables. Springer. A more
advanced book using S. Emphasis on linear models and multivariate data analysis.
Includes some coverage of R but more specific to SPlus.

[6]R Bloggers (2017). Data Science Job Report 2017. Retrieved from: https://www.r-
bloggers.com/

[7]R Development Core Team. (2007b). R: A language and environment for statistical
computing. Vienna, Austria: R Foundation for Statistical Computing. Available from
http://www.r-project.org

[8] Revolution Analytics. (2014, February 20). What is R? Retreived February 21, 2016,
from: http://www.inside-r.org/what-is-r, February 20, 2014.

[9]Sarkar, D. (2017). lattice: Lattice graphics. Retrieved from https://cran.r-

project.org/web/packages/lattice/lattice.pdf

[10] The R Foundation. (2017). The R Project for Statistical Computing. Retrieved from:
https://www.r-project.org/

[11] Yihui Xie (2016), A General-Purpose Package for Dynamic Report Generation in R
https://cran.r-project.org/web/packages/knitr/knitr.pdf

[12] Wickham H (2015). R Packages: Organize, Test, Document, and Share Your Code.
O’Reilly, Sebastopol.

97
98 BIBLIOGRAPHY

[13] Allen B. Downey. Think Python. O’Reilly Media, first edition, August 2012.

[14] Fabian Pedregosa, Ga el Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand

Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent
Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher,
Matthieu Perrot, and Edouard Duchesnay. Scikit-learn: machine learning in Python.
Journal of Machine Learning Research, 12:2825-2830, 2011.

[15] Guo, Philip. (2007, May). Why Python is a great language for teaching beginners
in introductory programming classes. Retrieved from: http://pgbovine.net/python-
teaching.htm

[16] Hinsen K (2007). ScientificPython Manual. URL http://dirac.cnrs-orleans.fr/ Scien-

tificPython/ScientificPythonManual/.

[17] Hughes, Zachariah. (2015, March). Personal Experience Section in MatLab vs.
Python vs. R.

[18] Kevin Sheppard. Introduction to Python for econometrics, statistics and data analy-
sis. Self-published, University of Oxford, version 2.1 edition, February 2014.

[19] matplotlib documentation, https://matplotlib.org/

[20] Oliphant TE (2006). Guide to NumPy. Provo, UT. URL http://www.tramy.us/.

[21] pandas documentation, https://pandas.pydata.org/

[22] seaborn documentation, https://seaborn.pydata.org/

[23] statsmodels documentation, http://www.statsmodels.org/dev/regression.html

[24] Skipper Seabold and Josef Perktold. Statsmodels: econometric and statistical model-
ing with python. In Proceedings of the 9th Python in Science Conference, 2010. Anno-
tation: Description of the statsmodels package for Python.

[25] Temple Lang D (2005). "R/S-PLUS-Python Interface." URL

http://www.omegahat.org/ RSPython/.

[26] Van Rossum G, others (2016). Python Programming Language. URL

http://www.python. org/

[27] Wes McKinney. Python for data analysis. O’Reilly, Sebastopol, California, first edi-
tion, October 2012. Annotation: Book on data analysis with Python introducing the
Pandas library.

View publication stats

Cisco ASA Firewall Commands - Cheat Sheet
No ratings yet
Cisco ASA Firewall Commands - Cheat Sheet
8 pages
BBA 3 RD SEM Management Information Systems
No ratings yet
BBA 3 RD SEM Management Information Systems
1 page
EC 130 B4-11 VEMD and Flight Instruments
0% (1)
EC 130 B4-11 VEMD and Flight Instruments
49 pages
Quantum Quest HF Series Brochure
100% (1)
Quantum Quest HF Series Brochure
4 pages
Eavy Metal Seminar Eldar Saim Hann
100% (1)
Eavy Metal Seminar Eldar Saim Hann
7 pages
Comsats University Islamabad: Department of Computer Science Lab Assignment - II
No ratings yet
Comsats University Islamabad: Department of Computer Science Lab Assignment - II
14 pages
Business Analytics Important Question Answers
No ratings yet
Business Analytics Important Question Answers
38 pages
Practical File On Bba 5th Sem
100% (1)
Practical File On Bba 5th Sem
27 pages
Notes For Mba (Business Research-524) : Q-1 What Is Business Research? Define / Types of Business Research?
No ratings yet
Notes For Mba (Business Research-524) : Q-1 What Is Business Research? Define / Types of Business Research?
5 pages
Marketing Management: Randy, Dept. of Commerce, Loyola College, Chennai-34
No ratings yet
Marketing Management: Randy, Dept. of Commerce, Loyola College, Chennai-34
136 pages
Question For Manager Test in Ksfe PDF
No ratings yet
Question For Manager Test in Ksfe PDF
11 pages
Principles of Management BBA 1st PPM BCA FINAL Updated
No ratings yet
Principles of Management BBA 1st PPM BCA FINAL Updated
184 pages
4th Sem Business Law Full Notes
No ratings yet
4th Sem Business Law Full Notes
108 pages
Guidelines Format - RPR (KMBN 408) - Even 2023-24
No ratings yet
Guidelines Format - RPR (KMBN 408) - Even 2023-24
6 pages
IFM - Lecture Notes
No ratings yet
IFM - Lecture Notes
256 pages
Chapter 11: Business Intelligence and Knowledge Management: Oz (5th Edition)
100% (1)
Chapter 11: Business Intelligence and Knowledge Management: Oz (5th Edition)
20 pages
Payroll Reports in Tally - Erp 9
No ratings yet
Payroll Reports in Tally - Erp 9
43 pages
1 - Business Statistics
No ratings yet
1 - Business Statistics
82 pages
N.R.narayana Murthy
No ratings yet
N.R.narayana Murthy
15 pages
Business Environment!
No ratings yet
Business Environment!
140 pages
Syllabus: B.B.A. Iii Sem Subject - Business Environment
No ratings yet
Syllabus: B.B.A. Iii Sem Subject - Business Environment
37 pages
Enterpreneurship and Small Business Management
No ratings yet
Enterpreneurship and Small Business Management
2 pages
Indian Financial System PDF
No ratings yet
Indian Financial System PDF
14 pages
Financial Accounting Sub2 PDF
100% (3)
Financial Accounting Sub2 PDF
219 pages
Course File of Ecommerce
100% (1)
Course File of Ecommerce
30 pages
Computer Application in Business: CAB Notes
100% (1)
Computer Application in Business: CAB Notes
55 pages
Vedant Surana - Imarticus Resume PDF
No ratings yet
Vedant Surana - Imarticus Resume PDF
2 pages
Family Business
No ratings yet
Family Business
15 pages
Presentation On Ifci
No ratings yet
Presentation On Ifci
9 pages
What Is BPM
100% (1)
What Is BPM
36 pages
DAO2702 Programming For Business Analytics S2AY1819
No ratings yet
DAO2702 Programming For Business Analytics S2AY1819
3 pages
PG M.B.a General (Tamil) 317 31 Marketing Management 5771
No ratings yet
PG M.B.a General (Tamil) 317 31 Marketing Management 5771
312 pages
Financial Management Terminologies
No ratings yet
Financial Management Terminologies
15 pages
Fundamentals of GST and Customs Law - Samplechapter
No ratings yet
Fundamentals of GST and Customs Law - Samplechapter
4 pages
Novemebr 2012 Mba Iii Sem Question Papers
No ratings yet
Novemebr 2012 Mba Iii Sem Question Papers
23 pages
11th Std-Computer Applications Vol-II EM - WWW - tn11th - in
75% (4)
11th Std-Computer Applications Vol-II EM - WWW - tn11th - in
216 pages
Marketing Management
No ratings yet
Marketing Management
9 pages
Nature and Functions of Management
No ratings yet
Nature and Functions of Management
2 pages
Dpu Mba Artificial Intelligence & Machine Learning Management
No ratings yet
Dpu Mba Artificial Intelligence & Machine Learning Management
26 pages
APSCHE Approved Short Term Internship Proposal by ExcelR
No ratings yet
APSCHE Approved Short Term Internship Proposal by ExcelR
4 pages
Business Law and Ethics Notice PDF
No ratings yet
Business Law and Ethics Notice PDF
146 pages
Business Anaytics Unit 1
No ratings yet
Business Anaytics Unit 1
37 pages
E Commerce Security Protocols: Presentation By: Jyotsna Mishra Id: 618057 BSC 6 Semester
No ratings yet
E Commerce Security Protocols: Presentation By: Jyotsna Mishra Id: 618057 BSC 6 Semester
8 pages
Retail Information System
No ratings yet
Retail Information System
15 pages
IFS Suggested Questions by GKJ
No ratings yet
IFS Suggested Questions by GKJ
5 pages
Bimetallism
No ratings yet
Bimetallism
28 pages
Trends in International Management
No ratings yet
Trends in International Management
5 pages
Model Question Paper of SMM 2022
100% (1)
Model Question Paper of SMM 2022
2 pages
Business Analytics Unit 1
No ratings yet
Business Analytics Unit 1
20 pages
A6515 BDA Question Bank
No ratings yet
A6515 BDA Question Bank
9 pages
Electronic Commerce Framework Technologies and Applications by Bharat Bhasker B00syemujs
No ratings yet
Electronic Commerce Framework Technologies and Applications by Bharat Bhasker B00syemujs
5 pages
Marketing Management
No ratings yet
Marketing Management
174 pages
Predictive Analytics
No ratings yet
Predictive Analytics
9 pages
ISSN Publication On " A Study On Green Financing Pattern in The Banking Sector A Case Study of Organic Farmers in Mandya District
No ratings yet
ISSN Publication On " A Study On Green Financing Pattern in The Banking Sector A Case Study of Organic Farmers in Mandya District
217 pages
AMAZON
No ratings yet
AMAZON
9 pages
Instant Download Business Statistics 2nd Edition J. K. Sharma PDF All Chapters
100% (3)
Instant Download Business Statistics 2nd Edition J. K. Sharma PDF All Chapters
84 pages
Computer Applications in Business: Assignment
No ratings yet
Computer Applications in Business: Assignment
8 pages
Casestudy On Corporate Social Responsibility Activities in KPR MillS Limtied
No ratings yet
Casestudy On Corporate Social Responsibility Activities in KPR MillS Limtied
50 pages
Touchpad Plus Ver. 1.1 Class 7
From Everand
Touchpad Plus Ver. 1.1 Class 7
Nisha Batra
No ratings yet
Wealth Management
From Everand
Wealth Management
Erik Lie
No ratings yet
Investigation into the Adherence to Corporate Governance in Zimbabwe’s SME Sector
From Everand
Investigation into the Adherence to Corporate Governance in Zimbabwe’s SME Sector
Maxwell Mago
No ratings yet
APznzaYFSYN7HfKT47uBOgy87rIMPHIRyJa0gVq-f59OfljwVybitiAPWEv_XOUqCVjlejwQyK9WRnC8vVgsGVZfFDp9dlKU2MmEx044EMQ-1PyhLuR97p7ZZvL-1luTtSZZqG1c26969gmzBST8Pa7NbPMkg946VEM0UL76hrFjD0_lQs1FWA_F35v6hnJ8Bs0pkYT8V94rtOAO8OWcgcxHEmpJA47QxC4ukAu54kJWAEiu
No ratings yet
APznzaYFSYN7HfKT47uBOgy87rIMPHIRyJa0gVq-f59OfljwVybitiAPWEv_XOUqCVjlejwQyK9WRnC8vVgsGVZfFDp9dlKU2MmEx044EMQ-1PyhLuR97p7ZZvL-1luTtSZZqG1c26969gmzBST8Pa7NbPMkg946VEM0UL76hrFjD0_lQs1FWA_F35v6hnJ8Bs0pkYT8V94rtOAO8OWcgcxHEmpJA47QxC4ukAu54kJWAEiu
22 pages
R Vs Python For Data Science
No ratings yet
R Vs Python For Data Science
7 pages
Taylor R. Brown - An Introduction To R and Python For Data Analysis - A Side-By-Side Approach-CRC Press - Chapman & Hall (2023)
No ratings yet
Taylor R. Brown - An Introduction To R and Python For Data Analysis - A Side-By-Side Approach-CRC Press - Chapman & Hall (2023)
267 pages
10EXP01.docx
No ratings yet
10EXP01.docx
12 pages
JQUERY
No ratings yet
JQUERY
13 pages
Workshop - Tiruchirappalli, Chathiram - Livewire_241210_140324
No ratings yet
Workshop - Tiruchirappalli, Chathiram - Livewire_241210_140324
14 pages
A2-General English-Speaking-Describing People-WS1 Key
No ratings yet
A2-General English-Speaking-Describing People-WS1 Key
3 pages
Dhanalakshmi Srinivasan Engineering College
No ratings yet
Dhanalakshmi Srinivasan Engineering College
117 pages
Real-World Cybersecurity: Guidance Manual For Aspiring Cybersecurity Experts Nullclass
No ratings yet
Real-World Cybersecurity: Guidance Manual For Aspiring Cybersecurity Experts Nullclass
3 pages
7834 Sivahari
No ratings yet
7834 Sivahari
1 page
Unit - V
No ratings yet
Unit - V
27 pages
Altruist Bangalore
No ratings yet
Altruist Bangalore
1 page
Interviewpreparation 150713180211 Lva1 App6891
No ratings yet
Interviewpreparation 150713180211 Lva1 App6891
19 pages
Deep Learning
No ratings yet
Deep Learning
10 pages
Arun Kannan
No ratings yet
Arun Kannan
2 pages
Metadata Report of (Hajj Statistics) EN
No ratings yet
Metadata Report of (Hajj Statistics) EN
19 pages
CC6051NIEthicalHackingY23SpringMainSitCWQP 94536
No ratings yet
CC6051NIEthicalHackingY23SpringMainSitCWQP 94536
5 pages
Matot v. CH (Magistrate's R&R)
No ratings yet
Matot v. CH (Magistrate's R&R)
7 pages
B Mar BR Ind930 Ind970 en
No ratings yet
B Mar BR Ind930 Ind970 en
5 pages
Oracle R12 Setup Guide
No ratings yet
Oracle R12 Setup Guide
2 pages
02 Introduction To RPi
No ratings yet
02 Introduction To RPi
22 pages
Math 6 Brigada Pagbasa
No ratings yet
Math 6 Brigada Pagbasa
6 pages
CHCCCS036-LAB- Jeferson Santos Coelho
No ratings yet
CHCCCS036-LAB- Jeferson Santos Coelho
15 pages
Unit01 - Introduction To Software Engineering
No ratings yet
Unit01 - Introduction To Software Engineering
10 pages
The Invincibility Paradox
No ratings yet
The Invincibility Paradox
20 pages
Addons Linker Procedure 1.2.0
No ratings yet
Addons Linker Procedure 1.2.0
7 pages
Department of Electronics & Communication Engineering: Mission & Vision
No ratings yet
Department of Electronics & Communication Engineering: Mission & Vision
3 pages
COMBISTOP en
No ratings yet
COMBISTOP en
12 pages
Angelica G. de Leon: Objective
No ratings yet
Angelica G. de Leon: Objective
2 pages
Group 5 Lab 2 - Q&A
No ratings yet
Group 5 Lab 2 - Q&A
2 pages
Fundamental Programming Structures in Java
No ratings yet
Fundamental Programming Structures in Java
23 pages
Design and Implementation of Web Based Job Portaldc73vcymol
No ratings yet
Design and Implementation of Web Based Job Portaldc73vcymol
11 pages
OTT IPTV system
No ratings yet
OTT IPTV system
29 pages
Digital Content Labels & Sensitive Topics
No ratings yet
Digital Content Labels & Sensitive Topics
5 pages
Manual 100H PDF
No ratings yet
Manual 100H PDF
27 pages
BS en 1992
100% (1)
BS en 1992
22 pages
Fundamentals of Logic Design 6th Edition Chapters 14-15
No ratings yet
Fundamentals of Logic Design 6th Edition Chapters 14-15
32 pages
VOLFLO 1.5 Quick Guide
50% (2)
VOLFLO 1.5 Quick Guide
8 pages
Summative Assessment - Introduction To Python Programming - Y8
No ratings yet
Summative Assessment - Introduction To Python Programming - Y8
17 pages
CS619 SRS Document HELPING MATERIAL
No ratings yet
CS619 SRS Document HELPING MATERIAL
14 pages

Data_Analysis_using_R_and_Python

Uploaded by

Data_Analysis_using_R_and_Python

Uploaded by

Contents

2 Graphics And Visualization 19

• Statistical Analysis: Designed by statisticians for doing statistical analysis. In R,

1.2 Why Python?

• Statistical Analysis: Python is widely used in scientific computing and statistical

1.3 Installing R software

• Click on the file containing the latest version of R under "Files."

• Save file, double-click it to open, and follow the installation instructions.

1.4 Installing Python Software

• Now follow the installation instructions.

• Now for doing data analysis. We need to install packages.

Once the software is installed, now it can be executed by launching corresponding

1.5 Basic Operators in R

1.5.1 Arithmetic Operators:

Operators Meaning Examples

log(100,base=10) #Takes the logarithm of x with base y;

#if base is not specified, returns the natural logarithm

sqrt(25) #Returns the square root of x

factorial(4) #Returns the factorial of x (x!)

choose(5,3) #Returns the number of possible combinations

#when drawing y elements at a time from x possibilities

1.5.2 Logical Operators

## [1] TRUE FALSE FALSE FALSE FALSE

## [1] FALSE FALSE TRUE TRUE TRUE

## [1] TRUE TRUE FALSE FALSE FALSE

## [1] FALSE TRUE TRUE TRUE TRUE

## [1] FALSE TRUE FALSE FALSE FALSE

## [1] TRUE FALSE TRUE TRUE TRUE

## [1] FALSE FALSE FALSE FALSE FALSE

1.6 Basic Operators in Python

1.6.1 Arithmetic Operators

In [1]: import math

In [2]: math.log10(100)#Return the base-10 logarithm

Operators Meaning Examples

In [3]: math.cos(25)#return the cosine of 25 radians

In [4]: math.factorial(5) #return the value of 5!

In [5]: math.gamma(5) #return the gamma function at 5

In [6]: math.degrees(90) #return angle 90 from radians to degree

1.6.2 Logical Operators

x is not greater than y

x is not greater than equal to y

1.7 Data types in R

print(class(X))# to get class of vector

• Matrices: Matrices is 2 dimensional data structure. It consists of elements of same

## [,1] [,2] [,3]

# Print the list.

1.8 Data types in Python

5 <class int >

0.265 <class float >' '

• String: String can be defined by using single(’), double(") or triple(”’) inverted

In [5]: greeting= greeting

greeting <class str > ' '

[0, 1, 2, 4, 5, 15, 25, 30] <class list > ' '

Out[10]: [2, 4, 5, 15]

• Tuple: Tuple is represented by number of values separated by commas. Tuple are

• Dictionary: Dictionaries in Python are lists of key:value pair. Dictionaries can be

Out[15]: dict_values([1, 2, 3])

1.9 Data Analysis

• Explanatory data dnalysis

Graphics And Visualization

2.1 Graphics and Visualization using R

• High-level plotting function: High-level plotting function create a new graph on

S.No Function Name of the plot

The number of arguments can be passed to high-level plotting function, as follows:

S.No. Arguments Explanation

3 ylab=" " Label for y axis

• Low-Level plotting function: Sometimes high-level plotting functions do not pro-

S.No. Function Explanation

• ggplot2: ggplot2 was created by Hadley Wickham in 2005, see Wickham(2016).

2.1.1 Application of lattice

Lattice Function Description R base analogue

barchart() Barcharts barplot()

# create factors with value labels

# kernel density plot

# kernel density plots by factor level

Density Plot by Number of Cylinders