Data_Analysis_using_R_and_Python
Data_Analysis_using_R_and_Python
1 Introduction 7
1.1 Why R? ............................................................................................................................ 7
1.2 Why Python? .................................................................................................................. 7
1.3 Installing R software ................................................................................................... 8
1.4 Installing Python Software ......................................................................................... 8
1.5 Basic Operators in R ...................................................................................................... 9
1.5.1 Arithmetic Operators:....................................................................................... 9
1.5.2 Logical Operators ............................................................................................ 10
1.6 Basic Operators in Python .......................................................................................... 11
1.6.1 Arithmetic Operators ...................................................................................... 11
1.6.2 Logical Operators ............................................................................................ 12
1.7 Data types in R ............................................................................................................. 14
1.8 Data types in Python ................................................................................................... 15
1.9 Data Analysis ................................................................................................................ 17
3 Data Cleaning 49
3.1 Why data cleaning is important? ............................................................................... 49
3.2 Problem if we don’t clean our data ........................................................................... 49
3.3 Data cleaning process ................................................................................................. 50
3.3.1 ETL process ..................................................................................................... 50
3.3.2 Extraction of data using R.............................................................................. 50
3.3.3 Extraction data using Python ........................................................................ 50
3.3.4 Transformation ................................................................................................ 51
3.3.5 Transformation using R.................................................................................. 51
3.3.6 Transformation using Python ........................................................................ 56
5
4 Explanatory Data Analysis 61
4.1 Univarate Explanatory Data Analysis ....................................................................... 62
4.2 Multivariate Explanatory Data Analysis ................................................................... 63
4.3 Explanatory Data Analysis using R .......................................................................... 63
4.4 Explanatory Data Analysis using Python ................................................................. 72
5 Regression Analysis 81
5.1 Linear Regression Analysis ......................................................................................... 81
5.1.1 Assumptions in Linear regression model .................................................... 82
5.2 Logistic Regression Analysis ...................................................................................... 82
5.3 Regression Analysis using R ...................................................................................... 83
5.3.1 Multiple Linear Regression............................................................................ 83
5.3.2 Logistic Regression ......................................................................................... 86
5.4 Regression Analysis using Python ............................................................................ 88
Bibliography 97
Chapter 1
Introduction
Python and R are two very popular open source software’s used for data analysis.
However both software’s are of equal importance and can be used as complement of
each other. When it comes to choose one, R offers more depth when it comes to data
analysis, data modelling while Python is easier to learn and tends to represent graphs in
more polished way.
1.1 Why R?
• Open Source: R is an open source software and can be installed easily.
• Learning: R requires you to learn and understand coding. It’s a low level program-
ming which can take longer codes.
• Learning: Python is easy to learn. Python is known for its simplicity in program-
ming world.
• Speed: People are inappropriately obsessed with speed. Python is a high-level lan-
guage, which means it has number of benefits to accelerate codes. Another benefit
is that it is easy to learn.
7
8 CHAPTER 1. INTRODUCTION
• Click the "download R" link in the middle of the page under "Getting Started."
• Select a CRAN location (a mirror site) and click the corresponding link.
• Click on the "Download R for (Mac) OS X" link at the top of the page.
• Once you have downloaded the file, open it. (You can also double-click on it to open
it.)
Arithmetic Operators
4.12e-2
## [1] 0.0412
10 CHAPTER 1. INTRODUCTION
cos(120*pi/180)
## [1] -0.5
## [1] 2
## [1] 72004899337
## [1] 5
## [1] 24
## [1] 10
Logical operators
Operators Meaning
< Less than
> Greater than
<= Less than equal to
>= Greater than equal to
== Exactly equal to
!= Not equal to
!x Not x
x|y x or y
x&y x and y
1.6. BASIC OPERATORS IN PYTHON 11
x=c(1:5)
y=2
x<y
x>y
x<=y
x>=y
x==y
x!=y
!x
!y
## [1] FALSE
Arithmetic operators
(remainder of 5/2)
** Exponent - left operand raised >>> 2**3
to the power of right 8 (i.e. 2 to
the power 3)
math.sin Sin trigonometric function >>> math.sin(90*(math.pi/180))
1.0
math.log logarithm >>> math.log(25)
3.2188758248682006
Out[2]: 2.0
Out[3]: 0.9912028118634736
Out[4]: 120
Out[5]: 24.0
Out[6]: 5156.620156177409
In [1]: x=2
y=7
1.6. BASIC OPERATORS IN PYTHON 13
Logical operators
Operators Meaning
< Less than
> Greater than
<= Less than equal to
>= Greater than equal to
== Exactly equal to
!= Not equal to
!x Not x
x|y x or y
x&y x and y
In [2]: if(x==y):
print("x is equal to y")
else:
print("x is not equal to y")
x is not equal to y
In [3]: if (x>y):
print("x is greater than y")
else:
print("x is not greater than y")
In [4]: if (x<y):
print("x is less than y")
else:
print("x is not less than y")
x is less than y
In [5]: if (x>=y):
print("x is greater than equal to y")
else:
print("x is not greater than equal to y")
In [6]: if (x<=y):
print("x is less than equal to y")
else:
print("x is not less than equal to y")
x is less than equal to y
Since every value has a data type. Data type are classes and variables are instances
of these classes.
#create a vector
X<- c(1,2,3,5,6,7)
X
## [1] 1 2 3 5 6 7
## [1] "numeric"
• List: List is a special type of vector which contain elements of different data types.
For example:
# Create a list.
list1 <- list(c(2,5,3),21.3,sin)
## [[1]]
## [1] 2 5 3
##
## [[2]]
## [1] 21.3
##
## [[3]]
## function (x) .Primitive("sin")
• Dataframe: Data frame is commonly used data type to store tabular data. It is dif-
ferent from matrices data type. In matrix, every element should be from same class.
But in data frame you can use list of vector of different class. Every column act as a
list. Every time you read data in R will be stored in a data frame. For example:
df
## name score
## 1 ash 67
## 2 jane 56
## 3 paul 87
## 4 mark 91
• Number: Number variables are created by the standard Python method. For
example:
In [1]: a=5
16 CHAPTER 1. INTRODUCTION
In [2]: print(a,type(a))
In [3]: b=0.265
In [4]: print(b,type(b))
In [6]: print(greeting,type(greeting))
• List: List are most useful data structures in Python. List can be simply defined
by using comma separated values in square brackets. List can contain items of
different type. But usually items have same type. Python lists are mutable and
individual elements of a list can be changed. For example:
In [7]: square_list=[0,1,2,4,5,15,25,30]
In [8]: print(square_list,type(square_list))
In [9]: square_list[0]
Out[9]: 0
In [10]: square_list[2:6]
In [11]: tulle_ex=1,2,3,5,9,77,44,55,21,36,59
In [12]: print(tulle_ex,type(tulle_ex))
(1, 2, 3, 5, 9, 77, 44, 55, 21, 36, 59) <class tulle > ' '
In [13]: dict={"A":1,"B":2,"C":3}
In [14]: dict.keys()
Out[14]: dict_keys([ A , B , C ])
' ' ' ' ' '
In [15]: dict.values()
• A clear idea that what questions you want the data to answer.
Process of data analysis have following steps:
• Data cleaning
• Data analysis
• Visualize report
• Decision making
18 CHAPTER 1. INTRODUCTION
Chapter 2
Graphics and visualization are powerful tool for describing and assisting analysis of data.
The power of graphics and visualization arises from the fact that they describes the large
quantities of information quickly. Graphics play an important role in good data analysis.
They are useful for storing large data sets during data analysis, assist in describing and
summarizing the data, and they can be tightly integrated with formal analytical statistical
tools such as model-fitting techniques so that the analysis process can be refined.
19
20 CHAPTER 2. GRAPHICS AND VISUALIZATION
R has many powerful packages for graphical representation and visualization. Two most
widely used packages are:
• lattice: lattice was created by Deepayan Sarkar, see Sarkar(2017) lattice is a power-
ful and high-level plotting data visualization system. The lattice is based on Grid-
graphics engine and requires grid add on package. lattice provides its own interface
for modification of set of graphical and non-graphical settings.
Dataset:
We are using Motor Trend Car Road tests datasets in R dataset package.
Description: The data was extracted from the 1974 Motor Trend US magazine, and com-
prises fuel consumption and 10 aspects of automobile design and performance for 32
automobiles (1973-74 models).
The lattice package supports the trellis graph that shows the relationship variables. The
basic format of lattice is:
graph_type(formula, data=)
Here we show some applications of lattice function with example of dataset of mtcars:
22 CHAPTER 2. GRAPHICS AND VISUALIZATION
# Lattice Examples
library(lattice)
attach(mtcars)
Density_Plot
0.06
0.04
Density
0.02
0.00
10 20 30 40
Miles_per_Gallon
0.20
0.15
Density
0.10
0.05
0.00
10 20 30 40 10 20 30 40
Miles_per_Gallon
cyl_8
0.25
0.20
0.15
0.10
0.05
0.00
cyl_6
0.25
0.20
Density
0.15
0.10
0.05
0.00
cyl_4
0.25
0.20
0.15
0.10
0.05
0.00
10 20 30 40
gears_5
cyl_8
cyl_6
cyl_4
gears_4
cyl_8
Cylinders
cyl_6
cyl_4
gears_3
cyl_8
cyl_6
cyl_4
10 15 20 25 30 35
Car Weight
3D Scatterplot by Cylinders
cyl_8
mpg
qsec wt
cyl_4 cyl_6
mpg mpg
qsec wt qsec wt
gears_5
cyl_8
cyl_6
cyl_4
gears_3 gears_4
cyl_8
cyl_6
cyl_4
10 15 20 25 30 35
# scatterplot matrix
splom(mtcars[c(1,3,4,5,6)],
main="Scatter Plot of mtcars Data")
2.1. GRAPHICS AND VISUALIZATION USING R 29
Advantages of ggplot2
• Plot at high-level of abstraction.
• Very flexible
• Consistent grammar of graphics
Grammar of graphics
• data
• aesthetic mapping
• geometric objects
• scales
30 CHAPTER 2. GRAPHICS AND VISUALIZATION
• coordinate system
• position adjustments
• statistical transformation
• geom_xxx(): There are many types of geometric objects some are as follow:
ggplot(mtcars, aes(x=mpg,y=disp))+geom_point()
400
300
disp
200
100
10 15 20 25 30 35
mpg
factor(vs)
0
400 1
factor(carb)
300
disp
1
2
200 3
4
100 6
8
10 15 20 25 30 35
mpg
• How to make Histogram, density and boxplot: If we want to plot the distribution
of mpg in mtcars and want to show trends by different groups. We can also show
density graph by simply adding a geom_density() function as follow:
32 CHAPTER 2. GRAPHICS AND VISUALIZATION
3 factor(vs)
count
2 1
0
10 15 20 25 30 35
mpg
0.12
0.09
factor(vs)
density
0.06 0
1
0.03
0.00
10 15 20 25 30 35
mpg
ggplot(mtcars, aes(x=factor(vs),y=disp))+geom_boxplot()+coord_flip()
2.1. GRAPHICS AND VISUALIZATION USING R 33
1
factor(vs)
• How to make Trend line: Trend line can aid the eye in seeing patterns in the presence
of over plotting. Using geom_smooth() to add Trend line to your graph:
ggplot(mtcars, aes(x=mpg,y=disp))+geom_point()+
geom_smooth(aes(colour=factor(vs)))#aes() will show the trend line
400
factor(vs)
disp
0
1
200
10 15 20 25 30 35
mpg
34 CHAPTER 2. GRAPHICS AND VISUALIZATION
• Faceting: Faceting generates small multiples each showing a different subset of the
data.
Faceting is an alternative approach to use aesthetics (like color, shape or size) to dif-
ferentiate groups. It is good when groups overlap a lot, but it make small differences
that are harder to observe.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
mtcars_new<-select(mtcars,-carb)
ggplot(mtcars, aes(x=mpg,y=disp))+geom_point(data=mtcars_new,
colour="grey70")+
geom_point(aes(colour = carb))+facet_wrap(~carb)
2.1. GRAPHICS AND VISUALIZATION USING R 35
1 2 3
400
300
carb
200 8
100
6
disp
4 6 8
4
400
2
300
200
100
10 15 20 25 30 3510 15 20 25 30 3510 15 20 25 30 35
mpg
• Theme: The theme system in ggplot2 enables a user to control non-data elements of
a ggplot object. It makes the ggplot2 a flexible and powerful graphing tool for data
visualization.
ggplot(mtcars, aes(x=mpg,y=disp))+geom_point(data=mtcars_new,
colour="grey70")+
geom_point(aes(colour = carb))+facet_wrap(~carb)+theme_dark()
1 2 3
400
300
carb
200 8
100
6
disp
4 6 8
4
400
2
300
200
100
10 15 20 25 30 3510 15 20 25 30 3510 15 20 25 30 35
mpg
• Setting axis limits and labeling scales: We commonly need to adjust axis so ggplot2
provides several convenient functions to label axis and adjust axis and other aes-
thetics:
– lims, xlim, ylim: set axis limits
36 CHAPTER 2. GRAPHICS AND VISUALIZATION
400
Displacement
Cylinders
300
0
1
200
100
10 15 20 25 30 35
Miles_per_gallon
• matplotlib
• seaborn
• ggplot
• mayavi
• chaos
But here we’ll be focused on two of above that is matplotlib and seaborn.
2.2. GRAPHICS AND VISUALIZATION USING PYTHON 37
matplotlib is a plotting package designed for creating quality plot. matplotlib was created
by John Hunter in 2002 to enable a MATLAB like plotting in python. matplotlib has a
number of add-on toolkits, such as mplot3d for 3D plots and basemap for mapping. There
are several ways to interact with matplotlib and the most common is through pylab mode
in ipython. matplotlib is probably the single most used Python package for 2D-graphics.
It provides both a very quick way to visualize data from Python and publication-quality
figures in many formats. pyplot provides a convenient interface to the matplotlib object-
oriented plotting library. It is modeled closely after Matlab(TM). Firstly we will import
(See https://matplotlib.org/) all the library needed by using following commands
• Simple plotting a line for two array using plot() function and subplot() function.
carb
0 4
1 4
2 1
3 1
4 2
In [4]: fig=plt.figure()
fig.add_subplot(2,2,1)#axis of first graph
#scatter plot can be drawn using scatter() function
plt.scatter(data.mpg, data.disp)
plt.ylabel( disp ) ' '
plt.legend()
• To plot different pline plot in single command for all the variables
In [9]: autocorrelation_plot(data.mpg)
• Pie chart: Pie chart can be easily drawn using pie() function.
In [10]: labels = eng , mat , fre , Logs
' ' ' ' ' ' ' '
fracs = [15,40,25,30]
explode = (0, 0.05, 0, 0)
Now using iris dataset to show some more attractive plots in matplotlib
In [12]: iris.head()
• Parallel coordinate plot: A parallel coordinate plot maps each row in the data table
as a line, or profile. Each attribute of a row is represented by a point on the line.
This makes parallel coordinate plots similar in appearance to line charts, but the
way data is translated into a plot is substantially different.
In [15]: plt.figure()
parallel_coordinates(iris, Name ) ' '
• Tools for choosing color palettes to make beautiful plots that reveal patterns in your
data
• Tools that fit and visualize linear regression models for different kinds of indepen-
dent and dependent variables
• Functions that visualize matrices of data and use clustering algorithms to discover
structure in those matrices
• A function to plot statistical time series data with flexible estimation and represen-
tation of uncertainty around the estimate
• High-level abstractions for structuring grids of plots that let you easily build com-
plex visualizations
Chapter 3
Data Cleaning
Most statistical theory focuses on data modelling, prediction and inference, while it is
assumed that our data is inaccurate for analysis. Here, inaccurate data is data that is
incorrect, incomplete, out-of-date, or wrongly formatted. In practice, it is very rare that
raw data one works with is in correct format or without error. Often our data can be
quite messy, if we will do direct analysis it might corrupt our analysis. So we need to
process our data before doing any further analysis. A data analyst spend its most of the
time in doing data cleaning i.e preparing the data for statistical analysis. Data cleaning is
process of correcting the errors and transforming raw data into consistent data that can
be analysed. R and python provides good environment for data cleaning.
No quality data, No quality decision: Quality decision must be based on good qual-
ity data(data with errors may lead to misleading statistics). The ultimate goal of data
cleaning is to make inconsistent data get ready for analysis.
49
50 CHAPTER 3. DATA CLEANING
• Transform: Manipulate the data into useful format and clean the data.
• Load: Load the data into data warehouse intended for analysis
• Import data from excel We have to use ’readxl’ package to access excel files. In excel
file first row should contain variable/column names.
library(readxl)
dataset = read_excel(”C : /Users/example2.xls”)
• Import data from csv In csv file first row should contain variable/column names.
Where each element is comma separated and header is true. We use command as
follow:
import pandas as pd
3.3.4 Transformation
• Rebuild Missing Data: Recreating missing information as and when possible, such
as Post codes, states, country, phone area codes, gender, web address from email
addresses etc.
• Standardize and Normalize Data: The entries in fields or the categories of the given
data set must be homogeneous i.e. all the entries must have the same format for
Name, Address, Email, Contact Number, abbreviated/full name of provinces, titles.
Moreover, this step ensures that similar information e.g. sir, Mr., Mr are altogether
changed over to Mr. Or, on the other hand road, st., strt. are altogether changed
over to St.). Convert telephone numbers to their standard format, or as required.
• De-Duplicate data: Identify potential duplicates. Seek high accuracy matches with
a tolerance for misspelling, missing values or different address orders. For mis-
sion critical data, these results should be manually reviewed and then update the
database accordingly.
• Verification to enrich data: Validate the data against internal and external data
sources to append value adding info. I.e., business contacts can be validated against
yellow pages to verify their current phone number and addresses. Same goes for
various other fields including credit ratings, geo-coords, key contacts, employee
size, profit, revenue, time zones etc., can be fetched for each company.
## 1 1 Rick 623.30 32
## 2 2 Dan 515.20 31
## 3 3 Michelle 611.00 26
## 4 4 Ryan 729.00 29
## 5 5 Gary 843.25 36
## 6 6 Ryan 552.10 29
By default the data frames are merged on the columns with names they both have,
but separate specifications of the columns can be given by by.x and by.y. The rows
in the two data frames that match on the specified columns are extracted, and joined
together.
mydata4<-merge(mydata1,mydata2, by.name="Ryan")
mydata4
## 3 foo 1 A
## 4 foo 1 B
## 5 bar 1 A
## A B C
## 1 foo 0 A
## 2 foo 1 A
## 3 foo 1 B
## 4 bar 1 A
The function distinct() in dplyr package can be used to keep only unique/distinct
rows from a data frame. If there are duplicate rows, only the first row is preserved.
It’s an efficient version of the R base function unique().
• Missing Observations: In R, missing values are represented by the symbol NA (not
available). Impossible values (e.g., dividing by zero) are represented by the symbol
NaN (not a number).
– Detecting missing values: Missing values can be detected using is.na() func-
tion.
## A B C
## 1 1 a one
## 2 2 c two
## 3 3 e <NA>
## 4 4 f three
## 5 5 h <NA>
54 CHAPTER 3. DATA CLEANING
– Ways to exclude missing values : Math functions generally have a way to ex-
clude missing values in their calculations. mean(), median(), colSums(), var(),
sd(), min() and max() all take the na.rm argument. When this is TRUE, missing
values are omitted. The default is FALSE, meaning that each of these functions
returns NA if any input number is NA.
More functions used to exclude missing values. If you have large number of
observations in your dataset, then try deleting (or not to include missing values
while model building, for example by setting na.action=na.omit) those obser-
vations (rows) that contain missing values.
* na.omit: Drop out any rows with missing values anywhere in them and
forgets them forever.
* na.exclude: Drop out rows with missing values, but keeps track of where
they were (so that when you make predictions, for example, you end up
with a vector whose length is that of the original response.)
* na.pass: returns the object unchanged
* na.fail: returns the object only if it contains no missing values
## A B C
## 1 1 a one
## 2 2 c two
## 4 4 f three
– Hmisc library can be used to replace missing values with mean, median and
mode
## A B C
## 1 1 0.5 15
## 2 2 0.8 26
## 3 3 1.2 NA
## 4 4 NA 12
## 5 5 0.1 NA
## 6 NA 1.5 NA
## 1 2 3 4 5 6
## 1 2 3 4 5 3*
## 1 2 3 4 5 6
## 0.5 0.8 1.2 0.8* 0.1 1.5
## 1 2 3 4 5 6
## 15 26 0* 12 0* 0*
56 CHAPTER 3. DATA CLEANING
d :[ 0 , 1 , 2 , 3 , 4 , 5 , 6 ]})
' ' ' ' ' ' ' ' ' ' ' ' ' ' ' '
print(df_1)
d key
0 0 b
1 1 b
2 2 a
3 3 c
4 4 a
5 5 a
6 6 b
d2 :[ 0 , 1 , 2 ]})
' ' ' ' ' ' ' '
print(df_2)
d2 key
0 0 a
1 1 b
2 2 d
In [13]: pd.merge(df_1,df_2)
Out[13]: d key d2
0 0 b 1
1 1 b 1
2 6 b 1
3 2 a 0
4 4 a 0
5 5 a 0
3.3. DATA CLEANING PROCESS 57
data1 :range(7)})
' '
print(df_3)
data1 lkey
0 0 b
1 1 b
2 2 a
3 3 c
4 4 a
5 5 a
6 6 b
data2 : range(3)})
' '
print(df_3)
data1 lkey
0 0 b
1 1 b
2 2 a
3 3 c
4 4 a
5 5 a
6 6 b
A B C
0 foo 0 A
1 foo 1 A
2 foo 1 B
3 bar 1 A
Out[9]: A B C
2 foo 1 B
3 bar 1 A
print(df)
print(df2)
pandas objects are equipped with various data manipulation methods for dealing
with missing data.
You may wish to simply exclude labels from a data set which refer to missing data.
To do this, use the dropna method:
In [36]: df2.dropna(axis=0)
You can also fillna using a dict or Series that is align-able. The labels of the dict or
index of the Series must match the columns of the frame you wish to fill. The use
case of this is to fill a DataFrame with the mean of that column.
In [38]: df2.fillna(df2.mean())
Explanatory Data Analysis (EDA) is an approach to data analysis. EDA is a critical step
in analyzing data. Its where the experimenter takes the bird’s eye of the data and tries to
make some sense of it. Exploratory Data Analysis takes place after gathering and cleaning
data, and is often implemented before any formal statistical technique is applied. Among
the main purposes of this type of analysis are of course getting to know our data, its
tendencies and its quality, and also to check or even start formulating our hypothesis.
Here are some reasons why we use EDA:
• Detection of mistakes.
• Gain maximum insight into the dataset and its underlying structure.
• Detection of outliers
Most EDA techniques are graphical in nature with a few quantitative techniques. The
reason for the heavy reliance on graphics is that by its nature the main role of EDA is to
open-mindedly explore. The particular graphical techniques employed in EDA are often
quite simple consisting of various techniques of:
• Plotting the raw data (such as data traces, histograms, bihistograms, probability
plots, lag plots, block plots, and Youden plots.
• Plotting simple statistics such as mean plots, standard deviation plots, box plots,
and main effects plots of the raw data.
Types of Exploratory Data Analysis:
EDA falls into four main areas:
• Univariate EDA- Looking at one variable of interest, like age, height, income level
etc.
• Multivariate EDA- Analysis of multiple variables at the same time.
61
62 CHAPTER 4. EXPLANATORY DATA ANALYSIS
• Stem and leaf plots: A simple substitute to histogram is stem and leaf plot. Nev-
ertheless, a histogram is generally considered better for estimating the shape of a
sample distribution than the stem and leaf plot.
• Density plot: A Density Plot visualizes the distribution of data over a continuous
interval or time period.
4.2. MULTIVARIATE EXPLANATORY DATA ANALYSIS 63
– If the vertical (or y-axis) variable increases as the horizontal (or x-axis) variable
increases, the correlation is positive.
– If the y-axis variable decreases as the x-axis variable increases or vice-versa, the
correlation is negative.
– If it is impossible to establish either of the above criteria, then the correlation is
zero.
• Multiple Boxplot: Unlike regular box plots in which the range of values of one vari-
able is represented, the multiple box plot represents ranges of values of multiple
variables. Multiple Boxplot can be used to visualize multiple variables together. It
can be used for comparing two or more variables.
• Multiple histogram: A panel of histograms enables you to compare the data dis-
tributions of different groups. You can create the histograms in a column (stacked
vertically) or in a row.
head(data)
## speed dist
## 1 4 2
64 CHAPTER 4. EXPLANATORY DATA ANALYSIS
## 2 4 10
## 3 7 4
## 4 7 22
## 5 8 16
## 6 9 10
We can see some basic characteristics of the dataset using dim(), str(), names(), head(),
tail(), summary() functions.
dim(data)
## [1] 50 2
str(data)
## data.frame : 50 obs. of 2 variables:
' '
The arrange() function is used to reorder rows of a data frame according to one of the
variables/columns.
## speed dist
## 1 4 2
## 2 4 10
## 3 7 4
## 4 7 22
## 5 8 16
## 6 9 10
The select() function can be used to select columns of a data frame that you want to
focus on. Often you’ll have a large data frame containing "all" of the data, but any given
analysis might only use a subset of variables or observations.
## speed
## 1 4
## 2 4
## 3 7
## 4 7
## 5 8
## 6 9
66 CHAPTER 4. EXPLANATORY DATA ANALYSIS
## speed
## 45 23
## 46 24
## 47 24
## 48 24
## 49 24
## 50 25
## velocity dist
## 1 4 2
## 2 4 10
## 3 7 4
## 4 7 22
## 5 8 16
## 6 9 10
• Histogram: Histogram can be drawn using hist() function. We can get a little more
detail by using the rug() function to show us the actual data points.
Histogram of Speed
15
Frequency
10
5
0
0 5 10 15 20 25
data$speed
## 4 | 00
## 6 | 00
## 8 | 00
## 10 | 00000
## 12 | 00000000
## 14 | 0000000
## 16 | 00000
## 18 | 0000000
## 20 | 00000
## 22 | 00
## 24 | 00000
Boxplot of Speed
25
20
15
10
5
• Density plot: Density plot can be drawn using plot(density()) where density() func-
tion return the density data and plot() function returns the result.
0.06
0.04
Density
0.02
0.00
0 5 10 15 20 25 30
N = 50 Bandwidth = 2.15
• Scatter plot: The function splom()[ in the package lattice], can be used to display
a scatter plot. The function chart.Correlation()[ in the package PerformanceAnalyt-
ics], can be used to display a chart of a scatter plot and correlation between variables.
library(lattice)
splom(data)
70 CHAPTER 4. EXPLANATORY DATA ANALYSIS
100
80
60 dist 60
40
20
0 20 40 60 0
25 15 20 25
20
15 15
speed
10
5
5 10 15
120
20 40 60 80
0
• Multiple Histogram: Mulptiple histogram can be simply drawn using hist() func-
tion for different variables of interest after par() function which divides the graph
window in different rows and columns.
Histogram of data$speed
Frequency
10
0
0 5 10 15 20 25
data$speed
Histogram of data$dist
Frequency
10
0
0 20 40 60 80 100 120
data$dist
12 12 14
13 12 20
14 12 24
15 12 28
16 13 26
17 13 34
18 13 34
19 13 46
20 14 26
21 14 36
22 14 60
23 14 80
24 15 20
25 15 26
26 15 54
27 16 32
28 16 40
29 17 32
30 17 40
31 17 50
32 18 42
33 18 56
34 18 76
35 18 84
36 19 36
37 19 46
38 19 68
39 20 32
40 20 48
41 20 52
42 20 56
43 20 64
44 22 66
45 23 54
46 24 70
47 24 92
48 24 93
49 24 120
50 25 85
In [2]: data.info()
74 CHAPTER 4. EXPLANATORY DATA ANALYSIS
Int64Index: 50 entries, 1 to 50
Data columns (total 2 columns):
speed 50 non-null int64
dist 50 non-null int64
dtypes: int64(2)
memory usage: 1.2 KB
In [3]: data.tail()
In [4]: data.head()
In [5]: data.loc[3:6]
In [6]: data.describe()
In [7]: data.speed.describe()
• Histogram: Histogram can be drawn using hist() function from matplotlib module.
Note: There are some of the tools in seaborn module for examining univariate and bivariate
distributions. That can be used as follows:
*import seaborn as sns #for calling seaborn module
* sns.distplot() #will give density plot
* sns.distplot(x, kde=False, rug=True) #will give histogram plot
* sns.distplot(x, hist=False, rug=True) #will give density plot
In [14]: plt.figure();
BP=data.boxplot(column=[ speed , dist ])
' ' ' '
4.4. EXPLANATORY DATA ANALYSIS USING PYTHON 79
• Multiple Histogram: Multiple Histogram can be drawn similarly like multiple box-
plot using DataFrame.hist(column=[]) as follow:
Regression Analysis
Regression analysis is used to know the nature of relationship between two or more
variables i.e. probable form of mathematical relation between X and Y (where X represent
various explanatory variables and Y represents response variable). Regression is also
used to predict or estimate the value of one variable(response or dependent variable)
corresponding to given value of another variable (explanatory or independent variable).
Linear model: A model is said to be linear when it is linear in parameter. Non-linear
model: A model is said to be non-linear when it is non-linear in parameter.
Y = Xβ + ε (5.1)
where:
• β denotes the regression coefficient associated with x1, x2, ..., xk variables.
We write equation (1) as y = x1β1 + x2β2 + ... + x k β k + ε for k explanatory variables. This
is called as multiple linear regression model.
Example: Income and education of a person are related it is expected that for an average
higher level of education provides higher income. So, the simple linear regression model can be
81
82 CHAPTER 5. REGRESSION ANALYSIS
expressed as:
• The error term must have constant variance. The presence of constant variance
among error term is known as homoskedasticity. And the absence of constant vari-
ance among the error term is known as heteroskedasticity.
• There should be no correlation between the residual or error term. Absence of this
phenomenon is known as Autocorrelation
Note: Normal probability plot and plot of residual versus corresponding fitted values is
helpful in detecting several common type of model assumption.
Example: The probability that a person has a heart attack within a specified time period
can be predicted from knowledge of person’s age, sex, cholesterol level, weight, etc.
5.3. REGRESSION ANALYSIS USING R 83
where:
• k= no. of independent variable in model
##
## Call:
## lm(formula = time ~ dist + climb, data = Hills_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.215 -7.129 -1.186 2.371 65.121
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.992039 4.302734 -2.090 0.0447 *
## dist 6.217956 0.601148 10.343 9.86e-12 ***
## climb 0.011048 0.002051 5.387 6.45e-06 ***
## ---
## Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1
' ' ' ' ' ' ' ' ' ' 1
##
## Residual standard error: 14.68 on 32 degrees of freedom
## Multiple R-squared: 0.9191,Adjusted R-squared: 0.914
## F-statistic: 181.7 on 2 and 32 DF, p-value: < 2.2e-16
5
18 18
60
Standardized residuals
4
40
3
7
Residuals
2
20
1
0
0
−20
−1
31
31
50 100 150 −2 −1 0 1 2
18
Standardized residuals
Standardized residuals
7
1.5
7
2
31
1.0
1
1
0.5
11
0.5
−1 0
0.5
1
0.0
50 100 150 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
library(car)
crPlots(model1)
86 CHAPTER 5. REGRESSION ANALYSIS
80
100
60
Component+Residual(time)
Component+Residual(time)
40
50
20
0
0
−20
dist climb
3
15 15
2
2
Std. deviance resid.
1
Residuals
1
0
0
−1
−1
8 8
4
−2
4
−6 −4 −2 0 2 4 6 −2 −1 0 1 2
15
4
3
Std. deviance resid.
8 1
2
1.0
0.5
1
0
0.5
−2 −1
8
0.5
4 1
0.0
In [2]: Hills_data=pd.read_csv("E:/jimmy/j2/Hills.csv",header=0)
print(Hills_data)
subclasshook__ , __weakref__ , _cache , _data_attr , _get_robust ' ' ' ' ' ' ' '
In [6]: print(model.summary())
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is
correctly specified.
[2] The condition number is large, 4.2e+03. This might indicate that there
are strong multicollinearity or other numerical problems.
The coefficient of determination is equal to R-squared value i.e. 0.846. Warning mes-
sage is indicating that there might be strong multicollinearity present.
To check assumption of model we can use following steps
Out[7]:
92 CHAPTER 5. REGRESSION ANALYSIS
In [8]: plt.scatter(model.fittedvalues.values,model.resid)
plt.xlabel("Fitted Values")
plt.ylabel("Residuals")
plt.show()
5.4. REGRESSION ANALYSIS USING PYTHON 93
• For quick check of all the regressors, you can use plot_partregress_grid() function.
In [9]: fig=plt.figure(figsize=(10,8))
fig=sma.graphics.plot_partregress_grid(model,fig=fig)
94 CHAPTER 5. REGRESSION ANALYSIS
[2]Hadley Wickham [aut, cre], Winston Chang [aut], RStudio [cph].(2016), Create Ele-
gant Data Visualizations Using the Grammar of Graphics
https://cran.r-project.org/web/packages/ggplot2/ggplot2.pdf
[5]Modern Applied Statistics with S, 4th Ed.. B.D. Ripley and V.N.Venables. Springer. A more
advanced book using S. Emphasis on linear models and multivariate data analy- sis.
Includes some coverage of R but more specific to SPlus.
[6]R Bloggers (2017). Data Science Job Report 2017. Retrieved from: https://www.r-
bloggers.com/
[7]R Development Core Team. (2007b). R: A language and environment for statistical
computing. Vienna, Austria: R Foundation for Statistical Computing. Available from
http://www.r-project.org
[8] Revolution Analytics. (2014, February 20). What is R? Retreived February 21, 2016,
from: http://www.inside-r.org/what-is-r, February 20, 2014.
[10] The R Foundation. (2017). The R Project for Statistical Computing. Retrieved from:
https://www.r-project.org/
[11] Yihui Xie (2016), A General-Purpose Package for Dynamic Report Generation in R
https://cran.r-project.org/web/packages/knitr/knitr.pdf
[12] Wickham H (2015). R Packages: Organize, Test, Document, and Share Your Code.
O’Reilly, Sebastopol.
97
98 BIBLIOGRAPHY
[13] Allen B. Downey. Think Python. O’Reilly Media, first edition, August 2012.
[15] Guo, Philip. (2007, May). Why Python is a great language for teaching beginners
in introductory programming classes. Retrieved from: http://pgbovine.net/python-
teaching.htm
[17] Hughes, Zachariah. (2015, March). Personal Experience Section in MatLab vs.
Python vs. R.
[18] Kevin Sheppard. Introduction to Python for econometrics, statistics and data analy-
sis. Self-published, University of Oxford, version 2.1 edition, February 2014.
[24] Skipper Seabold and Josef Perktold. Statsmodels: econometric and statistical model-
ing with python. In Proceedings of the 9th Python in Science Conference, 2010. Anno-
tation: Description of the statsmodels package for Python.
[27] Wes McKinney. Python for data analysis. O’Reilly, Sebastopol, California, first edi-
tion, October 2012. Annotation: Book on data analysis with Python introducing the
Pandas library.