Notes For Python Part II
Notes For Python Part II
Plotting
1 / 50
Graphics Importing/Exporting Data Series/DataFrame Stats Dist Regression
Line plots
Using plt.plot() function, we can produce the following plots.
import numpy as np
import matplotlib . pyplot as plt
import seaborn as sns
sns . set ( style = ' darkgrid ' ) # seaborn theme for the background
# Data
y = np . random . randn (100)
x = np . cumsum ( np . random . rand (100))
# Plot 1
% matplotlib inline
plt . plot ( y )
# Plot 2
plt . plot (y , ' r - o ' , label = ' a line graph ' )
plt . legend ()
plt . xlabel ( ' x label ' )
plt . title ( ' A line plot ' )
plt . ylabel ( ' y label ' )
# Plot 3
plt . plot (x ,y , ' r - d ' , label = ' a line graph ' )
plt . legend ()
plt . xlabel ( ' x label ' )
plt . title ( ' USING PLOT ' )
plt . ylabel ( ' y label ' )
In Plot 2, r-o indicates red (r), solid line (-) and circle (o) marker. Similarly, in
Plot 3, r-d indicates color red (r), solid line (-) and diamond (d) marker.
Titles are added with title() and legends are added with legend(). The
legend requires that the line has a label.
The labels for the x and y axis are added by xlabel() and ylabel().
2 / 50
Graphics Importing/Exporting Data Series/DataFrame Stats Dist Regression
Line plots
(c) Plot 3
3 / 50
Graphics Importing/Exporting Data Series/DataFrame Stats Dist Regression
Line plots
Now we will specify the arguments of plt.plot() explicitly in the following
example.
y = np . random . randn (100)
x = np . cumsum ( rand (100))
plt . plot (x ,y , alpha = 1 , color = ' # FF7F00 ' , \
label = ' Line Label ' , linestyle = ' - ' , \
linewidth = 2 , marker = ' o ' , markeredgecolor = ' #000000 ' , \
ma rker edgewidth = 1 , markerfacecolor = ' # FF7F99 ' , \
markersize =5)
plt . legend ()
plt . xlabel ( ' x label ' )
plt . title ( ' USING PLOT ' )
plt . ylabel ( ' y label ' )
4 / 50
Graphics Importing/Exporting Data Series/DataFrame Stats Dist Regression
Line plots
The most useful keyword arguments of plt.plot() are listed in the table
below.
The functions getp() and setp() can be used to get the list of properties for
a line (or any matplotlib object), and setp() can also be used to set a
particular property.
5 / 50
Graphics Importing/Exporting Data Series/DataFrame Stats Dist Regression
Line plots
Some options for color, linestyle and marker are given in the following
table.
6 / 50
Graphics Importing/Exporting Data Series/DataFrame Stats Dist Regression
Line plots
The functions getp() and setp() can be used in the following way.
# Using setp ()
h = plt . plot ( np . random . randn (10))
plt . setp (h , alpha =0.5 , linestyle = ' -- ' , linewidth =2 , label = ' Line Label ' ,\
marker = ' o ' , color = ' red ' )
plt . legend ()
plt . xlabel ( ' x label ' )
plt . title ( ' USING PLOT ' )
plt . ylabel ( ' y label ' )
# If you want to see all the properties that can be set ,
# and their possible values , you can do :
getp ( h )
# If you want to know the valid types of arguments , you can provide the name of
# the property you want to set without a value :
setp (h , ' alpha ' )
setp (h , ' color ' )
setp (h , ' linestyle ' )
setp (h , ' marker ' )
7 / 50
Graphics Importing/Exporting Data Series/DataFrame Stats Dist Regression
# Scatter plots
z = np . random . randn (100 ,2)
z [: ,1] = 0.5* z [: ,0] + np . sqrt (0.5)* z [: ,1]
x = z [: ,0]
y = z [: ,1]
plt . scatter (x ,y , c = ' # FF7F99 ' , marker = ' o ' , \
alpha = 1 , label = ' Scatter Data ' )
# Bar plots
y = np . random . rand (5)
x = np . arange (5)
b = plt . bar (x ,y , width = 1 , color = ' # FF7F99 ' , \
edgecolor = ' #000000 ' , linewidth = 1)
# Pie plots
y = np . random . rand (5)
y = y / np . sum ( y )
y [y <.05] = .05
labels =[ ' One ' , ' Two ' , ' Three ' , ' Four ' , ' Five ' ]
colors = [ ' # FF0000 ' , ' # FFFF00 ' , ' #00 FF00 ' , ' #00 FFFF ' , ' #0000 FF ' ]
plt . pie (y , labels = labels , colors = colors )
# Histograms
x = np . random . randn (1000)
plt . hist (x , bins = 30)
plt . hist (x , bins = 30 , density = True , color = ' # FF7F00 ' )
8 / 50
Graphics Importing/Exporting Data Series/DataFrame Stats Dist Regression
9 / 50
Graphics Importing/Exporting Data Series/DataFrame Stats Dist Regression
11 / 50
Graphics Importing/Exporting Data Series/DataFrame Stats Dist Regression
12 / 50
Graphics Importing/Exporting Data Series/DataFrame Stats Dist Regression
Seaborn
The main strength of the Seaborn library, apart from generating good-looking
graphics, is its collection of easy-to-use statistical plots.
Examples of these are the kdeplot and distplot, which plot a kernel-density
estimate plot and a histogram plot with a kernel-density estimate overlaid on
top of the histogram, respectively.
# density plot
x = np . random . randn (100)
sns . distplot (x , bins = 30);
13 / 50
Graphics Importing/Exporting Data Series/DataFrame Stats Dist Regression
Seaborn
x = np . random . randn (100)
y = 1 + np . random . randn (100)
sns . distplot (x , bins = 30);
sns . distplot (y , bins = 30);
14 / 50
Graphics Importing/Exporting Data Series/DataFrame Stats Dist Regression
Seaborn
15 / 50
Graphics Importing/Exporting Data Series/DataFrame Stats Dist Regression
Seaborn
Relatedly, we can use the jointplot function to plot the joint distribution for
two separate datasets.
# joint plots
with sns . axes_style ( " white " ):
sns . jointplot (x , y , kind = " kde " , space = 0 , color = " b " )
with sns . axes_style ( " white " ):
( sns . jointplot (x , y , color = " b " )
. plot_joint ( sns . kdeplot , zorder = 0 , n_levels = 6))
16 / 50
Graphics Importing/Exporting Data Series/DataFrame Stats Dist Regression
All of the data readers in pandas load data into a pandas DataFrame.
Comma-separated value (CSV) files can be read using read_csv.
Excel files, both 97/2003 (xls) and 2007/10/13 (xlsx), can be imported using
read_excel.
Stata files can be read using read_stata.
17 / 50
Graphics Importing/Exporting Data Series/DataFrame Stats Dist Regression
read_excel()
read_excel() supports reading data from both xls (Excel 2003) and xlsx
(Excel 2007/10/13) formats.
Notable keyword arguments include:
header, an integer indicating which row to use for the column labels. The
default is 0 (top) row.
skiprows, typically an integer indicating the number of rows at the top of the
sheet to skip before reading the file. The default is 0.
skip_footer, typically an integer indicating the number of rows at the bottom
of the sheet to skip when reading the file. The default is 0.
index_col, an integer or column name indicating the column to use as the index.
If not provided, a basic numeric index is generated.
18 / 50
Graphics Importing/Exporting Data Series/DataFrame Stats Dist Regression
read_csv()
19 / 50
Graphics Importing/Exporting Data Series/DataFrame Stats Dist Regression
Import/Export Data
# Importing data
import pandas as pd
# Use read_excel () to import data
state_gdp = pd . read_excel ( ' US_state_GDP . xls ' , sheet_name = ' Sheet1 ' )
type ( state_gdp )
state_gdp . head ()
# Use read_csv () to import data
csv_data = pd . read_csv ( ' US_state_GDP . csv ' )
type ( csv_data )
csv_data . head ()
# Use read_stata () to import data
stata_data = pd . read_stata ( ' US_state . dta ' )
type ( stata_data )
stata_data . head ()
##
# Exporting data
state_gdp . to_excel ( ' S t a t e _ G D P _ f r o m _ D a t a F r a m e . xls ' )
state_gdp . to_excel ( ' S t a t e _ G D P _ f r o m _ D a t a F r a m e . xls ' , sheet_name = ' State GDP ' )
state_gdp . to_excel ( ' S t a t e _ G D P _ f r o m _ D a t a F r a m e . xlsx ' )
state_gdp . to_csv ( ' S t a t e _ G D P _ f r o m _ D a t a F r a m e . csv ' , index = False )
state_gdp . to_stata ( ' S t a t e _ G D P _ f r o m _ D a t a F r a m e . dta ' , write_index = False )
20 / 50
Graphics Importing/Exporting Data Series/DataFrame Stats Dist Regression
Pandas
Series are the primary building block of the data structures in pandas, and in
many ways a Series behaves similarly to a NumPy array.
21 / 50
Graphics Importing/Exporting Data Series/DataFrame Stats Dist Regression
Series
import numpy as np
import pandas as pd
# Series
s = pd . Series ([0.1 , 1.2 , 2.3 , 3.4 , 4.5])
s
Out :
0 0.1
1 1.2
2 2.3
3 3.4
4 4.5
dtype : float64
type ( s )
Out : pandas . core . series . Series
# From array
a = np . array ([0.1 , 1.2 , 2.3 , 3.4 , 4.5])
s = pd . Series ( a )
s
Out :
0 0.1
1 1.2
2 2.3
3 3.4
4 4.5
dtype : float64
# From tuple
a = (1 ,2 ,3 ,4 , ' abs ' , np . nan )
s = pd . Series ( a )
s
Out :
0 1
1 2
2 3
3 4
4 abs
5 NaN
dtype : object
22 / 50
Graphics Importing/Exporting Data Series/DataFrame Stats Dist Regression
Series
23 / 50
Graphics Importing/Exporting Data Series/DataFrame Stats Dist Regression
Series
24 / 50
Graphics Importing/Exporting Data Series/DataFrame Stats Dist Regression
Series
In mathematical operations, indices that do not match are given the value NaN
(not a number).
s1 = pd . Series ({ ' a ' : 0.1 , ' b ' : 1.2 , ' c ' : 2.3})
s2 = pd . Series ({ ' a ' : 1.0 , ' b ' : 2.0 , ' c ' : 3.0})
s3 = pd . Series ({ ' c ' : 0.1 , ' d ' : 1.2 , ' e ' : 2.3})
s1 + s2
Out :
a 1.1
b 3.2
c 5.3
dtype : float64
s1 * s2
Out :
a 0.1
b 2.4
c 6.9
dtype : float64
s1 + s3
Out :
a NaN
b NaN
c 2.4
d NaN
e NaN
dtype : float64
25 / 50
Graphics Importing/Exporting Data Series/DataFrame Stats Dist Regression
Series
26 / 50
Graphics Importing/Exporting Data Series/DataFrame Stats Dist Regression
Series
s1 = pd . Series ([1.0 ,2 ,3])
s1 . values # access to values
Out : array ([1. , 2. , 3.])
s1 . index
Out : RangeIndex ( start =0 , stop =3 , step =1)
s1 . index = [ ' cat ' , ' dog ' , ' elephant ' ] # set index labels
s1 . index
Out : Index ([ ' cat ' , ' dog ' , ' elephant ' ] , dtype = ' object ' )
# Try the followings
s1 = pd . Series ([1 ,3 ,5 ,6 , np . nan , ' cat ' , ' abc ' ,10 ,12 ,5])
s1 . index = [ ' a ' , ' b ' , ' c ' , ' d ' , ' e ' , ' f ' , ' g ' , ' h ' , ' i ' , ' k ' ]
s1 . head ()
s1 . tail ()
s1 . isnull ()
s1 . notnull ()
s1 . loc [ ' e ' ]
s1 . iloc [4]
s1 . drop ( ' e ' )
s1 . dropna ()
s1 . fillna ( -99)
s1 = pd . Series ( np . arange (10.0 ,20.0))
s1 . describe ()
summ = s1 . describe ()
summ
summ [ ' mean ' ]
s1 = pd . Series ([1 , 2 , 3])
s2 = pd . Series ([4 , 5 , 6])
s1 . append ( s2 )
s1 . append ( s2 , ignore_index = True )
s1 = pd . Series ([1 , 2 , 3])
s2 = pd . Series ([4 , 5 , 6])
s1 . replace (1 , -99)
s1 . update ( s2 )
s1
27 / 50
Graphics Importing/Exporting Data Series/DataFrame Stats Dist Regression
DataFrame
DataFrames collect multiple series in the same way that a spreadsheet collects
multiple columns of data.
import numpy as np
import pandas as pd
df = pd . DataFrame ( np . array ([[1 ,2] ,[3 ,4]]) , columns =[ ' dogs ' , ' cats ' ] , \
index =[ ' Alice ' , ' Bob ' ])
df
Out :
dogs cats
Alice 1 2
Bob 3 4
28 / 50
Graphics Importing/Exporting Data Series/DataFrame Stats Dist Regression
The use of DataFrame will be demonstrated using a data set containing a mix
of data types using state level GDP data from the US.
29 / 50
Graphics Importing/Exporting Data Series/DataFrame Stats Dist Regression
Finally, rows can also be selected using logical selection using a Boolean array
with the same number of elements as the number of rows as the DataFrame.
s t a t e _ l o n g_ r e c e s si o n = state_gdp [ ' gdp_growth_2010 ' ] <0
state_gdp [ st a t e _ lo n g _ r e ce s s i o n ] # returns states for which gdp_growth_2010 is negative
It is not possible to use standard slicing to select both rows and columns. But
we can use loc[rowselector, coloumnselector].
state_gdp . loc [10:15 , ' state ' ]
state_gdp . loc [ state_long_recession , ' state ' ]
state_gdp . loc [ state_long_recession ,[ ' state ' , ' gdp_growth_2010 ' ]]
state_gdp . loc [ state_long_recession ,[ ' state ' , ' gdp_growth_2009 ' , ' gdp_growth_2010 ' ]]
30 / 50
Graphics Importing/Exporting Data Series/DataFrame Stats Dist Regression
31 / 50
Graphics Importing/Exporting Data Series/DataFrame Stats Dist Regression
Columns can be deleted by (i) the del keyword, (ii) pop(column) and (iii)
drop(list of columns,axis=1).
del will simply delete the Series from the DataFrame,
pop(column) will both delete the Series and return the Series as an output,
drop() will return a DataFrame with the Series dropped without modify the
original DataFrame.
# Deleting a column
state_gdp_copy = state_gdp . copy ()
state_gdp_copy . index = state_gdp [ ' state_code ' ] # replace index with state_code
# Keep only ' gdp_2009 ' , ' gdp_growth_2011 ' and ' gdp_growth_2012 '
state_gdp_copy = state_gdp_copy [[ ' gdp_2009 ' , ' gdp_growth_2011 ' , ' gdp_growth_2012 ' ]]
state_gdp_copy . head ()
# Drop ' gdp_2009 '
state_gdp_copy = state_gdp_copy . drop ( ' gdp_2009 ' , axis =1)
state_gdp_copy . head ()
# Delete ' gdp_growth_2012 '
gdp _gr owth_2012 = state_gdp_copy . pop ( ' gdp_growth_2012 ' )
gdp _gr owth_2012 . head ()
state_gdp_copy . head ()
# Delete ' gdp_growth_2011 '
del state_gdp_copy [ ' gdp_growth_2011 ' ]
state_gdp_copy . head ()
32 / 50
Graphics Importing/Exporting Data Series/DataFrame Stats Dist Regression
Some useful functions and methods are listed in the table below.
33 / 50
Graphics Importing/Exporting Data Series/DataFrame Stats Dist Regression
# Using insert ()
state_gdp_2012 = state_gdp [[ ' state ' , ' gdp_2012 ' ]] # create a new DataFrame : state_gdp_2012
state_gdp_2012 . insert (1 , ' gdp_growth_2012 ' , state_gdp [ ' gdp_growth_2012 ' ])
state_gdp_2012 . head ()
# Using drop () , dropna () and drop_duplicates ()
df = pd . DataFrame ( np . array ([[1 , np . nan ,3 ,8] ,[ np . nan ,2 ,3 ,5] ,[10 ,2 ,3 , np . nan ] ,
[10 ,2 ,3 , np . nan ] ,[10 ,2 ,3 ,11]]))
df . columns = [ ' one ' , ' two ' , ' three ' , ' four ' ] # assign names to columns
df . index = [ ' a ' , ' b ' , ' c ' , ' d ' , ' e ' ] # assign labels to index
df . drop ( ' a ' , axis =0) # removes row ' a '
df . drop ([ ' a ' , ' c ' ] , axis =0) # removes row ' a ' and ' c '
df . d rop _duplicates () # removes row ' d '
df . drop ( ' one ' , axis =1) # removes column ' one '
34 / 50
Graphics Importing/Exporting Data Series/DataFrame Stats Dist Regression
Statistical Distributions
35 / 50
Graphics Importing/Exporting Data Series/DataFrame Stats Dist Regression
36 / 50
Graphics Importing/Exporting Data Series/DataFrame Stats Dist Regression
Statistical Distributions
Computer simulated random numbers are not actually random, they are
generally described to as pseudo-random numbers.
All pseudo-random numbers in NumPy use one core random number generator
based on the Mersenne Twister.
numpy.random.seed is a useful function for initializing the random number
generator. To generate the same random numbers, we need to set seed.
np . random . seed (0)
np . random . randn ()
Out : 1 . 7 6 4052345967664
np . random . seed (0)
np . random . randn ()
Out : 1 . 7 6 4052345967664
37 / 50
Graphics Importing/Exporting Data Series/DataFrame Stats Dist Regression
Statistical Distributions
38 / 50
Graphics Importing/Exporting Data Series/DataFrame Stats Dist Regression
Statistical Distributions
39 / 50
Graphics Importing/Exporting Data Series/DataFrame Stats Dist Regression
import scipy as sp
sp . stats . norm . rvs ( loc =2 , scale =3 , size =10) # generates 10 rvs from N (2 ,9)
Out :
array ([ 0.31613221 , -1.05744118 , 2.28474865 , 5.43251686 , -1.97227871 ,
2.06680403 , 2.18448145 , 5.38146375 , 2.71106676 , 1.60296263])
sp . stats . norm . pdf (1.96 , loc =0 , scale =1) # evaluate normal pdf at 1.96
Out : 0 . 0 5 84 4 0 9 4 4 33 3 4 5 1 47 6
sp . stats . norm . cdf ( -1.96 , loc =0 , scale =1) # evaluate normal cdf at -1.96
Out : 0 . 0 2 49 9 7 8 9 5 14 8 2 2 0 43 5
sp . stats . norm . ppf (0.95 , loc =0 , scale =1) # return quantile at the lower tail prob 0.95
Out : 1 . 6 4 48 536 2695 147 22
x = sp . stats . norm . rvs ( loc =1 , scale =5 , size =1000)
location , scale = sp . stats . norm . fit ( x )
location , scale = sp . stats . norm . fit (x , input =(1 ,3)) # The search starts at input =(1 ,3)
print ( ' ( location , scale )= ' ,( location , scale ))
Out : ( location , scale )= (1.2632581286252547 , 4 .8 10 07 4 27 90 3 20 62 5 )
sp . stats . norm . median ( loc =3 , scale =1) # returns median of N (3 ,1)
Out : 3.0
sp . stats . norm . mean ( loc =3 , scale =2) # returns mean of N (3 ,4)
Out : 3.0
sp . stats . norm . var ( loc =3 , scale =2) # returns variance of N (3 ,4)
Out : 4.0
sp . stats . norm . std ( loc =3 , scale =2) # return std of N (3 ,4)
Out : 2.0
sp . stats . norm . moment (2 , loc =0 , scale =1) # the second non - central moment of N (0 ,1)
Out : 1.0
40 / 50
Graphics Importing/Exporting Data Series/DataFrame Stats Dist Regression
Regression analysis
There are two options for running OLS regressions: (i) smf.ols() and (ii)
sm.OLS(). In the first option, statsmodels allows users to fit statistical
models using R-style formulas.
41 / 50
Graphics Importing/Exporting Data Series/DataFrame Stats Dist Regression
Wage Regression
It is wage data with 1000 observations from the US Bureau of Census Current
Population survey, March 1995.
The underlying population is the employed labor force, age 18-65. The
variables are as follows:
1 hourly wage
2 female (1= worker = female)
3 non-white (1= worker = non-white)
4 unionmember (1 = worker = unionized)
5 education (years of education)
6 experience (years of work experience)
7 age
42 / 50
Graphics Importing/Exporting Data Series/DataFrame Stats Dist Regression
The smf module hosts many of the same functions found in the sm module
(e.g. OLS, GLM). Use dir(smf) to list available models.
model1 = smf . ols ( formula = ' wage ~ female + nonwhite + unionmember + education +\
experience ' , data = wage_data )
# We need to use . fit () to obtain parameter estimates
result1 = model1 . fit ()
# We now have the fitted regression model stored in result1
# To view the OLS regression results , we can call the . summary () method
result1 . summary ()
43 / 50
Graphics Importing/Exporting Data Series/DataFrame Stats Dist Regression
44 / 50
Graphics Importing/Exporting Data Series/DataFrame Stats Dist Regression
45 / 50
Graphics Importing/Exporting Data Series/DataFrame Stats Dist Regression
46 / 50
Graphics Importing/Exporting Data Series/DataFrame Stats Dist Regression
47 / 50
Graphics Importing/Exporting Data Series/DataFrame Stats Dist Regression
48 / 50
Graphics Importing/Exporting Data Series/DataFrame Stats Dist Regression
49 / 50
Graphics Importing/Exporting Data Series/DataFrame Stats Dist Regression
50 / 50