Principal Component Analysis Notes : Info
Principal Component Analysis Notes : Info
Principal Component Analysis Notes : Info
Info¶
Principal Component Analysis (PCA) is a simple yet popular and useful linear
transformation technique that is used in numerous applications, such as stock
market predictions, the analysis of gene expression data, and many more. In
this tutorial, we will see that PCA is not just a “black box”, and we are going
to unravel its internals in 3 basic steps.
PCA uses¶
PCA is effected by scale so you need to scale the features in your data before
applying PCA. Use StandardScaler to help you standardize the dataset’s fea-
tures onto unit scale (mean = 0 and variance = 1) which is a requirement for
the optimal performance of many machine learning algorithms. If you want to
see the negative effect not scaling your data can have, scikit-learn has a section
on the effects of not standardizing your data.
http://scikit-learn.org/stable/auto_examples/preprocessing/plot_scaling_
importance.html#sphx-glr-auto-examples-preprocessing-plot-scaling-importance-
py
Using PCA in Python We will perform PCA two ways. First, with techniques
from linear algebra, to get an idea of what we are doing. Second, with packages
from scikit-learn, which allow us to do PCA in a single line!
1
With techniques from linear algebra This blog post by Sebastian Raschka pro-
vides a clean and useful overview of our approach to PCA. This is what we will
implement in the following lines to reduce our data to k dimensions:
Standardize the data. (To make quantitative comparisons of variance, we want
to be sure each measurement varies to a similar extent; more on this later.) The
variance can be like 10,9,8,7,11, so variance about mean of each feature is what
we should take. for e.g. other feature might be 102,100,89,105 - the variance of
both these features will be similar.
But they’re different distances from zero! We’ll center all measurements at the
same point, the mean.
$\sigma^2=\sum\frac{(x-\mu)^2}{N}$
Compute the covariance matrix and use eigenvalue decomposition to obtain
the eigenvectors and eigenvalues. Select the k largest eigenvalues and their
associated eigenvectors. Transform the data into a k dimensional subspace using
those k eigenvectors. Let’s give it a try!
We won’t cover the math behind this procedure. However, it can be shown that
the principal component directions are given by the eigenvectors of the matrix,
and the magnitudes of the components are given by the eigenvalues.
Most of the available algorithms to do PCA use singular value decomposi-
tion instead for computational efficiency. But regardless of the algorithm
the objective is still the same: compute the eigenvectors and eigenvalues from
the covariance matrix.
What it means¶
2
Munging¶
Steps to follow¶
In [7]:
from sklearn.preprocessing import StandardScaler
df_scaled_data1=StandardScaler().fit_transform(X=df_data1)
# Remember to check if the data has been scaled ;-)
IMP!¶
What algo is getting used in PCA? What are the default features of the algo
Understand these to be able to better tune and understand your findings
In [124]:
3
PCA_OBJ=PCA(n_components=4,svd_solver="full")
# If we gave 0 < n_components < 1, it would select principal components
# that can explain variance greater than the % specified
# One would be interested in getting lower no of PCAs than the input feature set
# But want to see/visualize how the other two components fare
In [125]:
PC=PCA_OBJ.fit(X=df_scaled_data1)
# Going through two step process of first fitting the data and then transforming it
In [126]:
PC.explained_variance_,PC.n_components_,PC.n_features_,PC.noise_variance_,PC.singular_values
Out[126]:
(array([2.93035378, 0.92740362, 0.14834223, 0.02074601]),
4,
4,
0.0,
array([20.89551896, 11.75513248, 4.7013819 , 1.75816839]))
In [36]:
arr_PC=PC.transform(X=df_scaled_data1)
In [38]:
df_PC=pd.DataFrame(arr_PC,columns=["PC1","PC2","PC3","PC4"])
In [50]:
df_PC_Target=pd.concat([df_PC,df['target']],axis=1)
# why I need to do this - is because I have to filter for target
# Which means the rows across the transformation are retained
In [51]:
finalDf=df_PC_Target
In [55]:
%matplotlib inline
In [56]:
import matplotlib.pyplot as plt
In [122]:
fig = plt.figure(figsize = (8,8))
ax = fig.add_subplot(1,1,1)
ax.set_xlabel('PC1', fontsize = 15,color="white")
ax.set_ylabel('PC2', fontsize = 15,color="white")
ax.set_title('2 component PCA', fontsize = 20,color="white")
4
targets = ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica']
colors = ['r', 'g', 'b']
for target, color in zip(targets,colors):
indicesToKeep = finalDf['target'] == target
ax.scatter(finalDf.loc[indicesToKeep, 'PC1']
, finalDf.loc[indicesToKeep, 'PC2']
, c = color
, s = 50)
ax.legend(targets)
ax.grid()
In [46]:
from itertools import combinations
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
5
plt.style.use("ggplot")
class PandasScatterPlot():
def __init__(self,pdx,choice,target,title,size,subplots,drop=None):
"""
pdx= panda database
choice = [of columns to plot against the target column]
drop = [drop the columns that we don't need for plot]
target = target column name
size = [width,height] of the graphs/plots
subplots = [nrows,ncols] for the subplot() option
"""
if drop != None:
self.pdx=pdx.drop(columns=drop)
else:
self.pdx=pdx
self.target=target
self.choice=choice
self.title=title
self.target_set=set(pdx[target].values)
self.choice_list=list(combinations(choice,2))
self.size=size
self.subplots=subplots
#print(self.choice_list)
return
def plot(self):
# this has to be worked out
# @todo
#subplot_list=[211,212,221,222,231,232]
#subplot_list=[311,312,312,321,322,323]
count=0
# Becaureful if rows or cols =1 , different handling
f,ax=plt.subplots(self.subplots[0],self.subplots[1],constrained_layout=True) # rows,
f.suptitle(self.title,color="black",fontsize=25)
f.set_size_inches(self.size[0],self.size[1]) # width, height
compare_list=list(self.choice_list)
if self.subplots[0] == 1 or self.subplots[1] == 1:
# then the plot has to be ax[] and not ax[,]
max_value=self.subplots[0]+self.subplots[1]-1
for row in range(max_value):
#for col in range(self.subplots[1]):
#print("Plotting : {}".format(compare))
compare=compare_list[count]
6
count=count+1
In [ ]:
PDSP=PandasScatterPlot(df,size=[10,38],subplots=[6,1],choice=filter_data_cols,drop=None,targ
In [ ]:
PDSP.target_set
In [ ]:
7
PDSP.plot()
In [339]:
PDSP=PandasScatterPlot(df_PC_Target,size=[16,20],subplots=[3,2],choice=['PC1','PC2','PC3','P
In [331]:
PDSP.plot()
8
9
Try out PCA with random data and see what are
the plots like¶
The intent is to generate random data and see how it performs with respect to
PCA.
• generate Arr[100,4] random matrix 0 to 1.0
• generate Classification Label/Target randomly from 3 variables R,G,B
• take PCA of the data and then plot 2-D for all four PCAs and check how
well it performed
In [77]:
arr2=np.random.random([100,4])
In [78]:
df3_target=pd.DataFrame(np.array([np.random.choice(['R','G','B']) for i in range(100)]),colu
Another way to do the above categorical column addition
- df2=pd.DataFrame(arr2,columns=['D1','D2','D3','D4'])
- df2['target']=pd.DataFrame(np.array([np.random.choice(['R','G','B']))
In [98]:
pd.value_counts(df3_target.values.transpose()[0])
Out[98]:
G 43
R 32
B 25
dtype: int64
In [99]:
df2=pd.DataFrame(arr2,columns=['D1','D2','D3','D4'])
In [101]:
df2.mean(axis=0,skipna=True)
Out[101]:
D1 0.518460
D2 0.508468
D3 0.508047
D4 0.465573
dtype: float64
In [102]:
df2.var(axis=0,skipna=True)
10
Out[102]:
D1 0.089540
D2 0.085677
D3 0.075701
D4 0.089832
dtype: float64
In [104]:
from sklearn.decomposition import PCA
In [105]:
df5=df4_target.drop(columns=['target']).values
In [116]:
PCA1=PCA(n_components=4,svd_solver="full")
PCA2=PCA1.fit(X=df5)
PC_2=PCA2.transform(X=df5)
In [117]:
df6_PC=pd.DataFrame(PC_2,columns=['PC1','PC2','PC3','PC4'])
In [118]:
df6_PC_target=pd.concat([df6_PC,df3_target],axis=1)
In [119]:
df6_PC_target.shape
Out[119]:
(100, 5)
In [120]:
df6_PC_target.head()
Out[120]:
In [121]:
plot3=PandasScatterPlot(df6_PC_target,size=[10,15],subplots=[3,2],choice=['PC1','PC2','PC3',
11
In [122]:
plot3.plot()
12
PCA for random data¶
The PCA shows that it wouldn’t have been useful So after we get the PCA
values how to check their usefulness?
• Check the Explained Variance Ratio
• Check the Singular values
In [123]:
PCA2.singular_values_ # Singular Values for random data
Out[123]:
array([3.20485419, 3.11146084, 2.71367996, 2.53335704])
In [130]:
PCA2.explained_variance_ratio_ # Check the variance ratio
Out[130]:
array([0.30447082, 0.28698408, 0.2182963 , 0.1902488 ])
In [133]:
PCA2.n_features_
Out[133]:
4
In [127]:
PC.singular_values_ # Singular values for Iris data
Out[127]:
array([20.89551896, 11.75513248, 4.7013819 , 1.75816839])
In [131]:
PC.explained_variance_ratio_ # Check the variance raio
Out[131]:
array([0.72770452, 0.23030523, 0.03683832, 0.00515193])
13
import seaborn as sns
In [143]:
import sklearn.datasets
# Import the Iris dataset and convert it into a Pandas DataFrame
iris = sklearn.datasets.load_iris()
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
In [144]:
sns.pairplot(df_iris,hue='species')
Out[144]:
<seaborn.axisgrid.PairGrid at 0x24fd3b44cc0>
14
PCA not using sklearn¶
In [192]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
# load dataset into Pandas DataFrame
pd1 = pd.read_csv(url, names=['sepal length','sepal width','petal length','petal width','tar
15
df_scaled_data2=StandardScaler().fit_transform(X=df_data4)
# Remember to check if the data has been scaled ;-)
16
array([[ 1.00671141, -0.11010327, 0.87760486, 0.82344326],
[-0.11010327, 1.00671141, -0.42333835, -0.358937 ],
[ 0.87760486, -0.42333835, 1.00671141, 0.96921855],
[ 0.82344326, -0.358937 , 0.96921855, 1.00671141]])
In [195]:
eig_vals, eig_vecs=np.linalg.eig(cov_mat)
In [196]:
eig_vals
Out[196]:
array([2.93035378, 0.92740362, 0.14834223, 0.02074601])
In [197]:
eig_vecs
Out[197]:
array([[ 0.52237162, -0.37231836, -0.72101681, 0.26199559],
[-0.26335492, -0.92555649, 0.24203288, -0.12413481],
[ 0.58125401, -0.02109478, 0.14089226, -0.80115427],
[ 0.56561105, -0.06541577, 0.6338014 , 0.52354627]])
In [208]:
np.dot(eig_vecs,eig_vecs.transpose()).round(2)
# Zeros in upper and ower indicate that the vecs are perpendicular to each other
# Just checking !!!
Out[208]:
array([[ 1., -0., -0., -0.],
[-0., 1., 0., -0.],
[-0., 0., 1., 0.],
[-0., -0., 0., 1.]])
Key Understanding¶
Now we have the data(n,m) and eig(m,m) matrix, take their multtiplication to
get the final output of the Principal components So the Principal Components
= (Feature columns) * (eigen vectors) A Linear combination of the Feature
Columns
17
We will take 2 feature example as the Principal
components¶
As it is easier to visualize with 2
Two components chosen visually from Seaborn plot which shows best classifica-
tion of the flowers
In [163]:
filter_data_cols=['petal length','petal width']
df_data2=df.loc[:,filter_data_cols].values
# We are taking the numpy array version of the data and not the DataFrame version to be pass
df_scaled_data2=StandardScaler().fit_transform(X=df_data2)
# Remember to check if the data has been scaled ;-)
cov_mat2
eig_val2,eig_vec2=np.linalg.eig(cov_mat2)
print(eig_val2)
print(eig_vec2)
Out[163]:
(array([0.03749286, 1.97592996]), array([[-0.70710678, -0.70710678],
[ 0.70710678, -0.70710678]]))
In [209]:
# Swapping the cols
df_scaled_data2=StandardScaler().fit_transform(X=df_data2)
# Remember to check if the data has been scaled ;-)
cov_mat2
18
eig_val2,eig_vec2=np.linalg.eig(cov_mat2)
print(eig_val2) # CLEARLY showing that eigen value 0 accounts for 99% of the variance
print(eig_vec2)
[1.97592996 0.03749286]
[[ 0.70710678 -0.70710678]
[ 0.70710678 0.70710678]]
In [169]:
# NOTE : Eig vectors are
Eig_vec_1 = eig_vec2[:,0] # First Eigen Vector
Eig_vec_2 = eig_vec2[:,1] # Second Eigen Vector
Analysis¶
Now we have the data(n,m) and eig(m,m) matrix, take their multtiplication to
get the final output of the Principal components So the Principal Components
= (Feature columns) * (eigen vectors) A Linear combination of the Feature
Columns
19
pd2=pd.DataFrame(PC1,columns=['PC1'])
In [179]:
pd2['PC2']=pd.DataFrame(PC2)
In [181]:
pd2['target']=pd1['target']
In [183]:
pd2.head()
Out[183]:
In [190]:
sns.pairplot(pd2,hue='target')
Out[190]:
<seaborn.axisgrid.PairGrid at 0x24fd3f5b978>
20
Something big wrong , big wrong¶
Which means we don’t need PC2, a single PC1 would have been more than
sufficient to plot the graphs!!! Introspect what wrong you did!!!
References¶
• Queries Do we normalize before or after taking PCA? Of-course
when we talk about variance We refer to subtracting mean and then squar-
ing it Do we remove outliers before doing variance As these values
can screw up the mean. Why not co-corr matrix vs co-variance matrix
The difference being co-variance has mean 0 for each feature set. ** PCA
as SVD Understand SVD as well
• Implementation ** Scikit [[https://towardsdatascience.com/pca-
using-python-scikit-learn-e653f8989e60%5D%5BExample_from_
datascience]] [[http://scikit-learn.org/stable/auto_examples/plot_
feature_stacker.html#sphx-glr-auto-examples-plot-feature-stacker-
py%5D%5BConcatenate_multiple_feature_extraction_feature]]
21
• References [[http://www.lauradhamilton.com/introduction-to-principal-
component-analysis-pca%5D%5BRefernce_to_understand_PCA]]
Chapter 12 from Bishop PCA Some of the PDFs from Book/PCA
** Learn the graphical techniques for PCA here [[https://www.kaggle.
com/strakul5/principal-component-analysis-of-pokemon-data%5D%
5BGraphical_techniques_PCA]]
In [213]:
%timeit
In [214]:
lambda x: x*x
Out[214]:
<function __main__.<lambda>(x)>
22