Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
140 views22 pages

Principal Component Analysis Notes : Info

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 22

PCA

Principal Component Analysis Notes¶

Info¶

Principal Component Analysis (PCA) is a simple yet popular and useful linear
transformation technique that is used in numerous applications, such as stock
market predictions, the analysis of gene expression data, and many more. In
this tutorial, we will see that PCA is not just a “black box”, and we are going
to unravel its internals in 3 basic steps.

PCA uses¶

PCA is used for mainly two purposes


1. Visualization
2. Dimensionality reduction to speed up the processing ### Example 1
(This is for Visualization) The following notes are from the web page https:
//towardsdatascience.com/pca-using-python-scikit-learn-e653f8989e60

Standardize the Data¶

PCA is effected by scale so you need to scale the features in your data before
applying PCA. Use StandardScaler to help you standardize the dataset’s fea-
tures onto unit scale (mean = 0 and variance = 1) which is a requirement for
the optimal performance of many machine learning algorithms. If you want to
see the negative effect not scaling your data can have, scikit-learn has a section
on the effects of not standardizing your data.
http://scikit-learn.org/stable/auto_examples/preprocessing/plot_scaling_
importance.html#sphx-glr-auto-examples-preprocessing-plot-scaling-importance-
py
Using PCA in Python We will perform PCA two ways. First, with techniques
from linear algebra, to get an idea of what we are doing. Second, with packages
from scikit-learn, which allow us to do PCA in a single line!

1
With techniques from linear algebra This blog post by Sebastian Raschka pro-
vides a clean and useful overview of our approach to PCA. This is what we will
implement in the following lines to reduce our data to k dimensions:
Standardize the data. (To make quantitative comparisons of variance, we want
to be sure each measurement varies to a similar extent; more on this later.) The
variance can be like 10,9,8,7,11, so variance about mean of each feature is what
we should take. for e.g. other feature might be 102,100,89,105 - the variance of
both these features will be similar.
But they’re different distances from zero! We’ll center all measurements at the
same point, the mean.
$\sigma^2=\sum\frac{(x-\mu)^2}{N}$
Compute the covariance matrix and use eigenvalue decomposition to obtain
the eigenvectors and eigenvalues. Select the k largest eigenvalues and their
associated eigenvectors. Transform the data into a k dimensional subspace using
those k eigenvectors. Let’s give it a try!
We won’t cover the math behind this procedure. However, it can be shown that
the principal component directions are given by the eigenvectors of the matrix,
and the magnitudes of the components are given by the eigenvalues.
Most of the available algorithms to do PCA use singular value decomposi-
tion instead for computational efficiency. But regardless of the algorithm
the objective is still the same: compute the eigenvectors and eigenvalues from
the covariance matrix.

What it means¶

The eigenvectors (principal components) determine the directions of the new


feature space, and the eigenvalues determine their magnitude. In other words,
the eigenvalues explain the variance of the data along the new feature axes
In [2]:
import pandas as pd
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
# load dataset into Pandas DataFrame
df = pd.read_csv(url, names=['sepal length','sepal width','petal length','petal width','targ
In [3]:
import numpy as np

2
Munging¶

Steps to follow¶

1. Find out NaN


2. Find out Outliers Anything more than 2stds are replaced with 2std values
Or do we throw them out, or do we keep them to test our system later
3. Find if we can replace NaN (mean/mode/median) Better take mean as
mode and median might have challenges wrt fractional random data Read
about “Imputation techniques”/“Resampling” techniques for this
4. What if the NaN lies in your “Label/Target” column Separate it out and
later on analyze the predictions and see how your model responds to it
In [3]:
df.head()
Out[3]:

sepal length sepal width petal length petal width target


0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa

In [7]:
from sklearn.preprocessing import StandardScaler

filter_data_cols=['sepal length','sepal width','petal length','petal width']


df_data1=df.loc[:,filter_data_cols].values
# We are taking the numpy array version of the data and not the DataFrame version to be pass

df_scaled_data1=StandardScaler().fit_transform(X=df_data1)
# Remember to check if the data has been scaled ;-)

from sklearn.decomposition import PCA

IMP!¶

What algo is getting used in PCA? What are the default features of the algo
Understand these to be able to better tune and understand your findings
In [124]:

3
PCA_OBJ=PCA(n_components=4,svd_solver="full")
# If we gave 0 < n_components < 1, it would select principal components
# that can explain variance greater than the % specified
# One would be interested in getting lower no of PCAs than the input feature set
# But want to see/visualize how the other two components fare
In [125]:
PC=PCA_OBJ.fit(X=df_scaled_data1)
# Going through two step process of first fitting the data and then transforming it
In [126]:
PC.explained_variance_,PC.n_components_,PC.n_features_,PC.noise_variance_,PC.singular_values
Out[126]:
(array([2.93035378, 0.92740362, 0.14834223, 0.02074601]),
4,
4,
0.0,
array([20.89551896, 11.75513248, 4.7013819 , 1.75816839]))
In [36]:
arr_PC=PC.transform(X=df_scaled_data1)
In [38]:
df_PC=pd.DataFrame(arr_PC,columns=["PC1","PC2","PC3","PC4"])
In [50]:
df_PC_Target=pd.concat([df_PC,df['target']],axis=1)
# why I need to do this - is because I have to filter for target
# Which means the rows across the transformation are retained
In [51]:
finalDf=df_PC_Target
In [55]:
%matplotlib inline
In [56]:
import matplotlib.pyplot as plt
In [122]:
fig = plt.figure(figsize = (8,8))
ax = fig.add_subplot(1,1,1)
ax.set_xlabel('PC1', fontsize = 15,color="white")
ax.set_ylabel('PC2', fontsize = 15,color="white")
ax.set_title('2 component PCA', fontsize = 20,color="white")

4
targets = ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica']
colors = ['r', 'g', 'b']
for target, color in zip(targets,colors):
indicesToKeep = finalDf['target'] == target
ax.scatter(finalDf.loc[indicesToKeep, 'PC1']
, finalDf.loc[indicesToKeep, 'PC2']
, c = color
, s = 50)
ax.legend(targets)
ax.grid()

In [46]:
from itertools import combinations
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib

5
plt.style.use("ggplot")

class PandasScatterPlot():
def __init__(self,pdx,choice,target,title,size,subplots,drop=None):
"""
pdx= panda database
choice = [of columns to plot against the target column]
drop = [drop the columns that we don't need for plot]
target = target column name
size = [width,height] of the graphs/plots
subplots = [nrows,ncols] for the subplot() option
"""
if drop != None:
self.pdx=pdx.drop(columns=drop)
else:
self.pdx=pdx
self.target=target
self.choice=choice
self.title=title
self.target_set=set(pdx[target].values)
self.choice_list=list(combinations(choice,2))
self.size=size
self.subplots=subplots
#print(self.choice_list)
return

def plot(self):
# this has to be worked out
# @todo

#subplot_list=[211,212,221,222,231,232]
#subplot_list=[311,312,312,321,322,323]
count=0
# Becaureful if rows or cols =1 , different handling
f,ax=plt.subplots(self.subplots[0],self.subplots[1],constrained_layout=True) # rows,
f.suptitle(self.title,color="black",fontsize=25)
f.set_size_inches(self.size[0],self.size[1]) # width, height

compare_list=list(self.choice_list)
if self.subplots[0] == 1 or self.subplots[1] == 1:
# then the plot has to be ax[] and not ax[,]
max_value=self.subplots[0]+self.subplots[1]-1
for row in range(max_value):
#for col in range(self.subplots[1]):
#print("Plotting : {}".format(compare))
compare=compare_list[count]

6
count=count+1

ax[row].set_xlabel(compare[0], fontsize = 15,color="black")


ax[row].set_ylabel(compare[1], fontsize = 15,color="black")
colors = ['r', 'g', 'b','grey','black','orange','yellow','cyan']
for target_choice, color in zip(self.target_set,colors):
indicesToKeep = self.pdx[self.target] == target_choice
ax[row].scatter(self.pdx.loc[indicesToKeep, compare[0]]
, self.pdx.loc[indicesToKeep, compare[1]]
, c = color
, s = 50)
ax[row].legend(self.target_set)
ax[row].grid()
else:

for row in range(self.subplots[0]):


for col in range(self.subplots[1]):
#print("Plotting : {}".format(compare))
compare=compare_list[count]
count=count+1

ax[row,col].set_xlabel(compare[0], fontsize = 15,color="black")


ax[row,col].set_ylabel(compare[1], fontsize = 15,color="black")
colors = ['r', 'g', 'b','grey','black','orange','yellow','cyan']
for target_choice, color in zip(self.target_set,colors):
indicesToKeep = self.pdx[self.target] == target_choice
ax[row,col].scatter(self.pdx.loc[indicesToKeep, compare[0]]
, self.pdx.loc[indicesToKeep, compare[1]]
, c = color
, s = 50)
ax[row,col].legend(self.target_set)
ax[row,col].grid()
plt.show()
#f.subplots_adjust(hspace=0.5,wspace=0.5)
return

In [ ]:
PDSP=PandasScatterPlot(df,size=[10,38],subplots=[6,1],choice=filter_data_cols,drop=None,targ
In [ ]:
PDSP.target_set
In [ ]:

7
PDSP.plot()
In [339]:
PDSP=PandasScatterPlot(df_PC_Target,size=[16,20],subplots=[3,2],choice=['PC1','PC2','PC3','P
In [331]:
PDSP.plot()

8
9
Try out PCA with random data and see what are
the plots like¶
The intent is to generate random data and see how it performs with respect to
PCA.
• generate Arr[100,4] random matrix 0 to 1.0
• generate Classification Label/Target randomly from 3 variables R,G,B
• take PCA of the data and then plot 2-D for all four PCAs and check how
well it performed
In [77]:
arr2=np.random.random([100,4])
In [78]:
df3_target=pd.DataFrame(np.array([np.random.choice(['R','G','B']) for i in range(100)]),colu
Another way to do the above categorical column addition
- df2=pd.DataFrame(arr2,columns=['D1','D2','D3','D4'])
- df2['target']=pd.DataFrame(np.array([np.random.choice(['R','G','B']))
In [98]:
pd.value_counts(df3_target.values.transpose()[0])
Out[98]:
G 43
R 32
B 25
dtype: int64
In [99]:
df2=pd.DataFrame(arr2,columns=['D1','D2','D3','D4'])
In [101]:
df2.mean(axis=0,skipna=True)
Out[101]:
D1 0.518460
D2 0.508468
D3 0.508047
D4 0.465573
dtype: float64
In [102]:
df2.var(axis=0,skipna=True)

10
Out[102]:
D1 0.089540
D2 0.085677
D3 0.075701
D4 0.089832
dtype: float64
In [104]:
from sklearn.decomposition import PCA
In [105]:
df5=df4_target.drop(columns=['target']).values
In [116]:
PCA1=PCA(n_components=4,svd_solver="full")
PCA2=PCA1.fit(X=df5)
PC_2=PCA2.transform(X=df5)
In [117]:
df6_PC=pd.DataFrame(PC_2,columns=['PC1','PC2','PC3','PC4'])
In [118]:
df6_PC_target=pd.concat([df6_PC,df3_target],axis=1)
In [119]:
df6_PC_target.shape
Out[119]:
(100, 5)
In [120]:
df6_PC_target.head()
Out[120]:

PC1 PC2 PC3 PC4 target


0 0.372860 -0.429390 0.140535 0.152105 R
1 0.286631 0.469318 -0.163808 0.087128 G
2 -0.315111 -0.324368 -0.154092 -0.007522 R
3 0.226525 -0.085991 -0.295795 0.166544 G
4 0.722024 0.182921 -0.062515 0.022801 R

In [121]:
plot3=PandasScatterPlot(df6_PC_target,size=[10,15],subplots=[3,2],choice=['PC1','PC2','PC3',

11
In [122]:
plot3.plot()

12
PCA for random data¶
The PCA shows that it wouldn’t have been useful So after we get the PCA
values how to check their usefulness?
• Check the Explained Variance Ratio
• Check the Singular values
In [123]:
PCA2.singular_values_ # Singular Values for random data
Out[123]:
array([3.20485419, 3.11146084, 2.71367996, 2.53335704])
In [130]:
PCA2.explained_variance_ratio_ # Check the variance ratio
Out[130]:
array([0.30447082, 0.28698408, 0.2182963 , 0.1902488 ])
In [133]:
PCA2.n_features_
Out[133]:
4
In [127]:
PC.singular_values_ # Singular values for Iris data
Out[127]:
array([20.89551896, 11.75513248, 4.7013819 , 1.75816839])
In [131]:
PC.explained_variance_ratio_ # Check the variance raio
Out[131]:
array([0.72770452, 0.23030523, 0.03683832, 0.00515193])

Seaborn plot tool¶


Use pairplot to get an idea on what are the correlation info between all your
plots Plus it gives distribution info of the the parameters/columns/feature-sets
In [139]:

13
import seaborn as sns
In [143]:
import sklearn.datasets
# Import the Iris dataset and convert it into a Pandas DataFrame
iris = sklearn.datasets.load_iris()

# Uncomment if you want to print the dataset description


# print(iris.DESCR)

# Make a DataFrame with a species column


df_iris = pd.DataFrame(iris.data, columns=iris.feature_names)
df_iris['species'] = iris.target_names[iris.target]

# Take a look at df_iris


df_iris.head()
Out[143]:

sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa

In [144]:
sns.pairplot(df_iris,hue='species')
Out[144]:
<seaborn.axisgrid.PairGrid at 0x24fd3b44cc0>

14
PCA not using sklearn¶
In [192]:
import pandas as pd
from sklearn.preprocessing import StandardScaler

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
# load dataset into Pandas DataFrame
pd1 = pd.read_csv(url, names=['sepal length','sepal width','petal length','petal width','tar

filter_data_cols=['sepal length','sepal width','petal length','petal width']


df_data4=df.loc[:,filter_data_cols].values
# We are taking the numpy array version of the data and not the DataFrame version to be pass

15
df_scaled_data2=StandardScaler().fit_transform(X=df_data4)
# Remember to check if the data has been scaled ;-)

Covariance Matrix and Math behind eigen vector


& eigen values¶
Our data set is (100,4) What we need is a Co-variance matrix of (4,4) -> Why?
As we are trying to find the eigen vectors of these 4 components $||A-\lambda
I||=0$ Then we find the eigen values in this case the equation to solve and hence
4 eigen values and corresponding vectors
$\phi_1*\lambda^4+\phi_2*\lambda^3+\phi_3*\lambda^2+\phi_4*\lambda^1$
Why should the $det(A-\lambda I)$ be zero. $\lambda$ is eigen_value, $x$ is
eigen vector
So the only solution that satifies the Null Space is x=0
$Ax=\lambda x$ –> Eigen value and Eigen Vector definition
$Ax-\lambda x=0$
$(A-\lambda I)x=0$
Now given that eigen vector exists - it means that remaining expression evaluates
to a singular matrix.
Singular matrix $\to$ non-invertible matrix $\to$ determinant=0
Hence $det(A-\lambda I)=0$
In essence, an eigenvector v of a linear transformation T is a non-zero vector
that, when T is applied to it, does not change direction. Applying T to the
eigenvector only scales the eigenvector by the scalar value �, called an eigenvalue.
This condition can be written as the equation
$T(\mathbf{v})=\lambda \mathbf{v}$
Eigen Stuff and lot of applications in wiki
In [193]:
# note we are numpy darrays hence we can do transpose to (4,100) matrix
cov_mat=np.cov(df_scaled_data2.transpose())
In [194]:
cov_mat
Out[194]:

16
array([[ 1.00671141, -0.11010327, 0.87760486, 0.82344326],
[-0.11010327, 1.00671141, -0.42333835, -0.358937 ],
[ 0.87760486, -0.42333835, 1.00671141, 0.96921855],
[ 0.82344326, -0.358937 , 0.96921855, 1.00671141]])
In [195]:
eig_vals, eig_vecs=np.linalg.eig(cov_mat)
In [196]:
eig_vals
Out[196]:
array([2.93035378, 0.92740362, 0.14834223, 0.02074601])
In [197]:
eig_vecs
Out[197]:
array([[ 0.52237162, -0.37231836, -0.72101681, 0.26199559],
[-0.26335492, -0.92555649, 0.24203288, -0.12413481],
[ 0.58125401, -0.02109478, 0.14089226, -0.80115427],
[ 0.56561105, -0.06541577, 0.6338014 , 0.52354627]])
In [208]:
np.dot(eig_vecs,eig_vecs.transpose()).round(2)
# Zeros in upper and ower indicate that the vecs are perpendicular to each other
# Just checking !!!
Out[208]:
array([[ 1., -0., -0., -0.],
[-0., 1., 0., -0.],
[-0., 0., 1., 0.],
[-0., -0., 0., 1.]])

Key Understanding¶
Now we have the data(n,m) and eig(m,m) matrix, take their multtiplication to
get the final output of the Principal components So the Principal Components
= (Feature columns) * (eigen vectors) A Linear combination of the Feature
Columns

17
We will take 2 feature example as the Principal
components¶
As it is easier to visualize with 2
Two components chosen visually from Seaborn plot which shows best classifica-
tion of the flowers
In [163]:
filter_data_cols=['petal length','petal width']
df_data2=df.loc[:,filter_data_cols].values
# We are taking the numpy array version of the data and not the DataFrame version to be pass

df_scaled_data2=StandardScaler().fit_transform(X=df_data2)
# Remember to check if the data has been scaled ;-)

cov_mat2=np.cov(df_scaled_data2.transpose()) # Convert the data to 2*n from n*2

cov_mat2

eig_val2,eig_vec2=np.linalg.eig(cov_mat2)

print(eig_val2)
print(eig_vec2)
Out[163]:
(array([0.03749286, 1.97592996]), array([[-0.70710678, -0.70710678],
[ 0.70710678, -0.70710678]]))
In [209]:
# Swapping the cols

filter_data_cols=['petal width','petal length']


df_data2=df.loc[:,filter_data_cols].values
# We are taking the numpy array version of the data and not the DataFrame version to be pass

df_scaled_data2=StandardScaler().fit_transform(X=df_data2)
# Remember to check if the data has been scaled ;-)

cov_mat2=np.cov(df_scaled_data2.transpose()) # Convert the data to 2*n from n*2

cov_mat2

18
eig_val2,eig_vec2=np.linalg.eig(cov_mat2)

print(eig_val2) # CLEARLY showing that eigen value 0 accounts for 99% of the variance
print(eig_vec2)
[1.97592996 0.03749286]
[[ 0.70710678 -0.70710678]
[ 0.70710678 0.70710678]]
In [169]:
# NOTE : Eig vectors are
Eig_vec_1 = eig_vec2[:,0] # First Eigen Vector
Eig_vec_2 = eig_vec2[:,1] # Second Eigen Vector

# dot product is zero, they are perpendicular to each other


np.dot(Eig_vec_1,Eig_vec_2)
Out[169]:
0.0

Analysis¶
Now we have the data(n,m) and eig(m,m) matrix, take their multtiplication to
get the final output of the Principal components So the Principal Components
= (Feature columns) * (eigen vectors) A Linear combination of the Feature
Columns

From linear Algebra¶

When we multiply data with eig_vec_1


• Data Array(100,2), “Petal Width” is multiplied by eig_vec_1[0], and
“Petal_length” is multipleid by eig_vec_1[1]
• And then they are summed, so PC1 is showing equal 50% contribution by
both attributes/features/columns to PC1 (First Principal component)
• And then for PC2 (Second principal component) we multiply with
eig_vec_2
• But in this the first feature is subtracted from second feature
In [171]:
# Finally the PC1 and PC2 computation
PC1 = np.dot(df_scaled_data2,Eig_vec_1)
PC2 = np.dot(df_scaled_data2,Eig_vec_2)
In [178]:

19
pd2=pd.DataFrame(PC1,columns=['PC1'])
In [179]:
pd2['PC2']=pd.DataFrame(PC2)
In [181]:
pd2['target']=pd1['target']
In [183]:
pd2.head()
Out[183]:

PC1 PC2 target


0 -0.020008 -0.020008 Iris-setosa
1 -0.020008 -0.020008 Iris-setosa
2 -0.060218 -0.060218 Iris-setosa
3 0.020202 0.020202 Iris-setosa
4 -0.020008 -0.020008 Iris-setosa

In [190]:
sns.pairplot(pd2,hue='target')
Out[190]:
<seaborn.axisgrid.PairGrid at 0x24fd3f5b978>

20
Something big wrong , big wrong¶
Which means we don’t need PC2, a single PC1 would have been more than
sufficient to plot the graphs!!! Introspect what wrong you did!!!

References¶
• Queries Do we normalize before or after taking PCA? Of-course
when we talk about variance We refer to subtracting mean and then squar-
ing it Do we remove outliers before doing variance As these values
can screw up the mean. Why not co-corr matrix vs co-variance matrix
The difference being co-variance has mean 0 for each feature set. ** PCA
as SVD Understand SVD as well
• Implementation ** Scikit [[https://towardsdatascience.com/pca-
using-python-scikit-learn-e653f8989e60%5D%5BExample_from_
datascience]] [[http://scikit-learn.org/stable/auto_examples/plot_
feature_stacker.html#sphx-glr-auto-examples-plot-feature-stacker-
py%5D%5BConcatenate_multiple_feature_extraction_feature]]

21
• References [[http://www.lauradhamilton.com/introduction-to-principal-
component-analysis-pca%5D%5BRefernce_to_understand_PCA]]
Chapter 12 from Bishop PCA Some of the PDFs from Book/PCA
** Learn the graphical techniques for PCA here [[https://www.kaggle.
com/strakul5/principal-component-analysis-of-pokemon-data%5D%
5BGraphical_techniques_PCA]]
In [213]:
%timeit
In [214]:
lambda x: x*x
Out[214]:
<function __main__.<lambda>(x)>

22

You might also like