0% found this document useful (0 votes)

34 views

PCA Code-Checkpoint

The document explains the steps to perform principal component analysis (PCA) on a dataset. It begins with standardizing the dataset, then calculating the covariance matrix using both population and sample formulas. Eigenvalues and eigenvectors are extracted from the covariance matrix. The top eigenvectors corresponding to the highest eigenvalues are kept to transform the standardized data into a lower-dimensional principal component space. The results are verified using NumPy linear algebra functions and by repeating the process with Scikit-learn's PCA module.

Uploaded by

Аурел П

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

34 views

PCA Code-Checkpoint

Uploaded by

Аурел П

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

Understanding the mathematics behind PCA

In [1]:

import numpy as np
import pandas as pd

Step1: Lets take a dataset to explain

In [2]:

A = np.matrix([[1,2,3,4],
[5,5,6,7],
[1,4,2,3],
[5,3,2,1],
[8,1,2,2]])

In [3]:
df = pd.DataFrame(A,columns = ['f1','f2','f3','f4'])
df
Out[3]:

f1 f2 f3 f4

0 1 2 3 4

1 5 5 6 7

2 1 4 2 3

3 5 3 2 1

4 8 1 2 2

Step 2: Standardize the dataset

In [4]:
df_std = (df - df.mean()) / (df.std())
df_std
Out[4]:

f1 f2 f3 f4

0 -1.000000 -0.632456 0.000000 0.260623

1 0.333333 1.264911 1.732051 1.563740

2 -1.000000 0.632456 -0.577350 -0.173749

3 0.333333 0.000000 -0.577350 -1.042493

4 1.333333 -1.264911 -0.577350 -0.608121

Find the covariance matrix for the given dataset

There are two methods to do this

Sample formula
Population formula

Note: Any of the formula, can be used result will be same

Covariance population formula (divide by N)

In [5]:
df_cov = np.cov(df_std.T, bias = 1)
df_cov
Out[5]:
array([[ 0.8 , -0.25298221, 0.03849002, -0.14479075],
[-0.25298221, 0.8 , 0.51120772, 0.49449803],
[ 0.03849002, 0.51120772, 0.8 , 0.75235479],
[-0.14479075, 0.49449803, 0.75235479, 0.8 ]])

Covariance sample formula (divide by N-1)

In [6]:
cov_mat = np.cov(df_std.T, bias = 0)
cov_mat
Out[6]:
array([[ 1. , -0.31622777, 0.04811252, -0.18098843],
[-0.31622777, 1. , 0.63900965, 0.61812254],
[ 0.04811252, 0.63900965, 1. , 0.94044349],
[-0.18098843, 0.61812254, 0.94044349, 1. ]])

In [9]:
## verify varinace(f1) is as expected
print('var(f1) (population formula): ',((df_std.f1)**2).sum()/5)
print('var(f1) (sample formula): ',((df_std.f1)**2).sum()/4)

var(f1) (population formula): 0.8

var(f1) (sample formula): 1.0

In [11]:
## verify covarinace(f1,f2) is as expected
print('covar(f1,f2) (population formula): ',((df_std.f1)*(df_std.f2)).sum()/5)
print('covar(f1,f2) (sample formula): ',((df_std.f1)*(df_std.f2)).sum()/4)

covar(f1,f2) (population formula): -0.25298221281347033

covar(f1,f2) (sample formula): -0.3162277660168379

Calculate Eigenvalue and eigen vector

In [12]:
eigen_val, eigen_vectors = np.linalg.eig(cov_mat)

In [13]:
print(eigen_val)

[2.51579324 1.0652885 0.39388704 0.02503121]

In [14]:
print(eigen_vectors)

[[ 0.16195986 -0.91705888 -0.30707099 0.19616173]

[-0.52404813 0.20692161 -0.81731886 0.12061043]
[-0.58589647 -0.3205394 0.1882497 -0.72009851]
[-0.59654663 -0.11593512 0.44973251 0.65454704]]
Sort the eigen values and their correspoding eigen vectors

Since the eigen values are already sorted in our case, so no need of this step

In [75]:
n_components=3

pick

In [76]:
top_eigen_vectors = eigen_vectors[:,:n_components]

In [77]:

top_eigen_vectors
Out[77]:
array([[ 0.16195986, -0.91705888, -0.30707099],
[-0.52404813, 0.20692161, -0.81731886],
[-0.58589647, -0.3205394 , 0.1882497 ],
[-0.59654663, -0.11593512, 0.44973251]])

In [78]:
top_eigen_vectors.shape

Out[78]:
(4, 3)

In [79]:
np.array(df_std).shape
Out[79]:
(5, 4)

df_std.shape n_eigen_vectors.shape = transformed_data.shape

(5,4) (4,3) = (5,3)

In [80]:
transformed_data = np.matmul(np.array(df_std),top_eigen_vectors)

In [85]:
pd.DataFrame(data = transformed_data
, columns = ['principal component '+ str(i+1) for i in range(n_components)]
)
Out[85]:

principal component 1 principal component 2 principal component 3

0 0.014003 0.755975 0.941200

1 -2.556534 -0.780432 -0.106870

2 -0.051480 1.253135 -0.396673

3 1.014150 0.000239 -0.679886

4 1.579861 -1.228917 0.242230

In [82]:
transformed_data.shape

Out[82]:
(5, 3)

Now lets see the result using the Sklearn library

In [83]:
from sklearn.decomposition import PCA
pca = PCA(n_components=n_components)
principalComponents = pca.fit_transform(df_std)
principalDf = pd.DataFrame(data = principalComponents
, columns = ['principal component '+ str(i+1) for i in range(n_components)]
)

In [84]:
principalDf
Out[84]:

principal component 1 principal component 2 principal component 3

0 -0.014003 0.755975 0.941200

1 2.556534 -0.780432 -0.106870

2 0.051480 1.253135 -0.396673

3 -1.014150 0.000239 -0.679886

4 -1.579861 -1.228917 0.242230

In [ ]:

Problem 1: Machine Repair Shop: System Performance Analysis
No ratings yet
Problem 1: Machine Repair Shop: System Performance Analysis
4 pages
Scoring IWQOL-Lite Final
No ratings yet
Scoring IWQOL-Lite Final
5 pages
AR Model Session2 Output: Install - Packages ("Forecast")
No ratings yet
AR Model Session2 Output: Install - Packages ("Forecast")
30 pages
UNIT_IV (1)
No ratings yet
UNIT_IV (1)
63 pages
pandas
No ratings yet
pandas
24 pages
E21CSEU0770 Lab4
No ratings yet
E21CSEU0770 Lab4
4 pages
单因子回测
No ratings yet
单因子回测
4 pages
DM - Lab - 8 - Jupyter Notebook
No ratings yet
DM - Lab - 8 - Jupyter Notebook
5 pages
Machine Learning Stock Time Series 1700932258
No ratings yet
Machine Learning Stock Time Series 1700932258
21 pages
VKMeans9
No ratings yet
VKMeans9
4 pages
Normalization and PCA
No ratings yet
Normalization and PCA
12 pages
Multiple Linear Regression
No ratings yet
Multiple Linear Regression
2 pages
Unit 5 Descriptive Statistics
No ratings yet
Unit 5 Descriptive Statistics
7 pages
Kelompok 3 - Latihan 1 Setup Python Dan Aljabar Linier
No ratings yet
Kelompok 3 - Latihan 1 Setup Python Dan Aljabar Linier
12 pages
Name: Muhammad Sarfraz Seat: EP1850086 Section: A Course Code: 514 Course Name: Data Warehousing and Data Mining
No ratings yet
Name: Muhammad Sarfraz Seat: EP1850086 Section: A Course Code: 514 Course Name: Data Warehousing and Data Mining
39 pages
03 Multiple Linear Regression
No ratings yet
03 Multiple Linear Regression
7 pages
TranMinhTu1 bt2 2
No ratings yet
TranMinhTu1 bt2 2
5 pages
Minicurso R PDF
No ratings yet
Minicurso R PDF
100 pages
21BECE30036 Prac 1
No ratings yet
21BECE30036 Prac 1
10 pages
Pandas
No ratings yet
Pandas
21 pages
DL Lab2
No ratings yet
DL Lab2
38 pages
EDA - Session-4 - Numerical Data Analysis
No ratings yet
EDA - Session-4 - Numerical Data Analysis
9 pages
Chapter 5
No ratings yet
Chapter 5
22 pages
Assignment 4
No ratings yet
Assignment 4
5 pages
Hints and Answers
No ratings yet
Hints and Answers
13 pages
Numpy - Pandas - Lab - Jupyter Notebook
No ratings yet
Numpy - Pandas - Lab - Jupyter Notebook
29 pages
Import As Import As Import As: "Default - CSV"
No ratings yet
Import As Import As Import As: "Default - CSV"
9 pages
統計學習CH2 Lab - Jupyter Notebook (直向)
No ratings yet
統計學習CH2 Lab - Jupyter Notebook (直向)
41 pages
Jamboree
No ratings yet
Jamboree
56 pages
45B Ahmed Shaikh AIML Prac05
No ratings yet
45B Ahmed Shaikh AIML Prac05
4 pages
Daily Gold Price Time Series Analysis 1649918083
No ratings yet
Daily Gold Price Time Series Analysis 1649918083
23 pages
Data Preparation
No ratings yet
Data Preparation
11 pages
Bai Nop Ngay 03.12.23pdf
No ratings yet
Bai Nop Ngay 03.12.23pdf
4 pages
DMV - 6 - Jupyter Notebook
No ratings yet
DMV - 6 - Jupyter Notebook
6 pages
L_AND_T_project_Naveen 24cs002895
No ratings yet
L_AND_T_project_Naveen 24cs002895
7 pages
Regression: Pyspark - SQL
No ratings yet
Regression: Pyspark - SQL
5 pages
Short Notes on pandas
No ratings yet
Short Notes on pandas
21 pages
Know Your Dataset: Season Holiday Weekday Workingday CNT 726 727 728 729 730
No ratings yet
Know Your Dataset: Season Holiday Weekday Workingday CNT 726 727 728 729 730
1 page
Electrical Machine Learning Tool
No ratings yet
Electrical Machine Learning Tool
3 pages
Generative AI Binary Classification
No ratings yet
Generative AI Binary Classification
7 pages
linear-regression
No ratings yet
linear-regression
8 pages
kmeans
No ratings yet
kmeans
4 pages
Experiment1111
No ratings yet
Experiment1111
25 pages
10 Minutes To Pandas
No ratings yet
10 Minutes To Pandas
26 pages
Practice - DL - Ipynb - Colaboratory
No ratings yet
Practice - DL - Ipynb - Colaboratory
14 pages
10 Minutes To Pandas
No ratings yet
10 Minutes To Pandas
19 pages
10 Minutes To Pandas - Pandas 0.21
No ratings yet
10 Minutes To Pandas - Pandas 0.21
23 pages
Data Visualization Manual
No ratings yet
Data Visualization Manual
33 pages
RANDOM_FOREST__1737667979
No ratings yet
RANDOM_FOREST__1737667979
11 pages
Guia para La Importación de Series Financieras de Yahoo F
No ratings yet
Guia para La Importación de Series Financieras de Yahoo F
8 pages
pandas correlation,visualization 5
No ratings yet
pandas correlation,visualization 5
8 pages
10 Minutes to Pandas — Pandas 2.1.1 Documentation
No ratings yet
10 Minutes to Pandas — Pandas 2.1.1 Documentation
24 pages
Labpractice 2
100% (2)
Labpractice 2
29 pages
0501 Indexing and Selecting Data
No ratings yet
0501 Indexing and Selecting Data
16 pages
merge
No ratings yet
merge
33 pages
Data Analysis With Python Quiz 2
No ratings yet
Data Analysis With Python Quiz 2
3 pages
Logistic Regression
No ratings yet
Logistic Regression
8 pages
Numpy_TE2
No ratings yet
Numpy_TE2
12 pages
Simple Linear Regression
No ratings yet
Simple Linear Regression
4 pages
Laboratorio Regresión Logística - Colaboratory Grupo 2
No ratings yet
Laboratorio Regresión Logística - Colaboratory Grupo 2
7 pages
vertopal.com_DAI_Amberish_LAB_ASSIGNMENT_3 (1)
No ratings yet
vertopal.com_DAI_Amberish_LAB_ASSIGNMENT_3 (1)
7 pages
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet
Chap 4 Molecular Velocity Distribution
No ratings yet
Chap 4 Molecular Velocity Distribution
22 pages
Turbomole Manual 7-3
No ratings yet
Turbomole Manual 7-3
502 pages
Biological Control Systems: Biomedical Engineering - Bcs - Short Questions and Answers
No ratings yet
Biological Control Systems: Biomedical Engineering - Bcs - Short Questions and Answers
15 pages
System of Linear Equations
No ratings yet
System of Linear Equations
13 pages
Introduction To Polynomial Regression
No ratings yet
Introduction To Polynomial Regression
5 pages
C++ Armadillo Specifications
No ratings yet
C++ Armadillo Specifications
15 pages
State Space Analysis and Control Design
No ratings yet
State Space Analysis and Control Design
5 pages
An Improved Binary Quadratic Interpolation Optimization For 0-1 Knapsack Problems
No ratings yet
An Improved Binary Quadratic Interpolation Optimization For 0-1 Knapsack Problems
11 pages
Poisson Brackets and Constants of The Motion
No ratings yet
Poisson Brackets and Constants of The Motion
4 pages
Hill Cipher Problem
No ratings yet
Hill Cipher Problem
11 pages
2.05 Application of Matrices in Cryptography
0% (1)
2.05 Application of Matrices in Cryptography
4 pages
Signals & Systems-UA - 230623 - 235149
No ratings yet
Signals & Systems-UA - 230623 - 235149
149 pages
DC17 Chp11
No ratings yet
DC17 Chp11
34 pages
Dsa Assignment 1 (3) by Zamir Ali 091440
No ratings yet
Dsa Assignment 1 (3) by Zamir Ali 091440
44 pages
Find-S Algorithm
No ratings yet
Find-S Algorithm
1 page
PID Controller: From Wikipedia, The Free Encyclopedia
No ratings yet
PID Controller: From Wikipedia, The Free Encyclopedia
1 page
Btech Cs 5 Sem Application of Soft Computing kcs056 2022
No ratings yet
Btech Cs 5 Sem Application of Soft Computing kcs056 2022
1 page
MATH2291-Linear Algebra-Syllabus
No ratings yet
MATH2291-Linear Algebra-Syllabus
3 pages
@vtucode - in Model Paper 2018 Scheme Set 2 DBMS
No ratings yet
@vtucode - in Model Paper 2018 Scheme Set 2 DBMS
3 pages
Basic Calculus
No ratings yet
Basic Calculus
5 pages
Segmen 10 Mode Choice - Choice Modelling
No ratings yet
Segmen 10 Mode Choice - Choice Modelling
6 pages
Ee291e Lecture 25: Dynamic Programming & Game Theory: N M T T
No ratings yet
Ee291e Lecture 25: Dynamic Programming & Game Theory: N M T T
4 pages
850-Article Text-4036-3-10-20200929
No ratings yet
850-Article Text-4036-3-10-20200929
8 pages
Data Science Infographic en
No ratings yet
Data Science Infographic en
4 pages
Statistical Postulate
No ratings yet
Statistical Postulate
41 pages
Disk Scheduling Algorithms in Operating Systems
No ratings yet
Disk Scheduling Algorithms in Operating Systems
8 pages
Python Decision Tree Classification
No ratings yet
Python Decision Tree Classification
14 pages
10 BirthDeath
No ratings yet
10 BirthDeath
20 pages