0% found this document useful (0 votes)

24 views

Implementing PCA in Python With Scikit

1. The document discusses implementing PCA in Python using scikit-learn to reduce the dimensionality of data by selecting the most important attributes that capture maximum information. 2. It demonstrates loading breast cancer data, standardizing it, running PCA to reduce the 30 dimensions to 3, and visualizing the results in 2D and 3D plots to show separation between the two classes. 3. It also shows how PCA components explain the variance in the data, with the first component explaining 44% of variance.

Uploaded by

Shobha Kumari Choudhary

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views

Implementing PCA in Python With Scikit

Uploaded by

Shobha Kumari Choudhary

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 6

Implementing PCA in Python with

scikit-learn
In this article, we will learn about PCA (Principal Component Analysis) in Python
with scikit-learn. Let’s start our learning step by step.
WHY PCA?
 When there are many input attributes, it is difficult to visualize the data.
There is a very famous term ‘Curse of dimensionality in the machine
learning domain.
 Basically, it refers to the fact that a higher number of attributes in a dataset
adversely affects the accuracy and training time of the machine learning
model.
 Principal Component Analysis (PCA) is a way to address this issue and is
used for better data visualization and improving accuracy.
How does PCA work?
 PCA is an unsupervised pre-processing task that is carried out before
applying any ML algorithm. PCA is based on “orthogonal linear
transformation” which is a mathematical technique to project the attributes
of a data set onto a new coordinate system. The attribute which describes
the most variance is called the first principal component and is placed at
the first coordinate.
 Similarly, the attribute which stands second in describing variance is
called a second principal component and so on. In short, the complete
dataset can be expressed in terms of principal components. Usually, more
than 90% of the variance is explained by two/three principal components.
 Principal component analysis, or PCA, thus converts data from high
dimensional space to low dimensional space by selecting the most
important attributes that capture maximum information about the dataset.
Python Implementation:
 To implement PCA in Scikit learn, it is essential to standardize/normalize
the data before applying PCA.
 PCA is imported from sklearn.decomposition. We need to select the
required number of principal components.
 Usually, n_components is chosen to be 2 for better visualization but it
matters and depends on data.
 By the fit and transform method, the attributes are passed.
 The values of principal components can be checked using components_
while the variance explained by each principal component can be
calculated using explained_variance_ratio.
1. Import all the libraries
# import all libraries

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

%matplotlib inline

from sklearn.decomposition import PCA

from sklearn.preprocessing import StandardScaler

2. Loading Data
Load the breast_cancer dataset from sklearn.datasets. It is clear that the dataset has
569 data items with 30 input attributes. There are two output classes-benign and
malignant. Due to 30 input features, it is impossible to visualize this data

#import the breast _cancer dataset

from sklearn.datasets import load_breast_cancer
data=load_breast_cancer()
data.keys()

# Check the output classes

print(data['target_names'])

# Check the input attributes

print(data['feature_names'])

Output:

3. Apply PCA
 Standardize the dataset prior to PCA.
 Import PCA from sklearn.decomposition.
 Choose the number of principal components.
Let us select it to 3. After executing this code, we get to know that the dimensions of
x are (569,3) while the dimension of actual data is (569,30). Thus, it is clear that
with PCA, the number of dimensions has reduced to 3 from 30. If we choose
n_components=2, the dimensions would be reduced to 2.

# construct a dataframe using pandas

df1=pd.DataFrame(data['data'],columns=data['feature_names'])

# Scale data before applying PCA

scaling=StandardScaler()

# Use fit and transform method

scaling.fit(df1)
Scaled_data=scaling.transform(df1)

# Set the n_components=3

principal=PCA(n_components=3)
principal.fit(Scaled_data)
x=principal.transform(Scaled_data)

# Check the dimensions of data after PCA

print(x.shape)

Output:
(569,3)

4. Check Components
The principal.components_ provide an array in which the number of rows tells the
number of principal components while the number of columns is equal to the
number of features in actual data. We can easily see that there are three rows as
n_components was chosen to be 3. However, each row has 30 columns as in actual
data.

# Check the values of eigen vectors

# prodeced by principal components
principal.components_

5. Plot the components (Visualization)

Plot the principal components for better data visualization. Though we had taken
n_components =3, here we are plotting a 2d graph as well as 3d using first two
principal components and 3 principal components respectively. For three principal
components, we need to plot a 3d graph. The colors show the 2 output classes of the
original dataset-benign and malignant. It is clear that principal components show
clear separation between two output classes.

plt.figure(figsize=(10,10))
plt.scatter(x[:,0],x[:,1],c=data['target'],cmap='plasma')
plt.xlabel('pc1')
plt.ylabel('pc2')

Output:

For three principal components, we need to plot a 3d graph. x[:,0] signifies the first
principal component. Similarly, x[:,1] and x[:,2] represent the second and the third
principal component.

# import relevant libraries for 3d graph

from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure(figsize=(10,10))
# choose projection 3d for creating a 3d graph
axis = fig.add_subplot(111, projection='3d')

# x[:,0]is pc1,x[:,1] is pc2 while x[:,2] is pc3

axis.scatter(x[:,0],x[:,1],x[:,2], c=data['target'],cmap='plasma')
axis.set_xlabel("PC1", fontsize=10)
axis.set_ylabel("PC2", fontsize=10)
axis.set_zlabel("PC3", fontsize=10)

Output:

6. Calculate variance ratio

Explained_variance_ratio provides an idea of how much variation is explained by
principal components.

# check how much variance is explained by each principal component

print(principal.explained_variance_ratio_)

Output:
array([0.44272026, 0.18971182, 0.09393163])

Amazon Support Engineer Interview Questions
100% (1)
Amazon Support Engineer Interview Questions
7 pages
Manual Minesched 8.0 PDF
50% (2)
Manual Minesched 8.0 PDF
192 pages
Excel Automated Formulas
100% (1)
Excel Automated Formulas
31 pages
Principal Component Analysis: #Datascience
No ratings yet
Principal Component Analysis: #Datascience
13 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
34 pages
Reduce Data Dimensionality Using PCA
No ratings yet
Reduce Data Dimensionality Using PCA
6 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
11 pages
Principal Component Analysis Notes : Info
No ratings yet
Principal Component Analysis Notes : Info
22 pages
PCA_Explained -
No ratings yet
PCA_Explained -
9 pages
PCA Explained
No ratings yet
PCA Explained
5 pages
06 A1 ML Exp7
No ratings yet
06 A1 ML Exp7
5 pages
Dvpd11 Merged Merged 27 83
No ratings yet
Dvpd11 Merged Merged 27 83
57 pages
ML LAB - Principal Component Analysis
No ratings yet
ML LAB - Principal Component Analysis
3 pages
Exp3a
No ratings yet
Exp3a
2 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
1 page
Fem2063 Data Analytics - May 2020 Lab Practice 5 (Week 6)
No ratings yet
Fem2063 Data Analytics - May 2020 Lab Practice 5 (Week 6)
8 pages
program - 3
No ratings yet
program - 3
4 pages
Kinya Sharon - Ass2 - Machine Learning
No ratings yet
Kinya Sharon - Ass2 - Machine Learning
12 pages
program-3
No ratings yet
program-3
7 pages
Face Recognition Using PCA
No ratings yet
Face Recognition Using PCA
8 pages
Dimensionality Reduction: Motivation I: Data Compression
No ratings yet
Dimensionality Reduction: Motivation I: Data Compression
35 pages
Love Report
No ratings yet
Love Report
7 pages
Love Report 1
No ratings yet
Love Report 1
10 pages
Mloa Exp2 C121
No ratings yet
Mloa Exp2 C121
20 pages
The Intuition Behind PCA: Machine Learning Assignment
No ratings yet
The Intuition Behind PCA: Machine Learning Assignment
11 pages
Assignment
No ratings yet
Assignment
24 pages
vertopal.com_DAI_Amberish_LAB_ASSIGNMENT_3 (1)
No ratings yet
vertopal.com_DAI_Amberish_LAB_ASSIGNMENT_3 (1)
7 pages
ML Assignment 01 Code
No ratings yet
ML Assignment 01 Code
21 pages
Lab #3
No ratings yet
Lab #3
12 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
13 pages
Dimensionality Reduction (Principal Component Analysis)
No ratings yet
Dimensionality Reduction (Principal Component Analysis)
12 pages
PCA_dev
No ratings yet
PCA_dev
16 pages
PRACTICAL5
No ratings yet
PRACTICAL5
23 pages
A COMPLETE GUIDE TO PRINCIPAL COMPONENT ANALYSIS in ML 1598272724
No ratings yet
A COMPLETE GUIDE TO PRINCIPAL COMPONENT ANALYSIS in ML 1598272724
16 pages
Principal Component Analysis (PCA)
No ratings yet
Principal Component Analysis (PCA)
3 pages
Updated Lecture 13 Zainab
No ratings yet
Updated Lecture 13 Zainab
17 pages
Week6 - Colab
No ratings yet
Week6 - Colab
3 pages
Exp 3
No ratings yet
Exp 3
4 pages
Ai ( PCA)
No ratings yet
Ai ( PCA)
3 pages
PCA Using Python
No ratings yet
PCA Using Python
18 pages
Pca
No ratings yet
Pca
18 pages
Project LA
No ratings yet
Project LA
13 pages
advertising in ML
No ratings yet
advertising in ML
9 pages
PCA Finds Representation Through Linear Transformation
No ratings yet
PCA Finds Representation Through Linear Transformation
28 pages
3.2 Pca
No ratings yet
3.2 Pca
27 pages
PCA- PRINCIPAL COMPONENT ANALYSIS 1233
No ratings yet
PCA- PRINCIPAL COMPONENT ANALYSIS 1233
30 pages
The Math Behind PCA
No ratings yet
The Math Behind PCA
3 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
1 page
PCA by Vikram Kumar
No ratings yet
PCA by Vikram Kumar
19 pages
Practical Guide To Principal Component Analysis (PCA) in R & Python
No ratings yet
Practical Guide To Principal Component Analysis (PCA) in R & Python
33 pages
Utf 8''week4
No ratings yet
Utf 8''week4
15 pages
Chapter Five Principal Comonent Analysis (PCA)
No ratings yet
Chapter Five Principal Comonent Analysis (PCA)
33 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
6 pages
PCA Clearly Explained -When, Why, How To Use It and Feature Importance_ A Guide in Python _ by Serafeim Loukas _ Towards AI
No ratings yet
PCA Clearly Explained -When, Why, How To Use It and Feature Importance_ A Guide in Python _ by Serafeim Loukas _ Towards AI
19 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
13 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
2 pages
Principal Component Analysis PCA in Machine Learning
No ratings yet
Principal Component Analysis PCA in Machine Learning
20 pages
Principal Component Analysis (PCA) in Machine Learning
No ratings yet
Principal Component Analysis (PCA) in Machine Learning
20 pages
K. J. Somaiya College of Engineering, Mumbai-77: Title: Implementation of Principal Component Analysis
No ratings yet
K. J. Somaiya College of Engineering, Mumbai-77: Title: Implementation of Principal Component Analysis
2 pages
Principal Component Analysis Limitations and How To Overcome Them Let's Talk A
No ratings yet
Principal Component Analysis Limitations and How To Overcome Them Let's Talk A
5 pages
U5@-Data Reduction
No ratings yet
U5@-Data Reduction
22 pages
R PCA (Principal Component Analysis) - DataCamp
No ratings yet
R PCA (Principal Component Analysis) - DataCamp
54 pages
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
From Everand
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
César Pérez López
No ratings yet
K Means Clustering
No ratings yet
K Means Clustering
11 pages
Overview of Data Cleaning
No ratings yet
Overview of Data Cleaning
17 pages
Difference Between K Means and Hierarchical Clustering
No ratings yet
Difference Between K Means and Hierarchical Clustering
2 pages
Linear Equations-2
No ratings yet
Linear Equations-2
2 pages
SQL Query Processing10
No ratings yet
SQL Query Processing10
3 pages
SQL Sequences
No ratings yet
SQL Sequences
3 pages
SQL Part 1
No ratings yet
SQL Part 1
4 pages
SQL UNION Clause
No ratings yet
SQL UNION Clause
3 pages
SQL WITH Clause
No ratings yet
SQL WITH Clause
3 pages
Hidden Parameters
No ratings yet
Hidden Parameters
26 pages
CN3421 Lecture Note 1 - Introduction
No ratings yet
CN3421 Lecture Note 1 - Introduction
20 pages
CAD Help Center - Step by Step Installation Process of Catia V6R2009 x64 Bit On Windows XP x64 Bit
No ratings yet
CAD Help Center - Step by Step Installation Process of Catia V6R2009 x64 Bit On Windows XP x64 Bit
10 pages
Computer 8 4TH Quarter Exam
No ratings yet
Computer 8 4TH Quarter Exam
2 pages
Teachers Pack FINAL AC16092016
100% (1)
Teachers Pack FINAL AC16092016
26 pages
Ficha Tecnica TEX502LCD 2 DBTC 1
No ratings yet
Ficha Tecnica TEX502LCD 2 DBTC 1
2 pages
2M - Purc
No ratings yet
2M - Purc
2 pages
Single Channel Hybrid EOG/EEG-based Brain-Computer Interface
No ratings yet
Single Channel Hybrid EOG/EEG-based Brain-Computer Interface
69 pages
Chapter 4 Raster Data Model: Box 4.1 Rules in Determining A Categorical Cell Value
No ratings yet
Chapter 4 Raster Data Model: Box 4.1 Rules in Determining A Categorical Cell Value
12 pages
ST2_MAPEH 6_Q4 W3-4
No ratings yet
ST2_MAPEH 6_Q4 W3-4
3 pages
Ankit Kumar Resume
No ratings yet
Ankit Kumar Resume
1 page
Appendix 1 List of Buffer Memory Addresses
No ratings yet
Appendix 1 List of Buffer Memory Addresses
26 pages
A Real Time Virtual Dressing Room Applic
No ratings yet
A Real Time Virtual Dressing Room Applic
54 pages
High Performance Multiply
No ratings yet
High Performance Multiply
11 pages
01 - Samarai Island - Solar Plant - BD
No ratings yet
01 - Samarai Island - Solar Plant - BD
1 page
DAY 2
No ratings yet
DAY 2
4 pages
SP3D Admin Syllbus For Kagira & Onlinepiping PDF
No ratings yet
SP3D Admin Syllbus For Kagira & Onlinepiping PDF
7 pages
A Low Cost GSM GPRS Based Wireless Home
No ratings yet
A Low Cost GSM GPRS Based Wireless Home
6 pages
Manual Del Operador CD 1700
No ratings yet
Manual Del Operador CD 1700
459 pages
Trumpf Ts 7500 Brochure en
No ratings yet
Trumpf Ts 7500 Brochure en
26 pages
JD - Data Architect
No ratings yet
JD - Data Architect
3 pages
N3xxx: Iso/Iec Jtc1/Sc2/Wg2 L2/06-xxx
No ratings yet
N3xxx: Iso/Iec Jtc1/Sc2/Wg2 L2/06-xxx
18 pages
Unit - 16 - Lookup and Reference Function
No ratings yet
Unit - 16 - Lookup and Reference Function
8 pages
Object Oriented Software Engineering Using UML Patterns and Java 3rd Edition by Bernd Bruegge, Allen H Dutoit ISBN 0133002098 9780133002096pdf download
100% (3)
Object Oriented Software Engineering Using UML Patterns and Java 3rd Edition by Bernd Bruegge, Allen H Dutoit ISBN 0133002098 9780133002096pdf download
90 pages
0014 Excel For Advanced Users
100% (1)
0014 Excel For Advanced Users
6 pages
Cisco Ac I For HP Virtual Connect
No ratings yet
Cisco Ac I For HP Virtual Connect
30 pages
Python Programming Aug 2022
100% (1)
Python Programming Aug 2022
30 pages

Implementing PCA in Python With Scikit

Uploaded by

Implementing PCA in Python With Scikit

Uploaded by

Implementing PCA in Python with

import matplotlib.pyplot as plt

from sklearn.decomposition import PCA

from sklearn.preprocessing import StandardScaler

#import the breast _cancer dataset

# Check the output classes

# Check the input attributes

# construct a dataframe using pandas

# Scale data before applying PCA

# Use fit and transform method

# Set the n_components=3

# Check the dimensions of data after PCA

# Check the values of eigen vectors

5. Plot the components (Visualization)

# import relevant libraries for 3d graph

# x[:,0]is pc1,x[:,1] is pc2 while x[:,2] is pc3

6. Calculate variance ratio

# check how much variance is explained by each principal component

You might also like