6 - Data Pre-Processing-III
6 - Data Pre-Processing-III
6 - Data Pre-Processing-III
(Data Reduction)
TIET, PATIALA
Dimensionality/Data Reduction
▪ The number of input variables or features for a dataset is referred to as its
dimensionality.
▪ Dimensionality reduction refers to techniques that reduce the number of input
variables in a dataset.
▪ More input features often make a predictive modeling task more challenging to
model, more generally referred to as the curse of dimensionality.
▪ There exist a optimal number of feature in a feature set for corresponding Machine
Learning task.
▪ Adding additional features than optimal ones (strictly necessary) results in a
performance degradation ( because of added noise).
Dimensionality/Data Reduction
“ Challenging task”
Dimensionality/Data Reduction
Benefits of data reduction
• Accuracy improvements.
• Over-fitting risk reduction.
• Speed up in training.
• Improved Data Visualization.
• Increase in explain ability of ML model.
• Increase storage efficiency.
• Reduced storage cost.
Data Reduction Techniques
Feature Selection –
• Filter methods
• Wrapper methods
• Embedded methods
find the best set of feature
r is +ve
Feature Selection- Measuring Feature Redundancy
Distance-based:
▪The most commonly used distance metric is various forms of Minkowski distance.
𝑛
𝑟
𝑑 𝐹1 , 𝐹2 = (𝐹1𝑖 − 𝐹2𝑖 )𝑟
𝑖=1
It takes the form of Euclidian distance when r =2 (L2 norm) and Manhattandistance
when r = 1 (L1 norm).
▪Cosine similarity is another important metric for computing similarity between
features.
𝐹1 . 𝐹2
𝑐𝑜𝑠 𝐹1 , 𝐹2 =
𝐹1 |𝐹2 |
Where F1 and F2 denote feature vectors.
Feature Selection- Measuring Feature Redundancy
For binary features, following metrics are useful:
1. Hamming distance: number of values which are different in two feature vectors.
2. Jaccard distance: 1- Jaccard Similarity
𝑛11
𝐽𝑎𝑐𝑐𝑎𝑟𝑑 𝑆𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 =
𝑛01 + 𝑛10 + 𝑛11
3. Simple Matching Coefficient (SMC):
𝑛11 + 𝑛00
𝑆𝑀𝐶 =
𝑛00 + 𝑛01 + 𝑛10 + 𝑛11
Where n11, n00 represent number of cases where both features have value 1 and 0 respectively
n10 denote cases where feature 1 has value 1 and feature 2 has value 0.
n01 denote cases where feature 1 has value 0 and feature 2 has value 1.
Overall Feature Selection Process
Feature Selection Approaches
Filter Approach:
▪ In this approach, the feature subset is selected based on statistical measures.
▪ No learning algorithm is employed to evaluate the goodness of the feature
selected.
▪ Commonly used metrics include correlation, chi square, Fisher score, ANOVA,
Information Gain, etc.
Feature Selection Approaches
Wrapper Approach:
▪ In this approach, for every candidate subset, the learning model is trained and
the result is evaluated by running the learning algorithm.
▪ Computationally very expensive but superior in performance.
▪ Requires some method to search the space of all possible subsets of features
Feature Selection Approaches
Wrapper Approach- Searching Methods:
▪ Forward Feature Selection
➢This is an iterative method wherein we start with the best performing variable against the target.
➢Next, we select another variable that gives the best performance in combination with the first selected
variable.
➢This process continues until the preset criterion is achieved.
▪For a given Feature set Fi (F1, F2, F3,……..Fn), feature extraction finds a mapping
function that maps it to new feature set Fi’ (F1’, F2’, F3’,…….Fm’) such that Fi’=f(Fi)
and m <n.
The eigenvectors of the Covariance matrix are actually the directions of the axes where
there is the most variance (most information) and that we call Principal Components
Principal Component Analysis
Stepwise working of PCA
𝑨 − 𝝀𝑰 𝑿 = 𝐎
The eigenvalues are simply the coefficients attached to eigenvectors, which give
the amount of variance carried in each Principal Component.
Step 4: Sort the eigenvectors in decreasing order of eigenvalues and choose k
eigenvectors with the largest eigenvalues.
Step 5: Transform the data along the principal component axis.
Principal Component Analysis-Example
Example:
X Y
2.5 2.4 Compute
0.5 0.7 Covariance
2.2 2.9 Matrix i.e. A Cov(X,X) Cov(X,Y)
1.9 2.2
3.1 3 Cov(Y, X) Cov(Y,Y)
2.3 2.7
2 1.6
1 1.1
1.5 1.6
1.1 0.9
Original dataset
Principal Component Analysis-Example
Example:
X Y 𝑋𝑖 − 𝑋ത ത
(𝑌𝑖 − 𝑌) ത 𝑖 − 𝑋)
(𝑋𝑖 −𝑋)(𝑋 ത ത 𝑖 − 𝑌)
(𝑌𝑖 −𝑌)(𝑌 ത ത 𝑖 − 𝑌)
(𝑋𝑖 −𝑋)(𝑌 ത
Find determinate by
𝜆1 = 1.284028 equating to zero 0.6165-𝜆 0.6154
𝜆2 = 0.049083
0.6154 0.7165-𝜆
Principal Component Analysis-Example
Example: 𝑉1
-0.6675 0.6154
𝑥1
0.6165-𝜆1 0.6154
=O
0.6154 -0.5675 𝑥2
0.6154 0.7165-𝜆1 Compute
eigenvectors
𝑨 − 𝝀𝑰 𝑿 = 𝟎
0.6165-𝜆2 0.6154 𝑉2
𝜆1 = 1.284028
0.5674 0.6154
𝜆2 = 0.049083 𝑥1
0.6154 0.7165-𝜆2 =O
0.6154 0.6674 𝑥2
Principal Component Analysis-Example
Example:
0.67787 -0.73517
𝑉1 = 𝑉2 =
0.73517 0.67787