Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
16 views

Module2.1 Feature Selection

Uploaded by

haxodib502
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Module2.1 Feature Selection

Uploaded by

haxodib502
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 46

Course Code: CSA3002

MACHINE LEARNING ALGORITHMS

Course Type: LPC – 2-2-3


Course Objectives
• The objective of the course is to familiarize the learners with
the concepts of Machine Learning Algorithms and attain
Skill Development through Experiential Learning
techniques.
Course Outcomes
At the end of the course, students should be able to
1. Understanding of training and testing the datasets using machine
Learning techniques.
2. Apply optimization and parameter tuning techniques for machine
Learning algorithms.
3. Apply a machine learning model to solve various problems using
machine learning algorithms.
4. Apply machine learning algorithm to create models.
Feature Selection/Extraction Techniques
• Feature selection plays a crucial role in the machine learning process.
• Its main aim is to identify the subset of features that have the most
influential effect on the target variable.
• By removing irrelevant or noisy features, we can simplify the model,
enhance its interpretability, reduce training time, and avoid overfitting.
• This involves assessing the importance of each feature and choosing
the most informative ones.
Why is Feature Selection Important?
• Feature selection offers several advantages in the field of machine learning.
• Firstly, it enhances model performance by focusing on the most relevant
features.
• By eliminating irrelevant features, we can reduce the dimensionality of the
dataset, thereby justifying the curse of dimensionality and improving the
model's ability to generalize.
• Feature selection contributes significantly to model interpretability.
• By selecting the most important features, we gain a better understanding of the
underlying factors that influence the model's predictions.
• This interpretability holds particular significance in domains like healthcare
and finance, where transparency and explainability are crucial.
Common Feature Selection Techniques
• There are various approaches to performing feature selection, each
with its strengths and limitations.
• Three common categories of feature selection techniques: filter
methods, wrapper methods, and embedded methods.
Filter Methods
• A filter method in machine learning is a feature selection technique
used to select relevant features from a dataset before applying a
learning algorithm.
• Two commonly used filter methods include Variance Threshold and
Chi-Square Test.
Variance Threshold
• The Variance Threshold method identifies features with low variance,
assuming that features with minimal variation across the dataset
contribute less to the model.
• By establishing a threshold, we can select features with variance above
this defined threshold and discard the rest.
Example:Predicting Student Grades
• Suppose you have a dataset containing information about students, and
you want to predict their final exam grades. The dataset includes the
following features:
• Age: Age of the student (numeric).
• Gender: Gender of the student (categorical: 'Male' or 'Female').
• Study Hours: Number of hours the student spends studying per week
(numeric).
• Test 1 Score: Score on the first practice test (numeric).
• Test 2 Score: Score on the second practice test (numeric).
Before applying the Variance
Threshold method, let's take a look
at the dataset:
• Now, let's calculate the variance for each feature:
• Age: Variance = ((18-19.2)^2 + (19-19.2)^2 + (20-19.2)^2 + (18-19.2)^2 + (21-
19.2)^2) / 5 = 2.96
• Gender: This is categorical, so it doesn't have variance as we can't measure
variance for categories.
• Study Hours: Variance = ((10-11)^2 + (12-11)^2 + (15-11)^2 + (8-11)^2 + (10-
11)^2) / 5 = 5.2
• Test 1 Score: Variance = ((85-84.6)^2 + (78-84.6)^2 + (92-84.6)^2 + (80-84.6)^2 +
(88-84.6)^2) / 5 = 32.16
• Test 2 Score: Variance = ((87-83.8)^2 + (80-83.8)^2 + (91-83.8)^2 + (79-83.8)^2 +
(86-83.8)^2) / 5 = 7.36
• Now, let's say you decide to set a threshold for variance at 5.0. Any feature with a
variance below this threshold will be considered low-variance and will be
removed.
• In this case, "Gender" and "Age" will be removed because they have variances
below 5.0, and "Study Hours," "Test 1 Score," and "Test 2 Score" will be retained.
After applying the Variance
Threshold method, the dataset will
look like this:
Chi-Square Test
• The Chi-Square Test feature selection method is commonly used when
dealing with categorical data and is used to identify the most relevant
features by assessing the independence between each categorical
feature and the target variable.
• It measures the association or dependency between two categorical
variables, and in feature selection, it helps us select features that have
a significant relationship with the target variable.
Example: Predicting Loan
Approval
• Bank dataset contains the following categorical features:
• Credit Score: Categorical feature representing the applicant's credit score
category (e.g., "Low," "Medium," "High").
• Employment Status: Categorical feature indicating the applicant's
employment status (e.g., "Employed," "Unemployed," "Self-Employed").
• Marital Status: Categorical feature representing the applicant's marital status
(e.g., "Married," "Single," "Divorced").
• Loan Purpose: Categorical feature indicating the purpose of the loan (e.g.,
"Home Purchase," "Debt Consolidation," "Education").
• Loan Approval: Binary target variable indicating whether the loan application
was approved (1 for approved, 0 for denied).
• To perform feature selection using the Chi-Square Test, follow these
steps:
• Step 1: Create a contingency table (also known as a cross-tabulation
table) between each categorical feature and the target variable. The
table shows the counts of each combination of feature and target
variable values. For instance, for the "Credit Score" feature:
• Step 2: Calculate the Chi-Square statistic for each contingency table. The
Chi-Square statistic measures the difference between the observed and
expected frequencies of each combination. The formula for Chi-Square is:
• χ^2 = Σ [(Observed - Expected)^2 / Expected]
• Where:
• Observed: The actual count in a cell of the contingency table.
• Expected: The expected count in the same cell if there were no
relationship between the feature and the target variable.
• Expected = (Total count in the row * Total count in the column) / Total
count in the dataset
• Step 3: Repeat the above calculation for all other cells in the contingency table.
• Once you have calculated the Chi-Square values for all cells, sum up these
values to get the overall Chi-Square statistic for the entire table:
• Chi-Square Total = Σ (Chi-Square for each cell) = (4/27) + ... + (other cell Chi-
Square values)
• Calculate - Chi-Square Total for Credit Score, Employment Status, Marital
Status, Loan Purpose and Loan Approval
• Step 4: Set a significance level as a threshold. Features with p-values below this
threshold are considered statistically significant and are selected.
• Calculate the p-value using the Chi-Square distribution. You can use software or
an online calculator for this calculation.
Filter method in Python for Feature selection
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest, chi2

# Load data
df = read_csv("diabetes.csv") # Load data
x, y = load_iris(return_X_y=True)
x = np.array(df.drop(["Outcome"], axis=1))
y = np.array(df["Outcome"])

# Feature selection using chi-squared test


selector = SelectKBest(chi2, k=2) # Select the top 2 features
X_new = selector.fit_transform(X, y)

print(X_new[:5])
Wrapper Methods
• These methods use a machine learning model to evaluate feature
importance. They recursively select or exclude features based on model
performance
• Recursive Feature Elimination and Forward Selection are popular wrapper
methods.
• Recursive Feature Elimination
• Recursive Feature Elimination (RFE) is an iterative approach that begins
with all features and eliminates the least important feature in each iteration.
• This process continues until a specified number of features remains. RFE
assigns importance scores to each feature based on how much their
removal affects the model's performance.
Wrapper Methods for Feature selection
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

# Load data
model = LogisticRegression()

# RFE
rfe = RFE(model, n_features_to_select=2)
X_new = rfe.fit_transform(X, y)

print(X_new[:5])
Recursive Feature Elimination - Example
• Step 1: Import Libraries and Load the Dataset First, you need to import the necessary libraries and load the
diabetes dataset.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_diabetes
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import RFE
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load the diabetes dataset


diabetes = load_diabetes()
X = diabetes.data
y = diabetes.target
• Step 2: Split the Dataset Split the dataset into training and testing
sets to evaluate the model later.
• X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=100)
• Step 3: Create a Linear Regression Model Create a linear regression
model that will be used for feature selection.
• model = LinearRegression()
• Step 4: Initialize RFE with the Model and Desired Number of
Features Initialize RFE with the linear regression model and specify
the number of features you want to select.
• num_features_to_select = 5 # You can change this to the desired
number of features
• rfe = RFE(model, num_features_to_select)
• Step 5: Fit RFE to the Training Data Fit RFE to the training data to
perform feature selection.
• rfe.fit(X_train, y_train)
• Step 6: Get the Selected Features Retrieve the indices of the selected
features using rfe.support_.
• selected_features = np.where(rfe.support_)[0]
• print("Selected Feature Indices:", selected_features)
• Step 7: Train a Model Using the Selected Features Train a linear regression
model using the selected features.
• X_train_selected = X_train[:, selected_features]
• X_test_selected = X_test[:, selected_features]

• model.fit(X_train_selected, y_train)
• Step 8: Make Predictions and Evaluate the Model Make predictions
on the test set and calculate the mean squared error to evaluate the
model's performance.
• y_pred = model.predict(X_test_selected)
• mse = mean_squared_error(y_test, y_pred)
• print("Mean Squared Error:", mse)
• Forward Selection
• Forward Selection starts with an empty set of features and gradually
adds the most promising feature at each step.
• The model's performance is evaluated after each feature addition, and
the process continues until a specified number of features are selected.
Step-by-step process
• Step 1: Empty Feature Set
• Begin with an empty set of features. This set will gradually grow as the algorithm progresses.
• Step 2: Model Training and Evaluation
• Train a machine learning model (e.g., linear regression, decision tree, etc.) using the dataset
with the currently selected features (initially, this is an empty set).
• Evaluate the model's performance using a suitable metric, such as mean squared error (MSE)
for regression tasks or accuracy for classification tasks.
• Step 3: Feature Selection
• In each iteration of forward selection, consider adding one of the remaining candidate
features to the set of selected features.
• Train a new model with the current set of selected features plus the candidate feature.
• Evaluate the performance of the new model using the same metric as in Step 2.
• Step 4: Select the Best Feature
• Among all the candidate features considered in the current iteration, choose the one that leads to the best
improvement in model performance. This is typically determined by comparing the model's performance
metrics.
• Add the selected feature to the set of selected features.
• Step 5: Stopping Criterion
• Decide on a stopping criterion. This could be a predefined number of features to select, a specific
performance threshold, or any other relevant criterion.
• Check if the stopping criterion is met. If it is, stop the forward selection process. Otherwise, continue to the
next iteration.
• Step 6: Final Model
• Once the stopping criterion is met, the selected features form the final set of features to be used in your
model.
• Train a final model using all the selected features.
• Evaluate the final model on a separate test dataset to assess its performance in a more realistic scenario.
• Suppose you're working on a predictive modeling task to predict house prices.
• You have a dataset with features like the number of bedrooms, square footage,
presence of a garage, distance to the nearest school, and age of the house.
• You start with an empty set of features.
• In each iteration, you consider adding one feature to the set and measure how
much it improves the model's ability to predict house prices.
• You continue this process until you've added a predefined number of features or
until you're satisfied with the model's performance.
• The selected features form the final set used in your model to predict house
prices.
Feature selection by Forward Selection
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.model_selection import train_test_split

# Load data
data = load_iris()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

# Split into training and test set


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

# Logistic regression model


model = LogisticRegression(max_iter=200)
# Forward Selection
sfs = SFS(model,
k_features=2, # Number of features to select
forward=True, # Forward selection
floating=False, # No floating steps
scoring='accuracy',
cv=5) # 5-fold cross-validation

sfs = sfs.fit(X_train, y_train)

# Selected features
selected_features = X.columns[list(sfs.k_feature_idx_)]
print("Selected features:", selected_features)

# Performance with selected features


print("Cross-validation score with selected features:", sfs.k_score_)
Embedded Methods
• Embedded methods for feature selection are techniques where the
feature selection process is performed during the training of the
model itself.
• Lasso Regression and Random Forest Importance have been widely
used embedded methods.
• Example: Lasso Regression (L1 Regularization):
• Lasso (Least Absolute Shrinkage and Selection Operator) can shrink
the less important feature's coefficients to zero, effectively selecting a
subset of features.
• Tool: Lasso regression, Tree-based models (e.g., Random Forest).
L1 Regularization
from sklearn.linear_model import Lasso
from sklearn.datasets import make_regression
import numpy as np

# Create sample dataset


X, y = make_regression(n_samples=100, n_features=10, noise=0.1)

# Lasso Regression
lasso = Lasso(alpha=0.1)
lasso.fit(X, y)

print(np.nonzero(lasso.coef_)) # Non-zero coefficients indicate selected features


• Lasso Regression
• Lasso Regression introduces a regularization term that penalizes the
absolute values of the feature coefficients. As a result, some
coefficients become zero, effectively removing the corresponding
features from the model. This technique encourages scattered and
performs feature selection simultaneously.
• Lasso is a modification of linear regression, where the model is
penalized for the sum of absolute values of the weights. Thus, the
absolute values of weight will be (in general) reduced, and many will
tend to be zeros.
• Initialize and Train the Lasso Regression Model Create a Lasso
Regression model, and specify the strength of the regularization
penalty, typically denoted as alpha. Larger alpha values lead to
stronger regularization, which results in more feature selection. You
can use techniques like cross-validation to choose an appropriate
alpha value.
• alpha = 0.01 # Adjust the value of alpha based on your data and
requirements
• lasso_model = Lasso(alpha=alpha)
• lasso_model.fit(X_train, y_train)
• Feature Selection Lasso Regression will automatically perform feature
selection by shrinking the coefficients of less important features
towards zero. After training the model, you can examine the
coefficients to identify which features were selected.
• selected_features = X.columns[lasso_model.coef_ != 0]
• Random Forest Importance
• Random Forest Importance (RFI) is a technique used to perform
feature selection by leveraging the capabilities of a Random Forest
classifier or regressor. Random Forest is an ensemble learning method
that combines multiple decision trees to make predictions. RFI
measures the importance of each feature in the Random Forest
model and ranks them based on their contribution to the model's
predictive performance. Features that contribute the most to
reducing impurity or error are considered more important.
• Train a Random Forest Model Create and train a Random Forest
classifier using your training data.
• rf_classifier = RandomForestClassifier(n_estimators=100,
random_state=42)
• rf_classifier.fit(X_train, y_train)
• Feature Importance Calculation Retrieve the feature importances
from the trained Random Forest model.
• feature_importances = rf_classifier.feature_importances_
• Rank Features Rank the features based on their importance scores, in
descending order. You can use this ranking to select the top features
for your model.
• feature_ranking = pd.DataFrame({'Feature': X.columns, 'Importance':
feature_importances})
• feature_ranking = feature_ranking.sort_values(by='Importance',
ascending=False)
• Select Top Features Choose the top N features based on your
requirements. You can select a fixed number of features or use a
threshold on the importance score.
• top_n_features = feature_ranking.head(N) # Replace N with the
desired number of features
• selected_features = top_n_features['Feature'].tolist()
• Train and Evaluate the Model with Selected Features Train a Random Forest
model using only the selected features and evaluate its performance.
• X_train_selected = X_train[selected_features]
• X_test_selected = X_test[selected_features]

• rf_classifier.fit(X_train_selected, y_train)
• y_pred = rf_classifier.predict(X_test_selected)

• accuracy = accuracy_score(y_test, y_pred)


• print("Accuracy with Selected Features:", accuracy)
Evaluation Metrics for Feature Selection
• In order to measure the efficiency of feature selection techniques, it is
necessary to have suitable evaluation metrics.
• There are several commonly employed metrics, such as accuracy,
precision, recall, F1-score, and area under the receiver operating
characteristic curve (AUC-ROC).
• These metrics offer valuable information on how effectively the model
performs when utilizing the selected features, as opposed to using all
available features.

You might also like