Module2.1 Feature Selection
Module2.1 Feature Selection
# Load data
df = read_csv("diabetes.csv") # Load data
x, y = load_iris(return_X_y=True)
x = np.array(df.drop(["Outcome"], axis=1))
y = np.array(df["Outcome"])
print(X_new[:5])
Wrapper Methods
• These methods use a machine learning model to evaluate feature
importance. They recursively select or exclude features based on model
performance
• Recursive Feature Elimination and Forward Selection are popular wrapper
methods.
• Recursive Feature Elimination
• Recursive Feature Elimination (RFE) is an iterative approach that begins
with all features and eliminates the least important feature in each iteration.
• This process continues until a specified number of features remains. RFE
assigns importance scores to each feature based on how much their
removal affects the model's performance.
Wrapper Methods for Feature selection
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
# Load data
model = LogisticRegression()
# RFE
rfe = RFE(model, n_features_to_select=2)
X_new = rfe.fit_transform(X, y)
print(X_new[:5])
Recursive Feature Elimination - Example
• Step 1: Import Libraries and Load the Dataset First, you need to import the necessary libraries and load the
diabetes dataset.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_diabetes
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import RFE
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
• model.fit(X_train_selected, y_train)
• Step 8: Make Predictions and Evaluate the Model Make predictions
on the test set and calculate the mean squared error to evaluate the
model's performance.
• y_pred = model.predict(X_test_selected)
• mse = mean_squared_error(y_test, y_pred)
• print("Mean Squared Error:", mse)
• Forward Selection
• Forward Selection starts with an empty set of features and gradually
adds the most promising feature at each step.
• The model's performance is evaluated after each feature addition, and
the process continues until a specified number of features are selected.
Step-by-step process
• Step 1: Empty Feature Set
• Begin with an empty set of features. This set will gradually grow as the algorithm progresses.
• Step 2: Model Training and Evaluation
• Train a machine learning model (e.g., linear regression, decision tree, etc.) using the dataset
with the currently selected features (initially, this is an empty set).
• Evaluate the model's performance using a suitable metric, such as mean squared error (MSE)
for regression tasks or accuracy for classification tasks.
• Step 3: Feature Selection
• In each iteration of forward selection, consider adding one of the remaining candidate
features to the set of selected features.
• Train a new model with the current set of selected features plus the candidate feature.
• Evaluate the performance of the new model using the same metric as in Step 2.
• Step 4: Select the Best Feature
• Among all the candidate features considered in the current iteration, choose the one that leads to the best
improvement in model performance. This is typically determined by comparing the model's performance
metrics.
• Add the selected feature to the set of selected features.
• Step 5: Stopping Criterion
• Decide on a stopping criterion. This could be a predefined number of features to select, a specific
performance threshold, or any other relevant criterion.
• Check if the stopping criterion is met. If it is, stop the forward selection process. Otherwise, continue to the
next iteration.
• Step 6: Final Model
• Once the stopping criterion is met, the selected features form the final set of features to be used in your
model.
• Train a final model using all the selected features.
• Evaluate the final model on a separate test dataset to assess its performance in a more realistic scenario.
• Suppose you're working on a predictive modeling task to predict house prices.
• You have a dataset with features like the number of bedrooms, square footage,
presence of a garage, distance to the nearest school, and age of the house.
• You start with an empty set of features.
• In each iteration, you consider adding one feature to the set and measure how
much it improves the model's ability to predict house prices.
• You continue this process until you've added a predefined number of features or
until you're satisfied with the model's performance.
• The selected features form the final set used in your model to predict house
prices.
Feature selection by Forward Selection
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.model_selection import train_test_split
# Load data
data = load_iris()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target
# Selected features
selected_features = X.columns[list(sfs.k_feature_idx_)]
print("Selected features:", selected_features)
# Lasso Regression
lasso = Lasso(alpha=0.1)
lasso.fit(X, y)
• rf_classifier.fit(X_train_selected, y_train)
• y_pred = rf_classifier.predict(X_test_selected)