Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
5 views

Data Preprocessing for Machine Learning in Python

The document provides a comprehensive guide on data preprocessing techniques for machine learning in Python, covering key concepts such as data cleaning, encoding categorical features, and transforming numerical features. It emphasizes the importance of structured data for model performance and introduces tools like Jupyter Notebooks for interactive analysis. Additionally, it includes practical code examples and common pitfalls to avoid during the preprocessing stages.

Uploaded by

sankarthik9316
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Data Preprocessing for Machine Learning in Python

The document provides a comprehensive guide on data preprocessing techniques for machine learning in Python, covering key concepts such as data cleaning, encoding categorical features, and transforming numerical features. It emphasizes the importance of structured data for model performance and introduces tools like Jupyter Notebooks for interactive analysis. Additionally, it includes practical code examples and common pitfalls to avoid during the preprocessing stages.

Uploaded by

sankarthik9316
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Data Preprocessing for Machine Learning in Python

Section 1: Introduction to Data Preprocessing & Tools

Key Concepts Covered:


• Definition and purpose of data preprocessing.
• Types of data transformations: cleaning, encoding, scaling, dimensionality reduction,
oversampling.
• Differences between numerical and categorical variables.
• Introduction to Jupyter Notebooks for interactive data analysis.

1. What is Data Preprocessing?


o Definition: A set of techniques to transform raw data into a format suitable for machine
learning models.
o Why it matters: Models require structured, clean, and normalized data to perform
effectively.
o Key Transformations:
▪ Cleaning: Handling missing values (e.g., imputation).
▪ Encoding: Converting categorical variables into numerical representations.
▪ Scaling: Normalizing numerical features (e.g., Min-Max, Standardization).
▪ Dimensionality Reduction: Reducing feature count (e.g., PCA).
▪ Oversampling: Addressing class imbalance (e.g., SMOTE).
2. Numerical vs. Categorical Variables
o Numerical Variables:
▪ Represent quantitative data (e.g., integers, floats).
▪ Examples: Age (25, 30), Temperature (98.6°F).
o Categorical Variables:
▪ Represent discrete, finite categories (e.g., labels, binary values).
▪ Examples: Color (Red/Blue), Gender (Male/Female).
o Comparison Table:

Feature Numerical Variables Categorical Variables

Data Type Continuous/Discrete Discrete (finite set)

Examples Height, Salary Country, Product Category

Preprocessing Techniques Scaling, Normalization One-Hot Encoding, Labeling

3. Introduction to Jupyter Notebooks


o What is Jupyter?: An open-source interactive development environment (IDE) for
combining code, visualizations, and documentation.
o Key Features:
▪ Cells: Execute code blocks independently (mix code, markdown, and outputs).
▪ Kernel: Runs code in the background (supports Python, R, etc.).
▪ Keyboard Shortcuts:
▪ Ctrl + Enter: Execute current cell.
▪ Shift + Enter: Execute cell and move to the next.
▪ Esc + A/B: Insert cell above/below.
o Setup Guide:
# Install Jupyter
pip install jupyterlab
# Launch Jupyter
jupyter notebook

o Best Practices:
▪ Use markdown cells to document steps.
▪ Split code into logical cells (e.g., data loading, cleaning, modeling).

Common Pitfalls & Tips:


• Pitfall 1: Mixing code and documentation without structure.
o Tip: Use markdown headers to separate sections (e.g., "Data Loading", "Exploratory
Analysis").
• Pitfall 2: Not restarting the kernel after major changes.
o Tip: Use Kernel > Restart & Run All to ensure reproducibility.

Additional Context
1. Why Use Jupyter Over Other IDEs?
o Ideal for iterative data exploration (run code line-by-line).
o Supports inline visualizations (e.g., matplotlib plots).
o Shareable format (export to HTML/PDF).
2. Installing Jupyter via Anaconda (Alternative Method):
o Anaconda simplifies package management for data science:
conda install -c conda-forge jupyterlab

3. Jupyter Lab vs. Jupyter Notebook:


o Jupyter Lab: Modern interface with tabs, panels, and extensions.
o Jupyter Notebook: Classic single-document interface.

4. Critical Libraries for Data Preprocessing:


import pandas as pd # Data manipulation
import numpy as np # Numerical operations
from sklearn.preprocessing import StandardScaler, OneHotEncoder # Scaling/Encoding
Section 2: Data Cleaning
Key Concepts Covered:
• Handling missing values in numerical and categorical variables.
• Strategies for imputation: mean, median, constant, most frequent, and K-Nearest Neighbors
(KNN).
• Using ColumnTransformer and MakeColumnSelector to automate feature-specific
transformations.
• Practical exercises to apply imputation techniques.

1. Why Clean Data?


o Problem: Most machine learning models cannot handle missing values (NaNs).
o Goal: Replace NaNs with meaningful values while avoiding data leakage.
o Impact: Improves model reliability, reduces bias, and ensures compatibility with
algorithms.
2. Identifying Numerical vs. Categorical Variables
o Numerical Variables:
▪ Continuous (e.g., temperature) or discrete (e.g., age).
▪ Detected using df.select_dtypes(include=['int64', 'float64']).
o Categorical Variables:
▪ Discrete labels (e.g., "Red", "Yes/No").
▪ Detected using df.select_dtypes(include=['object', 'category']).
o Code Example:
numerical_cols = df.select_dtypes(exclude=['object']).columns
categorical_cols = df.select_dtypes(include=['object']).columns

3. Imputation Techniques for Numerical Data


o SimpleImputer (scikit-learn):
▪ Strategies:

Strategy Use Case

mean Symmetrical data distributions.

median Skewed distributions.

constant Domain-specific fixed value.

▪ Code Example:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X)

o KNNImputer:
▪ Infers missing values using values from *k* nearest neighbors.
▪ Advantages: Adapts to data patterns; uses feature relationships.
▪ Code Example:
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=5, weights='distance')
X_knn = imputer.fit_transform(X)

4. Imputation Techniques for Categorical Data


o Strategies:
▪ most_frequent: Replace NaNs with the mode.
▪ constant: Replace with a placeholder (e.g., "Unknown").
o Code Example:
imputer = SimpleImputer(strategy='most_frequent')
cat_imputed = imputer.fit_transform(df[categorical_cols])

5. Automating Imputation with ColumnTransformer


o Purpose: Apply different imputers to numerical and categorical columns in one step.
o Code Example:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline

preprocessor = ColumnTransformer(
transformers=[
('num', SimpleImputer(strategy='mean'), numerical_cols),
('cat', SimpleImputer(strategy='most_frequent'), categorical_cols)
]
)
X_clean = preprocessor.fit_transform(X)

6. MakeColumnSelector for Dynamic Feature Selection


o Use Case: Automatically select columns by data type.
o Code Example:
from sklearn.compose import make_column_selector

preprocessor = ColumnTransformer(
transformers=[
('num', SimpleImputer(strategy='median'),
make_column_selector(dtype_exclude='object')),
('cat', SimpleImputer(strategy='constant', fill_value='Unknown'),
make_column_selector(dtype_include='object'))
]
)
Common Pitfalls & Tips:
• Pitfall 1: Imputing before splitting data into train/test sets.
o Tip: Always split data first to avoid leakage:
from sklearn.model_selection import train_test_split
X_train, X_test = train_test_split(X, test_size=0.2)
imputer.fit(X_train) # Fit only on training data!

• Pitfall 2: Using KNNImputer on high-dimensional data.


o Tip: Reduce dimensionality first (e.g., with PCA) to improve performance.

Additional Context:
1. Evaluating Imputation Quality:
o Compare model performance (e.g., RMSE) before/after imputation.
o Use cross-validation to assess robustness.
2. Advanced Techniques (Preview):
o Iterative Imputer: Models missing values as a function of other features (e.g., MICE).
o Domain-Specific Imputation: Replace NaNs using business logic (e.g., "Unknown" for
missing categories).
3. Why Avoid Default SimpleImputer Settings?
o The default strategy='mean' may not suit skewed data. Always visualize distributions
first!

Visual Aids:
1. Imputation Workflow:
Detect NaNs → Split Data → Fit Imputer on Train → Transform Train/Test

2. KNNImputer Illustration:
o Missing value (?) inferred from nearest neighbors (A, B, C) using weighted average.

Record: [10, 20, ?, 40]


Neighbors:
- [10, 20, 30, 40] → Weight = 1/distance
- [10, 20, 25, 40] → Weight = 1/distance
Imputed Value = (30*0.5 + 25*0.5) / (0.5 + 0.5) = 27.5
Section 3: Encoding Categorical Features

Key Concepts Covered:


• One-Hot Encoding: Transforming categorical variables into binary columns.
• Ordinal Encoding: Mapping ordered categories to integers (e.g., "low" → 0, "medium" → 1).
• Label Encoding: Converting target labels into integers (for classification tasks).
• Handling Unknown Categories: Strategies for unseen values during transformation.
• Automation with ColumnTransformer: Applying encoders to specific columns dynamically.

Detailed Notes:
1. One-Hot Encoding
o Purpose: Convert categorical variables into binary (0/1) columns.
o Use Case: Nominal data (no inherent order, e.g., colors, countries).
o Code Example:
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')
X_encoded = encoder.fit_transform(df[['category_column']])

o Key Parameters:
▪ sparse=False: Return a dense matrix (default is sparse).
▪ drop='first': Remove one column to avoid multicollinearity.
▪ handle_unknown='ignore': Encode unseen categories as all zeros.
o Pitfall: High cardinality (e.g., 1,000 categories) increases dimensionality.
▪ Solution: Use feature selection or dimensionality reduction.
2. Ordinal Encoding
o Purpose: Encode ordered categories as integers (e.g., "low" < "medium" < "high").
o Use Case: Ordinal data with natural ranking.
o Code Example:
from sklearn.preprocessing import OrdinalEncoder
encoder = OrdinalEncoder(categories=[['low', 'medium', 'high']])
X_ordinal = encoder.fit_transform(df[['ordinal_column']])

o Custom Order: Define ranking explicitly using the categories parameter.


3. Label Encoding
o Purpose: Convert target labels (e.g., class names) into integers.
o Use Case: Preparing labels for classification models.
o Code Example:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y_encoded = le.fit_transform(y)
o Warning: Do not use for input features—models may misinterpret integers as ordinal.
4. Handling Unknown Categories
o Problem: New categories in test/production data not seen during training.
o Strategies:
▪ handle_unknown='error': Raise an error (default).
▪ handle_unknown='ignore': Encode as zeros (for one-hot) or a placeholder.
5. Automating with ColumnTransformer
o Purpose: Apply different encoders to different columns in one step.
o Code Example:
from sklearn.compose import ColumnTransformer
from sklearn.compose import make_column_selector

preprocessor = ColumnTransformer(
transformers=[
('onehot', OneHotEncoder(), make_column_selector(dtype_include='object')),
('ordinal', OrdinalEncoder(categories=[['low', 'medium', 'high']]), ['ordinal_column'])
],
remainder='passthrough'
)
X_transformed = preprocessor.fit_transform(X)

o Dynamic Selection: Use make_column_selector to filter columns by data type.

Common Pitfalls & Tips:


• Pitfall 1: Using label encoding for input features.
o Tip: Use one-hot or ordinal encoding instead to avoid implying false order.
• Pitfall 2: Ignoring high cardinality in one-hot encoding.
o Tip: Use drop='first' or target encoding (covered in later sections).
• Pitfall 3: Not fitting encoders on training data only.
o Tip: Always fit on training data to prevent data leakage:
encoder.fit(X_train)
X_test_encoded = encoder.transform(X_test)

Additional Context:
1. One-Hot vs. Ordinal vs. Label Encoding:

Technique Use Case Dimensionality Impact

One-Hot Nominal data (no order) High (creates *k* columns)

Ordinal Ordinal data (natural order) Low (1 column)

Label Target variable encoding Low (1 column)


2. Handling High Cardinality:
o Frequency Encoding: Replace categories with their occurrence counts.
o Target Encoding: Encode categories based on target mean (advanced).

3. Production Readiness:
o Always specify categories in OneHotEncoder/OrdinalEncoder to handle future unseen
values.

Visual Aids:
1. One-Hot Encoding Workflow:
Original Data: [A, B, A, C]
Encoded:
|A|B|C|
|---|---|--|
|1|0|0|
|0|1|0|
|1|0|0|
|0|0|1|

2. Ordinal Encoding Example:


Categories: ['low', 'medium', 'high'] → Encoded: [0, 1, 2]

3. ColumnTransformer Diagram:
Input Data → [OneHotEncoder on Column 1] → [OrdinalEncoder on Column 2] → Merged Output
Section 4: Transformation of Numerical Features

Key Concepts Covered:


• Power Transformations: Reducing skewness (Johnson, Box-Cox).
• Binning: Converting numerical features into categorical bins (uniform, quantile, k-means).
• Binary Thresholding: Converting values to 0/1 based on a threshold.
• Custom Transformations: Using FunctionTransformer for arbitrary functions.
• Automation: Applying transformations dynamically with ColumnTransformer.

Detailed Notes:
1. Power Transformations
o Purpose: Reduce skewness and approximate normality for models sensitive to feature
distributions (e.g., KNN, clustering).
o Methods:
▪ Johnson Transformation: Works with both positive and negative values.
▪ Box-Cox Transformation: Requires strictly positive values.
o Code Example:
from sklearn.preprocessing import PowerTransformer
# Johnson transformation (default)
pt_johnson = PowerTransformer(method='yeo-johnson', standardize=True)
X_transformed = pt_johnson.fit_transform(X)

# Box-Cox transformation
pt_boxcox = PowerTransformer(method='box-cox', standardize=True)
X_transformed = pt_boxcox.fit_transform(X[X > 0]) # Ensure positivity

2. Binning (Discretization)
o Strategies:

Strategy Use Case

Uniform Equal-width bins (e.g., 0-50, 50-100).

Quantile Equal-frequency bins (e.g., quintiles).

k-means Bins based on clustering algorithm.

o Code Example:
from sklearn.preprocessing import KBinsDiscretizer

# Uniform bins (5 bins)


uniform_binner = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='uniform')
X_uniform = uniform_binner.fit_transform(X)
# Quantile bins (5 bins)
quantile_binner = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='quantile')
X_quantile = quantile_binner.fit_transform(X)

# k-means bins (5 clusters)


kmeans_binner = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='kmeans')
X_kmeans = kmeans_binner.fit_transform(X)

3. Binary Thresholding
o Purpose: Convert numerical features to binary (0/1) based on a threshold.
o Code Example:
from sklearn.preprocessing import Binarizer

# Threshold = 12
binarizer = Binarizer(threshold=12)
X_binary = binarizer.fit_transform(X)

4. Custom Transformations
o Use Case: Apply custom logic (e.g., log transform, scaling).
o Code Example:
from sklearn.preprocessing import FunctionTransformer
import numpy as np

# Log transformation
log_transformer = FunctionTransformer(np.log, validate=True)
X_log = log_transformer.fit_transform(X)

# Custom multiplier function


def multiply_by(x, factor=2):
return x * factor

custom_transformer = FunctionTransformer(
multiply_by,
kw_args={'factor': 3},
validate=True
)
X_custom = custom_transformer.fit_transform(X)
5. Automation with ColumnTransformer
o Purpose: Apply different transformations to specific columns.
o Code Example:
from sklearn.compose import ColumnTransformer

preprocessor = ColumnTransformer(
transformers=[
('power_transform', PowerTransformer(), ['feature1']),
('binarize', Binarizer(threshold=10), ['feature2']),
('log_transform', FunctionTransformer(np.log), ['feature3'])
],
remainder='passthrough'
)
X_processed = preprocessor.fit_transform(X)

Common Pitfalls & Tips:


• Pitfall 1: Applying Box-Cox to non-positive data.
o Tip: Use np.log1p or clip negative values first.

• Pitfall 2: Data leakage when fitting transformers.


o Tip: Fit transformers on training data only:
preprocessor.fit(X_train)
X_test_transformed = preprocessor.transform(X_test)

• Pitfall 3: High dimensionality from excessive binning.


o Tip: Use encode='ordinal' instead of one-hot encoding for bins.

Additional Context:
1. Transformation Comparison:

Technique Use Case Impact on Dimensionality

Power Transform Skewed data → Normal distribution None (1:1 mapping)

Binning Create ordinal/categorical bins Increases (k bins → k-1 columns)

Binary Threshold Binary classification tasks Reduces to 1 column

2. Advanced Techniques:
o Interaction Terms: Combine features (e.g., feature1 * feature2).
o Polynomial Features: Create non-linear relationships (e.g., feature1^2).
Visual Aids:
1. Power Transformation Example:
Original Skewed Data → [Johnson/Box-Cox] → Symmetrical Distribution

2. Binning Workflow:
Numerical Data → [Uniform/Quantile/k-means] → Ordinal Categories

3. ColumnTransformer Flow:
Input Data → [Power Transform on Col1] → [Binarize Col2] → [Log Transform Col3] → Merged Output

Section 5: Pipelines
Key Concepts Covered:
• Pipeline Definition: Sequences of transformations applied in order.
• Pipeline Construction: Using make_pipeline and Pipeline classes.
• Integration with ColumnTransformer: Combining feature-specific transformations.
• Parameter Tuning: Modifying pipeline components with set_params.
• Nested Pipelines: Embedding pipelines within ColumnTransformer.

Detailed Notes:
1. What Are Pipelines?
o Purpose: Streamline data preprocessing by chaining transformations (e.g., imputation
→ scaling → encoding).
o Benefits:
▪ Avoid data leakage by ensuring transformations are fitted only on training data.
▪ Simplify code and ensure reproducibility.
2. Building Pipelines
o Using make_pipeline:
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import PowerTransformer

# Create a pipeline: Impute missing values → Apply power transform


pipeline = make_pipeline(
SimpleImputer(strategy='median'),
PowerTransformer(method='yeo-johnson')
)
X_transformed = pipeline.fit_transform(X)

o Using the Pipeline Class (Explicit Naming):


from sklearn.pipeline import Pipeline
pipeline = Pipeline([
('imputer', SimpleImputer(strategy='median')),
('power_transform', PowerTransformer())
])
3. Combining Pipelines with ColumnTransformer
o Example: Apply different pipelines to numerical and categorical features.
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

# Define pipelines
numerical_pipeline = Pipeline([
('imputer', SimpleImputer(strategy='median')),
('transformer', PowerTransformer())
])

categorical_pipeline = Pipeline([
('imputer', SimpleImputer(strategy='most_frequent')),
('encoder', OneHotEncoder(sparse=False))
])

# Combine using ColumnTransformer


preprocessor = ColumnTransformer([
('num', numerical_pipeline, ['age', 'income']),
('cat', categorical_pipeline, ['gender'])
])

X_processed = preprocessor.fit_transform(X)

4. Modifying Pipeline Parameters


o Use set_params: Adjust hyperparameters dynamically.
# Change imputation strategy in the numerical pipeline
preprocessor.set_params(num__imputer__strategy='mean')

# Update categorical encoder to handle unknowns


preprocessor.set_params(cat__encoder__handle_unknown='ignore')

Common Pitfalls & Tips:


• Pitfall 1: Data leakage from fitting on the entire dataset.
o Tip: Always use pipeline.fit(X_train) and pipeline.transform(X_test).
• Pitfall 2: Incorrect parameter syntax in set_params.
o Tip: Use double underscores (__) to navigate nested components
(e.g., num__imputer__strategy).
• Pitfall 3: Overlooking sparse matrices in one-hot encoding.
o Tip: Set sparse=False in OneHotEncoder for dense output.
Additional Context:
1. Integrating Models into Pipelines:
python
Copy
Download
from sklearn.linear_model import LogisticRegression

# Add a model as the final pipeline step


full_pipeline = Pipeline([
('preprocessor', preprocessor),
('classifier', LogisticRegression())
])

# Train and predict in one step


full_pipeline.fit(X_train, y_train)
y_pred = full_pipeline.predict(X_test)

2. Cross-Validation with Pipelines:


python
Copy
Download
from sklearn.model_selection import cross_val_score
scores = cross_val_score(full_pipeline, X, y, cv=5)

3. Why Pipelines Matter:


o Ensure consistent transformations across training/testing.
o Simplify deployment by encapsulating preprocessing and modeling.

Visual Aids:
1. Pipeline Workflow:
Raw Data → [Imputer] → [Power Transformer] → [Model] → Predictions

2. ColumnTransformer Diagram:
Input Data → [Numerical Pipeline] → [Categorical Pipeline] → Merged Output

3. Parameter Tuning Syntax:


pipeline.set_param(component__subcomponent__parameter=value)
Section 6: Scaling

Key Concepts
• Scaling Purpose: Ensures features have comparable magnitudes to prevent models from
biasing toward higher-magnitude features.
• Normalization (Min-Max Scaling): Scales features to [0, 1] range.
• Standardization (Z-Score Scaling): Centers features to mean=0 and variance=1.
• Robust Scaling: Uses median and interquartile range (IQR) to reduce outlier impact.
• Inverse Transformation: Scikit-learn scalers allow reverting scaled data to original form.

1. Why Scaling Matters


• Model Sensitivity:
o Distance-based algorithms (e.g., KNN, SVM) and gradient-descent optimizers (e.g.,
neural networks, logistic regression) require scaled features.
o Example: A feature ranging [0, 1000] vs. [0, 1] can dominate distance calculations.
• Outliers:
o MinMaxScaler and StandardScaler are sensitive to outliers; RobustScaler is preferred
for skewed data.
2. Scaling Techniques

Method Formula Use Case

Xscaled=X−Xmin⁡Xmax⁡−Xmin⁡Xscaled=Xmax Bounded ranges (e.g.,


MinMaxScaler
−XminX−Xmin images).

Normally distributed
StandardScaler Xscaled=X−μσXscaled=σX−μ
data.

RobustScaler Xscaled=X−medianIQRXscaled=IQRX−median Data with outliers.

Code Implementation:
from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler

# Normalization
minmax = MinMaxScaler()
X_minmax = minmax.fit_transform(X)

# Standardization
standard = StandardScaler()
X_standard = standard.fit_transform(X)

# Robust Scaling
robust = RobustScaler()
X_robust = robust.fit_transform(X)
3. Handling Outliers
• RobustScaler:
o Uses median (resistant to outliers) and IQR (75th - 25th percentile).
o Example: If a feature has outliers in housing prices, use RobustScaler instead
of StandardScaler.

4. Pipeline Integration
• Steps:
1. Impute Missing Values: Use SimpleImputer.
2. Scale Features: Apply scaler in a pipeline.
3. Column-Specific Transformations: Use ColumnTransformer to target numerical
features.
Example Pipeline:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

# Define numerical transformer


numerical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', MinMaxScaler()) # Replace with StandardScaler/RobustScaler
])

# Apply to numerical columns


preprocessor = ColumnTransformer(
transformers=[
('num', numerical_transformer, numerical_columns)
])

# Fit and transform


X_processed = preprocessor.fit_transform(X)

5. Changing Scalers in Pipelines


• Use set_params to switch scalers without rebuilding the entire pipeline:
# Change from MinMaxScaler to StandardScaler
preprocessor.named_transformers_['num'].set_params(scaler=StandardScaler())

6. Inverse Transformation
• Revert scaled data to original scale:
# For MinMaxScaler
X_original = minmax.inverse_transform(X_minmax)
Additional Notes
1. Data Leakage Warning:
o Always split data into train/test sets before scaling. Fit the scaler on the training data
only, then transform both train and test sets.

2. When Not to Scale:


o Tree-based models (e.g., Decision Trees, Random Forests) are invariant to feature
scales.

3. Visualization Tip:
o Plot distributions pre- and post-scaling to observe effects:
import seaborn as sns
sns.kdeplot(X['feature'], label='Original')
sns.kdeplot(X_standard[:, 0], label='Standardized')

4. Practical Advice:
o Experiment with all three scalers and validate model performance (e.g., cross-
validation).

Section 7: Principal Component Analysis (PCA)


Key Concepts
• Purpose: Reduces dimensionality by transforming features into uncorrelated components
(principal components) sorted by variance.
• Covariance Matrix: PCA diagonalizes the covariance matrix, removing linear correlations
between features.
• Explained Variance: Retains components that capture the most variance, discarding less
informative ones.
• Scaling Requirement: Features must be scaled (standardized) before PCA if they have
different magnitudes.

1. How PCA Works


• Mathematical Foundation:
o PCA performs eigen decomposition on the covariance matrix of the data.
o Eigenvectors represent principal components (directions of maximum variance).
o Eigenvalues indicate the variance explained by each component.
o Formula: Covariance Matrix=1n−1XTXCovariance Matrix=n−11XTX.
• Steps:
1. Standardize Data: Center and scale features (use StandardScaler).
2. Compute Covariance Matrix: Captures feature relationships.
3. Eigen Decomposition: Extract eigenvectors (components) and eigenvalues (variance).
4. Sort Components: Order components by descending eigenvalues.
5. Select Top-k Components: Choose components that retain desired variance.
2. Implementing PCA with Scikit-Learn

Code Example:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Load dataset (e.g., diabetes dataset)


from sklearn.datasets import load_diabetes
data = load_diabetes()
X = data.data

# Standardize data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply PCA (retain 80% variance)


pca = PCA(n_components=0.8)
X_pca = pca.fit_transform(X_scaled)

print(f"Original shape: {X.shape}, Reduced shape: {X_pca.shape}")


print(f"Explained variance ratio: {pca.explained_variance_ratio_}")

3. Determining the Optimal Number of Components


• Scree Plot: Visualize the variance explained by each component to identify "elbow points":
import matplotlib.pyplot as plt

pca_full = PCA().fit(X_scaled)
plt.plot(range(1, len(pca_full.explained_variance_ratio_)+1),
pca_full.explained_variance_ratio_.cumsum(),
marker='o')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance')
plt.title('Scree Plot')
plt.show()
o Elbow Point: Where the curve bends sharply (e.g., 80% cumulative variance).
• Variance Threshold:
o Set n_components as a float (e.g., 0.8 for 80% variance).

4. Practical Example: Diabetes Dataset


• Dataset: 442 samples, 10 features.
• Steps:
1. Standardize features.
2. Fit PCA without specifying n_components to analyze variance ratios.
3. Plot cumulative variance to choose components (e.g., 5 components for 80% variance).
Output Interpretation:
• First component explains ~40% variance; 5 components needed for 80%.

5. PCA vs. Other Techniques


• Linear Discriminant Analysis (LDA): Supervised method maximizing class separability.
• t-SNE/UMAP: Non-linear techniques for visualization (not for feature reduction).

Additional Notes
1. When to Use PCA:
o High-dimensional datasets (e.g., images, genomics).
o Multicollinearity in linear models (e.g., regression).
o Noise reduction or feature extraction for clustering.
2. Scaling is Mandatory:
o PCA is sensitive to feature scales. Unscaled data skews variance calculations.
3. Interpretability:
o Principal components are linear combinations of original features and lack direct
business meaning.
4. Common Pitfalls:
o Data Leakage: Fit PCA on training data only, then transform test data.
o Over-Reduction: Retaining too few components loses critical information.
5. Advanced Topics (Preview):
o Incremental PCA: For large datasets that don’t fit in memory.
o Kernel PCA: Non-linear dimensionality reduction.

Section 8: Filter-Based Feature Selection

Key Concepts
• Purpose: Reduces dimensionality by selecting features most relevant to the target variable.
• Methods:
o Statistical Tests: F-test, ANOVA, mutual information, chi-square.
o Model-Based: Feature importances from algorithms like Random Forest.
• Scenarios:
o Numerical features vs. numerical target (e.g., Pearson correlation).
o Categorical features vs. categorical target (e.g., chi-square).
o Mixed feature/target types (e.g., mutual information).

1. Why Feature Selection?


• Benefits:
o Reduces training time and overfitting.
o Improves model interpretability (identifies key drivers).
o Enhances performance by eliminating noise.
2. Filter Methods by Data Type

Feature Type Target Type Method Implementation in Scikit-Learn

Numerical Numerical Pearson correlation (F-test) f_regression

Numerical Categorical ANOVA f_classif

Categorical Numerical Mutual Information mutual_info_regression

Categorical Categorical Chi-square chi2

3. Implementing Filter Methods


Example 1: Numerical Features & Numerical Target (F-test)
from sklearn.feature_selection import SelectKBest, f_regression

# Load dataset
X, y = load_diabetes(return_X_y=True)

# Select top 5 features using F-test


selector = SelectKBest(score_func=f_regression, k=5)
X_selected = selector.fit_transform(X, y)

# Get selected feature names


selected_features = X.columns[selector.get_support()]

Example 2: Categorical Features & Categorical Target (Chi-square)


from sklearn.feature_selection import SelectKBest, chi2
from sklearn.preprocessing import OrdinalEncoder

# Encode categorical features


encoder = OrdinalEncoder()
X_encoded = encoder.fit_transform(X_categorical)

# Select top 3 features using chi-square


selector = SelectKBest(score_func=chi2, k=3)
X_selected = selector.fit_transform(X_encoded, y_categorical)

Example 3: Mutual Information


from sklearn.feature_selection import mutual_info_classif

# For classification (categorical target)


mi_scores = mutual_info_classif(X, y, discrete_features='auto')
4. Model-Based Feature Selection
Using Random Forest Feature Importances:
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel

# Train model
model = RandomForestClassifier(random_state=0)
model.fit(X_train, y_train)

# Select features with importance > mean


selector = SelectFromModel(model, threshold='mean')
X_selected = selector.fit_transform(X_train, y_train)

5. Handling Categorical Variables


• Encoding: Use OrdinalEncoder or OneHotEncoder before applying filter methods.
• Mutual Information: Specify discrete_features=True for encoded categorical features:
mi_scores = mutual_info_regression(X_encoded, y, discrete_features=[0, 1, 2])

Additional Notes
1. Mutual Information Parameters:
o n_neighbors (default=3): Controls bias-variance tradeoff. Increase for smoother
estimates.
o Example:
mi_scores = mutual_info_classif(X, y, n_neighbors=5)

2. Chi-Square Assumptions:
o Requires non-negative features (e.g., counts or one-hot encoded data).
o Avoid if expected frequencies in contingency tables are <5 (use Fisher’s exact test).

3. Pipeline Integration:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

pipeline = Pipeline([
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler()),
('selector', SelectKBest(score_func=f_classif, k=10))
])
X_processed = pipeline.fit_transform(X, y)

4. Common Pitfalls:
o Data Leakage: Always fit feature selectors on the training set.
o Scaling: Standardize numerical features before using distance-based metrics (e.g.,
mutual information).
5. Alternative Models for Feature Importance:
o Linear Models: Use coefficients (e.g., Lasso regression).
o Tree-Based Models: Use feature_importances_ (e.g., XGBoost).

Key Takeaways:
• Filter methods are computationally efficient but ignore feature interactions.
• Always validate selected features using cross-validation.
• Combine filter methods with domain knowledge for interpretable results.

Section 9: Building a Complete Preprocessing Pipeline


Key Concepts
• Pipeline Purpose: Streamline preprocessing steps (cleaning, encoding, scaling) and integrate
dimensionality reduction (PCA) and feature selection.
• ColumnTransformer: Apply different transformations to numerical and categorical features.
• Modularity: Use set_params to dynamically adjust pipeline components (e.g., PCA
components, imputation strategy).
• Integration: Combine imputation, scaling, encoding, PCA, and feature selection into a single
workflow.

1. Pipeline Components
1. Data Cleaning:
o Numerical Features: Impute missing values with median.
o Categorical Features: Impute missing values with most frequent category.
2. Feature Transformation:
o Numerical Features: Standardize using StandardScaler.
o Categorical Features: Encode using OneHotEncoder.
3. Dimensionality Reduction: Apply PCA to reduce features.
4. Feature Selection: Use statistical tests (e.g., ANOVA) to select top features.

2. Pipeline Implementation

Step 1: Import Libraries


import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest, f_classif
Step 2: Define Column Transformers
# Numerical Pipeline
numerical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])

# Categorical Pipeline
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent')),
('encoder', OneHotEncoder(handle_unknown='ignore'))
])

# Column Selectors
numerical_features = ['age', 'income'] # Example numerical columns
categorical_features = ['gender', 'city'] # Example categorical columns

# Combine Transformers
preprocessor = ColumnTransformer(transformers=[
('num', numerical_transformer, numerical_features),
('cat', categorical_transformer, categorical_features)
])

Step 3: Add PCA and Feature Selection


# Full Pipeline
full_pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('pca', PCA(n_components=10)), # Reduce to 10 components
('feature_selector', SelectKBest(score_func=f_classif, k=5)) # Select top 5 features
])

Step 4: Fit and Transform Data


# Example dataset
X_train, y_train = load_data()

# Fit and transform


X_processed = full_pipeline.fit_transform(X_train, y_train)

3. Modifying Pipeline Parameters


Use set_params to adjust components without rebuilding the pipeline:
# Change PCA components to 15 and select top 3 features
full_pipeline.set_params(
pca__n_components=15,
feature_selector__k=3
)
# Update imputation strategy for numerical features
full_pipeline.set_params(
preprocessor__num__imputer__strategy='mean' # Use mean instead of median
)
4. Pipeline Execution Flow
1. Preprocessing:
o Clean and transform numerical/categorical features separately.
2. PCA:
o Reduce dimensionality of the combined dataset.
3. Feature Selection:
o Select top features based on ANOVA F-test.

Additional Notes
1. Why ColumnTransformer?
o Ensures numerical and categorical features are processed independently.
o Avoids data leakage and incorrect scaling/encoding.
2. Order of Operations:
o PCA is applied after preprocessing to ensure standardized inputs.
o Feature selection is done after PCA to focus on the most informative reduced
components.
3. Handling Nested Pipelines:
o Use double underscores (__) to access nested parameters
(e.g., preprocessor__num__imputer).
4. Common Pitfalls:
o Data Leakage: Always fit the pipeline on the training set and transform the test set.
o Categorical Encoding: Avoid one-hot encoding high-cardinality features (use
alternatives like target encoding).
5. Visualizing the Pipeline:
from sklearn import set_config
set_config(display='diagram')
full_pipeline # Displays an interactive diagram

Key Takeaways:
• Pipelines ensure reproducibility and reduce code complexity.
• Modular design allows easy experimentation with different preprocessing strategies.
• Always validate the pipeline using cross-validation to avoid overfitting.
Section 10: Handling Imbalanced Data with SMOTE
Key Concepts
• Purpose: Address class imbalance in classification problems by generating synthetic samples
for minority classes.
• SMOTE Algorithm: Creates synthetic data points via interpolation between nearest neighbors
of minority class.
• Preprocessing: Requires standardization/normalization of numerical features.
• Categorical Handling: SMOTE-NC variant supports mixed data types (numerical +
categorical).

1. Why SMOTE?
• Imbalance Issues:
o Models may bias toward majority class, leading to poor minority class prediction.
o Common in fraud detection, medical diagnosis, etc.
• SMOTE Benefits:
o Avoids overfitting caused by simple duplication (e.g., random oversampling).
o Balances class distribution for fairer model training.

2. How SMOTE Works


1. Select a Minority Class Sample: Randomly choose a data point.
2. Find k-Nearest Neighbors: Identify k similar minority class points (default k=5).
3. Generate Synthetic Sample: Interpolate between the selected point and a random neighbor.
o Formula: Xnew=Xi+λ(Xj−Xi)Xnew=Xi+λ(Xj−Xi), where λ∈[0,1]λ∈[0,1].

3. Implementation Steps
Step 1: Standardize Numerical Features
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE

# Load dataset (e.g., wine dataset)


from sklearn.datasets import load_wine
X, y = load_wine(return_X_y=True)

# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply SMOTE
sm = SMOTE(random_state=42)
X_resampled, y_resampled = sm.fit_resample(X_scaled, y)

# Verify balance
print(pd.Series(y_resampled).value_counts())
Step 2: Inverse Transformation (Optional)
# Revert to original feature space
X_original_scale = scaler.inverse_transform(X_resampled)

4. Handling Categorical Features


• Use SMOTE-NC for datasets with categorical variables:
from imblearn.over_sampling import SMOTENC

# Specify categorical feature indices (e.g., first 2 columns)


sm_nc = SMOTENC(categorical_features=[0, 1], random_state=42)
X_resampled, y_resampled = sm_nc.fit_resample(X, y)

5. Practical Example with Wine Dataset


import pandas as pd
from sklearn.datasets import load_wine

# Load data
data = load_wine()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target)

# Check imbalance
print(y.value_counts())

# Standardize and apply SMOTE


scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
sm = SMOTE(random_state=42)
X_balanced, y_balanced = sm.fit_resample(X_scaled, y)

# Inverse standardization (if needed)


X_original = scaler.inverse_transform(X_balanced)

6. Best Practices
• Data Leakage: Fit scaler on training data only, then transform test data.
• Evaluation Metrics: Use precision, recall, F1-score, or ROC-AUC instead of accuracy.
• Skewed Features: Apply PowerTransformer before SMOTE for highly skewed distributions.
• When to Avoid SMOTE:
o Extremely small minority class (e.g., < 10 samples).
o Time-series data (temporal dependencies).
Additional Notes
1. Alternatives to SMOTE:
o ADASYN: Generates more samples near decision boundaries.
o Undersampling: Random/Clean undersampling of majority class.
o Class Weights: Assign higher weights to minority classes during model training.

2. Pipeline Integration:
from imblearn.pipeline import Pipeline

pipeline = Pipeline([
('scaler', StandardScaler()),
('smote', SMOTE(random_state=42)),
('classifier', RandomForestClassifier())
])
3. Cross-Validation: Always validate performance using stratified k-fold to maintain class
balance.

Key Takeaways:
• SMOTE improves model fairness but requires careful preprocessing.
• Always validate synthetic data quality to avoid introducing noise.
• Combine SMOTE with robust evaluation metrics for reliable results.

You might also like