Data Preprocessing for Machine Learning in Python
Data Preprocessing for Machine Learning in Python
o Best Practices:
▪ Use markdown cells to document steps.
▪ Split code into logical cells (e.g., data loading, cleaning, modeling).
Additional Context
1. Why Use Jupyter Over Other IDEs?
o Ideal for iterative data exploration (run code line-by-line).
o Supports inline visualizations (e.g., matplotlib plots).
o Shareable format (export to HTML/PDF).
2. Installing Jupyter via Anaconda (Alternative Method):
o Anaconda simplifies package management for data science:
conda install -c conda-forge jupyterlab
▪ Code Example:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X)
o KNNImputer:
▪ Infers missing values using values from *k* nearest neighbors.
▪ Advantages: Adapts to data patterns; uses feature relationships.
▪ Code Example:
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=5, weights='distance')
X_knn = imputer.fit_transform(X)
preprocessor = ColumnTransformer(
transformers=[
('num', SimpleImputer(strategy='mean'), numerical_cols),
('cat', SimpleImputer(strategy='most_frequent'), categorical_cols)
]
)
X_clean = preprocessor.fit_transform(X)
preprocessor = ColumnTransformer(
transformers=[
('num', SimpleImputer(strategy='median'),
make_column_selector(dtype_exclude='object')),
('cat', SimpleImputer(strategy='constant', fill_value='Unknown'),
make_column_selector(dtype_include='object'))
]
)
Common Pitfalls & Tips:
• Pitfall 1: Imputing before splitting data into train/test sets.
o Tip: Always split data first to avoid leakage:
from sklearn.model_selection import train_test_split
X_train, X_test = train_test_split(X, test_size=0.2)
imputer.fit(X_train) # Fit only on training data!
Additional Context:
1. Evaluating Imputation Quality:
o Compare model performance (e.g., RMSE) before/after imputation.
o Use cross-validation to assess robustness.
2. Advanced Techniques (Preview):
o Iterative Imputer: Models missing values as a function of other features (e.g., MICE).
o Domain-Specific Imputation: Replace NaNs using business logic (e.g., "Unknown" for
missing categories).
3. Why Avoid Default SimpleImputer Settings?
o The default strategy='mean' may not suit skewed data. Always visualize distributions
first!
Visual Aids:
1. Imputation Workflow:
Detect NaNs → Split Data → Fit Imputer on Train → Transform Train/Test
2. KNNImputer Illustration:
o Missing value (?) inferred from nearest neighbors (A, B, C) using weighted average.
Detailed Notes:
1. One-Hot Encoding
o Purpose: Convert categorical variables into binary (0/1) columns.
o Use Case: Nominal data (no inherent order, e.g., colors, countries).
o Code Example:
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')
X_encoded = encoder.fit_transform(df[['category_column']])
o Key Parameters:
▪ sparse=False: Return a dense matrix (default is sparse).
▪ drop='first': Remove one column to avoid multicollinearity.
▪ handle_unknown='ignore': Encode unseen categories as all zeros.
o Pitfall: High cardinality (e.g., 1,000 categories) increases dimensionality.
▪ Solution: Use feature selection or dimensionality reduction.
2. Ordinal Encoding
o Purpose: Encode ordered categories as integers (e.g., "low" < "medium" < "high").
o Use Case: Ordinal data with natural ranking.
o Code Example:
from sklearn.preprocessing import OrdinalEncoder
encoder = OrdinalEncoder(categories=[['low', 'medium', 'high']])
X_ordinal = encoder.fit_transform(df[['ordinal_column']])
preprocessor = ColumnTransformer(
transformers=[
('onehot', OneHotEncoder(), make_column_selector(dtype_include='object')),
('ordinal', OrdinalEncoder(categories=[['low', 'medium', 'high']]), ['ordinal_column'])
],
remainder='passthrough'
)
X_transformed = preprocessor.fit_transform(X)
Additional Context:
1. One-Hot vs. Ordinal vs. Label Encoding:
3. Production Readiness:
o Always specify categories in OneHotEncoder/OrdinalEncoder to handle future unseen
values.
Visual Aids:
1. One-Hot Encoding Workflow:
Original Data: [A, B, A, C]
Encoded:
|A|B|C|
|---|---|--|
|1|0|0|
|0|1|0|
|1|0|0|
|0|0|1|
3. ColumnTransformer Diagram:
Input Data → [OneHotEncoder on Column 1] → [OrdinalEncoder on Column 2] → Merged Output
Section 4: Transformation of Numerical Features
Detailed Notes:
1. Power Transformations
o Purpose: Reduce skewness and approximate normality for models sensitive to feature
distributions (e.g., KNN, clustering).
o Methods:
▪ Johnson Transformation: Works with both positive and negative values.
▪ Box-Cox Transformation: Requires strictly positive values.
o Code Example:
from sklearn.preprocessing import PowerTransformer
# Johnson transformation (default)
pt_johnson = PowerTransformer(method='yeo-johnson', standardize=True)
X_transformed = pt_johnson.fit_transform(X)
# Box-Cox transformation
pt_boxcox = PowerTransformer(method='box-cox', standardize=True)
X_transformed = pt_boxcox.fit_transform(X[X > 0]) # Ensure positivity
2. Binning (Discretization)
o Strategies:
o Code Example:
from sklearn.preprocessing import KBinsDiscretizer
3. Binary Thresholding
o Purpose: Convert numerical features to binary (0/1) based on a threshold.
o Code Example:
from sklearn.preprocessing import Binarizer
# Threshold = 12
binarizer = Binarizer(threshold=12)
X_binary = binarizer.fit_transform(X)
4. Custom Transformations
o Use Case: Apply custom logic (e.g., log transform, scaling).
o Code Example:
from sklearn.preprocessing import FunctionTransformer
import numpy as np
# Log transformation
log_transformer = FunctionTransformer(np.log, validate=True)
X_log = log_transformer.fit_transform(X)
custom_transformer = FunctionTransformer(
multiply_by,
kw_args={'factor': 3},
validate=True
)
X_custom = custom_transformer.fit_transform(X)
5. Automation with ColumnTransformer
o Purpose: Apply different transformations to specific columns.
o Code Example:
from sklearn.compose import ColumnTransformer
preprocessor = ColumnTransformer(
transformers=[
('power_transform', PowerTransformer(), ['feature1']),
('binarize', Binarizer(threshold=10), ['feature2']),
('log_transform', FunctionTransformer(np.log), ['feature3'])
],
remainder='passthrough'
)
X_processed = preprocessor.fit_transform(X)
Additional Context:
1. Transformation Comparison:
2. Advanced Techniques:
o Interaction Terms: Combine features (e.g., feature1 * feature2).
o Polynomial Features: Create non-linear relationships (e.g., feature1^2).
Visual Aids:
1. Power Transformation Example:
Original Skewed Data → [Johnson/Box-Cox] → Symmetrical Distribution
2. Binning Workflow:
Numerical Data → [Uniform/Quantile/k-means] → Ordinal Categories
3. ColumnTransformer Flow:
Input Data → [Power Transform on Col1] → [Binarize Col2] → [Log Transform Col3] → Merged Output
Section 5: Pipelines
Key Concepts Covered:
• Pipeline Definition: Sequences of transformations applied in order.
• Pipeline Construction: Using make_pipeline and Pipeline classes.
• Integration with ColumnTransformer: Combining feature-specific transformations.
• Parameter Tuning: Modifying pipeline components with set_params.
• Nested Pipelines: Embedding pipelines within ColumnTransformer.
Detailed Notes:
1. What Are Pipelines?
o Purpose: Streamline data preprocessing by chaining transformations (e.g., imputation
→ scaling → encoding).
o Benefits:
▪ Avoid data leakage by ensuring transformations are fitted only on training data.
▪ Simplify code and ensure reproducibility.
2. Building Pipelines
o Using make_pipeline:
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import PowerTransformer
# Define pipelines
numerical_pipeline = Pipeline([
('imputer', SimpleImputer(strategy='median')),
('transformer', PowerTransformer())
])
categorical_pipeline = Pipeline([
('imputer', SimpleImputer(strategy='most_frequent')),
('encoder', OneHotEncoder(sparse=False))
])
X_processed = preprocessor.fit_transform(X)
Visual Aids:
1. Pipeline Workflow:
Raw Data → [Imputer] → [Power Transformer] → [Model] → Predictions
2. ColumnTransformer Diagram:
Input Data → [Numerical Pipeline] → [Categorical Pipeline] → Merged Output
Key Concepts
• Scaling Purpose: Ensures features have comparable magnitudes to prevent models from
biasing toward higher-magnitude features.
• Normalization (Min-Max Scaling): Scales features to [0, 1] range.
• Standardization (Z-Score Scaling): Centers features to mean=0 and variance=1.
• Robust Scaling: Uses median and interquartile range (IQR) to reduce outlier impact.
• Inverse Transformation: Scikit-learn scalers allow reverting scaled data to original form.
Normally distributed
StandardScaler Xscaled=X−μσXscaled=σX−μ
data.
Code Implementation:
from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler
# Normalization
minmax = MinMaxScaler()
X_minmax = minmax.fit_transform(X)
# Standardization
standard = StandardScaler()
X_standard = standard.fit_transform(X)
# Robust Scaling
robust = RobustScaler()
X_robust = robust.fit_transform(X)
3. Handling Outliers
• RobustScaler:
o Uses median (resistant to outliers) and IQR (75th - 25th percentile).
o Example: If a feature has outliers in housing prices, use RobustScaler instead
of StandardScaler.
4. Pipeline Integration
• Steps:
1. Impute Missing Values: Use SimpleImputer.
2. Scale Features: Apply scaler in a pipeline.
3. Column-Specific Transformations: Use ColumnTransformer to target numerical
features.
Example Pipeline:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
6. Inverse Transformation
• Revert scaled data to original scale:
# For MinMaxScaler
X_original = minmax.inverse_transform(X_minmax)
Additional Notes
1. Data Leakage Warning:
o Always split data into train/test sets before scaling. Fit the scaler on the training data
only, then transform both train and test sets.
3. Visualization Tip:
o Plot distributions pre- and post-scaling to observe effects:
import seaborn as sns
sns.kdeplot(X['feature'], label='Original')
sns.kdeplot(X_standard[:, 0], label='Standardized')
4. Practical Advice:
o Experiment with all three scalers and validate model performance (e.g., cross-
validation).
Code Example:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
# Standardize data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
pca_full = PCA().fit(X_scaled)
plt.plot(range(1, len(pca_full.explained_variance_ratio_)+1),
pca_full.explained_variance_ratio_.cumsum(),
marker='o')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance')
plt.title('Scree Plot')
plt.show()
o Elbow Point: Where the curve bends sharply (e.g., 80% cumulative variance).
• Variance Threshold:
o Set n_components as a float (e.g., 0.8 for 80% variance).
Additional Notes
1. When to Use PCA:
o High-dimensional datasets (e.g., images, genomics).
o Multicollinearity in linear models (e.g., regression).
o Noise reduction or feature extraction for clustering.
2. Scaling is Mandatory:
o PCA is sensitive to feature scales. Unscaled data skews variance calculations.
3. Interpretability:
o Principal components are linear combinations of original features and lack direct
business meaning.
4. Common Pitfalls:
o Data Leakage: Fit PCA on training data only, then transform test data.
o Over-Reduction: Retaining too few components loses critical information.
5. Advanced Topics (Preview):
o Incremental PCA: For large datasets that don’t fit in memory.
o Kernel PCA: Non-linear dimensionality reduction.
Key Concepts
• Purpose: Reduces dimensionality by selecting features most relevant to the target variable.
• Methods:
o Statistical Tests: F-test, ANOVA, mutual information, chi-square.
o Model-Based: Feature importances from algorithms like Random Forest.
• Scenarios:
o Numerical features vs. numerical target (e.g., Pearson correlation).
o Categorical features vs. categorical target (e.g., chi-square).
o Mixed feature/target types (e.g., mutual information).
# Load dataset
X, y = load_diabetes(return_X_y=True)
# Train model
model = RandomForestClassifier(random_state=0)
model.fit(X_train, y_train)
Additional Notes
1. Mutual Information Parameters:
o n_neighbors (default=3): Controls bias-variance tradeoff. Increase for smoother
estimates.
o Example:
mi_scores = mutual_info_classif(X, y, n_neighbors=5)
2. Chi-Square Assumptions:
o Requires non-negative features (e.g., counts or one-hot encoded data).
o Avoid if expected frequencies in contingency tables are <5 (use Fisher’s exact test).
3. Pipeline Integration:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
pipeline = Pipeline([
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler()),
('selector', SelectKBest(score_func=f_classif, k=10))
])
X_processed = pipeline.fit_transform(X, y)
4. Common Pitfalls:
o Data Leakage: Always fit feature selectors on the training set.
o Scaling: Standardize numerical features before using distance-based metrics (e.g.,
mutual information).
5. Alternative Models for Feature Importance:
o Linear Models: Use coefficients (e.g., Lasso regression).
o Tree-Based Models: Use feature_importances_ (e.g., XGBoost).
Key Takeaways:
• Filter methods are computationally efficient but ignore feature interactions.
• Always validate selected features using cross-validation.
• Combine filter methods with domain knowledge for interpretable results.
1. Pipeline Components
1. Data Cleaning:
o Numerical Features: Impute missing values with median.
o Categorical Features: Impute missing values with most frequent category.
2. Feature Transformation:
o Numerical Features: Standardize using StandardScaler.
o Categorical Features: Encode using OneHotEncoder.
3. Dimensionality Reduction: Apply PCA to reduce features.
4. Feature Selection: Use statistical tests (e.g., ANOVA) to select top features.
2. Pipeline Implementation
# Categorical Pipeline
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent')),
('encoder', OneHotEncoder(handle_unknown='ignore'))
])
# Column Selectors
numerical_features = ['age', 'income'] # Example numerical columns
categorical_features = ['gender', 'city'] # Example categorical columns
# Combine Transformers
preprocessor = ColumnTransformer(transformers=[
('num', numerical_transformer, numerical_features),
('cat', categorical_transformer, categorical_features)
])
Additional Notes
1. Why ColumnTransformer?
o Ensures numerical and categorical features are processed independently.
o Avoids data leakage and incorrect scaling/encoding.
2. Order of Operations:
o PCA is applied after preprocessing to ensure standardized inputs.
o Feature selection is done after PCA to focus on the most informative reduced
components.
3. Handling Nested Pipelines:
o Use double underscores (__) to access nested parameters
(e.g., preprocessor__num__imputer).
4. Common Pitfalls:
o Data Leakage: Always fit the pipeline on the training set and transform the test set.
o Categorical Encoding: Avoid one-hot encoding high-cardinality features (use
alternatives like target encoding).
5. Visualizing the Pipeline:
from sklearn import set_config
set_config(display='diagram')
full_pipeline # Displays an interactive diagram
Key Takeaways:
• Pipelines ensure reproducibility and reduce code complexity.
• Modular design allows easy experimentation with different preprocessing strategies.
• Always validate the pipeline using cross-validation to avoid overfitting.
Section 10: Handling Imbalanced Data with SMOTE
Key Concepts
• Purpose: Address class imbalance in classification problems by generating synthetic samples
for minority classes.
• SMOTE Algorithm: Creates synthetic data points via interpolation between nearest neighbors
of minority class.
• Preprocessing: Requires standardization/normalization of numerical features.
• Categorical Handling: SMOTE-NC variant supports mixed data types (numerical +
categorical).
1. Why SMOTE?
• Imbalance Issues:
o Models may bias toward majority class, leading to poor minority class prediction.
o Common in fraud detection, medical diagnosis, etc.
• SMOTE Benefits:
o Avoids overfitting caused by simple duplication (e.g., random oversampling).
o Balances class distribution for fairer model training.
3. Implementation Steps
Step 1: Standardize Numerical Features
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE
# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Apply SMOTE
sm = SMOTE(random_state=42)
X_resampled, y_resampled = sm.fit_resample(X_scaled, y)
# Verify balance
print(pd.Series(y_resampled).value_counts())
Step 2: Inverse Transformation (Optional)
# Revert to original feature space
X_original_scale = scaler.inverse_transform(X_resampled)
# Load data
data = load_wine()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target)
# Check imbalance
print(y.value_counts())
6. Best Practices
• Data Leakage: Fit scaler on training data only, then transform test data.
• Evaluation Metrics: Use precision, recall, F1-score, or ROC-AUC instead of accuracy.
• Skewed Features: Apply PowerTransformer before SMOTE for highly skewed distributions.
• When to Avoid SMOTE:
o Extremely small minority class (e.g., < 10 samples).
o Time-series data (temporal dependencies).
Additional Notes
1. Alternatives to SMOTE:
o ADASYN: Generates more samples near decision boundaries.
o Undersampling: Random/Clean undersampling of majority class.
o Class Weights: Assign higher weights to minority classes during model training.
2. Pipeline Integration:
from imblearn.pipeline import Pipeline
pipeline = Pipeline([
('scaler', StandardScaler()),
('smote', SMOTE(random_state=42)),
('classifier', RandomForestClassifier())
])
3. Cross-Validation: Always validate performance using stratified k-fold to maintain class
balance.
Key Takeaways:
• SMOTE improves model fairness but requires careful preprocessing.
• Always validate synthetic data quality to avoid introducing noise.
• Combine SMOTE with robust evaluation metrics for reliable results.