ML Assignment 6
ML Assignment 6
January 3, 2025
Given a dataset Customer.csv with a mix of numerical and categorical features, discuss the chal-
lenges and considerations in using Random Forest for classification or regression tasks.
Propose strategies for encoding categorical variables and handling missing data to improve model
performance and interpretability.
You are being provided with a meta data also please read it before doing implementation.
[5]: import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score, mean_squared_error
data = pd.read_csv(url)
Delicatessen
0 1338
1 1776
2 7844
3 1788
4 5185
[11]: data.columns
1
[11]: Index(['Channel', 'Region', 'Fresh', 'Milk', 'Grocery', 'Frozen',
'Detergents_Paper', 'Delicatessen'],
dtype='object')
numerical_imputer = SimpleImputer(strategy='mean')
data[numerical_features] = numerical_imputer.
↪fit_transform(data[numerical_features])
↪get_feature_names_out(categorical_features))
# Add the one-hot encoded variables back to the dataset and drop original␣
↪categorical columns
2
# Make predictions
y_pred = rf_classifier.predict(X_test)
Accuracy: 0.90
# Make predictions
y_pred = rf_regressor.predict(X_test)