Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
4 views

ML Assignment 6

The document discusses the challenges of using Random Forest for classification or regression tasks with a dataset containing both numerical and categorical features. It proposes strategies for encoding categorical variables using One-Hot Encoding and handling missing data through imputation methods to enhance model performance and interpretability. The implementation includes loading the dataset, preprocessing, training the model, and evaluating its accuracy and mean squared error.

Uploaded by

anuj rawat
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

ML Assignment 6

The document discusses the challenges of using Random Forest for classification or regression tasks with a dataset containing both numerical and categorical features. It proposes strategies for encoding categorical variables using One-Hot Encoding and handling missing data through imputation methods to enhance model performance and interpretability. The implementation includes loading the dataset, preprocessing, training the model, and evaluating its accuracy and mean squared error.

Uploaded by

anuj rawat
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

abeluwlee

January 3, 2025

Given a dataset Customer.csv with a mix of numerical and categorical features, discuss the chal-
lenges and considerations in using Random Forest for classification or regression tasks.
Propose strategies for encoding categorical variables and handling missing data to improve model
performance and interpretability.
You are being provided with a meta data also please read it before doing implementation.
[5]: import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score, mean_squared_error

[6]: # Load the dataset


url = "https://itv-contentbucket.s3.ap-south-1.amazonaws.com/Exams/ML/EDA/
↪customers.csv"

data = pd.read_csv(url)

# Display the first few rows


print(data.head())

Channel Region Fresh Milk Grocery Frozen Detergents_Paper \


0 2 3 12669 9656 7561 214 2674
1 2 3 7057 9810 9568 1762 3293
2 2 3 6353 8808 7684 2405 3516
3 1 3 13265 1196 4221 6404 507
4 2 3 22615 5410 7198 3915 1777

Delicatessen
0 1338
1 1776
2 7844
3 1788
4 5185

[11]: data.columns

1
[11]: Index(['Channel', 'Region', 'Fresh', 'Milk', 'Grocery', 'Frozen',
'Detergents_Paper', 'Delicatessen'],
dtype='object')

[19]: # Imputation for missing numerical values


numerical_features = ['Fresh', 'Milk', 'Grocery', 'Frozen', 'Detergents_Paper',␣
↪'Delicatessen']

numerical_imputer = SimpleImputer(strategy='mean')
data[numerical_features] = numerical_imputer.
↪fit_transform(data[numerical_features])

# Imputation for missing categorical values


categorical_features = ['Channel_2.0', 'Region_2.0']
categorical_imputer = SimpleImputer(strategy='most_frequent')
data[categorical_features] = pd.DataFrame(categorical_imputer.
↪fit_transform(data[categorical_features]), columns=categorical_features)

[20]: # One-Hot Encoding


one_hot_encoder = OneHotEncoder(sparse_output=False, drop='first')
encoded_data = pd.DataFrame(one_hot_encoder.
↪fit_transform(data[categorical_features]), columns=one_hot_encoder.

↪get_feature_names_out(categorical_features))

# Add the one-hot encoded variables back to the dataset and drop original␣
↪categorical columns

data = data.drop(categorical_features, axis=1)


data = pd.concat([data, encoded_data], axis=1)

[22]: # Confirm the column names


print(data.columns)

# Select 'Region' as the target variable (if 'Channel' is not present)


target_variable = 'Region_3.0'
X = data.drop(target_variable, axis=1)
y = data[target_variable]

# Split the data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,␣
↪random_state=42)

Index(['Fresh', 'Milk', 'Grocery', 'Frozen', 'Detergents_Paper',


'Delicatessen', 'Region_3.0', 'Channel_2.0_1.0', 'Region_2.0_1.0'],
dtype='object')

[23]: # Initialize and train the model


rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
rf_classifier.fit(X_train, y_train)

2
# Make predictions
y_pred = rf_classifier.predict(X_test)

# Evaluate the model


accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

Accuracy: 0.90

[24]: # Initialize and train the model


rf_regressor = RandomForestRegressor(n_estimators=100, random_state=42)
rf_regressor.fit(X_train, y_train)

# Make predictions
y_pred = rf_regressor.predict(X_test)

# Evaluate the model


mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse:.2f}")

Mean Squared Error: 0.11

You might also like