0% found this document useful (0 votes)

44 views

Machine Learning Model Building

I’ve written a paper on machine learning

Uploaded by

Jessica Hombal

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

44 views

Machine Learning Model Building

I’ve written a paper on machine learning

Uploaded by

Jessica Hombal

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Machine Learning Models:

Supervised Learning: Classification

Step1: read data

Step2: Get info of the data, accordingly do the missing value imputation.
columns = test_data.columns.to_list()
for col in columns:
print("Missing values % of", col, train_data[col].isna().sum()/train_data.shape[0])

def missing_val_treatment(df):
for col in columns:
miss_perc = df[col].isna().sum()/df.shape[0]
if miss_perc > 0 and miss_perc < 0.7 and df[col].dtype == 'O':
df[col].fillna(df[col].mode()[0], inplace = True)
elif miss_perc > 0 and miss_perc < 0.7 and df[col].dtype != 'O':
df[col].fillna(df[col].median(), inplace = True)
elif miss_perc > 0.7:
df.drop(col, axis = 1, inplace = True)
else:
pass
If any rows to be dropped that contains NA then:
df.dropna(subset = [column_name], inplace = True)

Step3: See if any outliers are there in any column and accordingly deal with the
outliers.
train_data.boxplot(num_cols)
Step4: See if any irrelevant columns present like Name or Address which has lot of
text and mostly unique throughout the rows.
train_data.drop(['Name','Ticket'], axis = 1, inplace = True)
test_data.drop(['Name','Ticket'], axis = 1, inplace = True)

Step5: Look for date-time columns. Here’s how you can deal with them:
df[‘ScheduledDay’] = pd.to_datetime(df[‘ScheduledDay’],
format = ‘%Y-%m-%dT%H:%M:%SZ’, errors = ‘coerce’)
Filteration w. r. t. date columns:

Date Feature Engineering:

df[‘ScheduledDay_year’] = df[‘ScheduledDay’].dt.year
df[‘ScheduledDay_month’] = df[‘ScheduledDay’].dt.month
df[‘ScheduledDay_week’] = df[‘ScheduledDay’].dt.week
df[‘ScheduledDay_day’] = df[‘ScheduledDay’].dt.day
df[‘ScheduledDay_hour’] = df[‘ScheduledDay’].dt.hour
df[‘ScheduledDay_minute’] = df[‘ScheduledDay’].dt.minute
df[‘ScheduledDay_dayofweek’] = df[‘ScheduledDay’].dt.dayofweek

Step6: One-Hot Encoding

train_enc = pd.get_dummies(train_data, drop_first = True)
test_enc = pd.get_dummies(test_data, drop_first = True)

Step7: Data Normalization

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
for col in num_cols:
if test_data[col].dtype != 'O' and test_data[col].dtype != ‘Datetime’:
train_enc[col] = sc.fit_transform(train_enc[[col]])
test_enc[col] = sc.fit_transform(test_enc[[col]])
train_enc.hist()

Step8: Look for imbalance in the data w. r. t. target variable and depending upon
that apply sampling technique like SMOTE.

Step9: Apply ML models for the cleaned data:

from sklearn.model_selection import train_test_split
train_data, val_data = train_test_split(train_enc, test_size = 0.1, random_state = 0)
print(train_data.shape)
print(val_data.shape)
X_train = train_data.drop('Survived', axis = 1)
X_test = val_data.drop('Survived', axis = 1)
y_train = train_data['Survived']
y_test = val_data['Survived']

OR
X = train_data.drop('Survived', axis = 1)
y = train_data['Survived']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.1, random_state =
0)

from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()
model = lr.fit(X_train, y_train)

from sklearn.metrics import accuracy_score

y_train_pred = model.predict(X_train)
train_accuracy = accuracy_score(y_train, y_train_pred)
print("Training accuracy:", train_accuracy)

y_pred = model.predict(X_test)
test_accuracy = accuracy_score(y_test, y_pred)
print("Testing accuracy:", test_accuracy)

from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier

model2 = RandomForestClassifier(n_estimators = 80, oob_score = True,

random_state= 0)
model2 = model2.fit(X_train, y_train)
print(X_train.columns)
model2.feature_importances_
model2.oob_score_

oobs = []
w_values = list(range(20,300,10))
for w in w_values:
model2 = RandomForestClassifier(n_estimators = w, oob_score = True,
random_state= 0)
model2.fit(X_train, y_train)
oob = m_1.oob_score_
oobs.append(oob)
max_oob_index = oobs.index(max(oobs))
best_w = w_values[max_oob_index]
best_w

model2 = RandomForestClassifier(n_estimators = 280, oob_score = True,

random_state= 0)
model2.fit(X_train, y_train)
model2.oob_score_

model3 = AdaBoostClassifier(n_estimators = 100, random_state = 0)

model3.fit(X_train, y_train)
model3.score(X_test, y_test)

y_pred2 = model2.predict(test_enc)

result = pd.DataFrame()
result['PassengerId'] = test_enc['PassengerId']
result['Survived'] = y_pred2
result

result.to_csv("gender_submission.csv",index=False)

Supervised Learning: Regression

Import required modules:

from sklearn.model_selection import train_test_split

X = train_data.drop("MedHouseVal", axis = 1)
y = train_data['MedHouseVal']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

from sklearn.linear_model import LinearRegression

import statsmodels.api as sma

X_train = sma.add_constant(X_train)
X_test = sma.add_constant(X_test)

model = sma.OLS(y_train, X_train)

model = model.fit()
model.summary()
y_pred = model.predict(X_test)

from sklearn.metrics import r2_score, mean_squared_error

rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print("RMSE:",rmse)
r2 = r2_score(y_test, y_pred)
print("R2:",r2)

test_data = sma.add_constant(test_data)
y_pred2 = model.predict(test_data)

Unsupervised Learning: K-Means Clustering

Following are the steps to perform K-Means:
Step1: Perform EDA
Step2: Check for random
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters = 3, init = “k-means++”, random_state = 0)
kmeans = kmeans.fit(scaled_data)
wcss = []
for i in range(1, 30):
kmeans = KMeans(n_clusters = i, init = 'k-means++', max_iter = 300, n_init = 10,
random_state = 0)
kmeans.fit(scaled_data)
wcss.append(kmeans.inertia_)
import matplotlib.pyplot as plt
plt.plot(range(1,30), wcss)
plt.title('The elbow method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()

Step3: Identify the point (No. of clusters) where it starts remaining constant. If not
identifiable (Elbow method fails), then make use of Silhouette score to get the
number of clusters.
Take the range of no. of clusters where you are not sure where the consistency
starts. Let’s say you’re confused between k=3 and k=13.
from sklearn.metrics import silhouette_score
for i in range(3, 13):
labels = KMeans(n_clusters = i).fit(scaled_data).labels_
print(“SC for k =”+ str(i) +“is”+str(silhouette_score(scaled_data, labels)))

Now identify at what value of k the silhouette score is highest.

km = KMeans(n_clusters = 3, init = 'k-means++', random_state = 0)

y_means = km.fit_predict(scaled_data)

Machine Learning Business Report
75% (55)
Machine Learning Business Report
60 pages
(Feature Engineering) (Extended-Cheatsheet)
No ratings yet
(Feature Engineering) (Extended-Cheatsheet)
9 pages
What Is Empirical Literature Review PDF
100% (2)
What Is Empirical Literature Review PDF
7 pages
Mercedes-Benz Greener Manufacturing Ai
0% (1)
Mercedes-Benz Greener Manufacturing Ai
16 pages
ANOVA and MANOVA: Statistics For Psychology
No ratings yet
ANOVA and MANOVA: Statistics For Psychology
34 pages
Syllabus For College and Advanced Algebra
100% (1)
Syllabus For College and Advanced Algebra
6 pages
DA_Programs
No ratings yet
DA_Programs
44 pages
Slip
No ratings yet
Slip
5 pages
05 E RandomForest LoanData
No ratings yet
05 E RandomForest LoanData
8 pages
data preprocessing
No ratings yet
data preprocessing
9 pages
ML Lab
No ratings yet
ML Lab
7 pages
16BCB0126 VL2018195002535 Pe003
No ratings yet
16BCB0126 VL2018195002535 Pe003
40 pages
Aiml Ex 4-7
No ratings yet
Aiml Ex 4-7
8 pages
ML Codes
No ratings yet
ML Codes
9 pages
IRis
No ratings yet
IRis
19 pages
ML pdf
No ratings yet
ML pdf
30 pages
Index: Name - JINESH PRAJAPAT Class - B. Tech, III Year Branch - AI & DS Sem - V
No ratings yet
Index: Name - JINESH PRAJAPAT Class - B. Tech, III Year Branch - AI & DS Sem - V
35 pages
23BCE7199 ML Lab Assignment[1]
No ratings yet
23BCE7199 ML Lab Assignment[1]
15 pages
Data analytics
No ratings yet
Data analytics
10 pages
ANN_EXPERIENTIAL_LEARNING
No ratings yet
ANN_EXPERIENTIAL_LEARNING
43 pages
Mlda - Lab
No ratings yet
Mlda - Lab
35 pages
1st PGM
No ratings yet
1st PGM
10 pages
Classification Review
No ratings yet
Classification Review
8 pages
23BCE7092_ML_Lab_Assignment[1]
No ratings yet
23BCE7092_ML_Lab_Assignment[1]
14 pages
hw1 ML IvanReyes
No ratings yet
hw1 ML IvanReyes
21 pages
ML Lab Manual
No ratings yet
ML Lab Manual
12 pages
Advance Python
No ratings yet
Advance Python
5 pages
ML Practical 205160694034
No ratings yet
ML Practical 205160694034
33 pages
R Assignment
No ratings yet
R Assignment
8 pages
Machine File
No ratings yet
Machine File
27 pages
Aiml 5-8
No ratings yet
Aiml 5-8
19 pages
17 Ensemble Techniques Problem Statement
No ratings yet
17 Ensemble Techniques Problem Statement
28 pages
LAB-4 Report
No ratings yet
LAB-4 Report
21 pages
Supervised Learning For Data Science...
No ratings yet
Supervised Learning For Data Science...
14 pages
C121 Exp2
No ratings yet
C121 Exp2
23 pages
DA_012307
No ratings yet
DA_012307
8 pages
ML Record Print
No ratings yet
ML Record Print
20 pages
Aiml Lab
No ratings yet
Aiml Lab
14 pages
AI ML - Cycle 2 Programs (1)
No ratings yet
AI ML - Cycle 2 Programs (1)
15 pages
ML Classification
No ratings yet
ML Classification
54 pages
ML Lab Prgms Split
No ratings yet
ML Lab Prgms Split
3 pages
Data Analytics Lab Manual_250402_095326
No ratings yet
Data Analytics Lab Manual_250402_095326
58 pages
Case Study - Classifier
No ratings yet
Case Study - Classifier
5 pages
St. John College of Engineering and Management, Palghar - Maharashtra
No ratings yet
St. John College of Engineering and Management, Palghar - Maharashtra
11 pages
Naive Bayes
No ratings yet
Naive Bayes
58 pages
20-SE-66 ML Assign 2
No ratings yet
20-SE-66 ML Assign 2
4 pages
featureselection
No ratings yet
featureselection
11 pages
Slides on DataI
No ratings yet
Slides on DataI
33 pages
C121 Exp1
No ratings yet
C121 Exp1
32 pages
EXP-2 ML
No ratings yet
EXP-2 ML
6 pages
Fashion MNIST-6
No ratings yet
Fashion MNIST-6
10 pages
Subset Selection Class Assignment
No ratings yet
Subset Selection Class Assignment
5 pages
Final ML File
No ratings yet
Final ML File
34 pages
Unit2 ML Programs
No ratings yet
Unit2 ML Programs
7 pages
Minor_lab
No ratings yet
Minor_lab
4 pages
ML Interview Questions
No ratings yet
ML Interview Questions
10 pages
Tous Les Algo de ML
No ratings yet
Tous Les Algo de ML
7 pages
Assignment 2 Documentation
No ratings yet
Assignment 2 Documentation
15 pages
DM ML Practical
No ratings yet
DM ML Practical
13 pages
1 KNN - Jupyter Notebook
No ratings yet
1 KNN - Jupyter Notebook
3 pages
AIML_week7_week8_week9
No ratings yet
AIML_week7_week8_week9
6 pages
Multi Classification.py(for 1 Class Tp,Tn,Fp,Fn)
No ratings yet
Multi Classification.py(for 1 Class Tp,Tn,Fp,Fn)
25 pages
Advanced C Concepts and Programming: First Edition
From Everand
Advanced C Concepts and Programming: First Edition
Gayatri
3/5 (1)
Amazing Java: Learn Java Quickly
From Everand
Amazing Java: Learn Java Quickly
Andrei Besedin
No ratings yet
" Bhubaneswar: A Project Report Submitted For The Partial Fulfillment of Post Graduate Diploma in Management
No ratings yet
" Bhubaneswar: A Project Report Submitted For The Partial Fulfillment of Post Graduate Diploma in Management
4 pages
Analytical CRM - 6439030 Narakorn Luangin
No ratings yet
Analytical CRM - 6439030 Narakorn Luangin
12 pages
Graduate Handbook Counselling Psychology Program Faculty of Education The University of Western Ontario
No ratings yet
Graduate Handbook Counselling Psychology Program Faculty of Education The University of Western Ontario
35 pages
Basic Statistics Formula Sheet
No ratings yet
Basic Statistics Formula Sheet
5 pages
Data Analysis Formula Sheet Tables (DADM)
No ratings yet
Data Analysis Formula Sheet Tables (DADM)
8 pages
Unbalance Panel Data PDF
No ratings yet
Unbalance Panel Data PDF
19 pages
Business Statistics
No ratings yet
Business Statistics
7 pages
Statistics For Social Workers
No ratings yet
Statistics For Social Workers
52 pages
Core Inventory Management- BMC_S4HANA2023-FPS02_BPD_EN_DE
No ratings yet
Core Inventory Management- BMC_S4HANA2023-FPS02_BPD_EN_DE
48 pages
Package Factoshiny': R Topics Documented
No ratings yet
Package Factoshiny': R Topics Documented
18 pages
Template For Preparing Article For Journal of Contemporary Information Technology, Management, and Accounting 16pt, Times New Roman, Bold
No ratings yet
Template For Preparing Article For Journal of Contemporary Information Technology, Management, and Accounting 16pt, Times New Roman, Bold
4 pages
Full Download (eBook PDF) Business Intelligence, Analytics, and Data Science: A Managerial Perspective 4th Edition PDF DOCX
100% (1)
Full Download (eBook PDF) Business Intelligence, Analytics, and Data Science: A Managerial Perspective 4th Edition PDF DOCX
55 pages
Article 222882
No ratings yet
Article 222882
20 pages
Consider All Samples of Size 6 From This Population
No ratings yet
Consider All Samples of Size 6 From This Population
6 pages
Measurement of Clinical Nurse Performance Developi
No ratings yet
Measurement of Clinical Nurse Performance Developi
13 pages
Download Full Multilevel Modeling Using R (Second Edition) W. Holmes Finch PDF All Chapters
100% (1)
Download Full Multilevel Modeling Using R (Second Edition) W. Holmes Finch PDF All Chapters
55 pages
Investigation Into Failure Phenomena of Water Meter in The Kingdom of Bahrain
No ratings yet
Investigation Into Failure Phenomena of Water Meter in The Kingdom of Bahrain
118 pages
HCPC Husson Josse
No ratings yet
HCPC Husson Josse
17 pages
Correlation Analysis - Final
No ratings yet
Correlation Analysis - Final
40 pages
Multi Variate Analysis Techniques
No ratings yet
Multi Variate Analysis Techniques
4 pages
09 - Machine Learning
No ratings yet
09 - Machine Learning
7 pages
Unit 4 - DA - Frequent Itemsets and Clustering-1 (Unit-5)
No ratings yet
Unit 4 - DA - Frequent Itemsets and Clustering-1 (Unit-5)
86 pages
Ba4206 Business Analytics L T P C Sylabus
100% (2)
Ba4206 Business Analytics L T P C Sylabus
5 pages
First Pages
No ratings yet
First Pages
12 pages
Statistics Unit Test-Part 1 +probability
No ratings yet
Statistics Unit Test-Part 1 +probability
3 pages
Homework #3 - Answers Economics 113 Introduction To Econometrics Professor Spearot Due Wednesday, October 29th, 2008 - Beginning of Class
No ratings yet
Homework #3 - Answers Economics 113 Introduction To Econometrics Professor Spearot Due Wednesday, October 29th, 2008 - Beginning of Class
2 pages