Machine Learning Model Building
Machine Learning Model Building
Step2: Get info of the data, accordingly do the missing value imputation.
columns = test_data.columns.to_list()
for col in columns:
print("Missing values % of", col, train_data[col].isna().sum()/train_data.shape[0])
def missing_val_treatment(df):
for col in columns:
miss_perc = df[col].isna().sum()/df.shape[0]
if miss_perc > 0 and miss_perc < 0.7 and df[col].dtype == 'O':
df[col].fillna(df[col].mode()[0], inplace = True)
elif miss_perc > 0 and miss_perc < 0.7 and df[col].dtype != 'O':
df[col].fillna(df[col].median(), inplace = True)
elif miss_perc > 0.7:
df.drop(col, axis = 1, inplace = True)
else:
pass
If any rows to be dropped that contains NA then:
df.dropna(subset = [column_name], inplace = True)
Step3: See if any outliers are there in any column and accordingly deal with the
outliers.
train_data.boxplot(num_cols)
Step4: See if any irrelevant columns present like Name or Address which has lot of
text and mostly unique throughout the rows.
train_data.drop(['Name','Ticket'], axis = 1, inplace = True)
test_data.drop(['Name','Ticket'], axis = 1, inplace = True)
Step5: Look for date-time columns. Here’s how you can deal with them:
df[‘ScheduledDay’] = pd.to_datetime(df[‘ScheduledDay’],
format = ‘%Y-%m-%dT%H:%M:%SZ’, errors = ‘coerce’)
Filteration w. r. t. date columns:
Step8: Look for imbalance in the data w. r. t. target variable and depending upon
that apply sampling technique like SMOTE.
OR
X = train_data.drop('Survived', axis = 1)
y = train_data['Survived']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.1, random_state =
0)
y_train_pred = model.predict(X_train)
train_accuracy = accuracy_score(y_train, y_train_pred)
print("Training accuracy:", train_accuracy)
y_pred = model.predict(X_test)
test_accuracy = accuracy_score(y_test, y_pred)
print("Testing accuracy:", test_accuracy)
oobs = []
w_values = list(range(20,300,10))
for w in w_values:
model2 = RandomForestClassifier(n_estimators = w, oob_score = True,
random_state= 0)
model2.fit(X_train, y_train)
oob = m_1.oob_score_
oobs.append(oob)
max_oob_index = oobs.index(max(oobs))
best_w = w_values[max_oob_index]
best_w
y_pred2 = model2.predict(test_enc)
result = pd.DataFrame()
result['PassengerId'] = test_enc['PassengerId']
result['Survived'] = y_pred2
result
result.to_csv("gender_submission.csv",index=False)
X_train = sma.add_constant(X_train)
X_test = sma.add_constant(X_test)
test_data = sma.add_constant(test_data)
y_pred2 = model.predict(test_data)
Step3: Identify the point (No. of clusters) where it starts remaining constant. If not
identifiable (Elbow method fails), then make use of Silhouette score to get the
number of clusters.
Take the range of no. of clusters where you are not sure where the consistency
starts. Let’s say you’re confused between k=3 and k=13.
from sklearn.metrics import silhouette_score
for i in range(3, 13):
labels = KMeans(n_clusters = i).fit(scaled_data).labels_
print(“SC for k =”+ str(i) +“is”+str(silhouette_score(scaled_data, labels)))