27 Jupyter Notebook
27 Jupyter Notebook
27 Jupyter Notebook
Introduction
In this challenge, Santander invites Kagglers to help them identify which customers will make a specific
transaction in the future, irrespective of the amount of money transacted. The data provided for this competition
has the same structure as the real data they have available to solve this problem.
The data is anonimyzed, each row containing 200 numerical values identified just with a number.
In the following we will explore the data, prepare it for a model, train a model and predict the target value for the
test set, then prepare a submission.
Stay tuned, I will frequently update this Kernel in the next days.
In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import math
import warnings
warnings.filterwarnings("ignore")
In [2]:
def reduce_mem_usage(df):
""" iterate through all the columns of a dataframe and modify the data type
to reduce memory usage.
"""
start_mem = df.memory_usage().sum() / 1024 ** 2
print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
if col_type != object:
c_min = df[col].min()
c_max = df[col].max()
if str(col_type)[:3] == 'int':
if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
df[col] = df[col].astype(np.int8)
elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
df[col] = df[col].astype(np.int16)
elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
df[col] = df[col].astype(np.int32)
elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
df[col] = df[col].astype(np.int64)
else:
if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
df[col] = df[col].astype(np.float16)
elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
df[col] = df[col].astype(np.float32)
else:
df[col] = df[col].astype(np.float64)
else:
df[col] = df[col].astype('category')
return df
def import_data(file):
"""create a dataframe and optimize its memory usage"""
df = pd.read_csv(file, parse_dates=True, keep_date_col=True)
df = reduce_mem_usage(df)
return df
In [40]:
train = import_data("train.csv")
test = import_data("test.csv")
We can see that the train Dataset has 202 columns while the test Dataset has 201 Columns. The extra column
in the Train Dataset is the target data set which is not present in the Test Dataset
In [4]:
train.head(2)
Out[4]:
The data obtained is entirely masked so with even domain knowledge we will not be able to find out any
significant features. We can try with basic features like mean, standard deviation, counts, median, etc. We will
do feature engineering later.
Basic Stats
Target Distribution
In [5]:
sns.countplot(train['target'])
Out[5]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a4300b0208>
In [6]:
train.target.value_counts()
Out[6]:
0 179902
1 20098
Name: target, dtype: int64
In [7]:
t0=train[train['target']==0]
t1=train[train['target']==1]
In [8]:
We can see from the above that nearly 90% of the Target value is 0(we assume that 0 stands for Customer
didnot do transaction) and only 10% is 1(we assume 1 stands for Customer did a Transaction).
In [9]:
train.drop(['ID_code'],axis=1,inplace=True)
labels=train['target']
train.drop(['target'],axis=1,inplace=True)
In [10]:
train.select_dtypes(include='float16')
Out[10]:
In [11]:
train.astype(np.float64).describe()
Out[11]:
standard deviation is relatively large for both train and test variable data;
min, max, mean, sdt values for train and test data looks quite close;
Missing Values:
In [12]:
def missing_data(data):
total = data.isnull().sum()
percent = (data.isnull().sum()/data.isnull().count()*100)
tt = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
types = []
for col in data.columns:
dtype = str(data[col].dtype)
types.append(dtype)
tt['Types'] = types
return(np.transpose(tt))
In [13]:
missing_data(train)
Out[13]:
var_0 var_1 var_2 var_3 var_4 var_5 var_6 var_7 var_8 var_9 ... var_
Total 0 0 0 0 0 0 0 0 0 0 ...
Percent 0 0 0 0 0 0 0 0 0 0 ...
Types float16 float16 float16 float16 float16 float16 float16 float16 float16 float16 ... flo
In [14]:
missing_data(test)
Out[14]:
ID_code var_0 var_1 var_2 var_3 var_4 var_5 var_6 var_7 var_8 ... va
Total 0 0 0 0 0 0 0 0 0 0 ...
Percent 0 0 0 0 0 0 0 0 0 0 ...
Types category float16 float16 float16 float16 float16 float16 float16 float16 float16 ... f
We can notice that there is no missing values in both the Train and the Test Dataset
Performing EDA
Mean
In [15]:
features = train.columns.tolist()
In [16]:
plt.figure(figsize=(16,6))
plt.title("Mean in train and test set")
sns.distplot(train[features].mean(axis=1), color="green", kde=True, bins=120, label='train'
sns.distplot(test[features].mean(axis=1), color="blue", kde=True, bins=120, label='test')
plt.legend()
plt.show()
Standard Deviation
In [17]:
plt.figure(figsize=(16,6))
plt.title("Standard Deviation in train and test set")
sns.distplot(train[features].std(axis=1), color="green", kde=True, bins=120, label='train')
sns.distplot(test[features].std(axis=1), color="blue", kde=True, bins=120, label='test')
plt.legend()
plt.show()
Skewness
In [18]:
plt.figure(figsize=(16,6))
plt.title("Skewness in train and test set")
sns.distplot(train[features].skew(axis=1), color="green", kde=True, bins=120, label='train'
sns.distplot(test[features].skew(axis=1), color="blue", kde=True, bins=120, label='test')
plt.legend()
plt.show()
Min
In [19]:
plt.figure(figsize=(16,6))
plt.title("Min in train and test set")
sns.distplot(train[features].min(axis=1), color="green", kde=True, bins=120, label='train')
sns.distplot(test[features].min(axis=1), color="blue", kde=True, bins=120, label='test')
plt.legend()
plt.show()
Max
In [20]:
plt.figure(figsize=(16,6))
plt.title("Min in train and test set")
sns.distplot(train[features].max(axis=1), color="green", kde=True, bins=120, label='train')
sns.distplot(test[features].max(axis=1), color="blue", kde=True, bins=120, label='test')
plt.legend()
plt.show()
We can see from above that all the variables have nearly same distribution with the same scales
Duplicate Values
In [21]:
features = train.columns.values[2:202]
unique_max_train = []
unique_max_test = []
for feature in features:
values = train[feature].value_counts()
unique_max_train.append([feature, values.max(), values.idxmax()])
values = test[feature].value_counts()
unique_max_test.append([feature, values.max(), values.idxmax()])
In [22]:
Out[22]:
Feature var_68 var_108 var_12 var_126 var_25 var_43 var_91 var_125 var_148 var
Max
40233 6127 3221 2746 2087 1836 1811 1780 1578 1
duplicates
Value 5.01953 14.2031 13.9766 11.5391 13.6875 11.5078 6.98438 12.5547 4.02344 14.
In [23]:
Out[23]:
Feature var_68 var_108 var_12 var_126 var_25 var_43 var_91 var_125 var_148 va
Max
39964 5987 3164 2747 2116 1944 1848 1824 1617
duplicates
Value 5.01953 14.2031 13.9766 11.5391 13.6406 11.4609 7.03125 12.5391 4.00781 14.
Same columns in train and test set have the same or very close number of duplicates of same or very close
values. This is an interesting pattern that we might be able to use in the future.
Feature Engineering
In [24]:
In [25]:
train.head(2)
Out[25]:
In [26]:
train.drop(['kurt'],axis=1,inplace=True)
In [27]:
train.head()
Out[27]:
In [28]:
test.head(2)
Out[28]:
In [29]:
test.drop(['kurt','ID_code'],axis=1,inplace=True)
TSNE
In [30]:
In [31]:
train_data.shape
Out[31]:
(200000, 207)
In [32]:
As we can see from above the data cannot be separated using TSNE. The points are massively overlapped
with positive points concentrated in the middle and the negative points surrounding it.
Modelling
Regression Model
Light GBM
In [33]:
In [34]:
In [35]:
param = {
'bagging_freq': 5,
'bagging_fraction': 0.4,
'boost_from_average':'false',
'boost': 'gbdt',
'feature_fraction': 0.05,
'learning_rate': 0.01,
'max_depth': -1,
'metric':'auc',
'min_data_in_leaf': 80,
'min_sum_hessian_in_leaf': 10.0,
'num_leaves': 13,
'num_threads': 8,
'tree_learner': 'serial',
'objective': 'binary',
'verbosity': 1
}
In [37]:
#https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.train.html#lightgbm.train
#https://www.kaggle.com/ashishpatel26/kfold-lightgbm/code
#(learned from here how to use stratified k-fold with model)
#https://github.com/KazukiOnodera/Santander-Customer-Transaction-Prediction/blob/master/fin
num_round = 1000000
clf = lgb.train(param, trn_data, num_round, valid_sets = [trn_data, val_data], verbose_
oof[val_idx] = clf.predict(train.iloc[val_idx][features], num_iteration=clf.best_iterat
fold_importance_df = pd.DataFrame()
fold_importance_df["Feature"] = features
fold_importance_df["importance"] = clf.feature_importance()
fold_importance_df["fold"] = fold_ + 1
feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)
Fold 0
Training until validation scores don't improve for 3000 rounds.
[1000] training's auc: 0.898416 valid_1's auc: 0.879423
[2000] training's auc: 0.909095 valid_1's auc: 0.886415
[3000] training's auc: 0.916569 valid_1's auc: 0.890816
[4000] training's auc: 0.922256 valid_1's auc: 0.893047
[5000] training's auc: 0.926935 valid_1's auc: 0.894655
[6000] training's auc: 0.931187 valid_1's auc: 0.895585
[7000] training's auc: 0.935173 valid_1's auc: 0.896432
[8000] training's auc: 0.938821 valid_1's auc: 0.896624
[9000] training's auc: 0.942428 valid_1's auc: 0.896811
[10000] training's auc: 0.945787 valid_1's auc: 0.896703
[11000] training's auc: 0.949007 valid_1's auc: 0.896715
[12000] training's auc: 0.952091 valid_1's auc: 0.896705
Early stopping, best iteration is:
[9295] training's auc: 0.94345 valid_1's auc: 0.896898
Fold 1
Training until validation scores don't improve for 3000 rounds.
[1000] training's auc: 0.898197 valid_1's auc: 0.880849
[2000] training's auc: 0.908856 valid_1's auc: 0.888719
[3000] training's auc: 0.916275 valid_1's auc: 0.892485
[4000] training's auc: 0.921915 valid_1's auc: 0.895032
[5000] training's auc: 0.926672 valid_1's auc: 0.896269
[6000] training's auc: 0.930977 valid_1's auc: 0.897043
[7000] training's auc: 0.934918 valid_1's auc: 0.897488
In [38]:
plt.figure(figsize=(14,28))
sns.barplot(x="importance", y="Feature", data=best_features.sort_values(by="importance",asc
plt.title('Features importance (averaged/folds)')
plt.tight_layout()
plt.savefig('FI.png')
In [41]:
sub_df = pd.DataFrame({"ID_code":test["ID_code"].values})
sub_df["target"] = predictions
sub_df.to_csv("lgbm.csv", index=False)
In [42]:
lgbm=pd.read_csv('lgbm.csv')
lgbm.head()
Out[42]:
ID_code target
0 test_0 0.054296
1 test_1 0.214405
2 test_2 0.210391
3 test_3 0.220053
4 test_4 0.042658
Classification Model
1. Logistic Regression
In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
from random import randrange, uniform
from scipy.stats import chi2_contingency
%matplotlib inline
In [2]:
trans = pd.read_csv("train.csv")
In [3]:
for i in range(2,202):
#print(i)
q75, q25 = np.percentile(trans.iloc[:,i], [75 ,25])
iqr = q75 - q25
In [4]:
trans.shape
Out[4]:
(175073, 202)
In [5]:
trans.to_csv("outlier values.csv")
In [6]:
plt.boxplot(trans['var_0'] ,vert=True,patch_artist=True)
Out[6]:
In [7]:
In [8]:
In [9]:
print(x_train.shape)
print(x_test.shape)
(122551, 200)
(52522, 200)
In [10]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
from collections import Counter
from sklearn.metrics import accuracy_score
from math import log
In [11]:
import warnings
warnings.filterwarnings("ignore")
C = LogisticRegression()
import math
parameter_data = [0.0001,0.001,0.01,0.1,1,5,10,20,30,40]
#print(log_my_data)
print("Printing parameter Data and Corresponding Log value")
data={'Parameter value':parameter_data,'Corresponding Log Value':log_my_data}
param=pd.DataFrame(data)
print("="*100)
print(param)
parameters = {'C':parameter_data}
clf = RandomizedSearchCV(C, parameters, cv=3, scoring='roc_auc', return_train_score=True, n
clf.fit(x_train, y_train)
train_auc= clf.cv_results_['mean_train_score']
train_auc_std= clf.cv_results_['std_train_score']
cv_auc = clf.cv_results_['mean_test_score']
cv_auc_std= clf.cv_results_['std_test_score']
Out[11]:
localhost:8888/notebooks/AAIC/Project 1- Machine Learning/aganirbanghosh007%40gmail.com_27.ipynb 29/42
9/27/2019 aganirbanghosh007@gmail.com_27 - Jupyter Notebook
<matplotlib.collections.PathCollection at 0x1ea8f06d320>
In [14]:
y_data_pred = []
y_data_pred.extend(clf.predict_proba(data[:])[:,1])
return y_data_pred
In [23]:
In [41]:
In [43]:
Out[43]:
In [44]:
test =pd.read_csv("test.csv")
In [45]:
id_code = test.iloc[:,0]
In [46]:
Out[46]:
ID_code target
0 test_0 0
1 test_1 0
2 test_2 0
3 test_3 0
4 test_4 0
In [47]:
test_logistic = df.join(test)
test_logistic.to_csv('logisticmodelpred.csv')
test_logistic.head()
Out[47]:
ID_code target var_0 var_1 var_2 var_3 var_4 var_5 var_6 var_7 ... va
0 test_0 0 11.0656 7.7798 12.9536 9.4292 11.4327 -2.3805 5.8493 18.2675 ... -2
1 test_1 0 8.5304 1.2543 11.3047 5.1858 9.1974 -4.0117 6.0196 18.6316 ... 10
2 test_2 0 5.4827 -10.3581 10.1407 7.0479 10.2628 9.8052 4.8950 20.2537 ... -0
3 test_3 0 8.5374 -1.3222 12.0220 6.5749 8.8458 3.1744 4.9397 20.5660 ... 9
4 test_4 0 11.7058 -0.1327 14.1295 7.7506 9.1035 -8.5848 6.8595 10.6048 ... 4
In [48]:
sns.set_style('whitegrid')
sns.countplot(x='target',data=test_logistic,palette='RdBu_r')
test_logistic['target'].value_counts()
Out[48]:
0 194098
1 5902
Name: target, dtype: int64
Naive Bayes
In [49]:
In [58]:
neigh = GaussianNB()
neigh.fit(x_train, y_train)
# roc_auc_score(y_true, y_score) the 2nd parameter should be probability estimates of the p
# not the predicted outputs
In [63]:
CM = confusion_matrix(y_test, y_test_pred)
CM = pd.crosstab(y_test, y_test_pred)
Out[63]:
In [65]:
predictions_test = neigh.predict(test)
In [67]:
Out[67]:
ID_code target
0 test_0 0
1 test_1 0
2 test_2 0
3 test_3 0
4 test_4 0
In [68]:
test_nb = df.join(test)
In [69]:
test_nb.head(2)
Out[69]:
ID_code target var_0 var_1 var_2 var_3 var_4 var_5 var_6 var_7 ... var_
0 test_0 0 11.0656 7.7798 12.9536 9.4292 11.4327 -2.3805 5.8493 18.2675 ... -2.1
1 test_1 0 8.5304 1.2543 11.3047 5.1858 9.1974 -4.0117 6.0196 18.6316 ... 10.6
In [70]:
sns.set_style('whitegrid')
sns.countplot(x='target',data=test_nb,palette='RdBu_r')
test_nb['target'].value_counts()
Out[70]:
0 192096
1 7904
Name: target, dtype: int64
In [71]:
Random Forest
In [73]:
# https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.ht
from sklearn.model_selection import GridSearchCV
C = RandomForestClassifier()
n_estimators=[10,50,100,200]
max_depth=[1, 5, 10, 50]
import math
print("Printing parameter Data and Corresponding Log value for Max Depth")
data={'Parameter value':max_depth,'Corresponding Log Value':log_max_depth}
param=pd.DataFrame(data)
print("="*100)
print(param)
train_auc= clf.cv_results_['mean_train_score']
train_auc_std= clf.cv_results_['std_train_score']
cv_auc = clf.cv_results_['mean_test_score']
cv_auc_std= clf.cv_results_['std_test_score']
Printing parameter Data and Corresponding Log value for Max Depth
============================================================================
========================
Parameter value Corresponding Log Value
0 1 0.00000
1 5 0.69897
2 10 1.00000
3 50 1.69897
Printing parameter Data and Corresponding Log value for Estimators
============================================================================
========================
Parameter value Corresponding Log Value
0 10 1.00000
1 50 1.69897
2 100 2.00000
3 200 2.30103
In [74]:
# https://plot.ly/python/3d-axes/
trace1 = go.Scatter3d(x=log_n_estimators, y=log_max_depth, z=train_auc, name = 'train')
trace2 = go.Scatter3d(x=log_n_estimators, y=log_max_depth, z=cv_auc, name = 'Cross validati
data = [trace1, trace2]
In [76]:
# https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html#sklearn.
from sklearn.metrics import roc_curve, auc
#from sklearn.calibration import CalibratedClassifierCV
neigh = RandomForestClassifier(n_estimators=100,max_depth=10,class_weight='balanced')
neigh.fit(x_train, y_train)
# roc_auc_score(y_true, y_score) the 2nd parameter should be probability estimates of the p
# not the predicted outputs
In [78]:
CM = confusion_matrix(y_test, y_test_pred)
CM = pd.crosstab(y_test, y_test_pred)
Out[78]:
In [79]:
predictions_rfc = neigh.predict(test)
In [80]:
Out[80]:
ID_code target
0 test_0 1
1 test_1 0
2 test_2 0
3 test_3 0
4 test_4 0
In [81]:
test_rfc = df.join(test)
In [82]:
sns.set_style('whitegrid')
sns.countplot(x='target',data=test_rfc,palette='RdBu_r')
test_rfc['target'].value_counts()
Out[82]:
0 182954
1 17046
Name: target, dtype: int64
In [83]:
test_rfc.to_csv('RandomForestPrediction.csv')