Titanic
Titanic
Titanic
# https://www.kaggle.com/c/titanic/data
import pandas as pd
import numpy as np
test = pd.read_csv("https://raw.githubusercontent.com/flores58c/CST_383_Titanic_
train = pd.read_csv("https://raw.githubusercontent.com/flores58c/CST_383_Titanic
In [2]: train.info()
train.describe
<class 'pandas.core.frame.DataFrame'>
0 1 0 3
1 2 1 1
2 3 1 3
3 4 1 1
4 5 0 3
localhost:8888/nbconvert/html/CST_383_Titanic_Project/titanic.ipynb?download=false 1/13
11/21/21, 10:23 PM titanic
886 887 0 2
887 888 1 1
888 889 0 3
889 890 1 1
890 891 0 3
Predictions
We will be predicting survivability based on categorical data of Pclass, Sex, Age. We will use
machine learning methods of knn and or linear regression.
Data Wrangling
In [3]: #find null/na values
train.isna().sum()
Out[3]: PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
localhost:8888/nbconvert/html/CST_383_Titanic_Project/titanic.ipynb?download=false 2/13
11/21/21, 10:23 PM titanic
# we can remove cabin column too much NaN values
train.drop("Cabin",axis=1)
#not sure if we should replace some nan columns in age with mean values
Out[4]: PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare E
Braund,
0 1 0 3 Mr. Owen male 22.0 1 0 A/5 21171 7.2500
Harris
Cumings,
Mrs. John
1 2 1 1 Bradley female 38.0 1 0 PC 17599 71.2833
(Florence
Briggs
Th...
Heikkinen,
2 3 1 3 Miss. female 26.0 0 0 STON/O2.
3101282 7.9250
Laina
Futrelle,
Mrs.
3 4 1 1 Jacques female 35.0 1 0 113803 53.1000
Heath
(Lily May
Peel)
Allen, Mr.
4 5 0 3 William male 35.0 0 0 373450 8.0500
Henry
... ... ... ... ... ... ... ... ... ... ...
Montvila,
886 887 0 2 Rev. male 27.0 0 0 211536 13.0000
Juozas
Graham,
887 888 1 1 Miss. female 19.0 0 0 112053 30.0000
Margaret
Edith
Johnston,
Miss. W./C. 23.4500
888 889 0 3 Catherine female NaN 1 2 6607
Helen
"Carrie"
Behr, Mr.
889 890 1 1 Karl male 26.0 0 0 111369 30.0000
Howell
Dooley,
890 891 0 3 Mr. male 32.0 0 0 370376 7.7500
Patrick
891 rows × 11 columns
In [5]: #changing Sex column to binary 1=male 0=female
le = LabelEncoder()
train["Sex"]=le.fit_transform(train["Sex"])
localhost:8888/nbconvert/html/CST_383_Titanic_Project/titanic.ipynb?download=false 3/13
11/21/21, 10:23 PM titanic
In [6]: #Taking average, rounded age of only males in data. Will possibly import to N/A
male_age = train['Sex'] == 1
mean_for_male_age
Out[6]: 31.0
In [7]: #Taking average, rounded age of only females in data. Will possibly import to N/
female_age = train['Sex'] == 0
mean_for_female_age
Out[7]: 28.0
train["Age"].isnull().sum()
Out[8]: 0
Data Exploration
In [9]: # data discovery
# Survived: 0 = No, 1 = Yes
test.size == train.size
train.columns
dtype='object')
/Users/abeebe/opt/anaconda3/lib/python3.8/site-packages/seaborn/_decorators.py:3
6: FutureWarning: Pass the following variables as keyword args: x, y. From versi
on 0.12, the only valid positional argument will be `data`, and passing other ar
guments without an explicit keyword will result in an error or misinterpretatio
n.
warnings.warn(
localhost:8888/nbconvert/html/CST_383_Titanic_Project/titanic.ipynb?download=false 4/13
11/21/21, 10:23 PM titanic
In [11]: # The dimensions we will be using are Survived, Sex, Pclass, & Age
train.groupby('Pclass').size().plot.bar()
Out[11]: <AxesSubplot:xlabel='Pclass'>
# train[(train.Survived == 0)]
# train['Survived'].size OR len(train)
Out[13]: Sex 0 1
Survived
0 81 468
1 233 109
localhost:8888/nbconvert/html/CST_383_Titanic_Project/titanic.ipynb?download=false 5/13
11/21/21, 10:23 PM titanic
Out[14]: Pclass 1 2 3
Survived
0 80 97 372
1 136 87 119
sns.distplot(train.Age)
/Users/abeebe/opt/anaconda3/lib/python3.8/site-packages/seaborn/distributions.p
y:2551: FutureWarning: `distplot` is a deprecated function and will be removed i
n a future version. Please adapt your code to use either `displot` (a figure-lev
el function with similar flexibility) or `histplot` (an axes-level function for
histograms).
warnings.warn(msg, FutureWarning)
In [16]: # Find conditional probability of people under and over 30 who survived
# del train["over_30"]
# train.head
train["Over30"] = None
pd.crosstab(train.Survived, train.over_30)
In [17]: # Did men pay more for their fare than women, did older passengers pay less than
localhost:8888/nbconvert/html/CST_383_Titanic_Project/titanic.ipynb?download=false 6/13
11/21/21, 10:23 PM titanic
#sns.scatterplot(data = train, x = "Age", y = "Fare", hue = "Pclass")
plt.show()
In [19]: sns.regplot(x="Fare",y="Survived",data=train,logistic=True)
localhost:8888/nbconvert/html/CST_383_Titanic_Project/titanic.ipynb?download=false 7/13
11/21/21, 10:23 PM titanic
Paying more had a somewhat positive correlation. Not sure if that's the only reason.
In [20]: #grid of scatterplots
sns.pairplot(train)
localhost:8888/nbconvert/html/CST_383_Titanic_Project/titanic.ipynb?download=false 8/13
11/21/21, 10:23 PM titanic
Machine Learning
In [21]: predictors = ['Age']
target = 'Survived'
X = train[predictors].values
y = train[target].values
reg.fit(X_train,y_train)
Out[21]: LinearRegression()
In [22]: plt.scatter(X_train,y_train)
plt.plot(X_test,reg.predict(X_test),linestyle='dashed',color = 'black')
localhost:8888/nbconvert/html/CST_383_Titanic_Project/titanic.ipynb?download=false 9/13
11/21/21, 10:23 PM titanic
Answering the question which age range is most likely to survive. The answer is unclear and
can't be explicitly solved in this plot.
In [23]: def rmse(predicted, actual):
rmse(reg.predict(X_test),y_test)
Out[23]: 0.48659958319426705
Looking at our linear model the error is about half for fitting the data. So this would not be a
good representation for our prediction.
In [24]: # Predicting odds of survival based on age, class, and sex, with KNN Regression.
predictors = ['Age','Sex','Pclass']
target = 'Survived'
X = train[predictors].values
y = train[target].values
reg = KNeighborsRegressor(n_neighbors=20)
reg.fit(X,y)
plot_f_3['Sex'] = 0
plot_f_3['Pclass'] = 3
plot_f_2['Sex'] = 0
plot_f_2['Pclass'] = 2
plot_f_1['Sex'] = 0
plot_f_1['Pclass'] = 1
plot_m_3['Sex'] = 1
plot_m_3['Pclass'] = 3
plot_m_2['Sex'] = 1
plot_m_2['Pclass'] = 2
plot_m_1['Sex'] = 1
localhost:8888/nbconvert/html/CST_383_Titanic_Project/titanic.ipynb?download=false 10/13
11/21/21, 10:23 PM titanic
plot_m_1['Pclass'] = 1
plt1= fig.add_subplot(121)
plt2= fig.add_subplot(122)
plt1.legend()
plt2.legend()
plt1.set_xlabel('Age')
plt1.set_ylabel('Odds of survival')
plt2.set_xlabel('Age')
plt2.set_ylabel('Odds of survival')
plt1.set_title('Females')
plt2.set_title('Males')
predictors = ['Fare']
target = 'Age'
X = train[predictors].values
y = train[target].values
plt.scatter(X,y)
plt.plot(X,reg2.predict(X),linestyle='dashed',color = 'black')
localhost:8888/nbconvert/html/CST_383_Titanic_Project/titanic.ipynb?download=false 11/13
11/21/21, 10:23 PM titanic
The older passengers were more frugal whereas passengers who were in the middle age ranges
paid a bit more.
In [26]: #using amount paid to predict Survivability
predictors = ['Fare']
target = 'Survived'
X = train[predictors].values
y = train[target].values
reg3 = LinearRegression()
reg3.fit(X,y)
plt.scatter(X,y)
plt.plot(X,reg3.predict(X),linestyle='dashed',color = 'black')
This regression plot shows there is a positive correlation with paying more and surviving. This is
because first class had a closer access to the life boats on the titanic.
In [27]: predictors = ['Age']
target = 'Survived'
X = train[predictors].values
y = train[target].values
reg.fit(X_train, y_train)
plot_tree(reg)
localhost:8888/nbconvert/html/CST_383_Titanic_Project/titanic.ipynb?download=false 12/13
11/21/21, 10:23 PM titanic
In [ ]:
localhost:8888/nbconvert/html/CST_383_Titanic_Project/titanic.ipynb?download=false 13/13