0% found this document useful (0 votes)

20 views

Titanic Survival Prediction Using Machine Learning

FINAL SEM PROJECT

Uploaded by

kavyaanjappa.2004

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views

Titanic Survival Prediction Using Machine Learning

FINAL SEM PROJECT

Uploaded by

kavyaanjappa.2004

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Titanic Survival Prediction Using Machine Learning

BE G I NNE R C LA S S I F I C AT I O N M A C HI NE LE A RNI NG PRO J E C T PYT HO N

This article was published as a part of the Data Science Blogathon

Hey Folks, in this article, we will be understanding, how to analyze and predict, whether a person, who had
boarded the RMS Titanic has a chance of survival or not, using Machine Learning’s Logistic Regression
model.

Brief description about Logistic Regression:

A simple yet crisp description of Logistic Description would be, “it is a supervised learning classification
algorithm used to predict the probability of a target variable. The nature of target or dependent variable is
dichotomous, which means there would be only two possible classes.” as stated in the tutorial points
article.

The graph of logistic regression is as shown below:

image source: link

Let’s Dive into the process..!! :

Now let us begin the main part of this article.

If you prefer an audio-visual understanding of this process, you can refer to this video below. It goes
through everything in this article with a little more detail and will help make it easy for you to start
programming your own machine-learning model, even if you don’t have python installed on your computer.

Or you can use both as supplementary materials for learning about machine learning!
Project 15. Titanic Survival Prediction using Machin…
Machin…

For better understanding, let’s split the task into smaller parts and depict them in a workflow as shown
below :

(image source is the video linked above: image_link )

As we now know what we have to do, to accomplish this task, we shall begin with the very first and the
most important thing needed in machine learning, a Dataset.

What is a dataset:
A data set, as the name suggests, is a collection of data. In Machine Learning projects, we need a training
data set. It is the actual data set used to train the model for performing various actions.

Here, in this case, we will be using a dataset available on the internet. One can find various such datasets
over the internet.

The dataset that I’ve used in my code was the data available on Kaggle. You can also download it from
here.

One thing must be kept in mind, the larger the data, the more we can train our model, and the more
accurate our results come out to be. Don’t worry if all of this sounds weird to you, it will all make sense in
a few minutes.

Let’s Begin with our Coding:

To code, as we know we need a suitable environment, here in my case I’ve used Google Colab, as it reduces
the hectic task of compiling and running the program on your PC. You may use any editor as you like.

The foremost that we need to do is import the dependencies that we will be using in our code.

Importing dependencies :

We will be using: NumPy, pandas,mat plot lib, seaborn,sklearn.

As we move ahead, you will get to know the use of each of these modules.

Now, we need to upload the downloaded dataset, into this program, so that our code can read the data and
perform the necessary actions using it.

As we have downloaded a CSV file, we shall be using Pandas to store that data in a variable.

Our dataset is now stored in the variable named titanic_data.

To get a brief idea about how the data is loaded, we use the command “variable_name.head()” to get a
glimpse of the dataset in the form of a table.
The output came out to be as follows:

The meaning of the values (SibSp, Parch) can be found on the website from which we have downloaded the
dataset.

We have learned from Kaggle while downloading the data set, that the data has 891 rows and 12 columns.

Now, let’s check how many cells are left empty in the table.

titanic_data.isnull().sum()

The output came out to be as follows:

We cannot leave the cells empty, thus have to fill the tables with the most suitable values.

Handling the missing values:

Dropping the “Cabin” column from the data frame as it won’t be of much importance

titanic_data = titanic_data.drop(columns='Cabin', axis=1)

Replacing the missing values in the “Age” column with the mean value

titanic_data['Age'].fillna(titanic_data['Age'].mean(), inplace=True)

Finding the mode value of the “Embarked” column as it will have occurred the maximum number of times
print(titanic_data['Embarked'].mode())
Replacing the missing values in the “Embarked” column with mode value
titanic_data['Embarked'].fillna(titanic_data['Embarked'].mode()[0], inplace=True)

Now let us check if there are still any cells remaining empty.

Running the isnull() command again, we get the satisfactory output, that no such empty cells are present.

We have already noticed from the table, there are two columns that contain string-type values: The “Sex”
column and the “Berth” column.

Transformation into a categorical column.

Let’s convert that into integer type values, and transform it into a categorical column:

titanic_data.replace({'Sex':{'male':0,'female':1}, 'Embarked':{'S':0,'C':1,'Q':2}}, inplace=True)

Now if we run the titanic_data.head() command again, we find that the values have been replaced
successfully.

We also see, that there are few columns, which are not of much importance in this process. Let us get rid
of them.

titanic_data = titanic_data.drop(columns = ['PassengerId','Name','Ticket','Survived'],axis=1)

Now it’s time to begin implementing machine learning.

Let’s split the data into the target and feature variables.

X = titanic_data.drop(columns = ['PassengerId','Name','Ticket','Survived'],axis=1) Y =

titanic_data['Survived']

Here, X is the feature variable, containing all the features like Pclass, Age, Sex, Embarked, etc. excluding
the Survived column.

Y, on the other hand, is the target variable, as that is the result that we want to determine,i.e, whether a
person is alive.

Now, we will be splitting the data into four variables, namely, X_train, Y_train, X_test, Y_test.

X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size=0.2, random_state=2)

Let’s understand the variables :

X_train: contains a set of values from variable ‘ X ‘

Y_train: contains the output (whether the person is alive or dead) of the corresponding value of X_train.

X_test: contains a set of values from variable ‘ X ‘, excluding the ones from X_train.

Y_train: contains the output (whether the person is alive or dead) of the corresponding value of X_test.

test_size: represents the percentage ratio of X_train:X_test (Here 0.2 means that the data will be
segregated in the X_train and X_test variables in a 80:20 ratio). You can use any value you want. A value
<0.3 is preferred
Logistic Regression :

Let’s create a model named model

model = LogisticRegression()

Now let us train the model, with our training values(X_train , Y_train)

model.fit(X_train, Y_train)

The model trains in a way like this: “When the values of X are these, the value of Y is this.”

Checking the Accuracy:

Checking the accuracy of when our model tries to predict the values, using our training data :

Let’s name a variable X_train_prediction, which will store all the predictive outputs of the values X_train.

X_train_prediction = model.predict(X_train)

Now, to check how accurate was its prediction, we compare the values of X_train_prediction with Y_train,
which was the original real-life data.

training_data_accuracy = accuracy_score(Y_train, X_train_prediction) print('Accuracy score of training data :

', training_data_accuracy)

The output comes out to be 0.8075842696629213, which is pretty decent.

Now, Let’s try it again with X_test and Y_test:

X_test_prediction = model.predict(X_test) test_data_accuracy = accuracy_score(Y_test, X_test_prediction)

print('Accuracy score of test data : ', test_data_accuracy)

The output came out to be 0.7821229050279329, which was very close to our test data prediction.

Thus our model is quite accurate as per the data we received.

Checking for a Random Person:

Now let’s check for a random Person using random data from the unedited table from Kaggle.

input_data = (3,0,35,0,0,8.05,0) # Note that these datas exclude the Survived data, as it is to be
determined from the model itself

Now let’s change these values to a NumPy array :

input_data_as_numpy_array = np.asarray(input_data)

As our model was trained in different dimensions, we need to reshape this to our target dimensions.
input_data_reshaped = input_data_as_numpy_array.reshape(1,-1)

Now, Let’s predict using our model:

prediction = model.predict(input_data_reshaped) #print(prediction) if prediction[0]==0: print("Dead") if

prediction[0]==1: print("Alive")

On running the code, we get the exact same result, as the given one, in the table.

Thus we can conclude that our model is performing well. You can train the model using a bigger dataset to
get better results.

End Notes :

The result of any machine learning model can be made more precise using a bigger dataset, but, it will be
much more tedious and time-consuming. Feel free to add any kind of necessary changes to this code, and
customize it as per your requirements. A similar logic can be applied to perform various kinds of
predictions.

Thanks for reading…Have a good day..!!

About the Author:

Heyy, I am Pinak Datta, currently, a second-year student, pursuing Computer Science Enginnering from
Kalinga Institute of Industrial Technology. I love Web development, Competitive Coding, and a bit of
Machine Learning too. Please feel free to connect with me through my socials.

Linked-in

Instagram

Facebook

Mail

The media shown in this ar ticle are not owned by Analytics Vidhya and are used at the Author’s
discretion.

Article Url - https://www.analyticsvidhya.com/blog/2021/07/titanic-survival-prediction-using-machine-

learning/

Pinak Datta
2nd year CSE Undergrad who loves to write clean code 😛

Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
27 pages
Innovations in Portland Cement Manufacturing
80% (5)
Innovations in Portland Cement Manufacturing
1,283 pages
Twinkl Omnivore-Carnivore-Or-Herbivore-Venn-Diagram-Sorting-Activity-Sheet - Ver - 3
No ratings yet
Twinkl Omnivore-Carnivore-Or-Herbivore-Venn-Diagram-Sorting-Activity-Sheet - Ver - 3
3 pages
RSI Indicator Trading Strategy, 5 Systems + Back Test Results!
100% (9)
RSI Indicator Trading Strategy, 5 Systems + Back Test Results!
22 pages
Continental Diesel Engine Reference Guide: Make Fits Application Engine Details
0% (1)
Continental Diesel Engine Reference Guide: Make Fits Application Engine Details
1 page
Arm A330
100% (1)
Arm A330
581 pages
How To Prepare Your Dataset For Machine Learning in Python
No ratings yet
How To Prepare Your Dataset For Machine Learning in Python
14 pages
Implementing Artificial Neural Network in Python From Scratch
No ratings yet
Implementing Artificial Neural Network in Python From Scratch
16 pages
Model_learning_steps
No ratings yet
Model_learning_steps
12 pages
Week 7 Laboratory Activity
No ratings yet
Week 7 Laboratory Activity
12 pages
How to Develop a CNN for MNIST Handwritten Digit Classification
No ratings yet
How to Develop a CNN for MNIST Handwritten Digit Classification
43 pages
DL Lab-III-II
No ratings yet
DL Lab-III-II
98 pages
DL Lab-final
No ratings yet
DL Lab-final
22 pages
12 Useful Pandas Techniques in Python For Data Manipulation
100% (2)
12 Useful Pandas Techniques in Python For Data Manipulation
19 pages
Multi-Output Classification With Machine Learning
No ratings yet
Multi-Output Classification With Machine Learning
10 pages
How To Predict Doge Coin Price Using Machine Learning and Python
No ratings yet
How To Predict Doge Coin Price Using Machine Learning and Python
14 pages
Feature Engineering
No ratings yet
Feature Engineering
20 pages
Deep Learning
No ratings yet
Deep Learning
25 pages
Machine Learning
No ratings yet
Machine Learning
16 pages
mini4
No ratings yet
mini4
9 pages
Maxbox Starter60 Machine Learning
No ratings yet
Maxbox Starter60 Machine Learning
8 pages
2.1 ML (Implementation of Simple Linear Regression in Python)
No ratings yet
2.1 ML (Implementation of Simple Linear Regression in Python)
8 pages
A Neural Network Model Using Python
No ratings yet
A Neural Network Model Using Python
10 pages
Mining and Visualising Real-World Data: About This Module
100% (1)
Mining and Visualising Real-World Data: About This Module
16 pages
CNN with TensorFlow and Keras
No ratings yet
CNN with TensorFlow and Keras
11 pages
Machine Learning 2
No ratings yet
Machine Learning 2
45 pages
ML (Prac1)
No ratings yet
ML (Prac1)
12 pages
Machine Learning SVM - Supervised
No ratings yet
Machine Learning SVM - Supervised
32 pages
Top 9 Feature Engineering Techniques With Python: Dataset & Prerequisites
No ratings yet
Top 9 Feature Engineering Techniques With Python: Dataset & Prerequisites
27 pages
Linear Regression - Numpy and Sklearn
No ratings yet
Linear Regression - Numpy and Sklearn
7 pages
Project Documentation
No ratings yet
Project Documentation
24 pages
som
No ratings yet
som
19 pages
MLP - Week 5 - MNIST - Perceptron - Ipynb - Colaboratory
No ratings yet
MLP - Week 5 - MNIST - Perceptron - Ipynb - Colaboratory
31 pages
Data Preprocesing JavaPoint
No ratings yet
Data Preprocesing JavaPoint
19 pages
Cad and Dog
No ratings yet
Cad and Dog
5 pages
Handwritten Character Recognition With Neural Network
No ratings yet
Handwritten Character Recognition With Neural Network
12 pages
Asset-V1 VIT+MBA109+2020+type@asset+block@Introductio To ML Using Python
No ratings yet
Asset-V1 VIT+MBA109+2020+type@asset+block@Introductio To ML Using Python
7 pages
House Price Prediction Using Machine Learning in Python
No ratings yet
House Price Prediction Using Machine Learning in Python
13 pages
AIL303 M
No ratings yet
AIL303 M
22 pages
WWW Tensorflow Org Tutorials Structured Data Time Series
No ratings yet
WWW Tensorflow Org Tutorials Structured Data Time Series
41 pages
12 Dimensionality Reduction Techniqwues (with Python Codes)
No ratings yet
12 Dimensionality Reduction Techniqwues (with Python Codes)
20 pages
Unit III
No ratings yet
Unit III
28 pages
Implementation of Time Series Forecasting
No ratings yet
Implementation of Time Series Forecasting
12 pages
1 - An Introduction To Machine Learning With Scikit-Learn
No ratings yet
1 - An Introduction To Machine Learning With Scikit-Learn
9 pages
Random Forest Algorithm
No ratings yet
Random Forest Algorithm
9 pages
FineTune OPUS MT Engine
No ratings yet
FineTune OPUS MT Engine
9 pages
Xgboost: Notebook
No ratings yet
Xgboost: Notebook
8 pages
Growth Curve Estimation
No ratings yet
Growth Curve Estimation
13 pages
lab 6 ml
No ratings yet
lab 6 ml
7 pages
Weak AI Generative AI Strong AI:-Machine Learning Tutorial 1.supervised Leaning 2.un Supervised Learning 3.reinforcement Learning
No ratings yet
Weak AI Generative AI Strong AI:-Machine Learning Tutorial 1.supervised Leaning 2.un Supervised Learning 3.reinforcement Learning
53 pages
Simple Linear Regression in Machine Learning
No ratings yet
Simple Linear Regression in Machine Learning
7 pages
AD3411 - 1 To 5
No ratings yet
AD3411 - 1 To 5
11 pages
Efficient Python Tricks and Tools For Data Scientists - by Khuyen Tran
No ratings yet
Efficient Python Tricks and Tools For Data Scientists - by Khuyen Tran
20 pages
Pythonfile
No ratings yet
Pythonfile
36 pages
Ridge and Lasso Regression in Python
No ratings yet
Ridge and Lasso Regression in Python
18 pages
keras
No ratings yet
keras
4 pages
Machine Learning
No ratings yet
Machine Learning
53 pages
Classification Algorithms I
No ratings yet
Classification Algorithms I
14 pages
Assignment Text Classification Using Hugging Face
No ratings yet
Assignment Text Classification Using Hugging Face
6 pages
Predicting Drug Solubilty Wtih Deep Learning
No ratings yet
Predicting Drug Solubilty Wtih Deep Learning
9 pages
Regression Dataset Example
No ratings yet
Regression Dataset Example
14 pages
ASNM Program Explain
No ratings yet
ASNM Program Explain
4 pages
Machine Learning Lab
No ratings yet
Machine Learning Lab
43 pages
MCS-011: Problem Solving and Programming
From Everand
MCS-011: Problem Solving and Programming
Dr. DK Sukhani
No ratings yet
Python for Data Science: Data Science Mastery by Nikhil Khan, #1
From Everand
Python for Data Science: Data Science Mastery by Nikhil Khan, #1
Nikhil Khan
No ratings yet
Title Proposal Template
No ratings yet
Title Proposal Template
5 pages
Physiotherapy After Stroke PDF
No ratings yet
Physiotherapy After Stroke PDF
8 pages
Frequently Used Template: Confidential
No ratings yet
Frequently Used Template: Confidential
254 pages
233185P VR28-013 Voltage Regulator
No ratings yet
233185P VR28-013 Voltage Regulator
16 pages
Kuznetsov & Kuznetsova 2008
No ratings yet
Kuznetsov & Kuznetsova 2008
17 pages
Financial Analysis of Household Photovoltaic Self-Consumption in The Context of The Vehicle-to-Home (V2H) in Portugal
No ratings yet
Financial Analysis of Household Photovoltaic Self-Consumption in The Context of The Vehicle-to-Home (V2H) in Portugal
21 pages
T. Jukariya and R. Singhvi PDF
No ratings yet
T. Jukariya and R. Singhvi PDF
8 pages
Agitador Burrell 51100-XX PDF
No ratings yet
Agitador Burrell 51100-XX PDF
7 pages
Akash Kumar IITMandi
No ratings yet
Akash Kumar IITMandi
1 page
The Clay Research Group The Clay Research Group
No ratings yet
The Clay Research Group The Clay Research Group
9 pages
(Analysis of Variance) : Anova
No ratings yet
(Analysis of Variance) : Anova
22 pages
online-learning
No ratings yet
online-learning
3 pages
U4 The Functional Approach. Literal-Direct Translation
No ratings yet
U4 The Functional Approach. Literal-Direct Translation
18 pages
Existentialism & Marxism: Althea Alabanzas
No ratings yet
Existentialism & Marxism: Althea Alabanzas
7 pages
Database Design For Dynamic Online Surveys: Conference Paper
No ratings yet
Database Design For Dynamic Online Surveys: Conference Paper
9 pages
C Unit-Business Studies
No ratings yet
C Unit-Business Studies
44 pages
Final Review Solutions
No ratings yet
Final Review Solutions
14 pages
S5 Ch.5 Permutation and Combination
No ratings yet
S5 Ch.5 Permutation and Combination
15 pages
A Case Study On MUET High Achievers' Results Cohort 2012/13
No ratings yet
A Case Study On MUET High Achievers' Results Cohort 2012/13
59 pages
1 3 Quest-Answer 2013
No ratings yet
1 3 Quest-Answer 2013
8 pages
Birth Day Quotes
No ratings yet
Birth Day Quotes
2 pages
Vocabulary Intermediate
No ratings yet
Vocabulary Intermediate
21 pages
STL Varun Ehs M A02 Hira R0
No ratings yet
STL Varun Ehs M A02 Hira R0
6 pages
Cardiology
No ratings yet
Cardiology
613 pages
Air Pollution Prediction System For Smart City
No ratings yet
Air Pollution Prediction System For Smart City
3 pages