Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 37

BANK CUSTOMER CHURN PREDICTION USING

MACHINE LEARNING
A PROJECT REPORT

Submitted by

S. SHYAM KOUSHIK 221801370016


T.L.S. SUPREETHA 221801370030
V. SRAVAN KUMAR 221801370037
V. DINESH KUMAR 221801370039
P. SUBHASH SIDDIK 221801370051
P. TEJESH
P. CHARISHMA JYOTI
Y. CHANDU 221801370076

Under the esteemed guidance of

Mrs. P. Anuradha, M.Tech, (Ph.D),

in partial fulfilment for the award of the degree of

BACHELOR OF TECHNOLOGY IN

COMPUTER SCIENCE AND ENGINEERING

DEPARTMENT OF COMPUTER SCIENCE


AND ENGINEERING
CENTURION UNIVERSITY OF TECHNOLOGY AND
MANAGEMENT ANDHRA PRADESH.
VIZIANAGARAM CAMPUS

BONAFIDE CERTIFICATE

Certified that this project report Bank Customer churn prediction using machine

learning is the bonafide work of “S. SHYAM KOUSHIK (221801370016),

T.L.S. SUPREETHA (221801370030), V. SRAVAN KUMAR (211801370037),

V. DINESH KUMAR (221801370039), P. SUBHASH SIDDIK (221801370051),

P. TEJESH (221801370066), P. CHARISHMA JYOTI (221801370074), Y.

CHANDU(221801370076)” carried out the project work under my supervision.

This is to further certify to the best of my knowledge that this project has not been

carried out earlier in this institute and the university.

SIGNATURE

Mrs.P.ANURADHA
Assistant professor

Certified that the above-mentioned project has been duly carried out as per the
norms of the college and statutes of the university.

SIGNATURE
Dr .P.SUBRAT KUMAR
Associate professor

SIGNATURE
DR.P.A. SUNNY DAYAL
Dean Associate professor
HEAD OF THE DEPARTMENT / DEAN OF THE SCHOOL
Professor of Computer Science and Engineering
DEPARTMENT SEAL

ACKNOWLEDGEMENTS

I am immensely thankful to Assistant Professor P. Anuradha, of the Department of


Computer Science and Engineering at SoET, Vizianagaram Campus. P. Anuradha Ma’am
led me through the complexities of this project effortlessly, displaying unparalleled
generosity and guidance.

I thank Prof. Dr. Subrat Kumar Parida, Head of the Dept. of Department of Computer
Science and Engineering, SoET, Vizianagaram Campus for extending their support
during Course of this investigation.

I thank Dr. P. A. Sunny Dayal, Dean of SoET, Vizianagaram Campus for their
invaluable guidance, insightful feedback, and continuous support throughout the course of
this project. Your expertise and mentorship have been invaluable.

I thank Dr. P. Pallavi, Registrar, CUTM, Vizianagaram Campus for their assistance and
cooperation in facilitating the necessary resources and administrative support essential for
the successful execution of this project.

I thank P. Prasanta Kumar Mohanty, Vice Chancellor, CUTM, Vizianagaram Campus


for fostering an environment that encourages academic excellence and innovation. Your
vision has been a constant source of inspiration.

I also express my deepest appreciation to my parents for their unconditional love,


encouragement, and belief in my abilities. Their unwavering support has been the
cornerstone of my achievements.

I am sincerely grateful to each one of you for your contributions, guidance, and
unwavering support, without which this project would not have been possible.
DECLARATION

We hereby declare that the work described in this project work, entitled " BANK

CUSTOMER CHURN PREDICTION USING MACHINE LEARNING"

which is submitted by us in partial fulfilment for the award of Bachelor of

Technology in the Department of Computer Science and Engineering to the

Centurion University of Technology & Management, Andhra Pradesh, is the result

of work done by us under the guidance of Mrs. P. Anuradha mam.

The work is original and has not been submitted for any Degree of this or any other

university,

Submitted by,

V. DINESH KUMAR (221801370039)

P. TEJESH (221801370066)

S. SHYAM KOUSHIK (221801370016)

T.L.S SUPREETHA (221801370030)

P. SUBHASH SIDDIK (221801370051)

P. CHARISHMA JYOTI (221801370074)

Y. CHANDU (221801370076)
V.SRAVAN KUMAR (221801370037)

ABSTRACT

Customer churn, or the rate at which customers cease doing business with a company, is a
critical concern for banks, as retaining existing customers is often more cost-effective than
acquiring new ones. In this study, we apply machine learning techniques to predict
customer churn in the banking sector. We explore various features such as demographic
information, transaction history, and customer interactions to develop predictive models.
Specifically, we employ algorithms including logistic regression, random forest, and
gradient boosting machines to build and evaluate the models. Additionally, we investigate
the impact of feature engineering, hyperparameter tuning, and model interpretability
techniques on the performance and interpretability of the models. Our findings
demonstrate the effectiveness of machine learning in identifying customers at risk of
churn, enabling proactive retention strategies and ultimately contributing to improved
customer satisfaction and business profitability in the banking industry.
TABLE OF CONTENTS

Chapter 1 Introduction

Chapter 2 System Analysis

2.1 Existing System

2.2 Proposed System

2.3 Algorithm

2.4 System Requirements

2.4.1 Software Requirements

2.4.2 Hardware Requirements

Chapter 3 System Design

3.1 System Architecture


8

3.2 Modules 9

3.3 Data Flow Diagrams


10

Chapter 4 Technology Description


12
Chapter 5 Implementation

5.1 Steps for Implementation


15

5.2 Coding
15

Chapter 6 Output Screen


19

Conclusion
24

Future Scope
25

Biblograpy
26

References
27

List of Diagrams

2.3.1 Logistic Regression 4

2.3.2 Random Forest 5

3.1 System Architecture 8

3.3 Data Flow Diagrams


10

List of Tables

3.1 List of Attributes 9


INTRODUCTION

The heart is a kind of muscular organ which pumps blood into the body and

is the central part of the body's cardiovascular system which also contains lungs.

Cardiovascular system also comprises network of blood vessels, for example,

veins, arteries, and capillaries. These blood vessels deliver blood all over the body.

Abnormalities in normal blood flow from the heart cause several types of heart

diseases which are commonly known as cardiovascular diseases (CVD). Heart

diseases are the main reasons for death worldwide. According to the survey of the

World Health Organization (WHO), 17.5 million total global deaths occur because

of heart attacks and strokes. More than 75% of deaths from cardiovascular diseases

occur mostly in middle-income and low-income countries. Also, 80% of the deaths

that oceur due to CVDs are because of stroke and heart attack.

Therefore, prediction of cardiac abnormalities at the early stage and tools for

the prediction of heart diseases can save a lot of life and help doctors to design an

effective treatment plan which ultimately reduces the mortality rate due to

cardiovascular diseases. Data mining or machine learning is a discovery method for

analyzing big data from an assorted perspective and encapsulating it into useful

information. Nowadays, a huge amount of data pertaining to disease diagnosis,

patients etc. are generated by healthcare industries. Data mining provides a number

of techniques which discover hidden patterns or similarities from data.

In this paper, the machine learning algorithms is proposed for the

implementation of a heart disease prediction system which was validated on two


open access heart disease prediction datasets. Data mining is the computer based

process of extracting useful information from enormous sets of databases. These

patterns can be utilized for healthcare diagnosis. However, the available raw

medical data are widely distributed, voluminous and heterogeneous in nature .This

data needs to be collected in an organized form. This collected data can be then

integrated to form a medical information system. Disease prediction plays a

significant role in data mining. This paper analyzes the heart disease predictions

using classification algorithms. These invisible patterns can be utilized for health

diagnosis inhealthcare data.

The primary goal of this examination is to develop a heart forecast framework. The

system can find information related with heart disease from the historical heart data

set to implement the classifier that classifies the disease according to the

contribution of the client and reduce the cost of the medical test. The scope of the

project is to execute machine learning calculation to bigger dataset helps to

improve the accuracy ofresults. Utilizing of machine learning procedure gives

more exact outcomes than more experienced doctor.

We are predicting the heart disease using classification algorithms. Machine

learning techniques like Classification algorithms such as Random forest, Logistic

Regression are used to explore different kinds of heart based

problems.
SYSTEM ANALYSIS

2.1 EXISTING SYSTEM

Clinical decisions are often made based on doctors' intuition and

experience rather than on the knowledge rich data hidden in the database. This

practice leads to unwanted biases, errors and excessive medical costs which affects

the quality of service provided to patients. There are many ways that a medical

misdiagnosis can present itself. Whether a doctor is at fault, or hospital staff, a

misdiagnosis of a serious illness can have very extreme and harmful effects. The

National Patient Safety Foundation cites hat 42% of medical patients feel they have

had experienced a medical error or missed diagnosis. Patient safety is sometimes

negligently given he back seat for other concerns, such as the cost of medical tests,

drugs, and operations. Medical Misdiagnoses are a serious risk to our healthcare

profession. If they continue, then people will fear going to the hospital for

treatment. We can put an end to medical misdiagnosis by informing the public and

filing claims and suits against the medical practitioners at fault.

Disadvantages:

 Prediction is not possible at early stages

 In the Existing system, practical use of collected data is time consuming

 Any faults occurred by the doctor or hospital staff n predicting would lead to

fatal incidents.

 Highly expensive and laborious process needs to be performed before treating

the patient to find out if he/she has any chances to get heart disease in future.
2.2 PROPOSED SYSTEM

This section depicts the overview of the proposed system and illustrates

all of the components, techniques and tools are used for developing the entire

system. To develop an intelligent and user-friendly heart disease prediction system,

an efficient software tool is needed in order to train huge datasets and compare

multiple machine learning algorithms.After choosing the robust algorithm with best

accuracy and performance measures, it will be implemented on the development of

the smart phone-based application for detecting and predicting heart disease risk

level.

2.3 ALGORITHMS

2.3.1 Logistic Regression

A popular statistical technique to predict binomial outcomes (y = 0 or 1)

is Logistic Regression. Logistic regression predicts categorical outcomes

(binomial/ multinomial values of y). The predictions of Logistic Regression are in

the form of probabilities of an event occurring, i.e. the probability of y=1, given

certain values of input variables x. Thus, the results of LogR range between 0-1.

LogR models the data points using the standard logistic function, which

is an S- shaped curve also called as sigmoid curve and is given by the equation:

Logistic Regression Assumptions:

 Logistic regression requires the dependent variable to be binary.


 For a binary regression, the factor level 1 of the dependent variable should

represent the desired outcome.

 Only the meaningful variables should be included.

 The independent variables should be independent of each other.

 Logistic regression requires quite large sample sizes.

 Even though, logistic (logit) regression is frequently used for binary variables

(2 classes), it can be used for categorical dependent variables with more than 2

classes in this case it's called Multinomial Logistic Regression.

Figure 2.3.1 : Logistic Regression

2.3.2 Random Forest

Random forest is a supervised learning algorithm which is used for both

classification as well as regression .But however ,it is mainly used for classification

problems .As we know that a forest is made up of trees and more trees means more

robust forest. Similarly ,random forest creates decision trees on data samples and

then gets the prediction from each of them and finally selects the best solution by
means voting It is ensemble method which is better than a single decision tree

because it reduces the over-fitting by averaging the result .

Working of Random Forest with the help of following steps:

 First,start with the selection of random samples from a given dataset.

 Next ,this algorithm will construct a decision tree for every sample . Then it

will get the prediction result from every decision tree.

 In this step, voting will be performed for every predicted result.

 At last ,select the most voted prediction results as the final prediction result.

The following diagram will illustrates its working-

Figure 2.3.2 : Random Forest

2.4 System Requirements

2.4.1 Software Requirements :


REQUIREMENT SOFTWARE
Operating System - Windows 10 or above
IDE - Jupiter Notebook
Programming - Python
Language
2.4.2 Hardware Requirements :

REQUIREMENT HARDWARE
Processor - 1.6 GHz or Faster Processor
RAM - 8 GB or above
Hard disk
System Design

3.1 System Architecture

Figure 3.1 : System Architecture


The following shows the list of attributes on which we are working :

Table 3.1 : List of Attributes

3.2 MODULES

The entire work of this project is divided into 4 modules.

They are:

a. Data Pre-processing
b. Feature
c. Classification
d. Prediction

3.3 DATA FLOW DIAGRAMS


Figure 3.3 Data Flow Diagrams
TECHNOLOGY DESCRIPTION

4.1 Technology Explanation:

Heart attack prediction using machine learning involves the application of various

algorithms to analyze medical data and predict the likelihood of a person

experiencing a heart attack. Here's a breakdown of the technology used:

 Machine Learning Algorithms: Algorithms such as Logistic Regression,

Random Forest, Support Vector Machines (SVM), or Artificial Neural

Networks (ANN) are commonly employed to analyze medical data and

make predictions

 Feature Selection: Techniques like Principal Component Analysis (PCA)

or feature importance analysis are used to select relevant features from the

dataset, such as age, blood pressure, cholesterol levels, etc.

 Data Preprocessing: Steps including data cleaning, handling missing

values, normalization, and standardization are performed to ensure the

quality of the data for training the machine learning model.

 Model Evaluation: Techniques like cross-validation, ROC curves, and

confusion matrices are used to evaluate the performance of the trained

model and fine-tune its parameters.


4.2 PACKAGES USED AND INSTALLATION PROCESS

 Python: The project is typically implemented using Python programming

language.

 Scikit-learn: This library provides simple and efficient tools for data

mining and data analysis, including various machine learning algorithms.

Install it using pip:

o Pip install scikit-learn

 Pandas: Used for data manipulation and analysis. Install it using pip:

o Pip install pandas

 NumPy: Essential for numerical computing in Python.install it using pip:

o Pip install numpy

 Matplotlib and seaborn: These libraries sre used for data visualization.

Install them using pip:

o Pip install matplotlib seaborn

4.3 MANUAL

 Data Collection: Gather medical data including age, gender, blood

pressure, cholesterol levels, etc., from individuals.

 Data Preprocessing: Preprocess the data by cleaning, handling missing

values, and performing feature scaling.

 Feature Selection: Use techniques like PCA or feature importance analysis

to select relevant features.


 Model Training: Train the machine learning model using algorithms such

as Logistic Regression, Random Forest, or SVM.

 Model Evaluation: Evaluate the performance of the trained model using

techniques like cross-validation, ROC curves, and confusion matrices.

 Prediction: Input the medical data of a new individual into the trained

model to predict the likelihood of a heart attack.

 Result Interpretation: Interpret the prediction result and take necessary

actions, such as advising the individual to seek medical attention if the

likelihood of a heart attack is high.


IMPLEMENTATION

5.1 STEPS FOR IMPLEMENTATION

1. Install the required packages for building the “Passive Aggressive

Classifer”.

2. Load the libraries into the workspace from the packages.

3. Read the input data set.

4. Normalise the given input dataset.

5. Divide this normalised data into two parts:

a. Train data

b. Test data (Note: 80% of Normalised data is used as Train data,

20% of the Normalized data is used as Test data.)

5.2 CODING

Sample code:

import pandas as pd

import numpy as np

import seaborn as sns

import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler


from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression


from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import accuracy_score

from sklearn.datasets import load_iris


from sklearn.tree import DecisionTreeClassifier

heart_df=pd.read_csv("heartnew.csv")

heart_df

heart_df.head()

heart_df.isnull()

heart_df.isnull().sum()

heart_df.info()

dict_names = {
'age': 'Age',
'sex': 'Sex',
'cp': 'Chest_Pain',
'trtbps': 'Resting_Pressure',
'chol': 'Cholesterol',
'fbs': 'Fasting_Blood_Sugar',
'restecg': 'Resting_Ecg_Results',
'thalachh': 'Maximum_Heart_Rate',
'exng': 'Exercise_Induced_Angina',
'oldpeak': 'Old_Peak',
'slp': 'Slope',
'caa': 'Major_Vessels',
'thall': 'Thallium_Rate',
'output': 'Target'
}

for column in heart_df.columns:


if column in dict_names:
heart_df.rename(columns={column: dict_names[column]},
inplace=True)
heart_df.head()

heart_df.shape

heart_df.info()

heart_df.describe()

heart_df.duplicated().sum()

heart_df.drop_duplicates(inplace=True)

heart_df.shape

heart_df['Target'].value_count()
x=heart_df.drop(columns='target',axis=1)

plt.figure(figsize=(13,4))
age_counts = heart_df['Age'].value_counts().sort_index()
plt.bar(age_counts.index, age_counts.values,
color=plt.cm.viridis(np.linspace(0, 1, len(age_counts))))
plt.title('Frequency Of Each Age')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()

plt.figure(figsize=(13,4))
sns.histplot(heart_df['Maximum_Heart_Rate'], bins=20, kde=True,
color='pink')
plt.title('Distribution of Maximum Heart Rate')
plt.xlabel('Maximum Heart Rate')
plt.ylabel('Frequency')
plt.show()

##

plt.figure(figsize=(7,4))
sns.countplot(x='Sex', data=heart_df, palette='Set1')
plt.title('Distribution of Sex')
plt.xlabel('Sex (0 = Female, 1 = Male)')
plt.ylabel('Count')
plt.show()

plt.figure(figsize=(8, 6))
chest_pain_counts = heart_df['Chest_Pain'].value_counts().sort_index()
colors = plt.cm.viridis(np.linspace(0, 1, len(chest_pain_counts)))
ax = chest_pain_counts.plot(kind='bar', width=0.9, color=colors)
ax.set_title('Chest Pain Levels Frequency')
ax.set_xlabel('Chest Pain Level')
ax.set_ylabel('Frequency')
plt.show()

plt.figure(figsize=(13, 4))
sns.histplot(heart_df['Resting_Pressure'], kde=True, color='skyblue')
plt.title('Distribution of Resting Pressure')
plt.xlabel('Resting Pressure')
plt.ylabel('Frequency')
plt.show()

plt.figure(figsize=(13, 4))
sns.histplot(heart_df['Cholesterol'], kde=True, color='red')
plt.title('Distribution of Cholesterol')
plt.xlabel('Cholesterol Level')
plt.ylabel('Frequency')
plt.show()

x = heart_df.drop(columns=['Target'])
y = heart_df['Target']

x.shape

y.shape

scaler = StandardScaler()
x = scaler.fit_transform(x)

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2,


random_state=42)

logistic_model = LogisticRegression()
logistic_model.fit(x_train, y_train)

y_pred = logistic_model.predict(x_test)

accuracy_logistic = accuracy_score(y_test, y_pred)


accuracy_logistic

from sklearn.ensemble import RandomForestClassifier

# Assuming x_train and y_train are defined correctly

rf_model = RandomForestClassifier()
rf_model.fit(x_train, y_train)

y_pred = rf_model.predict(x_test)

accuracy_rf = accuracy_score(y_test, y_pred)


accuracy_rf

accuracy_rf = accuracy_score(y_test, y_pred)


accuracy_rf

rf_model = RandomForestClassifier()
rf_model.fit(x_train, y_train)

y_pred = rf_model.predict(x_test)

accuracy_rf = accuracy_score(y_test, y_pred)


accuracy_rf
OUTPUT SCREENS

Figure 6.1 Dataset

Figure 6.2 Dataset


Figure 6.3 Dataset

Figure 6.4 Random Forest Classifier


Figure 6.5 Logistic Regression

Figure 6.6 Maximum Heart Rate


Figure 6.7 Frequency of Each age

Figure 6.8 Distribution of Sex


Figure 6.8 : Resting Pressure

Figure 6.9 : Cholestrol Level


CONCLUSION

In this project, we introduce about the heart disease prediction system with

different classifier techniques for the prediction of heart disease. The techniques

are Random Forest and Logistic Regression: we have analyzed that the Random

Forest has better accuracy as compared to logistic Regression. Our purpose is to

improve the performance of the Random Forest by removing unnecessary and

irrelevant attributes from the dataset and only picking those that

are most informative for the classification task.


FUTURE SCOPE

As illustrated before the system can be used as a clinical assistant for any

clinicians.

The disease prediction through the risk factors can be hosted online and hence any

internet users can access the system through a web browser and understand the risk

of heart disease. The proposed model can be implemented for any real time

application .Using the proposed model other type of heart disease also can be

determined. Different heart diseases as rheumatic heart disease, hypertensive heart

disease, ischemic heart disease, cardiovascular disease and inflammatory heart

disease can be identified. Other health care systems can be formulated using this

proposed model in order to identify the diseases in the early stage. The proposed

model requires an efficient processor with good memory configuration to

implement it in real time. The proposed model has wide area of application like

grid computing, cloud computing, robotic modeling, etc. To increase the

performance of our classifier in future, we will work on ensembling two algorithms

called Random Forest and Adaboost. By ensembling these two algorithms we will

achieve

high performance.
BIBLOGRAPY

1. S. E.-S. S. I. D. K. A. A. F Ali, "A smart healthcare monitoring system for

heart disease prediction based on ensemble deep learning and feature fusion,"

2020.

2. C. T. G. S. S Mohan, "Effective heart disease prediction using hybrid machine

learning techniques," 2019.

3. M. R. M. I. M. I. S Nashif, "Heart disease detection by using machine learning

algorithms and a real-time cardiovascular health monitoring system," 2018.

4. Y. H. K. H. L. W. L. W. M Chen, "Disease prediction by machine learning

over big data from healthcare communities," 2017.

5. S. S. K Deepika, "Predictive analytics to prevent and control chronic diseases,"

2016.

6. J. S. N. S. A Dey, "Analysis of supervised machine learning algorithms for

heart disease prediction with reduced number of attributes using principal

component analysis," 2016.

7. M. S. B Bahrami, "Prediction and Diagnosis of Heart Disease by Data Mining

Techniques," 2015.

8. R. S. E. D. K Vembandasamy, "Heart diseases detection using Naive Bayes

algorithm," 2015.

9. E. A. Y. K. AF Otoom, "Effective diagnosis and monitoring of heart disease,"

2015. S. P. V Chaurasia, "Early prediction of heart diseases

10. using data mining techniques," 2013.

11. S. S. G Parthiban, "Applying machine learning methods in diagnosing heart

disease for diabetic patients," 2012.


REFERENCES

https://www.kaggle.com/code/kanncaa1/heart-attack-
analysis-prediction

https://www.youtube.com/watch?v=tSBAag6lAQo

You might also like