1.3.2 Final
1.3.2 Final
1.3.2 Final
Degree of
BACHELOR OF TECHNOLOGY
IN
COMPUTER SCIENCE AND ENGINEERING
By
M. Neeraja
18ME1A0563
2018-2022
i
RAMACHANDRA COLLEGE OF ENGINEERING
(NBA,Accredited by NAAC at B++ Approved by AICTE,
New Delhi Affiliated by JNTUK,Kakinada)Vatluru,Eluru,A.P.
DEPARTMENT OF
COMPUTER SCIENCE AND ENGINEERING
CERTIFICATE
Dr. SatyabrataDash
Professor, Department of CSE Dr.V.Suryanarayana
Project Guide Professor & HOD, Department of CSE
External Examiner
ii
DECLARATION
iii
ACKNOWLEDGEMENT
I wish to take this opportunity to express our deep gratitude to all the people who have
extended their cooperation in various ways during our project work. It is our pleasure and
responsibility to acknowledge the help of all those individuals.
I have extended our sincere thanks to. Dr.Satyabrata Dash, Associate Professor in
the Department of CSE for helping us in successful completion of our project under
hissupervision.
I am very grateful to Dr.V.Suryanarayana, Head of the Department, Department of
Computer Science & Engineering for his assistance and encouragement in all respects in
carrying throughout our project work.
I express my deepest gratitude to Dr. M. Muralidhar Rao, Principal, Ramachandra
College of Engineering, Eluru for his valuable suggestions during preparation of draft in our
document.
I express my deepest gratitude to The Management of Ramachandra College of
Engineering, Eluru for their support and encouragement in completing our project work and
providing us necessary facilities.
I sincerely thank all the faculty members and staff of the Department of CSE for their
valuable advices, suggestions and constant encouragement which played a vital role in
carrying out this projectwork.
Finally, we thank one and all who directly or indirectly helped us to complete our
project work successfully.
M. Neeraja
I8ME1A0563
INDEX
CONTENTS PAGES
ABSTRACT ……………………………………………………………………………...1
LIST OF FIGURES ………………………………………………………….2
LIST OF TABLES…………………………………………………………….3
CHAPTER 1:
INTRODUCTION
1.1 About The Project……………………………….…………………....…………4
1.2 Purpose …………………………………………………………….……………5
1.3 Motivation……………………………………………….………………………5
CHAPTER 2:
LITERATURE SURVEY ……………………………...…………………..……..6
2.1 FEASIBILITY…………………………………………….………………….….9
2.1.1 Technical Feasibility…………………………...………………………….….9
2.1.2 Economic Feasibility…………………………...……………………….……9
2.1.3 Operational Feasibility……………….…………………………………..….10
2.2 Existing System……………………………….……..........................................10
2.3 Proposed System……………………………..…………………………..…….11
2.3 Block Diagram…………………………………………………………….…..11
CHAPTER 3:
SYSTEM ANALYSIS AND DESIGN…………………………...………….…..12
3.1Requirement Specification…………………….……………………………....12
3.2 Software Requirement…….….………………………………………………14
3.3 Hardware Requirements…...…………………………………………..........…14
3.4 System Design ………………………………………………………….15
3.4.1System architecture…...…….………………………………...………….….15
3.4.2 Input Design……………………………..…………………………...……15
3.4.3 Output Design……………….………………………………………..……16
3.5 Uml Diagrams Introduction…..…………………………………………….....17
3.5.1 Use case Diagram…………………………………….……………..18
3.5.2 Sequence Diagram…………………………………….……….……………19
3.5.3 State chart Diagram…………………………………….………..20
3.5.4 Activity Diagram……………………………………….………..22
3.6 Software Environment………………………………………….…………….23
3.6.1 Python Programming…………………………...……….……....23
3.6.2 PythonIDLE…………………………………………….……....26
3.6.3Visual Studio Code………………………………….…….….………….…28
3.6.4 Web Technologies………………………………………….…..................30
v
3.7 Machine Learning…………………………….………………………..……...32
3.7.1 Logistic Regression………………………………………….…38
CHAPTER 4
METHODOLOGY
4.1 Project Description……………………….……………….…….….… 41
4.2 Dataset…………………………………….…………………….……….42
4.3 Data Sampling…………………………….…………………….……..........42
4.4 Pre-Processing…………………………….…………………….…………….…..42
4.5 Attributes………………………………………………………….……………....43
4.6Sample Code…………………………………………………….….…………….46
CHAPTER 5
RESULTS/OUTPUTS……………………..…………………….…….……57
CHAPTER 6
CONCLUSION…………………………………………….……………..…59
CHAPTER7
REFERENCES…………………………………………….………………..60
vi
ABSTRACT
All over the world, in different sectors churn prediction plays a very important role in
the growth of the organization. For the company’s revenue and profit, customer churn is very
harmful. The most important step to avoid churn is to detect churn and its reason, accordingly
initiate the prevention measures. Nowadays machine learning plays a vital role to get rid of
this problem. The objective of this project is to predict the churn in banking sectors, by using
well known machine learning techniques like Logistic Regression. The classification model
is built by analyzing historical data and then applying the prediction model based on the
analysis.Withholding the customer in an organization is one of the primary growth in today's
world. Predicting the customer churn rate will help the bank in knowing which category of
customers generally tend to leave the bank. Churn is based on various factors, including
changing the position to a competitor, canceling their subscription because of poor customer
service, or discontinuing all contact with a brand due to insufficient interaction between
customers. Being connected for a long period of time with customers is more effective than
trying to attract new customers. Dealing to figure out the amiss issues can make the
customers happy. In this project finding the major factors that affect the customers to churn
and analyze them by using Machine learning algorithms.Then churn gives the information of
how many existing customers tend to leave the business, so lowering churn has an immense
positive impact on the revenue streams. On the basis of this the Churn rates track lost
customers, and growth rates track new customers comparing and analyzing both of these
metrics tells exactly how much the business is growing over time. In this predictive process
popular models have been used to achieve a decent level of accuracy.
C16,C17,C3,C19
1
LIST OF FIGURES
2
LIST OF TABLES
3
CHAPTER 1
INTRODUCTION
4
1.2 PURPOSE
Customer churn prediction is the practice of assigning a churn probability to each
customer in the company database, according to a predicted relationship between that
customer's historical information and its future churning behavior. Practically, the probability
to end the relationship with the company is then used to rank the customers from most to least
likely to churn, and customers with the highest propensity to churn receive marketing
retention campaigns. It reveals the behavior of those customers over a time period, and
assesses the prediction quality of the models against the actual outcome of those customers.
All the leading organizations have been working for the customer’s best interest. A
customer has a choice due to healthy competition among the service providers and there is no
end to best services. Shortage of data, targeted sales and up-gradation of companies are the
major challenges while attaining the new customers. It is found that customer value and
increasing revenue are the factors of current customer retention instead of the new customer acquiring.
The companies know their current existing customers and come into a strong relation
with them and have a huge amount of data about them which is the key point to increase the
profit and customer value. It is very important to find whether the customer will churn in the
near future or stay within the bank, Which affects the revenue streams of the bank.
1.3 MOTIVATION
Churn will continue to exist and customer management is the best way to ensure
sustainable business growth for long term profitability rather than capturing new customers.
It is, therefore, in the best interest of any company to keep track of the behavior of its
customers in order to potentially anticipate any signs of dissatisfaction that could eventually
lead to churning. Such actions may be instrumental to reach out to those customers and
hopefully save the relationship with the bank.
6
CHAPTER 2
LITERATURE SURVEY
In the banking industry, the scope of the term is wide and is currently being utilized
within several different fields of the business. Credit card churn occurs when a customer
ceases to use its credit card within a specific timeframe. Likewise, network banking churn
may be defined as a customer who stops using its internet (home banking) service - Chiang,
Wang, Lee, & Lin (2003) covered this topic by measuring the periodicity of transaction time
of the users.
Additionally, Glady et al. (2008) defined a churner as a customer with less than 2.500
Euros of assets at the bank (savings, securities, or other kinds of products), and therefore
paved the way for two distinct definitions of churn that exist in the organization that will be
studied in this paper: the notion of voluntary churn and involuntary churn. The current study
will be particularly geared towards tracking the behavioral history of past churners within a
specific time frame in order to pinpoint certain patterns that might indicate that a customer is
at risk of churning.
As Oyeniyi and Adeyemo (2015) pointed out, “churning is an important problem that
has been studied across several areas of interest, such as mobile and telephony, insurance, and
healthcare. Other sectors where the customer churn problem has been analyzed includes
online social network churn analysis, and the retail banking industries' '. Although the broad
or most generally accepted definition of churn refers to the loss of a customer by a specific
company, one must analyze the concept with regards to the context in which it is being
employed.
Eichinger, Nauck, and Klawonn (2006) defined customer attrition when a customer is leaving
for a competitor. This notion has been backed by Qiasi, Roozbehani, &Minaei-bidgoli (2002)
who consider churn when a customer discontinues the use of an organization's products and
services in favor of a competitor's products and services. On the other hand, Neslin et al.
(2006) described customer churn as the propensity.
M.A.H. Farquad [4] proposed a hybrid approach to overcome the drawbacks of general
SVM model which generates a black box model (i.e., it does not reveal the knowledge gained
during training in human understandable form). The hybrid approach contains three phases: In
the first phase, SVM-RFE (SVM-recursive feature elimination) is employed to reduce the
7
feature set. In the second phase, dataset with reduced features is then used to obtain SVM
model and support vectors are extracted. In the final phase, rules are then generated using
Naive Bayes Tree (NBTree which is combination of Decision tree with naive Bayesian
Classifier).
The dataset used here is bank credit card customer dataset (Business Intelligence Cup
2004) which is highly unbalanced with 93.24% loyal and 6.76%churned customers. The
experimental showed that the model does not scalable to large datsets.
Wouter Verbeke [6] proposed the application of Ant-Miner+ and ALBA algorithms on
a publicly available churn prediction dataset in order to build accurate as well as
comprehensible classification rule-sets churn prediction models. Ant-Miner+ is a high
performing data mining method based on the principles of Ant Colony Optimization which
allows to include domain knowledge by imposing monotonicity constraints on the final rule-
set. The advantages of Ant-Miner+ are high accuracy, comprehensibility of the generated
models and the possibility to demand intuitive predictive models. Active Learning Based
Approach (ALBA) for SVM rule extraction is a rule extraction algorithm, which combines the
high predictive accuracy of a non-linear support vector machine model with the
comprehensibility of the ruleset format.
The results which are benchmarked to C4.5, RIPPER, SVM and logistic regression
showed that ALBA, combined with RIPPER, results in the highest accuracy, while sensitivity
is the highest for C4.5 and RIPPER applied on an oversampled dataset. Ant-Miner+ results in
less sensitive rule-sets, but allows to include domain knowledge, and results in
comprehensible rule-sets that are much smaller than the rulesets induced with C4.5. RIPPER
also results in small and
comprehensible rule-sets, but lead to unintuitive
C16,C17,C3,C19
models that violate domain knowledge.
Ning Lu proposed the use of boosting algorithms to enhance a customer churn prediction
model in which customers are separated into two clusters based on the weight assigned by the
boosting algorithm. As a result, a high risky customer cluster has been found. Logistic
regression is used as a basis learner, and a churn prediction model is built on each cluster,
respectively. The experimental results showed that boosting algorithm provides a good
separation of churn data when compared with a single logistic regression model.
Random sampling method can be used to change the distribution of data in order to
reduce the imbalance of the dataset. Imbalance in dataset is caused due to the low proportion
of churners.
Ssu-Han Chen used a novel mechanism based on the gamma Cumulative SUM
(CUSUM) chart in which the gamma CUSUM chart monitors individual customer’s Inter
Arrival Time (IAT) by introducing a finite mixture model to design the reference value and
decision interval of the chart and used a hierarchical Bayesian model to capture the
heterogeneity of customers. Recency, another time interval variable which is complementary
to IAT, is combined into the model and tracks the recent status of the login behavior. In
addition, benefits from the basic nature of control charts, the graphical interface for each
customer is an additional advantage of the proposed method. The results showed that the
accuracy rate (ACC) for gamma CUSUM chart is 5.2% higher than exponential CUSUM and
the Average Time to Signal (ATS) is about two days longer than required for exponential
CUSUM.
Koen W. De Bock proposed two rotation-based ensemble classifiers namely Rotation Forest
and Rotboost as modeling techniques for customer churn prediction. An ensemble classifier is a
combination of several member classifier models into one aggregated model, including the fusion rule
to combine member classifiers outputs. In Rotation Forests, feature extraction is applied to feature
subsets in order to turn the input data for training base classifiers, while RotBoost combines Rotation
Forest with AdaBoost. Four data sets from real-life customer churn prediction projects are used here.
The results showed that Rotation Forests outperform RotBoost in terms of area under the curve (AUC)
and top-decile lift, while RotBoost demonstrates higher accuracy than Rotation Forests. They also
compared three alternative feature extraction algorithms namely: Principal Component Analysis
(PCA), Independent Component Analysis (ICA) and Sparse Random Projections (SRP) on
classification performance of both RotBoost and Rotation Forest.In general, the performance of
rotation-based ensemble classifier depends upon: (i) the performance criteria used to measure
classification performance and (ii) the implemented feature extraction algorithm.
Lee et al. focused on building an accurate and succinct predictive model with the purpose of
churn prediction by using a Partial Least Squares (PLS) based method on highly correlated data sets
among variables. They not only present a prediction model to accurately predict customers churning
9
behaviour, but also a simple but implementable churn marketing program was employed. The
proposed methodology allows the marketing managers to maintain an optimal (atleast a near optimal)
level of churners effectively and efficiently through the marketing programs. Here, PLS is employed
as the prediction modelling method.
Y.Xie et al., [16] used an improved balance random forest (IBFR) model which is a
combination of balanced random forests and weighted random forests in order to overcome the data
distribution problem. The nature of IBRF is that the best features are iteratively learned by altering the
class distribution and by putting higher penalties on misclassification of the minority class.
2.1FEASIBILITYSTUDY
The preliminary investigation examines project feasibility, the likelihood the
application will be useful to the user. The main objective of the feasibility study is to test the
Technical, Operational, and Economical feasibility for adding new modules and debugging
traditional desktop-centric applications, and porting them to mobile devices. All systems are
feasible if they are given unlimited resources and infinite time. There are aspects in the
feasibility study portion of the preliminaryinvestigation:
The three major areas one should consider while determining the feasibility of the
projectare:
• Technical Feasibility
• EconomicFeasibility
• Operational Feasibility
Evaluating the technical feasibility is the trickiest part of a feasibility study. This is because,
at this point in time, not too many- detailed design of the system, making it difficult to access
issues like performance, costs on (on account of the kind of technology to be deployed) etc. A
number of issues have to be considered while doing a technical analysis.
2. Find out whether the organization currently possesses the required technologies:
Is the required technology available with the organization? If so, is the capacity sufficient?
For an instance - “Will the current printer be able to handle the new reports and forms
10
required for the new system?”
Economic feasibility attempts to weigh the costs of developing and implementing a new
system, against the benefits that would accrue 9 from having the new system in place.
This feasibility study gives the top management the economic justification for the new
system.
A simple economic analysis which gives the actual comparison of costs and benefits
are much more meaningful in this case. These could include increased customer satisfaction,
improvement in product quality better decision-making timeliness of information, expediting
activities, improved accuracy of operations, better documentation and record keeping, faster
retrieval of information, better employee morale.
Proposed projects are beneficial only if they can be turned into information systems that will
meet the organizations operating requirements. Simply stated, this test of feasibility asks if the
system will work when it is developed and installed. Are there major barriersto
Implementation? Here are questions that will help test the operational feasibility of a project:
Is there sufficient support for the project from management from users?
If the current system is well liked and used to the extent that persons will not be able to
see reasons for change, there may be resistance.
If they are not, Users may welcome a change that will bring about a more operational
and useful systems.
Has the user been involved in the planning and development of the project?
Early involvement reduces the chances of resistance to the system and in general and
increases the likelihood of successful project.
Since the proposed system was to help reduce the hardships encountered. In the existing
manual system, the new system was considered to be operational feasible.
2.2 EXISTINGSYSTEM
Churn studies have been used for years to achieve probability and to establish a sustainable
11
customer–company relationship. A customer churn in the banking sector indicates who has
closed all their active accounts. A customer who has not used their bank for a few months or
one year can be also considered as churned. Organizations developing churn management
systems as a part of their customer relationship management.If churn is higher than growth,
then business is getting smaller. Cost of acquiring new customers is high.
In the proposed system, the models used in this project predict the customers likely to be
churn based on the identified characteristics. This idea enables them to take the necessary
actions and decreases the churn rates to retain such customers. Our study investigates in this
process that the methods which are used in churn prediction have the ability to process huge
amounts of customer data. Offering a comprehensive knowledge base can detach stuck users
to reach their goals and to hold back the customers for a long period of time. Using a
regression model in this project which works well with very large datasets. This model is very
efficient to train and gives the direction of association which provides discreet output.
12
CHAPTER 3
Types
There are two types of requirements specification. They are:
Functional requirementsspecification
Non-functional requirementsspecification
13
Non- Functional Requirement Specifications
3.2 SOFTWAREREQUIREMENTS
OperatingSystem :Windows 10
Specification : InternetBrowser
15
3.4 SYSTEMDESIGN
System design is the process or art of defining the architecture, components, modules,
interfaces, and data for a system to satisfy specified requirements. One could see it as the
application of systems theory to product development. There is some overlap and synergy
with the disciplines of systems analysis, systems architecture, and systems engineering.
As stated above, the acquired MRI scan image is preprocessed. Next, we extract the
features from the segmented image. At last, we classify the image based upon the extracted
features and area. In our proposed algorithm we have tried to solve the problems that we
come across in the existed system.
This system is identified whether the tumor is cancerous or not. If the tumor is
cancerous, it produces the results as Cancer. If the tumor is non- cancerous, it produces the
results as No Tumor. Based on this information the tumor is curable by giving the proper
treatment bythedoctors.So,thepatientcanbecurable fromthetumoratanearly stage oflife.
3.4.2Input Design
Input design plays a vital role in the life cycle of software development, it requires
very careful attention of developers. The input design is to feed data to the application as
accurately as possible. So, inputs are supposed to be designed effectively so that the errors
occurring while feeding as minimized. According to the software engineering concepts, the
input forms or screens are designed to provide to have a validation control over the input
limit, range, and other relatedvalidations.
This system has input screens in almost all the modules. Error messages are developed
to alert the user whenever he commits some mistakes and guides him in the right way so
thatinvalid entries are not made. Let us see deeply about this under module design.
16
Input design is the process of converting the user-created input into a computer-
basedformat. The goal of the input design is to make the data entry logical and free from
errors.
Theerror is in the input are controlled by the input design. The application has been
developed ina user-friendly manner. The forms have been designed in such a way during the
processing
the cursor is placed in the position where must be entered. The user is also provided with an
option to select an appropriate input from various alternatives related to the field in
certaincases.
Validations are required for each data entered. Whenever a user enters erroneous data,
an error message is displayed and the user can move on to the subsequent pages
aftercompleting all the entries on the current page.
The output from the computer is required to mainly create an efficient method of
communication within the company primarily among the project leader and his team
members, in other words, the administrator and the client. The output of VPN is the system
which allows the project leader to manage his clients in term of creating new clients and
assigning new projects to them, maintaining a record of the project validity and providing
folder level access to each client on the user side depending on the project allotted to him.
After completion of a project, a new project may be assigned to the client. User authentication
procedures are maintained at the initial stages themselves. A new user may be created by
theadministrator himself or a user can himself register as a new user but the task of assigning
projects and validation a new user sets with the administrator only.
The application starts running when it is executed for the first time. The server has
tobe started. The project will run on the local area network so the server machine will serve as
the administrator while the other connected systems can act as the clients. The developed
system is highly user-friendly and can be easily understood by anyone using it even for the
firsttime.
17
3.5 UML DIAGRAMSINTRODUCTION
The UML was developed in 1994-95 by Grady Booch, Ivar Jacobson, and James
Rumbaugh at the Rational Software. In 1997, it got adopted as a standard by the Object
Management Group (OMG).
UML CONCEPTS
The Unified Modeling Language (UML) is a standard language for writing software
blue prints. The UML is a language for • Visualizing 26 • Specifying • Constructing •
Documenting the artifacts of a software intensive system. The UML is a language which
provides vocabulary and the rules for combining words in that vocabulary for the purpose of
communication. A modeling language is a language whose vocabulary and the rules focus
on the conceptual and physical representation of a system. Modeling yields an
understanding of a system.
UML DIAGRAMS
18
by the Object Management Group (OMG) as the standard for modeling software
development.
BEHAVIORAL DIAGRAMS
3.5.1 Use case diagram
To model a system, the most important aspect is to capture the dynamic behavior.
Dynamic behavior means the behavior of the system when it is running/operating. Only static
behavior is not sufficient to model a system rather dynamic behavior is more important than
staticbehavior.
These internal and external agents are known as actors. Use case diagrams consist of
actors, use cases, and their relationships. The diagram is used to model the system/subsystem
of an application. A single and their relationships. The diagram is used to model the
system/subsystem of an application. A single-use case diagram captures a particular
functionality of a system.
The purpose of the use case diagram is to capture the dynamic aspect of a system. However,
this definition is too generic to describe the purpose, as the other four diagrams (activity,
sequence, collaboration, and State chart) also have the same purpose. We will look into some
specific purpose, which will distinguish it from the other four diagrams. When the initial task
is complete, use case diagrams are modeled to present the outside view.
ii. Actors
19
An actor is a person, organization, or external system that plays a role in one or more
interactions with the system.
20
Fig3.5.2.1 sequence diagram
The name of the diagram itself clarifies the purpose of the diagram and other details. It
21
describes the different states of a component in a system. The states are specific to a
component/object of a system. A Statechart diagram describes a state machine. A state
machine can be defined as a machine that defines different states of an object and these states
are controlledby external or internal event.
Purpose of Statechart Diagrams
Statechart diagram is one of the five UML diagrams used to model the dynamic
natureofa system. They define different states of an object during its lifetime and these states
are changed by events. Statechart diagrams are useful to model reactive systems. Reactive
systems can be defined as a system that responds to external or internal events. Statechart
diagrams are also used for forward and reverse engineering of a system. However, the main
purpose is to model the reactivesystem.
Following are the main purposes of using Statechart diagrams –
22
Fig3.5.3.1 statechart diagram
The basic purposes of activity diagrams are similar to the other four diagrams. It
captures the dynamic behavior of the system. The other four diagrams are used to show the
message flow from one object to another but the activity diagram is used to show message
flow from one activity toanother.
The purpose of an activity diagram can be described as –
23
Describe the sequence from one activity toanother.
24
Software factories were soon created to introduce discipline and repeatability,
software visualization tools, the capture of customer needs or requirements, automated
software testing, and software reuse. Computer-assisted software engineering or CASE was
also created to enhance software productivity and reliability by automating document
production, diagram design, code compilation, software testing, configuration management,
management reporting, and sharing of data by multiple developers.
Features of Python
2.Expressive Language
Python can perform complex tasks using a few lines of code. A simple example, the hello
world program you simply type print("Hello World"). It will take only one line to execute,
while Java or C takes multiple lines.
25
3. Interpreted Language
Python is an interpreted language; it means the Python program is executed one line at a time.
The advantage of being interpreted language, it makes debugging easy and portable.
4.Cross-platform Language
Python can run equally on different platforms such as Windows, Linux, UNIX, and
Macintosh, etc. So, we can say that Python is a portable language. It enables programmers to
develop the software for several competing platforms by writing a program only once.
6.Object-Oriented Language
Python supports object-oriented language and concepts of classes and objects come into
existence. It supports inheritance, polymorphism, and encapsulation, etc. The object-oriented
procedure helps to programmer to write reusable code and develop applications in less code.
7. Extensible
It implies that other languages such as C/C++ can be used to compile the code and thus it can
be used further in our Python code. It converts the program into byte code, and any platform
can use that byte code.
26
Graphical User Interface is used for the developing Desktop application. PyQT5, Tkinter,
Kivy are the libraries which are used for developing the web application.
10.Integrated
It can be easily integrated with languages like C, C++, and JAVA, etc. Python runs code line
by line like C, C++ Java. It makes easy to debug the
code.
11. Embeddable
The code of the other programming language can use in the Python source code. We can use
Python source code in another programming language as well. It can embed other language
into our code.
12. Dynamic Memory Allocation
In Python, we don't need to specify the data-type of the variable. When we assign some value
to the variable, it automatically allocates the memory to the variable at run time. Suppose we
are assigned integer value 10 to x, then we don't need to write int x = 10. Just write x =10.
Applications of Python
Advantages of Python
1. Extensive Libraries
Python downloads with an extensive library and contains code for various purposes like
regular expressions, documentation-generation, unittesting, web browsers, threading,
databases, CGI, email, image manipulation, and more. So, we don’t have to write the
complete code for that manually.
2.Extensible
27
Python can be extended to other languages. You can write some of your code in languages
like C++ or C. This comes in handy, especially in projects.
3. Embeddable
Complimentary to extensibility, Python is embeddable as well. You can put your Python code
in your source code of a different language, like C++.
4.Improved Productivity
The language’s simplicity and extensive libraries render programmers more productive than
more productive than Java and C++ do.
5. IOT Opportunities
Since Python forms the basis of new platforms like Raspberry Pi, it finds the future bright for
the Internet of Things.
6.Object-Oriented
This language supports both the procedural and object - oriented programming paradigms.
While, python functions help us with code reusability, classes and objects let us model the
real world. A class allows the encapsulation of data and functions into one.
28
Exploring IDLE Software Features
• Remember that it is not advisable to write multiple lines of code that
havefunctions/classes in the IDLE shell
• In such cases, you can go to the File option of IDLE and click on New file.
• IDLE can be customized using the options present in the Format, Edit and Options
menu.
29
Visual Studio Code is an integrated development environment (IDE) from Microsoft. It is used
to develop console and graphical user interface applications along with Windows Forms or
WPF applications, web sites, web applications, and web services in both native code together
with managed code for all platforms supported by Microsoft Windows, Windows Mobile,
Windows CE, .NET Framework, and so on. It supports different programming languages by
means of language services, which allow the code editor and debugger to support (to varying
degrees) nearly any programming language, provided a language-specific service exists.
Visual Studio also includes a web-site editor and designer that allow web pages to be authored
by dragging and dropping widgets. It is used for developing VB.NET application efficiently to
get input and output design easiest one. It will be run at windows application based on services
provided by the user.
Features:
Built with love for the Web
VS Code includes enriched built-in support for Node.js development with JavaScript and
TypeScript, powered by the same underlying technologies that drive Visual Studio. VS Code
also includes great tooling for web technologies such as JSX/React, HTML, CSS, SCSS,
Less, and JSON.
Visual Studio Code includes a public extensibility model that lets developers build and use
extensions, and richly customize their edit-build-debug experience.
30
Fig3.6.3.1 Visual Studio Code Environment
31
3.6.4 WEB TECHNOLOGIES
HTML
HTML is an acronym which stands for Hyper Text Markup Language which is used
for creating web pages and web applications. Let's see what is meant by Hypertext
Markup Language, and Web page.
Hyper Text: HyperText simply means "Text within Text." A text has a link within it,
is a hypertext. Whenever you click on a link which brings you to a new webpage, you
have clicked on a hypertext. HyperText is a way to link two or more web pages
(HTML documents) with each other.
Markup language: A markup language is a computer language that is used to apply
layout and formatting conventions to a text document. Markup language makes text
more interactive and dynamic. It can turn text into images, tables, links, etc.
Web Page: A web page is a document which is commonly written in HTML and
translated by a web browser. A web page can be identified by entering an URL. A
Web page can be of the static or dynamic type. With the help of HTML only, we can
create static web pages.
Hence, HTML is a markup language which is used for creating attractive web pages
with the help of styling, and which looks in a nice format on a web browser. An
HTML document is made of many HTML tags and each HTML tag contains different
content.
Uses of HTML
32
anchor tag.
Responsive Design
Responsive design is an integral part of web development.
HTML images have an attribute known as ‘srcset.’ This attribute
references the images that the browser will parse and their respective sizes. These media
queries can make the selected images to be responsive.
HTML features such as localStorage and IndexDB have transformed the approach to the
storage of user data. HTML5 brought in these new features, and most browsers support them.
Depending on user permission, these features can be more useful when collecting and storing
data.
CSS
CSS stands for Cascading Style Sheets. It is a style sheet language which is used to describe
the look and formatting of a document written in markup language. It provides an additional
feature to HTML.
Uses of CSS
33
to be repeated on every web page. This was a very long process. For example: If you are
developing a large website where fonts and color information are added on every single page,
it will be become a long and expensive process. CSS was created to solve this problem.
Machine learning is a branch of artificial intelligence (AI) and computer science which
focuses on the use of data and algorithms to imitate the way that humans learn, gradually
improving its accuracy.
34
Machine learning is a subset of AI, which enables the machine to automatically learn
from data, improve performance from past experiences, and make predictions. Machine
learning contains a set of algorithms that work on a huge amount of data. Data is fed to these
algorithms to train them, and on the basis of training, they build the model & perform a
specific task.
machine with the input and corresponding output, and then we ask the machine to predict the
output using the test dataset.
The main goal of the supervised learning technique is to map the input variable(x) with
the output variable(y). Some real-world applications of supervised learning are Risk
Assessment, Fraud Detection, Spam filtering, etc.
• ImageSegmentation:
Supervised Learning algorithms are used in image segmentation. In this process, image
classification is performed on different image data with pre-defined labels.
35
• MedicalDiagnosis:
Supervised algorithms are also used in the medical field for diagnosis purposes. It is done
by using medical images and past labelled data with labels for disease conditions. With
such a process, the machine can identify a disease for the new patients.
• Fraud Detection - Supervised Learning classification algorithms are used for identifying
fraud transactions, fraud customers, etc. It is done by using historic data to identify the
patterns that can lead to possible fraud.
• Spam detection - In spam detection & filtering, classification algorithms are used. These
algorithms classify an email as spam or not spam. The spam emails are sent to the spam
folder.
• Speech Recognition - Supervised learning algorithms are also used in speech recognition.
The algorithm is trained with voice data, and various identifications can be done using the
same, such as voice-activated passwords, voice commands, etc.
Unsupervised learning is different from the Supervised learning technique; as its name
suggests, there is no need for supervision. It means, in unsupervised machine learning, the
machine is trained using the unlabeled dataset, and the machine predicts the output without
any supervision.
In unsupervised learning, the models are trained with the data that is neither classified nor
labelled, and the model acts on that data without any supervision Applications of
Unsupervised Learning
• Network Analysis: Unsupervised learning is used for identifying plagiarism and copyright
in document network analysis of text data for scholarly articles.
• Recommendation Systems: Recommendation systems widely use unsupervised learning
techniques for building recommendation applications for different web applications and e-
commerce websites.
• Anomaly Detection: Anomaly detection is a popular application of unsupervised learning,
which can identify unusual data points within the dataset. It is used to discover fraudulent
transactions.
• Singular Value Decomposition: Singular Value Decomposition or SVD is used to extract
particular information from the database. For example, extracting information of each user
located at a particular location.
36
Semi-Supervised Learning
Reinforcement Learning
In reinforcement learning, there is no labelled data like supervised learning, and agents
learn from their experiences only.
The reinforcement learning process is similar to a human being; for example, a child
learns various things by experiences in his day-to-day life. An example of reinforcement
learning is to play a game, where the Game is the environment, moves of an agent at each step
define states, and the goal of the agent is to get a high score. Agent receives feedback in terms
of punishment and rewards.
37
Due to its way of working, reinforcement learning is employed in different fields such
as Game theory, Operation Research, Information theory, multi-agent systems.
In Regression, we plot a graph between the variables which best fits the given
datapoints, using this plot, the machine learning model can make predictions about the data.
In simple words, "Regression shows a line or curve that passes through all the datapoints on
target-predictor graph in such a way that the vertical distance between the datapoints and the
regression line is minimum." The distance between datapoints and line tells whether a model
has captured a strong relationship or not.
38
understand is called the dependent variable. It is also called target variable
39
Independent Variable: The factors which affect the dependent variables or which are used to
predict the values of the dependent variables are called independent variable, also called as a
predictor.
Outliers: Outlier is an observation which contains either very low value or very high value in
comparison to other observed values. An outlier may hamper the result, so it should be
avoided.
Multicollinearity: If the independent variables are highly correlated with each other than
other variables, then such condition is called Multicollinearity. It should not be present in the
dataset, because it creates problem while ranking the most affecting variable.
Underfitting and Overfitting: If our algorithm works well with the training dataset but not
well with test dataset, then such problem is called Overfitting. And if our algorithm does not
perform well even with training dataset, then such problem is called underfitting.
Types of Regression
There are various types of regressions which are used in data science and machine
learning. Here we are discussing some important types of regression which are given below:
• Linear Regression
40
• Logistic Regression
• Polynomial Regression
• Support Vector Regression
• Decision Tree Regression
• Random Forest Regression
• Ridge Regression
• Lasso Regression
When we provide the input values (data) to the function, it gives the S-curve as follows:
41
Fig.3.7.1.1 Logistic regression
It uses the concept of threshold levels, values above the threshold level are rounded up to 1,
and values below the threshold level are rounded up to 0.
There are three types of logistic regression:
• Binary(0/1, pass/fail)
• Multi(cats, dogs, lions)
• Ordinal(low, medium, high)
Logistic Function (Sigmoid Function):
• The sigmoid function is a mathematical function used to map the predicted values to
probabilities.
It maps any real value into another value within a range of 0 and 1.The value of the logistic
regression must be between 0 and 1, which cannot go beyond this limit, so it forms a curve
like the "S" form. The S-form curve is called the Sigmoid function or the logistic function.In
logistic regression, we use the concept of the threshold value, which defines the probability of
either 0 or 1. Such as values above the threshold value tends to 1, and a value below the
threshold values tends to 0.
• In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above equation
by (1-y):
y/1-y; 0 for y=0, and infinity for y=1
• But we need range between -[infinity] to +[infinity], then take logarithm of the equation it will
become:
42
log[y/1-y]=b0+b1x1+b2x2+b3x3+........bnxn
The above equation is the final equation for Logistic Regression
Based on the problem statement, we need a predictive model that can do a binary
classification or predict Yes/No or 1/0 type of output variable.
One predictive model commonly implemented for binary classification and prediction of
binary outcome is Logistic Regression.
Logistic regression is a binary classification algorithm belonging to the generalized linear
regression model.
It can also be used to solve problems with more than 2 classes.
It is possible to use logistic regression to create a model using the customer churn data and
use it to predict if a particular customer of a set of customers will discontinue the service.
For example, one of the variables in the data is can be the “annual income”. Another
variable is the “gender” of the customer.
The outcome of the logistic regression function will tell us how income and or gender
determine the probability of service discontinuation by the customer.
43
CHAPTER 4
METHODOLOGY
44
4.2 Dataset
In this Study, a bank data is considered where a huge number of customers are leaving
the bank. Almost 10000 records of the bank, collected from Kaggle repository, are going to
help the model to investigate and predict which of the customers are about to leave the bank
soon. To test and evaluate the features the total dataset sliced into two subsets, training and
testing dataset. Training dataset can be used to define the statistical model and the testing
dataset can be used to predict the result and calculation of accuracy metrics for determining
the model accuracy.For validation of the model accuracy of the classifier is calculated on the
basis Confusion Matrix:
4.4Pre-Processing
In this phase different prepossessing techniques like handling the missing values, data
cleaning and feature extraction process have been performed. To identify the missing values
in the dataset imputation technique will be used to impute the blank and null values. Noisy
data, irrelevant attributes are removed. Those attributes that are not so much important
areremoved for model building . Finally for determining the performance of predictive
models , feature extraction plays an important role for correct prediction. Here are some
important features with description that can be useful for model construction as shown in
above Fig.
45
5Attributes considered from dataset
C16,C17,C3,C19
5 Tenure Refers to the number of years that the customer has been a
client of the bank. Normally, older clients are more loyal
and less likely to leave a bank.
10 EstimatedSalary People with low salaries are more likely to leave the bank
compared to those with higher salaries.
In below Fig4.5.2, box plots of some important attributes are given. When it comes to the
distribution of all data points over mean, box plots are used to identify the median and also
respective qualities inwell-structured manner.
46
Fig 4.5.2Box-Plots of Important Attributes
Data score
0.725
0.72
0.715
0.71
0.705
0.7
0.695
Training Data Score Testing Data Score
47
SAMPLE CODE Cources: C.10,C.7,C.5,C.3
Introduction
Coding is the process of designing, writing, testing, debugging, and maintaining the
source code of computer programs. This source code is written in one or more
programming languages. The purpose of programming is to create a set of instructions that
computers use to perform specific operations or to exhibit desired behaviors. The process
of writing source code often requires expertise in many different subjects, including
knowledge of the application domain, specialized algorithms, and formallogic.
Coding
Front-End Template(Main.html):-
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Customer Churn Prediction</title>
<meta name="viewport" content="width=device-width, initial-scale=1">
<link rel="stylesheet"
href="https://stackpath.bootstrapcdn.com/bootstrap/4.5.0/css/bootstrap.min.css"
integrity="sha384-
9aIt2nRpC12Uk9gS9baDl411NQApFmC26EwAOH8WgZl5MYYxFfc+NcPb1dKGj7Sk"
crossorigin="anonymous">
<link href="https://fonts.googleapis.com/css?family=Ubuntu" rel="stylesheet">
<link rel="stylesheet" href="static/css/style.css">
<script src="https://ajax.googleapis.com/ajax/libs/jquery/3.5.1/jquery.min.js"></script>
<script src="https://cdn.plot.ly/plotly-latest.min.js"></script>
<script src="https://cdn.jsdelivr.net/npm/@popperjs/core@2.11.5/dist/umd/popper.min.js"
integrity="sha384-
Xe+8cL9oJa6tN/veChSP7q+mnSPaj5Bcu9mPX5F5xIGE0DVittaqT5lorf0EI7Vk"
crossorigin="anonymous"></script>
<script src="https://cdn.jsdelivr.net/npm/bootstrap@5.2.0-beta1/dist/js/bootstrap.min.js"
integrity="sha384kjU+l4N0Yf4ZOJErLsIcvOU2qSb74wXpOhqTvwVx3OElZRweTnQ6d31fX
EoRD1Jy"
crossorigin="anonymous"></script>
48
<link href="https://cdn.jsdelivr.net/npm/bootstrap@5.2.0-beta1/dist/css/bootstrap.min.css"
rel="stylesheet"
integrity="sha384-
0evHe/X+R7YkIZDRvuzKMRqM+OrBnVFBL6DOitfPri4tjfHxaWutUpFmBp4vmVor"
crossorigin="anonymous">
<script src="https://cdn.jsdelivr.net/npm/bootstrap@5.2.0-beta1/dist/js/bootstrap.bundle.min.js"
integrity="sha384-
pprn3073KE6tl6bjs2QrFaJGz5/SUsLqktiwsUTF55Jfv3qYSDhgCecCxMW52nD2"
crossorigin="anonymous"></script>
</head>
<body>
<div class="container">
<br><br>
{% if mes%}
<div class="alert alert-{{col}} alert-dismissible fade show" role="alert">
<strong>{{mes}}</strong>.
<button type="button" class="btn-close" data-bs-dismiss="alert" aria-label="Close"></button>
</div>
{%endif%}
</div>
<div class="wrapper">
<nav class="navbar navbar-default">
<div class="container-fluid">
<div class="navbar-header">
<a class="navbar-brand" href="index.html">
</a>
</div>
</div>
</nav>
<div class="container">
<div class="row">
<div class="col-sm-3"></div>
<div class="col-sm-6"><b>Enter User Information in the Form Below</b></div>
</div>
</div>
<div class="container">
<div class="row">
<div class="col-sm-3"></div>
49
<div class="col-sm-3"><br>
<form action="{{ url_for('predict')}}" method="post">
50
<option disabled selected value> -- select -- </option>
<option value="1">Yes</option>
<option value="0">No</option>
</select><br>
<label for="isactivemember">Is an Active Bank Member:</label><br>
<select name="isactivemember" required" id="isActiveMember">
<option disabled selected value> -- select --</option>
<option value="1">Yes</option>
<option value="0">No</option>
</select>
<br>
<label for="estimatedsalary">Estimated Salary:</label><br>
<input type="number" min="0" step="any" required id="EstimatedSalary"
name="estimatedsalary"><br>
<br>
<br>
<button type="submit" class="btnbtn-success">Submit</button></form>
</div>
</div>
</div>
<br><br>
<br>
<!--<div class="container">
<div class="row">
<div class="col-sm-3"></div>
<div class="col-sm-6">
<div class="table-responsive">
<table class="table" border="2">
<thead>
<tr>
<th>Model Name</th>
<th>Prediction</th>
</tr>
</thead>
51
</table>
</div>
</div>
</div>
</div> -->
</body>
import numpy as np
app = Flask(__name__)
#dt_model = joblib.load('models/nate_decision_tree.sav')
# dl_model = joblib.load('models/imblearn_pipeline.sav')
# dl_model.named_steps['kerasclassifier'].model = load_model('models/keras_model.h5')
# newmodel=pickle.load(open('nate_logistic_regression.sav', 'rb'))
# # knn_model = joblib.load('models/nate_knn.sav')
# lr_model = joblib.load('nate_logistic_regression.sav')
# rf_model = joblib.load('models/nate_random_forest.sav')
# svm_model = joblib.load('models/SVM_model.sav')
# xgb_model = joblib.load('models/XGBoost_model.sav')
# loaded_models = {
# 'lr': lr_model
#}
def decode(pred):
@app.route('/')
52
def home():
# Initial rendering
result = [
# maind['customer'] = {}
# maind['predictions'] = result
return render_template('index.html')
@app.route('/predict', methods=['POST'])
def predict():
# print(new_array)
print(values)
cols = ['CreditScore',
'Geography',
'Gender',
'Age',
'Tenure',
53
'Balance',
'NumOfProducts',
'HasCrCard',
'IsActiveMember',
'EstimatedSalary']
f=1
f=0
def create() :
custd = {}
for k, v in zip(cols, values):
custd[k] = v
# Convert 1 or 0 to Yes or No
predl = []
# for m in loaded_models.values():
# predl.append(decode(m.predict(new_array)[0]))
if f==1 :
col="success"
else:
value="Customer will left the bank..!"
54
col="danger"
return render_template('index.html',mes=value
,col=col
if __name__ == "__main__":
app.run(debug=True)
55
CHAPTER 5
EXPERIMENTAL RESULTS
The data set contains details of a bank's customers and the target variable is a binary
variable reflecting the fact whether the customer left the bank (closed his account) or he
continues to be a customer.It consists of 10,000 records with demographic and bank history
information from customers from three countries, France, Germany and Spain.Continuing
with splitting the data into separate training and test sets. 30% ofobservations will be set aside
for the test setthe rest, 70%, will be used as the training set.
Evaluating the performance of the model using different metrics is integral to every data
science project. Here is what you have to keep an eye on:
Accuracy
Accuracy is a metric for how much of the predictions the model makes are true. The higher
the accuracy is, the better. However, it is not the only important metric when you estimate the
performance.
Loss describes the percentage of bad predictions. If the model’s prediction is perfect, the loss
is zero; otherwise, the loss is greater.
Precision
The precision metric marks how often the model is correct when identifying positive results.
For example, how often the model diagnoses cancer to patients who really have cancer.
Recall
This metric measures the number of correct predictions, divided by the number of results that
should have been predicted correctly. It refers to the percentage of total relevant results
correctly classified by your algorithm.
Confusion matrix
A confusion matrix is an N\times NN×N square table, where NN is the number of classes that
the model needs to classify. Usually, this method is applied to classification where each
column represents a label. For example, if you need to categorize fruits into three categories:
oranges, apples, and bananas, you draw a 3\times33×3 table. One axis will be the actual label,
and the other will be the predicted one.
57
Fig 5.1(b) Accuracy Factor(support)
58
5.2 OUTPUT SCREENSHOTS
59
Fig5.2.3 Displaysthe output as Customer Stays in the bank
Fig 5.2.4 Displays the output as customer will left the bank
60
CHAPTER 6
CONCLUSION
In this project, we proposed an algorithm that can predict Customer Churn and gives the
outcomes, contrasting other used techniques. This algorithm which deals with the large
amount of datasets. Based on the given customer details it predicts whether the customer will
stays in the bank or left the bank.. It gives an accuracy of 71.00 %. The proposed system
would be helpful in predicting if a customer will stays or left the bank. And also helps in the
growth of a company.
FUTURE ENHANCEMENTS
Further, we need to extract some features for this bank customer churn prediction
project. That is we will provide a user friendly interface and in future we are going to add
signup and login pages for the user to provide the security that only the authenticated user can
access the application and can see the result of churn prediction. In future we are in an idea to
improve the user interaction that allows the user to upload their own dataset by themselves
only in the application and can see the prediction of each customer in the application itself.
We also planning to do the same application which can produce the churn prediction not only
for bank customers but also for all kinds of business firms. This application will be a great
help for the businesses to improve their business and to retain their customers by doing their
own strategy of promotions.
.
61
CHAPTER 7
REFERENCES
Here, are some of the references for the project from various sites, journals, papers,etc..
62
12. Shui Hua Han a, ShuiXiuLu a, Stephen C.H. Leung., "Segmentation of telecom customers
based on customer value by decision tree model", Expert Systems with Applications, 39,
2012, 3964–3973
13. https://learnpython.com/ blog/python-customer-churn-prediction/
14. https://neptune.ai/blog/how-to-implement-customer-churn-prediction
15. Koen W. De Bock, Dirk Van den Poel, “An empirical evaluation of rotation-based
ensemble classifiers for customer churn prediction”, Expert Systems with Applications 38
(2011) 12293–12301.
16. H. Lee, Y. Lee, H. Cho, K. Im, Y.S. Kim, “Mining churning behaviors and developing
retention strategies based on a partial least squares (PLS) model”, Decision Support
System 52 (2011) 207–216.
17. Yaya Xie, Xiu Li, E.W.T. Ngai, Weiyun Ying, “Customer churn prediction using
improved balanced random forests”, Expert Systems with Applications 36 (2009) 5445–
5449.
63