0% found this document useful (0 votes)

236 views

Logistic Regression Project With Python

This document describes a project using logistic regression and Python to analyze advertising data and predict whether users will click on ads. It introduces logistic regression models and their uses in marketing, healthcare, and education. It then outlines the steps of the project, including importing necessary libraries, obtaining the data, exploratory data analysis through various plots, building a logistic regression model, making predictions and evaluating performance. The exploratory data analysis involves creating histograms, joint plots, and pair plots to visualize relationships between variables like age, income, internet usage, time on site, and the target variable of ad clicks. Interpretations are provided for results of the exploratory analysis.

Uploaded by

Meryem Harim

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

236 views

Logistic Regression Project With Python

Uploaded by

Meryem Harim

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Logistic Regression

Project with Python

Meryem HARIM
EGR 3303
SPRING 2023
Abstract: This project aims to manipulate a false large dataset containing advertising information
using Python. To that end, a prediction model is to be created to run an analysis of one main future:
clicks on Ad and the ways it relates to other variables. The purpose is to visualize the relationship
between those variables knowing that it is not linear, adding, therefore, a layer of complexity to it.
Consequently, we will create a logistic regression model to predict whether users would click on an
ad based on historic data given to them. Ultimately, we will be able to simulate the precision and
accuracy of our model through different metrics.

Keywords: Logistic Regression, prediction, prediction model, data visualization.

2
Table of Content:
I- Introduction
II- Importing Libraries
III-Getting the Data
IV- Exploratory Data Analysis
V- Logistic Regression
VI- Predictions and Evaluation
VII- Conclusion

3
I. Introduction:
1- Description of Logistic Regression Models:
Logistic regression models are, by definition, a “statistical analysis method to predict a binary
outcome”1. It is done with the purpose of predicting a dependent data variable through an analysis of
the various relationship relating it to other independent variables. This binary outcome (yes or no, 0
or 1…) is the result of the logistic regression model that calculates it through a set of coefficients for
the independent variables. This prediction is done to describe a set of data and provide future
predictions.

Logistic regression models are utilized in a wide range of fields such as marketing and healthcare.
This machine learning technology proved to be easily interpreted and adept at determining non-linear
relationships between various features.

2- Use Case Examples:

Below are three case examples of fields that utilize logistic regression models. The outcome of each
of those case examples shall be “yes” or “no”.
a. Marketing: Will customer X purchase a specific product given their gender, age, and
purchase history.
b. Healthcare: Will patient X develop lug cancer given their age, medical history, weight, and
diet.
c. Studies: Will student X pass the exam with a grade equal to or greater than 70% given the
hours they spent studying, current performance in the course, and difficulty of the exam.

II- Importing Libraries:

To build a logistic regression model, we would need the following libraries:
a. “pandas”: manipulate and pre-process data.
b. “numpy”: numerical computations including array operations.
c. “matplotlib”: data visualization and graphs.

1
Lawton, G., Burns, E., & Rosencrance, L. (2022, January 20). What is logistic regression? -
definition from Searchbusinessanalytics. Business Analytics. Retrieved April 10, 2023, from
https://www.techtarget.com/searchbusinessanalytics/definition/logistic-
regression#:~:text=Logistic%20regression%20is%20a%20statistical,or%20more%20existing
%20independent%20variables.

4
d. “seaborn”: advanced data visualization.
We would also need logistic regression models in order to carry out this project. All of these libraries
and models can be imported using the following code:

III- Getting the Data:

The first step to develop our model is to import our data file into Jupyter. The file’s name is
“advertising” and we must make sure to include the extension as well which is csv (Comma Separated
Values). We set it to a data frame called ad_data. This variable will indicate our dataset for the
remaining of this project.
Once we import our data, we check that we have imported the correct file through checking the head
of this dataset. The head command displays the first five rows and columns of our dataset which is
particularly useful for large datasets such as this one.
These operations yield the following results:

IV- Exploratory Data Analysis:

1. Histogram of Age:

5
We are interested in visualizing the “age” distribution of our data. To do this, we can use the plot
method of pandas. First, we run our code to check that all columns are, in fact, included in the dataset
and to make sure no typos are made at further stages. Then, we use our data frame ad_data to make
out two axes: x-axis representing the age and the y-axis representing the frequency at which each age
appears in our dataset. We add a colour specification for the aesthetic representation of our bars as
well as axis titles and a title of the histogram.

Consequently, we get the final histogram as follows:

2. Joint Plot of Area Income vs Age:

We want to visualize the relationship between our two variables: Area Income and Age. To do this,
we can create a joint plot using the jointplot function of the seaborn library.

6
This code will generate a scatter plot that shows the correlation between our two variables that we
have named x and y for simplicity purposes whereas data is the variable assigned to our entire data
frame.

By default, this graph also shows the distribution of each variable separately since we have inserted
the kind='reg' function. Yet again, we have switched the default colour to a purple one.

3. Joint Plot Showing the kde Distributions of Daily Time spent on Site vs Age:
Just as in the previous step, we would use the jointplot function once again to create a Daily Time
Spent on Site vs Age with KDE distributions. The latter could be modified through inserting
kind=’kde’. Again, we specify our x and y variables as well as assign our data frame to the variable
data.

The resulting graph is not only a visualization of the relationship between our parameters, but also
shows the marginal distribution of each of those variables across their respective axes. The code
generated is similar to the code of the precious step. However, because we set kind to kde, there is
an evident difference between the results of the previous graph and the one given below.

7
4. Joint Plot of Daily Time Spent on Site vs Daily Internet Usage:
As done in steps two and three, we can use the jointplot function to get a graph showing the
relationship between Daily Time Spent on Site and Daily Internet Usage. The same method is
followed by creating two variables x and y, feeding the data frame to a variable, then choosing the
themed color.

This code yields the following visualization:

8
5. Pair Plot with the Hue Defined by the Clicked on Ad Column Feature:
This task could be achieved by means of the pairplot function from the seaborn library which creates
a pair plot of the dataset. The result of this code is a set of scatter plots where each one shows the
relationship between two variables in our dataset.
Then we add the hue parameter to the Clicked on Ad column to differentiate the plots by the values
in this column.

The generated plot consists of a grid of scatter plots where each row and column are a variable of our
data frame ad_data with a colour code that displays the relationship, or lack of, between each
combination of two variables in the dataset through scatterplots and kernel density plots as follows:

9
Each point in the scatterplots represents a single datapoint in the dataset. This plot can, therefore,
allow us to identify any patterns and relationships between our variables with respect to the “Clicked
on Ad” column.

6. Interpretation of Resulting Plots:

Several remarks can be drawn based on the plots resulting from our data exploration. First, the
histogram of age is a clear representation of the age ranges of consumers. It allows us to see which
age fragment are the most engaged on the website before looking at the rest of the data. In fact, this
histogram showed that the majority of users are 31 years old, but also that the most recurrent ages
full in the range of 26 to 36 years old, with a less significant number of consumers in the younger and
older age ranges. Notably, the downward trend starts at 53 years old and decreases until seeing that
the oldest users are 61 years old but with a very small frequency.

The joint Plot of Area Income vs Age is our first graph to explore the relationship between two
variables. With the previous conclusion from the age histogram, we can analyse this joint plot with a

10
clear idea about the age ranges we are dealing with. That is, we are already expecting outliers for
young and old consumers, namely the two extremes of our x-axis. However, this joint graph gives a
new information which is that, in addition to the most common age range, the majority of users come
from an area income falling between 55000 and 75000 which is the area with the highest scatter points
density.

The joint plot of Daily Time Spent on Site vs Age plot shows the relationship between the amount of
time a consumer spends on the website daily and their age. This kde seaborn joint plot shows a higher
density at the upper left corner of the graph, meaning that the people spending the highest time on
site are between the ages of 25 and 36 years old, with time falling in the range of 70 to 91. This is
also intuitive since most our users belong to the same age range. Nevertheless, we could still get data
about other ages and the time they spend on the site. For instance, most 40 years old visitors spend
approximately between 70 and 89 minutes whereas only few 40 years old spend less than 30 minutes
on the website.

The joint plot of Daily Time Spent on Site vs Daily Internet Usage compares the amount of time a
consumer spends on the website daily and their daily internet usage. We can see that there is a positive
correlation between these two features, which means that as the amount of time a consumer spends
on the internet daily increases, the amount of time they spend on the website also tends to increase.
For example, users with over 200minutes spent on internet are active on the website for over
65minutes which is equivalent to 35.2% of their online screen time on the site. Hence, we can look
at both features and note down how much does the daily time spent on site make out of their total
time online. From the given graph, we can see that a large part of their scrolling is, in fact, done on
site.
The pair plot with hue defined by Clicked on Ad exhibits the relationship between each pair of
features in the dataset, with the points coloured based on whether or not the consumer clicked on the
advertisement. This graph might be a global one that contains valuable information we can gather.
Mainly, we can have a clear perspective on the relationships between each two pairs of variables then
how they are related vis-à-vis the ads clicks. While this could be done through a thorough analysis of
each of the 25 graphs displayed, we can also just look from a macroscopic lens to identify some
patterns. For instance, we can see that older people with little time spent online tend to click on ad
more than older people with a higher time on site. Similarly, users with high daily internet usage and
who live in areas with a high income are less likely to click on ad, as opposed to people with less

11
internet usage and who live in areas with lower income. Related comparisons and analysis can be
made for each of the 25 graphs which will then enable us to draw final conclusions.

These graphs are extremely useful seeing as they allow us to determine relationships amongst
variables with no linear relations. This is precisely the power of logistic models, which we will
explore more in depth in the upcoming sections.

V. Logistic Regression:
1. Splitting the Data into Training Set and Testing Set:
To split the data into training set and testing set, we use the train_test_split function from the
sklearn.model_selection module. We will then use this function to split the data into an X array that
contains the features to train on, and a y array with the target variable (Clicked on Ad).

2. Training and Fitting a Logistic Regression Model on the Training Set:

To do this, we import the LogisticRegression class from the sklearn.linear_model module then we
create an instance of the logistic regression model by storing this model into a variable we name
logmodel. We can then train the data using the fit method on the latter variable and passing it in the
X_train features then moving to the Y_train variable. Here, we will be able to find the optimal weights
to perform the logistic regression on the training data.

This operation can be used to then make predictions on the testing set and evaluate its performance.

12
VI. Predictions and Evaluation:
1. Predicting Values for the Testing Data:

After fitting the logistic regression model on the training set, we can use it to make predictions on the
testing set using the predict function. The code produces an array of predicted values for each sample
in the testing set.

2. Creating a Classification Report for the Model:

This could also be done by means of Python through the classification_report function which will
generate a report of the performance of our logistic regression model on the testing set.

13
This assessment returns different metrics which are precision, recall, f1-score, and support. These
metrics are particularly beneficial in the evaluation of our logistic model.

3. Explanation of Evaluations Metrics Used:

The classification report generated provides metrics for each class in the target variable, and each one
of those metrics gives an idea about the quality of performance of our two classes (clicked ad or did
not click ad) such that:

➢ Precision: measures the proportion of true positive predictions out of all the positive
predictions made by the model. In this case, we have an overall precision of 90% which means
that our model is precise.
➢ Recall: measures the proportion of true positive predictions out of all the actual positive
instances in the testing set. Here, 96% of the actual positive instances are, in fact, true positive
predictions for the first class (Did not click ad) and 84% for the second class (Clicked ad).
➢ F1-score: is the harmonic mean of the precision and recall, and it provides a balanced measure
of the model's performance. Evidently, from both our recall and precision scores, we can see
that our F1-score would be 90% for the first class and 89% for the second class.
➢ Support: is the number of instances in the testing set that belong to each class. For the no click
ad, the number is 146 whereas for the second class it is 154.

Conclusion:

Throughout this project, we were able to use different features imported from diverse libraries in
Python into Jupyter’s interface with the aim of studying a set of data and generating a predictive
algorithm to one specific feature. These operations helped in visualizing relationships between the
variables of our dataset and gathering conclusions which will allow us to make predictions.

The main limitations the predictive model might face are related to the accuracy of the given data
since it heavily relies on it. Otherwise, the evaluation metrics showed a high accuracy for our model
and a good precision.

In a real-life setting, the data visualization tools we used and the plots we generated could be used as
a tool to understand more about the site users and optimize our site accordingly to attain fragments
of customers we have not yet reached. Likewise, the predictive model provides insights regarding
which users are likely to click on our ad, and which are not.

CO Exam Final
100% (1)
CO Exam Final
19 pages
Marketing Analytics Unit 1
No ratings yet
Marketing Analytics Unit 1
48 pages
Analytics Compendium
No ratings yet
Analytics Compendium
41 pages
Regression in Marketing
No ratings yet
Regression in Marketing
90 pages
QT Presentation
No ratings yet
QT Presentation
16 pages
Bias Variance Tradeoff
No ratings yet
Bias Variance Tradeoff
6 pages
NLP and ML Project
100% (1)
NLP and ML Project
37 pages
Statistics Powerpoint Presentation - Regression
No ratings yet
Statistics Powerpoint Presentation - Regression
17 pages
6 - Train - Test - Split - Ipynb - Colaboratory
No ratings yet
6 - Train - Test - Split - Ipynb - Colaboratory
5 pages
Basics of Statistics1
No ratings yet
Basics of Statistics1
63 pages
Web 3.0 Knowledge Sharing Def
No ratings yet
Web 3.0 Knowledge Sharing Def
4 pages
How Artificial Intelligence (Ai) Is Revolutionizing Learning and Development (L&D) Practices
100% (1)
How Artificial Intelligence (Ai) Is Revolutionizing Learning and Development (L&D) Practices
36 pages
Download Complete Data Mining for Business Intelligence Concepts Techniques and Applications in Microsoft Office Excel r with XLMiner r 2nd ed Edition Patel PDF for All Chapters
100% (19)
Download Complete Data Mining for Business Intelligence Concepts Techniques and Applications in Microsoft Office Excel r with XLMiner r 2nd ed Edition Patel PDF for All Chapters
60 pages
Chapter 11: Business Intelligence and Knowledge Management: Oz (5th Edition)
100% (1)
Chapter 11: Business Intelligence and Knowledge Management: Oz (5th Edition)
20 pages
Andrea Martorana Tusa: Failure Prediction For Manufacturing Industry
No ratings yet
Andrea Martorana Tusa: Failure Prediction For Manufacturing Industry
23 pages
EFFIE 2002 Case Studies
100% (1)
EFFIE 2002 Case Studies
16 pages
Shreyash's Resume
No ratings yet
Shreyash's Resume
1 page
Simple Linear Regression - Assign3
No ratings yet
Simple Linear Regression - Assign3
8 pages
Statistics Presentation
No ratings yet
Statistics Presentation
21 pages
cmo-2024-06-guide-to-generative-ai
No ratings yet
cmo-2024-06-guide-to-generative-ai
45 pages
1.4 Marketing Campaign With Panda
No ratings yet
1.4 Marketing Campaign With Panda
35 pages
1.3 Marketing Campaign With Panda
No ratings yet
1.3 Marketing Campaign With Panda
27 pages
Data Science Project
No ratings yet
Data Science Project
3 pages
Lecture 2 - Introduction To Game Theory PDF
No ratings yet
Lecture 2 - Introduction To Game Theory PDF
30 pages
Bias and Variance
No ratings yet
Bias and Variance
6 pages
Machine Learning - Customer Segment Project. Approved by UDACITY
100% (1)
Machine Learning - Customer Segment Project. Approved by UDACITY
19 pages
Visualization of High Dimensional Scientific Data
No ratings yet
Visualization of High Dimensional Scientific Data
105 pages
Mining The Web Graph: Technical Seminar Presentation On
No ratings yet
Mining The Web Graph: Technical Seminar Presentation On
15 pages
A Predictive Analytics Approach For Demand Forecasting
100% (1)
A Predictive Analytics Approach For Demand Forecasting
22 pages
Data Mining Project Shivani Pandey
100% (1)
Data Mining Project Shivani Pandey
40 pages
ch9 Ensemble Learning
No ratings yet
ch9 Ensemble Learning
19 pages
Chapter 17 - Logistic Regression
No ratings yet
Chapter 17 - Logistic Regression
32 pages
Cost Reduction Techniques Final
100% (1)
Cost Reduction Techniques Final
89 pages
REPORT ON DATA ANALYTICS.docx NANMA (1)
No ratings yet
REPORT ON DATA ANALYTICS.docx NANMA (1)
52 pages
Project Vespa in India
No ratings yet
Project Vespa in India
27 pages
G.L Bajaj Institute of Management and Research
No ratings yet
G.L Bajaj Institute of Management and Research
4 pages
Data Mining Lab File
No ratings yet
Data Mining Lab File
20 pages
Text Analytics and Text Mining Overview
No ratings yet
Text Analytics and Text Mining Overview
16 pages
Deep Learning: - Course Code: - Unit 1
No ratings yet
Deep Learning: - Course Code: - Unit 1
21 pages
Knowledge Based Systems (Sistem Berbasis Pengetahuan) : Ir. Wahidin Wahab M.SC PH.D
No ratings yet
Knowledge Based Systems (Sistem Berbasis Pengetahuan) : Ir. Wahidin Wahab M.SC PH.D
33 pages
Sri Guru Tegh Bahadur Institute of Management & Information Technology (Ggsipu)
No ratings yet
Sri Guru Tegh Bahadur Institute of Management & Information Technology (Ggsipu)
63 pages
January 1, 1983 1990 5 July 1994 1930 1960
100% (1)
January 1, 1983 1990 5 July 1994 1930 1960
13 pages
Navies Bayes
No ratings yet
Navies Bayes
18 pages
NOTES OF Python Ok
No ratings yet
NOTES OF Python Ok
73 pages
Machine Learning (Analytics Vidhya) : What Is Logistic Regression?
100% (1)
Machine Learning (Analytics Vidhya) : What Is Logistic Regression?
5 pages
FinalPaper SalesPredictionModelforBigMart
No ratings yet
FinalPaper SalesPredictionModelforBigMart
14 pages
AI and ML For Business Antim Prahar WITH ANSWERS
No ratings yet
AI and ML For Business Antim Prahar WITH ANSWERS
26 pages
CRM & Ai
No ratings yet
CRM & Ai
9 pages
50.2 - Chi Square Goodness-of-Fit Test
No ratings yet
50.2 - Chi Square Goodness-of-Fit Test
11 pages
Service Marketing: Airline Industry
No ratings yet
Service Marketing: Airline Industry
26 pages
LLM Assignment 1
No ratings yet
LLM Assignment 1
3 pages
Sas Semma
100% (1)
Sas Semma
39 pages
Marketing - Market Research: A Case S Tudy Analysis of Kellogg's Indian Experience
No ratings yet
Marketing - Market Research: A Case S Tudy Analysis of Kellogg's Indian Experience
5 pages
Data Analytics in Hospitality Industry
No ratings yet
Data Analytics in Hospitality Industry
13 pages
The Box-Jenkins Methodology For RIMA Models
No ratings yet
The Box-Jenkins Methodology For RIMA Models
172 pages
IMC - Module 2
No ratings yet
IMC - Module 2
37 pages
Dove Fiama Final
No ratings yet
Dove Fiama Final
24 pages
Implications of Predictive Analytics
No ratings yet
Implications of Predictive Analytics
9 pages
Tf-Idf: David Kauchak cs160 Fall 2009
No ratings yet
Tf-Idf: David Kauchak cs160 Fall 2009
51 pages
Best Practices For Prompt Engineering With The OpenAI
No ratings yet
Best Practices For Prompt Engineering With The OpenAI
6 pages
Touchpad Plus Ver. 1.1 Class 7
From Everand
Touchpad Plus Ver. 1.1 Class 7
Nisha Batra
No ratings yet
Final IBM-CBSE - AI - Project - Logbook
No ratings yet
Final IBM-CBSE - AI - Project - Logbook
60 pages
CV - Hasibul Haque Tanvir
No ratings yet
CV - Hasibul Haque Tanvir
2 pages
Ks2 Mathematics 2005 Test A
No ratings yet
Ks2 Mathematics 2005 Test A
24 pages
Apollo Round 1 Solutions 2014
No ratings yet
Apollo Round 1 Solutions 2014
2 pages
More Challenging Problems On Numerical Analysis
No ratings yet
More Challenging Problems On Numerical Analysis
14 pages
An Ink-Controlled Fountain Drawing and Writing Pen
No ratings yet
An Ink-Controlled Fountain Drawing and Writing Pen
2 pages
Syslog Plugin For Cacti
100% (3)
Syslog Plugin For Cacti
8 pages
Getting Started With ChemCAD
No ratings yet
Getting Started With ChemCAD
39 pages
Test Your Skills in Python Language A Complete Questionnaire For Self-Assessment by Shivani Goel
No ratings yet
Test Your Skills in Python Language A Complete Questionnaire For Self-Assessment by Shivani Goel
148 pages
Delta-Wye and Wye-Delta Transformation
40% (5)
Delta-Wye and Wye-Delta Transformation
3 pages
UNIT IV (Well Posed Leaning Problems)
100% (1)
UNIT IV (Well Posed Leaning Problems)
16 pages
gaoutlook3_3_0_guide
No ratings yet
gaoutlook3_3_0_guide
35 pages
1.1 Memory: 1. Useful Commands Note All AIX Commands Reference Can Be Found Under
No ratings yet
1.1 Memory: 1. Useful Commands Note All AIX Commands Reference Can Be Found Under
6 pages
Log A
No ratings yet
Log A
215 pages
Software Engineering: Student Reference Tutorial
No ratings yet
Software Engineering: Student Reference Tutorial
106 pages
Assembly Language Practice Questions
No ratings yet
Assembly Language Practice Questions
5 pages
Presentation1 of IRI
No ratings yet
Presentation1 of IRI
29 pages
SV201_DataSheet_v1c-Web-1
No ratings yet
SV201_DataSheet_v1c-Web-1
2 pages
CT - Introduction & Principle
No ratings yet
CT - Introduction & Principle
44 pages
Summer Internship Report 157
No ratings yet
Summer Internship Report 157
30 pages
Analog Computer - Wikipedia
No ratings yet
Analog Computer - Wikipedia
26 pages
Advance Java Lab Manual
No ratings yet
Advance Java Lab Manual
93 pages
Page Wise Project
No ratings yet
Page Wise Project
1 page
CP650 Installation Manual PDF
No ratings yet
CP650 Installation Manual PDF
166 pages
Signal Controller Ti98,: Ref - No. 245.80.312
No ratings yet
Signal Controller Ti98,: Ref - No. 245.80.312
2 pages
Week 3 Basics-2
No ratings yet
Week 3 Basics-2
59 pages
Telecom Case Study - Solution - Saurabh Kumar
No ratings yet
Telecom Case Study - Solution - Saurabh Kumar
5 pages
Music Composition For Film and TV - Requirements
No ratings yet
Music Composition For Film and TV - Requirements
2 pages
نعیمی خطبات جلد اول
0% (1)
نعیمی خطبات جلد اول
306 pages

Logistic Regression Project With Python

Uploaded by

Logistic Regression Project With Python

Uploaded by

Logistic Regression

Project with Python

Keywords: Logistic Regression, prediction, prediction model, data visualization.

2- Use Case Examples:

II- Importing Libraries:

III- Getting the Data:

IV- Exploratory Data Analysis:

Consequently, we get the final histogram as follows:

2. Joint Plot of Area Income vs Age:

This code yields the following visualization:

6. Interpretation of Resulting Plots:

2. Training and Fitting a Logistic Regression Model on the Training Set:

2. Creating a Classification Report for the Model:

3. Explanation of Evaluations Metrics Used:

You might also like