Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
232 views

Logistic Regression Project With Python

This document describes a project using logistic regression and Python to analyze advertising data and predict whether users will click on ads. It introduces logistic regression models and their uses in marketing, healthcare, and education. It then outlines the steps of the project, including importing necessary libraries, obtaining the data, exploratory data analysis through various plots, building a logistic regression model, making predictions and evaluating performance. The exploratory data analysis involves creating histograms, joint plots, and pair plots to visualize relationships between variables like age, income, internet usage, time on site, and the target variable of ad clicks. Interpretations are provided for results of the exploratory analysis.

Uploaded by

Meryem Harim
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
232 views

Logistic Regression Project With Python

This document describes a project using logistic regression and Python to analyze advertising data and predict whether users will click on ads. It introduces logistic regression models and their uses in marketing, healthcare, and education. It then outlines the steps of the project, including importing necessary libraries, obtaining the data, exploratory data analysis through various plots, building a logistic regression model, making predictions and evaluating performance. The exploratory data analysis involves creating histograms, joint plots, and pair plots to visualize relationships between variables like age, income, internet usage, time on site, and the target variable of ad clicks. Interpretations are provided for results of the exploratory analysis.

Uploaded by

Meryem Harim
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Logistic Regression

Project with Python


Meryem HARIM
EGR 3303
SPRING 2023
Abstract: This project aims to manipulate a false large dataset containing advertising information
using Python. To that end, a prediction model is to be created to run an analysis of one main future:
clicks on Ad and the ways it relates to other variables. The purpose is to visualize the relationship
between those variables knowing that it is not linear, adding, therefore, a layer of complexity to it.
Consequently, we will create a logistic regression model to predict whether users would click on an
ad based on historic data given to them. Ultimately, we will be able to simulate the precision and
accuracy of our model through different metrics.

Keywords: Logistic Regression, prediction, prediction model, data visualization.

2
Table of Content:
I- Introduction
II- Importing Libraries
III-Getting the Data
IV- Exploratory Data Analysis
V- Logistic Regression
VI- Predictions and Evaluation
VII- Conclusion

3
I. Introduction:
1- Description of Logistic Regression Models:
Logistic regression models are, by definition, a “statistical analysis method to predict a binary
outcome”1. It is done with the purpose of predicting a dependent data variable through an analysis of
the various relationship relating it to other independent variables. This binary outcome (yes or no, 0
or 1…) is the result of the logistic regression model that calculates it through a set of coefficients for
the independent variables. This prediction is done to describe a set of data and provide future
predictions.

Logistic regression models are utilized in a wide range of fields such as marketing and healthcare.
This machine learning technology proved to be easily interpreted and adept at determining non-linear
relationships between various features.

2- Use Case Examples:


Below are three case examples of fields that utilize logistic regression models. The outcome of each
of those case examples shall be “yes” or “no”.
a. Marketing: Will customer X purchase a specific product given their gender, age, and
purchase history.
b. Healthcare: Will patient X develop lug cancer given their age, medical history, weight, and
diet.
c. Studies: Will student X pass the exam with a grade equal to or greater than 70% given the
hours they spent studying, current performance in the course, and difficulty of the exam.

II- Importing Libraries:


To build a logistic regression model, we would need the following libraries:
a. “pandas”: manipulate and pre-process data.
b. “numpy”: numerical computations including array operations.
c. “matplotlib”: data visualization and graphs.

1
Lawton, G., Burns, E., & Rosencrance, L. (2022, January 20). What is logistic regression? -
definition from Searchbusinessanalytics. Business Analytics. Retrieved April 10, 2023, from
https://www.techtarget.com/searchbusinessanalytics/definition/logistic-
regression#:~:text=Logistic%20regression%20is%20a%20statistical,or%20more%20existing
%20independent%20variables.

4
d. “seaborn”: advanced data visualization.
We would also need logistic regression models in order to carry out this project. All of these libraries
and models can be imported using the following code:

III- Getting the Data:


The first step to develop our model is to import our data file into Jupyter. The file’s name is
“advertising” and we must make sure to include the extension as well which is csv (Comma Separated
Values). We set it to a data frame called ad_data. This variable will indicate our dataset for the
remaining of this project.
Once we import our data, we check that we have imported the correct file through checking the head
of this dataset. The head command displays the first five rows and columns of our dataset which is
particularly useful for large datasets such as this one.
These operations yield the following results:

IV- Exploratory Data Analysis:


1. Histogram of Age:

5
We are interested in visualizing the “age” distribution of our data. To do this, we can use the plot
method of pandas. First, we run our code to check that all columns are, in fact, included in the dataset
and to make sure no typos are made at further stages. Then, we use our data frame ad_data to make
out two axes: x-axis representing the age and the y-axis representing the frequency at which each age
appears in our dataset. We add a colour specification for the aesthetic representation of our bars as
well as axis titles and a title of the histogram.

Consequently, we get the final histogram as follows:

2. Joint Plot of Area Income vs Age:


We want to visualize the relationship between our two variables: Area Income and Age. To do this,
we can create a joint plot using the jointplot function of the seaborn library.

6
This code will generate a scatter plot that shows the correlation between our two variables that we
have named x and y for simplicity purposes whereas data is the variable assigned to our entire data
frame.

By default, this graph also shows the distribution of each variable separately since we have inserted
the kind='reg' function. Yet again, we have switched the default colour to a purple one.

3. Joint Plot Showing the kde Distributions of Daily Time spent on Site vs Age:
Just as in the previous step, we would use the jointplot function once again to create a Daily Time
Spent on Site vs Age with KDE distributions. The latter could be modified through inserting
kind=’kde’. Again, we specify our x and y variables as well as assign our data frame to the variable
data.

The resulting graph is not only a visualization of the relationship between our parameters, but also
shows the marginal distribution of each of those variables across their respective axes. The code
generated is similar to the code of the precious step. However, because we set kind to kde, there is
an evident difference between the results of the previous graph and the one given below.

7
4. Joint Plot of Daily Time Spent on Site vs Daily Internet Usage:
As done in steps two and three, we can use the jointplot function to get a graph showing the
relationship between Daily Time Spent on Site and Daily Internet Usage. The same method is
followed by creating two variables x and y, feeding the data frame to a variable, then choosing the
themed color.

This code yields the following visualization:

8
5. Pair Plot with the Hue Defined by the Clicked on Ad Column Feature:
This task could be achieved by means of the pairplot function from the seaborn library which creates
a pair plot of the dataset. The result of this code is a set of scatter plots where each one shows the
relationship between two variables in our dataset.
Then we add the hue parameter to the Clicked on Ad column to differentiate the plots by the values
in this column.

The generated plot consists of a grid of scatter plots where each row and column are a variable of our
data frame ad_data with a colour code that displays the relationship, or lack of, between each
combination of two variables in the dataset through scatterplots and kernel density plots as follows:

9
Each point in the scatterplots represents a single datapoint in the dataset. This plot can, therefore,
allow us to identify any patterns and relationships between our variables with respect to the “Clicked
on Ad” column.

6. Interpretation of Resulting Plots:

Several remarks can be drawn based on the plots resulting from our data exploration. First, the
histogram of age is a clear representation of the age ranges of consumers. It allows us to see which
age fragment are the most engaged on the website before looking at the rest of the data. In fact, this
histogram showed that the majority of users are 31 years old, but also that the most recurrent ages
full in the range of 26 to 36 years old, with a less significant number of consumers in the younger and
older age ranges. Notably, the downward trend starts at 53 years old and decreases until seeing that
the oldest users are 61 years old but with a very small frequency.

The joint Plot of Area Income vs Age is our first graph to explore the relationship between two
variables. With the previous conclusion from the age histogram, we can analyse this joint plot with a

10
clear idea about the age ranges we are dealing with. That is, we are already expecting outliers for
young and old consumers, namely the two extremes of our x-axis. However, this joint graph gives a
new information which is that, in addition to the most common age range, the majority of users come
from an area income falling between 55000 and 75000 which is the area with the highest scatter points
density.

The joint plot of Daily Time Spent on Site vs Age plot shows the relationship between the amount of
time a consumer spends on the website daily and their age. This kde seaborn joint plot shows a higher
density at the upper left corner of the graph, meaning that the people spending the highest time on
site are between the ages of 25 and 36 years old, with time falling in the range of 70 to 91. This is
also intuitive since most our users belong to the same age range. Nevertheless, we could still get data
about other ages and the time they spend on the site. For instance, most 40 years old visitors spend
approximately between 70 and 89 minutes whereas only few 40 years old spend less than 30 minutes
on the website.

The joint plot of Daily Time Spent on Site vs Daily Internet Usage compares the amount of time a
consumer spends on the website daily and their daily internet usage. We can see that there is a positive
correlation between these two features, which means that as the amount of time a consumer spends
on the internet daily increases, the amount of time they spend on the website also tends to increase.
For example, users with over 200minutes spent on internet are active on the website for over
65minutes which is equivalent to 35.2% of their online screen time on the site. Hence, we can look
at both features and note down how much does the daily time spent on site make out of their total
time online. From the given graph, we can see that a large part of their scrolling is, in fact, done on
site.
The pair plot with hue defined by Clicked on Ad exhibits the relationship between each pair of
features in the dataset, with the points coloured based on whether or not the consumer clicked on the
advertisement. This graph might be a global one that contains valuable information we can gather.
Mainly, we can have a clear perspective on the relationships between each two pairs of variables then
how they are related vis-à-vis the ads clicks. While this could be done through a thorough analysis of
each of the 25 graphs displayed, we can also just look from a macroscopic lens to identify some
patterns. For instance, we can see that older people with little time spent online tend to click on ad
more than older people with a higher time on site. Similarly, users with high daily internet usage and
who live in areas with a high income are less likely to click on ad, as opposed to people with less

11
internet usage and who live in areas with lower income. Related comparisons and analysis can be
made for each of the 25 graphs which will then enable us to draw final conclusions.

These graphs are extremely useful seeing as they allow us to determine relationships amongst
variables with no linear relations. This is precisely the power of logistic models, which we will
explore more in depth in the upcoming sections.

V. Logistic Regression:
1. Splitting the Data into Training Set and Testing Set:
To split the data into training set and testing set, we use the train_test_split function from the
sklearn.model_selection module. We will then use this function to split the data into an X array that
contains the features to train on, and a y array with the target variable (Clicked on Ad).

2. Training and Fitting a Logistic Regression Model on the Training Set:


To do this, we import the LogisticRegression class from the sklearn.linear_model module then we
create an instance of the logistic regression model by storing this model into a variable we name
logmodel. We can then train the data using the fit method on the latter variable and passing it in the
X_train features then moving to the Y_train variable. Here, we will be able to find the optimal weights
to perform the logistic regression on the training data.

This operation can be used to then make predictions on the testing set and evaluate its performance.

12
VI. Predictions and Evaluation:
1. Predicting Values for the Testing Data:

After fitting the logistic regression model on the training set, we can use it to make predictions on the
testing set using the predict function. The code produces an array of predicted values for each sample
in the testing set.

2. Creating a Classification Report for the Model:


This could also be done by means of Python through the classification_report function which will
generate a report of the performance of our logistic regression model on the testing set.

13
This assessment returns different metrics which are precision, recall, f1-score, and support. These
metrics are particularly beneficial in the evaluation of our logistic model.

3. Explanation of Evaluations Metrics Used:


The classification report generated provides metrics for each class in the target variable, and each one
of those metrics gives an idea about the quality of performance of our two classes (clicked ad or did
not click ad) such that:

➢ Precision: measures the proportion of true positive predictions out of all the positive
predictions made by the model. In this case, we have an overall precision of 90% which means
that our model is precise.
➢ Recall: measures the proportion of true positive predictions out of all the actual positive
instances in the testing set. Here, 96% of the actual positive instances are, in fact, true positive
predictions for the first class (Did not click ad) and 84% for the second class (Clicked ad).
➢ F1-score: is the harmonic mean of the precision and recall, and it provides a balanced measure
of the model's performance. Evidently, from both our recall and precision scores, we can see
that our F1-score would be 90% for the first class and 89% for the second class.
➢ Support: is the number of instances in the testing set that belong to each class. For the no click
ad, the number is 146 whereas for the second class it is 154.

Conclusion:

Throughout this project, we were able to use different features imported from diverse libraries in
Python into Jupyter’s interface with the aim of studying a set of data and generating a predictive
algorithm to one specific feature. These operations helped in visualizing relationships between the
variables of our dataset and gathering conclusions which will allow us to make predictions.

The main limitations the predictive model might face are related to the accuracy of the given data
since it heavily relies on it. Otherwise, the evaluation metrics showed a high accuracy for our model
and a good precision.

In a real-life setting, the data visualization tools we used and the plots we generated could be used as
a tool to understand more about the site users and optimize our site accordingly to attain fragments
of customers we have not yet reached. Likewise, the predictive model provides insights regarding
which users are likely to click on our ad, and which are not.

14

You might also like