Logistic Regression
Logistic Regression
Logistic Regression
Logistic Regression is one of the most popular algoriths used in the field of data Science.
Understanding Regression
Linear Regression outputs continuous numeric values, whereas logistic regression transforms its output to
return a probability values which can be used for mapping to two or more classes.
Sigmoid Function
The hypothesis of logistic regression tends it to limit the cost function between 0 and 1. The sigmoidal
function is given as :
Dicision Boundary
The hypothesis of logistic regression can be given as:
Decision Boundary: A threshold value is decided between 0 and 1, which decides the class a numeric value
may correspond to.
Cost Function
The cost function to be minimized in logistic regression can be given as:
Types
I) Binary Logistic Regression: The categorical response has only two possible outcomes.
Advantages
1.Makes no assumption about distribution of class in feature space.
4.It gives direction of association among the dependent and independent variable involved.
Disadvantages
1.Can lead to overfitting if the number of feature is more than observations.
II) Credit Card Fraud can also be detected through logistic regression. It uses factors like data of the
transaction, amount, place, type of purchase and many more.
In [3]: x = np.linspace(-10,10,100)
y = 1/(1+np.exp(-x))
plt.plot(x,y)
plt.xlabel("x")
plt.ylabel("Sigmoid(x)")
plt.show()
Logistic Regression Practicle
In [4]: #Collection and import dataset, before import data you must be import pandas package.
dataset = pd.read_csv("Loan.csv")
In this dataset we have three columns. The first two (Income and Loan Amount ) are the predictor(
independent variables), While the last one - Target is the response (or dependent Variable).
we will use dataset to train logistic regression model to predict whether a borrow will default or not default
on a new loan based on their income and the amount of money they intend to borrow.
25 15 85 yes
26 18 90 yes
27 16 100 yes
28 22 105 yes
29 14 110 yes
In [7]: #summary statistics of your dataset
dataset.describe()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30 entries, 0 to 29
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Income 30 non-null int64
1 Loan Amount 30 non-null int64
2 Target 30 non-null object
dtypes: int64(2), object(1)
memory usage: 848.0+ bytes
Primary Objective in this step is to split our data into training and test sets. The training set be used to train
the model , while the test will be used to evalutate the model.
In [15]: y = dataset['Target']
# y for dependent variable.
Using the train_test_split() function, we can split x and y into x_train, x_test, y_train, and y_test.
Within the train_test_split() function, we will set: train_size to .70 to .80 . this mean depend on data size
basically we assigned 70% to 80% for training data while rest of 20% to 30% is assigned to the test data.
startify as y which means that we want the data split using a stratified random sampling approach based on
the values of y.
random_state to 123 we get the same result every time we do this split.
(21, 2)
Out[17]:
The about result show us that 21 out of the 30 instances in the dataset were assigned to the train set.
In [18]: #shape of test data
x_test.shape
(9, 2)
Out[18]:
The about result show us that 9 out of the 30 instances in the dataset were assigned to the test set.
In [20]: #To train model, we pass the training data(x_train & y_train) to the fit() method of the
model = classifier.fit(x_train, y_train)
In [21]: #Recall taht there are 9 instances(or rows) in the test set.
#to predict labels for the test instances , we pass the independent variable of the test
model.predict(x_test)
In [22]: #To evaluate how accurate our model is , we pass the test data(x_test and y_test) to sco
model.score(x_test, y_test)
0.8888888888888888
Out[22]:
The result tells us, Logistic Regression model is able to correctly predict 8 out of 9( 89%) of te labels in the
test set.
The accuracy of a model only gives us a one-dimensional perspective of performance.To get a broader
perspective, we need to generate a conusion matrix od the model's performance.
array([[3, 1],
Out[23]:
[0, 5]], dtype=int64)
The output is a 2*2 array that shows how many instance the model predicted correctly or incorrectely as
either Yes or No. this confusion matric can be define :
The first row of the matrix shows that of the 4 instances that were actually NO, the model predicted 3 of
them as NO but 1 of them as Yes. The second row of the matrix shows that of the 5 instances that were
actually Yes, the model predicted all 5 correctly as Yes.
Interpret the Model we did built model and evaluated the performance of the model on the test data, we
can now interpret the model's output. The model coefficeints.
The relation between the dependent and independent variables in a Logistic Regression model is generally
represented as follows:
In this representation , the left hand side of the equation is know as the logit or log-odds of the probability
of an outcome or class P. β0 is the intercept. β1 to βn are coefficients of the independent variables x1 to xn.
In [24]: model.intercept_
#get intercept(β0 ), we use intercept_ attribute for model.
array([15.4670632])
Out[24]:
In [25]: model.coef_
# To get the other model coefficient that is β1, β2 we use coef_ attribute for model
array([[-1.0178107 , 0.14656096]])
Out[25]:
array([-1.02, 0.15])
Out[26]:
The above code make coefficeints easier to work with , can convert the coefficients from a two dimensional
array to a one-dimensional array and round the values to two decimal places.
round() is a mathematical function that rounds an array to the given number of decimals. Syntax :
numpy.round(arr, decimals = 0, out = None)
Income -1.02
Above we create a Pandas DataFrame using the coefficint values and the columns name from the training
data as row indexes.
Above , first coeff tells us that, when all other variables are held constant, a $1 increase in a borrowers
income decrease the coeff odds that they will target on their loan by 1.02.
Likewise the second coefficient tells us that a $1 increase in the amount a customer borrows, increase the
coeff odds that they will target on their loan by 0.15 when all other variable are held constant.