Data Analytics Regression Unit III
Data Analytics Regression Unit III
Regression analysis is a set of statistical processes for estimating the relationships between a dependent
variable (often called the ‘outcome variable’) and one or more independent variables (often called
‘predictors’, ‘covariates’, or ‘features’).
The terminology you will often listen related to regression analysis is:
For an input x, if the output is continuous, this is called a regression problem. For example, based on
historical information of demand for smart phone in our mobile shop, you are asked to predict the demand
for the next month. Regression is concerned with the prediction of continuous quantities.
As mentioned above, regression analysis estimates the relationship between two or more variables. Let’s
understand this with an easy example:
Let’s say, you want to estimate growth in sales of a company based on current economic conditions. You
have the recent company data which indicates that the growth in sales is around two and a half times the
growth in the economy. Using this insight, we can predict future sales of the company based on current &
past information.
There are multiple benefits of using regression analysis. They are as follows:
1. It indicates the significant relationships between dependent variable and independent variable.
2. It indicates the strength of impact of multiple independent variables on a dependent variable.
Regression analysis also allows us to compare the effects of variables measured on different scales, such
as the effect of price changes and the number of promotional activities. These benefits help market
researchers / data analysts / data scientists to eliminate and evaluate the best set of variables to be used
for building predictive models.
There are various kinds of regression techniques available to make predictions. These techniques are
mostly driven by three metrics (number of independent variables, type of dependent variables and shape
of regression line).
For the creative ones, you can even cook up new regressions, if you feel the need to use a combination of
the parameters above, which people haven’t used before. But before you start that, let us understand the
most commonly used regressions:
• To find a target function that can fit the input data with minimum error.
• The error function for a regression task can be expressed in terms of the sum of absolute or
squared error:
Linear Regression
It is one of the most widely known modeling technique. Linear regression is usually among the first few
topics which people pick while learning predictive modeling. In this technique, the dependent variable is
continuous, independent variable(s) can be continuous or discrete, and nature of regression line is linear.
Linear Regression establishes a relationship between dependent variable (Y) and one or
more independent variables (X) using a best fit straight line (also known as regression line).
It is represented by an equation Y=a+b*X + e, where a is intercept, b is slope of the line and e is error
term. This equation can be used to predict the value of target variable based on given
predictor variable(s).
The difference between simple linear regression and multiple linear regression is that, multiple linear
regression has (>1) independent variables, whereas simple linear regression has only 1 independent
variable. Now, the question is “How do we obtain best fit line?”.
In the simplest case, the regression model allows for a linear relationship between the forecast
variable y and a single predictor variable x:
yt=β0+β1xt+εt.
An artificial example of data from such a model is shown in Figure. The coefficients β0 and β1 denote
the intercept and the slope of the line respectively. The intercept β0 represents the predicted value
of y when x=0. The slope β1 represents the average predicted change in y resulting from a one unit
increase in x.
The simplest case of linear regression is to find a relationship using a linear model (i.e line) between an
input independent variable (input single feature) and an output dependent variable. This is called
Bivariate Linear Regression.
On the other hand, when there is a linear model representing the relationship between a dependent
output and multiple independent input variables is called Multivariate Linear Regression.
The dependent variable is continuous and independent variables may or may not be continuous. We find
the relationship between them with the help of the best fit line which is also known as the Regression
line.
This task can be easily accomplished by Least Square Method. It is the most common method used for
fitting a regression line. It calculates the best-fit line for the observed data by minimizing the sum of the
squares of the vertical deviations from each data point to the line. Because the deviations are first
squared, when added, there is no cancelling out between positive and negative values.
• We can evaluate the model performance using the metric R-square. In multiple linear regressions,
multiple equations are added together but the parameters are still linear.
Important Points:
The least squares modeling procedure is the best linear unbiased estimator .i.e., BLUE (Best Linear
Unbiased Estimator). The simple model needs five fundamental assumptions to be satisfied and the
multiple regression model needs six assumptions to be satisfied. Among this four assumptions are related
to model’ residuals. They are as follows.
Here, the first assumption of zero mean is fulfilled due to the nature of least square estimation. The
assumption of normal distribution of residuals is not concerned until the BLUE property is concerned. The
Gauss-Markov theorem needs the residuals to maintain zero mean and constant variance. The
hypothesis testing however needs normality of residuals. The remaining three assumptions are important
in DLS estimation. They are not held always. The performance of forecasting gets affected when any of
the three assumptions is violated.
• The least squares method is a statistical procedure to find the best fit for a set of data points by
minimizing the sum of the offsets or residuals of points from the plotted curve. Least squares
regression is used to predict the behavior of dependent variables.
• Ordinary least squares is a method used by linear regression to get parameter estimates.
• This entails fitting a line so that the sum of the squared distance from each point to the regression
line (residual) is minimized.
• Let’s visualize this in the diagram below where the red line is the regression line and the blue lines
are the residuals.
Imagine you have some points, and want to have a line that best fits them like this:
We can place the line "by eye": try to have the line as close as possible to all points, and a similar number
of points above and below the line.
But for better accuracy let's see how to calculate the line using Least Squares Regression.
The Line
Our aim is to calculate the values m (slope) and b (y-intercept) in the equation of a line
y = mx + b
Where:
y = how far up
x = how far along
m = Slope or Gradient (how steep the line is)
b = the Y Intercept (where the line crosses the Y axis)
Steps
To find the line of best fit for N points:
Step 1: For each (x,y) point calculate x2 and xy
Step 2: Sum all x, y, x2 and xy, which gives us Σx, Σy, Σx2 and Σxy (Σ means "sum up")
Step 3: Calculate Slope m:
m = N Σ(xy) − ΣxΣy
N Σ(x2) − (Σx)2
(N is the number of points.)
Step 4: Calculate Intercept b:
b = Σy − m Σx
N
Step 5: Assemble the equation of a line
y = mx + b
Done!
Example
Let's have an example to see how to do it!
Example: Sam found how many hours of sunshine vs how many ice creams were sold at the shop from
Monday to Friday:
"x" "y"
Hours of Ice Creams
Sunshine Sold
2 4
3 5
5 7
7 10
9 15
Let us find the best m (slope) and b (y-intercept) that suits that data
y = mx + b
x y x2 xy
2 4 4 8
3 5 9 15
5 7 25 35
7 10 49 70
9 15 81 135
Σx: 26 Σy: 41 Σx2: 168 Σxy: 263
y = mx + b
y = 1.518x + 0.305
Here are the (x,y) points and the line y = 1.518x + 0.305 on a graph:
Sam hears the weather forecast which says "we expect 8 hours of sun tomorrow", so he uses the above
equation to estimate that he will sell
And for Multiple Linear regression since we have more than 2 independent variables the equation
becomes:
Now the question that comes into mind is, what error is this? Can we visualize it? How do we find it? In a
linear model or any model we don’t have to worry about the mathematical part, everything is done by the
model itself.
Let’s interpret the graph above. In linear regression the best fit line will be somewhat like this, the only
difference will be the number of data points. To make it easier I have taken a fewer number of data points.
Suppose there’s a variable Yi, The distance between this Yi and the predicted value is what we call “SUM
OF SQUARED ESTIMATE OF ERRORS” (SSE) . This is the unexplained variance and we have to
minimize it to get the best accuracy.
The distance between the predicted value y_hat and the mean of the dependent variable is called “SUM
OF SQUARED RESIDUALS” (SSR). This is the explained variance of our model and we want to
maximize it.
The total variation in the model (SSR+SSE=SST) is called “SUM OF SQUARED TOTAL” .
1. Positive Relationship – When the regression line between the two variables moves in the same
direction with an upward slope then the variables are said to be in a Positive Relationship, it means that if
we increase the value of x (independent variable) then we will see an increase in our dependent variable.
2. Negative Relationship – When the regression line between the two variables moves in the same
direction with a downward slope then the variables are said to be in a Negative Relationship it means that
if we increase the value of an independent variable (x) then we will see a decrease in our dependent
variable (y)
3. No Relationship – If the best fit line is flat (not sloped) then we can say that there is no relationship
among the variables. It means there will be no change in our dependent variable (y) by increasing or
decreasing our independent variable (x) value.
Correlation
When two sets of data are strongly linked together we say they have a High Correlation.
The word Correlation is made of Co- (meaning "together"), and Relation
Note:
The value shows how good the correlation is (not how steep the line is), and if it is positive or negative.
Variable Rationalization
• It is process of clustering data sets into more manageable parts for optimizing the query
performance.
• It is used to divide the data but in different way. This process can be assumed as grouping of
objects by attributes.
• It is method that increases the performance of bigdata operations.
• Variable rationalization is different from partitioning where every partition contains segments of
files.
Advantages & Disadvantages
Disadvantages
This process is very much true to scientific method by making the learn things through models to be
useful to gain understanding of investigated things and to make the predictions which are true for
testing. The process of building variable models involves asking of queries, gathering and manipulating
data, building of models and even ultimately testing and evaluating them.
We are going to discuss life cycle phases of data analytics in which we will cover various life cycle
phases and will discuss them one by one.
Phase 1: Discovery –
The data science team learns and investigates the problem.
Develop context and understanding.
Come to know about data sources needed and available for the project.
The team formulates initial hypothesis that can be later tested with data.
Phase 6: Operationalize –
The team communicates benefits of project more broadly and sets up pilot project to deploy work
in controlled way before broadening the work to full enterprise of users.
This approach enables team to learn about performance and related constraints of the model in
production environment on small scale  , and make adjustments before full deployment.
The team delivers final reports, briefings, codes.
Prepared by N.Venkateswaran, Associate Professor, CSE Dept. JITS - Karimnagar Page 15
Logistic Regression
Classification techniques are an essential part of machine learning and data mining applications.
Approximately 70% of problems in Data Science are classification problems. There are lots of
classification problems that are available, but the logistic regression is common and is a useful regression
method for solving the binary classification problem. Another category of classification is Multinomial
classification, which handles the issues where multiple classes are present in the target variable.
Logistic Regression can be used for various classification problems such as spam detection. Diabetes
prediction, if a given customer will purchase a particular product or will they churn another competitor,
whether the user will click on a given advertisement link or not, and many more examples are in the
bucket.
Logistic Regression is one of the most simple and commonly used Machine Learning algorithms for two-
class classification. It is easy to implement and can be used as the baseline for any binary classification
problem. Its basic fundamental concepts are also constructive in deep learning. Logistic regression
describes and estimates the relationship between one dependent binary variable and independent
variables.
We can call a Logistic Regression a Linear Regression model but the Logistic Regression uses a more
complex cost function, this cost function can be defined as the ‘Sigmoid function’ or also known as the
‘logistic function’ instead of a linear function.
The hypothesis of logistic regression tends it to limit the cost function between 0 and 1. Therefore linear
functions fail to represent it as it can have a value greater than 1 or less than 0 which is not possible as per
the hypothesis of logistic regression.
The sigmoid function, also called logistic function gives an ‘S’ shaped curve that can take any real-
valued number and map it into a value between 0 and 1. If the curve goes to positive infinity, y predicted
will become 1, and if the curve goes to negative infinity, y predicted will become 0. If the output of the
sigmoid function is more than 0.5, we can classify the outcome as 1 or YES, and if it is less than 0.5, we
can classify it as 0 or NO. The output cannot For example: If the output is 0.75, we can say in terms of
probability as: There is a 75 percent chance that patient will suffer from cancer.
Example:
Whether or not to lend to a bank customer (outcomes are yes or no).
Assessing cancer risk (outcomes are high or low).
Will a team win tomorrow’s game (outcomes are yes or no).
Multinomial Logistic Regression: In such a kind of classification, dependent variable can have 3
or more possible unordered types or the types having no quantitative significance. For example,
these variables may represent “Type A” or “Type B” or “Type C”.
Example:
Color(Red,Blue, Green)
School Subjects (Science, Math and Art)
Ordinal Logistic Regression: In such a kind of classification, dependent variable can have 3 or
more possible ordered types or the types having a quantitative significance. For example, these
variables may represent “poor” or “good”, “very good”, “Excellent” and each category can have the
scores like 0,1,2,3.
Example:
Medical Condition (Critical, Serious, Stable, Good)
Survey Results (Disagree, Neutral and Agree)
regression is estimated using Ordinary Least Squares (OLS) while logistic regression is estimated using
Maximum Likelihood Estimation (MLE) approach.
Linear regression is used to predict the Logistic Regression is used to predict the
continuous dependent variable using a given categorical dependent variable using a given
set of independent variables. set of independent variables.
Linear Regression is used for solving Logistic regression is used for solving
Regression problem. Classification problems.
In Linear regression, we predict the value of In logistic Regression, we predict the values of
continuous variables. categorical variables.
In linear regression, we find the best fit line, In Logistic Regression, we find the S-curve by
by which we can easily predict the output. which we can classify the samples.
Least square estimation method is used for Maximum likelihood estimation method is used
estimation of accuracy. for estimation of accuracy.
The output for Linear Regression must be a The output of Logistic Regression must be a
continuous value, such as price, age, etc. Categorical value such as 0 or 1, Yes or No,
etc.
In linear regression, there may be collinearity In logistic regression, there should not be
between the independent variables. collinearity between the independent variable.
Here is a more realistic and detailed scenario for when logistic regression might be used:
Logistic regression may be used when predicting whether bank customers are likely to default on
their loans. This is a calculation a bank makes when deciding if it will or will not lend to a customer
and assessing the maximum amount the bank will lend to those it has already deemed to be
creditworthy. In order to make this calculation, the bank will look at several factors. Lend is the
target in this logistic regression, and based on the likelihood of default that is calculated, a lender
will choose whether to take the risk of lending to each customer.
These factors, also known as features or independent variables, might include credit score,
income level, age, job status, marital status, gender, the neighborhood of current residence
and educational history.
Logistic regression is also often used for medical research and by insurance companies. In order
to calculate cancer risks, researchers would look at certain patient habits and genetic
predispositions as predictive factors. To assess whether or not a patient is at a high risk of
developing cancer, factors such as age, race, weight, smoking status, drinking status, exercise
habits, overall medical history, family history of cancer and place of residence and workplace,
accounting for environmental factors, would be considered.
Logistic regression is used in many other fields and is a common tool of data scient
As the name suggests in statistics it is a method for estimating the parameters of an assumed probability
distribution. Where the likelihood function measures the goodness of fit of a statistical model on data for
given values of parameters. The estimation of parameters is done by maximizing the likelihood function so
that the data we are using under the model can be more probable for the model. The likelihood function
for discrete random variables can be given by
Where x is the outcome of X random variables and likelihood is the function of θ. By the above function,
we can say the likelihood is equal to the probability of occurrence of outcome x is observed when the
parameter of the model is θ.
Here the likelihood function can be put into hypothesis testing for finding the probability of various
outcomes using the set of parameters defined in the null hypothesis.
The main goal of the maximum likelihood estimation is to make inferences about the data population
which will take part in the generation of the sample and evaluating the joint density at the observed data
set. As we have seen in the likelihood function above it can be maximized by
Here the motive of the estimation is to select the best fit parameter for the model to make the data most
probable. The specific value that maximizes the likelihood function Ln is called the
maximum likelihood estimation.
Ordinary Least squares estimates are computed by fitting a regression line on given data points that has
the minimum sum of the squared deviations (least square error). Both are used to estimate the
parameters of a linear regression model. MLE assumes a joint probability mass function, while OLS
doesn't require any stochastic assumptions for minimizing distance.
Maximum likelihood estimation, or MLE, is a method used in estimating the parameters of a statistical
model, and for fitting a statistical model to data. If you want to find the height measurement of every
basketball player in a specific location, you can use the maximum likelihood estimation. Normally, you
would encounter problems such as cost and time constraints. If you could not afford to measure all of the
basketball players’ heights, the maximum likelihood estimation would be very handy. Using the maximum
likelihood estimation, you can estimate the mean and variance of the height of your subjects. The MLE
would set the mean and variance as parameters in determining the specific parametric values in a given
model.
To sum it up, the maximum likelihood estimation covers a set of parameters which can be used for
predicting the data needed in a normal distribution. A given, fixed set of data and its probability model
would likely produce the predicted data. The MLE would give us a unified approach when it comes to the
estimation. But in some cases, we cannot use the maximum likelihood estimation because of recognized
errors or the problem actually doesn’t even exist in reality.
“OLS” stands for “ordinary least squares” while “MLE” stands for “maximum likelihood estimation.”
The ordinary least squares, or OLS, can also be called the linear least squares. This is a method
for approximately determining the unknown parameters located in a linear regression model.
Maximum likelihood estimation, or MLE, is a method used in estimating the parameters of a
statistical model and for fitting a statistical model to data.
Model Theory
Model Theory is the part of mathematics which shows how to apply logic to the study of structures in pure
mathematics. On the one hand it is the ultimate abstraction; on the other, it has immediate applications to
every-day mathematics.
The fundamental tenet of Model Theory is that mathematical truth, like all truth, is relative. A statement
may be true or false, depending on how and where it is interpreted.
This isn't necessarily due to mathematics itself, but is a consequence of the language that we use to
express mathematical ideas.
Model Theory is divided into two parts namely pure and applied. Pure model theory will learn the abstract
properties of first order theories and there on derives structure theorems for their models. The applied
model theory will study the concrete algebraic structures from model theoretic point of view and then uses
the results from pure model theory functionalities and uniformities of definition. The applied model theory
is connected strongly with other branches of mathematics.
Fit model describes the relationship between a response variable and one or more predictor variables.
There are many different models that you can fit including simple linear regression, multiple linear
regression, analysis of variance (ANOVA), analysis of covariance (ANCOVA), and binary logistic
regression.
Linear fit
A linear model describes the relationship between a continuous response variable and the explanatory
variables using a linear function.
Logistic fit
A logistic model describes the relationship between a categorical response variable and the explanatory
variables using a logistic function.
Construction
Data modelling is the process of creating a data model for storing the data in the database
A model is nothing but representation of data objects that is associations between different data
objects and rules.
Data Modelling helps in visually representing the data and the enforce business rules, regulatory
compliances and the government policies on data.
The logical designs are translated into physical models which contain storage devices, database
and the files that build the data.. The business earlier used the relational database technology such
as SQL to build the data models since it uniquely suits for linking the data set keys flexibility and
data types to support the requirements of business processes.
Design a system rather than Schema : in the era of traditional data, The relational database
schema can cover the relationships and links between the data needed by business for its
information support. This is not the case in big data that does not have database or that uses
database such as NoSQL. The big data models must be created on systems rather than on
databases.
Use Data Modelling Tools: The IT decision makers must include the ability to create the data
models for big data as the requirements while considering the big data tools and methodologies.
Focus on the Core Data of Business: Enterprises get the data at large volumes. Most of the data
is extraneous. The best method would be to identity the big data suitable for data tools and
methodologies.
Deliver the Quality Data: Earlier data models and relationships are effected for big data when
organisations focus on development of sound definitions for data. The thorough meta data
describes the source of data and its purpose. The knowledge about the data helps in planning it
properly in data models to support the business.
Search for Key in Roads into the data: A Commonly used vectors into big data today is
geographical location. Based on the business the industries have other common keys into big data
required by users. The data models can be created which support information access paths for the
company by identifying the common entry points into the data.
Digital Advertising: Data Algorithms can control the digital advertisements such as banners displayed on
various websites to digital billboards in big cities.
Marketing: Analytics is used to observe the buying patterns of consumer behaviour, analyzing trends to
identify the target audience through various advertising techniques which appeal to consumers, forecast
supply needs etc.
Finance: Analytics is important to finance sector. The data scientists have high demand in investment
banking portfolio management, financial planning, forecasting, budgeting etc.
CRM: Analytics enable to analyze the performance indicators that help in decision making and provide
strategies boost the relationship with the consumers. The demographics and data about other socio
economic factors, purchasing patterns, life cycle etc are important to CRM department.
Manufacturing: Analytics help in supply chain management, inventory management, measure the
performance of targets, risk mitigation plans and even improve the efficiency based on product data.
Travel: Analytics help in optimization of travelers who buy the experience through social media and
mobile mobile/weblog data analysis. The data analytics applications can deliver the personalized travel
recommendations based on result from social media data.
Customer interactions: Insurers can describe about their services through regular customer surveys
after communicating with claim handlers. This is important to know about their goods.
Manage risk: In insurance industry, the risk management is mainly focused. Data analytics offer
insurance companies the data on claims, actuarial and risk data by covering the important decisions that
the company must take. Evaluations are done by underwriter before anyone gets insured. Later on the
appropriate insurance is set. Nowadays, analytical software are used to detect different fraudulent claims.
Delivery Logistics: Various logistic companies like UPS, DHL, FedEx etc, use data to improve their
efficiency in operations. These companies from data analytics applications have found the suitable routes
to ship, the best delivery time, suitable means of transport. Data generated by the companies through
GPS provides them opportunities to take advantage of data analytics and data science.
Energy Management: Data analytics are applied to energy management and areas such as energy
optimization, smart grid management, distribution of energy and building automation for utility companies
are covered. The data analytics application focuses mainly on monitoring and controlling of dispatch crew,
network devices and management of service outages.
HR Professionals: The HR Professionals use data to fetch information about educational background of
skilled candidates, employee attrition rate, number of years of experiences service, age, gender etc. This
data is useful to play pivotal role in candidate selection procedure.
Fraud and Risk Detection: Analytics helps to rescue from losses incurred by organizations since they
could have extracted data from customers while applying loans. With this they can easily analyze and
infer if there is any probability of customers defaulting.