Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
1 views

Introduction_to_ML_Linear_Regression_Lecture_Slides

Uploaded by

kr5c96y7km
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

Introduction_to_ML_Linear_Regression_Lecture_Slides

Uploaded by

kr5c96y7km
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Machine Learning

kumar.sovan@gmail.com
6LGU0EZJIR

Linear Regression

This file is meant for personal use by kumar.sovan@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Linear Relations between two variables

• Do heavier cars have lower mileage?

• Can we use DATA to better understand relationships between the two


variables: weight and mpg?

kumar.sovan@gmail.com
6LGU0EZJIR

Mpg

Weight

2
This file is meant for personal use by kumar.sovan@gmail.com only.
Data Source: StatLib (http://lib.stat.cmu.edu/datasets/)
Sharing or publishing the contents in part or full is liable for legal action.
Which one has a stronger relationship?

kumar.sovan@gmail.com
6LGU0EZJIR

3
This file is meant for personal use by kumar.sovan@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.
Measures of Association

• Need a measure of association between two variables.

• By association we mean the strength (and direction) of a linear


relationship between two numerical variables.

• The relationship is “strong” if the points in a scatterplot cluster tightly


around some straight line. If this line rises form left to right then the
relationship is “positive”. If it falls from left to right then the relationship is
“negative”.
kumar.sovan@gmail.com
6LGU0EZJIR
• We know that variance of a variable X is

• On a similar note lets define “covariance” between X and Y as

4
This file is meant for personal use by kumar.sovan@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.
• Covariance:

• Covariance between X and Y is the same as the covariance between Y and X.

• The covariance between a variable and itself is the variance of the variable.

• It is difficult to interpret the magnitudes of covariances since it is not scale


invariant.

• Correlation

• We can scale covariance to make it an invariant measure of linear association!


kumar.sovan@gmail.com

6LGU0EZJIR Correlation between X and Y is

• Correlation is always between -1 and +1.The correlation between a variable and


itself is 1.

• The correlation between X and Y is the same as the correlation between Y and X.

• Correlation is scale invariant


5
This file is meant for personal use by kumar.sovan@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.
Interpreting Correlations

• Correlation between Weight and Mpg is -0.83

• Does heavier car tend to have a lower mileage?

• If we increase the weight of a car, will its Mpg decrease?

kumar.sovan@gmail.com
6LGU0EZJIR

• Correlation and covariance are measures of linear association only.

• Correlation can be misleading when the association is non- linear

• Outliers can have significant effects on correlations. Outliers that are


clearly identifiable are best deleted before correlation computations.

6
This file is meant for personal use by kumar.sovan@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.
kumar.sovan@gmail.com
6LGU0EZJIR

7
This file is meant for personal use by kumar.sovan@gmail.com only.
Source: Wikipedia
Sharing or publishing the contents in part or full is liable for legal action.
Salaries and Expenses

• Next: If a car’s weight is 4000, what would we expect its Mpg to be?

• Previously: Measuring strength of relationship

• Now: Capturing relationships using a simple model (equation)

kumar.sovan@gmail.com
6LGU0EZJIR
Mpg

Weight

8
This file is meant for personal use by kumar.sovan@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.
How easy is it to fit a straight line?

Mpg

kumar.sovan@gmail.com
6LGU0EZJIR

Weight

9
This file is meant for personal use by kumar.sovan@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.
One possibility that makes sense...

• Choosing a line that (in some sense) minimizes the vertical


distances from the point to the line.
kumar.sovan@gmail.com
6LGU0EZJIR
• We also choose to minimize the sum of the “squares” of this vertical
distances!!! For mathematical convenience.

• This method is called the “Least Squares Estimation” usually also


referred to as “Linear Regression”

10
This file is meant for personal use by kumar.sovan@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.
Least Squares Estimation

• Note that:

Observed Value = Fitted Value + Residual


• Fitted Value: The predicted value of the response variable. It is the y-axis value of the line.

• Residual: The difference between the actual and fitted values of the response variable.
kumar.sovan@gmail.com
6LGU0EZJIR • Observed Value: The actual value of the response variable

• Least Squares line is the one that minimizes the sum of the
squared residuals.

• If we denote the ith residual by ei, then we are minimizing: ⌃e2i


• All statistical software automate this method.

11
This file is meant for personal use by kumar.sovan@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.
So...

• If a car’s weight is 4000, what would we expect its Mpg to be?

kumar.sovan@gmail.com
6LGU0EZJIR

• We managed to use the data to construct a regression model. Using


this model we answered the above question.

12
This file is meant for personal use by kumar.sovan@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.
How good is our regression fit?

kumar.sovan@gmail.com
6LGU0EZJIR
• Need measures of goodness of fit?

13
This file is meant for personal use by kumar.sovan@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.
Measures of Regression Fit

• Standard deviation of the residuals. Sometimes also called the Root


Mean Sq Error (RMSE)
sP
e2i
se =
n 2
kumar.sovan@gmail.com
6LGU0EZJIR
• Comparing RMSE to Std. dev of y

14
This file is meant for personal use by kumar.sovan@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.
Measures of Regression Fit

• Coefficient of determination
P
e2i
R2 = 1 P
(yi ȳ)2

• Lends itself to a really nice interpretation:


kumar.sovan@gmail.com
6LGU0EZJIR

It is the percentage of variation of the dependent variable


explained by the regression.

• In simple linear regression, it is simply the square of the


correlation!

• R2 = SSR/SST has no units and lies between 0 and 1


15
This file is meant for personal use by kumar.sovan@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.
Multiple Regression

• One dependent variable. More than one independent variable.

• The regression model (equation)

• The above is the equation of a hyper-plane set in k dimensions

• Again use the similar arguments to find the best hyper-plane by


kumar.sovan@gmail.com
6LGU0EZJIR minimizing the least squares measure.

• Very easily computed using most Statics of ML tools

1. mpg: miles per gallon


2. cyl: cylinders
3. disp: displacement (cu. inches)
4. hp: horsepower
5. wt: weight (lbs)
6. acc: acceleration (secs for 0-60mph)
7. yr: model year
8. origin (American, European, Japanese)
9. car name

16
This file is meant for personal use by kumar.sovan@gmail.com only.
Data Source: StatLib (http://lib.stat.cmu.edu/datasets/)
Sharing or publishing the contents in part or full is liable for legal action.
Standard Error and Adjusted R2

• Standard Error for Multiple regression

kumar.sovan@gmail.com
6LGU0EZJIR
• Adjusted R2

• A measure that adjusts for the number of independent


variables used

• Used to monitor if more independent variables belong to the model

• Cannot be interpreted as “percentage or variation explained”

17
This file is meant for personal use by kumar.sovan@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.
Pros and Cons

• Advantages

• Simple elegant model

• Computationally very efficient

• Easy to interpret the output’s coefficients


kumar.sovan@gmail.com
6LGU0EZJIR Disadvantages

• Sometimes its just too simple to capture real-world complexities

• Assumes a linear relationships between dependent and independent


variables.

• Outliers can have a large effect on the output

• Assumes independence between attributes

18
This file is meant for personal use by kumar.sovan@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.

You might also like