Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
10 views

Intro To Statistical Learning

The document provides an introduction to a course on supervised learning. It outlines the topics that will be covered in the course including relevant chapters, applications of machine learning, and the goals of the course. It also distinguishes between supervised and unsupervised learning paradigms.

Uploaded by

alvin
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Intro To Statistical Learning

The document provides an introduction to a course on supervised learning. It outlines the topics that will be covered in the course including relevant chapters, applications of machine learning, and the goals of the course. It also distinguishes between supervised and unsupervised learning paradigms.

Uploaded by

alvin
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Supervised Learning Part 1: Intro to Statistical

Learning.

Juwa Nyirenda
juwa.nyirenda@uct.ac.za

University of Cape Town


Slide Credit: Slides adapted from those by Dr Etienne Pienaar

April 20, 2021


Admin

Please read the course outline to familiarise yourself with the content of
the course and how it is assessed. The course outline can be accessed by
clicking on the rst Vula tab on the left margin of the course Vula page.
Quiz 1 to be submitted by 23h55 on Monday, 26 April 2021
Quiz 2 to be submitted by 23h55 on Monday, 3 May 2021
Assignment 1 will be given out on Monday, 3 May 2021. Due date 23h55
on Monday, 10 May 2021.
Catch up meetings on MS Teams on Fridays between 16h00 and 18h00.
Admin

Relevant chapters:
Chapter 2
Chapter 3
Chapter 6 (6.1, 6.2)
Chapter 4
Chapter 8
Download pdf
James, G., Witten, D., Hastie, T. and Tibshirani, R.,
2013. An introduction to statistical learning
(Vol. 112).
New York: Springer.
ML/SL/Data Science/AI/Deep Learning/Analytics?!
All of these concepts are interrelated, and have become buzzwords
over the past several years
ML/SL/Data Science/AI/Deep Learning/Analytics?!

All of these concepts are interrelated, and have become buzzwords


over the past several years
Data Rich Environments in Industry:
Tech/Banking/Insurance1

Netix prize: Based on ratings of 18,000 movies by 400,000


Netix customers, predict customer's ratings of other movies.

(source: Holloway, 2010)

1
Slide credit: M. Varughese
Data Rich Environments in Industry: Spam lters, Chatbots,
Recommender Systems, Self-driving cars, Smart Policing,
Healthcare, etc.2

2
Slide credit:gizmodo.com, rcdroarena.com, www.wiseyak.com,
Applications in Computer Vision
Applications in NLP
Course Goals

By the end of the course, you should be able to:


Understand how various statistical learning algorithms work
Implement them on your own
Look at a real world problem and identify if a statistical learning
algorithm is an appropriate solution
If so, identify what types of algorithms might be applicable
Feel insipired to work on and learn more about statistical learning.
Skills: Paradigms for Statistical Learning
We bisect `Statistical Learning' into two broad paradigms of learning,
namely:
Supervised Learning
We are interested in the relationship between a set of predictor
variables and a measurable outcome.
Data consists of input variables and output variables.
Examples of models that t within this paradigm are: Linear and
Logistic Regression, Decision/Regression Trees, Neural Networks,
Support Vector Machines.
Unsupervised Learning
We are interested in intrinsic
patterns/clusters/features/partitions/groupings in a set of
observations.
Data simply consists of a number of measurements.
Examples of such methods include Principal Components Analysis
(PCA), Clustering, and Self-Organising Maps (SOMs).
Unsupervised Learning Example

Unsupervised Learning

● ●
● ●


● ●
● ● ● ● ●


● ● ●
4 ●● ●●●



● ●●

● ●

● ●●● ● ● ●
●● ●

● ●
● ● ●
● ● ● ●
● ●● ● ● ● ●● ● ●●
●●● ●●●●●●● ●
2

● ● ● ● ●●
● ●
●●●
● ● ●●●
●● ●
● ●

Input dim. 2

● ●

●●●●●●

●●●●
●●●●
●●● ●●● ● ● ●
●●●●●●● ●
−2


● ●
●●
● ● ●●● ●
●● ● ●

●●
● ● ●
●●●● ● ●
●● ● ●
●● ●● ● ● ●
● ●● ●●
−4

● ●
●● ● ●●● ●●● ●● ●● ●
● ●●
●● ●● ●
● ●●● ● ● ● ●
● ●●● ● ●
●● ● ● ● ●● ●● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ●
● ● ● ●● ● ●● ● ● ●●
● ●● ●● ●●● ●
● ● ● ● ●
● ● ●
● ●● ● ● ● ●
−6

● ●● ● ● ● ●● ● ●
● ● ●● ● ●

●●

−5 0 5 10

Input dim. 1
Supervised Learning

Supervised Learning

● ●
● ●


● ●
● ● ● ● ●


● ● ●
4 ●● ●●●



● ●●

● ●

● ●●● ● ● ●
●● ●

● ●
● ● ●
● ● ● ●
● ●● ● ● ● ●● ● ●●
●●● ●●●●●●● ●
2

● ● ● ● ●●
● ●
●●●
● ● ●●●
●● ●
● ●

Input dim. 2

● ●

●●●●●●

●●●●
●●●●
●●● ●●● ● ● ●
●●●●●●● ●
−2


● ●
●●
● ● ●●● ●
●● ● ●

●●
● ● ●
●●●● ● ●
●● ● ●
●● ●● ● ● ●
● ●● ●●
−4

● ●
●● ● ●●● ●●● ●● ●● ●
● ●●
●● ●● ●
● ●●● ● ● ● ●
● ●●● ● ●
●● ● ● ● ●● ●● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ●
● ● ● ●● ● ●● ● ● ●●
● ●● ●● ●●● ●
● ● ● ● ●
● ● ●
● ●● ● ● ● ●
−6

● ●● ● ● ● ●● ● ●
● ● ●● ● ●

●●

−5 0 5 10

Input dim. 1
Supervised Examples: Advertising

Advertising data: Sales and expenditure (000's) on TV, Radio, and


Newspaper advertising for 200 markets.
Can we predict Sales (output variable) based on advertising expenditure
(input variables/predictors)?
> # Load advertising dataset. Store in data frame called dat:
> dat = read.table(’Advertising.txt’, h = TRUE)
> # First 10 rows of Wage data:
> head(dat, 10)

TV Radio Newspaper Sales


1 230.1 37.8 69.2 22.1
2 44.5 39.3 45.1 10.4
3 17.2 45.9 69.3 9.3
4 151.5 41.3 58.5 18.5
5 180.8 10.8 58.4 12.9
6 8.7 48.9 75.0 7.2
7 57.5 32.8 23.5 11.8
8 120.2 19.6 11.6 13.2
9 8.6 2.1 1.0 4.8
10 199.8 2.6 21.2 10.6
Supervised Examples: Advertising

● ●
● ●
● ● ●● ●● ● ●

25

25
● ●
● ● ● ●
● ● ● ● ●

● ●
● ● ●
● ● ● ●


● ●

● ● ● ●
● ● ● ●
● ●
● ● ● ● ● ● ● ●

20

20
● ● ● ● ● ●
● ● ● ● ● ●
● ● ●● ● ●


● ● ● ● ●● ● ●
● ● ● ●● ●
● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ●
● ●● ●● ● ●
● ●● ●● ●
●● ● ● ●● ●
● ●
● ●● ● ● ● ●●


● ●

● ●

15

15
● ● ● ●

Sales

Sales
● ● ● ● ● ●● ●● ● ●● ● ●
●● ● ● ● ● ● ● ●
● ● ●●
● ● ●
● ●
● ● ●● ●● ● ● ●● ● ●● ●
● ● ● ● ●● ● ● ●
● ● ● ● ● ●● ● ● ● ●● ● ● ● ●
● ● ● ● ●
● ●● ● ● ● ●


● ● ●
● ● ●
● ● ● ●
● ●


●● ● ● ●
●●● ● ●● ● ●●
● ● ●
●● ●●

●● ● ● ●● ●● ●
● ● ●●
● ●● ●● ● ●● ●
10 ● ● ●●
● ●●●● ● ● ●

10
●● ●●
● ●●● ● ●● ● ● ● ● ●●●
● ●
● ●● ● ● ● ● ●● ● ●
● ●
● ●● ● ● ● ● ●● ●
● ●
● ●
● ●
● ● ● ●
● ● ● ●
●● ● ●● ● ● ●
● ●
● ●
● ●● ● ● ●
● ●● ●
● ●●
●● ● ● ●
5

5
● ●

● ●

● ●

0 50 100 150 200 250 300 0 10 20 30 40 50

TV Radio

50
● ●
● ● ●● ● ●
● ●●
● ● ● ● ● ● ●

25


● ●
●● ● ● ●


● ● ● ● ● ●
● ● ●
● ● ● ● ●
● ● ● ● ●
● ●

40
● ● ● ●
● ● ● ● ●
●● ● ● ● ●

● ● ● ●● ● ●
● ● ● ● ●
20

● ● ● ● ●
●● ● ● ● ● ●
● ● ●
● ● ● ● ●
● ● ● ●
● ● ● ● ●● ●
● ●
● ● ● ●
● ●● ●
● ● ●

30
● ● ● ● ●
● ● ●● ●● ● ●
● ● ●
● ● ● ● ● ● ●●
● ●
● ● ●● ● ● ● ● ● ● ●
● ● ●
15

● ● Radio
Sales

●●
●● ● ● ● ●● ● ● ● ●
● ● ● ● ●
● ● ● ●
● ● ●
●●● ●
● ● ●
● ● ● ● ●
●● ● ● ● ●
● ● ● ● ●● ● ● ●
● ●● ●● ●● ●●
20
● ●
● ● ●●●● ● ● ● ●●
● ● ● ●
● ● ● ●
● ● ● ●●

● ●
● ● ● ● ● ●
● ●● ● ● ● ● ●
● ●● ● ●
10

●●● ●
●● ● ● ● ● ● ● ●

●● ●● ● ● ●

● ●● ●
● ● ●● ● ● ● ● ● ●
● ●
● ●
● ● ● ● ●● ● ●
● ● ●
● ● ●
● ● ●
10

● ● ● ● ●
● ● ● ● ●
● ● ● ● ●
● ● ● ●● ● ●
● ● ●
5

● ● ●● ●
●● ● ● ●
● ● ● ●
● ● ●● ● ● ● ●
● ● ● ●●
●● ●● ● ● ● ●
● ●●
● ●
0

0 20 40 60 80 100 0 20 40 60 80 100

Newspaper Newspaper

Sales vs. various predictor variables.


Supervised Examples: Income

Income data: Income and years of Education and Seniority for 30


individuals.
Can we predict Income (outcome) based on years of Education and
Seniority?

> # Load Income dataset and store in data frame called dat:
> dat = read.table(’Income.txt’, h = TRUE)
> # First 10 rows of Income data:
> head(dat, 10)

Education Seniority Income


1 21.58621 113.10345 99.91717
2 18.27586 119.31034 92.57913
3 12.06897 100.68966 34.67873
4 17.03448 187.58621 78.70281
5 19.93103 20.00000 68.00992
6 18.27586 26.20690 71.50449
7 19.93103 150.34483 87.97047
8 21.17241 82.06897 79.81103
9 20.34483 88.27586 90.00633
10 10.00000 113.10345 45.65553
Supervised Examples: Income

100

100
● ●
● ●
● ●

● ●
● ● ● ●
● ●
● ●
80

80
● ●
● ● ● ●

● ● ● ●
● ● ● ●
● ●
● ●
● ●
Income

Income
60

60
● ●

● ●

● ●

● ●
40

40
● ●
● ●

● ● ●

● ●
● ●
20

20
● ●
● ●

10 12 14 16 18 20 22 50 100 150

Education Seniority

Income vs. various predictor variables.


Supervised Examples: Email Classication

Can we detect whether an email is spam3 ?


We might use the relative frequency of occurrence of
characters/words/mispelings...
type free your our mail order dollar ! (
email 0.000 0.000 0.270 0.550 0.000 0.000 0.000 0.549
email 0.000 1.780 0.890 0.000 0.000 0.000 0.000 0.298
email 0.000 0.000 0.000 1.960 0.000 0.000 0.000 0.373
email 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
email 0.000 0.000 0.600 0.000 0.100 0.000 0.000 0.049
spam 0.000 2.830 0.940 0.940 0.000 0.000 0.000 0.000
spam 1.050 2.100 0.000 0.000 0.000 0.182 0.365 0.365
email 1.380 1.380 0.000 0.690 0.000 0.000 2.378 0.000
email 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.098
email 0.930 0.000 0.000 0.000 0.000 0.000 0.000 0.163

3
Term originates from a Monty Python sketch*
Variables: Measurement Scales

Both predictor and output variables can be dened as either qualitative


or quantitative.
Supervised learning problems where the output variables are quantitative
in nature are referred to as regression tasks.
Problems where the output variables are qualitative in nature are
referred to as classication tasks.
The particular methods one chooses to perform these tasks are often
informed by the measurement scale of the output, i.e., qualitative vs.
quantitative.
The set of predictor variables are usually a combination of both
quantitative and qualitative measurements for both regression and
classication problems. Indeed, one often spends some time
encoding/engineering the feature set4 in order to improve model
performance or interpretation.

4
Synonym for input variables. Used interchangeably.
Supervised Examples: Mortality

But this should be obvious, right?


Consider the following: Does smoking status aect mortality?
age smoking deaths person_years
1 35_44 smoker 32 52407
2 45_54 smoker 104 43248
3 55_64 smoker 206 28612
4 65_74 smoker 186 12663
5 75_84 smoker 102 5317
6 35_44 non-smoker 2 18790
7 45_54 non-smoker 12 10673
8 55_64 non-smoker 28 5710
9 65_74 non-smoker 28 2585
10 75_84 non-smoker 31 1462
Variables: Notation

We use Y to notionally represent responses/output variables/target


variable.
For an observed outcome of Y , we use y . Furthermore, we often use an
index subscript, say yi to denote the i-th observation i.e., a particular
instance of an outcome.
{y1 , y2 , . . . , yN } thus represents a set of N observations.
For a given Y we also observe a vector of p features
X = (X1 , X2 , . . . , Xp ).
We may refer to an observed outcome of this feature vector using
x = (x1 , x2 , . . . , xp ).
So for each observation, we have a pair (y, x).
For example, for our rst observation of the advertising data, we have
y = 22.1 and x = (230.1, 37.8, 69.2).
Variables: R

In computing environments `variables' are objects which contain data.


Often, the relationship between the mathematical variables and those in
the workspace becomes opaque.
Typically, in R we work with vectors and matrices.
Sales may denote a collection of observations {y1 , y2 , . . . , yN }.
Sales is thus a (column) vector of observations which can be indexed:
> Sales[1:5] # Vector

[1] 22.1 10.4 9.3 18.5 12.9

We like to assign variable names that correspond to the output and


features of the data when we build models for interpretive purposes.
Hint: When writing pseudo-code, make sure to annotate the
mathematical counterparts so you know what you are doing.
Variables: R

Since we usually have multiple features for every response, we tend to


work with collections of vectors (one for each feature)
Mathematically, these features are usually joined in the form of a matrix.
This can be achieved quite naturally in the computing environment:
> X = cbind(TV, Radio, Newspaper) # Matrix: features by column
> head(X, 5)

TV Radio Newspaper
[1,] 230.1 37.8 69.2
[2,] 44.5 39.3 45.1
[3,] 17.2 45.9 69.3
[4,] 151.5 41.3 58.5
[5,] 180.8 10.8 58.4

Each row of this matrix now (say, [3,]) corresponds to the feature set for
an observation in the vector Sales (say, Sales[3]).
Statistical Learning: Regression

One way to describe the relationship between responses and


predictors is via the equation:
Y = f (X) + |{z}

| {z } (1)
systematic random

f (.) is some xed, but unknown function.


 is a random error term, independent of X , with mean 0.
We may subsequently posit a model for f , say fˆ, which estimates
f.
Statistical Learning: Regression

Data Systematic Component


●●
● ●
●●
8

7



●● (x25, y25)

●●

7

● ●

6

● ●



6

5

● ●
● ●
Y

Y
● ●

● ●
● ● ●

5

● ●
● ● ●

4
● ● ●
● ●
● ● ● ●
● ●
● ● ● ● ●● ● ●
● ● ●

● ● ●

4

● ● ●
● ● ● ● ●

3

● ●● ● ● ● ●
● ●


● ● ●
3

● True f
● ● ● Data

0.0 0.2 0.4 0.6 0.8 1.0 2 0.0 0.2 0.4 0.6 0.8 1.0

X X

What we have vs. what is.


Statistical Learning: Regression

Random Component Data


●●
● ●
●●

8

7





●●

●●

7
● ●
6


● ●


6

5


● ●
● ●
Y

Y
● ●

● ●
● ● ●

5
● ●
● ● ●
4

● ● ●
● ●
● ● ● ●
● ●
● ●● ● ● ● ● ●
● ● ●

● ● ●

4
● ● ●
● ● ● ● ●
3


● ● ● ●● ● ●
● ●

Error ● True f
● ● ●

3
● Data ● ● Data
Sys. Comp. ● ● Model
2

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

X X

What we posit.
Statistical Learning: Regression

8

●●

●●

7



6



Y



● ●
● ●
● ●
● ● ●
● ●
● ● ●
4

● ●
● ●

● ●● ● ●
● ●

● Error
● ● ●
3

● ● Data
● ● Model

0.0 0.2 0.4 0.6 0.8 1.0

How we use what we have to get what we posit.


Statistical Learning: Classication

Compute f (X) at each point (X1 , X2 ) on a lattice (grid).


Dashed line is where f (X) = 0.5.
oo o
oo o
o
o
o oo oo o
o
o oo oo ooo
o o o oooooo o oo
o oo o o ooo o
o o oo o oo
o
oo o ooo o o o o o o o
o o o oo o o o
o oo o o o o o o o
o o oooo o ooo o o o o ooo
o
o
X2

o o o oo o o o ooooo o o o
oo o oo o o
o o o o oo o
o o oo oo o
o o o oo o
o
o ooo o
oo o ooooo oooo
o o o oo o o
o o o oo o o o
o o ooooo oo
o o o o
o oo o
o o o

X1

Classication on a lattice over the input space.


Where is the response variable in this graph?
Statistical Learning: Classication
1.0

1.0
0.5

0.5
0.0

0.0
X2

X2
−0.5

−0.5
−1.0

−1.0 −0.5 0.0 0.5 1.0 −1.0 −1.0 −0.5 0.0 0.5 1.0

X1 X1

f (.) vs. the data.


Statistical Learning: Classication
1.0

1.0
0.5

0.5
0.0

0.0
X2

X2
−0.5

−0.5
−1.0

−1.0 −0.5 0.0 0.5 1.0 −1.0 −1.0 −0.5 0.0 0.5 1.0

X1 X1

f (.) vs fˆ(.)
Why estimate f ?

Prediction: When we have X , and our assumption that


mean() = 0 is valid, then we can predict the associated outcome
Y using:
Ŷ = fˆ(X). (2)

Classication: the boundary dened by fˆ(X) can be used to decide an


appropriate class.
Inference: We wish to infer aspects of the relationship between X
and Y from the data. The goal here is dierent, although we often
recover predictions from such studies.
Which predictors are associated with the response?
What is the relationship between the predictors and response?
Methods for nding f

High
Subset Selection
Lasso

Least Squares
Interpretability

Generalized Additive Models


Trees

Bagging, Boosting

Support Vector Machines


Low

Low High

Flexibility

Flexibility/Complexity vs. Interpretability.


Parametric Methods: Advertising Revisited

Let's formulate an hypothesis on the structure of f .


If we consider predicting Sales using expenditure on TV only, we'd
have:
Y = β0 + β1 × X + , (3)
where Y denotes sales, X denotes expenditure, and {β0 , β1 } are
coecients of the linear equation.
In less formal notation, we posit:
Sales ≈ β0 + β1 × TV. (4)
Parametric Methods: Advertising Revisited

We thus have an equation which posits the relationship between TV and


Sales.
We need to nd the parameters, (β0 , β1 ), such that the equation best
represents the data in some sense.
One possible way to do this is to minimise the distance between
predictions under this model and the observed data, i.e.:
MSQE = Ave((Y − Ŷi )2 )
N
1 X
= (yi − ŷi )2
N i=1 (5)
N
1 X
= (yi − (β0 + β1 xi ))2 .
N i=1

Once we have done so, we have estimates (β̂0 , β̂1 ).


Parametric Methods: Advertising Revisited

By minimising the discrepancies between the model equation and


the observations, we get:


120
● ● ●●
25


● ●
● ● ●

● ● ●
● ●

100
● ●
● ●
● ● ● ●
20

● ● ●
● ● ●
● ● ●●
● ● ● ●
● ● ●
● ● ●

80
● ● ● ● ●
● ● ●●
● ●●
● ● ● ● ●●
● ●● ● ●

15

● ●
Sales

Error
● ● ● ● ●●
●● ● ● ● ●
● ● ●

60

● ● ●● ●● ●
● ● ● ●
● ● ● ● ● ●● ● ●
● ● ● ● ●
●● ●
● ● ●
● ● ● ●● ●


●● ●
●●

●● ● ● ●●
● ●● ●● ●
● ● ●●

10

●●
●●● ●

40
● ●●
● ●● ● ● ● ●
● ●● ● ● ● ●

● ●
● ●
●● ● ●●

● ●●
● ●●
20
●● ●
5


^
f

● Data
● Residuals
0

0 50 100 150 200 250 300 0 20 40 60 80 100

TV Iteration

We can repeat this for other predictors as well.


Parametric Methods: Advertising Revisited

● ●
● ●
● ● ●● ●● ● ●

25

25
● ●
● ● ● ●
● ● ● ● ●

● ●
● ● ●
● ● ● ●


● ●

● ● ● ●
● ● ● ●
● ●
● ● ● ● ● ● ● ●

20

20
● ● ● ● ● ●
● ● ● ● ● ●
● ● ●● ● ●


● ● ● ● ●● ● ●
● ● ● ●● ●
● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ●
● ●● ●● ● ●
● ●● ●● ●
●● ● ● ●● ●
● ●
● ●● ● ● ● ●●


● ●

● ●

15

15
● ● ● ●

Sales

Sales
● ● ● ● ● ●● ●● ● ●● ● ●
●● ● ● ● ● ● ● ●
● ● ●●
● ● ●
● ●
● ● ●● ●● ● ● ●● ● ●● ●
● ● ● ● ●● ● ● ●
● ● ● ● ● ●● ● ● ● ●● ● ● ● ●
● ● ● ● ●
● ●● ● ● ● ●


● ● ●
● ● ●
● ● ● ●
● ●


●● ● ● ●
●●● ● ●● ● ●●
● ● ●
●● ●●

●● ● ● ●● ●● ●
● ● ●●
● ●● ●● ● ●● ●
10 ● ● ●●
● ●●●● ● ● ●

10
●● ●●
● ●●● ● ●● ● ● ● ● ●●●
● ●
● ●● ● ● ● ● ●● ● ●
● ●
● ●● ● ● ● ● ●● ●
● ●
● ●
● ●
● ● ● ●
● ● ● ●
●● ● ●● ● ● ●
● ●
● ●
● ●● ● ● ●
● ●● ●
● ●●
●● ● ● ●
5

5
● ●

● ●

● ●

0 50 100 150 200 250 300 0 10 20 30 40 50

TV Radio

50
● ●
● ● ●● ● ●
● ●●
● ● ● ● ● ● ●

25


● ●
●● ● ● ●


● ● ● ● ● ●
● ● ●
● ● ● ● ●
● ● ● ● ●
● ●

40
● ● ● ●
● ● ● ● ●
●● ● ● ● ●

● ● ● ●● ● ●
● ● ● ● ●
20

● ● ● ● ●
●● ● ● ● ● ●
● ● ●
● ● ● ● ●
● ● ● ●
● ● ● ● ●● ●
● ●
● ● ● ●
● ●● ●
● ● ●

30
● ● ● ● ●
● ● ●● ●● ● ●
● ● ●
● ● ● ● ● ● ●●
● ●
● ● ●● ● ● ● ● ● ● ●
● ● ●
15

● ● Radio
Sales

●●
●● ● ● ● ●● ● ● ● ●
● ● ● ● ●
● ● ● ●
● ● ●
●●● ●
● ● ●
● ● ● ● ●
●● ● ● ● ●
● ● ● ● ●● ● ● ●
● ●● ●● ●● ●●
20
● ●
● ● ●●●● ● ● ● ●●
● ● ● ●
● ● ● ●
● ● ● ●●

● ●
● ● ● ● ● ●
● ●● ● ● ● ● ●
● ●● ● ●
10

●●● ●
●● ● ● ● ● ● ● ●

●● ●● ● ● ●

● ●● ●
● ● ●● ● ● ● ● ● ●
● ●
● ●
● ● ● ● ●● ● ●
● ● ●
● ● ●
● ● ●
10

● ● ● ● ●
● ● ● ● ●
● ● ● ● ●
● ● ● ●● ● ●
● ● ●
5

● ● ●● ●
●● ● ● ●
● ● ● ●
● ● ●● ● ● ● ●
● ● ● ●●
●● ●● ● ● ● ●
● ●●
● ●
0

0 20 40 60 80 100 0 20 40 60 80 100

Newspaper Newspaper

Linear t of Sales vs. various predictor variables.


Parametric Methods: Income Revisited

But what about the joint relationship? That is, we know that more
than one factor attributes to income.
So we include more predictors: E.g., for a given combination of
Education and Seniority, what Income can we expect to observe?
Let's just extend our equation:
Y = β0 + β1 × X1 + β2 × X2 + , (6)
where Y denotes income, X1 denotes years of education, X2
denotes seniority, and {β0 , β1 , β2 } are coecients of the linear
equation. I.e.,
Income ≈ β0 + β1 × Education + β2 × Seniority. (7)
Parametric Methods: Income Revisited

Incom
e

ity
or
Ye

ni
ars

Se
of
Ed
uc
ati
on

f (X1 , X2 ) vs. observations.


Parametric Methods: Income Revisited

Incom
e

ity
or
Ye

ni
ars

Se
of
Ed
uc
ati
on

Parametric equation for a plane as fˆ(X1 , X2 ) (Equation 7).


Parametric Methods: Income Revisited
100

100
● ●
● ●
● ●

● ●
● ● ● ●
● ●
● ●
80

80
● ●
● ● ● ●

● ● ● ●
● ● ● ●
● ●
● ●
● ●
Income

Income
60

60
● ●

● ●

● ●

● ●
40

40
● ●
● ●

● ● ●

● ●
● ●
20

20
● ●
● ●

10 12 14 16 18 20 22 50 100 150

Education Seniority

Income vs. various predictor variables.


Non-Parametric Methods

Consider a more mechanical hypothesis on the structure of f .


Draw a set of predictors. Pick the K observations that are `closest' to X
and average the responses from those examples.
Mathematically, we have:
ŷi = fˆ(xi ) = Ave(yj |xj ∈ NK (xi )) (8)
for every i = 1, 2, ..., N .
The similarity argument is simple, and doesn't require us to nd any
parameters (ish).
Non-Parametric Methods: Income Revisited

k_NN(k = 3) k_NN(k = 10)


100

100
● ●
● ●
● ●

● ●
● ● ● ●
● ●
● ●
80

80
● ●
● ● ● ●

● ● ● ●
● ● ● ●
● ●
● ●
● ●
Income

Income
60

60
● ●

● ●

● ●

● ●
40

40
y^ = 35.25

● ●

y^ = 29.08
● ●

● ● ● ●

● ●
● ●
20

● 20 ●
● ●

10 12 14 16 18 20 22 10 12 14 16 18 20 22

Education Education
Non-Parametric Methods: Income Revisited
Non-Parametric Methods: Income Revisited

k_NN(k = 5) k_NN(k = 15)


Inco

Inco
me

me
Ed

Ed
uc

uc
atio

atio
n

n
rity rity
nio nio
Se Se

fˆ for
Circles Revisited
Some Closing Remarks

Which approach is better?


For a given approach, which parameters are `best'?
We never have f (.), how do we know even know we are close to
the true pattern?
What insights do these models give about the true process?

You might also like