Machine Learning-Lecture 2(Student)
Machine Learning-Lecture 2(Student)
Overview
⚫ Linear Regression vs. Classification
The linear regression model assumes that the response variable Y is . But
in many situations, the response variable is instead .
⚫ Examples of Classification
1. A person arrives at the emergency room with a set of symptoms that could
possibly be attributed to one of three medical conditions. Which of the three
conditions does the individual have?
2. An online banking service must be able to determine whether or not a
transaction being performed on the site is fraudulent, on the basis of the user’s
IP address, past transaction history, and so forth.
3. On the basis of DNA sequence data for a number of patients with and without a
given disease, a biologist would like to figure out which DNA mutations are
deleterious (disease-causing) and which are not.
⚫ Why not linear regression?
Suppose that we are trying to predict the medical condition of a patient in the
emergency room on the basis of her symptoms. In this simplified example, there are
three possible diagnoses: stroke, drug overdose, and epileptic seizure. We could
consider encoding these values as a quantitative response variable, Y , as follows:
1
Logistic Regression
⚫ An Example
We will illustrate the concept of classification using the simulated Default data set. We
are interested in predicting whether an individual will default on his or her credit
card payment, on the basis of annual income and monthly credit card balance. The
data set is displayed in Figure 4.1. We have plotted annual income and monthly credit
card balance for a subset of 10, 000 individuals.
⚫ Estimation
A likelihood function
2
⚫ Making Prediction
we predict that the default probability for an individual with a balance of $1, 000 is
For example, a student with a credit card balance of $1, 500 and an income of $40,
000 has an estimated probability of default of
⚫ Multiple-Class (K>2)
The two-class logistic regression models have multiple-class extensions, but in
practice they tend not to be used all that often.
3
Linear Discriminant Analysis
⚫ Why not using logistic regression
1. When the classes are well-separated, the parameter estimates for the
logistic regression model are surprisingly unstable.
➢ Bayes' Thm
4
Computer Session
library(ISLR2)
names(Smarket)
summary(Smarket)
## 3rd Qu.: 0.596750 3rd Qu.: 0.59700 3rd Qu.:1.6417 3rd Qu.: 0.596750
## Direction
## Down:602
## Up :648
##
5
pairs(Smarket)
cor(Smarket[, -9])
glm.fit=glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume, data=Smarket,
family=binomial)
summary(glm.fit)
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
## Volume, family = binomial, data = Smarket)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.126000 0.240736 -0.523 0.601
## Lag1 -0.073074 0.050167 -1.457 0.145
## Lag2 -0.042301 0.050086 -0.845 0.398
## Lag3 0.011085 0.049939 0.222 0.824
## Lag4 0.009359 0.049974 0.187 0.851
## Lag5 0.010313 0.049511 0.208 0.835
## Volume 0.135441 0.158360 0.855 0.392
##
7
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1731.2 on 1249 degrees of freedom
## Residual deviance: 1727.6 on 1243 degrees of freedom
## AIC: 1741.6
##
## Number of Fisher Scoring iterations: 3
coef(glm.fit)
summary(glm.fit)$coef
glm.probs=predict(glm.fit, type="response")
glm.probs[1:10]
## 1 2 3 4 5 6 7 8
## 9 10
## 0.5176135 0.4888378
contrasts(Direction)
## Up
## Down 0
## Up 1
8
glm.pred=rep("Down", 1250)
glm.pred[glm.probs > .5]="Up"
table(glm.pred, Direction)
## Direction
## glm.pred Down Up
## Down 145 141
## Up 457 507
(507+145)/1250
## [1] 0.5216
mean(glm.pred==Direction)
## [1] 0.5216
train=(Year<2005)
Smarket.2005=Smarket[!train, ]
dim(Smarket.2005)
## [1] 252 9
Direction.2005=Direction[!train]
glm.fit=glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume, data=Smarket,
family=binomial, subset=train)
glm.probs=predict(glm.fit, Smarket.2005, type="response")
glm.pred=rep("Down ", 252)
glm.pred[glm.probs>.5]="Up"
table(glm.pred, Direction.2005)
## Direction.2005
## glm.pred Down Up
## Down 77 97
## Up 34 44
mean(glm.pred==Direction.2005)
## [1] 0.1746032
mean(glm.pred!=Direction.2005)
## [1] 0.8253968