Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
2 views

Machine Learning-Lecture 2(Student)

The document discusses classification methods, contrasting linear regression with classification techniques, particularly logistic regression and linear discriminant analysis. It provides examples of classification scenarios, such as diagnosing medical conditions and detecting fraudulent transactions, and explains the logistic regression model's application in predicting credit card defaults. Additionally, it includes a computer session using R to analyze a dataset, demonstrating the implementation of logistic regression and evaluating prediction accuracy.

Uploaded by

hubertkuo418
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Machine Learning-Lecture 2(Student)

The document discusses classification methods, contrasting linear regression with classification techniques, particularly logistic regression and linear discriminant analysis. It provides examples of classification scenarios, such as diagnosing medical conditions and detecting fraudulent transactions, and explains the logistic regression model's application in predicting credit card defaults. Additionally, it includes a computer session using R to analyze a dataset, demonstrating the implementation of logistic regression and evaluating prediction accuracy.

Uploaded by

hubertkuo418
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Lecture 2: Classification I

Overview
⚫ Linear Regression vs. Classification
The linear regression model assumes that the response variable Y is . But
in many situations, the response variable is instead .
⚫ Examples of Classification
1. A person arrives at the emergency room with a set of symptoms that could
possibly be attributed to one of three medical conditions. Which of the three
conditions does the individual have?
2. An online banking service must be able to determine whether or not a
transaction being performed on the site is fraudulent, on the basis of the user’s
IP address, past transaction history, and so forth.
3. On the basis of DNA sequence data for a number of patients with and without a
given disease, a biologist would like to figure out which DNA mutations are
deleterious (disease-causing) and which are not.
⚫ Why not linear regression?
Suppose that we are trying to predict the medical condition of a patient in the
emergency room on the basis of her symptoms. In this simplified example, there are
three possible diagnoses: stroke, drug overdose, and epileptic seizure. We could
consider encoding these values as a quantitative response variable, Y , as follows:

⚫ Three of the most widely-used classifiers

1
Logistic Regression
⚫ An Example
We will illustrate the concept of classification using the simulated Default data set. We
are interested in predicting whether an individual will default on his or her credit
card payment, on the basis of annual income and monthly credit card balance. The
data set is displayed in Figure 4.1. We have plotted annual income and monthly credit
card balance for a subset of 10, 000 individuals.

⚫ The Logistic Model


➢ For the Default data, logistic regression models the probability of default. For
example, the probability of default given balance can be written as

➢ using a linear regression model to represent these probabilities:

➢ Taking log-odds or logit:

⚫ Estimation
A likelihood function

2
⚫ Making Prediction

we predict that the default probability for an individual with a balance of $1, 000 is

⚫ Multiple Logistic regression

For example, a student with a credit card balance of $1, 500 and an income of $40,
000 has an estimated probability of default of

⚫ Multiple-Class (K>2)
The two-class logistic regression models have multiple-class extensions, but in
practice they tend not to be used all that often.

3
Linear Discriminant Analysis
⚫ Why not using logistic regression
1. When the classes are well-separated, the parameter estimates for the
logistic regression model are surprisingly unstable.

2. If n is small and the distribution of the predictors X is approximately


normal in each of the classes,

3. As mentioned in last page, linear discriminant analysis is popular when we have


more than two response classes.
⚫ Using Bayes’ Theorem for Classification
➢ Rules of multiplication

➢ Bayes' Thm

➢ Let represent the overall or prior probability that a randomly chosen


observation comes from the kth class; this is the probability that a given
observation is associated with the kth category of the response variable Y .
Let denote the density function of X for an observation that
comes from the kth class. In other words, fk(x) is relatively large if there is a
high probability that an observation in the kth class has X ≈ x. The Bayes
Thm states that

we will use the abbreviation . In general, estimating πk is


easy if we have a random sample of Y s from the population: we simply
compute the fraction of the training observations that belong to the kth
class. However, estimating fk(X) tends to be more challenging, unless we
assume some simple forms for these densities. We refer to pk(x) as the
that an observation X = x belongs to the kth
class. That is, it is the probability that the observation belongs to the kth
class, given the predictor value for that observation. (to be continued.....)

4
Computer Session
library(ISLR2)

## Warning: 套件 'ISLR2' 是用 R 版本 4.3.2 來建造的

names(Smarket)

## [1] "Year" "Lag1" "Lag2" "Lag3" "Lag4" "Lag5"


## [7] "Volume" "Today" "Direction"

summary(Smarket)

## Year Lag1 Lag2 Lag3


## Min. :2001 Min. :-4.922000 Min. :-4.922000 Min. :-4.922000

## 1st Qu.:2002 1st Qu.:-0.639500 1st Qu.:-0.639500 1st Qu.:-0.640000

## Median :2003 Median : 0.039000 Median : 0.039000 Median : 0.038500

## Mean :2003 Mean : 0.003834 Mean : 0.003919 Mean : 0.001716


## 3rd Qu.:2004 3rd Qu.: 0.596750 3rd Qu.: 0.596750 3rd Qu.: 0.596750

## Max. :2005 Max. : 5.733000 Max. : 5.733000 Max. : 5.733000


## Lag4 Lag5 Volume Today
## Min. :-4.922000 Min. :-4.92200 Min. :0.3561 Min. :-4.922000

## 1st Qu.:-0.640000 1st Qu.:-0.64000 1st Qu.:1.2574 1st Qu.:-0.63950


0
## Median : 0.038500 Median : 0.03850 Median :1.4229 Median : 0.038500

## Mean : 0.001636 Mean : 0.00561 Mean :1.4783 Mean : 0.003138

## 3rd Qu.: 0.596750 3rd Qu.: 0.59700 3rd Qu.:1.6417 3rd Qu.: 0.596750

## Max. : 5.733000 Max. : 5.73300 Max. :3.1525 Max. : 5.733000

## Direction
## Down:602
## Up :648
##

5
pairs(Smarket)

cor(Smarket[, -9])

## Year Lag1 Lag2 Lag3 Lag4


## Year 1.00000000 0.029699649 0.030596422 0.033194581 0.035688718
## Lag1 0.02969965 1.000000000 -0.026294328 -0.010803402 -0.002985911
## Lag2 0.03059642 -0.026294328 1.000000000 -0.025896670 -0.010853533
## Lag3 0.03319458 -0.010803402 -0.025896670 1.000000000 -0.024051036
## Lag4 0.03568872 -0.002985911 -0.010853533 -0.024051036 1.000000000
## Lag5 0.02978799 -0.005674606 -0.003557949 -0.018808338 -0.027083641
## Volume 0.53900647 0.040909908 -0.043383215 -0.041823686 -0.048414246
## Today 0.03009523 -0.026155045 -0.010250033 -0.002447647 -0.006899527
## Lag5 Volume Today
## Year 0.029787995 0.53900647 0.030095229
## Lag1 -0.005674606 0.04090991 -0.026155045
## Lag2 -0.003557949 -0.04338321 -0.010250033
## Lag3 -0.018808338 -0.04182369 -0.002447647
## Lag4 -0.027083641 -0.04841425 -0.006899527
## Lag5 1.000000000 -0.02200231 -0.034860083
## Volume -0.022002315 1.00000000 0.014591823
## Today -0.034860083 0.01459182 1.000000000
6
attach(Smarket)
plot(Volume)

glm.fit=glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume, data=Smarket,
family=binomial)
summary(glm.fit)

##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
## Volume, family = binomial, data = Smarket)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.126000 0.240736 -0.523 0.601
## Lag1 -0.073074 0.050167 -1.457 0.145
## Lag2 -0.042301 0.050086 -0.845 0.398
## Lag3 0.011085 0.049939 0.222 0.824
## Lag4 0.009359 0.049974 0.187 0.851
## Lag5 0.010313 0.049511 0.208 0.835
## Volume 0.135441 0.158360 0.855 0.392
##
7
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1731.2 on 1249 degrees of freedom
## Residual deviance: 1727.6 on 1243 degrees of freedom
## AIC: 1741.6
##
## Number of Fisher Scoring iterations: 3

coef(glm.fit)

## (Intercept) Lag1 Lag2 Lag3 Lag4 Lag


5
## -0.126000257 -0.073073746 -0.042301344 0.011085108 0.009358938 0.01031
3068
## Volume
## 0.135440659

summary(glm.fit)$coef

## Estimate Std. Error z value Pr(>|z|)


## (Intercept) -0.126000257 0.24073574 -0.5233966 0.6006983
## Lag1 -0.073073746 0.05016739 -1.4565986 0.1452272
## Lag2 -0.042301344 0.05008605 -0.8445733 0.3983491
## Lag3 0.011085108 0.04993854 0.2219750 0.8243333
## Lag4 0.009358938 0.04997413 0.1872757 0.8514445
## Lag5 0.010313068 0.04951146 0.2082966 0.8349974
## Volume 0.135440659 0.15835970 0.8552723 0.3924004

glm.probs=predict(glm.fit, type="response")
glm.probs[1:10]

## 1 2 3 4 5 6 7 8

## 0.5070841 0.4814679 0.4811388 0.5152224 0.5107812 0.5069565 0.4926509 0.5092292

## 9 10

## 0.5176135 0.4888378

contrasts(Direction)

## Up
## Down 0
## Up 1

8
glm.pred=rep("Down", 1250)
glm.pred[glm.probs > .5]="Up"
table(glm.pred, Direction)

## Direction
## glm.pred Down Up
## Down 145 141
## Up 457 507

(507+145)/1250

## [1] 0.5216

mean(glm.pred==Direction)

## [1] 0.5216

train=(Year<2005)
Smarket.2005=Smarket[!train, ]
dim(Smarket.2005)

## [1] 252 9

Direction.2005=Direction[!train]
glm.fit=glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume, data=Smarket,
family=binomial, subset=train)
glm.probs=predict(glm.fit, Smarket.2005, type="response")
glm.pred=rep("Down ", 252)
glm.pred[glm.probs>.5]="Up"
table(glm.pred, Direction.2005)

## Direction.2005
## glm.pred Down Up
## Down 77 97
## Up 34 44

mean(glm.pred==Direction.2005)

## [1] 0.1746032

mean(glm.pred!=Direction.2005)

## [1] 0.8253968

You might also like