Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

14: Classification, Statistical Sins

Download as pdf or txt
Download as pdf or txt
You are on page 1of 23

Lecture

14:
Classification, Statistical
Sins

6.0002 LECTURE 14 1
Announcements
§Reading
◦ Chapter 21
§Course evaluations
◦ Online evaluation now through noon on Friday,
December 16
§Will be making study code for final exam available later
today

6.0002 LECTURE 14 2
Compare to KNN Results (from Monday)
Average of 10 80/20 splits using KNN (k=3) Average of 10 80/20 splits LR
Accuracy = 0.744 Accuracy = 0.804
Sensitivity = 0.629 Sensitivity = 0.719
Specificity = 0.829 Specificity = 0.859
Pos. Pred. Val. = 0.728 Pos. Pred. Val. = 0.767
Average of LOO testing using KNN (k=3) Average of LOO testing using LR
Accuracy = 0.769 Accuracy = 0.786
Sensitivity = 0.663 Sensitivity = 0.705
Specificity = 0.842 Specificity = 0.842
Pos. Pred. Val. = 0.743 Pos. Pred. Val. = 0.754

Performance not much difference


Logistic regression slightly better
Logistic regression provides insight about variables

6.0002 LECTURE 14 3
Looking at Feature Weights
model.classes_ = ['Died' 'Survived']
For label Survived
Be wary of reading too
C1 = 1.66761946545
much into the weights
C2 = 0.460354552452
C3 = -0.50338282535 Features are often
age = -0.0314481062387 correlated
male gender = -2.39514860929

L1 regression tends to drive one variable to zero

L2 (default) regression spreads weights across variables

6.0002 LECTURE 14 4
Correlated Features, an Example
§c1 + c2 + c3 = 1
◦ I.e., values are not independent
◦ Is being in 1st class good, or being in the other classes
bad?
§Suppose we eliminate c1?

6.0002 LECTURE 14 5
Comparative Results

Original Features Modified Features

Average of 20 80/20 splits LR Average of 20 80/20 splits LR


Accuracy = 0.778 Accuracy = 0.779
Sensitivity = 0.687 Sensitivity = 0.674
Specificity = 0.842 Specificity = 0.853
Pos. Pred. Val. = 0.755 Pos. Pred. Val. = 0.765
model.classes_ = ['Died' 'Survived'] model.classes_ = ['Died' 'Survived']
For label Survived For label Survived
C1 = 1.68864047459 C2 = -1.08356816806
C2 = 0.390605976351 C3 = -1.92251427055
C3 = -0.46270349333 age = -0.026056041377
age = -0.0307090135358 male gender = -2.36239279331
male gender = -2.41191131088

6.0002 LECTURE 14 6
Changing the Cutoff

Try p = 0.1 Try p = 0.9


Accuracy = 0.493 Accuracy = 0.656
Sensitivity = 0.976 Sensitivity = 0.176
Specificity = 0.161 Specificity = 0.984
Pos. Pred. Val. = 0.444 Pos. Pred. Val. = 0.882

6.0002 LECTURE 14 7
ROC (Receiver Operating Characteristic)

6.0002 LECTURE 14 8
Output

6.0002 LECTURE 14 9
There are Three Kinds of Lies

LIES
DAMNED LIES
and

STATISTICS
6.0002 LECTURE 14 10
Humans and Statistics

Human Mind Statistics

Image of brain © source unknown. All rights reserved. This content is excluded from our
Creative Commons license. For more information, see https://ocw.mit.edu/help/faq-fair-use/.

6.0002 LECTURE 14 11
Humans and Statistics

“If you can't prove what you want to prove,


demonstrate something else and pretend they are
the same thing. In the daze that follows the collision
of statistics with the human mind, hardly anyone will
notice the difference.” – Darrell Huff

Image of brain © source unknown. All rights reserved. This content is excluded from our
Creative Commons license. For more information, see https://ocw.mit.edu/help/faq-fair-use/.

6.0002 LECTURE 14 12
Anscombe’s Quartet
§Four groups each containing 11 x, y pairs

6.0002 LECTURE 14 13
Summary Statistics
§Summary statistics for groups identical
◦ Mean x = 9.0
◦ Mean y = 7.5
◦ Variance of x = 10.0
◦ Variance of y = 3.75
◦ Linear regression model: y = 0.5x + 3
§Are four data sets really similar?

6.0002 LECTURE 14 14
Let’s Plot the Data

Moral: Statistics about the data is not the same as the data
Moral: Use visualization tools to look at the data itself
6.0002 LECTURE 14 15
Lying with Pictures

6.0002 LECTURE 14 16
Telling the Truth with Pictures

Moral: Look carefully at the axes labels and scales

6.0002 LECTURE 14 17
Lying with Pictures

Moral: Ask whether the things being compared are actually



comparable
Screenshot of Fox News © 20th / 21st Century Fox. All rights reserved. This content is excluded from
our Creative Commons license. For more information, see https://ocw.mit.edu/help/faq-fair-use/.

6.0002 LECTURE 14 18
Garbage In, Garbage Out

“On two occasions I have been asked [by


members of Parliament], ‘Pray, Mr. Babbage,
if you put into the machine wrong figures,
will the right answers come out?’ I am not
able rightly to apprehend the kind of
confusion of ideas that could provoke such a
question.” – Charles Babbage (1791-1871)

6.0002 LECTURE 14 19
Calhoun’s Response to Errors in Data

“there were so many errors they balanced one another, and led to
the same conclusion as if they were all correct.”
Was it the case that the measurement errors are unbiased and
independent of each of other, and therefore almost identically
distributed on either side of the mean?
No, later analysis showed that the errors were not random but
systematic.
“it was the census that was insane and not the colored people.”—
James Freeman Clarke
Moral: Analysis of bad data can lead to dangerous conclusions.

6.0002 LECTURE 14 20
Sampling
§All statistical techniques are based upon the
assumption that by sampling a subset of a population
we can infer things about the population as a whole
§As we have seen, if random sampling is used, one can
make meaningful mathematical statements about the
expected relation of the sample to the entire
population
§Easy to get random samples in simulations
§Not so easy in the field, where some examples are
more convenient to acquire than others

6.0002 LECTURE 14 21
Non-representative Sampling
§“Convenience sampling” not usually random, e.g.,
◦ Survivor bias, e.g., course evaluations at end of course or
grading final exam in 6.0002 on a strict curve
◦ Non-response bias, e.g., opinion polls conducted by mail
or online
§When samples not random and independent, we can
still do things like computer means and standard
deviations, but we should not draw conclusions from
them using things like the empirical rule and central
limit theorem.
§Moral: Understand how data was collected, and
whether assumptions used in the analysis are satisfied.
If not, be wary.

6.0002 LECTURE 14 22
MIT OpenCourseWare
https://ocw.mit.edu

6.0002 Introduction to Computational Thinking and Data Science


Fall 2016

For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms.

You might also like