Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
2 views

Datascience Python Bayes

The document covers essential concepts in data visualization using the matplotlib library, including line charts, bar charts, scatterplots, and the importance of good visualizations. It also introduces linear algebra, statistics, correlation, probability, and Bayes's theorem, providing insights into how these mathematical concepts apply to data analysis. Additionally, it discusses gradient descent as a method for optimizing functions and fitting models to data.

Uploaded by

youtecknol
Copyright
© © All Rights Reserved
Available Formats
Download as KEY, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Datascience Python Bayes

The document covers essential concepts in data visualization using the matplotlib library, including line charts, bar charts, scatterplots, and the importance of good visualizations. It also introduces linear algebra, statistics, correlation, probability, and Bayes's theorem, providing insights into how these mathematical concepts apply to data analysis. Additionally, it discusses gradient descent as a method for optimizing functions and fitting models to data.

Uploaded by

youtecknol
Copyright
© © All Rights Reserved
Available Formats
Download as KEY, PDF, TXT or read online on Scribd
You are on page 1/ 124

Formation python

Visualizing
Data
Matplotl
ib
Visualizi A fundamental part of the data scientist’s toolkit is data visualization. Although
it is very easy to create visualizations, it’s much harder to produce good ones.

ng data A wide variety of tools exist for visualizing data. We will be using the matplotlib
library, which is widely used.
Simple Line Chart
Line Charts

Line charts are good choice for showing


trends
A bar chart is a good choice when you want to show
how some qunatity varies among some discrete set
Bar of items.
Chart
Bar chart can also be a good
choice for plotting histograms of
Bar buckleted numeric values
chart
Scatterplots
A scatterplot is the right choice for visualizing the relationshio between two paired sets of
data.
Scatterplots
For further exploration

The matplotlib Gallery :


https://matplotlib.org/2.0.2/gallery.html
Seaborn : https://seaborn.pydata.org/
Altair : https://altair-viz.github.io/
D3.js : https://d3js.org/
Bokeh : https://docs.bokeh.org/en/latest/
Linear Algebra

Linear algebra is the branch of


mathematics that deals with
vector spaces.
Abstractly, vectors are objects that
can be added together to form
new vectors and that can be
multiplied by scalars, also to from
new vectors
Adding 2
vectors
Adding two vectors [1,2] and [3,5] results in [1+3, 2+5] soit
[4,7]
Graphycally
Sustract 2
vectors
Adding two vectors [1,2] and [3,5] results in [1+3, 2+5] soit
[4,7]
Scalar
multiply
Multiply a vector by a scalar, which we do simply by multiplying each element of the vector by that
number
Matrices

A matrix is a two-
dimensional collection of
numbers. We will
represent matrices as
lists, with each inner list
having the same size and
representing a row og the
matrix, if A is a matrix,
then A[i][j] is the element
in the ith row and the jth
column.
N and M of
a Matrix

If a matrix has n rows and


k columns, we will refer
to it as an n x k matrix.
We can think of each row
of an n. k matrix as a
vector of length k, and
each column as a vector
of length n
Get row or columnas Vector
Creat
ea
matri
Create a identity matrix
xidnetity matrix
for example : make a 5 x
5
Statistics

Statistics refers to the mathematics and techniques


with which we understand data.
Describing a Single Set of Data
As a first approach, you put the friend count
into a histogram using Counter and plt.bar
Unfortunately, this chart is still too difficult to slip
into conversations. So you start generating some
statistics.
Usually, we will want some notion of where our data is centrered. Most commonly we’ll use
the mean (or average), which is just the sum of the data divided by its count

Central
tendenci
es – The
mean
Central
tendencies
– The
median
The median is the middle-
most value ( if the number
of data is odd) or the
average of the two middle-
most value (if the number
of data points is even )
Central
tendencies
– The
quantile

A genralization of the
median is the quantile,
which represent the
value under which a
certain percentile of the
data lie ( the median
represent the value
underwhich 50 % of
the data lies
Central tendencies –
The mode
The mode in statistics refers to a number in a set of numbers that
appears the most often
Dispersion is the state of getting dispersed or spread.
Statistical dispersion means the extent to which numerical
data is likely to vary about an average value. In other words,
dispersion helps to understand the distribution of the data.

Dispers
ion

The range is 0 precisely when the max and the min are equal,
which can only happen when the elements of a datasets are
all the same.
A more complex
measure of dispersion
is the variance, wich
is computed as

Dispersio
n - The
variance
Correlation
The amount of time people spend on the site is related to
the number of friends they have on the site.
After digging through traffic logs, we have come up with a
list called daily_minutes that shows how many minutes per
day each user spends on our site, and we have ordered it
so that its elemets correspond to the elements of our
previous num_friend list.
We’d like to investigate the relationship between these
two metrics
Correlation -
Covariance

Whereas variance measures hoe


a single variable deviates from
its mean, covariance measures
how two variable vary in
tandem from their means.
A large positive covariance means
that x tends to be large when y is
large and small when y is small. A
large negative covariance means
Correlation - the opposite-- that x tends to be
Covariance small when y is large and vice
versa. A covariance close to zero
means that no such relationship
exists.
Correlation

Correlation divides out


the standard deviations
of both variable.
The correlation is unitless and
always lies between -1
( perfect anticorrelation ) and
1 ( perfect correlation ) . A
number like 0.25 represents a
relatively weak positive
correlation
Correlat
ion with
an
outlier
Simpson’s pardox

One not uncommon surprise


when analysing data is simpson’s
paradox, in which correlations can
be misleading when counfouding
variable are ignored.

=> It certainly look like the


West Coast data
scientists are friendlier
than the East Coast.
Simpson’s pardox

But when playing with the data,


you discover something very
strange, if you look only at people
with PhDs, the East Coast data
scientists have more friends on
average.
And if you look only at people
without PhDs, the East Coast data
scientisits also have more friends
on average
-> Once you account for
the user’s degrees, the
correlation goes in the
opposite direction !
Probability

The laws of probability, so true in


general, so fallacious in particular.
- Edward Gibbon
We write P(E) to mean “The probality of
the event E”
We’ll use probability theory to
build models. We’ll use
probability theory to evaluate
models. We’ll use probability
theory all over the place.
Dependence and
Independence

We say that two events E and F are dependent if


knowing something about whether E happens gives us
information about whether F happens ( and vice versa ).
Otherwise there independent
Two events E and F are independent if the probability
that both happen is the product of the probabilities that
each one happens :

P(E, F) =
P(E)P(F)
Conditional probability

When two events E and F are independent,


then :
P(E, F) = P(E)P(F)
Conditional probability

If there not necessarily independent ( and if the


probability of F is not 0 ) then we define the
probability of E conditional on F as :
P(E\F) = P(E,F)/P(F)
-> The probability that E happens, given that
we know that F happens
Conditional probability

When E and F are independent, you can check


that this gives :
P(E\F) = P(E)
Family example

A family with two unknown children. if we assume that :


Each child is equally likely to be boy or a girl.
The gender of the second child is independent of the gender of the first
child.
Two girls : ¼
One girl, one boy : ¼
One boy, one girl : ¼
Two boys : ¼
Family example

The probability of the event “both


are girls” (B) conditional on the event
children
older
“The child is a girl”
(G) :

P(B/G) = P(B,G)/P(G) =
P(B)/P(G) = * ¼½
=
Since the event B½and G (“Both children
girls
are and the older child is a girl”) is just
event
the B. (Once you know that
both
children are girls, it’s necessarily true
that the older child is a girl
Family
example

We could also ask about the


probability of the event “Both
children are girl” conditional on the
event “At least one of the children is
a girl” :

P(B/L) = P(B,L)/P(L) =

P(B)/P(L) = ¼

The
family
exampl
e with
python
Bayes’s
Theorm

Let’s say we need to know the


probability of some event E
conditional on some other event
F occurring. But we only have
information about the probability of
F conditional on E
occurring.
P(E|F) = P(E,F)/P(F) = P(F|
E)P(E)/P(F)
Bayes’s Theorm

The event F can be split into two


mutually exclusive event “F and E” and
“F and not E”
P(F) = P(F,E) + P(F, not E)
So that :
P(E|F) = P(F|F)P(E) / [ P(F|E)
P(E) +
P(P|not E) P(not E) ]
Bayes’s Theorm

Imagine a certain desease that affect 1 in every 10 000 peoples. And imagine that there is
a test for this disease that gives the correct result (‘diseased if you have the disease’,
‘nondiseased’ if you don’t have) 99% of the time. P(T|D) = 0.99
What does a positive test mean ? Let’s use T for the event “your test is positive” and D for the
event “you have the disease” => then the Bayes’s theorem says that the probability that
you have the disease conditional on testing positive is :
P(D|T) = P(T|D) P(D) / [P(T|D) P(D) + P(T| not D) P( not D )] = 0.98%
P(T|D) = 0.99 = The P that someone with disease test positif.
P(D) = 1/10000 = 0.0001 = The P that given person has the disease.
P(T | not D) = 0.01 = The P that someone without the disease tests
positive. P (not D) = 0.9999 = The P that any given person doesn’t have the
disease.
Application of
Bayes’s theorem :

We have a population of 1 million people.


You’d expect 100 of them to have the disease
P(D), and 99 of those 100 to test
positive P(T|D). On the other hand, you’d
expect 999 900 P( not D) of them not to
have the disease, and 9 999 P(T | not D =
0.01) of those to test positive. That means you’d
expect only 99 out of (99 + 9999 =
0.98% ) positive tester to actually have the
disease P(D|T)
Countinious distribution

A coin flip corresponds to a discrete distribution — one that


associates positive probability with discrete outcomes. Often
we’ll want to model distributions across acontinuum of
outcomes. (For our purposes, these outcomes will always be
real numbers, although that’s not always the case in real life.)
For example, the uniform distribution puts equal weight on all
the numbers between 0 and 1.
Because there are infinitely many numbers between 0 and 1,
this means that the weight it assigns to individual points must
necessarily be zero. For this reason, we represent a continuous
distribution with a probability density function
(pdf) such that the probability of seeing a value in a certain
interval equals the integral of the density function
over the interval.
Probability
Density Function

A simpler way of understanding this is that if


adistribution has density function f ,
probability
then the of seeing a value between x and
x + h is approximately h * f(x) if h is
small.
The density function for the
uniform distribution is just :

The probability that a random variable following


that distribution is between 0.2 and 0.3 is 1/10.
Python random.random is a (pseudo) random
variable
with a uniform density
Cumulative Distribution
function (CDF)
Gives the probability that
value.
a random variable is less or equal to a certain
The Normal Distribution

The normal distribution is the king


of distributions. It is the classic bell
curve–shaped distribution and is
completely determined by two
parameters: its mean (mu) and
its standard deviation (sigma).
The mean indicates where the
bell is centered, and the
standard deviation how
“wide” it is.
Implementatio
n of the Normal
Distribution
-
Probability
Density
Function
When mu = 0 and sigma = 1,
it’s call the standard normal
distribution
Implementatio
n of the Normal
Distribution
-
Cumulative
Distribution
Function
Sometimes we need to invert
normal_cdf to find the value
corresponding to a specified
probability.
For further
exploration

Scipy.stats
(https://docs.scipy.org/doc/scipy/
re ference/stats.html)
Reading a probability book
Gradient Descent

Those who boast of their descent, brag


on what they owe to others
We have a function f that we need
to minimize or maximize

We need to find the input v that


produces the largest (or
smallest ) possible value.
Algorithm

1. Pick a random starting point


2. Compute the gradient
3. Take a small step in the direction of the gradient ( The direction that
causes the function to increase the most )
4. Repeat with a new starting point.

=> We can try to minimize the function by taking small


steps in the opposite.
If a function has a unique global
minimum, this procedure is likely
to find it. If a function has
multiple (local) minima, this
procedure might “find” the
wrong one of them, in which
case you might re- run the
procedure from a variety of
starting points. If a function has
no minimum, then it’s possible
the procedure might go on
forever.
Estimating
the gradient

If f is a function of one
variable, its derivative at
a point x measures how
f(x) when we make a very
small change to x. It is
defined as the limit of the
difference quotients
The derivative is the slope of the
tangent line at (x, f(x)) , while
the difference quotient is the
slope of the not-quite- tangent
line that runs through (x+h,
f(x+h))
. As h gets smaller and smaller,
the not-
quite-tangent line gets closer and
closer to the tangent line
In python
Using the gradient

Sum_of_square function is
smallest when its input v is a
vector of zeros.
Imagine we didn’t know that.
Let’s use gradients to find the
minimum
Choosing the
right step size

Although the rationale for moving against the


gradient is clear, how far to move is not. Indeed,
choosing the right step size is more
of an art than a science. Popular
options include:
Using a fixed step size
Gradually shrinking the step size over time
At each step, choosing the step size that
minimizes the value of the objective
function
The step size that works depends on the problem,
too small, and you gradient descent will take
forever, too big, and you will take giant steps that
might make the function you care about get larger or
even be undefined.
=> So we will need to experiment
Using Gradient
Descent to Fit Models

We use gradient descent


to fit parameterized
models to data. In the usual
case, we will have some dataset
and some (hypothesized) model
for the data that depends (in as
differentiable way) on one or
more parameters. We will also
have a loss function that
measures how well the model fits
our data => Smaller is better.
Example

In this example we know the parameters


of the linear relationship between x and
y : y = 20
*x+5
Imagine we would like to learn them
from data.
Derivation - Chain Rule

The chain rule is used for calculating


the derivative of composite
functions. The chain rule can also be
expressed in Leibniz’s notation as follows:
If a variable z depends on the variable y,
which itself depends on the variable x, so
that y and z are dependent variables, then
z, via the intermediate variable of y,
depends on x as well. This is called the
chain rule
Partial derivatives

let’s calculate how the Cost


function changes
relative to m and c. This
deals with the concept of partial
derivatives, which says that if
there is a function of two
variables, then to find the
partial derivative of that
function with regard to one
variable, treat the other variable
as constant. This will be more clear
with an example:
Calculating
Gradient Descent

Let us now apply the knowledge of these rules


of calculus in our original equation and find the
derivative of the Cost Function w.r.t to
both ‘m’ and ‘b’. Revising the Cost Function
equation
:
Minibatch
Gradient Descent

As we mentioned before, often we’ll be using


gradient descent to choose the parameters of a
model in a way that minimizes some notion of error.
Using the previous batch approach, each gradient
step requires us to make a prediction and compute
the gradient for the whole data set, which makes
each step take a long time.
Now, usually these error functions are additive,
which means that the predictive error on the whole
data set is simply the sum of the predictive errors for
each data point.
When this is the case, we can instead apply a
technique called stochastic gradient descent, which
computes the gradient (and takes a step) for only one
point at a time. It cycles over our data repeatedly
until it reaches a stopping point.
Another variation is stochastic
Stochastic Gradient gradient descent, in which you
take gradient steps based on one
training example at time :
Descent
On this problem, stochastic gradient descent
finds the optimal parameters much smaller
of epochs, but there is always tradeoffs.
Basing gradeint steps on small minibatches
Conclusion ( or single data points ) allows you to take
more of them, but the gradient for a single
point might lie in very different direction
from the gradient for the dataset as a whole.
Machine learning

Data science is mostly turning business


problems into data problems and collecting
data and undertanding data and cleaning
data and formatting data. after which
machine learning is almost an
afterthought. Even so, it's an interesting and
essential afterthought that you pretty much
have to know about in order to do
data science
Modeling

WHAT IS A MODEL? IT’S SIMPLY A SPECIFICATION OF A IF YOU’RE TRYING TO RAISE MONEY FOR YOUR SOCIAL
MATHEMATICAL (OR PROBABILISTIC) RELATIONSHIP NETWORKING SITE, YOU MIGHT BUILD A BUSINESS MODEL
THAT EXISTS BETWEEN DIFFERENT VARIABLES. (LIKELY IN A SPREADSHEET) THAT TAKES INPUTS LIKE
“NUMBER OF USERS” AND “AD REVENUE PER USER” AND
“NUMBER OF EMPLOYEES” AND OUTPUTS YOUR ANNUAL
PROFIT FOR THE NEXT SEVERAL YEARS.
What is machine
learning

We’ll use machine learning to refer to creating and using models


that are learned from data. In other contexts this might be called
predictive modeling or data mining, but we will stick with machine
learning.
Typically, our goal will be to use existing data to develop models that
we can use to predict various outcomes for new data, such as:
Predicting whether an email message is spam or not
Predicting whether a credit card transaction is fraudulent
Predicting which advertisement a shopper is most likely
to click on

Arthur Samuel (1959) : Machine Learning :


Field of study that gives computers the ability
to learn without being explicitly programmed.
Machine learning : Exercice

TomMitchell(1998) : Well-posedLearning Problem: A computer program is said to learn from experience


E with respect to some task T and some performance measure P, if its performance on T, as
measured by P, improves with experience E.

⇒ Suppose your email program watches which emails you do or do not


mark as spam, and based on that learns how to better filter spam. What
is the task T in this setting?

Classifying emails as spam or not spam


Watching you label emails as spam or not spam
The number (or fraction) of emails correctly classified as spam/not spam.
□ None of the above—this is not a machine learning problem
Machine learning : Exercice

TomMitchell(1998) : Well-posedLearning Problem: A computer program is said to learn from experience


E with respect to some task T and some performance measure P, if its performance on T, as
measured by P, improves with experience E.

⇒ Suppose your email program watches which emails you do or do not


mark as spam, and based on that learns how to better filter spam. What
is the task T in this setting?

Classifying emails as spam or not spam : T


Watching you label emails as spam or not spam : E
The number (or fraction) of emails correctly classified as spam/not spam. : P
□ None of the above—this is not a machine learning problem
Overfitting and Underfitting

The horizontal line shows the best fit


degree 0 polynomial. It severely
underfits the training data. The
best fit degree 9 polynomial goes through
every training data point exactly, but it
very severely overfits — if we
were to pick a few more data points it
would quite likely miss them by a lot. And
the degree 1 line strikes a nice balance —
it’s pretty close to every point, and (if
these data are representative) the line will
likely be close to new data points as well.
Solution

The most fundamental


approach involves using
different data to train the
model and to test the
model.
The simplest way to do this is
to split your data set, so that
(for example) two-thirds of it
is used to train the model,
after which we measure the
model’s performance on the
remaining third:
Email Spam example

Imagine building model to make a binary judgment. Is this email Spam


or not Spam
True positive : “This message is spam, and we correctly predict
spam” False positive : “This message is not spam, but we
predict spam” False negative : “This message is spam, but we
predicted not spam”
True negative : “This message is not spam, and we correctly
predicted not spam”
Correctness – Use case

Suppose there is a test that can be given to a newborn baby that


predicts
— with greater than 98% accuracy — whether the
newborn will ever develop leukemia : Predict leukemia if and only if
the baby is named Luke (which sounds sort of like “leukemia”).
These days approximately 5 babies out of 1,000 are named
Luke. And the lifetime prevalence of leukemia is about 1.4%, or 14
out of every 1,000 people. If we believe these two factors are
independent and apply my “Luke is for leukemia” test to 1
million people, we’d expect to see a confusion matrix like:
Correctness - Accuracy

Accuracy is defined as the fraction of


correct predictions
Precision and recall

In pattern recognition, information retrieval,


object detection and classification (machine
learning), precision and recall are performance metrics that
apply to data retrieved from a collection, corpus or sample
space.
Precision (also called positive predictive value) is
the fraction of relevant instances
among the retrieved instances, while
recall (also known as sensitivity) is the fraction
of relevant instances that were
retrieved. Both precision and recall are therefore
based on relevance.
Correctness - Precision

Precision measures how


accurate our positive predictions
were :
Correctness - Recall

Recall measures what fraction of the positives our


model identified
F1 Score : Harmonic mean

Sometimes precesion and recall are


combined into the F1 score :
Choice of a model

Usually the choice of a model involves a trade-off


between precision and recall.
A model that predicts “yes” when it’s even a little bit
confident will probably have a high recall but a low
precision; a model that predicts “yes” only when it’s
extremely confident is likely to have a low recall and a
high precision.
Alternatively, you can think of this as a trade-off between
false positives and false negatives. Saying “yes”
too often will give you lots of false positives; saying “no” too
often will give you lots of false negatives.
Feature Extraction
and Selection

Features are whatever inputs we


provide to our model.
Things become more interesting as your data
becomes more complicated. Imagine trying to
build a spam filter to predict whether an email is
junk or not. Most models won’t know what to do
with a raw email, which is just a collection of text.
You’ll have to extract features. For example:
Does the email contain the word
“Viagra”?
(Yes or no)
How many times does the letter d
appear? ( Number)
What was the domain of the sender?
(Choice from discrete set of option)
Feature Extraction
and Selection

Pretty much always, we’ll extract


features from our data that fall into
one of these three categories.
What’s more, the type of
features we have constrains
the type of models we can
use.
The Naive Bayes classifier is
suited to yes-or- no features like the first
one in the preceding list.
Regression models require
numeric features (which could include
dummy variables that are 0s and 1s).
Decision trees can deal with
numeric or categorical data.
For further
Exploration
https://www.coursera.org/learn/machine-learning
K-Nearest
Neighbors
The model

Nearest neighbors is
simplest predictive models there is.
one of the
makes
It no mathematical
and it doesn’t require any sort of
assumptions,
heavy machinery. The only things it
requires
ar
e:
Some notion of
distance
An assumption that points
are
thatclose to one another
simil
are
ar
the prediction for
depends
each newonly on the
point
handful of points closest
to it.
In the general situation, we have some data points and we
have a corresponding set of labels. The labels could be True
and False, indicating whether each input satisfies some
condition like “is spam?” or “is poisonous?” or “would be
enjoyable to watch?” Or they could be categories, like movie
ratings (G, PG, PG-13, R, NC-17). Or they could be the names of
presidential candidates. Or they could be favorite programming
languages.
In our case, the data points will be vectors, which means that
we can use the distance
Let’s say we’ve picked a number k like 3 or 5. Then when we
want to classify some new data point, we find the k nearest
labeled points and let them vote on the new output.
Reduce k until we find a unique winner.

But this doesn’t do anything intelligent


with ties. For example, imagine we’re
rating movies and the five
nearest movies are rated G, G,
PG, PG, and R. Then G has two votes
and PG also has two votes. In that case, we
have several options:
Pick one of the winners at random.
Weight the votes by distance and
pick the weighted winner.
Reduce k until we find a
unique winner.
We will implement the third :
Create a classifier
Example : The Iris Dataset

The Iris dataset is a staple of machine learning. It contains


a bunch of measurements for 150 flowers representing
three species of iris. For each flower we have it petal
length, petal width, sepal length, and sepal width, as well
as its species.
We can download it from :
https://archive.ics.uci.edu/ml/datasets/Iris
For example the first row looks like :
5.1,3.5,1.4,0.2,Iris-setosa
Training set
Apply the K-
nearest neighbors

The training set will be “neighbors” that we will


use to classify the point in the test. We just need
to choose a value for k, the number of neighbors
who get to vote. Too small (think k = 1), and
we let outliers have too much
influence; too large (think k
= 105), and we just predict the most
common class in the dataset.
In real application (and with more
data), we might create a separate
valiadtion set and use it to choose
k
Here we will use k = 5
Prediction

On this simple
dataset, the model
predicts almost
perfectly.
There’s one
versicolor for
which it predict
verginia, but
otherwise it get things
exactly right
Naïve bayes

A social network isn’t much good if people can’t


network. Accordingly, DataSciencester has a
popular feature that allows members to send
messages to other members. And while most of
your members are responsible citizens who send
only well-received “how’s it going?” messages, a
few miscreants persistently spam other members
about get-rich schemes, no-prescription-required
pharmaceuticals, and for-profit data science
credentialing programs.
Your users have begun to complain, and so the VP
of Messaging has asked you to use data science to
figure out a way to filter out these spam
messages.
A really dumb spam filter

Imagine a “universe” that consists of receiving a message chosen randomly from all
possible messages. Let S be the event “the message is spam” and V be the
event “the message contains the word Bitcoin.” Then Bayes’s Theorem
tells us that the probability that the message is spam conditional on containing the
word Bitcoin is:
P(S|B) = [ P(B|S)P(S) ] / [ P(B|S)P(S) + P(B | not S)P(not S) ]
If we have a large collection of messages we know are spam, and a large collection of
messages we know are not spam, then we can easily estimate P(B|S) and P(B|not
S) and . If we further assume that any message is equally likely to be spam or notspam
(so that ) P(S) = P(not S) = 0.5, then:
P(S|B) = [ P(B|S) ] / [ P(B|S) + P(B | not S) ]
Application

If 50% of spam messages have


the word bitcoin, but only 1%
of nonspam messages do, then
the probability that any given
bitcoin containing email is
spam :
0.5 / (0.5 + 0.01) =
98%
A More Sophisticated
Spam Filter

Imagine now that we have a vocabulary of many


words W1 … Wn . To move this into the realm
of probability theory, we’ll write Xi for the event
“a message contains the word Wi .”
Also imagine that (through some unspecified-at-
this-point process) we’ve come up with an
estimate P(Xi|S) for the probability that a
spam message contains the ith word, and a
similar estimate P(Xi| not S) for the
probability that a nonspam message contains the
ith word.
The formula
Application
Underflow

In practice, you usually want to avoid


multiplying lots of probabilities
together, to avoid a problem called
underflow, in which computers don’t
deal well with floating-point numbers
that are too close to zero. Recalling
from algebra that log(ab) =
log(a) + log(b) and that
exp(log x) = x, we usually
compute
P1*….Pn as
exp(log(P1) + …. + log(Pn))
The calculate

The only challenge left is coming up with


estimates for P(Xi|S) and P(Xi|not S) , the
probabilities that a spam message (or nonspam
message) contains the word Wi . If we have a fair
number of “training” messages labeled as spam
and not-spam, an obvious first try is to estimate
P(Xi|S) simply as the fraction of spam
messages containing word Wi.
The probem

This causes a big problem, though. Imagine that in our training set
the vocabulary word “data” only occurs in nonspam messages.
Then we’d estimate P(data | S) = 0 . The result is that our
Naive Bayes classifier would always assign spam probability
0 to any message containing the word “data”,
even a message like “data on cheap bitcoin and
authentic rolex watches.” To avoid this problem, we
usually use some kind of smoothing.
In particular, we’ll choose a pseudocount — k — and estimate
the probability of seeing the ith word in a spam as:
P(Xi|S) = (k + number of spam containig
Wi) / (2k + number of spams)
Example

For example, if “data” occurs in 0/98 spam


documents, and if k is 1, we estimate P(data|
S) as 1/100 = 0.01, which allows our classifier
to still assign some nonzero spam probability to
messages that contain the word “data.”
P(data|S) = (k + spam avec data) /
(2k + number of spam) = 1 / (2+98)
Implementation
Build our classifier
Now we have all the pieces
we need to build our
classifier.
First, let’s create a simple
function to tokenize
messages
into distinct words. We’ll
first convert each message
to lowercase; use
re.findall() to
extract “words” consisting
of letters, numbers, and
apostrophes; and finally
use set() to get just the
distinct words.
The classifier

As our classifier need to keep track


of tokens, counts, and labels from
the training data, we’ll make it a
class.
The constructor take 1 parameter, the
pseudocount to use when computing
probabilities. It also initializes an empty
set of tokens, counters to track how
often each token is seen in spam
messages and ham messages and
counts of how many spam and ham(not
spam) messages it was trained on :
The probabilities

We will want to predict P(spam | token). As


we saw earlier, to apply Bayes’s theorem we n eed
to know P(token | spam) and P(token |
ham) for each token in vocabulary.
The prediction

Finally, we are ready to write our


predict method. As mentioned
earlier, rather than multiplying together
lot of small probabilities, we’ll instead
sum up the log probabilities :
And now we have a classifier
!
Testi
ng
our
mod
Let’s make sur our model
works by writing some
el
unit test for it
Using our Model

We will use SpamAssassin


public corpus :
https://spamassassin.apache.org/
ol d/publiccorpus/
We will took the files prefixed with
20021010
To keep things really easy simple,
we will just look at the subject
lines of each email
Get Data

How to we identify the subject line ?


When we look through the files, they
all seem to start with “Subject”
Split Data

Now we can split the data into training


data and test data, and then we are
ready to build a classifier
Predicti
on
Let’s generate some predictions and check how our model
does.
Performance

This gives
84 true positives (Spam classified as spam)
25 false positives (Ham classified as spam)
703 true negatives (Ham classified as ham)
44 false negatives (Spam classified as ham)
Precision = 84 / (84 + 25) = 77 %
Recall = 84 / (84 + 44) = 65 %
Wich are not bad numbers for such a simple model.
(you can do better if we looked at more than
the subject lines)
We can also inspect the model’s innards to see which words are least and
most indicative of spam

Performa The spammiest words include things like sale, mortgage, money and rates,
nce whereas the hammiest words include things like spambayes, users, apt. So that
also gives us some intuitive confidence that out model is basically doing the right
thing
How to get better
performance
Get more data in the training set.
Look at message content not just the subject line.
The classifier takes into account every word that
appears in the training set, even words that
appear only once. Modify the classifier to accept
an optional min_count threshold and ignore
tokens that don’t appear at least that many times.

You might also like