0% found this document useful (0 votes)

2 views

Datascience Python Bayes

The document covers essential concepts in data visualization using the matplotlib library, including line charts, bar charts, scatterplots, and the importance of good visualizations. It also introduces linear algebra, statistics, correlation, probability, and Bayes's theorem, providing insights into how these mathematical concepts apply to data analysis. Additionally, it discusses gradient descent as a method for optimizing functions and fitting models to data.

Uploaded by

youtecknol

Available Formats

Download as KEY, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

Datascience Python Bayes

Uploaded by

youtecknol

Available Formats

Download as KEY, PDF, TXT or read online on Scribd

You are on page 1/ 124

Formation python

Visualizing
Data
Matplotl
ib
Visualizi A fundamental part of the data scientist’s toolkit is data visualization. Although
it is very easy to create visualizations, it’s much harder to produce good ones.

ng data A wide variety of tools exist for visualizing data. We will be using the matplotlib
library, which is widely used.
Simple Line Chart
Line Charts

Line charts are good choice for showing

trends
A bar chart is a good choice when you want to show
how some qunatity varies among some discrete set
Bar of items.
Chart
Bar chart can also be a good
choice for plotting histograms of
Bar buckleted numeric values
chart
Scatterplots
A scatterplot is the right choice for visualizing the relationshio between two paired sets of
data.
Scatterplots
For further exploration

The matplotlib Gallery :

https://matplotlib.org/2.0.2/gallery.html
Seaborn : https://seaborn.pydata.org/
Altair : https://altair-viz.github.io/
D3.js : https://d3js.org/
Bokeh : https://docs.bokeh.org/en/latest/
Linear Algebra

Linear algebra is the branch of

mathematics that deals with
vector spaces.
Abstractly, vectors are objects that
can be added together to form
new vectors and that can be
multiplied by scalars, also to from
new vectors
Adding 2
vectors
Adding two vectors [1,2] and [3,5] results in [1+3, 2+5] soit
[4,7]
Graphycally
Sustract 2
vectors
Adding two vectors [1,2] and [3,5] results in [1+3, 2+5] soit
[4,7]
Scalar
multiply
Multiply a vector by a scalar, which we do simply by multiplying each element of the vector by that
number
Matrices

A matrix is a two-
dimensional collection of
numbers. We will
represent matrices as
lists, with each inner list
having the same size and
representing a row og the
matrix, if A is a matrix,
then A[i][j] is the element
in the ith row and the jth
column.
N and M of
a Matrix

If a matrix has n rows and

k columns, we will refer
to it as an n x k matrix.
We can think of each row
of an n. k matrix as a
vector of length k, and
each column as a vector
of length n
Get row or columnas Vector
Creat
ea
matri
Create a identity matrix
xidnetity matrix
for example : make a 5 x
5
Statistics

Statistics refers to the mathematics and techniques

with which we understand data.
Describing a Single Set of Data
As a first approach, you put the friend count
into a histogram using Counter and plt.bar
Unfortunately, this chart is still too difficult to slip
into conversations. So you start generating some
statistics.
Usually, we will want some notion of where our data is centrered. Most commonly we’ll use
the mean (or average), which is just the sum of the data divided by its count

Central
tendenci
es – The
mean
Central
tendencies
– The
median
The median is the middle-
most value ( if the number
of data is odd) or the
average of the two middle-
most value (if the number
of data points is even )
Central
tendencies
– The
quantile

A genralization of the
median is the quantile,
which represent the
value under which a
certain percentile of the
data lie ( the median
represent the value
underwhich 50 % of
the data lies
Central tendencies –
The mode
The mode in statistics refers to a number in a set of numbers that
appears the most often
Dispersion is the state of getting dispersed or spread.
Statistical dispersion means the extent to which numerical
data is likely to vary about an average value. In other words,
dispersion helps to understand the distribution of the data.

Dispers
ion

The range is 0 precisely when the max and the min are equal,
which can only happen when the elements of a datasets are
all the same.
A more complex
measure of dispersion
is the variance, wich
is computed as

Dispersio
n - The
variance
Correlation
The amount of time people spend on the site is related to
the number of friends they have on the site.
After digging through traffic logs, we have come up with a
list called daily_minutes that shows how many minutes per
day each user spends on our site, and we have ordered it
so that its elemets correspond to the elements of our
previous num_friend list.
We’d like to investigate the relationship between these
two metrics
Correlation -
Covariance

Whereas variance measures hoe

a single variable deviates from
its mean, covariance measures
how two variable vary in
tandem from their means.
A large positive covariance means
that x tends to be large when y is
large and small when y is small. A
large negative covariance means
Correlation - the opposite-- that x tends to be
Covariance small when y is large and vice
versa. A covariance close to zero
means that no such relationship
exists.
Correlation

Correlation divides out

the standard deviations
of both variable.
The correlation is unitless and
always lies between -1
( perfect anticorrelation ) and
1 ( perfect correlation ) . A
number like 0.25 represents a
relatively weak positive
correlation
Correlat
ion with
an
outlier
Simpson’s pardox

One not uncommon surprise

when analysing data is simpson’s
paradox, in which correlations can
be misleading when counfouding
variable are ignored.

=> It certainly look like the

West Coast data
scientists are friendlier
than the East Coast.
Simpson’s pardox

But when playing with the data,

you discover something very
strange, if you look only at people
with PhDs, the East Coast data
scientists have more friends on
average.
And if you look only at people
without PhDs, the East Coast data
scientisits also have more friends
on average
-> Once you account for
the user’s degrees, the
correlation goes in the
opposite direction !
Probability

The laws of probability, so true in

general, so fallacious in particular.
- Edward Gibbon
We write P(E) to mean “The probality of
the event E”
We’ll use probability theory to
build models. We’ll use
probability theory to evaluate
models. We’ll use probability
theory all over the place.
Dependence and
Independence

We say that two events E and F are dependent if

knowing something about whether E happens gives us
information about whether F happens ( and vice versa ).
Otherwise there independent
Two events E and F are independent if the probability
that both happen is the product of the probabilities that
each one happens :

P(E, F) =
P(E)P(F)
Conditional probability

When two events E and F are independent,

then :
P(E, F) = P(E)P(F)
Conditional probability

If there not necessarily independent ( and if the

probability of F is not 0 ) then we define the
probability of E conditional on F as :
P(E\F) = P(E,F)/P(F)
-> The probability that E happens, given that
we know that F happens
Conditional probability

When E and F are independent, you can check

that this gives :
P(E\F) = P(E)
Family example

A family with two unknown children. if we assume that :

Each child is equally likely to be boy or a girl.
The gender of the second child is independent of the gender of the first
child.
Two girls : ¼
One girl, one boy : ¼
One boy, one girl : ¼
Two boys : ¼
Family example

The probability of the event “both

are girls” (B) conditional on the event
children
older
“The child is a girl”
(G) :

P(B/G) = P(B,G)/P(G) =
P(B)/P(G) = * ¼½
=
Since the event B½and G (“Both children
girls
are and the older child is a girl”) is just
event
the B. (Once you know that
both
children are girls, it’s necessarily true
that the older child is a girl
Family
example

We could also ask about the

probability of the event “Both
children are girl” conditional on the
event “At least one of the children is
a girl” :

P(B/L) = P(B,L)/P(L) =
*¾
P(B)/P(L) = ¼
=¾
The
family
exampl
e with
python
Bayes’s
Theorm

Let’s say we need to know the

probability of some event E
conditional on some other event
F occurring. But we only have
information about the probability of
F conditional on E
occurring.
P(E|F) = P(E,F)/P(F) = P(F|
E)P(E)/P(F)
Bayes’s Theorm

The event F can be split into two

mutually exclusive event “F and E” and
“F and not E”
P(F) = P(F,E) + P(F, not E)
So that :
P(E|F) = P(F|F)P(E) / [ P(F|E)
P(E) +
P(P|not E) P(not E) ]
Bayes’s Theorm

Imagine a certain desease that affect 1 in every 10 000 peoples. And imagine that there is
a test for this disease that gives the correct result (‘diseased if you have the disease’,
‘nondiseased’ if you don’t have) 99% of the time. P(T|D) = 0.99
What does a positive test mean ? Let’s use T for the event “your test is positive” and D for the
event “you have the disease” => then the Bayes’s theorem says that the probability that
you have the disease conditional on testing positive is :
P(D|T) = P(T|D) P(D) / [P(T|D) P(D) + P(T| not D) P( not D )] = 0.98%
P(T|D) = 0.99 = The P that someone with disease test positif.
P(D) = 1/10000 = 0.0001 = The P that given person has the disease.
P(T | not D) = 0.01 = The P that someone without the disease tests
positive. P (not D) = 0.9999 = The P that any given person doesn’t have the
disease.
Application of
Bayes’s theorem :

We have a population of 1 million people.

You’d expect 100 of them to have the disease
P(D), and 99 of those 100 to test
positive P(T|D). On the other hand, you’d
expect 999 900 P( not D) of them not to
have the disease, and 9 999 P(T | not D =
0.01) of those to test positive. That means you’d
expect only 99 out of (99 + 9999 =
0.98% ) positive tester to actually have the
disease P(D|T)
Countinious distribution

A coin flip corresponds to a discrete distribution — one that

associates positive probability with discrete outcomes. Often
we’ll want to model distributions across acontinuum of
outcomes. (For our purposes, these outcomes will always be
real numbers, although that’s not always the case in real life.)
For example, the uniform distribution puts equal weight on all
the numbers between 0 and 1.
Because there are infinitely many numbers between 0 and 1,
this means that the weight it assigns to individual points must
necessarily be zero. For this reason, we represent a continuous
distribution with a probability density function
(pdf) such that the probability of seeing a value in a certain
interval equals the integral of the density function
over the interval.
Probability
Density Function

A simpler way of understanding this is that if

adistribution has density function f ,
probability
then the of seeing a value between x and
x + h is approximately h * f(x) if h is
small.
The density function for the
uniform distribution is just :

The probability that a random variable following

that distribution is between 0.2 and 0.3 is 1/10.
Python random.random is a (pseudo) random
variable
with a uniform density
Cumulative Distribution
function (CDF)
Gives the probability that
value.
a random variable is less or equal to a certain
The Normal Distribution

The normal distribution is the king

of distributions. It is the classic bell
curve–shaped distribution and is
completely determined by two
parameters: its mean (mu) and
its standard deviation (sigma).
The mean indicates where the
bell is centered, and the
standard deviation how
“wide” it is.
Implementatio
n of the Normal
Distribution
-
Probability
Density
Function
When mu = 0 and sigma = 1,
it’s call the standard normal
distribution
Implementatio
n of the Normal
Distribution
-
Cumulative
Distribution
Function
Sometimes we need to invert
normal_cdf to find the value
corresponding to a specified
probability.
For further
exploration

Scipy.stats
(https://docs.scipy.org/doc/scipy/
re ference/stats.html)
Reading a probability book
Gradient Descent

Those who boast of their descent, brag

on what they owe to others
We have a function f that we need
to minimize or maximize

We need to find the input v that

produces the largest (or
smallest ) possible value.
Algorithm

1. Pick a random starting point

2. Compute the gradient
3. Take a small step in the direction of the gradient ( The direction that
causes the function to increase the most )
4. Repeat with a new starting point.

=> We can try to minimize the function by taking small

steps in the opposite.
If a function has a unique global
minimum, this procedure is likely
to find it. If a function has
multiple (local) minima, this
procedure might “find” the
wrong one of them, in which
case you might re- run the
procedure from a variety of
starting points. If a function has
no minimum, then it’s possible
the procedure might go on
forever.
Estimating
the gradient

If f is a function of one
variable, its derivative at
a point x measures how
f(x) when we make a very
small change to x. It is
defined as the limit of the
difference quotients
The derivative is the slope of the
tangent line at (x, f(x)) , while
the difference quotient is the
slope of the not-quite- tangent
line that runs through (x+h,
f(x+h))
. As h gets smaller and smaller,
the not-
quite-tangent line gets closer and
closer to the tangent line
In python
Using the gradient

Sum_of_square function is
smallest when its input v is a
vector of zeros.
Imagine we didn’t know that.
Let’s use gradients to find the
minimum
Choosing the
right step size

Although the rationale for moving against the

gradient is clear, how far to move is not. Indeed,
choosing the right step size is more
of an art than a science. Popular
options include:
Using a fixed step size
Gradually shrinking the step size over time
At each step, choosing the step size that
minimizes the value of the objective
function
The step size that works depends on the problem,
too small, and you gradient descent will take
forever, too big, and you will take giant steps that
might make the function you care about get larger or
even be undefined.
=> So we will need to experiment
Using Gradient
Descent to Fit Models

We use gradient descent

to fit parameterized
models to data. In the usual
case, we will have some dataset
and some (hypothesized) model
for the data that depends (in as
differentiable way) on one or
more parameters. We will also
have a loss function that
measures how well the model fits
our data => Smaller is better.
Example

In this example we know the parameters

of the linear relationship between x and
y : y = 20
*x+5
Imagine we would like to learn them
from data.
Derivation - Chain Rule

The chain rule is used for calculating

the derivative of composite
functions. The chain rule can also be
expressed in Leibniz’s notation as follows:
If a variable z depends on the variable y,
which itself depends on the variable x, so
that y and z are dependent variables, then
z, via the intermediate variable of y,
depends on x as well. This is called the
chain rule
Partial derivatives

let’s calculate how the Cost

function changes
relative to m and c. This
deals with the concept of partial
derivatives, which says that if
there is a function of two
variables, then to find the
partial derivative of that
function with regard to one
variable, treat the other variable
as constant. This will be more clear
with an example:
Calculating
Gradient Descent

Let us now apply the knowledge of these rules

of calculus in our original equation and find the
derivative of the Cost Function w.r.t to
both ‘m’ and ‘b’. Revising the Cost Function
equation
:
Minibatch
Gradient Descent

As we mentioned before, often we’ll be using

gradient descent to choose the parameters of a
model in a way that minimizes some notion of error.
Using the previous batch approach, each gradient
step requires us to make a prediction and compute
the gradient for the whole data set, which makes
each step take a long time.
Now, usually these error functions are additive,
which means that the predictive error on the whole
data set is simply the sum of the predictive errors for
each data point.
When this is the case, we can instead apply a
technique called stochastic gradient descent, which
computes the gradient (and takes a step) for only one
point at a time. It cycles over our data repeatedly
until it reaches a stopping point.
Another variation is stochastic
Stochastic Gradient gradient descent, in which you
take gradient steps based on one
training example at time :
Descent
On this problem, stochastic gradient descent
finds the optimal parameters much smaller
of epochs, but there is always tradeoffs.
Basing gradeint steps on small minibatches
Conclusion ( or single data points ) allows you to take
more of them, but the gradient for a single
point might lie in very different direction
from the gradient for the dataset as a whole.
Machine learning

Data science is mostly turning business

problems into data problems and collecting
data and undertanding data and cleaning
data and formatting data. after which
machine learning is almost an
afterthought. Even so, it's an interesting and
essential afterthought that you pretty much
have to know about in order to do
data science
Modeling

WHAT IS A MODEL? IT’S SIMPLY A SPECIFICATION OF A IF YOU’RE TRYING TO RAISE MONEY FOR YOUR SOCIAL
MATHEMATICAL (OR PROBABILISTIC) RELATIONSHIP NETWORKING SITE, YOU MIGHT BUILD A BUSINESS MODEL
THAT EXISTS BETWEEN DIFFERENT VARIABLES. (LIKELY IN A SPREADSHEET) THAT TAKES INPUTS LIKE
“NUMBER OF USERS” AND “AD REVENUE PER USER” AND
“NUMBER OF EMPLOYEES” AND OUTPUTS YOUR ANNUAL
PROFIT FOR THE NEXT SEVERAL YEARS.
What is machine
learning

We’ll use machine learning to refer to creating and using models

that are learned from data. In other contexts this might be called
predictive modeling or data mining, but we will stick with machine
learning.
Typically, our goal will be to use existing data to develop models that
we can use to predict various outcomes for new data, such as:
Predicting whether an email message is spam or not
Predicting whether a credit card transaction is fraudulent
Predicting which advertisement a shopper is most likely
to click on

Arthur Samuel (1959) : Machine Learning :

Field of study that gives computers the ability
to learn without being explicitly programmed.
Machine learning : Exercice

TomMitchell(1998) : Well-posedLearning Problem: A computer program is said to learn from experience

E with respect to some task T and some performance measure P, if its performance on T, as
measured by P, improves with experience E.

⇒ Suppose your email program watches which emails you do or do not

mark as spam, and based on that learns how to better filter spam. What
is the task T in this setting?

Classifying emails as spam or not spam

Watching you label emails as spam or not spam
The number (or fraction) of emails correctly classified as spam/not spam.
□ None of the above—this is not a machine learning problem
Machine learning : Exercice

TomMitchell(1998) : Well-posedLearning Problem: A computer program is said to learn from experience

E with respect to some task T and some performance measure P, if its performance on T, as
measured by P, improves with experience E.

⇒ Suppose your email program watches which emails you do or do not

mark as spam, and based on that learns how to better filter spam. What
is the task T in this setting?

Classifying emails as spam or not spam : T

Watching you label emails as spam or not spam : E
The number (or fraction) of emails correctly classified as spam/not spam. : P
□ None of the above—this is not a machine learning problem
Overfitting and Underfitting

The horizontal line shows the best fit

degree 0 polynomial. It severely
underfits the training data. The
best fit degree 9 polynomial goes through
every training data point exactly, but it
very severely overfits — if we
were to pick a few more data points it
would quite likely miss them by a lot. And
the degree 1 line strikes a nice balance —
it’s pretty close to every point, and (if
these data are representative) the line will
likely be close to new data points as well.
Solution

The most fundamental

approach involves using
different data to train the
model and to test the
model.
The simplest way to do this is
to split your data set, so that
(for example) two-thirds of it
is used to train the model,
after which we measure the
model’s performance on the
remaining third:
Email Spam example

Imagine building model to make a binary judgment. Is this email Spam

or not Spam
True positive : “This message is spam, and we correctly predict
spam” False positive : “This message is not spam, but we
predict spam” False negative : “This message is spam, but we
predicted not spam”
True negative : “This message is not spam, and we correctly
predicted not spam”
Correctness – Use case

Suppose there is a test that can be given to a newborn baby that

predicts
— with greater than 98% accuracy — whether the
newborn will ever develop leukemia : Predict leukemia if and only if
the baby is named Luke (which sounds sort of like “leukemia”).
These days approximately 5 babies out of 1,000 are named
Luke. And the lifetime prevalence of leukemia is about 1.4%, or 14
out of every 1,000 people. If we believe these two factors are
independent and apply my “Luke is for leukemia” test to 1
million people, we’d expect to see a confusion matrix like:
Correctness - Accuracy

Accuracy is defined as the fraction of

correct predictions
Precision and recall

In pattern recognition, information retrieval,

object detection and classification (machine
learning), precision and recall are performance metrics that
apply to data retrieved from a collection, corpus or sample
space.
Precision (also called positive predictive value) is
the fraction of relevant instances
among the retrieved instances, while
recall (also known as sensitivity) is the fraction
of relevant instances that were
retrieved. Both precision and recall are therefore
based on relevance.
Correctness - Precision

Precision measures how

accurate our positive predictions
were :
Correctness - Recall

Recall measures what fraction of the positives our

model identified
F1 Score : Harmonic mean

Sometimes precesion and recall are

combined into the F1 score :
Choice of a model

Usually the choice of a model involves a trade-off

between precision and recall.
A model that predicts “yes” when it’s even a little bit
confident will probably have a high recall but a low
precision; a model that predicts “yes” only when it’s
extremely confident is likely to have a low recall and a
high precision.
Alternatively, you can think of this as a trade-off between
false positives and false negatives. Saying “yes”
too often will give you lots of false positives; saying “no” too
often will give you lots of false negatives.
Feature Extraction
and Selection

Features are whatever inputs we

provide to our model.
Things become more interesting as your data
becomes more complicated. Imagine trying to
build a spam filter to predict whether an email is
junk or not. Most models won’t know what to do
with a raw email, which is just a collection of text.
You’ll have to extract features. For example:
Does the email contain the word
“Viagra”?
(Yes or no)
How many times does the letter d
appear? ( Number)
What was the domain of the sender?
(Choice from discrete set of option)
Feature Extraction
and Selection

Pretty much always, we’ll extract

features from our data that fall into
one of these three categories.
What’s more, the type of
features we have constrains
the type of models we can
use.
The Naive Bayes classifier is
suited to yes-or- no features like the first
one in the preceding list.
Regression models require
numeric features (which could include
dummy variables that are 0s and 1s).
Decision trees can deal with
numeric or categorical data.
For further
Exploration
https://www.coursera.org/learn/machine-learning
K-Nearest
Neighbors
The model

Nearest neighbors is
simplest predictive models there is.
one of the
makes
It no mathematical
and it doesn’t require any sort of
assumptions,
heavy machinery. The only things it
requires
ar
e:
Some notion of
distance
An assumption that points
are
thatclose to one another
simil
are
ar
the prediction for
depends
each newonly on the
point
handful of points closest
to it.
In the general situation, we have some data points and we
have a corresponding set of labels. The labels could be True
and False, indicating whether each input satisfies some
condition like “is spam?” or “is poisonous?” or “would be
enjoyable to watch?” Or they could be categories, like movie
ratings (G, PG, PG-13, R, NC-17). Or they could be the names of
presidential candidates. Or they could be favorite programming
languages.
In our case, the data points will be vectors, which means that
we can use the distance
Let’s say we’ve picked a number k like 3 or 5. Then when we
want to classify some new data point, we find the k nearest
labeled points and let them vote on the new output.
Reduce k until we find a unique winner.

But this doesn’t do anything intelligent

with ties. For example, imagine we’re
rating movies and the five
nearest movies are rated G, G,
PG, PG, and R. Then G has two votes
and PG also has two votes. In that case, we
have several options:
Pick one of the winners at random.
Weight the votes by distance and
pick the weighted winner.
Reduce k until we find a
unique winner.
We will implement the third :
Create a classifier
Example : The Iris Dataset

The Iris dataset is a staple of machine learning. It contains

a bunch of measurements for 150 flowers representing
three species of iris. For each flower we have it petal
length, petal width, sepal length, and sepal width, as well
as its species.
We can download it from :
https://archive.ics.uci.edu/ml/datasets/Iris
For example the first row looks like :
5.1,3.5,1.4,0.2,Iris-setosa
Training set
Apply the K-
nearest neighbors

The training set will be “neighbors” that we will

use to classify the point in the test. We just need
to choose a value for k, the number of neighbors
who get to vote. Too small (think k = 1), and
we let outliers have too much
influence; too large (think k
= 105), and we just predict the most
common class in the dataset.
In real application (and with more
data), we might create a separate
valiadtion set and use it to choose
k
Here we will use k = 5
Prediction

On this simple
dataset, the model
predicts almost
perfectly.
There’s one
versicolor for
which it predict
verginia, but
otherwise it get things
exactly right
Naïve bayes

A social network isn’t much good if people can’t

network. Accordingly, DataSciencester has a
popular feature that allows members to send
messages to other members. And while most of
your members are responsible citizens who send
only well-received “how’s it going?” messages, a
few miscreants persistently spam other members
about get-rich schemes, no-prescription-required
pharmaceuticals, and for-profit data science
credentialing programs.
Your users have begun to complain, and so the VP
of Messaging has asked you to use data science to
figure out a way to filter out these spam
messages.
A really dumb spam filter

Imagine a “universe” that consists of receiving a message chosen randomly from all
possible messages. Let S be the event “the message is spam” and V be the
event “the message contains the word Bitcoin.” Then Bayes’s Theorem
tells us that the probability that the message is spam conditional on containing the
word Bitcoin is:
P(S|B) = [ P(B|S)P(S) ] / [ P(B|S)P(S) + P(B | not S)P(not S) ]
If we have a large collection of messages we know are spam, and a large collection of
messages we know are not spam, then we can easily estimate P(B|S) and P(B|not
S) and . If we further assume that any message is equally likely to be spam or notspam
(so that ) P(S) = P(not S) = 0.5, then:
P(S|B) = [ P(B|S) ] / [ P(B|S) + P(B | not S) ]
Application

If 50% of spam messages have

the word bitcoin, but only 1%
of nonspam messages do, then
the probability that any given
bitcoin containing email is
spam :
0.5 / (0.5 + 0.01) =
98%
A More Sophisticated
Spam Filter

Imagine now that we have a vocabulary of many

words W1 … Wn . To move this into the realm
of probability theory, we’ll write Xi for the event
“a message contains the word Wi .”
Also imagine that (through some unspecified-at-
this-point process) we’ve come up with an
estimate P(Xi|S) for the probability that a
spam message contains the ith word, and a
similar estimate P(Xi| not S) for the
probability that a nonspam message contains the
ith word.
The formula
Application
Underflow

In practice, you usually want to avoid

multiplying lots of probabilities
together, to avoid a problem called
underflow, in which computers don’t
deal well with floating-point numbers
that are too close to zero. Recalling
from algebra that log(ab) =
log(a) + log(b) and that
exp(log x) = x, we usually
compute
P1*….Pn as
exp(log(P1) + …. + log(Pn))
The calculate

The only challenge left is coming up with

estimates for P(Xi|S) and P(Xi|not S) , the
probabilities that a spam message (or nonspam
message) contains the word Wi . If we have a fair
number of “training” messages labeled as spam
and not-spam, an obvious first try is to estimate
P(Xi|S) simply as the fraction of spam
messages containing word Wi.
The probem

This causes a big problem, though. Imagine that in our training set
the vocabulary word “data” only occurs in nonspam messages.
Then we’d estimate P(data | S) = 0 . The result is that our
Naive Bayes classifier would always assign spam probability
0 to any message containing the word “data”,
even a message like “data on cheap bitcoin and
authentic rolex watches.” To avoid this problem, we
usually use some kind of smoothing.
In particular, we’ll choose a pseudocount — k — and estimate
the probability of seeing the ith word in a spam as:
P(Xi|S) = (k + number of spam containig
Wi) / (2k + number of spams)
Example

For example, if “data” occurs in 0/98 spam

documents, and if k is 1, we estimate P(data|
S) as 1/100 = 0.01, which allows our classifier
to still assign some nonzero spam probability to
messages that contain the word “data.”
P(data|S) = (k + spam avec data) /
(2k + number of spam) = 1 / (2+98)
Implementation
Build our classifier
Now we have all the pieces
we need to build our
classifier.
First, let’s create a simple
function to tokenize
messages
into distinct words. We’ll
first convert each message
to lowercase; use
re.findall() to
extract “words” consisting
of letters, numbers, and
apostrophes; and finally
use set() to get just the
distinct words.
The classifier

As our classifier need to keep track

of tokens, counts, and labels from
the training data, we’ll make it a
class.
The constructor take 1 parameter, the
pseudocount to use when computing
probabilities. It also initializes an empty
set of tokens, counters to track how
often each token is seen in spam
messages and ham messages and
counts of how many spam and ham(not
spam) messages it was trained on :
The probabilities

We will want to predict P(spam | token). As

we saw earlier, to apply Bayes’s theorem we n eed
to know P(token | spam) and P(token |
ham) for each token in vocabulary.
The prediction

Finally, we are ready to write our

predict method. As mentioned
earlier, rather than multiplying together
lot of small probabilities, we’ll instead
sum up the log probabilities :
And now we have a classifier
!
Testi
ng
our
mod
Let’s make sur our model
works by writing some
el
unit test for it
Using our Model

We will use SpamAssassin

public corpus :
https://spamassassin.apache.org/
ol d/publiccorpus/
We will took the files prefixed with
20021010
To keep things really easy simple,
we will just look at the subject
lines of each email
Get Data

How to we identify the subject line ?

When we look through the files, they
all seem to start with “Subject”
Split Data

Now we can split the data into training

data and test data, and then we are
ready to build a classifier
Predicti
on
Let’s generate some predictions and check how our model
does.
Performance

This gives
84 true positives (Spam classified as spam)
25 false positives (Ham classified as spam)
703 true negatives (Ham classified as ham)
44 false negatives (Spam classified as ham)
Precision = 84 / (84 + 25) = 77 %
Recall = 84 / (84 + 44) = 65 %
Wich are not bad numbers for such a simple model.
(you can do better if we looked at more than
the subject lines)
We can also inspect the model’s innards to see which words are least and
most indicative of spam

Performa The spammiest words include things like sale, mortgage, money and rates,
nce whereas the hammiest words include things like spambayes, users, apt. So that
also gives us some intuitive confidence that out model is basically doing the right
thing
How to get better
performance
Get more data in the training set.
Look at message content not just the subject line.
The classifier takes into account every word that
appears in the training set, even words that
appear only once. Modify the classifier to accept
an optional min_count threshold and ignore
tokens that don’t appear at least that many times.

Introduction To Probability
No ratings yet
Introduction To Probability
88 pages
Faqs Sta301 by Naveedabbas17
100% (1)
Faqs Sta301 by Naveedabbas17
77 pages
MIDS Unit 2
No ratings yet
MIDS Unit 2
18 pages
552 Notes 1
No ratings yet
552 Notes 1
35 pages
STA 205 LECTURE 1.doc - 1643091486032
No ratings yet
STA 205 LECTURE 1.doc - 1643091486032
7 pages
Unit 2 TE Honours
No ratings yet
Unit 2 TE Honours
20 pages
Probability and Statistics
No ratings yet
Probability and Statistics
11 pages
Tests of Significance and Measures of Association
No ratings yet
Tests of Significance and Measures of Association
21 pages
Tests of Significance and Measures of Association
No ratings yet
Tests of Significance and Measures of Association
21 pages
Basic Statistics For Data Science
100% (1)
Basic Statistics For Data Science
45 pages
MIT (14.32) Spring 2009 J. Angrist Preliminaries
No ratings yet
MIT (14.32) Spring 2009 J. Angrist Preliminaries
6 pages
Sta301 Short Notes
No ratings yet
Sta301 Short Notes
6 pages
Probability and Statistics: Wikipedia
No ratings yet
Probability and Statistics: Wikipedia
12 pages
2024 10 14quiz
No ratings yet
2024 10 14quiz
41 pages
Module3 - Learning, Uncertainity Lecture Notes. 16861418577274
No ratings yet
Module3 - Learning, Uncertainity Lecture Notes. 16861418577274
30 pages
Definition of Median
No ratings yet
Definition of Median
6 pages
Unit IV Probability Bayes
No ratings yet
Unit IV Probability Bayes
33 pages
CS 904: Natural Language Processing Statistical Inference: N-Grams
No ratings yet
CS 904: Natural Language Processing Statistical Inference: N-Grams
30 pages
Normal Distribution:Sampling
No ratings yet
Normal Distribution:Sampling
8 pages
Econ1203 Notes
67% (3)
Econ1203 Notes
35 pages
What Is The Difference Between Alpha and
No ratings yet
What Is The Difference Between Alpha and
3 pages
1.2 Conditional Probability
No ratings yet
1.2 Conditional Probability
11 pages
Stats Mid Term
No ratings yet
Stats Mid Term
22 pages
What Is A Data Set?
No ratings yet
What Is A Data Set?
19 pages
ProbabilityDistributions 61
No ratings yet
ProbabilityDistributions 61
61 pages
Stat Merge
No ratings yet
Stat Merge
162 pages
Lecture 7 9
No ratings yet
Lecture 7 9
16 pages
Basic Terms of Probability
No ratings yet
Basic Terms of Probability
7 pages
Summary of EMFIN - Google Dokument
No ratings yet
Summary of EMFIN - Google Dokument
14 pages
Probability and Random Variables: Example: The Truth About Cats and Dogs. Suppose We Conduct A Very Large Survey of EMBA
No ratings yet
Probability and Random Variables: Example: The Truth About Cats and Dogs. Suppose We Conduct A Very Large Survey of EMBA
11 pages
ELI5(Not So Much) Linear Regression
No ratings yet
ELI5(Not So Much) Linear Regression
21 pages
(R17A1204) Artificial Intelligence (6) - 119-143
No ratings yet
(R17A1204) Artificial Intelligence (6) - 119-143
25 pages
Statistics With R Unit 3
No ratings yet
Statistics With R Unit 3
11 pages
Bernard F Dela Vega PH 1-1
No ratings yet
Bernard F Dela Vega PH 1-1
5 pages
BE.104 Spring Biostatistics: Detecting Differences and Correlations J. L. Sherley
100% (2)
BE.104 Spring Biostatistics: Detecting Differences and Correlations J. L. Sherley
9 pages
Multiple Regression Interpretation
No ratings yet
Multiple Regression Interpretation
13 pages
MLT Assignment 1
No ratings yet
MLT Assignment 1
13 pages
Eight Things You Need To Know About Interpreting Correlations
No ratings yet
Eight Things You Need To Know About Interpreting Correlations
9 pages
Probability MIT
No ratings yet
Probability MIT
116 pages
MIT18 05S14 Readings
No ratings yet
MIT18 05S14 Readings
291 pages
Causes of Uncertainty
No ratings yet
Causes of Uncertainty
14 pages
BS Important Questions With Answers-2
No ratings yet
BS Important Questions With Answers-2
17 pages
Lecture Probability
No ratings yet
Lecture Probability
31 pages
Unit-3 Ai
No ratings yet
Unit-3 Ai
24 pages
Probability
No ratings yet
Probability
48 pages
Ai Cat 2
No ratings yet
Ai Cat 2
20 pages
Engineering Data Analysis
No ratings yet
Engineering Data Analysis
12 pages
VIVA_DSA
No ratings yet
VIVA_DSA
11 pages
Unit 3 r as a Set of Statistical Tables
No ratings yet
Unit 3 r as a Set of Statistical Tables
31 pages
Chi Squared for Beginners
From Everand
Chi Squared for Beginners
Stephanie Glen
No ratings yet
What Is Dispersion?: Variance Standard Deviation Interquartile Range Dot Plots Boxplots Stem and Leaf Plots
No ratings yet
What Is Dispersion?: Variance Standard Deviation Interquartile Range Dot Plots Boxplots Stem and Leaf Plots
4 pages
Inferential Statistics: Probability
No ratings yet
Inferential Statistics: Probability
5 pages
Stats Week 1 PDF
No ratings yet
Stats Week 1 PDF
6 pages
TOPIC 7-9 NOTES
No ratings yet
TOPIC 7-9 NOTES
14 pages
Probability and Statistics Are The Two Important C
No ratings yet
Probability and Statistics Are The Two Important C
6 pages
Chapter 08 Statistics 2
No ratings yet
Chapter 08 Statistics 2
47 pages
Probability And Statistics
No ratings yet
Probability And Statistics
5 pages
Statistics-Glossary CSE
No ratings yet
Statistics-Glossary CSE
13 pages
Introduction to Abstract Analysis
From Everand
Introduction to Abstract Analysis
Marvin E. Goldstein
No ratings yet
Math for Computer Applications
From Everand
Math for Computer Applications
The Editors of REA
No ratings yet
Hypothesis Testing Made Simple
From Everand
Hypothesis Testing Made Simple
Leonard Gaston
4/5 (5)
Chapter 6 - Random Variables and Probability Distributions
No ratings yet
Chapter 6 - Random Variables and Probability Distributions
101 pages
F Distribution PDF Derivation
No ratings yet
F Distribution PDF Derivation
2 pages
Probability Distribution
No ratings yet
Probability Distribution
11 pages
Download Full Textile Engineering Statistical Techniques Design of Experiments and Stochastic Modeling 1st Edition Anindya Ghosh PDF All Chapters
100% (4)
Download Full Textile Engineering Statistical Techniques Design of Experiments and Stochastic Modeling 1st Edition Anindya Ghosh PDF All Chapters
40 pages
Syllabus BBA 303 ONLINE Winter 2018 PDF
No ratings yet
Syllabus BBA 303 ONLINE Winter 2018 PDF
5 pages
TQM - Statistical Process Control
100% (1)
TQM - Statistical Process Control
93 pages
1743 Chapter 4 Probability Distribution
No ratings yet
1743 Chapter 4 Probability Distribution
23 pages
Determination of Flatness From Straightness Measurements and Charaterization of The Surface by Four Parameters
No ratings yet
Determination of Flatness From Straightness Measurements and Charaterization of The Surface by Four Parameters
6 pages
Business Statistics 3 ed. Edition Govind Chand Beri - The ebook is available for instant download, read anywhere
100% (2)
Business Statistics 3 ed. Edition Govind Chand Beri - The ebook is available for instant download, read anywhere
74 pages
Instant Download Linear Models 2nd Edition Shayle R. Searle PDF All Chapters
100% (1)
Instant Download Linear Models 2nd Edition Shayle R. Searle PDF All Chapters
67 pages
Theoretical Distributions
No ratings yet
Theoretical Distributions
46 pages
Expected Shortfall
No ratings yet
Expected Shortfall
4 pages
SPTC 0404 q3 FPF
No ratings yet
SPTC 0404 q3 FPF
22 pages
(Bethea, Robert M) - Statistical Methods For Engineers and Scientists, Third Edition-Routledge (2018)
75% (4)
(Bethea, Robert M) - Statistical Methods For Engineers and Scientists, Third Edition-Routledge (2018)
681 pages
Sma 260 Probability and Statistics I
No ratings yet
Sma 260 Probability and Statistics I
4 pages
Writing Geochem Reports-AEG-2nd Edition
No ratings yet
Writing Geochem Reports-AEG-2nd Edition
27 pages
Cumulative Standard Normal Table
100% (1)
Cumulative Standard Normal Table
1 page
IGNOU MBA Note On Statistics For Management
100% (2)
IGNOU MBA Note On Statistics For Management
23 pages
Probability Distribution
100% (7)
Probability Distribution
37 pages
Assignment 4
No ratings yet
Assignment 4
8 pages
Week 2: Introduction To Discrete-Time Stochastic Processes: 15.455x Mathematical Methods of Quantitative Finance
100% (1)
Week 2: Introduction To Discrete-Time Stochastic Processes: 15.455x Mathematical Methods of Quantitative Finance
58 pages
Recent Progress On Reservoir History Matching: A Review: Dean S. Oliver Yan Chen
No ratings yet
Recent Progress On Reservoir History Matching: A Review: Dean S. Oliver Yan Chen
37 pages
The beta-PERT Distribution
No ratings yet
The beta-PERT Distribution
7 pages
Z-Scores-Worksheet
No ratings yet
Z-Scores-Worksheet
3 pages
Chihara Et Al-Mathematical Statistics With Re Sampling and R-2011-ToC - 201
No ratings yet
Chihara Et Al-Mathematical Statistics With Re Sampling and R-2011-ToC - 201
8 pages
MAT102 - Statistics For Business - UEH-ISB - T3 2022 - Unit Guide - DR Chon Le
No ratings yet
MAT102 - Statistics For Business - UEH-ISB - T3 2022 - Unit Guide - DR Chon Le
12 pages
Quarter 3 - Summative Test STAT
No ratings yet
Quarter 3 - Summative Test STAT
9 pages
UCLA Department of Statistics STATS 100B Homework 1
No ratings yet
UCLA Department of Statistics STATS 100B Homework 1
2 pages
Sas/Stat 14.3 User's Guide: The GEE Procedure
No ratings yet
Sas/Stat 14.3 User's Guide: The GEE Procedure
63 pages