Datascience Python Bayes
Datascience Python Bayes
Visualizing
Data
Matplotl
ib
Visualizi A fundamental part of the data scientist’s toolkit is data visualization. Although
it is very easy to create visualizations, it’s much harder to produce good ones.
ng data A wide variety of tools exist for visualizing data. We will be using the matplotlib
library, which is widely used.
Simple Line Chart
Line Charts
A matrix is a two-
dimensional collection of
numbers. We will
represent matrices as
lists, with each inner list
having the same size and
representing a row og the
matrix, if A is a matrix,
then A[i][j] is the element
in the ith row and the jth
column.
N and M of
a Matrix
Central
tendenci
es – The
mean
Central
tendencies
– The
median
The median is the middle-
most value ( if the number
of data is odd) or the
average of the two middle-
most value (if the number
of data points is even )
Central
tendencies
– The
quantile
A genralization of the
median is the quantile,
which represent the
value under which a
certain percentile of the
data lie ( the median
represent the value
underwhich 50 % of
the data lies
Central tendencies –
The mode
The mode in statistics refers to a number in a set of numbers that
appears the most often
Dispersion is the state of getting dispersed or spread.
Statistical dispersion means the extent to which numerical
data is likely to vary about an average value. In other words,
dispersion helps to understand the distribution of the data.
Dispers
ion
The range is 0 precisely when the max and the min are equal,
which can only happen when the elements of a datasets are
all the same.
A more complex
measure of dispersion
is the variance, wich
is computed as
Dispersio
n - The
variance
Correlation
The amount of time people spend on the site is related to
the number of friends they have on the site.
After digging through traffic logs, we have come up with a
list called daily_minutes that shows how many minutes per
day each user spends on our site, and we have ordered it
so that its elemets correspond to the elements of our
previous num_friend list.
We’d like to investigate the relationship between these
two metrics
Correlation -
Covariance
P(E, F) =
P(E)P(F)
Conditional probability
P(B/G) = P(B,G)/P(G) =
P(B)/P(G) = * ¼½
=
Since the event B½and G (“Both children
girls
are and the older child is a girl”) is just
event
the B. (Once you know that
both
children are girls, it’s necessarily true
that the older child is a girl
Family
example
P(B/L) = P(B,L)/P(L) =
*¾
P(B)/P(L) = ¼
=¾
The
family
exampl
e with
python
Bayes’s
Theorm
Imagine a certain desease that affect 1 in every 10 000 peoples. And imagine that there is
a test for this disease that gives the correct result (‘diseased if you have the disease’,
‘nondiseased’ if you don’t have) 99% of the time. P(T|D) = 0.99
What does a positive test mean ? Let’s use T for the event “your test is positive” and D for the
event “you have the disease” => then the Bayes’s theorem says that the probability that
you have the disease conditional on testing positive is :
P(D|T) = P(T|D) P(D) / [P(T|D) P(D) + P(T| not D) P( not D )] = 0.98%
P(T|D) = 0.99 = The P that someone with disease test positif.
P(D) = 1/10000 = 0.0001 = The P that given person has the disease.
P(T | not D) = 0.01 = The P that someone without the disease tests
positive. P (not D) = 0.9999 = The P that any given person doesn’t have the
disease.
Application of
Bayes’s theorem :
Scipy.stats
(https://docs.scipy.org/doc/scipy/
re ference/stats.html)
Reading a probability book
Gradient Descent
If f is a function of one
variable, its derivative at
a point x measures how
f(x) when we make a very
small change to x. It is
defined as the limit of the
difference quotients
The derivative is the slope of the
tangent line at (x, f(x)) , while
the difference quotient is the
slope of the not-quite- tangent
line that runs through (x+h,
f(x+h))
. As h gets smaller and smaller,
the not-
quite-tangent line gets closer and
closer to the tangent line
In python
Using the gradient
Sum_of_square function is
smallest when its input v is a
vector of zeros.
Imagine we didn’t know that.
Let’s use gradients to find the
minimum
Choosing the
right step size
WHAT IS A MODEL? IT’S SIMPLY A SPECIFICATION OF A IF YOU’RE TRYING TO RAISE MONEY FOR YOUR SOCIAL
MATHEMATICAL (OR PROBABILISTIC) RELATIONSHIP NETWORKING SITE, YOU MIGHT BUILD A BUSINESS MODEL
THAT EXISTS BETWEEN DIFFERENT VARIABLES. (LIKELY IN A SPREADSHEET) THAT TAKES INPUTS LIKE
“NUMBER OF USERS” AND “AD REVENUE PER USER” AND
“NUMBER OF EMPLOYEES” AND OUTPUTS YOUR ANNUAL
PROFIT FOR THE NEXT SEVERAL YEARS.
What is machine
learning
Nearest neighbors is
simplest predictive models there is.
one of the
makes
It no mathematical
and it doesn’t require any sort of
assumptions,
heavy machinery. The only things it
requires
ar
e:
Some notion of
distance
An assumption that points
are
thatclose to one another
simil
are
ar
the prediction for
depends
each newonly on the
point
handful of points closest
to it.
In the general situation, we have some data points and we
have a corresponding set of labels. The labels could be True
and False, indicating whether each input satisfies some
condition like “is spam?” or “is poisonous?” or “would be
enjoyable to watch?” Or they could be categories, like movie
ratings (G, PG, PG-13, R, NC-17). Or they could be the names of
presidential candidates. Or they could be favorite programming
languages.
In our case, the data points will be vectors, which means that
we can use the distance
Let’s say we’ve picked a number k like 3 or 5. Then when we
want to classify some new data point, we find the k nearest
labeled points and let them vote on the new output.
Reduce k until we find a unique winner.
On this simple
dataset, the model
predicts almost
perfectly.
There’s one
versicolor for
which it predict
verginia, but
otherwise it get things
exactly right
Naïve bayes
Imagine a “universe” that consists of receiving a message chosen randomly from all
possible messages. Let S be the event “the message is spam” and V be the
event “the message contains the word Bitcoin.” Then Bayes’s Theorem
tells us that the probability that the message is spam conditional on containing the
word Bitcoin is:
P(S|B) = [ P(B|S)P(S) ] / [ P(B|S)P(S) + P(B | not S)P(not S) ]
If we have a large collection of messages we know are spam, and a large collection of
messages we know are not spam, then we can easily estimate P(B|S) and P(B|not
S) and . If we further assume that any message is equally likely to be spam or notspam
(so that ) P(S) = P(not S) = 0.5, then:
P(S|B) = [ P(B|S) ] / [ P(B|S) + P(B | not S) ]
Application
This causes a big problem, though. Imagine that in our training set
the vocabulary word “data” only occurs in nonspam messages.
Then we’d estimate P(data | S) = 0 . The result is that our
Naive Bayes classifier would always assign spam probability
0 to any message containing the word “data”,
even a message like “data on cheap bitcoin and
authentic rolex watches.” To avoid this problem, we
usually use some kind of smoothing.
In particular, we’ll choose a pseudocount — k — and estimate
the probability of seeing the ith word in a spam as:
P(Xi|S) = (k + number of spam containig
Wi) / (2k + number of spams)
Example
This gives
84 true positives (Spam classified as spam)
25 false positives (Ham classified as spam)
703 true negatives (Ham classified as ham)
44 false negatives (Spam classified as ham)
Precision = 84 / (84 + 25) = 77 %
Recall = 84 / (84 + 44) = 65 %
Wich are not bad numbers for such a simple model.
(you can do better if we looked at more than
the subject lines)
We can also inspect the model’s innards to see which words are least and
most indicative of spam
Performa The spammiest words include things like sale, mortgage, money and rates,
nce whereas the hammiest words include things like spambayes, users, apt. So that
also gives us some intuitive confidence that out model is basically doing the right
thing
How to get better
performance
Get more data in the training set.
Look at message content not just the subject line.
The classifier takes into account every word that
appears in the training set, even words that
appear only once. Modify the classifier to accept
an optional min_count threshold and ignore
tokens that don’t appear at least that many times.