CS 725: Foundations of Machine Learning: Lecture 2. Overview of Probability Theory For ML
CS 725: Foundations of Machine Learning: Lecture 2. Overview of Probability Theory For ML
June 2019
1
Probability is Quantification of Uncertainty
• We are trying to build systems that understand and (possibly) interact with
the real world
• We often can not prove something is true, but we can still ask how likely
different outcomes are or ask for the most likely explanation
2
Random Variable and Sample Space
• A random variable X represents the outcome or the state of the world and
takes values from a sample space or domain
• Sample space: the space of all possible outcomes
1. Can be continuous (Example: how much it will rain tomorrow)
2. Or Discrete (Example: results of a coin pair toss; S = {HH, HT , TH, TT }.)
• Pr(x) is the probability mass (density) function
• Assigns a number to each point in sample space
• Non-negative, sums (integrates) to 1
• Intuitively: how often does x occur, how much do we believe in x.
3
Events
4
A review of probability theory
• Note:
• Pr(S) = 1 and Pr(∅) = 0
• Pr(E ) = 1 − Pr(E ), where E = S \ E
• Pr(E1 ∪ E2 ) = Pr(E1 ) + Pr(E2 ) − Pr(E1 ∩ E2 )
• If E1 , E2 , . . . , En are pairwise disjoint events, then
n
[ n
X
Pr ( Ei ) = Pr (Ei )
i=1 i=1
5
Distribution Functions for discrete data
6
Continuous Distributions
7
Cumulative Distribution Function
Suppose X is a continuous random variable which takes values from the sample
space R, and has a pdf f . Its cdf is defined as F : R → [0, 1]:
Z a
F (a) = Pr (X ≤ a) = f (x)dx
−∞
Note: pdf for continuous distribution can be obtained by differentiating the cdf
of that random variable:
dF (x)
f (a) = |x=a
dx
8
Multiple Random Variables
9
Multiple Random Variables (cont)
• For continuous:
If f (x, y ) is a joint pdf, then
Z b Z a
F (a, b) = Pr (X ≤ a, Y ≤ b) = f (x, y )dxdy
−∞ −∞
∂ 2 F (x, y )
f (a, b) = |a,b
∂x∂y
P
• Marginal Distribution: For x ∈ S1 , Pr(X = x) = y ∈S2 Pr((X , Y ) = (x, y )).
Similarly, marginal distribution of Y .
10
Conditional Probability
Pr(X |Y ) Pr(Y )
Pr(Y |X ) = ,
Pr(X )
11
Using Bayes’ Theorem
A lab test has a probability 0.95 of detecting a disease when applied to a person
suffering from said disease, and a probability 0.10 of giving a false positive when
applied to a non-sufferer. If 0.5% of the population are sufferers, what is the
probability of a person suffering from the disease if the test is positive?
12
Independence of Random Variables
13
Expectation
If X is a random variable taking (say) real values (i.e., S ⊆ R), we can define an
“expected value” for X as:
E (X ) = Σx∈S x Pr(X = x)
For continuous:
R∞
E [X ] = −∞
xf (x)dx
• Var [X ] = E [X 2 ] − (E [X ])2
• Var [X + β] = Var [X ] and Var [αX ] = α2 Var [X ]
P P
• If X1 , · · · , Xn are pairwise independent, then Var [ i Xi ] = i Var [Xi ]
(Proof HW)
• If X1 , · · · , Xn are pairwise independent, each with variance σ 2 , then,
Var [ n1 i Xi ] = σ 2 /n
P
15
Covariance
E [XY ] = E [X ]E [Y ]
16
Important Discrete Random Variables
• E [X ] = (1 − q) × 0 + q × 1 = q
• Var [X ] = q − q 2 = q(1 − q)
1. Pr [X = k] = kn q k (1 − q)n−k
P
2. E [X ] = i E [Yi ] where Yi is a bernoulli random variable ⇒ E [X ] = nq
P
3. Var [X ] = i Var [Yi ] (since Yi ’s are independent) ⇒ Var [X ] = nq(1 − q)
18
Normal (Gaussian) Distribution
19
1-D Gaussian distribution
23/01/2017 https://upload.wikimedia.org/wikipedia/commons/7/74/Normal_Distribution_PDF.svg
1.0
μ = 0, σ 2 = 0.2,
μ = 0, σ 2 = 1.0,
0.8
μ = 0, σ 2 = 5.0,
μ = −2, σ 2 = 0.5,
0.6
φμ,σ (x)
2
0.4
0.2
0.0
−5 −4 −3 −2 −1 0 1 2 3 4 5
x
20
Normal (Gaussian) Distribution
21
2-D Gaussian distribution
1
0.8
0.6
0.4
0.2
0 3
2
1 3
0 2
1
-1 0
-2 -1
-2
-3 -3
22
Properties of Normal Distribution