Lecture 3:introduction To Data Analysis and Machine Learning
Lecture 3:introduction To Data Analysis and Machine Learning
Lecture 3:introduction To Data Analysis and Machine Learning
2
PHYS188/288: BAYESIAN DATA ANALYSIS AND MACHINE LEARNING FOR PHYSICAL SCIENCES UROŠ SELJAK
Posterior PDF p(λ|D,H ) contains all information on !
!* = Maximum A
Posterior or MAP
d2 1
• Error estimate: lnp( |D, H) = 2
d 2 = ⇤
• Laplace approximation: = ⇤
±
3
PHYS188/288: BAYESIAN DATA ANALYSIS AND MACHINE LEARNING FOR PHYSICAL SCIENCES UROŠ SELJAK
Posterior predictive distribution
• Predicting future observation conditional on current data y and model
posterior: we marginalize over all models at fixed current data y
• We have seen it in example 2.6 (lecture 2 slide 28)
• Example: we measure a quantity but each measurement has some error
!. After N measurements we get mean µ1 and variance "1=!/N1/2. Next
measurement will be located around µ1 combining the two errors.
4
PHYS188/288: BAYESIAN DATA ANALYSIS AND MACHINE LEARNING FOR PHYSICAL SCIENCES UROŠ SELJAK
Modern statistical methods (Bayesian or not)
Gelman et al., Bayesian Data Analysis, 3rd edition
5
PHYS188/288: BAYESIAN DATA ANALYSIS AND MACHINE LEARNING FOR PHYSICAL SCIENCES UROŠ SELJAK
INTRODUCTION TO MODELING OF DATA
(y(xi ) yi )2
1 2 2
p(yi ) = p 2
e i
2⇡ i
Y
p(~y ) = p(yi )
i
6
PHYS188/288: BAYESIAN DATA ANALYSIS AND MACHINE LEARNING FOR PHYSICAL SCIENCES UROŠ SELJAK
We can parametrize the model in terms of M free parameters
y(xi|a1,a2,a3,…,aM)
• Instead of the full posterior we can ask what is the best fit
value of parameters a1,a2,a3,…,aM
@L @lnL
MLE : = 0 or =0
@~a @~a
8
PHYS188/288: BAYESIAN DATA ANALYSIS AND MACHINE LEARNING FOR PHYSICAL SCIENCES UROŠ SELJAK
(y(xi ) yi )2
1 2 2
Maximum'likelihood'estimator p(yi ) = p2⇡ 2 e i
i
for'gaussian'errors Y
p(~y ) = p(yi )
i
X n (yi y(xi |a1 , ..., aM ))2 o
2lnL = 2 + ln i
i i
}
2
Since !i does not depend on ai, MLE means minimizing "2 wrt ak
@ 2 X yi y(xi ) @y(xi )
=0 ! 2 =0
@ak i i @ak
Linear Regression
Minimize !2":"
10
PHYS188/288: BAYESIAN DATA ANALYSIS AND MACHINE LEARNING FOR PHYSICAL SCIENCES UROŠ SELJAK
Define:
! ! !
Matrix S Sx a Sy
Form:
=
Sx Sxx b Sxy
Solve this with linear algebra
11
PHYS188/288: BAYESIAN DATA ANALYSIS AND MACHINE LEARNING FOR PHYSICAL SCIENCES UROŠ SELJAK
!
! ! ! 1 S Sx
C =
S Sx a Sy Sx Sxx
= !
Sx Sxx b Sxy 1 Sxx Sx
C=
Sx S
Solution: Define
â
b̂
12
PHYS188/288: BAYESIAN DATA ANALYSIS AND MACHINE LEARNING FOR PHYSICAL SCIENCES UROŠ SELJAK
What about the errors?
• We approximate the log posterior around its peak with a
quadratic function
• The posterior is thus approximated as a Gaussian
• This goes under the name Laplace approximation
• Note that the errors need to be described as a matrix
• It is exact for linear parameters (such as a and b)
13
PHYS188/288: BAYESIAN DATA ANALYSIS AND MACHINE LEARNING FOR PHYSICAL SCIENCES UROŠ SELJAK
MLE/MAP+Laplace
2 · lnp(a, b|yi ) = 2 · lnL(a, b)
Taylor expansion around the peak ( â & b̂ ): first derivative is 0
Let a = x1 , b = x2
1 X @ 2 lnL
2 · lnL(x1 , x2 ) = 2 · lnL(xˆ1 , xˆ2 ) 2· xi xj
2 i,j=1,2 @xi @xj xi =xˆi
where xi = xi x̂i
1X
xi Cij 1 xj
2 ij
Note: h xi xj i = Cij
@2 2 X 1
=2 2 = 2S
@a2 i i
@2 2 X x2
i
= 2 2 = 2Sxx
@b2 i i
@2 2 X xi
=2 2 = 2Sx
@a@b i i
! !
1 S Sx 1 Sxx Sx
C = C=
Sx Sxx Sx S
S51 is+error+on+a+at+a+fixed+b
Define
Marginalized+error+on+a:+integrate+out+b
Marginal+errors+are+larger:+!a2>S51 15
PHYS188/288: BAYESIAN DATA ANALYSIS AND MACHINE LEARNING FOR PHYSICAL SCIENCES UROŠ SELJAK
Show
Z ⇥ ⇤ 2
1 (b b̂)
1
(a â) 2
Caa1 +(a â)(b b̂)Cab1 +(b b̂) 2
Cbb1
da · e 2 / e 2 Cbb
Solution:
h Cab1 2 Cab2 2 i 1 2
Caa1 ( a+ 1 b) 2 b + C bb b
Caa Caa
Z ⇥ 1 ⇤
1 1
2 Caa
1
(
C
a+ ab1 b) 2 p
da · e Caa = 2⇡Caa
1
⇥ C
2
+C
1
Caa1
⇤
1
2 b2 ab bb
1
1 b2
/ e Caa =e 2 Cbb
16
PHYS188/288: BAYESIAN DATA ANALYSIS AND MACHINE LEARNING FOR PHYSICAL SCIENCES UROŠ SELJAK
Bayesian Posterior and Marginals
• The posterior distribution p(a,b|yi) is described as a 2-d C-1
ellipse in (a,b) plane
17
PHYS188/288: BAYESIAN DATA ANALYSIS AND MACHINE LEARNING FOR PHYSICAL SCIENCES UROŠ SELJAK
Asymptotics theorems
(Le Cam 1953, adopted to Bayesian posteriors)
18
PHYS188/288: BAYESIAN DATA ANALYSIS AND MACHINE LEARNING FOR PHYSICAL SCIENCES UROŠ SELJAK
Asymptotics theorems
(Le Cam 1953, adopted to Bayesian posteriors)
• There are caveats when this does not apply, e.g. when data are
not informative about a parameter or some linear combination of
them, when number of parameters M is comparable to N, when
posteriors are improper or likelihoods are unbounded… Always
exercise care!
• bi = yi/ !i
20
PHYS188/288: BAYESIAN DATA ANALYSIS AND MACHINE LEARNING FOR PHYSICAL SCIENCES UROŠ SELJAK
Design matrix
"kj=d2!2/dakdaj
To solve the normal equations to obtain best fit values and the precision matrix
we need to learn linear algebra numerical methods: topic of next lecture 22
PHYS188/288: BAYESIAN DATA ANALYSIS AND MACHINE LEARNING FOR PHYSICAL SCIENCES UROŠ SELJAK
Gaussian posterior
24
PHYS188/288: BAYESIAN DATA ANALYSIS AND MACHINE LEARNING FOR PHYSICAL SCIENCES UROŠ SELJAK
Credible intervals under Gaussian posterior approx.
• We like to quote posteriors in terms of X% credible intervals
• For Gaussian likelihoods most compact posteriors correspond to a
constant change !"2 relative to MAP/MLE
• The intervals depend on the dimension: example for X=68
25
PHYS188/288: BAYESIAN DATA ANALYSIS AND MACHINE LEARNING FOR PHYSICAL SCIENCES UROŠ SELJAK
26
PHYS188/288: BAYESIAN DATA ANALYSIS AND MACHINE LEARNING FOR PHYSICAL SCIENCES UROŠ SELJAK
We rarely go above ν = 2 dimensions in projections
(difficult to visualize)
27
PHYS188/288: BAYESIAN DATA ANALYSIS AND MACHINE LEARNING FOR PHYSICAL SCIENCES UROŠ SELJAK
Introduction to Machine Learning
• From%some%input%x,%output%can%be:%
• Summary%z:%unsupervised%learning%(descriptive,%hindsight)
• Prediction%y:%supervised%learning%(predictive,%insight)
• Action%a%to%maximize%reward%r:%reinforcement%learning%
(prescriptive,%foresight)
• Value%vs%difficulty%(although%
this%view%is%subjective)
• Supervised%learning:%
classification%and%regression
• Unsupervised%learning:%e.g.
dimensionality%reduction
Chris%Wiggins%taxonomy,%Gartner/Recht graph 28
PHYS188/288: BAYESIAN DATA ANALYSIS AND MACHINE LEARNING FOR PHYSICAL SCIENCES UROŠ SELJAK
Data$Analysis$versus$Machine$Learning
• In#physical#sciences#we#usually#compare#data#to#a#physics#based#model#to#infer#
parameters#of#the#model.#This#is#often#an#analytic#model#as#a#function#of#physical#
parameters#(e.g.#linear#regression).#This#is#the#Bayesian#Data#Analysis#component#
of#this#course.#We#need#likelihood#and#prior.#
• In#machine#learning#we#usually#do#not#have#a#model,#all#we#have#is#data.#If#the#
data#is#labeled,#we#can#also#do#inference#on#a#new#unlabeled#data:#we#can#learn#
that#data#with#certain#value#of#the#label#have#certain#properties,#so#that#when#we#
evaluate#a#new#data#we#can#assign#the#value#of#the#label#to#it.#This#works#both#for#
regression#(continuous#values#for#labels)#or#classification#(discrete#label#values).#
ML$is$a$fancy$version$of$interpolation.
• Hybrid:#Likelihood#free#inference#(LFI),#i.e.#inference#using#ML#methods.#Instead#
of#doing#prior+likelihood analysis#we#make#labeled#synthetic#data#realizations#
using#simulations,#and#use#ML#methods#to#infer#the#parameter#values#given#the#
actual#data#realization.#We#pay#the#price#of#sampling#noise,#in#that#we#may#not#
have#sufficient#simulations#for#ML#methods#to#learn#the#labels#well.#
• For#very#complicated,#high#dimensional#problems,#full#Bayesian#analysis#may#not#
be#feasible#and#LFI#can#be#an#attractive#alternative.#We#will#be#learning#both#
approaches#in#this#course.#
29
PHYS188/288: BAYESIAN DATA ANALYSIS AND MACHINE LEARNING FOR PHYSICAL SCIENCES UROŠ SELJAK
Supervised Learning (SL)
• Answering)a)specific)question:)e.g.)regression)or)classification
• Supervised)learning)is)essentially)interpolation
• General)approach:)frame)the)problem,)collect)the)data
• Choose)the)SL)algorithm
• Choose)objective)function)(decide)what)to)optimize)
• Train)the)algorithm,)test)(crossEvalidate)
30
PHYS188/288: BAYESIAN DATA ANALYSIS AND MACHINE LEARNING FOR PHYSICAL SCIENCES UROŠ SELJAK
Classes of problems: regression
• w
31
PHYS188/288: BAYESIAN DATA ANALYSIS AND MACHINE LEARNING FOR PHYSICAL SCIENCES UROŠ SELJAK
Basic machine learning procedure
• We#have#some#data#x#and#some#labels#y,#such#that#Y=(x,y).#We#wish#to#find#some#
model#g(a)#and#some#cost#or#loss#function#C(Y,#g(a))#that#we#wish#to#minimize#
such#that#the#model#g(a)#explains#the#data#Y.#
• E.g.#Y=(x,y),#C=!2
• G=a0X0(xi) + a1X1(xi) + … + aM-1XM-1(xi)
• In ML we divide data into training data Ytrain (e.g. 90%) and test data Ytest (e.g.
10%)
• We fit model to the training data: the value of the minimum loss function at amin is
called in-sample error Ein=C(Ytrain,g(amin))
• We test the results on test data, getting out of sample error
Eout=C(Ytest,g(amin))>Ein
• This is called cross-validation technique
• If we have different models then test data are called validation data while test data
are used to test different models, each trained on training data (3 way split, e.g.
60%, 30%, 10%) 32
PHYS188/288: BAYESIAN DATA ANALYSIS AND MACHINE LEARNING FOR PHYSICAL SCIENCES UROŠ SELJAK
Data analysis versus machine learning
• Data$analysis:$fitting$existing$data$to$a$physics$based$model$to$
obtain$model$parameters$y.$Parameters$are$fixed:$we$know$physics$
up$to$parameter$values.$Parameter$posteriors$are$the$goal.$
• ML:$use$model$derived$from$existing$data$to$predict$regression$or$
classification$parameters$y$for$new$data.$
• Example:$polynomial$regression.$This$will$be$HW$4$problem
• We$can$fit$the$training$data$to$a$simple$model$or$complex$model
• In$the$absence$of$noise$complex$model$(many$fitting$parameters$
a)$always$better
• In$the$presence$of$noise$complex$model$often$worse
• Note$that$parameters$a$have$no$meaning$on$their$own,$just$means$
to$reach$the$goal$of$predicting$y
33
PHYS188/288: BAYESIAN DATA ANALYSIS AND MACHINE LEARNING FOR PHYSICAL SCIENCES UROŠ SELJAK
f(x)=2x,(no(noise((
34
f(x)=2x.10x5+15x10
PHYS188/288: BAYESIAN DATA ANALYSIS AND MACHINE LEARNING FOR PHYSICAL SCIENCES UROŠ SELJAK
Over%fitting+noise+with+too+complex+models+(bias%variance+trade%off)
35
PHYS188/288: BAYESIAN DATA ANALYSIS AND MACHINE LEARNING FOR PHYSICAL SCIENCES UROŠ SELJAK
Bias-variance trade-off
36
PHYS188/288: BAYESIAN DATA ANALYSIS AND MACHINE LEARNING FOR PHYSICAL SCIENCES UROŠ SELJAK
Another example: k-nearest neighbors
How$predictions$change$as$we$
average$over$more$nearest$
neighbours?$
37
PHYS188/288: BAYESIAN DATA ANALYSIS AND MACHINE LEARNING FOR PHYSICAL SCIENCES UROŠ SELJAK
Statistical learning theory
• We#have#data#and#we#can#change#the#number#of#data#points
• we#have#models#and#we#can#change#complexity#(number#of#model#
parameters#in#simple#versions)
• Trade<off#at#fixed#model#complexity:
• small#data#size#suffers#from#a#large#
variance#(we#are#overfitting#noise)
• Large#data#size#suffers#from#model#bias
• Variance#quantified#by#Ein vs#Eout
• Ein and#Eout approach#bias#for#large#data
• To#reduce#bias#increase#complexity
38
PHYS188/288: BAYESIAN DATA ANALYSIS AND MACHINE LEARNING FOR PHYSICAL SCIENCES UROŠ SELJAK
Bias-variance trade-off vs complexity
• Low$complexity:$large$bias
• Large$complexity:$large$variance
• Optimum$when$the$two$are$balanced
• Complexity$can$be$controlled$by$
regularization$(we$will$discuss$it$further)
39
PHYS188/288: BAYESIAN DATA ANALYSIS AND MACHINE LEARNING FOR PHYSICAL SCIENCES UROŠ SELJAK
Representational power
• We#are#learning#a#manifold#M#f:X Y
• To#learn#complex#manifolds#we#need#high#representational#power
• We#need#a#universal#approximator#with#good#generalization#
properties#(from#in>sample#to#out#of#sample,#i.e.#not#over>fitting)#
• This#is#where#neural#networks#excel:#they#can#fit#anything#(literally,#
including#pure#noise),#yet#can#also#generalize## 40
PHYS188/288: BAYESIAN DATA ANALYSIS AND MACHINE LEARNING FOR PHYSICAL SCIENCES UROŠ SELJAK
Unsupervised machine learning
• Discovering+structure+in+unlabeled+data
• Examples:+clustering,+dimensionality+reduction
• The+promise:+easier+to+do+regression,+
classification
• Easier+visualization
41
PHYS188/288: BAYESIAN DATA ANALYSIS AND MACHINE LEARNING FOR PHYSICAL SCIENCES UROŠ SELJAK
Dimensionality reduction
• PCA$(lecture$4),$ICA$(lecture$5)
• Manifold$projection:$want$to$reduce$it$preserving$pairwise$
distance$between$data$points$(e.g.$tASNE,$ISOMAP,$UMAP).$
• If$reduced$too$much$we$get$crowding$problem
42
PHYS188/288: BAYESIAN DATA ANALYSIS AND MACHINE LEARNING FOR PHYSICAL SCIENCES UROŠ SELJAK
UMAP:&Uniform&Manifold&Approximation&and&
Projection&for&Dimension&Reduction
• Tries&to&connect&nearby&points&using&locally&varying&metric
• Best&on&the&market&at&the&moment
• You&will&try&it&in&HW3
• Example&MNIST&digits&separate&in&2d&UMAP&plane
43
PHYS188/288: BAYESIAN DATA ANALYSIS AND MACHINE LEARNING FOR PHYSICAL SCIENCES UROŠ SELJAK
Clustering+algorithms
• For$unsupervised$learning$(no$labels$available)$we$also$need$
to$identify$distinct$classes
• Clustering$algorithms$look$at$clusters$of$data$in$original$space$
or$reduced$dimensionality$space
• We$will$look$at$k=means$and$Gaussian$mixture$model$later
• Clustering$algorithms$such$as$HDBSCAN$connect$close$
particles$together:$friends$of$friends$algorithms
• HW$3:$UMAP$+$clustering
44
PHYS188/288: BAYESIAN DATA ANALYSIS AND MACHINE LEARNING FOR PHYSICAL SCIENCES UROŠ SELJAK
Literature
• Numerical Recipes, Press et al., Chapter 15
(http://apps.nrbook.com/c/index.html)
• Bayesian Data Analysis, Gelman et al. , Chapter 1-4
• https://umap-learn.readthedocs.io/en/latest/how_umap_works.html
• A high bias, low variance introduction to machine learning
for physicists, https://arxiv.org/pdf/1803.08823.pdf (pictures on slides
34-42 taken from this review)
45
PHYS188/288: BAYESIAN DATA ANALYSIS AND MACHINE LEARNING FOR PHYSICAL SCIENCES UROŠ SELJAK