Lecture 3:introduction To Data Analysis and Machine Learning

LECTURE 3:INTRODUCTION TO DATA
ANALYSIS AND MACHINE LEARNING

• Goal of data analysis: to determine some parameters from the data
• We want to combine new data with previous information on the
parameters (prior: theoretical or empirical)
• We multiply the likelihood of parameters given the data with the
prior to get the posterior
• Goal of machine learning: we want to predict parameters of new data
given some existing labeled data (supervised learning),searching for
patterns and dimensionality reduction (unsupervised learning)
• The goals of statistics and ML are often similar or related

• Methodologies and language are often very different
1
PHYS188/288: BAYESIAN DATA ANALYSIS AND MACHINE LEARNING FOR PHYSICAL SCIENCES UROŠ SELJAK
Goals of Data Analysis
• Data analysis: summarizing the posterior information: mean or

mode, variance… Typically we are interested in more than mean and
variance (skewness, curtosis, full PDF)
• Posterior intervals: e.g. 95% credible interval can be constructed as
central (relative to median) or highest posterior density. Typically
these agree, but:
2
Posterior PDF p(λ|D,H ) contains all information on !
!* = Maximum A
Posterior or MAP
p( |D, H) / p(D| , H) if p( ) / constant

If p(!) ∝ constant (uniform prior) → !* = maximum likelihood
estimator (MLE) and MLE=MAP
d
p( |D, H) = 0 Approximate p(!|D) as a Gaussian around ! *
d = ⇤
d2 1
• Error estimate: lnp( |D, H) = 2
d 2 = ⇤
• Laplace approximation: = ⇤
±
3
Posterior predictive distribution
• Predicting future observation conditional on current data y and model
posterior: we marginalize over all models at fixed current data y
• We have seen it in example 2.6 (lecture 2 slide 28)
• Example: we measure a quantity but each measurement has some error
!. After N measurements we get mean µ1 and variance "1=!/N1/2. Next
measurement will be located around µ1 combining the two errors.
Two sources of uncertainty! Will be discussed further when we cover

hierarchical models
4
Modern statistical methods (Bayesian or not)
Gelman et al., Bayesian Data Analysis, 3rd edition
5
INTRODUCTION TO MODELING OF DATA
• We are given N number of data measurements (xi,yi)

• Each measurement comes with an error estimate !i
• We have a parametrized model for the data y = y(xi)
• We think the error probability is Gaussian and the
measurements are uncorrelated:
(y(xi ) yi )2
1 2 2
p(yi ) = p 2
e i
2⇡ i
Y
p(~y ) = p(yi )
i
6
We can parametrize the model in terms of M free parameters
y(xi|a1,a2,a3,…,aM)
Bayesian formalism gives us the full posterior information on

the parameters of the model
Y
p(~y |~a) = p(yi |~a) = L(~a)
i
Q
i p(yi |~a)p(~a)
p(a1 , ..., aM |~y ) =
p(yi )
We can assume a flat prior p(a1,a2,a3,…,aM) = constant

In this case posterior proportional to likelihood L
Normalization (evidence, marginal) p(yi) not needed if we just

need relative posterior density
7
Maximum likelihood estimator (MLE)
• Instead of the full posterior we can ask what is the best fit
value of parameters a1,a2,a3,…,aM
• We can define this in different ways: mean, median, mode
• Choosing the mode (peak posterior or peak likelihood) means

we want to maximize the likelihood: maximum likelihood
estimator (or MAP for non-uniform prior)
@L @lnL
MLE : = 0 or =0
@~a @~a
8
(y(xi ) yi )2
1 2 2
Maximum'likelihood'estimator p(yi ) = p2⇡ 2 e i
i
for'gaussian'errors Y
p(~y ) = p(yi )
i
X n (yi y(xi |a1 , ..., aM ))2 o
2lnL = 2 + ln i
i i
}
2
Since !i does not depend on ai, MLE means minimizing "2 wrt ak
@ 2 X yi y(xi ) @y(xi )
=0 ! 2 =0
@ak i i @ak
This is a system of M nonlinear equations for M unknowns

9
Fitting&data&to&a&straight&line:&model&is&a&line
Linear Regression
Measures how well the model agrees with the data
Minimize !2":"
10
Define:
! ! !
Matrix S Sx a Sy
Form:
=
Sx Sxx b Sxy
Solve this with linear algebra
11
!
! ! ! 1 S Sx
C =
S Sx a Sy Sx Sxx
= !
Sx Sxx b Sxy 1 Sxx Sx
C=
Sx S
Solution: Define
â
b̂
This gives best fit â & b̂
12
What about the errors?
• We approximate the log posterior around its peak with a
quadratic function
• The posterior is thus approximated as a Gaussian
• This goes under the name Laplace approximation
• Note that the errors need to be described as a matrix
• It is exact for linear parameters (such as a and b)
13
MLE/MAP+Laplace
2 · lnp(a, b|yi ) = 2 · lnL(a, b)
Taylor expansion around the peak ( â & b̂ ): first derivative is 0
Let a = x1 , b = x2
1 X @ 2 lnL
2 · lnL(x1 , x2 ) = 2 · lnL(xˆ1 , xˆ2 ) 2· xi xj
2 i,j=1,2 @xi @xj xi =xˆi
where xi = xi x̂i
1X
xi Cij 1 xj
2 ij
Note: h xi xj i = Cij
Gaussian posterior approximation: we are dropping terms beyond 2nd order

@ 2 lnL
⌘ Cij 1 (C 1
= ↵ is called precision matrix.)
@xi @xj (Precision*matrix*also*called*Hessian*matrix)
P
1
xi Cij 1 xj
L / e 2 ij
14
2
2 · lnL =
@2 2 X 1
=2 2 = 2S
@a2 i i
@2 2 X x2
i
= 2 2 = 2Sxx
@b2 i i
@2 2 X xi
=2 2 = 2Sx
@a@b i i
! !
1 S Sx 1 Sxx Sx
C = C=
Sx Sxx Sx S
S51 is+error+on+a+at+a+fixed+b
Define
Marginalized+error+on+a:+integrate+out+b
Marginal+errors+are+larger:+!a2>S51 15
Show
Z ⇥ ⇤ 2
1 (b b̂)
1
(a â) 2
Caa1 +(a â)(b b̂)Cab1 +(b b̂) 2
Cbb1
da · e 2 / e 2 Cbb
(Complete the square in a)
Solution:
h Cab1 2 Cab2 2 i 1 2
Caa1 ( a+ 1 b) 2 b + C bb b
Caa Caa
Z ⇥ 1 ⇤
1 1
2 Caa
1
(
C
a+ ab1 b) 2 p
da · e Caa = 2⇡Caa
1
⇥ C
2
+C
1
Caa1
⇤
1
2 b2 ab bb
1
1 b2
/ e Caa =e 2 Cbb
16
Bayesian Posterior and Marginals
• The posterior distribution p(a,b|yi) is described as a 2-d C-1
ellipse in (a,b) plane
• At any fixed value of a (or b) the posterior of b (or a) is a

gaussian with variance [C-1bb(aa)]-1
• If we want to know the error on b (or a) independent of a (or b)

we need to marginalize over a (or b)
• This marginalization can be done analytically (completion of

squares), and leads to Cbb(aa) as the variance of b (or a)
• This will increase the error: Cbb(aa)>[C-1bb(aa)]-1
17
Asymptotics theorems
(Le Cam 1953, adopted to Bayesian posteriors)
• At a fixed number of parameters posteriors approach a multi-

variate Gaussian in the large N limit (N: number of data points):
this is because the 2nd order Taylor expansion of ln L is more and
more accurate in this limit, i.e. we can drop 3rd and higher order
terms, by central limit theorem
• The marginalized means approach the true value and the

variance approaches the Fisher matrix, defined as ensemble
average of precision matrix <C-1>
• The likelihood dominates over the prior in large N limit
18
Asymptotics theorems
(Le Cam 1953, adopted to Bayesian posteriors)
• There are caveats when this does not apply, e.g. when data are
not informative about a parameter or some linear combination of
them, when number of parameters M is comparable to N, when
posteriors are improper or likelihoods are unbounded… Always
exercise care!
• In practice the asymptotic limit is often not achieved for

nonlinear models, i.e. we cannot linearize the model across the
region of non-zero posterior: this is why we will use advanced
Bayesian methods to evaluate the posteriors instead of Gaussian
• It is useful to know the existence of this limit, but since we
cannot know ahead of time whether we are in this limit or not in
practice we cannot assume it: we will be doing full Bayesian
posteriors in this course, but we will also sometimes compare to
the gaussian limit 19
Multivariate linear least squares
• We can generalize the model to a generic functional form

yi = a0X0(xi) + a1X1(xi) + … + aM-1XM-1(xi)
• The problem is linear in aj and can be nonlinear in xi,

e.g. Xj(xi)=xij
• We can define design matrix Aij = Xj(xi)/!i and
• bi = yi/ !i
20
Design matrix
Credit: NR, Press et al. 21

Solution by normal equations d!2/dak=0
"kj=d2!2/dakdaj
To solve the normal equations to obtain best fit values and the precision matrix
we need to learn linear algebra numerical methods: topic of next lecture 22
Gaussian posterior
Marginalization over nuisance parameters

• If we want to know the error on j-th parameter we need to
marginalize over all other parameters
• In analogy to 2-d case this leads to σj2 = Cjj
• So we need to invert the precision matrix ! = C-1 to get C

• Analytic marginalization is only possible for a multi-variate
Gaussian distribution: a great advantage of using a Gaussian
• If the posterior is not Gaussian it may be made more

Gaussian by a nonlinear transformation of the variable
23
What about multi-dimensional projections?
• Suppose we are interested in ν components of a, marginalizing
over remaining M- ν components.
• We take the components of C corresponding to ν parameters to

create ν x ν matrix Cproj
• Invert the matrix to get precision matrix Cproj-1
• Posterior distribution is proportional to
exp(-δaprojTCproj-1 δaproj/2),
which is distributed as exp(-!"2/2),
i.e. "2 with ν degrees of freedom
24
Credible intervals under Gaussian posterior approx.
• We like to quote posteriors in terms of X% credible intervals
• For Gaussian likelihoods most compact posteriors correspond to a
constant change !"2 relative to MAP/MLE
• The intervals depend on the dimension: example for X=68
25
26
We rarely go above ν = 2 dimensions in projections
(difficult to visualize)
27
Introduction to Machine Learning
• From%some%input%x,%output%can%be:%
• Summary%z:%unsupervised%learning%(descriptive,%hindsight)
• Prediction%y:%supervised%learning%(predictive,%insight)
• Action%a%to%maximize%reward%r:%reinforcement%learning%
(prescriptive,%foresight)
• Value%vs%difficulty%(although%
this%view%is%subjective)
• Supervised%learning:%
classification%and%regression
• Unsupervised%learning:%e.g.
dimensionality%reduction
Chris%Wiggins%taxonomy,%Gartner/Recht graph 28
Data$Analysis$versus$Machine$Learning
• In#physical#sciences#we#usually#compare#data#to#a#physics#based#model#to#infer#
parameters#of#the#model.#This#is#often#an#analytic#model#as#a#function#of#physical#
parameters#(e.g.#linear#regression).#This#is#the#Bayesian#Data#Analysis#component#
of#this#course.#We#need#likelihood#and#prior.#
• In#machine#learning#we#usually#do#not#have#a#model,#all#we#have#is#data.#If#the#
data#is#labeled,#we#can#also#do#inference#on#a#new#unlabeled#data:#we#can#learn#
that#data#with#certain#value#of#the#label#have#certain#properties,#so#that#when#we#
evaluate#a#new#data#we#can#assign#the#value#of#the#label#to#it.#This#works#both#for#
regression#(continuous#values#for#labels)#or#classification#(discrete#label#values).#
ML$is$a$fancy$version$of$interpolation.
• Hybrid:#Likelihood#free#inference#(LFI),#i.e.#inference#using#ML#methods.#Instead#
of#doing#prior+likelihood analysis#we#make#labeled#synthetic#data#realizations#
using#simulations,#and#use#ML#methods#to#infer#the#parameter#values#given#the#
actual#data#realization.#We#pay#the#price#of#sampling#noise,#in#that#we#may#not#
have#sufficient#simulations#for#ML#methods#to#learn#the#labels#well.#
• For#very#complicated,#high#dimensional#problems,#full#Bayesian#analysis#may#not#
be#feasible#and#LFI#can#be#an#attractive#alternative.#We#will#be#learning#both#
approaches#in#this#course.#
29
Supervised Learning (SL)
• Answering)a)specific)question:)e.g.)regression)or)classification
• Supervised)learning)is)essentially)interpolation
• General)approach:)frame)the)problem,)collect)the)data
• Choose)the)SL)algorithm
• Choose)objective)function)(decide)what)to)optimize)
• Train)the)algorithm,)test)(crossEvalidate)
30
Classes of problems: regression
• w
31
Basic machine learning procedure
• We#have#some#data#x#and#some#labels#y,#such#that#Y=(x,y).#We#wish#to#find#some#
model#g(a)#and#some#cost#or#loss#function#C(Y,#g(a))#that#we#wish#to#minimize#
such#that#the#model#g(a)#explains#the#data#Y.#
• E.g.#Y=(x,y),#C=!2
• G=a0X0(xi) + a1X1(xi) + … + aM-1XM-1(xi)
• In ML we divide data into training data Ytrain (e.g. 90%) and test data Ytest (e.g.
10%)
• We fit model to the training data: the value of the minimum loss function at amin is
called in-sample error Ein=C(Ytrain,g(amin))
• We test the results on test data, getting out of sample error
Eout=C(Ytest,g(amin))>Ein
• This is called cross-validation technique
• If we have different models then test data are called validation data while test data
are used to test different models, each trained on training data (3 way split, e.g.
60%, 30%, 10%) 32
Data analysis versus machine learning
• Data$analysis:$fitting$existing$data$to$a$physics$based$model$to$
obtain$model$parameters$y.$Parameters$are$fixed:$we$know$physics$
up$to$parameter$values.$Parameter$posteriors$are$the$goal.$
• ML:$use$model$derived$from$existing$data$to$predict$regression$or$
classification$parameters$y$for$new$data.$
• Example:$polynomial$regression.$This$will$be$HW$4$problem
• We$can$fit$the$training$data$to$a$simple$model$or$complex$model
• In$the$absence$of$noise$complex$model$(many$fitting$parameters$
a)$always$better
• In$the$presence$of$noise$complex$model$often$worse
• Note$that$parameters$a$have$no$meaning$on$their$own,$just$means$
to$reach$the$goal$of$predicting$y
33
f(x)=2x,(no(noise((
34
f(x)=2x.10x5+15x10
Over%fitting+noise+with+too+complex+models+(bias%variance+trade%off)
35
Bias-variance trade-off
36
Another example: k-nearest neighbors
How$predictions$change$as$we$
average$over$more$nearest$
neighbours?$
37
Statistical learning theory
• We#have#data#and#we#can#change#the#number#of#data#points
• we#have#models#and#we#can#change#complexity#(number#of#model#
parameters#in#simple#versions)
• Trade<off#at#fixed#model#complexity:
• small#data#size#suffers#from#a#large#
variance#(we#are#overfitting#noise)
• Large#data#size#suffers#from#model#bias
• Variance#quantified#by#Ein vs#Eout
• Ein and#Eout approach#bias#for#large#data
• To#reduce#bias#increase#complexity
38
Bias-variance trade-off vs complexity
• Low$complexity:$large$bias
• Large$complexity:$large$variance
• Optimum$when$the$two$are$balanced
• Complexity$can$be$controlled$by$
regularization$(we$will$discuss$it$further)
39
Representational power
• We#are#learning#a#manifold#M#f:X Y
• To#learn#complex#manifolds#we#need#high#representational#power
• We#need#a#universal#approximator#with#good#generalization#
properties#(from#in>sample#to#out#of#sample,#i.e.#not#over>fitting)#
• This#is#where#neural#networks#excel:#they#can#fit#anything#(literally,#
including#pure#noise),#yet#can#also#generalize## 40
Unsupervised machine learning
• Discovering+structure+in+unlabeled+data
• Examples:+clustering,+dimensionality+reduction
• The+promise:+easier+to+do+regression,+
classification
• Easier+visualization
41
Dimensionality reduction
• PCA$(lecture$4),$ICA$(lecture$5)
• Manifold$projection:$want$to$reduce$it$preserving$pairwise$
distance$between$data$points$(e.g.$tASNE,$ISOMAP,$UMAP).$
• If$reduced$too$much$we$get$crowding$problem
42
UMAP:&Uniform&Manifold&Approximation&and&
Projection&for&Dimension&Reduction
• Tries&to&connect&nearby&points&using&locally&varying&metric
• Best&on&the&market&at&the&moment
• You&will&try&it&in&HW3
• Example&MNIST&digits&separate&in&2d&UMAP&plane
43
Clustering+algorithms
• For$unsupervised$learning$(no$labels$available)$we$also$need$
to$identify$distinct$classes
• Clustering$algorithms$look$at$clusters$of$data$in$original$space$
or$reduced$dimensionality$space
• We$will$look$at$k=means$and$Gaussian$mixture$model$later
• Clustering$algorithms$such$as$HDBSCAN$connect$close$
particles$together:$friends$of$friends$algorithms
• HW$3:$UMAP$+$clustering
44
Literature
• Numerical Recipes, Press et al., Chapter 15
(http://apps.nrbook.com/c/index.html)
• Bayesian Data Analysis, Gelman et al. , Chapter 1-4
• https://umap-learn.readthedocs.io/en/latest/how_umap_works.html
• A high bias, low variance introduction to machine learning
for physicists, https://arxiv.org/pdf/1803.08823.pdf (pictures on slides
34-42 taken from this review)
45

Lecture 3:introduction To Data Analysis and Machine Learning

Uploaded by

Copyright:

Available Formats

Lecture 3:introduction To Data Analysis and Machine Learning

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 3:introduction To Data Analysis and Machine Learning

Uploaded by

Copyright:

Available Formats

LECTURE 3:INTRODUCTION TO DATA

ANALYSIS AND MACHINE LEARNING

• The goals of statistics and ML are often similar or related

• Data analysis: summarizing the posterior information: mean or

p( |D, H) / p(D| , H) if p( ) / constant

Two sources of uncertainty! Will be discussed further when we cover

• We are given N number of data measurements (xi,yi)

Bayesian formalism gives us the full posterior information on

We can assume a flat prior p(a1,a2,a3,…,aM) = constant

Normalization (evidence, marginal) p(yi) not needed if we just

• We can define this in different ways: mean, median, mode

• Choosing the mode (peak posterior or peak likelihood) means

This is a system of M nonlinear equations for M unknowns

Measures how well the model agrees with the data

This gives best fit â & b̂

Gaussian posterior approximation: we are dropping terms beyond 2nd order

(Complete the square in a)

• At any fixed value of a (or b) the posterior of b (or a) is a

• If we want to know the error on b (or a) independent of a (or b)

• This marginalization can be done analytically (completion of

• This will increase the error: Cbb(aa)>[C-1bb(aa)]-1

• At a fixed number of parameters posteriors approach a multi-

• The marginalized means approach the true value and the

• The likelihood dominates over the prior in large N limit

• In practice the asymptotic limit is often not achieved for

• We can generalize the model to a generic functional form

• The problem is linear in aj and can be nonlinear in xi,

• We can define design matrix Aij = Xj(xi)/!i and

Credit: NR, Press et al. 21

Marginalization over nuisance parameters

• In analogy to 2-d case this leads to σj2 = Cjj

• So we need to invert the precision matrix ! = C-1 to get C

• If the posterior is not Gaussian it may be made more

• We take the components of C corresponding to ν parameters to

You might also like