Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Lecture 3:introduction To Data Analysis and Machine Learning

Download as pdf or txt
Download as pdf or txt
You are on page 1of 45

LECTURE 3:INTRODUCTION TO DATA

ANALYSIS AND MACHINE LEARNING


• Goal of data analysis: to determine some parameters from the data
• We want to combine new data with previous information on the
parameters (prior: theoretical or empirical)
• We multiply the likelihood of parameters given the data with the
prior to get the posterior
• Goal of machine learning: we want to predict parameters of new data
given some existing labeled data (supervised learning),searching for
patterns and dimensionality reduction (unsupervised learning)

• The goals of statistics and ML are often similar or related


• Methodologies and language are often very different
1
PHYS188/288: BAYESIAN DATA ANALYSIS AND MACHINE LEARNING FOR PHYSICAL SCIENCES UROŠ SELJAK
Goals of Data Analysis

• Data analysis: summarizing the posterior information: mean or


mode, variance… Typically we are interested in more than mean and
variance (skewness, curtosis, full PDF)
• Posterior intervals: e.g. 95% credible interval can be constructed as
central (relative to median) or highest posterior density. Typically
these agree, but:

2
PHYS188/288: BAYESIAN DATA ANALYSIS AND MACHINE LEARNING FOR PHYSICAL SCIENCES UROŠ SELJAK
Posterior PDF p(λ|D,H ) contains all information on !

!* = Maximum A
Posterior or MAP

p( |D, H) / p(D| , H) if p( ) / constant


If p(!) ∝ constant (uniform prior) → !* = maximum likelihood
estimator (MLE) and MLE=MAP
d
p( |D, H) = 0 Approximate p(!|D) as a Gaussian around ! *
d = ⇤

d2 1
• Error estimate: lnp( |D, H) = 2
d 2 = ⇤

• Laplace approximation: = ⇤
±
3
PHYS188/288: BAYESIAN DATA ANALYSIS AND MACHINE LEARNING FOR PHYSICAL SCIENCES UROŠ SELJAK
Posterior predictive distribution
• Predicting future observation conditional on current data y and model
posterior: we marginalize over all models at fixed current data y
• We have seen it in example 2.6 (lecture 2 slide 28)
• Example: we measure a quantity but each measurement has some error
!. After N measurements we get mean µ1 and variance "1=!/N1/2. Next
measurement will be located around µ1 combining the two errors.

Two sources of uncertainty! Will be discussed further when we cover


hierarchical models

4
PHYS188/288: BAYESIAN DATA ANALYSIS AND MACHINE LEARNING FOR PHYSICAL SCIENCES UROŠ SELJAK
Modern statistical methods (Bayesian or not)
Gelman et al., Bayesian Data Analysis, 3rd edition

5
PHYS188/288: BAYESIAN DATA ANALYSIS AND MACHINE LEARNING FOR PHYSICAL SCIENCES UROŠ SELJAK
INTRODUCTION TO MODELING OF DATA

• We are given N number of data measurements (xi,yi)


• Each measurement comes with an error estimate !i
• We have a parametrized model for the data y = y(xi)
• We think the error probability is Gaussian and the
measurements are uncorrelated:

(y(xi ) yi )2
1 2 2
p(yi ) = p 2
e i
2⇡ i
Y
p(~y ) = p(yi )
i

6
PHYS188/288: BAYESIAN DATA ANALYSIS AND MACHINE LEARNING FOR PHYSICAL SCIENCES UROŠ SELJAK
We can parametrize the model in terms of M free parameters
y(xi|a1,a2,a3,…,aM)

Bayesian formalism gives us the full posterior information on


the parameters of the model
Y
p(~y |~a) = p(yi |~a) = L(~a)
i
Q
i p(yi |~a)p(~a)
p(a1 , ..., aM |~y ) =
p(yi )

We can assume a flat prior p(a1,a2,a3,…,aM) = constant


In this case posterior proportional to likelihood L

Normalization (evidence, marginal) p(yi) not needed if we just


need relative posterior density
7
PHYS188/288: BAYESIAN DATA ANALYSIS AND MACHINE LEARNING FOR PHYSICAL SCIENCES UROŠ SELJAK
Maximum likelihood estimator (MLE)

• Instead of the full posterior we can ask what is the best fit
value of parameters a1,a2,a3,…,aM

• We can define this in different ways: mean, median, mode

• Choosing the mode (peak posterior or peak likelihood) means


we want to maximize the likelihood: maximum likelihood
estimator (or MAP for non-uniform prior)

@L @lnL
MLE : = 0 or =0
@~a @~a

8
PHYS188/288: BAYESIAN DATA ANALYSIS AND MACHINE LEARNING FOR PHYSICAL SCIENCES UROŠ SELJAK
(y(xi ) yi )2
1 2 2
Maximum'likelihood'estimator p(yi ) = p2⇡ 2 e i

i
for'gaussian'errors Y
p(~y ) = p(yi )
i
X n (yi y(xi |a1 , ..., aM ))2 o
2lnL = 2 + ln i
i i

}
2
Since !i does not depend on ai, MLE means minimizing "2 wrt ak

@ 2 X yi y(xi ) @y(xi )
=0 ! 2 =0
@ak i i @ak

This is a system of M nonlinear equations for M unknowns


9
PHYS188/288: BAYESIAN DATA ANALYSIS AND MACHINE LEARNING FOR PHYSICAL SCIENCES UROŠ SELJAK
Fitting&data&to&a&straight&line:&model&is&a&line

Linear Regression

Measures how well the model agrees with the data

Minimize !2":"

10
PHYS188/288: BAYESIAN DATA ANALYSIS AND MACHINE LEARNING FOR PHYSICAL SCIENCES UROŠ SELJAK
Define:

! ! !
Matrix S Sx a Sy
Form:
=
Sx Sxx b Sxy
Solve this with linear algebra
11
PHYS188/288: BAYESIAN DATA ANALYSIS AND MACHINE LEARNING FOR PHYSICAL SCIENCES UROŠ SELJAK
!
! ! ! 1 S Sx
C =
S Sx a Sy Sx Sxx
= !
Sx Sxx b Sxy 1 Sxx Sx
C=
Sx S

Solution: Define

This gives best fit â & b̂

12
PHYS188/288: BAYESIAN DATA ANALYSIS AND MACHINE LEARNING FOR PHYSICAL SCIENCES UROŠ SELJAK
What about the errors?
• We approximate the log posterior around its peak with a
quadratic function
• The posterior is thus approximated as a Gaussian
• This goes under the name Laplace approximation
• Note that the errors need to be described as a matrix
• It is exact for linear parameters (such as a and b)

13
PHYS188/288: BAYESIAN DATA ANALYSIS AND MACHINE LEARNING FOR PHYSICAL SCIENCES UROŠ SELJAK
MLE/MAP+Laplace
2 · lnp(a, b|yi ) = 2 · lnL(a, b)
Taylor expansion around the peak ( â & b̂ ): first derivative is 0
Let a = x1 , b = x2
1 X @ 2 lnL
2 · lnL(x1 , x2 ) = 2 · lnL(xˆ1 , xˆ2 ) 2· xi xj
2 i,j=1,2 @xi @xj xi =xˆi

where xi = xi x̂i
1X
xi Cij 1 xj
2 ij
Note: h xi xj i = Cij

Gaussian posterior approximation: we are dropping terms beyond 2nd order


@ 2 lnL
⌘ Cij 1 (C 1
= ↵ is called precision matrix.)
@xi @xj (Precision*matrix*also*called*Hessian*matrix)
P
1
xi Cij 1 xj
L / e 2 ij
14
PHYS188/288: BAYESIAN DATA ANALYSIS AND MACHINE LEARNING FOR PHYSICAL SCIENCES UROŠ SELJAK
2
2 · lnL =

@2 2 X 1
=2 2 = 2S
@a2 i i

@2 2 X x2
i
= 2 2 = 2Sxx
@b2 i i

@2 2 X xi
=2 2 = 2Sx
@a@b i i

! !
1 S Sx 1 Sxx Sx
C = C=
Sx Sxx Sx S
S51 is+error+on+a+at+a+fixed+b
Define

Marginalized+error+on+a:+integrate+out+b
Marginal+errors+are+larger:+!a2>S51 15
PHYS188/288: BAYESIAN DATA ANALYSIS AND MACHINE LEARNING FOR PHYSICAL SCIENCES UROŠ SELJAK
Show
Z ⇥ ⇤ 2
1 (b b̂)
1
(a â) 2
Caa1 +(a â)(b b̂)Cab1 +(b b̂) 2
Cbb1
da · e 2 / e 2 Cbb

(Complete the square in a)

Solution:
h Cab1 2 Cab2 2 i 1 2
Caa1 ( a+ 1 b) 2 b + C bb b
Caa Caa
Z ⇥ 1 ⇤
1 1
2 Caa
1
(
C
a+ ab1 b) 2 p
da · e Caa = 2⇡Caa
1

⇥ C
2
+C
1
Caa1

1
2 b2 ab bb
1
1 b2
/ e Caa =e 2 Cbb

16
PHYS188/288: BAYESIAN DATA ANALYSIS AND MACHINE LEARNING FOR PHYSICAL SCIENCES UROŠ SELJAK
Bayesian Posterior and Marginals
• The posterior distribution p(a,b|yi) is described as a 2-d C-1
ellipse in (a,b) plane

• At any fixed value of a (or b) the posterior of b (or a) is a


gaussian with variance [C-1bb(aa)]-1

• If we want to know the error on b (or a) independent of a (or b)


we need to marginalize over a (or b)

• This marginalization can be done analytically (completion of


squares), and leads to Cbb(aa) as the variance of b (or a)

• This will increase the error: Cbb(aa)>[C-1bb(aa)]-1

17
PHYS188/288: BAYESIAN DATA ANALYSIS AND MACHINE LEARNING FOR PHYSICAL SCIENCES UROŠ SELJAK
Asymptotics theorems
(Le Cam 1953, adopted to Bayesian posteriors)

• At a fixed number of parameters posteriors approach a multi-


variate Gaussian in the large N limit (N: number of data points):
this is because the 2nd order Taylor expansion of ln L is more and
more accurate in this limit, i.e. we can drop 3rd and higher order
terms, by central limit theorem

• The marginalized means approach the true value and the


variance approaches the Fisher matrix, defined as ensemble
average of precision matrix <C-1>

• The likelihood dominates over the prior in large N limit

18
PHYS188/288: BAYESIAN DATA ANALYSIS AND MACHINE LEARNING FOR PHYSICAL SCIENCES UROŠ SELJAK
Asymptotics theorems
(Le Cam 1953, adopted to Bayesian posteriors)
• There are caveats when this does not apply, e.g. when data are
not informative about a parameter or some linear combination of
them, when number of parameters M is comparable to N, when
posteriors are improper or likelihoods are unbounded… Always
exercise care!

• In practice the asymptotic limit is often not achieved for


nonlinear models, i.e. we cannot linearize the model across the
region of non-zero posterior: this is why we will use advanced
Bayesian methods to evaluate the posteriors instead of Gaussian
• It is useful to know the existence of this limit, but since we
cannot know ahead of time whether we are in this limit or not in
practice we cannot assume it: we will be doing full Bayesian
posteriors in this course, but we will also sometimes compare to
the gaussian limit 19
PHYS188/288: BAYESIAN DATA ANALYSIS AND MACHINE LEARNING FOR PHYSICAL SCIENCES UROŠ SELJAK
Multivariate linear least squares

• We can generalize the model to a generic functional form


yi = a0X0(xi) + a1X1(xi) + … + aM-1XM-1(xi)

• The problem is linear in aj and can be nonlinear in xi,


e.g. Xj(xi)=xij

• We can define design matrix Aij = Xj(xi)/!i and

• bi = yi/ !i

20
PHYS188/288: BAYESIAN DATA ANALYSIS AND MACHINE LEARNING FOR PHYSICAL SCIENCES UROŠ SELJAK
Design matrix

Credit: NR, Press et al. 21


PHYS188/288: BAYESIAN DATA ANALYSIS AND MACHINE LEARNING FOR PHYSICAL SCIENCES UROŠ SELJAK
Solution by normal equations d!2/dak=0

"kj=d2!2/dakdaj

To solve the normal equations to obtain best fit values and the precision matrix
we need to learn linear algebra numerical methods: topic of next lecture 22
PHYS188/288: BAYESIAN DATA ANALYSIS AND MACHINE LEARNING FOR PHYSICAL SCIENCES UROŠ SELJAK
Gaussian posterior

Marginalization over nuisance parameters


• If we want to know the error on j-th parameter we need to
marginalize over all other parameters

• In analogy to 2-d case this leads to σj2 = Cjj

• So we need to invert the precision matrix ! = C-1 to get C


• Analytic marginalization is only possible for a multi-variate
Gaussian distribution: a great advantage of using a Gaussian

• If the posterior is not Gaussian it may be made more


Gaussian by a nonlinear transformation of the variable
23
PHYS188/288: BAYESIAN DATA ANALYSIS AND MACHINE LEARNING FOR PHYSICAL SCIENCES UROŠ SELJAK
What about multi-dimensional projections?
• Suppose we are interested in ν components of a, marginalizing
over remaining M- ν components.

• We take the components of C corresponding to ν parameters to


create ν x ν matrix Cproj
• Invert the matrix to get precision matrix Cproj-1
• Posterior distribution is proportional to
exp(-δaprojTCproj-1 δaproj/2),
which is distributed as exp(-!"2/2),
i.e. "2 with ν degrees of freedom

24
PHYS188/288: BAYESIAN DATA ANALYSIS AND MACHINE LEARNING FOR PHYSICAL SCIENCES UROŠ SELJAK
Credible intervals under Gaussian posterior approx.
• We like to quote posteriors in terms of X% credible intervals
• For Gaussian likelihoods most compact posteriors correspond to a
constant change !"2 relative to MAP/MLE
• The intervals depend on the dimension: example for X=68

25
PHYS188/288: BAYESIAN DATA ANALYSIS AND MACHINE LEARNING FOR PHYSICAL SCIENCES UROŠ SELJAK
26
PHYS188/288: BAYESIAN DATA ANALYSIS AND MACHINE LEARNING FOR PHYSICAL SCIENCES UROŠ SELJAK
We rarely go above ν = 2 dimensions in projections
(difficult to visualize)

27
PHYS188/288: BAYESIAN DATA ANALYSIS AND MACHINE LEARNING FOR PHYSICAL SCIENCES UROŠ SELJAK
Introduction to Machine Learning
• From%some%input%x,%output%can%be:%
• Summary%z:%unsupervised%learning%(descriptive,%hindsight)
• Prediction%y:%supervised%learning%(predictive,%insight)
• Action%a%to%maximize%reward%r:%reinforcement%learning%
(prescriptive,%foresight)
• Value%vs%difficulty%(although%
this%view%is%subjective)
• Supervised%learning:%
classification%and%regression
• Unsupervised%learning:%e.g.
dimensionality%reduction
Chris%Wiggins%taxonomy,%Gartner/Recht graph 28
PHYS188/288: BAYESIAN DATA ANALYSIS AND MACHINE LEARNING FOR PHYSICAL SCIENCES UROŠ SELJAK
Data$Analysis$versus$Machine$Learning
• In#physical#sciences#we#usually#compare#data#to#a#physics#based#model#to#infer#
parameters#of#the#model.#This#is#often#an#analytic#model#as#a#function#of#physical#
parameters#(e.g.#linear#regression).#This#is#the#Bayesian#Data#Analysis#component#
of#this#course.#We#need#likelihood#and#prior.#
• In#machine#learning#we#usually#do#not#have#a#model,#all#we#have#is#data.#If#the#
data#is#labeled,#we#can#also#do#inference#on#a#new#unlabeled#data:#we#can#learn#
that#data#with#certain#value#of#the#label#have#certain#properties,#so#that#when#we#
evaluate#a#new#data#we#can#assign#the#value#of#the#label#to#it.#This#works#both#for#
regression#(continuous#values#for#labels)#or#classification#(discrete#label#values).#
ML$is$a$fancy$version$of$interpolation.
• Hybrid:#Likelihood#free#inference#(LFI),#i.e.#inference#using#ML#methods.#Instead#
of#doing#prior+likelihood analysis#we#make#labeled#synthetic#data#realizations#
using#simulations,#and#use#ML#methods#to#infer#the#parameter#values#given#the#
actual#data#realization.#We#pay#the#price#of#sampling#noise,#in#that#we#may#not#
have#sufficient#simulations#for#ML#methods#to#learn#the#labels#well.#
• For#very#complicated,#high#dimensional#problems,#full#Bayesian#analysis#may#not#
be#feasible#and#LFI#can#be#an#attractive#alternative.#We#will#be#learning#both#
approaches#in#this#course.#

29
PHYS188/288: BAYESIAN DATA ANALYSIS AND MACHINE LEARNING FOR PHYSICAL SCIENCES UROŠ SELJAK
Supervised Learning (SL)
• Answering)a)specific)question:)e.g.)regression)or)classification
• Supervised)learning)is)essentially)interpolation
• General)approach:)frame)the)problem,)collect)the)data
• Choose)the)SL)algorithm
• Choose)objective)function)(decide)what)to)optimize)
• Train)the)algorithm,)test)(crossEvalidate)

30
PHYS188/288: BAYESIAN DATA ANALYSIS AND MACHINE LEARNING FOR PHYSICAL SCIENCES UROŠ SELJAK
Classes of problems: regression

• w

31
PHYS188/288: BAYESIAN DATA ANALYSIS AND MACHINE LEARNING FOR PHYSICAL SCIENCES UROŠ SELJAK
Basic machine learning procedure
• We#have#some#data#x#and#some#labels#y,#such#that#Y=(x,y).#We#wish#to#find#some#
model#g(a)#and#some#cost#or#loss#function#C(Y,#g(a))#that#we#wish#to#minimize#
such#that#the#model#g(a)#explains#the#data#Y.#
• E.g.#Y=(x,y),#C=!2
• G=a0X0(xi) + a1X1(xi) + … + aM-1XM-1(xi)

• In ML we divide data into training data Ytrain (e.g. 90%) and test data Ytest (e.g.
10%)
• We fit model to the training data: the value of the minimum loss function at amin is
called in-sample error Ein=C(Ytrain,g(amin))
• We test the results on test data, getting out of sample error
Eout=C(Ytest,g(amin))>Ein
• This is called cross-validation technique
• If we have different models then test data are called validation data while test data
are used to test different models, each trained on training data (3 way split, e.g.
60%, 30%, 10%) 32
PHYS188/288: BAYESIAN DATA ANALYSIS AND MACHINE LEARNING FOR PHYSICAL SCIENCES UROŠ SELJAK
Data analysis versus machine learning
• Data$analysis:$fitting$existing$data$to$a$physics$based$model$to$
obtain$model$parameters$y.$Parameters$are$fixed:$we$know$physics$
up$to$parameter$values.$Parameter$posteriors$are$the$goal.$
• ML:$use$model$derived$from$existing$data$to$predict$regression$or$
classification$parameters$y$for$new$data.$
• Example:$polynomial$regression.$This$will$be$HW$4$problem
• We$can$fit$the$training$data$to$a$simple$model$or$complex$model
• In$the$absence$of$noise$complex$model$(many$fitting$parameters$
a)$always$better
• In$the$presence$of$noise$complex$model$often$worse
• Note$that$parameters$a$have$no$meaning$on$their$own,$just$means$
to$reach$the$goal$of$predicting$y
33
PHYS188/288: BAYESIAN DATA ANALYSIS AND MACHINE LEARNING FOR PHYSICAL SCIENCES UROŠ SELJAK
f(x)=2x,(no(noise((

34
f(x)=2x.10x5+15x10
PHYS188/288: BAYESIAN DATA ANALYSIS AND MACHINE LEARNING FOR PHYSICAL SCIENCES UROŠ SELJAK
Over%fitting+noise+with+too+complex+models+(bias%variance+trade%off)

35
PHYS188/288: BAYESIAN DATA ANALYSIS AND MACHINE LEARNING FOR PHYSICAL SCIENCES UROŠ SELJAK
Bias-variance trade-off

36
PHYS188/288: BAYESIAN DATA ANALYSIS AND MACHINE LEARNING FOR PHYSICAL SCIENCES UROŠ SELJAK
Another example: k-nearest neighbors
How$predictions$change$as$we$
average$over$more$nearest$
neighbours?$

37
PHYS188/288: BAYESIAN DATA ANALYSIS AND MACHINE LEARNING FOR PHYSICAL SCIENCES UROŠ SELJAK
Statistical learning theory
• We#have#data#and#we#can#change#the#number#of#data#points
• we#have#models#and#we#can#change#complexity#(number#of#model#
parameters#in#simple#versions)
• Trade<off#at#fixed#model#complexity:
• small#data#size#suffers#from#a#large#
variance#(we#are#overfitting#noise)
• Large#data#size#suffers#from#model#bias
• Variance#quantified#by#Ein vs#Eout
• Ein and#Eout approach#bias#for#large#data
• To#reduce#bias#increase#complexity
38
PHYS188/288: BAYESIAN DATA ANALYSIS AND MACHINE LEARNING FOR PHYSICAL SCIENCES UROŠ SELJAK
Bias-variance trade-off vs complexity
• Low$complexity:$large$bias
• Large$complexity:$large$variance
• Optimum$when$the$two$are$balanced
• Complexity$can$be$controlled$by$
regularization$(we$will$discuss$it$further)

39
PHYS188/288: BAYESIAN DATA ANALYSIS AND MACHINE LEARNING FOR PHYSICAL SCIENCES UROŠ SELJAK
Representational power
• We#are#learning#a#manifold#M#f:X Y

• To#learn#complex#manifolds#we#need#high#representational#power
• We#need#a#universal#approximator#with#good#generalization#
properties#(from#in>sample#to#out#of#sample,#i.e.#not#over>fitting)#
• This#is#where#neural#networks#excel:#they#can#fit#anything#(literally,#
including#pure#noise),#yet#can#also#generalize## 40
PHYS188/288: BAYESIAN DATA ANALYSIS AND MACHINE LEARNING FOR PHYSICAL SCIENCES UROŠ SELJAK
Unsupervised machine learning
• Discovering+structure+in+unlabeled+data
• Examples:+clustering,+dimensionality+reduction
• The+promise:+easier+to+do+regression,+
classification
• Easier+visualization

41
PHYS188/288: BAYESIAN DATA ANALYSIS AND MACHINE LEARNING FOR PHYSICAL SCIENCES UROŠ SELJAK
Dimensionality reduction
• PCA$(lecture$4),$ICA$(lecture$5)
• Manifold$projection:$want$to$reduce$it$preserving$pairwise$
distance$between$data$points$(e.g.$tASNE,$ISOMAP,$UMAP).$
• If$reduced$too$much$we$get$crowding$problem

42
PHYS188/288: BAYESIAN DATA ANALYSIS AND MACHINE LEARNING FOR PHYSICAL SCIENCES UROŠ SELJAK
UMAP:&Uniform&Manifold&Approximation&and&
Projection&for&Dimension&Reduction

• Tries&to&connect&nearby&points&using&locally&varying&metric
• Best&on&the&market&at&the&moment
• You&will&try&it&in&HW3
• Example&MNIST&digits&separate&in&2d&UMAP&plane

43
PHYS188/288: BAYESIAN DATA ANALYSIS AND MACHINE LEARNING FOR PHYSICAL SCIENCES UROŠ SELJAK
Clustering+algorithms

• For$unsupervised$learning$(no$labels$available)$we$also$need$
to$identify$distinct$classes
• Clustering$algorithms$look$at$clusters$of$data$in$original$space$
or$reduced$dimensionality$space
• We$will$look$at$k=means$and$Gaussian$mixture$model$later
• Clustering$algorithms$such$as$HDBSCAN$connect$close$
particles$together:$friends$of$friends$algorithms
• HW$3:$UMAP$+$clustering

44
PHYS188/288: BAYESIAN DATA ANALYSIS AND MACHINE LEARNING FOR PHYSICAL SCIENCES UROŠ SELJAK
Literature
• Numerical Recipes, Press et al., Chapter 15
(http://apps.nrbook.com/c/index.html)
• Bayesian Data Analysis, Gelman et al. , Chapter 1-4
• https://umap-learn.readthedocs.io/en/latest/how_umap_works.html
• A high bias, low variance introduction to machine learning
for physicists, https://arxiv.org/pdf/1803.08823.pdf (pictures on slides
34-42 taken from this review)
45
PHYS188/288: BAYESIAN DATA ANALYSIS AND MACHINE LEARNING FOR PHYSICAL SCIENCES UROŠ SELJAK

You might also like