Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Getting to Know Maximum Likelihood

2023

All models contain parameters that drive their behavior. We do not know what the values of those parameters are; they have to be estimated from our data. Estimation theory is concerned with developing principled ways to bring data to bear on the estimation process. The goal is to derive estimators-formulas that use the data to obtain a parameter estimate-with desirable properties. This is one half of the bigger process known as statistical inference. The other half is hypothesis testing. This is where the data are used to test a hypothesis about a specific value of a parameter. These notes describe one particular approach to estimation, namely maximum likelihood. Maximum likelihood estimation (or MLE) is a versatile approach that can handle a wide range of models, much wider than ordinary least squares. This is the reason why MLE forms the cornerstone of the course. MLE also generates estimators that, under certain general assumptions, have desirable asymptotic properties such as consistency and normality. Finally, MLE provides a unified framework toward both estimation and hypothesis testing [5]. The goal of these notes is to familiarize you with the idea of maximum likelihood, computational methods, as well as issues of which you should be aware. Consult this document as background to the applications of MLE in specific modeling contexts.

Getting to Know Maximum Likelihood Marco Steenbergen September 2023 All models contain parameters that drive their behavior. We do not know what the values of those parameters are; they have to be estimated from our data. Estimation theory is concerned with developing principled ways to bring data to bear on the estimation process. The goal is to derive estimators—formulas that use the data to obtain a parameter estimate—with desirable properties. This is one half of the bigger process known as statistical inference. The other half is hypothesis testing. This is where the data are used to test a hypothesis about a specific value of a parameter. These notes describe one particular approach to estimation, namely maximum likelihood. Maximum likelihood estimation (or MLE) is a versatile approach that can handle a wide range of models, much wider than ordinary least squares. This is the reason why MLE forms the cornerstone of the course. MLE also generates estimators that, under certain general assumptions, have desirable asymptotic properties such as consistency and normality. Finally, MLE provides a unified framework toward both estimation and hypothesis testing [5]. The goal of these notes is to familiarize you with the idea of maximum likelihood, computational methods, as well as issues of which you should be aware. Consult this document as background to the applications of MLE in specific modeling contexts. 1 The Idea of (Maximum) Likelihood Imagine that someone tosses a coin 10 times. The person records five heads and five tails. Now ask yourself the following question: is it more likely that we obtained those results from a fair coin or from a coin biased towards yielding heads. The example illustrates the binomial distribution. That distribution has two parameters: (1) the number of trials or the sample size, in this case 10 and (2) the probability of a success, defined here as the outcome heads. For a fair coin, the probability of success is π = 0.5. Further the expected number of heads over n = 10 trials is 5. This is exactly the outcome we observed. By contrast, a coin biased toward heads, where 1 π > 0.5, should have produced more than five heads in expectation. The conclusion then is that the outcome of five heads is more likely to have been produced by a fair coin. What we just observed is the likelihood principle in action. Fisher [4, p. 310] defines likelihood as follows: “The likelihood that any parameter (or set of parameters) should have any assigned value (or set of values) is proportional to the probability that if this were so, the totality of the observations should be that observed.” The maximum likelihood estimator is the value that maximizes the likelihood of observing the data. At this point, we must introduce some notation. We shall designate a single generic parameter as θ. The estimator of that parameter is θ̂. Our data consist of n observations, y1 , y2 , · · · , yn , where n is the sample size. We assume that the observations are n draws from either a probability density or a probability mass function.1 The likelihood function is L = p(y1 , y2 , · · · , yn |θ) (1) Here, p(.) denotes a distribution. We must specify this distribution to perform maximum likelihood.2 The distribution encapsulates our. beliefs about the data generating process (DGP). We often simplify the likelihood by assuming that the sample consists of independent draws from the distribution.3 In this case, the joint Q distribution in equation (1) can be replaced by a product, , of marginal distributions, resulting in L= n Y i=1 p(yi |θ) (2) Throughout the course, we shall be using equation (2). Example 1: Assuming n independent coin tosses resulting in y heads, the binomial likelihood is given by   n y π (1 − π)n−y L= y For the earlier coin tossing example, we substitute n = 10 and y = 5. This yields the likelihood function in Figure 1. 1 Probability mass functions are used for discrete random variables, whereas probability density functions apply to continuous random variables. 2 By contrast, least squares estimation does not require any distributional assumptions. 3 More precisely, the draws are conditionally independent. This means that all dependencies have been captured through one or more parameters of p(.). Conditional on knowing those parameters, there are no residual dependencies. 2 Figure 1: Binomial Likelihood for n = 10 Trials with y = 5 Successes 0.25 0.2 L 0.15 0.1 5 · 10−2 0 0 0.2 0.4 0.6 0.8 1 π Figure 1 shows that the apex of the likelihood is reached at 0.5. Hence, the ML estimate is π̂ = 0.5. Formally, θ̂ = argmax L (3) θ∈Θ Here Θ is the parameter space, i.e., the set of feasible values of the parameter. In Example 1, where the parameter is a probability, the parameter space consists of all real numbers between 0 and 1. The ML estimator is thus a value contained in the parameter space that maximizes the likelihood. We make one small change to the Q estimation procedure. Equation (2) contains the product operator, . From a computational perspective, this can cause problems when the random variable is discrete. In that case, p(.) is a probability and the product over n probability terms quickly becomes too small for computers to handle. To overcome this problem, we typically work with the log-likelihood: ℓ = ln(L) = n X i=1 ln [p(yi |θ)] (4) Maximizing equation (4) yields the same solution as maximizing the likelihood. However, the log-likelihood is computationally simpler than the likelihood. The binomial log-likelihood is shown in Figure 2. In sum, MLE looks for the parameter(s) of a distribution most likely to have given rise to the data. This simple principle is very powerful and can be applied to nearly any probability density or mass function, regardless of its complexity. 3 Figure 2: Binomial Log-Likelihood for n = 10 Trials with y = 5 Successes 0 −10 L −20 −30 −40 0 0.2 0.4 0.6 0.8 1 π 2 Deriving and Optimizing the Log-Likelihood 2.1 Distributions with a Single Parameter Deriving the log-likelihood is a straightforward process: 1. Settle on a probability density or mass function, p(y|θ). 2. Take the natural logarithm for a single sampling unit: ln [p(yi |θ)]. 3. Sum over all of the sampling units to derive ℓ. Note that in some instances, e.g., the BHHH algorithm, it suffices to stop with step 2. Example 2: The exponential distribution is often used in event duration modeling. In this distribution, the random variable can take on any non-negative real value. The probability density function is4 p(y|λ) = λ exp(−λy) Here, λ > 0 is the sole parameter of the distribution. For a single sampling unit, the log-density is ln [p(yi |θ)] = ln(λ) − λyi 4 Summing over all of the units yields ℓ = n ln(λ) − λ n X yi i=1 To maximize the log-likelihood with respect to the parameter, we rely on a well-known result from calculus: at an extreme value (such as a maximum) the slope of the tangent line is 0. The slope of the tangent line is given by the first derivative of the log-likelihood with respect to θ. Setting this derivative equal to zero is known as the first-order condition for a maximum. This condition is satisfied at any extreme value. To ensure that a value yields a maximum, we need to evaluate the second-order condition that the second derivative is negative. Example 3: ing results. For the exponential distribution, we obtain the follow- • The first derivative of the exponential log-likelihood with respect to λ is n n X dℓ = − yi dλ λ i=1 • The first-order condition sets the derivative to zero and yields n λ̂ = Pn i=1 yi • The second derivative of the log-likelihood with respect to λ is n d2 ℓ =− 2 dλ2 λ Both the numerator and the denominator are positive, which means that the second derivative as a whole is negative. This means that λ̂ yields a maximum for the log-likelihood. Notice that the formula for the ML estimator includes information that can beP gleaned from the data: (1) the sample size, n, and (2) the n total score, i=1 yi . Estimators should be functions only of the data or of other estimators. 5 2.2 Distributions with Multiple Parameters Most distributions contain multiple parameters. This does not fundamentally change thins, but the mathematics are a bit more involved. Consider a distribution with P parameters, θ1 , θ2 , · · · , θP . We collect them in the vector θ and write the log-likelihood as ℓ= n X i=1 ln [p(yi |θ] (5) Note that a vector is nothing more than a container of like objects, in our case parameters. By default, we assume this vector to be a column with the objects stacked on top of each other. Example 3: The normal distribution is the most widely used probability density in the social sciences. It is given by   1 (y − µ)2 1 p(y|µ, σ) = √ exp − 2 σ2 σ 2π There are two parameters: (1) µ, which captures the center of the distribution and can be any real number and (2) σ, which captures the dispersion of the random variable and must be strictly positive. We can combine the two parameters into a single vector     µ θ θ= 1 = θ2 σ The log-likelihood can be obtained as follows. For a single unit, ln [p(yi |θ] = −0.5 ln(2π) − ln(σ) − 1 (yi − µ)2 2 σ2 Aggregating over all units, ℓ = −0.5n ln(2π) − n ln(σ) − 0.5 n X (yi − µ)2 i=1 σ2 Since we have multiple parameters, the first-order condition has to be written in terms of partial derivatives: ∂ℓ/∂0, where 0 is a vector consisting of P zeros stacked on top of each other. We can expand the 6 expression:    0  0  ∂ℓ     =  = . ∂θ  ...   ..  ∂ℓ 0 ∂θ  ∂ℓ ∂θ1 ∂ℓ ∂θ2 (6) P Equation (6) clearly shows that each element of ∂ℓ/∂θ is a partial derivative that is set equal to 0. We refer to the vector of partial derivatives as the gradient and denote it with the symbol ∇. Example 4: For the normal distribution, ∇=  ∂ℓ  ∂µ ∂ℓ ∂σ Pn = i=1 (yi −µ) σ2 2 Pn i=1 (yi −µ) σ3 − n σ ! Setting the first partial derivative to 0 yields Pn yi µ̂ = i=1 = ȳ n Setting the second partial derivative to 0 yields r Pn 2 i=1 (yi − ȳ) σ̂ = n So far, we have said nothing about the second-order condition. The reason is that it is much more complex in the multiparameter case. First, there is not one partial derivative, but a whole matrix of them. We call that matrix the hessian and it takes the following general form:5 H=    ∂2ℓ  =  ⊤ ∂θ∂θ  ∂2ℓ ∂θ1 ∂θ1 ∂2ℓ ∂θ2 ∂θ1 ∂2ℓ ∂θ1 ∂θ2 ∂2ℓ ∂θ2 ∂θ2 ∂2ℓ ∂θP ∂θ1 ∂2ℓ ∂θP ∂θ2 .. . .. . ··· ··· .. . ··· ∂2ℓ ∂θ1 ∂θP ∂2ℓ ∂θ2 ∂θP .. . ∂2ℓ ∂θP ∂θP       (7) The first element in the first row comes about by taking the partial derivative of the log-likelihood with respect to the first parameter and then differentiating the result a second time, again with respect to the first parameter. The second element in the first row is the result of 5 The ⊤-operator is the transpose. In our case, it converts a column vector into a row vector, i.e., a row of numbers. 7 Figure 3: Sample Size and the Log-Likelihood 10 n=2 n = 10 0 ℓ −10 −20 −30 −40 0 0.2 0.4 0.6 0.8 1 π taking the first partial derivative of ℓ with respect to θ1 and then differentiating the outcome with respect to θ2 . Other entries in the table are constructed in a similar fashion. Example 5: For the normal density, we obtain P ! 2 n ∂2ℓ ∂2ℓ i=1 (yi −µ) − − σn2 ∂µ∂µ ∂µ∂σ σ3 P P H= = n n 2 2 2 i=1 (yi −µ) 3 i=1 (yi −µ)2 ∂ ℓ ∂ ℓ − − + 3 4 ∂σ∂µ ∂σ∂σ σ σ n σ2 ! The second complication is that the second-order condition is different. In the multi-parameter case, the requirement is that H is negativedefinite. This means that the eigenvalues of H are all negative. This holds true for the normal distribution. However, do not panic if you have never heard of eigenvalues before. In practice, we leave MLE to computer algorithms (see below). Those algorithms will let it be known if negative-definiteness is not fulfilled. 3 The Precision of ML Estimators Consider the two log-likelihood functions in Figure 3, drawn for n = 2 and n = 10 trials, respectively. In both log-likelihoods, the apex occurs 8 at π̂ = 0.5. But pay attention to the very slight drop-off of the loglikelihood directly to the left and the right of the ML estimate when there are two trials. This suggests that the evidence that the coin is fair is far from overwhelming. It is better when the number of trials is set to ten. In the language of statistics, there is considerable imprecision in our estimate when n = 2. Conversely, the variance of the estimator is high when the sample size is this small. The question is how we measure this variance. This is where the hessian turns out to be of great relevance. Statistical theory teaches us that we can convert the hessian into a variance-covariance matrix of the estimators (or VCE) in two steps: 1. Convert the hessian into the expected Fisher information matrix: J = −E[H] (8) 2. Invert the expected Fisher information: V [θ̂] = J −1 Example 6: (9) For the binomial distribution, the hessian is given by H=− n−y y − π2 (1 − π)2 where n is the number of trials and π is the number of successes across those trials. We do not use boldface on the hessian because it is a single number (a scalar) and not a matrix. The expected Fisher information is E[y] n − E[y] J= 2 + π (1 − π)2 For a binomial random variable, E[y] = nπ. Substitution yields J= n π(1 − π) Finally, inversion yields V [π̂] = π(1 − π) n Let us relate Example 6 to Figure 3. While we do not know the true value of π, we can substitute the estimator for this parameter. This 9 ˆ For both n = 2 and n = 10 trials, yields the estimated variance of pi. the numerator for the estimated variance is 0.5 × (1 − 0.5) = 0.25. The denominator varies, however, resulting in different estimated variances. For n = 2, the estimated variance is 0.25/2 = 0.125. For n = 10, the estimated variance is 0.25/10 = 0.025. We clearly see the difference in variances, owing to the different sample sizes. Before we consider another example, let us consider two points. First, if we omit the inversion step, then we obtain the precision of the estimator. Precision, then, is the inverse of the variance of an estimator. Second, we typically report the standard error of an estimator: q (10) SE(θ̂j ) = V [θ̂j ] Example 7: The expectation over the hessian of the normal distribution is given by Pn ! n i=1 (E[yi ]−µ) − − 2 3 σ Pn σ P E[H] = (E[y ]−µ) 3 n E[(y −µ)2 ] − i=1 σ3 i − i=1 σ4 i + σn2 We know that E[yi ] = µ. For the off-diagonal elements, this means that E[yi ]−µ = µ−µ = 0. As a result, the off-diagonal elements are 0. We also know that the definition of the variance is σ 2 = E[(yi − µ)2 ]. This means that Pn 3 i=1 E[(yi − µ)2 ] n 3nσ 2 n 2n + = − + 2 =− 2 − σ4 σ2 σ4 σ σ With those results in place, the expected Fisher information is n  0 2 σ J= 0 2n σ2 The VCE is then V = σ2 n 0 0 σ2 2n ! In example 6, the first diagonal element is the variance of µ̂ and the second one the variance of σ̂. The estimators of the mean and standard deviation do not co-vary because the off-diagonal elements are 0. Because V depends on the unknown population variance, we generally use the estimated variance covariance matrix where σ 2 is replaced with an estimator. 10 4 Properties of ML Estimators Assuming a correctly specified model, which includes the correct specification of a distribution, ML estimators have the following asymptotic properties:6 1. ML estimators are consistent, meaning that the estimate in any large sample deviates from the true value by an infinitesimally small amount. This is a crucial property when it comes to trusting ML estimates. 2. ML estimators are asymptotically efficient because they reach the Cramèr-Rao lower-bound.7 This, too, is an important property because it means that ML estimators give the highest possible precision for a given sample size. 3. ML estimators are asymptotically normally distributed. This property means that the sampling distribution is known and that we can use it to build confidence intervals and test hypotheses. It is important to stress these are asymptotic properties. In finite samples of a small size, ML estimators do not always show desirable properties. For instance, the ML estimator of σ in the normal distribution is biased in small samples. In that situation it also does not display normality. This raises the all-important question of what sample size is large enough to start relying on the asymptotic results for ML estimators. There is no real consensus about this question, as it depends on the specific model and behavior of the data. Nevertheless, most scholars agree that MLE is risky when n < 100. They also tend to agree that, under most circumstance, MLE can be used when n > 400. Whether it is wise to use MLE when 100 ≤ n ≤ 400 is more controversial. 5 Numeric Optimization In section 2, we used calculus to find the ML estimator. This gets quite cumbersome, especially when there are many parameters. Moreover, first derivatives of log-likelihood functions are often so complex that an analytic solution for the first-order condition does not exist. We need an 6 In statistics, asymptotic literally means that the sample size goes to infinity. In practice, it is often interpreted in terms of large samples. What “large” means is something we shall discuss later. Note that some other more technical constraints also need to be met. For instance, no parameters can drive the support of the random variable, the first three derivatives must be finite, the number of parameters is finite [7], and the expected Fisher information is positive-definite. 7 The lower-bound states that for any unbiased estimator, θ̂, it must hold that V[θ̂] ≤ J −1 . For ML estimators, we have seen that V [θ̂] = J−1 . 11 alternative and that is to rely on computers. We can use computers for numeric differentiation and for numeric optimization. In practice, this is how MLE is implemented. ‘because of this, it is good to understand the principles and be aware of potential issues that might arise in the practice of MLE. 5.1 Numeric Differentiation The first-order condition centers about the gradient, but differentiation can be tedious and time-consuming. Fortunately, we can let the computer take care of this task. Numeric differentiation goes back to the very definition of a first derivative: ∇ = lim ∆→0 ℓ(θ + ∆) − ℓ(θ) ∆ Numerically, we could set ∆ to some small number and then approximate the gradient by computing [ℓ(θ + ∆) − ℓ(θ)]/∆. Since we evaluate ℓ at two points, to wit θ and θ + ∆, this form of numeric differentiation is known as the two-point method [2]. We can improve on the two-point method by applying centering: ∇≈ ℓ(θ + 0.5∆) − ℓ(θ − 0.5∆) ∆ (11) This generally yields greater precision. Another way of improving the precision is to increase the number of evaluation points to three or more. Example 8: Consider the exponential distribution from example 3. we have collected n = 100 observations that yield P Imagine, i = 1100 = 50. The ML estimator is λ̂ = 2. Analytically, the first derivative evaluated at this value is exactly zero. In a numeric approximation, imagine we set ∆ = 0.001 and evaluate the loglikelihood at 2 ± 0.5∆. The centered numeric derivative in this case depends on the difference between 100×ln(2+0.0005)−(2+0.0005)×50 and 100×ln(2−0.0005)−(2−0.0005)×50, for which the first eight digits are all 0. Division by ∆ yields an approximation of the gradient of 0.000001, which is practically 0. Typically, we set ∆ even smaller than we did here, resulting in an even more precise approximation of the gradient. 5.2 Hill-Climbing Algorithms The Essence of Hill-Climbing Imagine we seek to estimate a single parameter, θ. Instead of relying on calculus to find the value of theta that 12 Figure 4: Hill-Climbing in a Log-Likelihood 0 ℓ −1 −2 −3 −4 θ0 −2 θ1 −1 0 θ 1 2 maximizes the log-likelihood, we make an initial guess of that value. This value, θ0 , is known as the starting value or the initial value. We would be extremely lucky should θ0 indeed maximize the log-likelihood. In all likelihood, we will have to pick a new value such that θ1 = θ0 + α0 , where α is an adjustment. We keep making such adjustments until we have maximized the log-likelihood function. The production of each new estimate is called an iteration of the algorithm. If we generalize the idea to multiple parameters, the tth iteration can be written as θ t+1 = θ t + αt (12) We could select α at random but this is not very efficient. Hillclimbing algorithms use a more systematic approach by making α a function of the gradient. As the name implies, the gradient contains information about the direction in which the log-likelihood is changing (like the grade of a road). If the gradient is positive, then the loglikelihood is still increasing in the parameter. This suggests that we have not reached the apex yet and should make a positive adjustment for the parameter. This is illustrated in Figure 4. By contrast, if the gradient is negative, this means that we are descending from the apex. We now need to reduce the parameter. The general principle of hill-climbing, then, is to iterate according to  (13) θ t+1 = θ t + f ∇t Here, ∇t is based on the current guess of the parameters and f (.) is some function defined over the gradient. 13 Variations on Hill-Climbing The function f (.) can be specified in a number of different ways. The simplest variant is the steepest ascent algorithm, which updates estimates according to θ t+1 = θ t + λ∇t (14) Here, λ > 0 is known as the step size. It determines how much the gradient influences the parameter update [9]. The logic of steepest ascent is making a comeback in the social sciences through deep learning.8 The method is not widely used for more classical forms of modeling of the type we discuss in this course. For the statistical models in this course, the Newton-Raphson algorithm is widely used. Here, f (.) is a function of negative one times the hessian, a matrix that is also known as the observed Fisher information. Specifically,9 −1 t θ t+1 = θ t + −H t ∇ (15) The effect of using the hessian is that it captures the rate of change in the gradient. The closer we get to the maximum of the log-likelihood, the smaller the changes in the gradient are. It makes sense to take this into account in the parameter updates because it allows us to make smaller adjustments the closer we get to the maximum. Example 9: In example 8, we had collected data from 100 units yielding a total score of 50. The maximum likelihood estimator is λ̂ = 2. Imagine we started with λ0 = 0.5. Using the results from example 3, the gradient and hessian for the starting value are: ∇= 100 100 − 50 = 150 and H = − 2 = −400 0.5 0.5 The Newton-Raphson algorithm now suggests that λ1 = 0.5 + (−1 × −400)−1 × 150 = 0.875 At the new guess, ∇= 100 450 100 6400 − 50 = and H = − 2 = − 0.875 7 0.5 49 The algorithm now suggests λ2 = 1.367. The adjustment is now smaller because we are getting closer to the maximum. 8 Since we minimize an objective function in that context, we use steepest descent, where, θ t+1 = θ t − λ∇t . 9 We use negative one times the hessian because the elements of that matrix are negative and we seek to preserve a positive adjustment. Note that we can also add a step size. 14 One of the big advantages of the Newton-Raphson algorithm is that it is guaranteed to converge to a fixed point when the hessian is wellbehaved. Convergence is also typically fast—near the maximum it is quadratic.10 On the downside, the algorithm has difficulties when the hessian is not negative-definite, which happens with flat areas in the log-likelihood. Moreover, the algorithm is computationally intensive: with P parameters, the number of computations at each iteration is .5P (P + 3). This is a major reason why the algorithm is not used in deep learning models, where P can be in the millions. One of these problems—lack of negative-definitiness—is addressed by the Fisher scoring algorithm, which is also known as the efficient scoring algorithm. Here, −H is replaced by J = −E[H]. The iterations are given by −1 t ∇ (16) θ t+1 = θ t + J t Note that J −1 is the VCE, which by definition is positive-definite, whereas −H does not necessarily have this property. Example 10: For the exponential distribution, V [λ̂] = λ2 /n. The updating term is of the type Pn λ2 i=1 yi −1 = λ − λ2 ȳ J ∇=λ− n Pn For n = 100, i=1 yi = 50, and λ0 = 0.5, the first iteration yields   λ1 = 0.5 + 0.5 − (0.5)2 × 0.5 = 0.875 In this simple case, the update is identical to what the NewtonRaphson algorithm produces. However, this is not always the case. While Fisher scoring has certain advantages, it has two drawbacks. First, deriving J can be complex. Second, the algorithm is only linearly convergent. Still, it remains widely used in statistics. In recent decades, quasi-Newton algorithms have become popular. These algorithms forego the need to compute the hessian, thus greatly simplifying computation. At the same time, they still give a great deal of control over the size of the parameter updates. 10 The order of convergence is important for determining how quickly an algorithm reaches a solution. Let us define the approximation error as ϵt = θt − T , where T is the target. Then the sequence θ1 , θ2 , · · · displays convergence of order β if limt→∞ ϵt+1 = cϵβ t. If β = 1, then we have linear convergence. If β = 2, then convergence is quadratic. This means that the number of correct digits doubles at every iteration. 15 One of the quasi-Newton algorithms was developed by Berndt et al. [1] and goes under the name of the BHHH algorithm. Here, the observed Fisher information matrix is approximated by the outer-product of gradients or OPG: n X O= ∇i ∇⊤ (17) i i=1 We compute the vector of partial derivatives for each unit in the sample and then compute the outer-product of those derivatives, which has the same dimensionality as J . We sum over all of the units to accumulate the outer-products. Updates then take place using −1 t θ t+1 = θ t + O t ∇ (18) The algorithm offers two major advantages. First, no second derivative has to be computed. Second the OPG is positive-definite. Another way of bypassing the hessian is to rely on the BFGS algorithm developed by Charles George Broyden, Roger Fletcher, Donald Goldfarb, and David Shanno. This algorithm relies on arc gradients and, as such, has the nice property of being robust: where other algorithms may fail to converge, BFGS often works. The complex computational details can be found in Thisted [9]. Convergence At what point should we stop updating our parameters? At first sight, this may look like a trivial question: when all of the partial derivatives are zero. The problem is that computers achieve an exact value of 0 only for Boolean operations, and hill-climbing is a floating point computation. In practice, then, we use an alternative stopping criterion: the relative absolute change in the log-likelihood. We decide that convergence is achieved when ℓt+1 − ℓt < tolerance ℓt (19) The tolerance is a very small number (usually 10−8 ). If the log-likelihood shows a very small relative improvement, then we stop the algorithm because additional iterations are unlikely to drastically increase it.11 Convergence is not guaranteed. The following are common causes of convergence problems. 1. Early stopping: If we stop the algorithm too early, it may not have reached its maximum yet. One should avoid setting the maximum number of iterations too low or tinkering with the tolerance, as those strategies may result in a premature termination of parameter updates. 11 For those of you who are well-versed in calculus, you may wonder if there is not a risk that convergence is to a local maximum. Fortunately, the log-likelihood functions we shall encounter all have a unique maximum. 16 2. Starting values: A poor choice of starting values can cause convergence to be slow or fail altogether. Most routines performing MLE have optimized procedures for starting values. Still, it may help to double check the results against other starting values in order to flag any discrepancies. 3. Small n: Small sample sizes are a common cause of convergence problems, especially in the Newton-Raphson algorithm. The reason is that small sample sizes can induce nearly flat areas in the log-likelihood, resulting in problems with the hessian. 4. Estimation at the Boundary of Parameter Space: Some parameters like the mean of the normal distribution are unbounded. Others, such as the standard deviation, are bounded. If the ML estimator of a parameter like σ is very close to its lower-bound of zero, then this can cause convergence problems. Often, we see that statisticians convert a bounded parameter to an unbounded one (e.g., ‘ln(σ)) to avoid problems. Thanks to the Slutsky theorem, the estimators of the transformed parameters are also ML estimators and they can be easily converted. 5. Scaling: Vast differences in the scales of variables can cause convergence problems. Long [6] suggests that the ratio of the largest to the smallest standard deviation should not exceed a factor of 10. Strategies that homogenize scales are normalization, where all variables are put on a 0-to-1 scale, and standardization, where all variables have a mean of 0 and a standard deviation of 1. 6. Inappropriate Model: If one imposes the false model on the data, this, too, can result in convergence problems. 7. Data Cleaning: Mis-codings in the data can produce convergence problems. Hence, the first order of business is always to inspect the data and to see whether the distributions make sense. Collinearity can also play havoc with convergence, so that inspections of bivariate and multivariate relations among variables is important as well. Variance of ML Estimators Depending on the algorithm, different estimators of the VCE are produced. Remember that, formally, the VCE is equal to the inverse of the expected Fisher information. This is what is being used in the Fisher scoring algorithm, so that the VCE in this algorithm corresponds to the formal definition. The Newton-Raphson algorithm is based on the observed Fisher information. Consequently, the VCE is the inverse of that matrix. The BHHH algorithm computes the VCE by inverting the OPG, since that is the central element in the algorithm. The different estimators are 17 Table 1: Numerical Approaches to the VCE Algorithm Fisher Scoring Newton-Raphson VCE −1 V 1 = (−E[H]) −1 V 2 = (−H) P −1 n ⊤ ∇ ∇ V3= i i i=1 BHHH shown in Table 1. Asymptotically all of the estimators in the table are identical. 6 Hypothesis Testing 6.1 Joint Hypotheses When we speak of a joint hypothesis test, we mean that several parameters are tested simultaneously. This can take on several forms: (1) some parameters may be restricted to a specified value (usually 0) or (2) certain parameters may be constrained to be equal. it is also possible to impose inequality constraints, whereby one parameter is assumed to be greater than another. Those kinds of inequality constraints are less common in the social sciences and will not be considered here. Joint hypothesis tests frequently can be cast in terms of competing models of the data. For example, in one model, all parameters are estimated, while in another model some are constrained. In this way, nested models are obtained. A model MR is nested in another model, MU , if it imposes constraints on a subset of the parameters in the latter. More formally, two models are nested if both contain the same terms, including the same response variable, and if one of them contains at least one additional term. Example 11: Consider two alternative accounts of a normally distributed random variable: (1) Y ∼ N (µ, σ) and (2) Y ∼ N (µ, 1). The second model is nested in the first model, as it comes about by constraining σ to 1. We can also say that the first model is exactly the same as the second one, except for adding an additional term in the form of σ. In statistics, there are three major ways of testing joint hypotheses: (1) the likelihood ratio test; (2) the Wald test; and (3) the Lagrange multiplier or score test [3]. These differ in terms of the models that require estimation (see Table 2). The likelihood ratio test requires that both MU 18 Table 2: Approaches to Testing Joint Hypotheses Approach Likelihood Ratio Wald Lagrange Multiplier Estimated Models Unrestricted Restricted Yes Yes Yes No No Yes and MR are estimated. The Wald test requires estimation only of MU . Finally, the score test depends only on estimation of MR . Here, the subscript R stands for “restricted,” while the subscript U denotes “unrestricted.” Asymptotically, all three tests lead to the same conclusion. Likelihood Ratio Test When both the restricted and unrestricted models are estimated, each yields a value of the likelihood. We can take the ratio of these likelihoods as the basis of our hypothesis test. Specifically, λ= LR LU (20) is the likelihood ratio. The unrestricted model can be no less likely to have given rise to the data than the restricted model, so that LU ≥ LR . Consequently, 0 < λ ≤ 1. If the restrictions imposed by MR are valid, then λ = 1. On the other hand, if they do not do justice to the data generating process, then λ ≪ 1. Thus, smaller values of the likelihood ratio indicate evidence against the unrestricted model and therefore, the null hypothesis that imposes the constraints. Since the sampling distribution of (20) is complex, we rely on a transformation: asy LR = −2 ln(λ) ∼ χ2q (21) Here, LR is the likelihood ratio (LR) test statistic and q is the number of constraints in MR . The LR test statistic asymptotically follows a chisquared distribution with q degrees of freedom. Note that −2 ln(λ) = 2(ℓU − ℓR ). Example 12: Building on example 11, we fit MU : Y ∼ N (µ, σ) and MR : Y ∼ N (µ, 1). With n = 1000 observations, we obtain ℓU = −540.099 and ℓR = −541.750. Hence, LR = 2×(−540.099−(−541.750)) = 3.304. When referred to a χ21 -distribution, we obtain p = 0.069. Using a customary Type-I error of 0.05, we fail to reject the null hypothesis H0 : σ = 1. 19 Log-likelihood tests show how estimation and testing are connected through the log-likelihood function, resulting in a unified approach to inference [5]. These tests also have the benefit that we learn information about both the unrestricted and restricted model. The Wald and Lagrange multiplier tests do not offer this advantage. Still, there are circumstances where the likelihood ratio test should be avoided. This is true of pseudo-log-likelihoods, which are obtained in certain models (e.g., the pseudo-Poisson model) or with certain adjustments for survey design. Pseudo-log-likelihoods are approximations and, as the name implies, not true log-likelihoods. Hence, using them to inform a likelihood ratio test is ill-advised. The Wald test is a good alternative for those situations. Wald Test In the Wald [10] test, we only estimate the unrestricted model. We take the estimates from that model and then compare those to the constraints that have been imposed. Large discrepancies between the estimates and the constraints, relative to the standard error, are taken as evidence against the null hypothesis. For linear constraints—our focus—the Wald test can be understood in terms of a series of linear equations. In matrix notation, those constraints take the form of Rθ = r (22) Here, r is a q × 1 vector of values, where q is again the number of constraints. Further, θ is the P × 1 vector of parameters from MU . Finally, R is a q × P design matrix connecting parameters to the hypothesized values. Equation (22) is versatile, as the following examples illustrate. Example 13: Imagine a model MU with three parameters. We want to test the null hypothesis H0 : θ2 = θ3 = 0. That is, we impose two constraints, one each for θ2 and θ3 , respectively. The values of those constraints are both 0, so that   0 r= 0 We need to define the matrix R in such a way that (22) yields the equations θ2 = 0 and θ3 = 0. This is done by setting   0 1 0 R= 0 0 1 The first column in R pertains to theta1 . Because the null hypothesis does not restrict this parameter, the elements in the column are 20 set to 0. The second column pertains to θ2 , which we do restrict. As such one element in the column is 1 and the other is 0. The element that is set to 1 corresponds to the first constraint (θ2 = 0) and, as such, the 1 is the first value in the column. Finally, the third column pertains to θ3 , which we also constrain. Here the second element in the column is set to 1, corresponding to the second constraint (θ3 = 0). Example 14: Again assume that MU contains three parameters. However, in this case H0 : θ1 + θ2 − θ3 = 2. We impose one constraint: once two parameters are known, the third one follows from H0 . In this case, r becomes the scalar 2. Further,  R = 1 1 −1 The Wald test uses the ingredients from equation (22). It substitutes the parameter estimates to form the expression Rθ̂ − r. If the null hypothesis is valid, then Rθ̂ − r = 0 This means that the unconstrained estimates coincide with the constraints. If the null hypothesis is false, then we should expect that the unconstraint estimates no longer correspond to the values of the constraints. The bigger the discrepancies are, the more evidence there is against the null hypothesis. The discrepancies should be compared to their sample variance. If the sample variance is high, for instance because the sample size is small, then large discrepancies should occur frequently even if the null hypothesis is true. That is different when the variance is small. The variance of the discrepancies is given by h i V Rθ̂ − r = RV [θ̂]R⊤ We can now derive the Wald test statistic: −1   asy  ⊤  Rθ̂ − r ∼ χ2q RV [θ̂]R⊤ W = Rθ̂ − r (23) If this looks awful to you, then rest assured that programs like R will do all of the work for you. 21 Example 15: For the normal distribution, we test the hypothesis that µ = 0 and σ = 1, i.e., that the data were generated by a standard normal distribution. The constraint may be written as      µ 0 1 0 = σ 1 0 1 | {z } | {z } |{z} R θ r For a sample of n = 500 observations, we obtain µ̂ = 0.459 and σ̂ = 1.104. Hence,   0.459 Rθ̂ − r = 0.104 Further, RV [θ̂]R⊤ =  2.439E − 3 0 0 1.220E − 3  Application of (23) yields W = 95.150. When referred to a χ22 distribution, p = 0.000 and we reject the null hypothesis. The Lagrange Multiplier Test The Lagrange multiplier test is not as common in the social sciences as the likelihood ratio and Wald tests. Its main use is when the restricted model makes an assumption that would be difficult to relax, as it would significantly increase model complexity. In this case, it can be useful to estimate the restricted model and ask whether it is necessary to release the constraint. The mathematics underlying the test are complex. In a nutshell, we can think of MLE of the restricted model as constrained optimization: one or more parameters that could have been estimated have been fixed. Constraint optimization involves so-called Lagrange multipliers in the restricted log-likelihood, values that weight the constraints. If the constraints are valid, then the multipliers are zero. Hence the test ascertains whether the multipliers are significantly different from zero. Details can be found in Engle [3]. 6.2 Tests of Hypotheses about Single Parameters For a single parameter, we can test H0 : θj = k, where j = 1, · · · , P and k is some constant (frequently 0). The likelihood ratio, Wald, and Lagrange multiplier tests can be used to test a single parameter just as easily as multiple parameters. Indeed, this is what we did in example 12, where θ = σ and k = 1. 22 For single parameters, you may be familiar with the z-test: zj = θ̂j − k SE[θ̂] asy ∼ N (0, 1) (24) This test is identical to the Wald test. For a single parameter, it can be shown that W = z 2 . This follows a χ21 -distribution because that is the distribution of a single standard normal variate squared. Example 16: Consider again example 12. For this example, σ̂ = 1.041 with a standard error of 0.023. Given H0 : σ = 1, z= 1.041 − 1.000 = 1.758 0.023 This generates a p-value of 0.079, which is not sufficient to reject the null hypothesis. 7 Maximum Likelihood in Practice In these notes, we have primarily focused on the theory of MLE and the ideas behind numeric optimization. For those who want to program their own ML estimators, one source is Steenbergen [8]. The need to program your own estimators depends on whether you are willing to accept the assumptions made in “canned” software routines. If you are happy with those assumptions, then the present notes give you useful background knowledge on how R packages generate parameter estimates. If you want to go push the envelope a little, then the present notes and Steenbergen [8] should put you in a good spot to develop your own estimators. References [1] Ernst R. Berndt et al. “Estimation and Inference in Nonlinear Structural Models”. In: Annals of Economic and Social Measurement 3.4 (1974), pp. 653–665. [2] Richard L. Burden and J. Douglas Faires. Numerical Analysis. 7th ed. Australia ; Pacific Grove, CA: Brooks/Cole, 2001. [3] Robert F. Engle. “Wald, Likelihood Ratio, and Lagrange Multiplier Tests in Econometrics”. In: Handbook of Econometrics, Volume 2. Ed. by Zvi Griliches and Michael D. Intriligator. Amsterdam: North-Holland, 1984, pp. 776–826. 23 [4] R. A. Fisher. “On the Mathematical Foundations of Theoretical Statistics”. English. In: Philosophical Transactions of the Royal Society of London A: Mathematical, Physical and Engineering Sciences 222.594-604 (Jan. 1922), pp. 309–368. [5] Gary King. Unifying Political Methodology: The Likelihood Theory of Statistical Inference. en. Ann Arbor, MI: University of Michigan Press, 1998. [6] J. Scott Long. Regression Models for Categorical and Limited Dependent Variables. Advanced quantitative techniques in the social sciences 7. Thousand Oaks: Sage Publications, 1997. [7] J. Neyman and Elizabeth L. Scott. “Consistent Estimates Based on Partially Consistent Observations”. English. In: Econometrica 16.1 (1948), pp. 1–32. [8] Marco R. Steenbergen. “A Primer of Maximum Likelihood Programming in R”. Zurich, 2019. [9] Ronald A. Thisted. Elements of Statistical Computing: Numerical Computation. English. New York: Chapman & Hall, 1988. [10] Abraham Wald. “Tests of Statistical Hypotheses Concerning Several Parameters When the Number of Observations Is Large”. en. In: Transactions of the American Mathematical Society 54.3 (1943), pp. 426–482. 24