A comparison of variational approximations for fast inference in mixed logit models

Martina Vandebroek

View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Lirias A comparison of variational approximations for fast inference in mixed logit models Depraetere N, Vandebroek M. KBI_1503 A comparison of variational approximations for fast inference in mixed logit models Nicolas Depraetere1 KU Leuven, Faculty of Business and Economics, Naamsestraat 69, 3000 Leuven, Belgium. Email: Nicolas.Depraetere@kuleuven.be Tel: +32 16 326576 Martina Vandebroek KU Leuven, Faculty of Business and Economics, Naamsestraat 69, 3000 Leuven, Belgium. Leuven Statistics Research Center, W. de Croylaan 54, 3001 Leuven-Heverlee, Belgium. Email: Martina.Vandebroek@kuleuven.be Tel: +32 16 326975 Nicolas Depraetere was funded by project G.0385.10N of the Flemish Research Foundation (FWO Flanders), Belgium. 1 Corresponding author 1 A comparison of variational approximations for fast inference in mixed logit models Abstract Variational Bayesian methods aim to address some of the weaknesses (computation time, storage costs and convergence monitoring) of mainstream M CM Cbased inference at the cost of a biased approximation to the posterior distribution. We investigate the performance of variational approximations in the context of the mixed logit model, which is arguably one of the most used models for discrete choice data. A typical treatment using the variational Bayesian methodology is hindered by the fact that the expectation of the so called log-sum-exponential function has no closed form expression. Therefore, one has to resort to approximating or bounding this term. In this paper we compare seven different possible bounds or approximations. We found that quadratic bounds do not perform particularly well. A recently proposed non-quadratic bound, on the other hand, did perform quite well. We also found that the approximation used in a previous study only performed well for specific settings. Our proposed approximation based on quasi Monte Carlo sampling on the other hand performed consistently well across all simulation settings while remaining computationally tractable. Keywords - Bayesian statistics, Variational Bayes, Discrete choice, Mixed logit models Acknowledgement Nicolas Depraetere was funded by project G.0385.10N of the Flemish Research Foundation (FWO Flanders), Belgium. 1 2 Introduction Choice data are often encountered in marketing research. Such data arise when observed subjects (also called persons, households, companies, ...) make choices out of finite sets of mutually exclusive alternatives (revealed preferences) or are asked to make hypothetical choices in hypothetical situations designed by the researcher (stated preferences). Each alternative is characterized by a set of attributes which determine their utility to the subjects. It is generally assumed that subjects will select the alternative with the highest utility as their preferred choice. An example would be the purchase of a car. Important attributes here could be the price, the brand, mileage, size, aesthetic attributes, ... Then, depending on the relative importance of these attributes and the particular values of these attributes, the decision maker chooses the car he prefers. An overview of the rich discrete choice literature can be found in Train (2009). Usually the goal of the researcher is to gage the relative importance of the attributes to the decision makers, based on the observed or stated choices. This is usually done by estimating discrete choice models using logit or probit link functions between a linear function of the attributes and the observed, categorical outcomes. Estimation of non-trivial discrete choice models rapidly becomes difficult and many of such models, like the mixed logit model, are therefore estimated with a hierarchical Bayesian procedure (see for instance, Rossi et al. (2005) and Train (2009)). A Bayesian analysis of discrete choice models, however, is hindered by the fact that there is no naturally intuitive conjugate prior which would allow analytical expressions for the posterior distributions of the model parameters. Hence, a Bayesian analysis will have to take recourse to numerical methods. For relatively low dimensional problems the necessary integrals could be computed using Gaussian quadrature. This, however, becomes quickly infeasible when the dimensionality of the unknown parameter vector increases. Therefore, most researchers have turned to stochastic approximations (this also holds for maximum likelihood where many models are estimated with maximum simulated likelihood (Train 2009)). Typically this is done with a Markov chain Monte Carlo (MCMC) simulation where one draws dependent samples from the posterior distribution of the unknown model parameters and latent variables. Although this approach works well in theory there are some caveats in practice. A properly specified Markov chain is guaranteed to converge to its equilibrium distribution but this may take a long time, especially in large complicated models. Furthermore, the assessment of convergence is non-trivial. Another practical difficulty can be the storage cost. Large models with subject specific parameters require ever more 3 storage as the number of draws from the posterior increases and the number of subjects increases (Braun and McAuliffe 2010). Nowadays, it is not unheard of to use subsets of the available data for inference in large discrete choice models which can lead to biased estimation (Zanutto and Bradlow 2006). To overcome these practical problems, i.e. computation time, convergence assessment and storage, one can turn to variational approximation methods. Applied in a Bayesian context, variational Bayes (VB) optimizes a well defined functional in order to approximate the posterior distribution. The aim is to find a tractable distribution of the model parameters and latent variables that minimizes some measure of distance between the true posterior distribution (as is sampled from in a properly specified MCMC chain) and the approximate posterior distribution. The distance measure is usually taken to be the Kullback-Leibler divergence. As the inference problem is now transformed into an optimization problem, convergence is generally faster and much easier to assess. Furthermore, the storage requirements are much more modest than for MCMC. For instance, if the posterior of a particular K-dimensional parameter vector is approximated by a multivariate normal distribution one would require numbers for the unique elthe storage of K numbers for the posterior mean and K(K+1) 2 ements of the posterior matrix. In an MCMC scheme however, one requires covariance K(K+1) numbers where R represents the number of draws from the storage of R × K + 2 the posterior. As R is typically in the order of thousands (or more), one can readily see that the difference in storage capacity can be quite large. The downside of the variational approach is that it is necessarily biased as the true posterior is approximated by a (much) simpler, more tractable distribution. The amount of bias can be reduced by allowing more complicated approximate posterior distributions but this more or less defeats the purpose. The most common approach to variational Bayesian approximations is to factorize the posterior distribution in a product of more tractable distributions. This approach is also employed in this paper. As variational approximations are relatively new to the statistical literature, there are not many results yet on their statistical properties. Wang and Titterington (2006) investigated the convergence of factorized, or so called mean-field, variational Bayesian approximations in Gaussian finite mixture models. They showed that asymptotically, as the sample size grows, their estimators converge locally to the maximum likelihood estimator. Wang and Titterington (2005), however, showed that the resulting variance estimates are too narrow compared to maximum likelihood which leads to over-optimistic inference. This phenomenon has been observed by many researchers working with factorized approximations, see for instance Bishop (2006), Consonni and Marin (2007) and Rue et al. (2009). More recently Hall et al. (2011) and Ormerod and Wand (2012) investigated the properties of Gaussian vari- 4 ational approximations (approximating the distribution of random effects by normal distributions) of maximum likelihood estimation in Poisson mixed models and generalized linear mixed models respectively and proved some consistency results for simple models. In the context of mixed discrete choice models Braun and McAuliffe (2010) used a similar approach in a Bayesian framework. They empirically investigated the performance of these approximations and showed that their approach performed similar to MCMC but their methods were significantly faster and required far less memory. The goal of this paper is to assess the accuracy of several variational approximations in the context of the very popular mixed logit model. We use several bounds which have never been used for these types of models. Furthermore, we propose a particular approximation based on quasi Monte Carlo sampling which will be shown to work very well. In the following section we will briefly introduce the conditional and the mixed logit model. After that, in the subsequent section we will introduce variational Bayesian approximations for these models. Then, we will present the results of several simulation studies on synthetic data which will be followed by a conclusion. Logit Models for Discrete Choice In this section we will briefly introduce the conditional and mixed logit model. In this paper the subjects, also called agents or decision makers, will be denoted by index h going from 1 to H. Each of these subjects is faced with Th choice sets. This could, for instance, be multiple purchases by the same agent at different time points. Or, in case of stated preferences, this represents the number of hypothetical situations in which the subject is asked to make a choice. Furthermore, we will assume that each choice set is defined by a finite number of J alternatives, indexed by j = 1, . . . , J. Each of these alternatives is characterized by K attributes. The values of these attributes are stored in K-dimensional vectors xhtj = (xhtj1 , . . . , xhtjK )T which contain the K attribute values of alternative j, encountered by subject h at choice situation t. We can collect all J attribute vectors in a J by K-dimensional matrix, denoted by X ht which is called the design matrix of choice set t for subject h. The result of the subjects’ decisions is stored in J-dimensional binary vectors y ht = (yht1 , . . . , yhtJ )T . In these binary vectors a 1 indicates the chosen alternative and the non-chosen alternatives are indicated by 0s. Hence, each of these vectors contains exactly one 1. As the dependent variable is a binary vector we can adequately model it with a multinomial distribution. Furthermore, we will assume that the choice probabilities are functions of a linear combination of the alter- 5 natives attributes and the tastes of the subject. A subjects taste is represented by a Kdimensional vector β h which contains the relative importances of each attribute for subject h. All that is required now is a link function to link the probabilities (which must be non-negative and sum to 1) and the linear predictor xThtj β. For logit models this is the logit link and this leads to the following expression for the probability that subject h, at choice point t, selects the jth alternative T P (yhtj = 1|xhtj , β) = phtj = P J exhyj β j ′ =1 e xT htj ′β . (1) This is the conditional logit model introduced by McFadden (1974). In a full Bayesian analysis we require a prior distribution on the unknown parameter vector β which is generally taken to be multivariate normal with mean vector ζ and covariance matrix Ω. The full specification of a Bayesian conditional logit model is then: y ht |X ht , β ∼ Multinomial (pht1 , . . . , phtJ ) , h = 1, . . . , H β|ζ, Ω ∼ NK (ζ, Ω) . Note that in this specification all subjects have the same taste vector. The assumption that the tastes are homogeneous in the population is a fairly restrictive assumption which is usually far from true. Furthermore, when subjects make multiple choices it seems hard to argue that these observations are independent. A popular approach to overcome these shortcomings of the conditional logit model is to allow taste heterogeneity among the subjects, i.e. each subject h has his own personal taste vector β h . One could estimate these personal taste vectors by estimating H conditional logit models, one for each subject. This would be a good approach if each subject makes a large number of choices. However, one usually only observes a limited number of choices per subject which would make the resulting inferences very noisy. A way to overcome this is by specifying a distribution of these tastes, usually a multivariate normal distribution. In this scenario, each subject’s specific taste parameters are estimated by taking the other tastes into account which is a way of borrowing strength from the other observations. Furthermore, the parameters of the mixing distribution need to be estimated and are usually of prime interest to the researcher. Observations made by the same subjects are now no longer independent which adds to the plausibility of the model. In this paper we will assume a multivariate normal distribution as the mixing distribution of the tastes in the population. As before, in a fully Bayesian analysis we require prior distributions on the mean 6 vector and the covariance matrix of the mixing distribution. We will assume the typical conjugate priors, i.e. a normal prior for the mean and an inverse-Wishart distribution for the covariance matrix. The full mixed logit model is then 1 : y ht |X ht , β h ∼ Multinomial (pht1 , . . . , phtJ ) , β h |ζ, Ω ∼ NK (ζ, Ω) , h = 1, . . . , H, t = 1, . . . , Th h = 1, . . . , H ζ|β 0 , Ω0 ∼ NK (β 0 , Ω0 ) Ω|S −1 , ν ∼ W −1 S −1 , ν . Now that the model is fully specified, a Bayesian analysis proceeds by obtaining the posterior distribution of the unknown parameters θ = (ζ, Ω, β 1:H ) which is given by Q QTh p (ζ) p (Ω) H p (y ht |X ht β h ) h=1 p (β h |ζ, Ω) p (ζ, Ω, β 1:H |D) = R QH QTh t=1 p (ζ) p (Ω) h=1 p (β h |ζ, Ω) t=1 p (y ht |X ht β h )dζdΩdβ 1:H where D represents the observed data. Even though we use (conditionally) conjugate priors, the denominator is not analytically integrable and we will have to resort to numerical approximations. Furthermore, to obtain marginal posterior distributions of the parameters of the mixing distribution one also requires numerical approximations to integrate the β h ’s from the numerator. Variational Bayesian Approximation Variational Bayes Variational Bayesian approximations cover a wide set of methods to approximate the posterior distribution. Recent tutorials and literature overviews can be found in Bishop (2006), Ormerod and Wand (2010) and Titterington (2011). The main idea is that one tries to approximate the posterior distribution with a simpler distribution. So, how does one select such a simpler distribution and how does one evaluate how well it approximates the posterior? There are of course several possibilities but most often one tries to minimize the Kullback-Leibler divergence between the approximating distribution q (θ), where θ contains all the unknowns in the model, and the posterior distribution p (θ|D) where D refers to the observed data. To start we rewrite the Kullback-Leibler divergence 7 of p (θ|D) from q (θ) as q (θ) dθ p (θ|D) Z q (θ) p (D) = q (θ) log dθ p (θ, D) Z q (θ) dθ + log p (D) . = q (θ) log p (θ, D) KL (q (θ) ||p (θ|D)) = Z q (θ) log The right term on the last line is the natural logarithm of the marginal likelihood of the data under the model, sometimes called the evidence, and it is independent of the model parameters. The left term of the last line is the Kullback-Leibler divergence of the joint distribution of the data and the unknowns from the approximating distribution. We can rewrite this equation and obtain the following decomposition p (θ, D) dθ + KL (q (θ) ||p (θ|D)) q (θ) Z p (θ, D) ≥ q (θ) log dθ = L (q (θ)) q (θ) log p (D) = Z q (θ) log since KL (.||.) is non-negative. We have now obtained a lower bound L (q (θ)) on the logarithm of the marginal data likelihood. Making q (θ) as equal (in the Kullback-Leibler divergence sense) to the posterior as possible can now be seen to be equivalent with maximizing this bound with respect to q (θ). Using classical calculus of variations one can show that this optimization results into an optimal q (θ) = p (θ|D). So we have not made much headway yet as we start out from an intractable posterior p (θ|D). We can, however, put restrictions on q (θ). The most common type of restriction is to factorize the approximating distribution which is known as mean-field theory in physics (Parisi 1988). For the mixed logit model, for instance, we could restrict our approximation, with Q θ = (ζ, Ω, β 1:H ), as q (ζ, Ω, β 1:H ) = q (ζ) × q (Ω) × H h=1 q (β h ). It can be shown that in fully conjugate exponential family models, the optimal approximate q (.) densities are in the same family as their priors (Winn and Bishop 2005; Ormerod and Wand 2010). From this we can already deduce that q (ζ) ∼ NK µζ , Σζ and q (Ω) ∼ W −1 Υ−1 , ω . The posterior approximate densities q (β h ) , h = 1, . . . , H, however, are not conjugate due to the logit link. A natural way to parameterize them is to assume they are independent multivariate normal densities, i.e. q (β h ) ∼ NK (µh , Σh ) , h = 1, . . . , H, which would be conjugate if one approximates the logistic terms by quadratic functions of β h as their prior is a normal distribution, NK (ζ, Ω). This factorization was also employed in Braun 8 and McAuliffe (2010). In the terminology of Ormerod and Wand (2010), this is a mix of a product density transform or a mean-field approximation (factorized approximate posterior) and a parametric density transform (assuming a specific parametric posterior). As mentioned before, the drawback of this assumed factorization is that posterior dependencies are lost between the different sets of parameters. This results in posterior second moments being too small and hence posterior credible intervals can be (much) too narrow showing inflated confidence. Further on, this approach will be mixed with yet another approach, the tangent transform, where a tangent lower bound is placed on the variational lower bound. With factorized approximate posterior distributions one generally maximizes the lower bound by a coordinate ascent algorithm. As will be shown in the next section, each parameter vector depends on (subsets) of the other parameter vectors. Typically, one initializes several sets of parameters and then cycles through the update equations until some measure of convergence is satisfied. Hence, at each step, all other parameter vectors are held constant while one particular parameter vector is found to maximize the resulting lower bound. When there is conjugacy, these updates are usually closed form and can be performed efficiently. When there is no conjugacy on the other hand, it might be necessary to use numerical optimization. Variational Bayes for the Mixed Logit Model We have seen before that to obtain a variational approximation one can maximize L (q (θ)) = Z q (ζ) q (Ω) H Y h=1 q (β h ) log H Y p (ζ, Ω, β 1:H , D) dζdΩ dβ h Q q (ζ) q (Ω) H q (β ) h h=1 h=1 = Eq(θ) [log p (ζ, Ω, β 1:H , D)] + H X H [q (β h )] + H [q (ζ)] + H [q (Ω)] (2) h=1 with respect to the parameters of the variational posterior distribution. The first part of the right hand side is the expectation of the log joint probability of the data and the parameters (log likelihood plus log prior) with respect to the variational posterior distribution and the H [q (.)] terms are the differential entropies of the variational posterior2 . Plugging the known density families into (2) we obtain the following expression which 9 needs to be maximized with respect to the variational parameters L (q (θ)) = Th H X X h=1 t=1 + ( " y Tht X ht ENK (βh ;ζ,Ω) [β h ] − ENK (βh ;ζ,Ω) log J X e xT htj β h j=1 H X ENK (ζ;µζ ,Ωζ )W −1 (Ω;Υ−1 ,ω)NK (βh ;µh ,Σh ) [log NK (β h ; ζ, Ω)] H X H [q (β h )] + H [q (ζ)] + H [q (Ω)] . !#) h=1 + EW −1 (Ω;Υ−1 ,ω) log W −1 Ω; Υ−1 , ω + ENK (ζ;µζ ,Σζ ) log NK ζ; µζ , Σζ + (3) h=1 Because the variational posterior distribution is factorized all but one of the expectations in this expression are fairly simple to evaluate. Plugging these expectations into (3) one can calculate derivatives of the lower bound with respect to µζ , Σζ , ω and Υ and equate them to 0. This yields closed form update equations for these sets of parameters. More details on this can be found in appendix A. The only parts of (3) which are troublesome are the updates with respect to µh and Σh for all h = 1, . . . , H. The expected value of the log-sum of exponentials in equation (3) has no analytically closed form. Hence, we have no analytical form in function of the variational parameters over which we can maximize. In the following subsections we will list a number of possible avenues to deal with this problem. Some of these solutions try to approximate the problematic expectation in terms of parameters of the variational posterior. Other solutions bound this expectation in various ways which results into a lower bound on the lower bound. Some of these avenues lead to closed form update equations while others require numeric optimization. Approximating or Bounding the Log-Sum-Exponential Function There are H × Th terms in equation (3) which have no analytical expectations with respect to a normal distribution: LSE (X ht β h ) = log J X j=1 e xT htj β h ! , ∀h = 1, . . . , H, t = 1, . . . , Th . (4) Therefore, these terms do not allow analytic expressions in terms of the means, µh , and covariance matrices, Σh , of the approximate posterior distribution. In order to optimize the lower bound with respect to these parameters one has to replace equation (4) with 10 an approximation or a bound which can be expressed in terms of the individual agents’ parameters. Here we will describe several approaches to do this. Details beyond this exposition can be found in appendix B. Taylor series. An approach which was proposed by Braun and McAuliffe (2010) is to replace (4) by a second order Taylor series expansion around the current mean µh and taking the expectation which yields " Eq(βh ) log J X j=1 e xT htj β h !# ≈ log J X j=1 e xT htj µh ! 1 1 + tr Σh X Tht diag (pht ) X ht − tr Σh X Tht pht pTht X ht 2 2 where pht is a J-dimensional vector with entries e PJ ′ xT htj µh j =1 e x T ′ µh htj , ∀j = 1, . . . , J and diag (x) is an operator that constructs a diagonal matrix out of the elements in x. The resulting expression, plugged into equation (3), can now be optimized with respect to (µ1:H , Σ1:H ) but does not allow a closed form update equation. Hence, the coordinate ascent algorithm will require a numeric maximization step using any efficient optimization algorithm. Note that using this approach there is no longer a guarantee that the function which is maximized remains a lower bound and the resulting approximate posterior is therefore no longer guaranteed to be the closest approximate posterior (with the chosen factorization and parameterization) to the real posterior distribution with respect to the Kullback-Leibler divergence. Nevertheless, Braun and McAuliffe (2010) empirically showed that the resulting inference is very close to M CM C and is therefore useful. Note also that Braun and McAuliffe (2010) used another simplification in that they restricted the subjects posterior covariances, Σh , ∀h = 1, . . . , H, to diagonal matrices. In this paper we use both the unrestricted and the restricted approach. The unrestricted approach treats the covariance matrices as dense matrices and is denoted by BM . The restricted approach restricts the subjects covariance matrices to diagonal matrices and is denoted by BMD . Quasi Monte Carlo. A different, viable approach would be to approximate the expectation by a Monte Carlo method. Lawrence et al. (2004) used importance sampling to obtain approximations to an intractable expectation within their variational algorithm to improve grid placement for the analysis of DNA microarray data. Girolami and Rogers (2006) also used importance sampling for their variational Bayesian treatment of multinomial probit regression with Gaussian process priors. In this paper, however, we pro- 11 pose to make us of the considerable research in the field of Quasi Monte Carlo (QMC). Quasi Monte Carlo samples are samples which are constructed to proportionally fill the high density regions of the distribution from which one wishes to sample. The resulting approximate integration can then be performed in a stable manner with fewer samples than with regular Monte Carlo. Inspired by Yu et al. (2010) we use the extensible shifted lattice points (ESLP) algorithm as proposed by Hickernell et al. (2000). This algorithm scales well with the dimensionality of the integral. For details on generating such QMC samples we refer you to the previous two references and appendix C. We thus approximate the expectation of (4) as: " Eq(βh ) log J X e xT htj β h j=1 !# J R X 1X T (r) exhtj (Lh z +µh ) log ≈ R r=1 j= ! where Lh is the lower triangular Cholesky factor of Σh such that Σh = Lh LTh , R is the number of QM C samples from a standard multivariate normal distribution and each draw is represented by z (r) . This expression once again does not allow for analytic update equations and hence requires numeric optimization. For a small number of draws the lower bound can also not be guaranteed but this can be alleviated by increasing the sample size. The number of QM C samples we used is determined by a parameter m which results in a sample size of R = 2m . We used several values for m in the range of 6 − 12 and found that R = 26 = 64 worked well enough in our applications. Furthermore, we also considered a restricted version where the subjects covariance matrices were restricted to diagonal matrices. We refer to these respective approaches as QM C and QM CD . Jensen’s inequality. A different approach to approximate the expectation of (4) was used by Blei and Lafferty (2007), based on Jensen’s inequality, in the context of topic models. As log(.) is a concave function one can simply apply Jensen’s inequality to obtain: " Eq(βh ) log J X j=1 e xT htj β h !# ≤ log J X j=1 h Eq(βh ) e xT htj β h i ! = log K X j=1 e 1 T xT htj µh + 2 xhtj Σh xhtj ! Again we considered an unrestricted and a restricted version and we denote these respective methods as JI and JID . Knowles and Minka (2011) improved the flexibility of the former bound by introducing additional parameters a = (a1:H,1:Th ) and aht a J- . 12 dimensional vector. Their bound results in " !# J J X X xT βh htj ≤ e Eq(βh ) log ahtj xThtj µh j=1 j=1 + log J X e(xhtj − PJ j=1 ahtj xhtj ) T µh + 12 (xhtj − PJ j=1 T ahtj xhtj ) Σh (xhtj − PJ j=1 The additional flexibility tightens the inequality at the expense of introducing more parameters which need to be optimized. We again consider an unrestricted and a restricted version of the subjects’ covariance matrices. These approaches, denoted here by KM and KMD respectively, require an additional step in the coordinate ascent algorithm to update the extra HT J-dimensional parameter vector a. Both JI and KM , bound the required expectation and hence they keep the lower bound property of the variational algorithm intact. Both require numeric optimization in each iteration in order to maximize the subjects’ variational parameters. Bı̈¿ 21 hning-Lindsay. A different approach, which does not require numeric optimization is to introduce a quadratic approximation to (4). The first quadratic approximation we consider is due to a bound on the second order Taylor series expansion of equation (4) around an extra J-dimensional vector of variational parameters Ψht . The resulting expectation is then " Eq(βh ) log J X j=1 e xT htj β h !# ≤ log J X j=1 eΨhtj ! + (X ht µh − Ψht )T ∇ (Ψht ) 1 1 T X ht AX ht Σh + (X ht µh − Ψht )T A (X ht µh − Ψht ) . 2 2 where A = 12 I J − 1J 1TJ /J , I J is the J-dimensional identity matrix and 1J is a Jdimensional vector of ones and with Ψht a J-dimensional vector of extra variational parameters. Note also that ∇ (Ψht ) is the gradient of equation (4) evaluated at Ψht . This quadratic bound follows from a result from Böhning and Lindsay (1988) and Böhning (1992). We denote this method by BL. This bound has been successfully used by Khan et al. (2010) for a variational treatment of mixed-data factor analysis due to its computational efficiency. + Bouchard. A different quadratic bound we considered is due to Bouchard (2007) which is a multinomial generalization of a quadratic bound developed by Jaakkola and Jordan j=1 ahtj xhtj ) ! . 13 (2000). This bound also introduces extra variational parameters (α1:H,1:Th , t1:H,1:Th ,1:J ) and is " !# J J X X xThtj µh − αht − thtj T xhtj β h Eq(βh ) log ≤ αht + e 2 j=1 j=1 i h 2 + λ (thtj ) xThtj µh − αht − t2htj + xThtj Σh xhtj + log 1 + ethtj where λ (t) = 4t1 tanh 2t . This bound also leads to closed form updates for the subjects’ parameters, conditional on the optimal HT -dimensional vector of α = α1:H,1:Th and HT J-dimensional vector t = t1:H,1:Th ,1:J . We used this bound in several experimental settings but found that it was way too loose and yielded very biased approximations for the posterior mean parameters. Therefore we will not show any results of this bound but it is included here for completeness sake. Jebara-Choromanska. The final quadratic approach considered is due to Jebara and Choromanska (2012) who developed an algorithm to find a quadratic bound around some β̃ h to equation (4) which tightens BL and generalizes BO. After taking expectations this bound leads to !# " J 1 T T X 1 T Eq(βh ) log exhtj βh ≤ log zht + µh − β̃ h S ht µh − β̃ h + tr [Σh S ht ]+ µh − β̃ h mht 2 2 j=1 where zht is a scalar, mht is a K-dimensional vector and S ht is a K × K-dimensional matrix which are determined by the algorithm of Jebara and Choromanska (2012). The algorithm can be found in appendix B. We will denote this approach as JC. It should be noted that all three quadratic approaches considered here maintain the lower bound property of the variational objective function. Furthermore, due to the fact that the subjects’ variational parameters can be updated with closed form updates, these algorithms tend to be computationally very efficient. The requirement that the bounds are quadratic, on the other hand, hinders their flexibility. Final algorithm and comments. Now that all bounds and approximations have been introduced, we can formulate a typical coordinate ascent algorithm which was used to estimate the models in our simulations. As an example we give the algorithm to obtain the QM C approximation to the posterior distribution (more detail on the derivation of this algorithm is provided in the appendices). These algorithms require several user-specified 14 input decisions, namely the specification of hyperparameters for the prior distributions, Algorithm 1 QM C Input X, y, β 0 , Ω0 , S −1 , ν, Z Initialize: µ1:H , Σ1:H , µζ , Σζ , ω = ν + H, Convergence = False n h T io−1 P Υ = S −1 + HΣζ + H Σ + µ − µ µ − µ h h ζ h ζ h=1 while Convergence = False do for h = 1 to h = H do Obtain Lh from Lh LTh = Σh (r) ∀r = 1, . . . , R β h = Lh z (r) + µh , P (r) P P Th T J xT htj β h log e µh , Σh = arg max t=1 y ht X ht µh − R1 R r=1 j=1 µh ,Σh − 12 tr [ωΥΣh ] − ω2 µh ΥµTh + ωµTh Υµζ + 12 log |Σh | end for −1 −1 Σζ = HωΥ + Ω 0 P −1 µ + Ω β µζ = Σζ ωΥ H 0 0 h=1 h n h T io−1 P Υ = S −1 + HΣζ + H Σ + µ − µ µ − µ h h ζ h ζ h=1 Test Convergence end while Output: µ1:H , Σ1:H , µζ , Σζ , Υ the initialization method, the convergence criterion and tolerance. We used uninformative but proper priors. The prior mean and variance of ζ were taken as respectively a K-dimensional zero vector, β 0 = 0K , and 100 times the K-dimensional identity matrix, Ω0 = 100 × I K . An uninformative prior for an inverse-Wishart distribution is somewhat harder to select. We decided to put the prior degrees of freedom at ν = K + 3 and the prior scale matrix at S −1 = 2 × I K , i.e. two times the K-dimensional identity matrix. As such the prior expected value of Ω is the K-dimensional identity matrix and the variances of all the elements are infinite. All that is left to specify now is the convergence criterion of the algorithm and a decent method to initialize the parameters. As convergence criterion we took the relative change in the joint Euclidean norm of all the variational parameters. The tolerance level for this convergence criterion was set at 10−4 . To initialize µζ and Σζ we used the Laplace approximation to the posterior distribution of a regular conditional logit model based on all the data. As such a model ignores possible (and in our case known) heterogeneity we multiplied the resulting initial posterior variance Ωζ by a factor H. We subsequently initialized the agent specific variational parameters µh and Σh by a Laplace approximation of an individual specific conditional logit model with priors given by µζ and Σζ . All computations were done using R (R 15 Core Team 2012) where we also used the optim routine to perform the numerical optimization parts with the BFGS algorithm using analytical gradients for all parameters. We now conclude this section with a brief summary of some published results concerning variational algorithms with respect to multinomial logit models. As far as we are aware the different approaches from the previous subsections have not been compared to each other in the context of discrete choice models. Knowles and Minka (2011) tested several bounds (JID , KMD , BL and BO) and the Taylor series approximation (BMD here) for their tightness with respect to the expected log-sum-exponential function. They found that BL was very loose and that KMD dominates JID . Furthermore, KMD performed best of all bounds except when the inputs of the log-sum-exponential function were extremely variable. In that instance, BO performed best but that bound performed much worse in all other cases. They also found that BMD performed best but did not consider it for further experiments due to the fact that it is not a bound. They also tested BO, KMD and JID on some simulated multinomial logit datasets using a slightly different algorithm (non-conjugate variational message passing) than our variational Bayes method and found that KMD performed best. With respect to the mixed logit model we are not aware of any results except for Braun and McAuliffe (2010) who used JI and BMD . They did not report results on JI as they found that in their settings, BMD was much better. In the following sections all these seven approaches will be compared on their performance in the context of mixed logit models. Numerical Experiments Performance Assessment In order to assess the performance of the various variational approaches in the context of a mixed logit model we performed several simulation experiments which will be detailed in the next subsections. As a benchmark for the variational algorithms we used the function rhierMnlRwMixture from the bayesm package (Rossi 2012) which uses a Gibbs sampler with a random walk Metropolis step which is explained in Rossi et al. (2005, pg. 136-137). Rather than using this M CM C chain to explore the posterior distribution completely we chose to run this algorithm for as long as it took the variational QM C algorithm to converge. This eliminates the need to and the trouble of checking for convergence of the high dimensional M CM C chain for each experiment, which is a nontrivial problem. Once the chain was stopped we removed the first half of the draws as 16 burn-in. From the resulting draws we only kept every fifth draw, i.e. thinning, for practical convenience resulting in R draws. So for each dataset we obtain seven variational results which are sets of parameters from the approximate parametric posterior distribution and an M CM C sample of size R from (a part of) the posterior distribution. To measure the accuracy of the different approaches we used the same procedure as Braun and McAuliffe (2010) which compares out-of-sample predictions with the true predictive choice distribution. So, suppose a new choice set X new is presented to an agent and suppose, for the moment, that we know this agent’s tastes, βh . We can then calculate the predictive choice distribution for this new choice set based on model (1). This predictive choice distribution yields the probabilities that this respective agent selects the various alternatives specified in the new choice set. Most of the time, however, we are not interested in a specific agent’s choices but rather in the choice probabilities of the ’average’ agent. Hence, we need to integrate these probabilities over the population heterogeneity distribution which yields Z true p (y new |X new , ζ, Ω) = p (y new |X new , β) NK (β|ζ, Ω) dβ. (5) However, unlike in simulation studies, one generally does not know the heterogeneity distribution. One can estimate this distribution however. As we have posterior distributions over model parameters in a Bayesian setting, this will require another set of integrals to integrate over the posterior distributions of the parameters of the mixing distribution. This results then in an estimated predictive choice distribution given by Z Z −1 p̂ (y new |X new , D) = p (y new |X new , β) NK (β|ζ, Ω) q ζ|µ̂ζ , Σ̂ζ q Ω|Υ̂ , ω̂ dβdζdΩ. (6) In order to calculate the true predictive choice distribution in (5) we averaged the choice probabilities over 1000000 draws of β from the known, true heterogeneity distribution, Nk (ζ, Ω). Braun and McAuliffe (2010) used this sample size to ensure that the Monte Carlo error of this estimation is negligible compared to the variability of their results. To calculate the estimated predictive choice distribution for the variational results in (6) we generate 500 samples of ζ and Ω from q ζ|µζ , Σζ q Ω|Υ−1 , ω . For each of these 500 samples we draw 10000 β vectors to evaluate the estimated predictive choice distribution. The average of these 5000000 predictive choice distributions is then the estimated predictive choice distribution. Similarly, for the M CM C results, we use 10000 β samples for each of the R draws from the posterior to obtain the estimated predictive choice 17 distribution. The true predictive choice distribution and the estimated predictive choice distribution were always assessed for 25 new randomly generated choice sets. To compare the true and the estimated predictive choice distributions we used the total variation metric which is a metric that compares probability distributions. The total variation error can then be calculated as(Levin et al. 2009) TV ptrue (y new |X new , ζ, Ω) , p̂ (y new |X new , D) J 1 X true p (y new |X new , ζ, Ω) − p̂j (y new |X new , D) . = 2 j=1 j This error is contained in the interval [0, 1] and obviously smaller errors are preferred. For each simulation scenario this metric was calculated for all the replications and the reported results are based on the median total variation error over the 25 new choice sets. Uncorrelated Taste Parameters In this simulation study five experimental factors were varied. The number of decision makers, H, was considered to be 250 or 1000. The number of choice sets per agent were taken as Th = 1, 5, 15 or 25. The number of alternatives J was either 3 or 12 and the number of attributes K was 3 or 10. Finally there was a setting with a relatively high population taste heterogeneity where the true Ω was set equal to the K-dimensional identity matrix, I K . In the relatively low taste heterogeneity setting the true Ω was set to 0.25 times the K-dimensional identity matrix, 0.25 × I K . The true mean ζ was set at K equally spaced values between −2 and 2. Finally, the attribute values were independent identically distributed normal variables, N (0, 0.52 ). Each of these simulation settings was replicated 10 times. -Insert Figure 1 about here-Insert Figure 2 about hereThe results of these simulations can be found in figures 1 and 2 which show the average total variation error and average completion time (in minutes) over the 10 replications for all experimental settings. In the cases when the variational algorithm was run with a diagonal version of the decision makers’ covariance matrices and with an unrestricted 18 version, i.e. dense covariance matrices, we only report the restricted results. The reason for this will be explained in a following subsection. We initially included the BO approach in the simulation but stopped this early as this method yielded very biased estimates for the posterior means of the subjects which indicates that the bound is too loose to work properly in these experimental settings. Note also that the figures do not include timing information for the M CM C chains. As these chains were generally not run until convergence, these times are not very interesting here. Just know that these chains were run for as long as the QM C approach was run, which is, generally somewhat slower than the QM CD version. Looking at figure 1 we can see that there are four distinct clusters with respect to accuracy. We can clearly see that in most settings M CM C performs some orders of magnitude worse than the other algorithms which is an indication that it did not have enough time to fully explore the posterior distribution. The difference is smallest when the number of choice sets per agent is small, i.e. when there is not much information per decision maker in the data. The Taylor series approximation, BM , on the other hand performs very well when the number of choice sets is relatively large but very poorly when the number is low. This shows that this approximation can be very inaccurate when there is not a lot of information per agent. Furthermore, we can discern two other distinct groups. The group with KM and QM C is clearly the most accurate overall which is followed by the group with BL, JC and JI. We can also see that the difference between these two groups increases when the sample size H increases. Finally, it can also be seen that generally the accuracy is higher when there are fewer attributes, i.e. fewer parameters in the model, and when there are more choice sets per agent. Looking at figure 2 many of these patterns reappear. The major difference here is that M CM C does only slightly worse than KM and QM C. Hence, for these settings, the M CM C chains converge quicker than when the heterogeneity is low. This group is again followed by the group of BL, JC and JI, which are very similar when the number of alternatives is small, J = 3, and which can be ordered as JI, JC and BL when the number of alternatives is large, J = 12. BM , as before, does very well when there are relatively many choice sets per decision maker and is very inaccurate when there are not. Looking at both figures it can clearly be seen that BL is by far the fastest of the algorithms, followed by JC and JI. The former has the edge when there are three alternatives while the latter has the edge when there are twelve alternatives. When there are only three alternatives, we can see that QM C is faster than BM and KM . This distinction however disappears when there are twelve alternatives. 19 Correlated Taste Parameters All the specifications of the population variance of the heterogeneous tastes in the previous section were diagonal matrices. This is a highly idealized set-up which represents a small subset of potential heterogeneity distributions. Furthermore, it is doubtful that in reality tastes for different attributes are truly independent. Therefore, we performed a second set of simulations to assess the performance of the various variational approaches in a setting with non-zero covariances between the taste parameters. In order to obtain plausible settings for the population tastes we simulated data based on results reported by Train and Weeks (2005). The original data were obtained by Train and Hudson (2000) and contained stated-preference choices made by 500 households among alternativefueled vehicles. Each of the respondents considered 15 choice sets with several attributes describing the alternatives. Train and Weeks (2005) estimated several discrete choice models with these data and we will use estimated parameters from a model with K = 7 attributes (Train and Weeks 2005, pg.13-14): price, willingness to pay for (WTP) operating cost, WTP range, WTP electric car, WTP hybrid car, WTP High performance, WTP Medium/High performance. Note that several of these coefficients reflect tastes for categorical attributes in the originial data. We will however use these parameters with continuous attribute values. This results in a population mean taste vector ζ = (−1.4934, −0.0489, 0.7636, −2.5353, 0.8738, 0.3584, 0.6047) and the following covariance matrix Ω which represents the population taste heterogeneity for these attributes   3.2844 0.0532 0.6262 −2.0619 1.0965 0.4893 0.7940    0.0532 0.0028 0.0101 −0.0333 0.0179 0.0084 0.0133     0.6262  0.0101 0.1812 −0.3915 0.2091 0.0932 0.1494     Ω = −2.0619 −0.0333 −0.3915 1.9827 −0.6851 −0.3038 −0.5110.    1.0965 0.0179 0.2091 −0.6851 2.1182 0.1584 0.2688     0.4893 0.0084 0.0932 −0.3038 0.1584 0.5720 0.1174    0.7940 0.0133 0.1494 −0.5110 0.2688 0.1174 3.8189 Based on these taste parameters we simulated data specifying different levels of the number of agents H = 250 or H = 1000. The number of choice sets per agent were taken as Th = 1, 5, 15 or 25 and the number of alternatives per choice set as J = 3 or J = 12. Each of these simulation scenarios was once again replicated 10 times and all the attributes were once more independent identically normally distributed variables, N (0, 0.52 ). The performance of the various variational algorithms was again assessed by the median total variation error of the predictive choice distributions for 25 new, random choice sets 20 as described in the previous section. -Insert Figure 3 about hereThe results of these simulations can be found in figure 3 which shows the average total variation error and average completion time (in minutes) over the 10 replications for all experimental settings. In the cases when the variational algorithm was run with a restricted version of the decision makers’ covariance matrices and with an unrestricted version, i.e. dense covariance matrices, we again only report the restricted results. Note also that again the figure does not include timing information for the M CM C chains. Looking at figure 3 we can see a similar overall pattern as before. The QM C approach seems to be most accurate overall. The BM approach is very similar to QM C when the number of choice sets is large enough but is extremely inaccurate when this is not the case. QCM and BM (sometimes) are closely followed by KM . The worst accuracy is seen to come from the M CM C approach which indicates it clearly did not have enough time for the chain to converge. The BL, JC and JI methods fall somewhere in between. We can also observe that the accuracy increases when the number of alternatives is lower and when the number of choice sets is larger. We observe again that BL is by far the fastest method followed by JC when there are three alternatives and followed by JI when there are twelve alternatives. We can also see that QM C is slightly faster than KM which in turn is slightly faster than BM when there are three alternatives. When there are twelve alternatives however, we see that KM is slightly faster than QM C and BM . Diagonal Restriction In the previous sections the presented results for the BM , JI, KM and QM C methods were based on the diagonally restricted versions. -Insert Figure 4 about hereIn figure 4 we show the performances of the restricted versions against the unrestricted versions. With respect to time we can see that in most cases the restricted versions converged faster than their unrestricted counterparts, which is to be expected. The case of BM is an outlier here in that in quite a few cases the restricted version converged faster. These, however, are the cases where the algorithm did not perform accurate at all, i.e. diverged. With respect to the accuracy we can see that the restricted and the unrestricted versions are very similar. Considering the significant speed-up of the algorithms, we can see that using restricted decision makers’ covariance matrices works very well. 21 Conclusions and Future Work We have compared several approaches to approximate the posterior distribution in the framework of mixed logit models. We found that several bounds were too loose to adequately capture the posterior variation which resulted in relatively poor performance. These bounds are the quadratic bounds, BL, BO, JC and the non quadratic bound JI. The proposed bound of Knowles and Minka (2011) did perform well in the context of mixed logit models on the other hand. Furthermore, it appears that the approximations, opposed to bounds, considered here, outperform the bounds. The QM C approach especially performed well in all experimental settings. The BM approximation performed equally well whenever the data contained enough information. In datasets with a small number of agents and/or a small number of choice sets, however, this approximation’s bias becomes too large and its performance decreases considerably. All in all, it appears that using an appropriate approximation or bound, the variational approach is viable in the context of mixed logit models. This may indicate an avenue of potential further research, i.e. the development of new, non quadratic bounds which may simplify the algorithms or speed up the optimization. Another potential avenue for further research may be to look for an optimal combination of all useful bounds and/or approximations, i.e. development of some hybrid algorithm. We also did not consider the question of optimal visiting schedules for the various parameter updates. It is very likely that the coordinate ascent algorithm’s convergence can be improved by optimizing such a schedule. Finally, as the variational approach seems to work adequately, more complicated models could be considered. For instance, the requirement that the mixing distribution is normal is a suspect assumption which is likely not very realistic. One could improve the flexibility of the model by using a finite mixture of mixed models. Traditional M CM C becomes very burdensome for these types of models due to the multimodality in the posterior and the label switching. A variational approach on the other hand only focuses on one mode and hence there is no need to explore all the equivalent modes due to the label switching. This will speed up the inference considerably at the potential small cost of some approximation bias. 22 References Christopher M. Bishop. Pattern recognition and machine learning, volume 4. Springer New York, 2006. David M. Blei and John D. Lafferty. A correlated topic model of Science. Annals of Applied Statistics, 1(1):17–35, June 2007. Dankmar Böhning. Multinomial Logistic Regression Algorithm. Annals of the Institute of Statistical Mathematics, 44(1):197–200, 1992. Dankmar Böhning and Bruce G. Lindsay. Monotonicity of Quadratic-Approximation Algorithms. Annals of the Institute of Statistical Mathematics, 40(4):641–663, 1988. Guillaume Bouchard. Efficient Bounds for the Softmax Function and Applications to Approximate Inference in Hybrid models. In NIPS 207 Workshop on Approximate Inference in Hybrid Models, pages 1–9, 2007. Michael Braun and Jon McAuliffe. Variational Inference for Large-Scale Models of Discrete Choice. Journal of the American Statistical Association, 105(489):324–335, March 2010. Guido Consonni and Jean-Michel Marin. Mean-field variational approximate Bayesian inference for latent variable models. Computational Statistics & Data Analysis, 52(2): 790–798, October 2007. Mark Girolami and Simon Rogers. Variational Bayesian Multinomial Probit Regression with Gaussian Process Priors. Neural Computation, 18(8):1790–1817, 2006. Maya Gupta and Santosh Srivastava. Parametric Bayesian Estimation of Differential Entropy and Relative Entropy. Entropy, 12(4):818–843, April 2010. Peter Hall, John T. Ormerod, and Matt Wand. Theory of Gaussian variational approximation for a Poisson mixed model. Statistica Sinica, 21:369–389, 2011. Fred J. Hickernell, Hee S. Hong, Pierre L’Écuyer, and Christiane Lemieux. Extensible Lattice Sequences for Quasi-Monte Carlo Quadrature. SIAM Journal on Scientific Computing, 22(3):1117–1138, January 2000. Tommi S. Jaakkola and Michael I. Jordan. Bayesian parameter estimation via variational methods. Statistics and Computing, 10:25–37, 2000. 23 Tony Jebara and Anna Choromanska. Majorization for CRFs and Latent Likelihoods. In Neural Information Processing Systems (NIPS), 2012. Mohammad E. Khan, Benjamin M. Marlin, Guillaume Bouchard, and Kevin P. Murphy. Variational bounds for mixed-data factor analysis. In Advances in Neural Information Proceeding Systems, pages 1–9, 2010. David A. Knowles and Thomas P. Minka. Non-conjugate variational message passing for multinomial and binary regression. In Advances in Neural Information Processing Systems, pages 1701–1709. 2011. Neil D. Lawrence, Marta Milo, Mahesan Niranjan, Penny Rashbass, and Stephan Soullier. Reducing the variability in cDNA microarray image processing by Bayesian inference. Bioinformatics (Oxford, England), 20(4):518–26, March 2004. David A. Levin, Yuval Peres, and Elizabeth L. Wilmer. Markov Chains and Mixing Times. American Mathematical Society, 2009. Daniel McFadden. Conditional logit analysis of qualitative choice behaviour. In P. Zarembka, editor, Frontiers of Econometrics. Academic Press, New York, 1974. John T. Ormerod and Matt Wand. Explaining Variational Approximations. The American Statistician, 64(2):140–153, May 2010. John T. Ormerod and Matt Wand. Gaussian Variational Approximate Inference for Generalized Linear Mixed Models. Journal of Computational and Graphical Statistics, 21 (1):2–17, January 2012. Giorgio Parisi. Statistical Field Theory. Addison-Wesley, 1988. R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2012. URL http://www.R-project.org/. Peter Rossi. bayesm: Bayesian Inference for Marketing/Micro-econometrics, 2012. URL http://CRAN.R-project.org/package=bayesm. R package version 2.2-5. Peter. Rossi, Greg. Allenby, and Robert McCulloch. Bayesian Statistics and Marketing., volume 169. John Wiley & Sons Ltd, October 2005. Havard vard Rue, Sara Martino, and Nicolas Chopin. Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations. Journal of the Royal Statistical Society. Series B (Methodological), 71:319–392, 2009. 24 Mike Titterington. The EM algorithm, variational approximations and expectation propagation for mixtures. In K. L. Mengersen, C. P. Robert, and D. M. Titterington, editors, Mixtures: Estimation and Applications, pages 1–30. John Wiley & Sons, Ltd, Chichester, UK, 2011. Kenneth Train. Discrete Choice Methods with Simulation. Cambridge University Press, second edition, 2009. Kenneth Train and Kathleen Hudson. The Impact of Information on Vehicle Choices and the Demand for Electric Vehicles in California. Technical report, Toyota and General Motors, 2000. Kenneth Train and Melvyn Weeks. Discrete Choice Models in Preference Space and Willingness-to-Pay Space. In Anna Scarpa, Riccardo and Alberini, editor, Applications of Simulation Methods in Environmental and Resource Economics, pages 1–16. Springer Netherlands, 2005. Bo Wang and Mike Titterington. Inadequacy of interval estimates corresponding to variational Bayesian approximations. In Robert G. Cowell and Zoubin Ghahramani, editors, Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics, Jan 6-8, 2005, Savannah Hotel, Barbados, pages 373–380. Society for Artificial Intelligence and Statistics, 2005. Bo Wang and Mike Titterington. Convergence properties of a general algorithm for calculating variational Bayesian estimates for a normal mixture model. Bayesian Analysis, 1(3):625–650, 2006. John Winn and Christopher M. Bishop. Variational Message Passing. Journal of Machine Learning Research, 6:661–694, 2005. Jie Yu, Peter Goos, and Martina Vandebroek. Comparing different sampling schemes for approximating the integrals involved in the efficient design of stated choice experiments. Transportation Research Part B: Methodological, 44(10):1268–1289, December 2010. Elaine L. Zanutto and Eric T. Bradlow. Data pruning in consumer choice models. Quantitative Marketing and Economics, 4(3):267–287, August 2006. 25 Notes 1 2 Note that we use the same notation in this paper as Braun and McAuliffe (2010) for consistency. R The differential entropy of a density f (x) is defined as H [f (x)] = − f (x) log f (x) dx. 26 Ω = 0.25 × IK 15 25 15 25 ● ● 0.01 0.02 0.05 0.10 5 ● 1 5 15 25 ● ● ● ● 1 5 15 25 ● ● ● 5 15 25 ● ● 15 25 5.0 5.0 5.0 5.0 1 ● ● 20.0 5 ● ● ● 0.2 5 15 25 1 5 ● ● 15 25 0.100 1 1.0 ● 0.100 25 0.2 15 ● 0.100 ● 0.100 5 ● ● 0.2 ● ● 1.0 1.0 1.0 ● 0.2 15 25 ● ● ● 1 1 15 25 0.020 5 20 20 20 ● 0.005 0.020 5 ● 1 100 25 ● ● 20 15 ● 100 5 ● 100 1 ● ● 100 0.005 ● ● 0.005 0.020 ● 0.005 0.020 ● ● 1 5 15 25 ● BL 1 5 BMD 15 25 JC ● ● 1 JID 5 5 ● ● 5 15 KMD ● 25 QMCD ● 1 2 ● ● 1 2 5 ● 1 2 5 ● ● 1 2 Time [min] TV error 1 ● ● 0.01 0.02 25 ● J = 12 & K = 10 20.0 15 ● 20.0 ● 20.0 5 ● ● 1 Time [min] ● 0.01 0.02 ● 1 H = 1000 J = 12 & K = 3 0.05 0.10 0.05 0.10 0.05 0.10 J = 3 & K = 10 ● 0.01 0.02 H = 250 TV error J=3&K=3 1 5 MCMC Figure 1: Total variation error and times till convergence (in minutes) for the low heterogeneity setting. Each observation is the average over 10 replications. The x-axes represent the number of choice sets per agent. Note that the y-axes are on a logarithmic scale. Some results are clipped from above to improve the readability. 27 Ω = 1 × IK 15 25 ● ● ● 0.04 0.02 ● 1 5 15 25 1 5 15 25 5.0 20.0 5 5.0 5.0 0.08 0.08 0.04 ● 0.02 1 ● 5 15 25 ● BL 1.0 ● 5 15 25 ● ● ● 5 15 25 ● 1 5 15 25 ● ● ● 5 15 25 1 BMD JC ● 1 JID 1 5 15 ● ● ● 1 5 15 ● ● ● 1 5 15 25 ● 25 20 20 5 ● ● 0.2 25 0.05 0.10 ● 1 2 ● ● ● 20 1 ● 15 ● 100 1 5 20 ● 5 5 25 1 ● 0.01 0.02 1.0 0.2 0.05 0.10 25 ● 0.01 0.02 ● 15 ● ● ● KMD QMCD ● 1 2 15 ● 5 0.01 0.02 ● ● 1 2 5 1 100 5 ● 100 1 ● ● 100 25 1 2 1.0 1.0 15 0.2 ● 0.05 0.10 5 ● 0.05 0.10 0.01 0.02 ● ● ● 0.2 Time [min] H = 1000 TV error ● ● ● ● 5.0 25 ● ● ● ● 1 Time [min] ● ● 20.0 15 20.0 5 J = 12 & K = 10 20.0 ● ● 1 ● 0.04 ● J = 12 & K = 3 0.02 0.04 ● 0.02 TV error H = 250 0.08 J = 3 & K = 10 0.08 J=3&K=3 MCMC Figure 2: Total variation error and times till convergence (in minutes) for the high heterogeneity setting. Each observation is the average over 10 replications. The x-axes represent the number of choice sets per agent. Note that the y-axes are on a logarithmic scale. Some results are clipped from above to improve the readability. 25 28 Ω = ΩWTP ● 0.10 0.10 ● 0.05 ● 0.05 TV error 0.20 J = 12 0.20 J=3 ● ● ● H = 250 ● 15 25 15 25 5.0 2.0 5.0 2.0 5 ● ● ● 25 ● 5 ● 15 25 ● 0.05 0.05 ● ● ● 1 0.20 15 0.20 5 0.5 ● ● 0.5 Time [min] ● 1 TV error 1 20.0 5 20.0 1 ● ● ● ● 0.01 25 ● ● 15 25 1 5 ● ● 15 25 ● ● 15 25 10 20 50 15 50 5 10 20 ● 5 5 Time [min] 1 2 ● 1 5 ● BL BMD JC 2 H = 1000 0.01 ● 1 JID 5 KMD QMCD MCMC Figure 3: Total variation error and times till convergence (in minutes). Each observation is the average over 10 replications. The x-axes represent the number of choice sets per agent. Note that the y-axes are on a logarithmic scale. 29 Time [min] 0.200 BM 0.100 0.050 0.020 0.010 0.005 ● ● ●● ● ●● ●● ● ● ●● ● ● ● ● ●● ●● ● ● ● ●● 0.005 0.010 Restricted Restricted TV error ● ●● ● ● ● ●● ●●● ● ● 0.020 0.050 0.100 0.200 200.0 100.0 50.0 BM 20.0 10.0 5.0 2.0 1.0 0.5 ● ● ● ● ● ● ● ● ● ● ● ● ● 0.500 1 2 5 10 0.02 ● ● ● ● ●● 0.01 ● ● ● ● ●● ● ● ●●● ● ●● ●●● ●● ● ● ●● ● ● ●● ●● ● ●● JI 10.0 ● 2.0 1.0 ● ● 0.05 ●● ● ●●● ● ● ●● ● ●● ● ● 0.10 0.5 1.0 0.010 0.005 ● ●● ● ● ● ● ● ● 0.005 ●● ●● ● ● ● ● ● ● ● ●● ● ● ● ●● 0.010 ● ● ●●● ● Restricted Restricted 0.020 ● ● ● ●● 0.020 0.050 100.0 50.0 ●● ● ●● 5.0 10.0 20.0 50.0 2.0 1.0 0.5 0.100 ● ●● ●● ● ●● ●● ● ●●● ● ● ● ● ● ● ●●●● ● ●● ● ● ● ● ● ● ● ● ●●● ● 1 2 5 10 20 50 100 0.020 0.010 ● ● ●● ● ● ● ● ●● ●● ● ● ●●●● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ●● Restricted QMC 0.050 ● ● ● 0.020 Unrestricted 0.050 0.100 ● 10.0 5.0 2.0 1.0 0.010 QMC 20.0 0.5 0.005 200 Unrestricted 50.0 0.100 Restricted ● ● ● KM 20.0 10.0 5.0 Unrestricted 0.005 200 Unrestricted KM 0.050 ●● ●● ●● ● ● ● ●● 2.0 Unrestricted 0.100 100 ● ● 0.02 50 ● 5.0 0.5 ● 0.01 20 Unrestricted Restricted Restricted 0.05 ● ● 20.0 JI ● ● ●● ● ● ● ● ●● ● Unrestricted 0.10 ● ●●● ●● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ●● ● ● ●● ●● ●● ● ●● ● ● ● ● ● ●● ● ● ●● ● ● ● 2 5 10 20 Unrestricted Figure 4: Total variation error and times till convergence (in minutes) for all settings. Each observation is the average over 10 replications. The x-axes represent the performance of the algorithms with unrestricted decision makers’ covariance matrices. The y-axes represent the performance of the algorithms with diagonally restricted decision makers’ covariance matrices. The green dots represent the cases where the number of choice sets, Th , was 15 or 25 whereas the cases where the number of choice sets was 1 or 5 are represented by red triangles. The dashed lines are the 45◦ lines which represent identical performances 50 100 30 A Variational Bayes for the Mixed Logit Model In this section the development of equation (3), i.e. the expected joint log probability of the data and the priors under the factorized posterior approximation plus the entropy of the variational posterior distribution are shown in more detail. In order to avoid confusion about different parameterizations, we define the following form for the normal and inverse-Wishart densities: T −1 1 1 p ζ; µζ , Σζ ∝ |Σζ |− 2 e− 2 (ζ−µζ ) Σζ (ζ−µζ ) ω+K+1 1 −1 −1 p Ω; Υ−1 , ω ∝ |Ω|− 2 e− 2 tr(Υ Ω ) . As the densities of q (β h ) , h = 1, . . . , H are equivalent to the density of ζ, only the latter details are shown. The log joint probability of the mixed logit model is, up to a constant (Hyperparameters from priors are set before estimation and are thus constants. Any term that only contains constants required for normalization of the normal and inverseWishart distributions is dropped here.): Th H X X h=1 t=1 ( y Tht X ht β h − log + H X h=1 " J X j=1 e xT htj β h #) 1 1 1 − log |Ω| − β Th Ω−1 β h − ζ T Ω−1 ζ + β Th Ω−1 ζ 2 2 2 1 ν+K +1 1 T −1 −1 −1 − ζ T Ω−1 . ζ + ζ Ω β − log |Ω| − tr S Ω 0 0 0 2 2 2 (7) Because the assumed posterior distribution is factorized, we only require moments of the normal and inverse-Wishart distribution to evaluate the expected value of equation (7), which are fairly easy to derive. In what follows all the expectations are with respect to the approximate posterior densities of the variables over which the expectation is taken. For the normal expectations and entropy for the parameter ζ we have: E [ζ] = µζ E ζ T Ω0 ζ = E tr Ω0 ζζ T = tr Ω0 Σζ + µζ µTζ 1 1 K log (2πe) + |Σζ | = |Σζ | + Constant. H [ζ] = 2 2 2 The same expectations are required to evaluate the expectations for the β h=1:H parameters. For the inverse-Wishart expectations for the parameter Ω we have the following 31 expectations (Gupta and Srivastava 2010): E Ω−1 = ωΥ K X ω+1−k E [log |Ω|] = − log |Υ| − ψ 2 k=1 K X ωK K + 1 ω+1−k + − log |Υ| H [Ω] = log Γ 2 2 2 k=1 K ω+K +1X ω+1−k − + Constant ψ 2 2 k=1 d log Γ (x), and Γ (x) represents where ψ (.) represents the digamma function, ψ (x) = dx R ∞ x−1 −t the gamma function, Γ (x) = 0 t e dt. When we plug these expectations into equation (7) we obtain the following expected joint log probability of the data and the priors, again up to a constant: Th H X X h=1 t=1 + H X h=1 ( ( " y Tht X ht µh − Eq(βh ) log 1 1 log |Υ| + 2 2 K X k=1 ψ J X j=1 ω+1−k 2 e xT htj β h !#) 1 − tr ωΥ Σh + Σζ + µh µTh + µζ µζ 2 K T 1 ν+K +1 ν+K +1X T −1 T − tr Ω−1 + µ Ω β + Σ + µ µ log |Υ| + ψ ζ 0 ζ ζ ζ 0 0 2 2 2 k=1 ω − tr S −1 Υ . 2 + ωµTh Υµζ ω+1−k 2 (8) The entropy of the variational posterior is up to a constant H X 1 h=1 K X ωK K + 1 1 ω+1−k + log |Σh | + log |Σζ | + − log |Υ| log Γ 2 2 2 2 2 k=1 K ω+K +1X ω+1−k − . (9) ψ 2 2 k=1 ) 32 We can now calculate derivatives of the lower bound, i.e. (8) + (9), with respect to µζ , Σζ , ω and Υ and set them to 0. 1 −1 ∇Σζ = − tr HωΥ + Ω−1 0 − Σζ 2 H X −1 ∇µζ = − HωΥ + Υ0 µζ + ωΥ µh + Υ−1 0 β0 h=1 ν + H −1 ω Υ − ∇Υ = 2 2 ( S −1 + HΣζ + H h X Σh + µ h − µ ζ h=1 K X ω+1−k 2 µh − µζ T i ) ∂ψ K H +ν−ω ∂ ((8) + (9)) = + ∂ω 2 2 ∂ω k=1 ( ) H h X T i 1 −1 Σh + µ h − µ ζ µ h − µ ζ − tr S + HΣζ + 2 h=1 Solving for the variational parameters we get the following closed form update equations: Σζ = HωΥ + Ω−1 0 µ ζ = Σζ ωΥ H X −1 µh + Ω−1 0 β0 h=1 ω =ν+H ( Υ= S −1 + HΣζ + H h X h=1 ! Σh + µ h − µ ζ µh − µζ T i )−1 . The degrees of freedom parameter ω of the approximate posterior of ζ is not data dependent and can be fixed at its optimal value from the start. The only unspecified parts of the estimation algorithm are the updates with respect to µh and Σh for all h = 1, . . . , H. 33 B B.1 Derivation of Bounds and Approximations Taylor Series Consider the second order Taylor series expansion of the function f (β h ) = log around the current mean µh which results in f (β h ) = log J X e xT htj β h j=1 ! J X ≈ log e xT htj µh j=1 ! P J xT htj β h j=1 e 1 +(β h − µh )T ∇ (µh )+ (β h − µh )T H (µh ) (β h − µh ) 2 where ∇ (µh ) = X Tht phj , H (µh ) = X Tht diag (pht ) − pht pTht X ht and where pht is a Je dimensional vector with entries xT htj µh x T ′ µh htj PJ e xT htj β h !# ′ j =1 , ∀j = 1, . . . , J and diag (x) is an operator that constructs a diagonal matrix out of the elements in x. Taking expectations with respect to β h this leads to " Eq(βh ) log J X e j=1 ≈ log J X e xT htj µh j=1 ! 1 + tr [Σh H (µh )] . 2 If we plug this approximation into equation (7) and collect all terms which only depend on µh and Σh from equations (7) and (9) we obtain the following maximization problem: arg max µh ,Σh Th X t=1 y Tht X ht µh −log J X j=1 e xT htj µh ! ω ω 1 1 − tr [Σh H (µh )]− tr [ΥΣh ]− µTh Υµh +ωµTh Υµζ + log |Σh | . 2 2 2 2 This approach is the BM and BMD method where the latter restricts Σh to a diagonal matrix. Obviously this approach will only work well if the approximation is close enough. B.2 Quasi Monte Carlo The maximization function for the QM C approach is arg max µh ,Σh Th X t=1 ! R J X X (r) ω 1 ω 1 T log exhtj βh − tr [ΥΣh ]− µTh Υµh +ωµTh Υµζ + log |Σh | . y Tht X ht µh − R r=1 2 2 2 j=1 34 This approach was also considered with unrestricted and a diagonally restricted variance matrices Σh . B.3 Jensen’s Inequality As log (.) is a concave function one can apply Jensen’s inequality to obtain: " Eq(βh ) log J X e xT htj β h j=1 !# J X ≤ log h Eq(βh ) e j=1 xT htj β h i ! K X = log e 1 T xT htj µh + 2 xhtj Σh xhtj j=1 ! . The latter expectation is simply the moment generating function of a multivariate normal distribution. If we plug this bound into equation (7) and collect all terms which only depend on µh and Σh from equations (7) and (9) we obtain the following maximization problem: arg max µh ,Σh Th X J X y Tht X ht µh −log t=1 e 1 T xT htj µh + 2 xhtj Σh xhtj j=1 ! ω ω 1 − tr [ΥΣh ]− µTh Υµh +ωµTh Υµζ + log |Σh | . 2 2 2 This approach is the JI and JID method where the latter restricts Σh to a diagonal matrix. To obtain the KM and KMD methods we need to introduce additional variP PJ T T ational parameters. We start from the identity log Jj=1 exhtj βh = j=1 ahtj xhtj β h + PJ T P log J e(xhtj − j=1 ahtj xhtj ) βh . Taking expectations and once again applying Jensen’s j=1 inequality then leads to " Eq(βh ) log J X e xT htj β h j=1 !# ≤ J X ahtj xThtj µh j=1 J X + log e(xhtj − PJ j=1 ahtj xhtj ) T µh + 21 (xhtj − PJ T j=1 ahtj xhtj ) Σh (xhtj − PJ j=1 Plugging this into equation (7) we obtain a similar maximization problem as the previous one. However, we have introduced extra variational parameters a = (a1:H,1:Th ,1:J ) which also need to be updated. Taking derivatives and equating them to 0 results in the following fixed point update equations exhtj µh + 2 (xhtj −2 T ahtj = PJ j ′ =1 e xT htj 1 1 ′ µh + 2 x htj T ahtj xhtj ) Σh xhtj PJ ′ −2 j=1 PJ ′′ j =1 a htj ′′ x htj ′′ T ∀h, t, j. Σh x htj ′ j=1 ahtj xhtj ) ! . 35 This approach is the KM and KMD method where the latter restricts Σh to a diagonal matrix. B.4 Bı̈¿ 21 hning-Lindsay Define A = 21 I J − 1J 1TJ /J where I J is the J-dimensional identity matrix and 1J is a J-dimensional vector of ones. Böhning and Lindsay (1988) and Böhning (1992) show that A ≥ H with respect to the Loewner ordering (A ≥ H with respect to the Loewner ordering if A − H is positive semi-definite.). P If weT take a second order Taylor series exJ xhtj β h pansion of the function f (X ht β h ) = log around some parameter vector j=1 e ∗ Ψht we know that for some specific vector Ψht we get the following equality J X f (X ht β h ) = log e xT htj β h j=1 ! J X = log eΨhtj j=1 ! + (X ht β h − Ψht )T ∇ (Ψht ) 1 + (X ht β h − Ψht )T H (Ψ∗ht ) (X ht β h − Ψht ) 2 where ∇ (Ψht ) and H (Ψ∗ht ) are the gradient of f (X ht β h ) evaluated at Ψht and Ψ∗ht respectively. Replacing H (Ψ∗ht ) with A and taking expectations over β h we can obtain the following bound: " Eq(βh ) log J X e xT htj β h j=1 !# J X ≤ log j=1 e Ψhtj ! + (X ht µh − Ψht )T ∇ (Ψht ) 1 1 + X Tht AX ht Σh + (X ht µh − Ψht )T A (X ht µh − Ψht ) . 2 2 From this bound it is possible to generate analytic update equations for the subject specific parameters by plugging it into equation (7) and equating derivatives with respect to µh and Σh to 0 which results in: Σh = ωΥ + Th X X Tht AX ht t=1 µ h = Σh ( ωΥµζ + Th X t=1 !−1 , h = 1, . . . , H ) X Tht [y ht − ∇ (Ψht ) + AΨht ] , h = 1, . . . , H. Using derivatives again, it can be seen that the update for the extra variational parameters Ψht , ∀h, t, turns out to be Ψht = X ht µh . 36 B.5 Bouchard QJ xj exj ). Replacing xj by xThtj β h − αht e ≤ j=1 j=1 (1 + PJ PJ xT β h −αht xT βh htj htj and taking logarithms we arrive at log ≤ αht + j=1 log 1 + e . j=1 e Bouchard (2007) observed that PJ + Jaakkola and Jordan (2000) derived the well known tangential bound log (1 + ex ) ≤ x−t 2 2 1 t tanh 2 (x − t2 ) + log (1 + et ). Combining these two results and taking expectations 4t with respect to β h we obtain the following quadratic lower bound: " Eq(βh ) log J X e xT htj β h j=1 !# ≤ αht + J X xThtj µh − αht − thtj 2 j=1 + λ (thtj ) h xThtj µh − αht 2 i − t2htj + xThtj Σh xhtj + log 1 + ethtj where λ (t) = 4t1 tanh 2t . From this bound it is possible to generate analytic update equations for the subject specific parameters by plugging it into equation (7) and equating derivatives with respect to µh and Σh to 0 which results in: Σh = ωΥ + 2 Th X J X λ (thtj ) xhtj xThtj t=1 j=1 " µh = Σh ωΥµζ + Th X J X t=1 j=1 y Thtj − !−1 , ∀h = 1, . . . , H # 1 + 2αht λ (thtj ) xhtj , 2 ∀h = 1, . . . , H. The extra variational parameters can be updated by fixed point equations which are P J/2 − 1 + 2 Jj=1 λ (thtj ) xThtj µh αht = P 2 Jj=1 λ (thtj ) q 2 xThtj µh − αht + xThtj Σh xhtj thtj = B.6 ∀h, t ∀h, t, j. Jebara-Choromanska Jebara and Choromanska (2012) developed an algorithm to find a quadratic bound P T T J xT 1 htj β h ≤ log z + log e + β − β̃ β − β̃ mht around β − β̃ S ht ht h h h h h h j=1 2 some β̃. The algorithm outputs zht , mht and S ht and is: 37 Algorithm 2 Input β̃ h , X ht Initialize: j = 1, zht = 0, mht = 0K , S ht = zht I K while j ≤ J do T α = exhtj β̃h l = xhtj − mhj tanh( 21 log(a/zht )) T ll S ht = S ht + 2 log(a/z ht ) a mht = mht + zht +a l zht = z + a j =j+1 end while Output: zht , mht , S ht After taking expectations this bound leads to " Eq(βh ) log J X j=1 e xT htj β h !# ≤ log zht + 1 T T 1 µh − β̃ h S ht µh − β̃ h + tr [Σh S ht ]+ µh − β̃ h mht . 2 2 This quadratic bound again leads to analytic updates of the subjects’ variational parameters in the form of Σh = ωΥ + µ h = Σh Th X S ht t=1 Th X ωΥ t=1 !−1 S ht β̃ h − mht ! . We chose to update β̃ h as µζ . C Generating Quasi Monte Carlo Samples In this section we briefly show how we constructed the QM C samples. We chose to construct the QM C samples according to Hickernell et al. (2000) which are called extensible shifted lattice points (ESLP). For more details on the properties and optimal construction of such samples we refer you to the previously mentioned reference. The goal of QM C samples is to sample from the K-dimensional unit cube [0, 1)K in a way such that the discrepancy between the empirical distribution of the QM C sample and the continuous uniform distribution is small. If this goal is successful, relatively precise high di- 38 mensional integration can be performed with a relatively small number of samples which benefits the computational efficiency of the algorithm. Say that we require R samples where R is some integer power of an integer base, i.e. R = bm , b ≥ 2 and b andm are integers. We also require a generating vector h of dimension K. Following Hickernell et al. T (2000) we use the generating vector h = 1, η, η 2 , . . . , η K−1 . The next step is to write the integers 0, 1, 2, . . . , bm − 1 in base b form. So, for instance, if b = 2 and m = 3, we have R = 23 = 8 samples. The integer 0 would be written as 0 × 20 + 0 × 21 + 0 × 22 , 1 would be written as 1 × 20 + 0 × 21 + 0 × 22 all up to 7 = 1 × 20 + 1 × 21 + 1 × 22 . So now we have for all the integers 0, . . . , bm − 1 the coefficients of its base b representation which can be written as i= m −1 bX ik bk = i0 b0 + i1 b1 + . . . . k=0 Define now the function φb (i) as φb (i) = m −1 bX ik b−(k+1) = i0 b−1 + i1 b−2 + . . . . k=0 The final element to generate the QM C sample is to introduce a random shift vector u = (u1 , . . . , uK )T which is an element of the unit cube [0, 1)K . The ith QM C sample is now defined as ({φb (i) h1 + u1 } , . . . , {φb (i) hK + uK })T where {x} is a function which takes the fractional part of x, i.e. {x} = x (mod 1). Hickernell et al. (2000) used a periodizing transformation on the final QM C samples as this appeared to increase the accu′ racy of the method. We also used this transformation which is defined as x = |2x − 1|. Finally, as we are interested in samples from a multivariate normal distribution rather than from a multivariate uniform distribution we apply the inverse normal distribution transformation on all coordinates. This results in a QM C sample from a standard K-dimensional normal distribution. In our algorithms we used base b = 2, exponents m = 6, . . . , 12 for the conditional logit model and m = 6, 7, 8 for the mixed logit model. Furthermore, we used η = 1571 from Hickernell et al. (2000, table 4.1) which is appropriate for bases in 6, . . . , 12 and up to K = 33 dimensions. FACULTY OF ECONOMICS AND BUSINESS Naamsestraat 69 bus 3500 3000 LEUVEN, BELGIË tel. + 32 16 32 66 12 fax + 32 16 32 67 91 info@econ.kuleuven.be www.econ.kuleuven.be

RELATED PAPERS

RELATED TOPICS

Log In

A comparison of variational approximations for fast inference in mixed logit models

A comparison of variational approximations for fast inference in mixed logit models