Search | arXiv e-print repository

DP-EM: Differentially Private Expectation Maximization

Authors: Mijung Park, Jimmy Foulds, Kamalika Chaudhuri, Max Welling

Abstract: The iterative nature of the expectation maximization (EM) algorithm presents a challenge for privacy-preserving estimation, as each iteration increases the amount of noise needed. We propose a practical private EM algorithm that overcomes this challenge using two innovations: (1) a novel moment perturbation formulation for differentially private EM (DP-EM), and (2) the use of two recently develope… ▽ More The iterative nature of the expectation maximization (EM) algorithm presents a challenge for privacy-preserving estimation, as each iteration increases the amount of noise needed. We propose a practical private EM algorithm that overcomes this challenge using two innovations: (1) a novel moment perturbation formulation for differentially private EM (DP-EM), and (2) the use of two recently developed composition methods to bound the privacy "cost" of multiple EM iterations: the moments accountant (MA) and zero-mean concentrated differential privacy (zCDP). Both MA and zCDP bound the moment generating function of the privacy loss random variable and achieve a refined tail bound, which effectively decrease the amount of additive noise. We present empirical results showing the benefits of our approach, as well as similar performance between these two composition methods in the DP-EM setting for Gaussian mixture models. Our approach can be readily extended to many iterative learning algorithms, opening up various exciting future directions. △ Less

Submitted 31 October, 2016; v1 submitted 23 May, 2016; originally announced May 2016.

arXiv:1603.07294 [pdf, other]

On the Theory and Practice of Privacy-Preserving Bayesian Data Analysis

Authors: James Foulds, Joseph Geumlek, Max Welling, Kamalika Chaudhuri

Abstract: Bayesian inference has great promise for the privacy-preserving analysis of sensitive data, as posterior sampling automatically preserves differential privacy, an algorithmic notion of data privacy, under certain conditions (Dimitrakakis et al., 2014; Wang et al., 2015). While this one posterior sample (OPS) approach elegantly provides privacy "for free," it is data inefficient in the sense of asy… ▽ More Bayesian inference has great promise for the privacy-preserving analysis of sensitive data, as posterior sampling automatically preserves differential privacy, an algorithmic notion of data privacy, under certain conditions (Dimitrakakis et al., 2014; Wang et al., 2015). While this one posterior sample (OPS) approach elegantly provides privacy "for free," it is data inefficient in the sense of asymptotic relative efficiency (ARE). We show that a simple alternative based on the Laplace mechanism, the workhorse of differential privacy, is as asymptotically efficient as non-private posterior inference, under general assumptions. This technique also has practical advantages including efficient use of the privacy budget for MCMC. We demonstrate the practicality of our approach on a time-series analysis of sensitive military records from the Afghanistan and Iraq wars disclosed by the Wikileaks organization. △ Less

Submitted 8 June, 2016; v1 submitted 23 March, 2016; originally announced March 2016.

Comments: Updated to match the accepted UAI version. Generalized the ARE result and included a more detailed proof. Improved some figures, etc

Journal ref: Proceedings of the 32nd Conference on Uncertainty in Artificial Intelligence (UAI), 2016

arXiv:1603.04733 [pdf, other]

Structured and Efficient Variational Deep Learning with Matrix Gaussian Posteriors

Authors: Christos Louizos, Max Welling

Abstract: We introduce a variational Bayesian neural network where the parameters are governed via a probability distribution on random matrices. Specifically, we employ a matrix variate Gaussian \cite{gupta1999matrix} parameter posterior distribution where we explicitly model the covariance among the input and output dimensions of each layer. Furthermore, with approximate covariance matrices we can achieve… ▽ More We introduce a variational Bayesian neural network where the parameters are governed via a probability distribution on random matrices. Specifically, we employ a matrix variate Gaussian \cite{gupta1999matrix} parameter posterior distribution where we explicitly model the covariance among the input and output dimensions of each layer. Furthermore, with approximate covariance matrices we can achieve a more efficient way to represent those correlations that is also cheaper than fully factorized parameter posteriors. We further show that with the "local reprarametrization trick" \cite{kingma2015variational} on this posterior distribution we arrive at a Gaussian Process \cite{rasmussen2006gaussian} interpretation of the hidden units in each layer and we, similarly with \cite{gal2015dropout}, provide connections with deep Gaussian processes. We continue in taking advantage of this duality and incorporate "pseudo-data" \cite{snelson2005sparse} in our model, which in turn allows for more efficient sampling while maintaining the properties of the original model. The validity of the proposed approach is verified through extensive experiments. △ Less

Submitted 23 June, 2016; v1 submitted 15 March, 2016; originally announced March 2016.

Comments: Updated results with the original folds in the regression experiments. Appearing in the International Conference on Machine Learning (ICML) 2016

arXiv:1603.02518 [pdf, other]

A New Method to Visualize Deep Neural Networks

Authors: Luisa M. Zintgraf, Taco S. Cohen, Max Welling

Abstract: We present a method for visualising the response of a deep neural network to a specific input. For image data for instance our method will highlight areas that provide evidence in favor of, and against choosing a certain class. The method overcomes several shortcomings of previous methods and provides great additional insight into the decision making process of convolutional networks, which is imp… ▽ More We present a method for visualising the response of a deep neural network to a specific input. For image data for instance our method will highlight areas that provide evidence in favor of, and against choosing a certain class. The method overcomes several shortcomings of previous methods and provides great additional insight into the decision making process of convolutional networks, which is important both to improve models and to accelerate the adoption of such methods in e.g. medicine. In experiments on ImageNet data, we illustrate how the method works and can be applied in different ways to understand deep neural nets. △ Less

Submitted 12 June, 2017; v1 submitted 8 March, 2016; originally announced March 2016.

Comments: Please note that this version of the article is outdated. The new version (published at ICLR2017) includes additional experiments on MRI scans and can be found at arXiv:1702.04595

arXiv:1602.08323 [pdf, other]

Deep Spiking Networks

Authors: Peter O'Connor, Max Welling

Abstract: We introduce an algorithm to do backpropagation on a spiking network. Our network is "spiking" in the sense that our neurons accumulate their activation into a potential over time, and only send out a signal (a "spike") when this potential crosses a threshold and the neuron is reset. Neurons only update their states when receiving signals from other neurons. Total computation of the network thus s… ▽ More We introduce an algorithm to do backpropagation on a spiking network. Our network is "spiking" in the sense that our neurons accumulate their activation into a potential over time, and only send out a signal (a "spike") when this potential crosses a threshold and the neuron is reset. Neurons only update their states when receiving signals from other neurons. Total computation of the network thus scales with the number of spikes caused by an input rather than network size. We show that the spiking Multi-Layer Perceptron behaves identically, during both prediction and training, to a conventional deep network of rectified-linear units, in the limiting case where we run the spiking network for a long time. We apply this architecture to a conventional classification problem (MNIST) and achieve performance very close to that of a conventional Multi-Layer Perceptron with the same architecture. Our network is a natural architecture for learning based on streaming event-based data, and is a stepping stone towards using spiking neural networks to learn efficiently on streaming data. △ Less

Submitted 7 November, 2016; v1 submitted 26 February, 2016; originally announced February 2016.

Comments: 8 pages main paper + 1 page reference + 7 pages Appendix

MSC Class: 68T01 ACM Class: F.1.1

arXiv:1602.07576 [pdf, ps, other]

Group Equivariant Convolutional Networks

Authors: Taco S. Cohen, Max Welling

Abstract: We introduce Group equivariant Convolutional Neural Networks (G-CNNs), a natural generalization of convolutional neural networks that reduces sample complexity by exploiting symmetries. G-CNNs use G-convolutions, a new type of layer that enjoys a substantially higher degree of weight sharing than regular convolution layers. G-convolutions increase the expressive capacity of the network without inc… ▽ More We introduce Group equivariant Convolutional Neural Networks (G-CNNs), a natural generalization of convolutional neural networks that reduces sample complexity by exploiting symmetries. G-CNNs use G-convolutions, a new type of layer that enjoys a substantially higher degree of weight sharing than regular convolution layers. G-convolutions increase the expressive capacity of the network without increasing the number of parameters. Group convolution layers are easy to use and can be implemented with negligible computational overhead for discrete groups generated by translations, reflections and rotations. G-CNNs achieve state of the art results on CIFAR10 and rotated MNIST. △ Less

Submitted 3 June, 2016; v1 submitted 24 February, 2016; originally announced February 2016.

Journal ref: Proceedings of the International Conference on Machine Learning (ICML), 2016

arXiv:1602.03014 [pdf, other]

Herding as a Learning System with Edge-of-Chaos Dynamics

Authors: Yutian Chen, Max Welling

Abstract: Herding defines a deterministic dynamical system at the edge of chaos. It generates a sequence of model states and parameters by alternating parameter perturbations with state maximizations, where the sequence of states can be interpreted as "samples" from an associated MRF model. Herding differs from maximum likelihood estimation in that the sequence of parameters does not converge to a fixed poi… ▽ More Herding defines a deterministic dynamical system at the edge of chaos. It generates a sequence of model states and parameters by alternating parameter perturbations with state maximizations, where the sequence of states can be interpreted as "samples" from an associated MRF model. Herding differs from maximum likelihood estimation in that the sequence of parameters does not converge to a fixed point and differs from an MCMC posterior sampling approach in that the sequence of states is generated deterministically. Herding may be interpreted as a"perturb and map" method where the parameter perturbations are generated using a deterministic nonlinear dynamical system rather than randomly from a Gumbel distribution. This chapter studies the distinct statistical characteristics of the herding algorithm and shows that the fast convergence rate of the controlled moments may be attributed to edge of chaos dynamics. The herding algorithm can also be generalized to models with latent variables and to a discriminative learning setting. The perceptron cycling theorem ensures that the fast moment matching property is preserved in the more general framework. △ Less

Submitted 1 March, 2016; v1 submitted 9 February, 2016; originally announced February 2016.

arXiv:1511.00830 [pdf, other]

The Variational Fair Autoencoder

Authors: Christos Louizos, Kevin Swersky, Yujia Li, Max Welling, Richard Zemel

Abstract: We investigate the problem of learning representations that are invariant to certain nuisance or sensitive factors of variation in the data while retaining as much of the remaining information as possible. Our model is based on a variational autoencoding architecture with priors that encourage independence between sensitive and latent factors of variation. Any subsequent processing, such as classi… ▽ More We investigate the problem of learning representations that are invariant to certain nuisance or sensitive factors of variation in the data while retaining as much of the remaining information as possible. Our model is based on a variational autoencoding architecture with priors that encourage independence between sensitive and latent factors of variation. Any subsequent processing, such as classification, can then be performed on this purged latent representation. To remove any remaining dependencies we incorporate an additional penalty term based on the "Maximum Mean Discrepancy" (MMD) measure. We discuss how these architectures can be efficiently trained on data and show in experiments that this method is more effective than previous work in removing unwanted sources of variation while maintaining informative latent representations. △ Less

Submitted 9 August, 2017; v1 submitted 3 November, 2015; originally announced November 2015.

Comments: Fixed typo in eq. 3 and 4

arXiv:1510.04815 [pdf, ps, other]

Scalable MCMC for Mixed Membership Stochastic Blockmodels

Authors: Wenzhe Li, Sungjin Ahn, Max Welling

Abstract: We propose a stochastic gradient Markov chain Monte Carlo (SG-MCMC) algorithm for scalable inference in mixed-membership stochastic blockmodels (MMSB). Our algorithm is based on the stochastic gradient Riemannian Langevin sampler and achieves both faster speed and higher accuracy at every iteration than the current state-of-the-art algorithm based on stochastic variational inference. In addition w… ▽ More We propose a stochastic gradient Markov chain Monte Carlo (SG-MCMC) algorithm for scalable inference in mixed-membership stochastic blockmodels (MMSB). Our algorithm is based on the stochastic gradient Riemannian Langevin sampler and achieves both faster speed and higher accuracy at every iteration than the current state-of-the-art algorithm based on stochastic variational inference. In addition we develop an approximation that can handle models that entertain a very large number of communities. The experimental results show that SG-MCMC strictly dominates competing algorithms in all cases. △ Less

Submitted 21 October, 2015; v1 submitted 16 October, 2015; originally announced October 2015.

Comments: 9 pages, 18 figures

arXiv:1506.04416 [pdf, other]

Bayesian Dark Knowledge

Authors: Anoop Korattikara, Vivek Rathod, Kevin Murphy, Max Welling

Abstract: We consider the problem of Bayesian parameter estimation for deep neural networks, which is important in problem settings where we may have little data, and/ or where we need accurate posterior predictive densities, e.g., for applications involving bandits or active learning. One simple approach to this is to use online Monte Carlo methods, such as SGLD (stochastic gradient Langevin dynamics). Unf… ▽ More We consider the problem of Bayesian parameter estimation for deep neural networks, which is important in problem settings where we may have little data, and/ or where we need accurate posterior predictive densities, e.g., for applications involving bandits or active learning. One simple approach to this is to use online Monte Carlo methods, such as SGLD (stochastic gradient Langevin dynamics). Unfortunately, such a method needs to store many copies of the parameters (which wastes memory), and needs to make predictions using many versions of the model (which wastes time). We describe a method for "distilling" a Monte Carlo approximation to the posterior predictive density into a more compact form, namely a single deep neural network. We compare to two very recent approaches to Bayesian neural networks, namely an approach based on expectation propagation [Hernandez-Lobato and Adams, 2015] and an approach based on variational Bayes [Blundell et al., 2015]. Our method performs better than both of these, is much simpler to implement, and uses less computation at test time. △ Less

Submitted 6 November, 2015; v1 submitted 14 June, 2015; originally announced June 2015.

Comments: final version submitted to NIPS 2015

arXiv:1506.03693 [pdf, other]

Optimization Monte Carlo: Efficient and Embarrassingly Parallel Likelihood-Free Inference

Authors: Edward Meeds, Max Welling

Abstract: We describe an embarrassingly parallel, anytime Monte Carlo method for likelihood-free models. The algorithm starts with the view that the stochasticity of the pseudo-samples generated by the simulator can be controlled externally by a vector of random numbers u, in such a way that the outcome, knowing u, is deterministic. For each instantiation of u we run an optimization procedure to minimize th… ▽ More We describe an embarrassingly parallel, anytime Monte Carlo method for likelihood-free models. The algorithm starts with the view that the stochasticity of the pseudo-samples generated by the simulator can be controlled externally by a vector of random numbers u, in such a way that the outcome, knowing u, is deterministic. For each instantiation of u we run an optimization procedure to minimize the distance between summary statistics of the simulator and the data. After reweighing these samples using the prior and the Jacobian (accounting for the change of volume in transforming from the space of summary statistics to the space of parameters) we show that this weighted ensemble represents a Monte Carlo estimate of the posterior distribution. The procedure can be run embarrassingly parallel (each node handling one sample) and anytime (by allocating resources to the worst performing sample). The procedure is validated on six experiments. △ Less

Submitted 2 December, 2015; v1 submitted 11 June, 2015; originally announced June 2015.

Comments: NIPS 2015 camera ready

arXiv:1506.02557 [pdf, other]

Variational Dropout and the Local Reparameterization Trick

Authors: Diederik P. Kingma, Tim Salimans, Max Welling

Abstract: We investigate a local reparameterizaton technique for greatly reducing the variance of stochastic gradients for variational Bayesian inference (SGVB) of a posterior over model parameters, while retaining parallelizability. This local reparameterization translates uncertainty about global parameters into local noise that is independent across datapoints in the minibatch. Such parameterizations can… ▽ More We investigate a local reparameterizaton technique for greatly reducing the variance of stochastic gradients for variational Bayesian inference (SGVB) of a posterior over model parameters, while retaining parallelizability. This local reparameterization translates uncertainty about global parameters into local noise that is independent across datapoints in the minibatch. Such parameterizations can be trivially parallelized and have variance that is inversely proportional to the minibatch size, generally leading to much faster convergence. Additionally, we explore a connection with dropout: Gaussian dropout objectives correspond to SGVB with local reparameterization, a scale-invariant prior and proportionally fixed posterior variance. Our method allows inference of more flexibly parameterized posteriors; specifically, we propose variational dropout, a generalization of Gaussian dropout where the dropout rates are learned, often leading to better models. The method is demonstrated through several experiments. △ Less

Submitted 20 December, 2015; v1 submitted 8 June, 2015; originally announced June 2015.

arXiv:1505.04413 [pdf, other]

Harmonic Exponential Families on Manifolds

Authors: Taco S. Cohen, Max Welling

Abstract: In a range of fields including the geosciences, molecular biology, robotics and computer vision, one encounters problems that involve random variables on manifolds. Currently, there is a lack of flexible probabilistic models on manifolds that are fast and easy to train. We define an extremely flexible class of exponential family distributions on manifolds such as the torus, sphere, and rotation gr… ▽ More In a range of fields including the geosciences, molecular biology, robotics and computer vision, one encounters problems that involve random variables on manifolds. Currently, there is a lack of flexible probabilistic models on manifolds that are fast and easy to train. We define an extremely flexible class of exponential family distributions on manifolds such as the torus, sphere, and rotation groups, and show that for these distributions the gradient of the log-likelihood can be computed efficiently using a non-commutative generalization of the Fast Fourier Transform (FFT). We discuss applications to Bayesian camera motion estimation (where harmonic exponential families serve as conjugate priors), and modelling of the spatial distribution of earthquakes on the surface of the earth. Our experimental results show that harmonic densities yield a significantly higher likelihood than the best competing method, while being orders of magnitude faster to train. △ Less

Submitted 20 May, 2015; v1 submitted 17 May, 2015; originally announced May 2015.

Comments: fixed typo

Journal ref: Proceedings of the International Conference on Machine Learning, 2015

arXiv:1503.01916 [pdf, other]

Hamiltonian ABC

Authors: Edward Meeds, Robert Leenders, Max Welling

Abstract: Approximate Bayesian computation (ABC) is a powerful and elegant framework for performing inference in simulation-based models. However, due to the difficulty in scaling likelihood estimates, ABC remains useful for relatively low-dimensional problems. We introduce Hamiltonian ABC (HABC), a set of likelihood-free algorithms that apply recent advances in scaling Bayesian learning using Hamiltonian M… ▽ More Approximate Bayesian computation (ABC) is a powerful and elegant framework for performing inference in simulation-based models. However, due to the difficulty in scaling likelihood estimates, ABC remains useful for relatively low-dimensional problems. We introduce Hamiltonian ABC (HABC), a set of likelihood-free algorithms that apply recent advances in scaling Bayesian learning using Hamiltonian Monte Carlo (HMC) and stochastic gradients. We find that a small number forward simulations can effectively approximate the ABC gradient, allowing Hamiltonian dynamics to efficiently traverse parameter spaces. We also describe a new simple yet general approach of incorporating random seeds into the state of the Markov chain, further reducing the random walk behavior of HABC. We demonstrate HABC on several typical ABC problems, and show that HABC samples comparably to regular Bayesian inference using true gradients on a high-dimensional problem from machine learning. △ Less

Submitted 6 March, 2015; originally announced March 2015.

Comments: Submission to UAI 2015

arXiv:1503.01596 [pdf, other]

Large-Scale Distributed Bayesian Matrix Factorization using Stochastic Gradient MCMC

Authors: Sungjin Ahn, Anoop Korattikara, Nathan Liu, Suju Rajan, Max Welling

Abstract: Despite having various attractive qualities such as high prediction accuracy and the ability to quantify uncertainty and avoid over-fitting, Bayesian Matrix Factorization has not been widely adopted because of the prohibitive cost of inference. In this paper, we propose a scalable distributed Bayesian matrix factorization algorithm using stochastic gradient MCMC. Our algorithm, based on Distribute… ▽ More Despite having various attractive qualities such as high prediction accuracy and the ability to quantify uncertainty and avoid over-fitting, Bayesian Matrix Factorization has not been widely adopted because of the prohibitive cost of inference. In this paper, we propose a scalable distributed Bayesian matrix factorization algorithm using stochastic gradient MCMC. Our algorithm, based on Distributed Stochastic Gradient Langevin Dynamics, can not only match the prediction accuracy of standard MCMC methods like Gibbs sampling, but at the same time is as fast and simple as stochastic gradient descent. In our experiments, we show that our algorithm can achieve the same level of prediction accuracy as Gibbs sampling an order of magnitude faster. We also show that our method reduces the prediction error as fast as distributed stochastic gradient descent, achieving a 4.1% improvement in RMSE for the Netflix dataset and an 1.8% for the Yahoo music dataset. △ Less

Submitted 9 March, 2015; v1 submitted 5 March, 2015; originally announced March 2015.

arXiv:1412.7659 [pdf, other]

Transformation Properties of Learned Visual Representations

Authors: Taco S. Cohen, Max Welling

Abstract: When a three-dimensional object moves relative to an observer, a change occurs on the observer's image plane and in the visual representation computed by a learned model. Starting with the idea that a good visual representation is one that transforms linearly under scene motions, we show, using the theory of group representations, that any such representation is equivalent to a combination of the… ▽ More When a three-dimensional object moves relative to an observer, a change occurs on the observer's image plane and in the visual representation computed by a learned model. Starting with the idea that a good visual representation is one that transforms linearly under scene motions, we show, using the theory of group representations, that any such representation is equivalent to a combination of the elementary irreducible representations. We derive a striking relationship between irreducibility and the statistical dependency structure of the representation, by showing that under restricted conditions, irreducible representations are decorrelated. Under partial observability, as induced by the perspective projection of a scene onto the image plane, the motion group does not have a linear action on the space of images, so that it becomes necessary to perform inference over a latent representation that does transform linearly. This idea is demonstrated in a model of rotating NORB objects that employs a latent representation of the non-commutative 3D rotation group SO(3). △ Less

Submitted 7 April, 2015; v1 submitted 24 December, 2014; originally announced December 2014.

Comments: T.S. Cohen & M. Welling, Transformation Properties of Learned Visual Representations. In International Conference on Learning Representations (ICLR), 2015

Journal ref: Proceedings of the International Conference on Learning Representations, 2015

arXiv:1412.3051 [pdf, other]

POPE: Post Optimization Posterior Evaluation of Likelihood Free Models

Authors: Edward Meeds, Michael Chiang, Mary Lee, Olivier Cinquin, John Lowengrub, Max Welling

Abstract: In many domains, scientists build complex simulators of natural phenomena that encode their hypotheses about the underlying processes. These simulators can be deterministic or stochastic, fast or slow, constrained or unconstrained, and so on. Optimizing the simulators with respect to a set of parameter values is common practice, resulting in a single parameter setting that minimizes an objective s… ▽ More In many domains, scientists build complex simulators of natural phenomena that encode their hypotheses about the underlying processes. These simulators can be deterministic or stochastic, fast or slow, constrained or unconstrained, and so on. Optimizing the simulators with respect to a set of parameter values is common practice, resulting in a single parameter setting that minimizes an objective subject to constraints. We propose a post optimization posterior analysis that computes and visualizes all the models that can generate equally good or better simulation results, subject to constraints. These optimization posteriors are desirable for a number of reasons among which easy interpretability, automatic parameter sensitivity and correlation analysis and posterior predictive analysis. We develop a new sampling framework based on approximate Bayesian computation (ABC) with one-sided kernels. In collaboration with two groups of scientists we applied POPE to two important biological simulators: a fast and stochastic simulator of stem-cell cycling and a slow and deterministic simulator of tumor growth patterns. △ Less

Submitted 9 December, 2014; originally announced December 2014.

arXiv:1412.2432 [pdf, other]

MLitB: Machine Learning in the Browser

Authors: Edward Meeds, Remco Hendriks, Said Al Faraby, Magiel Bruntink, Max Welling

Abstract: With few exceptions, the field of Machine Learning (ML) research has largely ignored the browser as a computational engine. Beyond an educational resource for ML, the browser has vast potential to not only improve the state-of-the-art in ML research, but also, inexpensively and on a massive scale, to bring sophisticated ML learning and prediction to the public at large. This paper introduces MLitB… ▽ More With few exceptions, the field of Machine Learning (ML) research has largely ignored the browser as a computational engine. Beyond an educational resource for ML, the browser has vast potential to not only improve the state-of-the-art in ML research, but also, inexpensively and on a massive scale, to bring sophisticated ML learning and prediction to the public at large. This paper introduces MLitB, a prototype ML framework written entirely in JavaScript, capable of performing large-scale distributed computing with heterogeneous classes of devices. The development of MLitB has been driven by several underlying objectives whose aim is to make ML learning and usage ubiquitous (by using ubiquitous compute devices), cheap and effortlessly distributed, and collaborative. This is achieved by allowing every internet capable device to run training algorithms and predictive models with no software installation and by saving models in universally readable formats. Our prototype library is capable of training deep neural networks with synchronized, distributed stochastic gradient descent. MLitB offers several important opportunities for novel ML research, including: development of distributed learning algorithms, advancement of web GPU algorithms, novel field and mobile applications, privacy preserving computing, and green grid-computing. MLitB is available as open source software. △ Less

Submitted 17 June, 2015; v1 submitted 7 December, 2014; originally announced December 2014.

Comments: Revised for PeerJ Computer Science

arXiv:1410.6460 [pdf, other]

Markov Chain Monte Carlo and Variational Inference: Bridging the Gap

Authors: Tim Salimans, Diederik P. Kingma, Max Welling

Abstract: Recent advances in stochastic gradient variational inference have made it possible to perform variational Bayesian inference with posterior approximations containing auxiliary random variables. This enables us to explore a new synthesis of variational inference and Monte Carlo methods where we incorporate one or more steps of MCMC into our variational approximation. By doing so we obtain a rich cl… ▽ More Recent advances in stochastic gradient variational inference have made it possible to perform variational Bayesian inference with posterior approximations containing auxiliary random variables. This enables us to explore a new synthesis of variational inference and Monte Carlo methods where we incorporate one or more steps of MCMC into our variational approximation. By doing so we obtain a rich class of inference algorithms bridging the gap between variational methods and MCMC, and offering the best of both worlds: fast posterior approximation through the maximization of an explicit objective, with the option of trading off additional computation for additional accuracy. We describe the theoretical foundations that make this possible and show some promising first results. △ Less

Submitted 19 May, 2015; v1 submitted 23 October, 2014; originally announced October 2014.

arXiv:1408.2047 [pdf]

Bayesian Structure Learning for Markov Random Fields with a Spike and Slab Prior

Authors: Yutian Chen, Max Welling

Abstract: In recent years a number of methods have been developed for automatically learning the (sparse) connectivity structure of Markov Random Fields. These methods are mostly based on L1-regularized optimization which has a number of disadvantages such as the inability to assess model uncertainty and expensive crossvalidation to find the optimal regularization parameter. Moreover, the model's predictive… ▽ More In recent years a number of methods have been developed for automatically learning the (sparse) connectivity structure of Markov Random Fields. These methods are mostly based on L1-regularized optimization which has a number of disadvantages such as the inability to assess model uncertainty and expensive crossvalidation to find the optimal regularization parameter. Moreover, the model's predictive performance may degrade dramatically with a suboptimal value of the regularization parameter (which is sometimes desirable to induce sparseness). We propose a fully Bayesian approach based on a "spike and slab" prior (similar to L0 regularization) that does not suffer from these shortcomings. We develop an approximate MCMC method combining Langevin dynamics and reversible jump MCMC to conduct inference in this model. Experiments show that the proposed model learns a good combination of the structure and parameter values without the need for separate hyper-parameter tuning. Moreover, the model's predictive performance is much more robust than L1-based methods with hyper-parameter settings that induce highly sparse model structures. △ Less

Submitted 9 August, 2014; originally announced August 2014.

Comments: Appears in Proceedings of the Twenty-Eighth Conference on Uncertainty in Artificial Intelligence (UAI2012)

Report number: UAI-P-2012-PG-174-184

arXiv:1406.5298 [pdf, other]

Semi-Supervised Learning with Deep Generative Models

Authors: Diederik P. Kingma, Danilo J. Rezende, Shakir Mohamed, Max Welling

Abstract: The ever-increasing size of modern data sets combined with the difficulty of obtaining label information has made semi-supervised learning one of the problems of significant practical importance in modern data analysis. We revisit the approach to semi-supervised learning with generative models and develop new models that allow for effective generalisation from small labelled data sets to large unl… ▽ More The ever-increasing size of modern data sets combined with the difficulty of obtaining label information has made semi-supervised learning one of the problems of significant practical importance in modern data analysis. We revisit the approach to semi-supervised learning with generative models and develop new models that allow for effective generalisation from small labelled data sets to large unlabelled ones. Generative approaches have thus far been either inflexible, inefficient or non-scalable. We show that deep generative models and approximate Bayesian inference exploiting recent advances in variational methods can be used to provide significant improvements, making generative approaches highly competitive for semi-supervised learning. △ Less

Submitted 31 October, 2014; v1 submitted 20 June, 2014; originally announced June 2014.

Comments: To appear in the proceedings of Neural Information Processing Systems (NIPS) 2014

arXiv:1402.7025 [pdf, other]

Exploiting the Statistics of Learning and Inference

Authors: Max Welling

Abstract: When dealing with datasets containing a billion instances or with simulations that require a supercomputer to execute, computational resources become part of the equation. We can improve the efficiency of learning and inference by exploiting their inherent statistical nature. We propose algorithms that exploit the redundancy of data relative to a model by subsampling data-cases for every update an… ▽ More When dealing with datasets containing a billion instances or with simulations that require a supercomputer to execute, computational resources become part of the equation. We can improve the efficiency of learning and inference by exploiting their inherent statistical nature. We propose algorithms that exploit the redundancy of data relative to a model by subsampling data-cases for every update and reasoning about the uncertainty created in this process. In the context of learning we propose to test for the probability that a stochastically estimated gradient points more than 180 degrees in the wrong direction. In the context of MCMC sampling we use stochastic gradients to improve the efficiency of MCMC updates, and hypothesis tests based on adaptive mini-batches to decide whether to accept or reject a proposed parameter update. Finally, we argue that in the context of likelihood free MCMC one needs to store all the information revealed by all simulations, for instance in a Gaussian process. We conclude that Bayesian methods will remain to play a crucial role in the era of big data and big simulations, but only if we overcome a number of computational challenges. △ Less

Submitted 4 March, 2014; v1 submitted 26 February, 2014; originally announced February 2014.

Comments: Proceedings of the NIPS workshop on "Probabilistic Models for Big Data"

arXiv:1402.4437 [pdf, ps, other]

Learning the Irreducible Representations of Commutative Lie Groups

Authors: Taco Cohen, Max Welling

Abstract: We present a new probabilistic model of compact commutative Lie groups that produces invariant-equivariant and disentangled representations of data. To define the notion of disentangling, we borrow a fundamental principle from physics that is used to derive the elementary particles of a system from its symmetries. Our model employs a newfound Bayesian conjugacy relation that enables fully tractabl… ▽ More We present a new probabilistic model of compact commutative Lie groups that produces invariant-equivariant and disentangled representations of data. To define the notion of disentangling, we borrow a fundamental principle from physics that is used to derive the elementary particles of a system from its symmetries. Our model employs a newfound Bayesian conjugacy relation that enables fully tractable probabilistic inference over compact commutative Lie groups -- a class that includes the groups that describe the rotation and cyclic translation of images. We train the model on pairs of transformed image patches, and show that the learned invariant representation is highly effective for classification. △ Less

Submitted 25 May, 2014; v1 submitted 18 February, 2014; originally announced February 2014.

Journal ref: Proceedings of the International Conference on Machine Learning, 2014

arXiv:1402.0480 [pdf, other]

Efficient Gradient-Based Inference through Transformations between Bayes Nets and Neural Nets

Authors: Diederik P. Kingma, Max Welling

Abstract: Hierarchical Bayesian networks and neural networks with stochastic hidden units are commonly perceived as two separate types of models. We show that either of these types of models can often be transformed into an instance of the other, by switching between centered and differentiable non-centered parameterizations of the latent variables. The choice of parameterization greatly influences the effi… ▽ More Hierarchical Bayesian networks and neural networks with stochastic hidden units are commonly perceived as two separate types of models. We show that either of these types of models can often be transformed into an instance of the other, by switching between centered and differentiable non-centered parameterizations of the latent variables. The choice of parameterization greatly influences the efficiency of gradient-based posterior inference; we show that they are often complementary to eachother, we clarify when each parameterization is preferred and show how inference can be made robust. In the non-centered form, a simple Monte Carlo estimator of the marginal likelihood can be used for learning the parameters. Theoretical results are supported by experiments. △ Less

Submitted 22 January, 2015; v1 submitted 3 February, 2014; originally announced February 2014.

Journal ref: Proceedings of The 31st International Conference on Machine Learning, pp. 1782-1790, 2014

arXiv:1401.2838 [pdf, other]

GPS-ABC: Gaussian Process Surrogate Approximate Bayesian Computation

Authors: Edward Meeds, Max Welling

Abstract: Scientists often express their understanding of the world through a computationally demanding simulation program. Analyzing the posterior distribution of the parameters given observations (the inverse problem) can be extremely challenging. The Approximate Bayesian Computation (ABC) framework is the standard statistical tool to handle these likelihood free problems, but they require a very large nu… ▽ More Scientists often express their understanding of the world through a computationally demanding simulation program. Analyzing the posterior distribution of the parameters given observations (the inverse problem) can be extremely challenging. The Approximate Bayesian Computation (ABC) framework is the standard statistical tool to handle these likelihood free problems, but they require a very large number of simulations. In this work we develop two new ABC sampling algorithms that significantly reduce the number of simulations necessary for posterior inference. Both algorithms use confidence estimates for the accept probability in the Metropolis Hastings step to adaptively choose the number of necessary simulations. Our GPS-ABC algorithm stores the information obtained from every simulation in a Gaussian process which acts as a surrogate function for the simulated statistics. Experiments on a challenging realistic biological problem illustrate the potential of these algorithms. △ Less

Submitted 13 January, 2014; originally announced January 2014.

arXiv:1312.6114 [pdf, other]

Auto-Encoding Variational Bayes

Authors: Diederik P Kingma, Max Welling

Abstract: How can we perform efficient inference and learning in directed probabilistic models, in the presence of continuous latent variables with intractable posterior distributions, and large datasets? We introduce a stochastic variational inference and learning algorithm that scales to large datasets and, under some mild differentiability conditions, even works in the intractable case. Our contributions… ▽ More How can we perform efficient inference and learning in directed probabilistic models, in the presence of continuous latent variables with intractable posterior distributions, and large datasets? We introduce a stochastic variational inference and learning algorithm that scales to large datasets and, under some mild differentiability conditions, even works in the intractable case. Our contributions are two-fold. First, we show that a reparameterization of the variational lower bound yields a lower bound estimator that can be straightforwardly optimized using standard stochastic gradient methods. Second, we show that for i.i.d. datasets with continuous latent variables per datapoint, posterior inference can be made especially efficient by fitting an approximate inference model (also called a recognition model) to the intractable posterior using the proposed lower bound estimator. Theoretical advantages are reflected in experimental results. △ Less

Submitted 10 December, 2022; v1 submitted 20 December, 2013; originally announced December 2013.

Comments: Fixes a typo in the abstract, no other changes

arXiv:1305.2452 [pdf, ps, other]

Stochastic Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation

Authors: James Foulds, Levi Boyles, Christopher Dubois, Padhraic Smyth, Max Welling

Abstract: In the internet era there has been an explosion in the amount of digital text information available, leading to difficulties of scale for traditional inference algorithms for topic models. Recent advances in stochastic variational inference algorithms for latent Dirichlet allocation (LDA) have made it feasible to learn topic models on large-scale corpora, but these methods do not currently take fu… ▽ More In the internet era there has been an explosion in the amount of digital text information available, leading to difficulties of scale for traditional inference algorithms for topic models. Recent advances in stochastic variational inference algorithms for latent Dirichlet allocation (LDA) have made it feasible to learn topic models on large-scale corpora, but these methods do not currently take full advantage of the collapsed representation of the model. We propose a stochastic algorithm for collapsed variational Bayesian inference for LDA, which is simpler and more efficient than the state of the art method. We show connections between collapsed variational Bayesian inference and MAP estimation for LDA, and leverage these connections to prove convergence properties of the proposed algorithm. In experiments on large-scale text corpora, the algorithm was found to converge faster and often to a better solution than the previous method. Human-subject experiments also demonstrated that the method can learn coherent topics in seconds on small corpora, facilitating the use of topic models in interactive document analysis software. △ Less

Submitted 10 May, 2013; originally announced May 2013.

arXiv:1304.5299 [pdf, other]

Austerity in MCMC Land: Cutting the Metropolis-Hastings Budget

Authors: Anoop Korattikara, Yutian Chen, Max Welling

Abstract: Can we make Bayesian posterior MCMC sampling more efficient when faced with very large datasets? We argue that computing the likelihood for N datapoints in the Metropolis-Hastings (MH) test to reach a single binary decision is computationally inefficient. We introduce an approximate MH rule based on a sequential hypothesis test that allows us to accept or reject samples with high confidence using… ▽ More Can we make Bayesian posterior MCMC sampling more efficient when faced with very large datasets? We argue that computing the likelihood for N datapoints in the Metropolis-Hastings (MH) test to reach a single binary decision is computationally inefficient. We introduce an approximate MH rule based on a sequential hypothesis test that allows us to accept or reject samples with high confidence using only a fraction of the data required for the exact MH rule. While this method introduces an asymptotic bias, we show that this bias can be controlled and is more than offset by a decrease in variance due to our ability to draw more samples per unit of time. △ Less

Submitted 14 February, 2014; v1 submitted 18 April, 2013; originally announced April 2013.

Comments: v4 - version accepted by ICML2014

arXiv:1301.4168 [pdf, other]

Herded Gibbs Sampling

Authors: Luke Bornn, Yutian Chen, Nando de Freitas, Mareija Eskelin, Jing Fang, Max Welling

Abstract: The Gibbs sampler is one of the most popular algorithms for inference in statistical models. In this paper, we introduce a herding variant of this algorithm, called herded Gibbs, that is entirely deterministic. We prove that herded Gibbs has an $O(1/T)$ convergence rate for models with independent variables and for fully connected probabilistic graphical models. Herded Gibbs is shown to outperform… ▽ More The Gibbs sampler is one of the most popular algorithms for inference in statistical models. In this paper, we introduce a herding variant of this algorithm, called herded Gibbs, that is entirely deterministic. We prove that herded Gibbs has an $O(1/T)$ convergence rate for models with independent variables and for fully connected probabilistic graphical models. Herded Gibbs is shown to outperform Gibbs in the tasks of image denoising with MRFs and named entity recognition with CRFs. However, the convergence for herded Gibbs for sparsely connected probabilistic graphical models is still an open problem. △ Less

Submitted 15 March, 2013; v1 submitted 17 January, 2013; originally announced January 2013.

Comments: 19 pages, including the appendix. Submission for ICLR 2013

arXiv:1301.2317 [pdf]

Belief Optimization for Binary Networks: A Stable Alternative to Loopy Belief Propagation

Authors: Max Welling, Yee Whye Teh

Abstract: We present a novel inference algorithm for arbitrary, binary, undirected graphs. Unlike loopy belief propagation, which iterates fixed point equations, we directly descend on the Bethe free energy. The algorithm consists of two phases, first we update the pairwise probabilities, given the marginal probabilities at each unit,using an analytic expression. Next, we update the marginal probabilities,… ▽ More We present a novel inference algorithm for arbitrary, binary, undirected graphs. Unlike loopy belief propagation, which iterates fixed point equations, we directly descend on the Bethe free energy. The algorithm consists of two phases, first we update the pairwise probabilities, given the marginal probabilities at each unit,using an analytic expression. Next, we update the marginal probabilities, given the pairwise probabilities by following the negative gradient of the Bethe free energy. Both steps are guaranteed to decrease the Bethe free energy, and since it is lower bounded, the algorithm is guaranteed to converge to a local minimum. We also show that the Bethe free energy is equal to the TAP free energy up to second order in the weights. In experiments we confirm that when belief propagation converges it usually finds identical solutions as our belief optimization method. However, in cases where belief propagation fails to converge, belief optimization continues to converge to reasonable beliefs. The stable nature of belief optimization makes it ideally suited for learning graphical models from data. △ Less

Submitted 10 January, 2013; originally announced January 2013.

Comments: Appears in Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence (UAI2001)

Report number: UAI-P-2001-PG-554-561

arXiv:1212.2513 [pdf]

Efficient Parametric Projection Pursuit Density Estimation

Authors: Max Welling, Richard S. Zemel, Geoffrey E. Hinton

Abstract: Product models of low dimensional experts are a powerful way to avoid the curse of dimensionality. We present the ``under-complete product of experts' (UPoE), where each expert models a one dimensional projection of the data. The UPoE is fully tractable and may be interpreted as a parametric probabilistic model for projection pursuit. Its ML learning rules are identical to the… ▽ More Product models of low dimensional experts are a powerful way to avoid the curse of dimensionality. We present the ``under-complete product of experts' (UPoE), where each expert models a one dimensional projection of the data. The UPoE is fully tractable and may be interpreted as a parametric probabilistic model for projection pursuit. Its ML learning rules are identical to the approximate learning rules proposed before for under-complete ICA. We also derive an efficient sequential learning algorithm and discuss its relationship to projection pursuit density estimation and feature induction algorithms for additive random field models. △ Less

Submitted 19 October, 2012; originally announced December 2012.

Comments: Appears in Proceedings of the Nineteenth Conference on Uncertainty in Artificial Intelligence (UAI2003)

Report number: UAI-P-2003-PG-575-582

arXiv:1210.4916 [pdf]

A Cluster-Cumulant Expansion at the Fixed Points of Belief Propagation

Authors: Max Welling, Andrew E. Gelfand, Alexander T. Ihler

Abstract: We introduce a new cluster-cumulant expansion (CCE) based on the fixed points of iterative belief propagation (IBP). This expansion is similar in spirit to the loop-series (LS) recently introduced in [1]. However, in contrast to the latter, the CCE enjoys the following important qualities: 1) it is defined for arbitrary state spaces 2) it is easily extended to fixed points of generalized belief pr… ▽ More We introduce a new cluster-cumulant expansion (CCE) based on the fixed points of iterative belief propagation (IBP). This expansion is similar in spirit to the loop-series (LS) recently introduced in [1]. However, in contrast to the latter, the CCE enjoys the following important qualities: 1) it is defined for arbitrary state spaces 2) it is easily extended to fixed points of generalized belief propagation (GBP), 3) disconnected groups of variables will not contribute to the CCE and 4) the accuracy of the expansion empirically improves upon that of the LS. The CCE is based on the same Möbius transform as the Kikuchi approximation, but unlike GBP does not require storing the beliefs of the GBP-clusters nor does it suffer from convergence issues during belief updating. △ Less

Submitted 16 October, 2012; originally announced October 2012.

Comments: Appears in Proceedings of the Twenty-Eighth Conference on Uncertainty in Artificial Intelligence (UAI2012)

Report number: UAI-P-2012-PG-883-892

arXiv:1210.4857 [pdf]

Generalized Belief Propagation on Tree Robust Structured Region Graphs

Authors: Andrew E. Gelfand, Max Welling

Abstract: This paper provides some new guidance in the construction of region graphs for Generalized Belief Propagation (GBP). We connect the problem of choosing the outer regions of a LoopStructured Region Graph (SRG) to that of finding a fundamental cycle basis of the corresponding Markov network. We also define a new class of tree-robust Loop-SRG for which GBP on any induced (spanning) tree of the Markov… ▽ More This paper provides some new guidance in the construction of region graphs for Generalized Belief Propagation (GBP). We connect the problem of choosing the outer regions of a LoopStructured Region Graph (SRG) to that of finding a fundamental cycle basis of the corresponding Markov network. We also define a new class of tree-robust Loop-SRG for which GBP on any induced (spanning) tree of the Markov network, obtained by setting to zero the off-tree interactions, is exact. This class of SRG is then mapped to an equivalent class of tree-robust cycle bases on the Markov network. We show that a treerobust cycle basis can be identified by proving that for every subset of cycles, the graph obtained from the edges that participate in a single cycle only, is multiply connected. Using this we identify two classes of tree-robust cycle bases: planar cycle bases and "star" cycle bases. In experiments we show that tree-robustness can be successfully exploited as a design principle to improve the accuracy and convergence of GBP. △ Less

Submitted 16 October, 2012; originally announced October 2012.

Comments: Appears in Proceedings of the Twenty-Eighth Conference on Uncertainty in Artificial Intelligence (UAI2012)

Report number: UAI-P-2012-PG-296-305

arXiv:1210.2162 [pdf, other]

Semisupervised Classifier Evaluation and Recalibration

Authors: Peter Welinder, Max Welling, Pietro Perona

Abstract: How many labeled examples are needed to estimate a classifier's performance on a new dataset? We study the case where data is plentiful, but labels are expensive. We show that by making a few reasonable assumptions on the structure of the data, it is possible to estimate performance curves, with confidence bounds, using a small number of ground truth labels. Our approach, which we call Semisupervi… ▽ More How many labeled examples are needed to estimate a classifier's performance on a new dataset? We study the case where data is plentiful, but labels are expensive. We show that by making a few reasonable assumptions on the structure of the data, it is possible to estimate performance curves, with confidence bounds, using a small number of ground truth labels. Our approach, which we call Semisupervised Performance Evaluation (SPE), is based on a generative model for the classifier's confidence scores. In addition to estimating the performance of classifiers on new datasets, SPE can be used to recalibrate a classifier by re-estimating the class-conditional confidence distributions. △ Less

Submitted 8 October, 2012; originally announced October 2012.

arXiv:1207.4158 [pdf]

On the Choice of Regions for Generalized Belief Propagation

Authors: Max Welling

Abstract: Generalized belief propagation (GBP) has proven to be a promising technique for approximate inference tasks in AI and machine learning. However, the choice of a good set of clusters to be used in GBP has remained more of an art then a science until this day. This paper proposes a sequential approach to adding new clusters of nodes and their interactions (i.e. "regions") to the approximation. We fi… ▽ More Generalized belief propagation (GBP) has proven to be a promising technique for approximate inference tasks in AI and machine learning. However, the choice of a good set of clusters to be used in GBP has remained more of an art then a science until this day. This paper proposes a sequential approach to adding new clusters of nodes and their interactions (i.e. "regions") to the approximation. We first review and analyze the recently introduced region graphs and find that three kinds of operations ("split", "merge" and "death") leave the free energy and (under some conditions) the fixed points of GBP invariant. This leads to the notion of "weakly irreducible" regions as the natural candidates to be added to the approximation. Computational complexity of the GBP algorithm is controlled by restricting attention to regions with small "region-width". Combining the above with an efficient (i.e. local in the graph) measure to predict the improved accuracy of GBP leads to the sequential "region pursuit" algorithm for adding new regions bottom-up to the region graph. Experiments show that this algorithm can indeed perform close to optimally. △ Less

Submitted 11 July, 2012; originally announced July 2012.

Comments: Appears in Proceedings of the Twentieth Conference on Uncertainty in Artificial Intelligence (UAI2004)

Report number: UAI-P-2004-PG-585-592

arXiv:1207.1426 [pdf]

Structured Region Graphs: Morphing EP into GBP

Authors: Max Welling, Thomas P. Minka, Yee Whye Teh

Abstract: GBP and EP are two successful algorithms for approximate probabilistic inference, which are based on different approximation strategies. An open problem in both algorithms has been how to choose an appropriate approximation structure. We introduce 'structured region graphs', a formalism which marries these two strategies, reveals a deep connection between them, and suggests how to choose good appr… ▽ More GBP and EP are two successful algorithms for approximate probabilistic inference, which are based on different approximation strategies. An open problem in both algorithms has been how to choose an appropriate approximation structure. We introduce 'structured region graphs', a formalism which marries these two strategies, reveals a deep connection between them, and suggests how to choose good approximation structures. In this formalism, each region has an internal structure which defines an exponential family, whose sufficient statistics must be matched by the parent region. Reduction operators on these structures allow conversion between EP and GBP free energies. Thus it is revealed that all EP approximations on discrete variables are special cases of GBP, and conversely that some wellknown GBP approximations, such as overlapping squares, are special cases of EP. Furthermore, region graphs derived from EP have a number of good structural properties, including maxent-normality and overall counting number of one. The result is a convenient framework for producing high-quality approximations with a user-adjustable level of complexity △ Less

Submitted 4 July, 2012; originally announced July 2012.

Comments: Appears in Proceedings of the Twenty-First Conference on Uncertainty in Artificial Intelligence (UAI2005)

Report number: UAI-P-2005-PG-609-614

arXiv:1206.6868 [pdf]

Bayesian Random Fields: The Bethe-Laplace Approximation

Authors: Max Welling, Sridevi Parise

Abstract: While learning the maximum likelihood value of parameters of an undirected graphical model is hard, modelling the posterior distribution over parameters given data is harder. Yet, undirected models are ubiquitous in computer vision and text modelling (e.g. conditional random fields). But where Bayesian approaches for directed models have been very successful, a proper Bayesian treatment of undirec… ▽ More While learning the maximum likelihood value of parameters of an undirected graphical model is hard, modelling the posterior distribution over parameters given data is harder. Yet, undirected models are ubiquitous in computer vision and text modelling (e.g. conditional random fields). But where Bayesian approaches for directed models have been very successful, a proper Bayesian treatment of undirected models in still in its infant stages. We propose a new method for approximating the posterior of the parameters given data based on the Laplace approximation. This approximation requires the computation of the covariance matrix over features which we compute using the linear response approximation based in turn on loopy belief propagation. We develop the theory for conditional and 'unconditional' random fields with or without hidden variables. In the conditional setting we introduce a new variant of bagging suitable for structured domains. Here we run the loopy max-product algorithm on a 'super-graph' composed of graphs for individual models sampled from the posterior and connected by constraints. Experiments on real world data validate the proposed methods. △ Less

Submitted 27 June, 2012; originally announced June 2012.

Comments: Appears in Proceedings of the Twenty-Second Conference on Uncertainty in Artificial Intelligence (UAI2006)

Report number: UAI-P-2006-PG-512-519

arXiv:1206.6845 [pdf]

Gibbs Sampling for (Coupled) Infinite Mixture Models in the Stick Breaking Representation

Authors: Ian Porteous, Alexander T. Ihler, Padhraic Smyth, Max Welling

Abstract: Nonparametric Bayesian approaches to clustering, information retrieval, language modeling and object recognition have recently shown great promise as a new paradigm for unsupervised data analysis. Most contributions have focused on the Dirichlet process mixture models or extensions thereof for which efficient Gibbs samplers exist. In this paper we explore Gibbs samplers for infinite complexity mix… ▽ More Nonparametric Bayesian approaches to clustering, information retrieval, language modeling and object recognition have recently shown great promise as a new paradigm for unsupervised data analysis. Most contributions have focused on the Dirichlet process mixture models or extensions thereof for which efficient Gibbs samplers exist. In this paper we explore Gibbs samplers for infinite complexity mixture models in the stick breaking representation. The advantage of this representation is improved modeling flexibility. For instance, one can design the prior distribution over cluster sizes or couple multiple infinite mixture models (e.g. over time) at the level of their parameters (i.e. the dependent Dirichlet process model). However, Gibbs samplers for infinite mixture models (as recently introduced in the statistics literature) seem to mix poorly over cluster labels. Among others issues, this can have the adverse effect that labels for the same cluster in coupled mixture models are mixed up. We introduce additional moves in these samplers to improve mixing over cluster labels and to bring clusters into correspondence. An application to modeling of storm trajectories is used to illustrate these ideas. △ Less

Submitted 27 June, 2012; originally announced June 2012.

Comments: Appears in Proceedings of the Twenty-Second Conference on Uncertainty in Artificial Intelligence (UAI2006)

Report number: UAI-P-2006-PG-385-392

arXiv:1206.6380 [pdf]

Bayesian Posterior Sampling via Stochastic Gradient Fisher Scoring

Authors: Sungjin Ahn, Anoop Korattikara, Max Welling

Abstract: In this paper we address the following question: Can we approximately sample from a Bayesian posterior distribution if we are only allowed to touch a small mini-batch of data-items for every sample we generate?. An algorithm based on the Langevin equation with stochastic gradients (SGLD) was previously proposed to solve this, but its mixing rate was slow. By leveraging the Bayesian Central Limit T… ▽ More In this paper we address the following question: Can we approximately sample from a Bayesian posterior distribution if we are only allowed to touch a small mini-batch of data-items for every sample we generate?. An algorithm based on the Langevin equation with stochastic gradients (SGLD) was previously proposed to solve this, but its mixing rate was slow. By leveraging the Bayesian Central Limit Theorem, we extend the SGLD algorithm so that at high mixing rates it will sample from a normal approximation of the posterior, while for slow mixing rates it will mimic the behavior of SGLD with a pre-conditioner matrix. As a bonus, the proposed algorithm is reminiscent of Fisher scoring (with stochastic gradients) and as such an efficient optimizer during burn-in. △ Less

Submitted 27 June, 2012; originally announced June 2012.

Comments: Appears in Proceedings of the 29th International Conference on Machine Learning (ICML 2012)

arXiv:1206.3297 [pdf]

Hybrid Variational/Gibbs Collapsed Inference in Topic Models

Authors: Max Welling, Yee Whye Teh, Hilbert Kappen

Abstract: Variational Bayesian inference and (collapsed) Gibbs sampling are the two important classes of inference algorithms for Bayesian networks. Both have their advantages and disadvantages: collapsed Gibbs sampling is unbiased but is also inefficient for large count values and requires averaging over many samples to reduce variance. On the other hand, variational Bayesian inference is efficient and acc… ▽ More Variational Bayesian inference and (collapsed) Gibbs sampling are the two important classes of inference algorithms for Bayesian networks. Both have their advantages and disadvantages: collapsed Gibbs sampling is unbiased but is also inefficient for large count values and requires averaging over many samples to reduce variance. On the other hand, variational Bayesian inference is efficient and accurate for large count values but suffers from bias for small counts. We propose a hybrid algorithm that combines the best of both worlds: it samples very small counts and applies variational updates to large counts. This hybridization is shown to significantly improve testset perplexity relative to variational inference at no computational cost. △ Less

Submitted 13 June, 2012; originally announced June 2012.

Comments: Appears in Proceedings of the Twenty-Fourth Conference on Uncertainty in Artificial Intelligence (UAI2008)

Report number: UAI-P-2008-PG-587-594

arXiv:1206.1088 [pdf, other]

Bayesian Structure Learning for Markov Random Fields with a Spike and Slab Prior

Authors: Yutian Chen, Max Welling

Abstract: In recent years a number of methods have been developed for automatically learning the (sparse) connectivity structure of Markov Random Fields. These methods are mostly based on L1-regularized optimization which has a number of disadvantages such as the inability to assess model uncertainty and expensive cross-validation to find the optimal regularization parameter. Moreover, the model's predictiv… ▽ More In recent years a number of methods have been developed for automatically learning the (sparse) connectivity structure of Markov Random Fields. These methods are mostly based on L1-regularized optimization which has a number of disadvantages such as the inability to assess model uncertainty and expensive cross-validation to find the optimal regularization parameter. Moreover, the model's predictive performance may degrade dramatically with a suboptimal value of the regularization parameter (which is sometimes desirable to induce sparseness). We propose a fully Bayesian approach based on a "spike and slab" prior (similar to L0 regularization) that does not suffer from these shortcomings. We develop an approximate MCMC method combining Langevin dynamics and reversible jump MCMC to conduct inference in this model. Experiments show that the proposed model learns a good combination of the structure and parameter values without the need for separate hyper-parameter tuning. Moreover, the model's predictive performance is much more robust than L1-based methods with hyper-parameter settings that induce highly sparse model structures. △ Less

Submitted 22 June, 2012; v1 submitted 5 June, 2012; originally announced June 2012.

Comments: Accepted in the Conference on Uncertainty in Artificial Intelligence (UAI), 2012

arXiv:1205.2662 [pdf]

On Smoothing and Inference for Topic Models

Authors: Arthur Asuncion, Max Welling, Padhraic Smyth, Yee Whye Teh

Abstract: Latent Dirichlet analysis, or topic modeling, is a flexible latent variable framework for modeling high-dimensional sparse count data. Various learning algorithms have been developed in recent years, including collapsed Gibbs sampling, variational inference, and maximum a posteriori estimation, and this variety motivates the need for careful empirical comparisons. In this paper, we highlight the c… ▽ More Latent Dirichlet analysis, or topic modeling, is a flexible latent variable framework for modeling high-dimensional sparse count data. Various learning algorithms have been developed in recent years, including collapsed Gibbs sampling, variational inference, and maximum a posteriori estimation, and this variety motivates the need for careful empirical comparisons. In this paper, we highlight the close connections between these approaches. We find that the main differences are attributable to the amount of smoothing applied to the counts. When the hyperparameters are optimized, the differences in performance among the algorithms diminish significantly. The ability of these algorithms to achieve solutions of comparable accuracy gives us the freedom to select computationally efficient approaches. Using the insights gained from this comparative study, we show how accurate topic models can be learned in several seconds on text corpora with thousands of documents. △ Less

Submitted 9 May, 2012; originally announced May 2012.

Comments: Appears in Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence (UAI2009)

Report number: UAI-P-2009-PG-27-34

arXiv:1205.2605 [pdf]

Herding Dynamic Weights for Partially Observed Random Field Models

Authors: Max Welling

Abstract: Learning the parameters of a (potentially partially observable) random field model is intractable in general. Instead of focussing on a single optimal parameter value we propose to treat parameters as dynamical quantities. We introduce an algorithm to generate complex dynamics for parameters and (both visible and hidden) state vectors. We show that under certain conditions averages computed over t… ▽ More Learning the parameters of a (potentially partially observable) random field model is intractable in general. Instead of focussing on a single optimal parameter value we propose to treat parameters as dynamical quantities. We introduce an algorithm to generate complex dynamics for parameters and (both visible and hidden) state vectors. We show that under certain conditions averages computed over trajectories of the proposed dynamical system converge to averages computed over the data. Our "herding dynamics" does not require expensive operations such as exponentiation and is fully deterministic. △ Less

Submitted 9 May, 2012; originally announced May 2012.

Comments: Appears in Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence (UAI2009)

Report number: UAI-P-2009-PG-599-606

arXiv:1203.3472 [pdf]

Super-Samples from Kernel Herding

Authors: Yutian Chen, Max Welling, Alex Smola

Abstract: We extend the herding algorithm to continuous spaces by using the kernel trick. The resulting "kernel herding" algorithm is an infinite memory deterministic process that learns to approximate a PDF with a collection of samples. We show that kernel herding decreases the error of expectations of functions in the Hilbert space at a rate O(1/T) which is much faster than the usual O(1/pT) for iid rando… ▽ More We extend the herding algorithm to continuous spaces by using the kernel trick. The resulting "kernel herding" algorithm is an infinite memory deterministic process that learns to approximate a PDF with a collection of samples. We show that kernel herding decreases the error of expectations of functions in the Hilbert space at a rate O(1/T) which is much faster than the usual O(1/pT) for iid random samples. We illustrate kernel herding by approximating Bayesian predictive distributions. △ Less

Submitted 15 March, 2012; originally announced March 2012.

Comments: Appears in Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence (UAI2010)

Report number: UAI-P-2010-PG-109-116

arXiv:cond-mat/0510500 [pdf, ps, other]

doi 10.1140/epjb/e2006-00063-7

Avalanches and Self-Organized Criticality in Superconductors

Authors: Rinke J. Wijngaarden, Marco S. Welling, Christof M. Aegerter, Mariela Menghini

Abstract: We review the use of superconductors as a playground for the experimental study of front roughening and avalanches. Using the magneto-optical technique, the spatial distribution of the vortex density in the sample is monitored as a function of time. The roughness and growth exponents corresponding to the vortex landscape are determined and compared to the exponents that characterize the avalanch… ▽ More We review the use of superconductors as a playground for the experimental study of front roughening and avalanches. Using the magneto-optical technique, the spatial distribution of the vortex density in the sample is monitored as a function of time. The roughness and growth exponents corresponding to the vortex landscape are determined and compared to the exponents that characterize the avalanches in the framework of Self-Organized Criticality. For those situations where a thermo-magnetic instability arises, an analytical non-linear and non-local model is discussed, which is found to be consistent to great detail with the experimental results. On anisotropic substrates, the anisotropy regularizes the avalanches. △ Less

Submitted 19 October, 2005; originally announced October 2005.

arXiv:cond-mat/0411260 [pdf, ps, other]

doi 10.1016/j.physa.2004.08.019

Dynamic roughening of the magnetic flux landscape in YBa$_2$Cu$_3$O$_{7-x}$

Authors: C. M. Aegerter, M. S. Welling, R. J. Wijngaarden

Abstract: We study the magnetic flux landscape in YBa$_2$Cu$_3$O$_{7-x}$ thin films as a two dimensional rough surface. The vortex density in the superconductor forms a self-affine structure in both space and time. This is characterized by a roughness exponent $α= 0.76(3)$ and a growth exponent $β= 0.57(6)$. This is due to the structure and distribution of flux avalanches in the self-organized critical st… ▽ More We study the magnetic flux landscape in YBa$_2$Cu$_3$O$_{7-x}$ thin films as a two dimensional rough surface. The vortex density in the superconductor forms a self-affine structure in both space and time. This is characterized by a roughness exponent $α= 0.76(3)$ and a growth exponent $β= 0.57(6)$. This is due to the structure and distribution of flux avalanches in the self-organized critical state, which is formed in the superconductor. We also discuss our results in the context of other roughening systems in the presence of quenched disorder. △ Less

Submitted 10 November, 2004; originally announced November 2004.

Comments: 13 pages, 7 figures, accepted for publication in Physica A

arXiv:cond-mat/0410369 [pdf, ps, other]

doi 10.1103/PhysRevB.71.104515

Self-organized criticality induced by quenched disorder: experiments on flux avalanches in NbH$_x$ films

Authors: M. S. Welling, C. M. Aegerter, R. J. Wijngaarden

Abstract: We present an experimental study of the influence of quenched disorder on the distribution of flux avalanches in type-II superconductors. In the presence of much quenched disorder, the avalanche sizes are power-law distributed and show finite size scaling, as expected from self-organized criticality (SOC). Furthermore, the shape of the avalanches is observed to be fractal. In the absence of quen… ▽ More We present an experimental study of the influence of quenched disorder on the distribution of flux avalanches in type-II superconductors. In the presence of much quenched disorder, the avalanche sizes are power-law distributed and show finite size scaling, as expected from self-organized criticality (SOC). Furthermore, the shape of the avalanches is observed to be fractal. In the absence of quenched disorder, a preferred size of avalanches is observed and avalanches are smooth. These observations indicate that a certain minimum amount of disorder is necessary for SOC behavior. We relate these findings to the appearance or non-appearance of SOC in other experimental systems, particularly piles of sand. △ Less

Submitted 14 October, 2004; originally announced October 2004.

Comments: 4 pages, 4 figures

arXiv:cond-mat/0407490 [pdf, ps, other]

doi 10.1103/PhysRevLett.94.037002

Dendritic flux avalanches and nonlocal electrodynamics in thin superconducting films

Authors: Igor S. Aranson, Alex Gurevich, Marco S. Welling, Rinke J. Wijngaarden, Vitalii K. Vlasko-Vlasov, Valerii M. Vinokur, Ulrich Welp

Abstract: We present numerical and analytical studies of coupled nonlinear Maxwell and thermal diffusion equations which describe nonisothermal dendritic flux penetration in superconducting films. We show that spontaneous branching of propagating flux filaments occurs due to nonlocal magnetic flux diffusion and positive feedback between flux motion and Joule heat generation. The branching is triggered by… ▽ More We present numerical and analytical studies of coupled nonlinear Maxwell and thermal diffusion equations which describe nonisothermal dendritic flux penetration in superconducting films. We show that spontaneous branching of propagating flux filaments occurs due to nonlocal magnetic flux diffusion and positive feedback between flux motion and Joule heat generation. The branching is triggered by a thermomagnetic edge instability which causes stratification of the critical state. The resulting distribution of magnetic microavalanches depends on a spatial distribution of defects. Our results are in good agreement with experiments performed on Nb films. △ Less

Submitted 19 July, 2004; originally announced July 2004.

Comments: 4 pages, 3 figures, see http://mti.msd.anl.gov/aran_h1.htm for extensive collection of movies of dendritic flux and temperature patterns

Journal ref: Phys. Rev. Lett. 94, 037002 (2005)

arXiv:cond-mat/0305591 [pdf, ps, other]

Self-organized criticality in the Bean state in YBa$_2$Cu$_3$O$_{7-x}$ thin films

Authors: C. M. Aegerter, M. S. Welling, R. J. Wijngaarden

Abstract: The penetration of magnetic flux into a thin film of YBa$_2$Cu$_3$O$_{7-x}$ is studied when the external field is ramped slowly. In this case the flux penetrates in bursts or avalanches. The size of these avalanches is distributed according to a power law with an exponent of $τ$ = 1.29(2). The additional observation of finite-size scaling of the avalanche distributions, with an avalanche dimensi… ▽ More The penetration of magnetic flux into a thin film of YBa$_2$Cu$_3$O$_{7-x}$ is studied when the external field is ramped slowly. In this case the flux penetrates in bursts or avalanches. The size of these avalanches is distributed according to a power law with an exponent of $τ$ = 1.29(2). The additional observation of finite-size scaling of the avalanche distributions, with an avalanche dimension D = 1.89(3), gives strong indications towards self-organized criticality in this system. Furthermore we determine exponents governing the roughening dynamics of the flux surface using some universal scaling relations. These exponents are compared to those obtained from a standard roughening analysis. △ Less

Submitted 26 May, 2003; originally announced May 2003.

Comments: 4 pages, 4 figures, submitted to PRL

arXiv:gr-qc/9708054 [pdf, ps, other]

doi 10.1088/0264-9381/15/10/008

Quantum Mechanics of a Point Particle in 2+1 Dimensional Gravity

Authors: Hans-Juergen Matschull, Max Welling

Abstract: We study the phase space structure and the quantization of a pointlike particle in 2+1 dimensional gravity. By adding boundary terms to the first order Einstein Hilbert action, and removing all redundant gauge degrees of freedom, we arrive at a reduced action for a gravitating particle in 2+1 dimensions, which is invariant under Lorentz transformations and a group of generalized translations. Th… ▽ More We study the phase space structure and the quantization of a pointlike particle in 2+1 dimensional gravity. By adding boundary terms to the first order Einstein Hilbert action, and removing all redundant gauge degrees of freedom, we arrive at a reduced action for a gravitating particle in 2+1 dimensions, which is invariant under Lorentz transformations and a group of generalized translations. The momentum space of the particle turns out to be the group manifold SL(2). Its position coordinates have non-vanishing Poisson brackets, resulting in a non-commutative quantum spacetime. We use the representation theory of SL(2) to investigate its structure. We find a discretization of time, and some semi-discrete structure of space. An uncertainty relation forbids a fully localized particle. The quantum dynamics is described by a discretized Klein Gordon equation. △ Less

Submitted 21 April, 1998; v1 submitted 22 August, 1997; originally announced August 1997.

Comments: 58 pages, 3 eps figures, presentation of the classical theory improved

Report number: THU-97/22

Journal ref: Class.Quant.Grav. 15 (1998) 2981-3030

Showing 151–200 of 208 results for author: Welling, M