-
DP-EM: Differentially Private Expectation Maximization
Authors:
Mijung Park,
Jimmy Foulds,
Kamalika Chaudhuri,
Max Welling
Abstract:
The iterative nature of the expectation maximization (EM) algorithm presents a challenge for privacy-preserving estimation, as each iteration increases the amount of noise needed. We propose a practical private EM algorithm that overcomes this challenge using two innovations: (1) a novel moment perturbation formulation for differentially private EM (DP-EM), and (2) the use of two recently develope…
▽ More
The iterative nature of the expectation maximization (EM) algorithm presents a challenge for privacy-preserving estimation, as each iteration increases the amount of noise needed. We propose a practical private EM algorithm that overcomes this challenge using two innovations: (1) a novel moment perturbation formulation for differentially private EM (DP-EM), and (2) the use of two recently developed composition methods to bound the privacy "cost" of multiple EM iterations: the moments accountant (MA) and zero-mean concentrated differential privacy (zCDP). Both MA and zCDP bound the moment generating function of the privacy loss random variable and achieve a refined tail bound, which effectively decrease the amount of additive noise. We present empirical results showing the benefits of our approach, as well as similar performance between these two composition methods in the DP-EM setting for Gaussian mixture models. Our approach can be readily extended to many iterative learning algorithms, opening up various exciting future directions.
△ Less
Submitted 31 October, 2016; v1 submitted 23 May, 2016;
originally announced May 2016.
-
On the Theory and Practice of Privacy-Preserving Bayesian Data Analysis
Authors:
James Foulds,
Joseph Geumlek,
Max Welling,
Kamalika Chaudhuri
Abstract:
Bayesian inference has great promise for the privacy-preserving analysis of sensitive data, as posterior sampling automatically preserves differential privacy, an algorithmic notion of data privacy, under certain conditions (Dimitrakakis et al., 2014; Wang et al., 2015). While this one posterior sample (OPS) approach elegantly provides privacy "for free," it is data inefficient in the sense of asy…
▽ More
Bayesian inference has great promise for the privacy-preserving analysis of sensitive data, as posterior sampling automatically preserves differential privacy, an algorithmic notion of data privacy, under certain conditions (Dimitrakakis et al., 2014; Wang et al., 2015). While this one posterior sample (OPS) approach elegantly provides privacy "for free," it is data inefficient in the sense of asymptotic relative efficiency (ARE). We show that a simple alternative based on the Laplace mechanism, the workhorse of differential privacy, is as asymptotically efficient as non-private posterior inference, under general assumptions. This technique also has practical advantages including efficient use of the privacy budget for MCMC. We demonstrate the practicality of our approach on a time-series analysis of sensitive military records from the Afghanistan and Iraq wars disclosed by the Wikileaks organization.
△ Less
Submitted 8 June, 2016; v1 submitted 23 March, 2016;
originally announced March 2016.
-
Structured and Efficient Variational Deep Learning with Matrix Gaussian Posteriors
Authors:
Christos Louizos,
Max Welling
Abstract:
We introduce a variational Bayesian neural network where the parameters are governed via a probability distribution on random matrices. Specifically, we employ a matrix variate Gaussian \cite{gupta1999matrix} parameter posterior distribution where we explicitly model the covariance among the input and output dimensions of each layer. Furthermore, with approximate covariance matrices we can achieve…
▽ More
We introduce a variational Bayesian neural network where the parameters are governed via a probability distribution on random matrices. Specifically, we employ a matrix variate Gaussian \cite{gupta1999matrix} parameter posterior distribution where we explicitly model the covariance among the input and output dimensions of each layer. Furthermore, with approximate covariance matrices we can achieve a more efficient way to represent those correlations that is also cheaper than fully factorized parameter posteriors. We further show that with the "local reprarametrization trick" \cite{kingma2015variational} on this posterior distribution we arrive at a Gaussian Process \cite{rasmussen2006gaussian} interpretation of the hidden units in each layer and we, similarly with \cite{gal2015dropout}, provide connections with deep Gaussian processes. We continue in taking advantage of this duality and incorporate "pseudo-data" \cite{snelson2005sparse} in our model, which in turn allows for more efficient sampling while maintaining the properties of the original model. The validity of the proposed approach is verified through extensive experiments.
△ Less
Submitted 23 June, 2016; v1 submitted 15 March, 2016;
originally announced March 2016.
-
A New Method to Visualize Deep Neural Networks
Authors:
Luisa M. Zintgraf,
Taco S. Cohen,
Max Welling
Abstract:
We present a method for visualising the response of a deep neural network to a specific input. For image data for instance our method will highlight areas that provide evidence in favor of, and against choosing a certain class. The method overcomes several shortcomings of previous methods and provides great additional insight into the decision making process of convolutional networks, which is imp…
▽ More
We present a method for visualising the response of a deep neural network to a specific input. For image data for instance our method will highlight areas that provide evidence in favor of, and against choosing a certain class. The method overcomes several shortcomings of previous methods and provides great additional insight into the decision making process of convolutional networks, which is important both to improve models and to accelerate the adoption of such methods in e.g. medicine. In experiments on ImageNet data, we illustrate how the method works and can be applied in different ways to understand deep neural nets.
△ Less
Submitted 12 June, 2017; v1 submitted 8 March, 2016;
originally announced March 2016.
-
Deep Spiking Networks
Authors:
Peter O'Connor,
Max Welling
Abstract:
We introduce an algorithm to do backpropagation on a spiking network. Our network is "spiking" in the sense that our neurons accumulate their activation into a potential over time, and only send out a signal (a "spike") when this potential crosses a threshold and the neuron is reset. Neurons only update their states when receiving signals from other neurons. Total computation of the network thus s…
▽ More
We introduce an algorithm to do backpropagation on a spiking network. Our network is "spiking" in the sense that our neurons accumulate their activation into a potential over time, and only send out a signal (a "spike") when this potential crosses a threshold and the neuron is reset. Neurons only update their states when receiving signals from other neurons. Total computation of the network thus scales with the number of spikes caused by an input rather than network size. We show that the spiking Multi-Layer Perceptron behaves identically, during both prediction and training, to a conventional deep network of rectified-linear units, in the limiting case where we run the spiking network for a long time. We apply this architecture to a conventional classification problem (MNIST) and achieve performance very close to that of a conventional Multi-Layer Perceptron with the same architecture. Our network is a natural architecture for learning based on streaming event-based data, and is a stepping stone towards using spiking neural networks to learn efficiently on streaming data.
△ Less
Submitted 7 November, 2016; v1 submitted 26 February, 2016;
originally announced February 2016.
-
Group Equivariant Convolutional Networks
Authors:
Taco S. Cohen,
Max Welling
Abstract:
We introduce Group equivariant Convolutional Neural Networks (G-CNNs), a natural generalization of convolutional neural networks that reduces sample complexity by exploiting symmetries. G-CNNs use G-convolutions, a new type of layer that enjoys a substantially higher degree of weight sharing than regular convolution layers. G-convolutions increase the expressive capacity of the network without inc…
▽ More
We introduce Group equivariant Convolutional Neural Networks (G-CNNs), a natural generalization of convolutional neural networks that reduces sample complexity by exploiting symmetries. G-CNNs use G-convolutions, a new type of layer that enjoys a substantially higher degree of weight sharing than regular convolution layers. G-convolutions increase the expressive capacity of the network without increasing the number of parameters. Group convolution layers are easy to use and can be implemented with negligible computational overhead for discrete groups generated by translations, reflections and rotations. G-CNNs achieve state of the art results on CIFAR10 and rotated MNIST.
△ Less
Submitted 3 June, 2016; v1 submitted 24 February, 2016;
originally announced February 2016.
-
Herding as a Learning System with Edge-of-Chaos Dynamics
Authors:
Yutian Chen,
Max Welling
Abstract:
Herding defines a deterministic dynamical system at the edge of chaos. It generates a sequence of model states and parameters by alternating parameter perturbations with state maximizations, where the sequence of states can be interpreted as "samples" from an associated MRF model. Herding differs from maximum likelihood estimation in that the sequence of parameters does not converge to a fixed poi…
▽ More
Herding defines a deterministic dynamical system at the edge of chaos. It generates a sequence of model states and parameters by alternating parameter perturbations with state maximizations, where the sequence of states can be interpreted as "samples" from an associated MRF model. Herding differs from maximum likelihood estimation in that the sequence of parameters does not converge to a fixed point and differs from an MCMC posterior sampling approach in that the sequence of states is generated deterministically. Herding may be interpreted as a"perturb and map" method where the parameter perturbations are generated using a deterministic nonlinear dynamical system rather than randomly from a Gumbel distribution. This chapter studies the distinct statistical characteristics of the herding algorithm and shows that the fast convergence rate of the controlled moments may be attributed to edge of chaos dynamics. The herding algorithm can also be generalized to models with latent variables and to a discriminative learning setting. The perceptron cycling theorem ensures that the fast moment matching property is preserved in the more general framework.
△ Less
Submitted 1 March, 2016; v1 submitted 9 February, 2016;
originally announced February 2016.
-
The Variational Fair Autoencoder
Authors:
Christos Louizos,
Kevin Swersky,
Yujia Li,
Max Welling,
Richard Zemel
Abstract:
We investigate the problem of learning representations that are invariant to certain nuisance or sensitive factors of variation in the data while retaining as much of the remaining information as possible. Our model is based on a variational autoencoding architecture with priors that encourage independence between sensitive and latent factors of variation. Any subsequent processing, such as classi…
▽ More
We investigate the problem of learning representations that are invariant to certain nuisance or sensitive factors of variation in the data while retaining as much of the remaining information as possible. Our model is based on a variational autoencoding architecture with priors that encourage independence between sensitive and latent factors of variation. Any subsequent processing, such as classification, can then be performed on this purged latent representation. To remove any remaining dependencies we incorporate an additional penalty term based on the "Maximum Mean Discrepancy" (MMD) measure. We discuss how these architectures can be efficiently trained on data and show in experiments that this method is more effective than previous work in removing unwanted sources of variation while maintaining informative latent representations.
△ Less
Submitted 9 August, 2017; v1 submitted 3 November, 2015;
originally announced November 2015.
-
Scalable MCMC for Mixed Membership Stochastic Blockmodels
Authors:
Wenzhe Li,
Sungjin Ahn,
Max Welling
Abstract:
We propose a stochastic gradient Markov chain Monte Carlo (SG-MCMC) algorithm for scalable inference in mixed-membership stochastic blockmodels (MMSB). Our algorithm is based on the stochastic gradient Riemannian Langevin sampler and achieves both faster speed and higher accuracy at every iteration than the current state-of-the-art algorithm based on stochastic variational inference. In addition w…
▽ More
We propose a stochastic gradient Markov chain Monte Carlo (SG-MCMC) algorithm for scalable inference in mixed-membership stochastic blockmodels (MMSB). Our algorithm is based on the stochastic gradient Riemannian Langevin sampler and achieves both faster speed and higher accuracy at every iteration than the current state-of-the-art algorithm based on stochastic variational inference. In addition we develop an approximation that can handle models that entertain a very large number of communities. The experimental results show that SG-MCMC strictly dominates competing algorithms in all cases.
△ Less
Submitted 21 October, 2015; v1 submitted 16 October, 2015;
originally announced October 2015.
-
Bayesian Dark Knowledge
Authors:
Anoop Korattikara,
Vivek Rathod,
Kevin Murphy,
Max Welling
Abstract:
We consider the problem of Bayesian parameter estimation for deep neural networks, which is important in problem settings where we may have little data, and/ or where we need accurate posterior predictive densities, e.g., for applications involving bandits or active learning. One simple approach to this is to use online Monte Carlo methods, such as SGLD (stochastic gradient Langevin dynamics). Unf…
▽ More
We consider the problem of Bayesian parameter estimation for deep neural networks, which is important in problem settings where we may have little data, and/ or where we need accurate posterior predictive densities, e.g., for applications involving bandits or active learning. One simple approach to this is to use online Monte Carlo methods, such as SGLD (stochastic gradient Langevin dynamics). Unfortunately, such a method needs to store many copies of the parameters (which wastes memory), and needs to make predictions using many versions of the model (which wastes time).
We describe a method for "distilling" a Monte Carlo approximation to the posterior predictive density into a more compact form, namely a single deep neural network. We compare to two very recent approaches to Bayesian neural networks, namely an approach based on expectation propagation [Hernandez-Lobato and Adams, 2015] and an approach based on variational Bayes [Blundell et al., 2015]. Our method performs better than both of these, is much simpler to implement, and uses less computation at test time.
△ Less
Submitted 6 November, 2015; v1 submitted 14 June, 2015;
originally announced June 2015.
-
Optimization Monte Carlo: Efficient and Embarrassingly Parallel Likelihood-Free Inference
Authors:
Edward Meeds,
Max Welling
Abstract:
We describe an embarrassingly parallel, anytime Monte Carlo method for likelihood-free models. The algorithm starts with the view that the stochasticity of the pseudo-samples generated by the simulator can be controlled externally by a vector of random numbers u, in such a way that the outcome, knowing u, is deterministic. For each instantiation of u we run an optimization procedure to minimize th…
▽ More
We describe an embarrassingly parallel, anytime Monte Carlo method for likelihood-free models. The algorithm starts with the view that the stochasticity of the pseudo-samples generated by the simulator can be controlled externally by a vector of random numbers u, in such a way that the outcome, knowing u, is deterministic. For each instantiation of u we run an optimization procedure to minimize the distance between summary statistics of the simulator and the data. After reweighing these samples using the prior and the Jacobian (accounting for the change of volume in transforming from the space of summary statistics to the space of parameters) we show that this weighted ensemble represents a Monte Carlo estimate of the posterior distribution. The procedure can be run embarrassingly parallel (each node handling one sample) and anytime (by allocating resources to the worst performing sample). The procedure is validated on six experiments.
△ Less
Submitted 2 December, 2015; v1 submitted 11 June, 2015;
originally announced June 2015.
-
Variational Dropout and the Local Reparameterization Trick
Authors:
Diederik P. Kingma,
Tim Salimans,
Max Welling
Abstract:
We investigate a local reparameterizaton technique for greatly reducing the variance of stochastic gradients for variational Bayesian inference (SGVB) of a posterior over model parameters, while retaining parallelizability. This local reparameterization translates uncertainty about global parameters into local noise that is independent across datapoints in the minibatch. Such parameterizations can…
▽ More
We investigate a local reparameterizaton technique for greatly reducing the variance of stochastic gradients for variational Bayesian inference (SGVB) of a posterior over model parameters, while retaining parallelizability. This local reparameterization translates uncertainty about global parameters into local noise that is independent across datapoints in the minibatch. Such parameterizations can be trivially parallelized and have variance that is inversely proportional to the minibatch size, generally leading to much faster convergence. Additionally, we explore a connection with dropout: Gaussian dropout objectives correspond to SGVB with local reparameterization, a scale-invariant prior and proportionally fixed posterior variance. Our method allows inference of more flexibly parameterized posteriors; specifically, we propose variational dropout, a generalization of Gaussian dropout where the dropout rates are learned, often leading to better models. The method is demonstrated through several experiments.
△ Less
Submitted 20 December, 2015; v1 submitted 8 June, 2015;
originally announced June 2015.
-
Harmonic Exponential Families on Manifolds
Authors:
Taco S. Cohen,
Max Welling
Abstract:
In a range of fields including the geosciences, molecular biology, robotics and computer vision, one encounters problems that involve random variables on manifolds. Currently, there is a lack of flexible probabilistic models on manifolds that are fast and easy to train. We define an extremely flexible class of exponential family distributions on manifolds such as the torus, sphere, and rotation gr…
▽ More
In a range of fields including the geosciences, molecular biology, robotics and computer vision, one encounters problems that involve random variables on manifolds. Currently, there is a lack of flexible probabilistic models on manifolds that are fast and easy to train. We define an extremely flexible class of exponential family distributions on manifolds such as the torus, sphere, and rotation groups, and show that for these distributions the gradient of the log-likelihood can be computed efficiently using a non-commutative generalization of the Fast Fourier Transform (FFT). We discuss applications to Bayesian camera motion estimation (where harmonic exponential families serve as conjugate priors), and modelling of the spatial distribution of earthquakes on the surface of the earth. Our experimental results show that harmonic densities yield a significantly higher likelihood than the best competing method, while being orders of magnitude faster to train.
△ Less
Submitted 20 May, 2015; v1 submitted 17 May, 2015;
originally announced May 2015.
-
Hamiltonian ABC
Authors:
Edward Meeds,
Robert Leenders,
Max Welling
Abstract:
Approximate Bayesian computation (ABC) is a powerful and elegant framework for performing inference in simulation-based models. However, due to the difficulty in scaling likelihood estimates, ABC remains useful for relatively low-dimensional problems. We introduce Hamiltonian ABC (HABC), a set of likelihood-free algorithms that apply recent advances in scaling Bayesian learning using Hamiltonian M…
▽ More
Approximate Bayesian computation (ABC) is a powerful and elegant framework for performing inference in simulation-based models. However, due to the difficulty in scaling likelihood estimates, ABC remains useful for relatively low-dimensional problems. We introduce Hamiltonian ABC (HABC), a set of likelihood-free algorithms that apply recent advances in scaling Bayesian learning using Hamiltonian Monte Carlo (HMC) and stochastic gradients. We find that a small number forward simulations can effectively approximate the ABC gradient, allowing Hamiltonian dynamics to efficiently traverse parameter spaces. We also describe a new simple yet general approach of incorporating random seeds into the state of the Markov chain, further reducing the random walk behavior of HABC. We demonstrate HABC on several typical ABC problems, and show that HABC samples comparably to regular Bayesian inference using true gradients on a high-dimensional problem from machine learning.
△ Less
Submitted 6 March, 2015;
originally announced March 2015.
-
Large-Scale Distributed Bayesian Matrix Factorization using Stochastic Gradient MCMC
Authors:
Sungjin Ahn,
Anoop Korattikara,
Nathan Liu,
Suju Rajan,
Max Welling
Abstract:
Despite having various attractive qualities such as high prediction accuracy and the ability to quantify uncertainty and avoid over-fitting, Bayesian Matrix Factorization has not been widely adopted because of the prohibitive cost of inference. In this paper, we propose a scalable distributed Bayesian matrix factorization algorithm using stochastic gradient MCMC. Our algorithm, based on Distribute…
▽ More
Despite having various attractive qualities such as high prediction accuracy and the ability to quantify uncertainty and avoid over-fitting, Bayesian Matrix Factorization has not been widely adopted because of the prohibitive cost of inference. In this paper, we propose a scalable distributed Bayesian matrix factorization algorithm using stochastic gradient MCMC. Our algorithm, based on Distributed Stochastic Gradient Langevin Dynamics, can not only match the prediction accuracy of standard MCMC methods like Gibbs sampling, but at the same time is as fast and simple as stochastic gradient descent. In our experiments, we show that our algorithm can achieve the same level of prediction accuracy as Gibbs sampling an order of magnitude faster. We also show that our method reduces the prediction error as fast as distributed stochastic gradient descent, achieving a 4.1% improvement in RMSE for the Netflix dataset and an 1.8% for the Yahoo music dataset.
△ Less
Submitted 9 March, 2015; v1 submitted 5 March, 2015;
originally announced March 2015.
-
Transformation Properties of Learned Visual Representations
Authors:
Taco S. Cohen,
Max Welling
Abstract:
When a three-dimensional object moves relative to an observer, a change occurs on the observer's image plane and in the visual representation computed by a learned model. Starting with the idea that a good visual representation is one that transforms linearly under scene motions, we show, using the theory of group representations, that any such representation is equivalent to a combination of the…
▽ More
When a three-dimensional object moves relative to an observer, a change occurs on the observer's image plane and in the visual representation computed by a learned model. Starting with the idea that a good visual representation is one that transforms linearly under scene motions, we show, using the theory of group representations, that any such representation is equivalent to a combination of the elementary irreducible representations. We derive a striking relationship between irreducibility and the statistical dependency structure of the representation, by showing that under restricted conditions, irreducible representations are decorrelated. Under partial observability, as induced by the perspective projection of a scene onto the image plane, the motion group does not have a linear action on the space of images, so that it becomes necessary to perform inference over a latent representation that does transform linearly. This idea is demonstrated in a model of rotating NORB objects that employs a latent representation of the non-commutative 3D rotation group SO(3).
△ Less
Submitted 7 April, 2015; v1 submitted 24 December, 2014;
originally announced December 2014.
-
POPE: Post Optimization Posterior Evaluation of Likelihood Free Models
Authors:
Edward Meeds,
Michael Chiang,
Mary Lee,
Olivier Cinquin,
John Lowengrub,
Max Welling
Abstract:
In many domains, scientists build complex simulators of natural phenomena that encode their hypotheses about the underlying processes. These simulators can be deterministic or stochastic, fast or slow, constrained or unconstrained, and so on. Optimizing the simulators with respect to a set of parameter values is common practice, resulting in a single parameter setting that minimizes an objective s…
▽ More
In many domains, scientists build complex simulators of natural phenomena that encode their hypotheses about the underlying processes. These simulators can be deterministic or stochastic, fast or slow, constrained or unconstrained, and so on. Optimizing the simulators with respect to a set of parameter values is common practice, resulting in a single parameter setting that minimizes an objective subject to constraints. We propose a post optimization posterior analysis that computes and visualizes all the models that can generate equally good or better simulation results, subject to constraints. These optimization posteriors are desirable for a number of reasons among which easy interpretability, automatic parameter sensitivity and correlation analysis and posterior predictive analysis. We develop a new sampling framework based on approximate Bayesian computation (ABC) with one-sided kernels. In collaboration with two groups of scientists we applied POPE to two important biological simulators: a fast and stochastic simulator of stem-cell cycling and a slow and deterministic simulator of tumor growth patterns.
△ Less
Submitted 9 December, 2014;
originally announced December 2014.
-
MLitB: Machine Learning in the Browser
Authors:
Edward Meeds,
Remco Hendriks,
Said Al Faraby,
Magiel Bruntink,
Max Welling
Abstract:
With few exceptions, the field of Machine Learning (ML) research has largely ignored the browser as a computational engine. Beyond an educational resource for ML, the browser has vast potential to not only improve the state-of-the-art in ML research, but also, inexpensively and on a massive scale, to bring sophisticated ML learning and prediction to the public at large. This paper introduces MLitB…
▽ More
With few exceptions, the field of Machine Learning (ML) research has largely ignored the browser as a computational engine. Beyond an educational resource for ML, the browser has vast potential to not only improve the state-of-the-art in ML research, but also, inexpensively and on a massive scale, to bring sophisticated ML learning and prediction to the public at large. This paper introduces MLitB, a prototype ML framework written entirely in JavaScript, capable of performing large-scale distributed computing with heterogeneous classes of devices. The development of MLitB has been driven by several underlying objectives whose aim is to make ML learning and usage ubiquitous (by using ubiquitous compute devices), cheap and effortlessly distributed, and collaborative. This is achieved by allowing every internet capable device to run training algorithms and predictive models with no software installation and by saving models in universally readable formats. Our prototype library is capable of training deep neural networks with synchronized, distributed stochastic gradient descent. MLitB offers several important opportunities for novel ML research, including: development of distributed learning algorithms, advancement of web GPU algorithms, novel field and mobile applications, privacy preserving computing, and green grid-computing. MLitB is available as open source software.
△ Less
Submitted 17 June, 2015; v1 submitted 7 December, 2014;
originally announced December 2014.
-
Markov Chain Monte Carlo and Variational Inference: Bridging the Gap
Authors:
Tim Salimans,
Diederik P. Kingma,
Max Welling
Abstract:
Recent advances in stochastic gradient variational inference have made it possible to perform variational Bayesian inference with posterior approximations containing auxiliary random variables. This enables us to explore a new synthesis of variational inference and Monte Carlo methods where we incorporate one or more steps of MCMC into our variational approximation. By doing so we obtain a rich cl…
▽ More
Recent advances in stochastic gradient variational inference have made it possible to perform variational Bayesian inference with posterior approximations containing auxiliary random variables. This enables us to explore a new synthesis of variational inference and Monte Carlo methods where we incorporate one or more steps of MCMC into our variational approximation. By doing so we obtain a rich class of inference algorithms bridging the gap between variational methods and MCMC, and offering the best of both worlds: fast posterior approximation through the maximization of an explicit objective, with the option of trading off additional computation for additional accuracy. We describe the theoretical foundations that make this possible and show some promising first results.
△ Less
Submitted 19 May, 2015; v1 submitted 23 October, 2014;
originally announced October 2014.
-
Bayesian Structure Learning for Markov Random Fields with a Spike and Slab Prior
Authors:
Yutian Chen,
Max Welling
Abstract:
In recent years a number of methods have been developed for automatically learning the (sparse) connectivity structure of Markov Random Fields. These methods are mostly based on L1-regularized optimization which has a number of disadvantages such as the inability to assess model uncertainty and expensive crossvalidation to find the optimal regularization parameter. Moreover, the model's predictive…
▽ More
In recent years a number of methods have been developed for automatically learning the (sparse) connectivity structure of Markov Random Fields. These methods are mostly based on L1-regularized optimization which has a number of disadvantages such as the inability to assess model uncertainty and expensive crossvalidation to find the optimal regularization parameter. Moreover, the model's predictive performance may degrade dramatically with a suboptimal value of the regularization parameter (which is sometimes desirable to induce sparseness). We propose a fully Bayesian approach based on a "spike and slab" prior (similar to L0 regularization) that does not suffer from these shortcomings. We develop an approximate MCMC method combining Langevin dynamics and reversible jump MCMC to conduct inference in this model. Experiments show that the proposed model learns a good combination of the structure and parameter values without the need for separate hyper-parameter tuning. Moreover, the model's predictive performance is much more robust than L1-based methods with hyper-parameter settings that induce highly sparse model structures.
△ Less
Submitted 9 August, 2014;
originally announced August 2014.
-
Semi-Supervised Learning with Deep Generative Models
Authors:
Diederik P. Kingma,
Danilo J. Rezende,
Shakir Mohamed,
Max Welling
Abstract:
The ever-increasing size of modern data sets combined with the difficulty of obtaining label information has made semi-supervised learning one of the problems of significant practical importance in modern data analysis. We revisit the approach to semi-supervised learning with generative models and develop new models that allow for effective generalisation from small labelled data sets to large unl…
▽ More
The ever-increasing size of modern data sets combined with the difficulty of obtaining label information has made semi-supervised learning one of the problems of significant practical importance in modern data analysis. We revisit the approach to semi-supervised learning with generative models and develop new models that allow for effective generalisation from small labelled data sets to large unlabelled ones. Generative approaches have thus far been either inflexible, inefficient or non-scalable. We show that deep generative models and approximate Bayesian inference exploiting recent advances in variational methods can be used to provide significant improvements, making generative approaches highly competitive for semi-supervised learning.
△ Less
Submitted 31 October, 2014; v1 submitted 20 June, 2014;
originally announced June 2014.
-
Exploiting the Statistics of Learning and Inference
Authors:
Max Welling
Abstract:
When dealing with datasets containing a billion instances or with simulations that require a supercomputer to execute, computational resources become part of the equation. We can improve the efficiency of learning and inference by exploiting their inherent statistical nature. We propose algorithms that exploit the redundancy of data relative to a model by subsampling data-cases for every update an…
▽ More
When dealing with datasets containing a billion instances or with simulations that require a supercomputer to execute, computational resources become part of the equation. We can improve the efficiency of learning and inference by exploiting their inherent statistical nature. We propose algorithms that exploit the redundancy of data relative to a model by subsampling data-cases for every update and reasoning about the uncertainty created in this process. In the context of learning we propose to test for the probability that a stochastically estimated gradient points more than 180 degrees in the wrong direction. In the context of MCMC sampling we use stochastic gradients to improve the efficiency of MCMC updates, and hypothesis tests based on adaptive mini-batches to decide whether to accept or reject a proposed parameter update. Finally, we argue that in the context of likelihood free MCMC one needs to store all the information revealed by all simulations, for instance in a Gaussian process. We conclude that Bayesian methods will remain to play a crucial role in the era of big data and big simulations, but only if we overcome a number of computational challenges.
△ Less
Submitted 4 March, 2014; v1 submitted 26 February, 2014;
originally announced February 2014.
-
Learning the Irreducible Representations of Commutative Lie Groups
Authors:
Taco Cohen,
Max Welling
Abstract:
We present a new probabilistic model of compact commutative Lie groups that produces invariant-equivariant and disentangled representations of data. To define the notion of disentangling, we borrow a fundamental principle from physics that is used to derive the elementary particles of a system from its symmetries. Our model employs a newfound Bayesian conjugacy relation that enables fully tractabl…
▽ More
We present a new probabilistic model of compact commutative Lie groups that produces invariant-equivariant and disentangled representations of data. To define the notion of disentangling, we borrow a fundamental principle from physics that is used to derive the elementary particles of a system from its symmetries. Our model employs a newfound Bayesian conjugacy relation that enables fully tractable probabilistic inference over compact commutative Lie groups -- a class that includes the groups that describe the rotation and cyclic translation of images. We train the model on pairs of transformed image patches, and show that the learned invariant representation is highly effective for classification.
△ Less
Submitted 25 May, 2014; v1 submitted 18 February, 2014;
originally announced February 2014.
-
Efficient Gradient-Based Inference through Transformations between Bayes Nets and Neural Nets
Authors:
Diederik P. Kingma,
Max Welling
Abstract:
Hierarchical Bayesian networks and neural networks with stochastic hidden units are commonly perceived as two separate types of models. We show that either of these types of models can often be transformed into an instance of the other, by switching between centered and differentiable non-centered parameterizations of the latent variables. The choice of parameterization greatly influences the effi…
▽ More
Hierarchical Bayesian networks and neural networks with stochastic hidden units are commonly perceived as two separate types of models. We show that either of these types of models can often be transformed into an instance of the other, by switching between centered and differentiable non-centered parameterizations of the latent variables. The choice of parameterization greatly influences the efficiency of gradient-based posterior inference; we show that they are often complementary to eachother, we clarify when each parameterization is preferred and show how inference can be made robust. In the non-centered form, a simple Monte Carlo estimator of the marginal likelihood can be used for learning the parameters. Theoretical results are supported by experiments.
△ Less
Submitted 22 January, 2015; v1 submitted 3 February, 2014;
originally announced February 2014.
-
GPS-ABC: Gaussian Process Surrogate Approximate Bayesian Computation
Authors:
Edward Meeds,
Max Welling
Abstract:
Scientists often express their understanding of the world through a computationally demanding simulation program. Analyzing the posterior distribution of the parameters given observations (the inverse problem) can be extremely challenging. The Approximate Bayesian Computation (ABC) framework is the standard statistical tool to handle these likelihood free problems, but they require a very large nu…
▽ More
Scientists often express their understanding of the world through a computationally demanding simulation program. Analyzing the posterior distribution of the parameters given observations (the inverse problem) can be extremely challenging. The Approximate Bayesian Computation (ABC) framework is the standard statistical tool to handle these likelihood free problems, but they require a very large number of simulations. In this work we develop two new ABC sampling algorithms that significantly reduce the number of simulations necessary for posterior inference. Both algorithms use confidence estimates for the accept probability in the Metropolis Hastings step to adaptively choose the number of necessary simulations. Our GPS-ABC algorithm stores the information obtained from every simulation in a Gaussian process which acts as a surrogate function for the simulated statistics. Experiments on a challenging realistic biological problem illustrate the potential of these algorithms.
△ Less
Submitted 13 January, 2014;
originally announced January 2014.
-
Auto-Encoding Variational Bayes
Authors:
Diederik P Kingma,
Max Welling
Abstract:
How can we perform efficient inference and learning in directed probabilistic models, in the presence of continuous latent variables with intractable posterior distributions, and large datasets? We introduce a stochastic variational inference and learning algorithm that scales to large datasets and, under some mild differentiability conditions, even works in the intractable case. Our contributions…
▽ More
How can we perform efficient inference and learning in directed probabilistic models, in the presence of continuous latent variables with intractable posterior distributions, and large datasets? We introduce a stochastic variational inference and learning algorithm that scales to large datasets and, under some mild differentiability conditions, even works in the intractable case. Our contributions are two-fold. First, we show that a reparameterization of the variational lower bound yields a lower bound estimator that can be straightforwardly optimized using standard stochastic gradient methods. Second, we show that for i.i.d. datasets with continuous latent variables per datapoint, posterior inference can be made especially efficient by fitting an approximate inference model (also called a recognition model) to the intractable posterior using the proposed lower bound estimator. Theoretical advantages are reflected in experimental results.
△ Less
Submitted 10 December, 2022; v1 submitted 20 December, 2013;
originally announced December 2013.
-
Stochastic Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation
Authors:
James Foulds,
Levi Boyles,
Christopher Dubois,
Padhraic Smyth,
Max Welling
Abstract:
In the internet era there has been an explosion in the amount of digital text information available, leading to difficulties of scale for traditional inference algorithms for topic models. Recent advances in stochastic variational inference algorithms for latent Dirichlet allocation (LDA) have made it feasible to learn topic models on large-scale corpora, but these methods do not currently take fu…
▽ More
In the internet era there has been an explosion in the amount of digital text information available, leading to difficulties of scale for traditional inference algorithms for topic models. Recent advances in stochastic variational inference algorithms for latent Dirichlet allocation (LDA) have made it feasible to learn topic models on large-scale corpora, but these methods do not currently take full advantage of the collapsed representation of the model. We propose a stochastic algorithm for collapsed variational Bayesian inference for LDA, which is simpler and more efficient than the state of the art method. We show connections between collapsed variational Bayesian inference and MAP estimation for LDA, and leverage these connections to prove convergence properties of the proposed algorithm. In experiments on large-scale text corpora, the algorithm was found to converge faster and often to a better solution than the previous method. Human-subject experiments also demonstrated that the method can learn coherent topics in seconds on small corpora, facilitating the use of topic models in interactive document analysis software.
△ Less
Submitted 10 May, 2013;
originally announced May 2013.
-
Austerity in MCMC Land: Cutting the Metropolis-Hastings Budget
Authors:
Anoop Korattikara,
Yutian Chen,
Max Welling
Abstract:
Can we make Bayesian posterior MCMC sampling more efficient when faced with very large datasets? We argue that computing the likelihood for N datapoints in the Metropolis-Hastings (MH) test to reach a single binary decision is computationally inefficient. We introduce an approximate MH rule based on a sequential hypothesis test that allows us to accept or reject samples with high confidence using…
▽ More
Can we make Bayesian posterior MCMC sampling more efficient when faced with very large datasets? We argue that computing the likelihood for N datapoints in the Metropolis-Hastings (MH) test to reach a single binary decision is computationally inefficient. We introduce an approximate MH rule based on a sequential hypothesis test that allows us to accept or reject samples with high confidence using only a fraction of the data required for the exact MH rule. While this method introduces an asymptotic bias, we show that this bias can be controlled and is more than offset by a decrease in variance due to our ability to draw more samples per unit of time.
△ Less
Submitted 14 February, 2014; v1 submitted 18 April, 2013;
originally announced April 2013.
-
Herded Gibbs Sampling
Authors:
Luke Bornn,
Yutian Chen,
Nando de Freitas,
Mareija Eskelin,
Jing Fang,
Max Welling
Abstract:
The Gibbs sampler is one of the most popular algorithms for inference in statistical models. In this paper, we introduce a herding variant of this algorithm, called herded Gibbs, that is entirely deterministic. We prove that herded Gibbs has an $O(1/T)$ convergence rate for models with independent variables and for fully connected probabilistic graphical models. Herded Gibbs is shown to outperform…
▽ More
The Gibbs sampler is one of the most popular algorithms for inference in statistical models. In this paper, we introduce a herding variant of this algorithm, called herded Gibbs, that is entirely deterministic. We prove that herded Gibbs has an $O(1/T)$ convergence rate for models with independent variables and for fully connected probabilistic graphical models. Herded Gibbs is shown to outperform Gibbs in the tasks of image denoising with MRFs and named entity recognition with CRFs. However, the convergence for herded Gibbs for sparsely connected probabilistic graphical models is still an open problem.
△ Less
Submitted 15 March, 2013; v1 submitted 17 January, 2013;
originally announced January 2013.
-
Belief Optimization for Binary Networks: A Stable Alternative to Loopy Belief Propagation
Authors:
Max Welling,
Yee Whye Teh
Abstract:
We present a novel inference algorithm for arbitrary, binary, undirected graphs. Unlike loopy belief propagation, which iterates fixed point equations, we directly descend on the Bethe free energy. The algorithm consists of two phases, first we update the pairwise probabilities, given the marginal probabilities at each unit,using an analytic expression. Next, we update the marginal probabilities,…
▽ More
We present a novel inference algorithm for arbitrary, binary, undirected graphs. Unlike loopy belief propagation, which iterates fixed point equations, we directly descend on the Bethe free energy. The algorithm consists of two phases, first we update the pairwise probabilities, given the marginal probabilities at each unit,using an analytic expression. Next, we update the marginal probabilities, given the pairwise probabilities by following the negative gradient of the Bethe free energy. Both steps are guaranteed to decrease the Bethe free energy, and since it is lower bounded, the algorithm is guaranteed to converge to a local minimum. We also show that the Bethe free energy is equal to the TAP free energy up to second order in the weights. In experiments we confirm that when belief propagation converges it usually finds identical solutions as our belief optimization method. However, in cases where belief propagation fails to converge, belief optimization continues to converge to reasonable beliefs. The stable nature of belief optimization makes it ideally suited for learning graphical models from data.
△ Less
Submitted 10 January, 2013;
originally announced January 2013.
-
Efficient Parametric Projection Pursuit Density Estimation
Authors:
Max Welling,
Richard S. Zemel,
Geoffrey E. Hinton
Abstract:
Product models of low dimensional experts are a powerful way to avoid the curse of dimensionality. We present the ``under-complete product of experts' (UPoE), where each expert models a one dimensional projection of the data. The UPoE is fully tractable and may be interpreted as a parametric probabilistic model for projection pursuit. Its ML learning rules are identical to the…
▽ More
Product models of low dimensional experts are a powerful way to avoid the curse of dimensionality. We present the ``under-complete product of experts' (UPoE), where each expert models a one dimensional projection of the data. The UPoE is fully tractable and may be interpreted as a parametric probabilistic model for projection pursuit. Its ML learning rules are identical to the approximate learning rules proposed before for under-complete ICA. We also derive an efficient sequential learning algorithm and discuss its relationship to projection pursuit density estimation and feature induction algorithms for additive random field models.
△ Less
Submitted 19 October, 2012;
originally announced December 2012.
-
A Cluster-Cumulant Expansion at the Fixed Points of Belief Propagation
Authors:
Max Welling,
Andrew E. Gelfand,
Alexander T. Ihler
Abstract:
We introduce a new cluster-cumulant expansion (CCE) based on the fixed points of iterative belief propagation (IBP). This expansion is similar in spirit to the loop-series (LS) recently introduced in [1]. However, in contrast to the latter, the CCE enjoys the following important qualities: 1) it is defined for arbitrary state spaces 2) it is easily extended to fixed points of generalized belief pr…
▽ More
We introduce a new cluster-cumulant expansion (CCE) based on the fixed points of iterative belief propagation (IBP). This expansion is similar in spirit to the loop-series (LS) recently introduced in [1]. However, in contrast to the latter, the CCE enjoys the following important qualities: 1) it is defined for arbitrary state spaces 2) it is easily extended to fixed points of generalized belief propagation (GBP), 3) disconnected groups of variables will not contribute to the CCE and 4) the accuracy of the expansion empirically improves upon that of the LS. The CCE is based on the same Möbius transform as the Kikuchi approximation, but unlike GBP does not require storing the beliefs of the GBP-clusters nor does it suffer from convergence issues during belief updating.
△ Less
Submitted 16 October, 2012;
originally announced October 2012.
-
Generalized Belief Propagation on Tree Robust Structured Region Graphs
Authors:
Andrew E. Gelfand,
Max Welling
Abstract:
This paper provides some new guidance in the construction of region graphs for Generalized Belief Propagation (GBP). We connect the problem of choosing the outer regions of a LoopStructured Region Graph (SRG) to that of finding a fundamental cycle basis of the corresponding Markov network. We also define a new class of tree-robust Loop-SRG for which GBP on any induced (spanning) tree of the Markov…
▽ More
This paper provides some new guidance in the construction of region graphs for Generalized Belief Propagation (GBP). We connect the problem of choosing the outer regions of a LoopStructured Region Graph (SRG) to that of finding a fundamental cycle basis of the corresponding Markov network. We also define a new class of tree-robust Loop-SRG for which GBP on any induced (spanning) tree of the Markov network, obtained by setting to zero the off-tree interactions, is exact. This class of SRG is then mapped to an equivalent class of tree-robust cycle bases on the Markov network. We show that a treerobust cycle basis can be identified by proving that for every subset of cycles, the graph obtained from the edges that participate in a single cycle only, is multiply connected. Using this we identify two classes of tree-robust cycle bases: planar cycle bases and "star" cycle bases. In experiments we show that tree-robustness can be successfully exploited as a design principle to improve the accuracy and convergence of GBP.
△ Less
Submitted 16 October, 2012;
originally announced October 2012.
-
Semisupervised Classifier Evaluation and Recalibration
Authors:
Peter Welinder,
Max Welling,
Pietro Perona
Abstract:
How many labeled examples are needed to estimate a classifier's performance on a new dataset? We study the case where data is plentiful, but labels are expensive. We show that by making a few reasonable assumptions on the structure of the data, it is possible to estimate performance curves, with confidence bounds, using a small number of ground truth labels. Our approach, which we call Semisupervi…
▽ More
How many labeled examples are needed to estimate a classifier's performance on a new dataset? We study the case where data is plentiful, but labels are expensive. We show that by making a few reasonable assumptions on the structure of the data, it is possible to estimate performance curves, with confidence bounds, using a small number of ground truth labels. Our approach, which we call Semisupervised Performance Evaluation (SPE), is based on a generative model for the classifier's confidence scores. In addition to estimating the performance of classifiers on new datasets, SPE can be used to recalibrate a classifier by re-estimating the class-conditional confidence distributions.
△ Less
Submitted 8 October, 2012;
originally announced October 2012.
-
On the Choice of Regions for Generalized Belief Propagation
Authors:
Max Welling
Abstract:
Generalized belief propagation (GBP) has proven to be a promising technique for approximate inference tasks in AI and machine learning. However, the choice of a good set of clusters to be used in GBP has remained more of an art then a science until this day. This paper proposes a sequential approach to adding new clusters of nodes and their interactions (i.e. "regions") to the approximation. We fi…
▽ More
Generalized belief propagation (GBP) has proven to be a promising technique for approximate inference tasks in AI and machine learning. However, the choice of a good set of clusters to be used in GBP has remained more of an art then a science until this day. This paper proposes a sequential approach to adding new clusters of nodes and their interactions (i.e. "regions") to the approximation. We first review and analyze the recently introduced region graphs and find that three kinds of operations ("split", "merge" and "death") leave the free energy and (under some conditions) the fixed points of GBP invariant. This leads to the notion of "weakly irreducible" regions as the natural candidates to be added to the approximation. Computational complexity of the GBP algorithm is controlled by restricting attention to regions with small "region-width". Combining the above with an efficient (i.e. local in the graph) measure to predict the improved accuracy of GBP leads to the sequential "region pursuit" algorithm for adding new regions bottom-up to the region graph. Experiments show that this algorithm can indeed perform close to optimally.
△ Less
Submitted 11 July, 2012;
originally announced July 2012.
-
Structured Region Graphs: Morphing EP into GBP
Authors:
Max Welling,
Thomas P. Minka,
Yee Whye Teh
Abstract:
GBP and EP are two successful algorithms for approximate probabilistic inference, which are based on different approximation strategies. An open problem in both algorithms has been how to choose an appropriate approximation structure. We introduce 'structured region graphs', a formalism which marries these two strategies, reveals a deep connection between them, and suggests how to choose good appr…
▽ More
GBP and EP are two successful algorithms for approximate probabilistic inference, which are based on different approximation strategies. An open problem in both algorithms has been how to choose an appropriate approximation structure. We introduce 'structured region graphs', a formalism which marries these two strategies, reveals a deep connection between them, and suggests how to choose good approximation structures. In this formalism, each region has an internal structure which defines an exponential family, whose sufficient statistics must be matched by the parent region. Reduction operators on these structures allow conversion between EP and GBP free energies. Thus it is revealed that all EP approximations on discrete variables are special cases of GBP, and conversely that some wellknown GBP approximations, such as overlapping squares, are special cases of EP. Furthermore, region graphs derived from EP have a number of good structural properties, including maxent-normality and overall counting number of one. The result is a convenient framework for producing high-quality approximations with a user-adjustable level of complexity
△ Less
Submitted 4 July, 2012;
originally announced July 2012.
-
Bayesian Random Fields: The Bethe-Laplace Approximation
Authors:
Max Welling,
Sridevi Parise
Abstract:
While learning the maximum likelihood value of parameters of an undirected graphical model is hard, modelling the posterior distribution over parameters given data is harder. Yet, undirected models are ubiquitous in computer vision and text modelling (e.g. conditional random fields). But where Bayesian approaches for directed models have been very successful, a proper Bayesian treatment of undirec…
▽ More
While learning the maximum likelihood value of parameters of an undirected graphical model is hard, modelling the posterior distribution over parameters given data is harder. Yet, undirected models are ubiquitous in computer vision and text modelling (e.g. conditional random fields). But where Bayesian approaches for directed models have been very successful, a proper Bayesian treatment of undirected models in still in its infant stages. We propose a new method for approximating the posterior of the parameters given data based on the Laplace approximation. This approximation requires the computation of the covariance matrix over features which we compute using the linear response approximation based in turn on loopy belief propagation. We develop the theory for conditional and 'unconditional' random fields with or without hidden variables. In the conditional setting we introduce a new variant of bagging suitable for structured domains. Here we run the loopy max-product algorithm on a 'super-graph' composed of graphs for individual models sampled from the posterior and connected by constraints. Experiments on real world data validate the proposed methods.
△ Less
Submitted 27 June, 2012;
originally announced June 2012.
-
Gibbs Sampling for (Coupled) Infinite Mixture Models in the Stick Breaking Representation
Authors:
Ian Porteous,
Alexander T. Ihler,
Padhraic Smyth,
Max Welling
Abstract:
Nonparametric Bayesian approaches to clustering, information retrieval, language modeling and object recognition have recently shown great promise as a new paradigm for unsupervised data analysis. Most contributions have focused on the Dirichlet process mixture models or extensions thereof for which efficient Gibbs samplers exist. In this paper we explore Gibbs samplers for infinite complexity mix…
▽ More
Nonparametric Bayesian approaches to clustering, information retrieval, language modeling and object recognition have recently shown great promise as a new paradigm for unsupervised data analysis. Most contributions have focused on the Dirichlet process mixture models or extensions thereof for which efficient Gibbs samplers exist. In this paper we explore Gibbs samplers for infinite complexity mixture models in the stick breaking representation. The advantage of this representation is improved modeling flexibility. For instance, one can design the prior distribution over cluster sizes or couple multiple infinite mixture models (e.g. over time) at the level of their parameters (i.e. the dependent Dirichlet process model). However, Gibbs samplers for infinite mixture models (as recently introduced in the statistics literature) seem to mix poorly over cluster labels. Among others issues, this can have the adverse effect that labels for the same cluster in coupled mixture models are mixed up. We introduce additional moves in these samplers to improve mixing over cluster labels and to bring clusters into correspondence. An application to modeling of storm trajectories is used to illustrate these ideas.
△ Less
Submitted 27 June, 2012;
originally announced June 2012.
-
Bayesian Posterior Sampling via Stochastic Gradient Fisher Scoring
Authors:
Sungjin Ahn,
Anoop Korattikara,
Max Welling
Abstract:
In this paper we address the following question: Can we approximately sample from a Bayesian posterior distribution if we are only allowed to touch a small mini-batch of data-items for every sample we generate?. An algorithm based on the Langevin equation with stochastic gradients (SGLD) was previously proposed to solve this, but its mixing rate was slow. By leveraging the Bayesian Central Limit T…
▽ More
In this paper we address the following question: Can we approximately sample from a Bayesian posterior distribution if we are only allowed to touch a small mini-batch of data-items for every sample we generate?. An algorithm based on the Langevin equation with stochastic gradients (SGLD) was previously proposed to solve this, but its mixing rate was slow. By leveraging the Bayesian Central Limit Theorem, we extend the SGLD algorithm so that at high mixing rates it will sample from a normal approximation of the posterior, while for slow mixing rates it will mimic the behavior of SGLD with a pre-conditioner matrix. As a bonus, the proposed algorithm is reminiscent of Fisher scoring (with stochastic gradients) and as such an efficient optimizer during burn-in.
△ Less
Submitted 27 June, 2012;
originally announced June 2012.
-
Hybrid Variational/Gibbs Collapsed Inference in Topic Models
Authors:
Max Welling,
Yee Whye Teh,
Hilbert Kappen
Abstract:
Variational Bayesian inference and (collapsed) Gibbs sampling are the two important classes of inference algorithms for Bayesian networks. Both have their advantages and disadvantages: collapsed Gibbs sampling is unbiased but is also inefficient for large count values and requires averaging over many samples to reduce variance. On the other hand, variational Bayesian inference is efficient and acc…
▽ More
Variational Bayesian inference and (collapsed) Gibbs sampling are the two important classes of inference algorithms for Bayesian networks. Both have their advantages and disadvantages: collapsed Gibbs sampling is unbiased but is also inefficient for large count values and requires averaging over many samples to reduce variance. On the other hand, variational Bayesian inference is efficient and accurate for large count values but suffers from bias for small counts. We propose a hybrid algorithm that combines the best of both worlds: it samples very small counts and applies variational updates to large counts. This hybridization is shown to significantly improve testset perplexity relative to variational inference at no computational cost.
△ Less
Submitted 13 June, 2012;
originally announced June 2012.
-
Bayesian Structure Learning for Markov Random Fields with a Spike and Slab Prior
Authors:
Yutian Chen,
Max Welling
Abstract:
In recent years a number of methods have been developed for automatically learning the (sparse) connectivity structure of Markov Random Fields. These methods are mostly based on L1-regularized optimization which has a number of disadvantages such as the inability to assess model uncertainty and expensive cross-validation to find the optimal regularization parameter. Moreover, the model's predictiv…
▽ More
In recent years a number of methods have been developed for automatically learning the (sparse) connectivity structure of Markov Random Fields. These methods are mostly based on L1-regularized optimization which has a number of disadvantages such as the inability to assess model uncertainty and expensive cross-validation to find the optimal regularization parameter. Moreover, the model's predictive performance may degrade dramatically with a suboptimal value of the regularization parameter (which is sometimes desirable to induce sparseness). We propose a fully Bayesian approach based on a "spike and slab" prior (similar to L0 regularization) that does not suffer from these shortcomings. We develop an approximate MCMC method combining Langevin dynamics and reversible jump MCMC to conduct inference in this model. Experiments show that the proposed model learns a good combination of the structure and parameter values without the need for separate hyper-parameter tuning. Moreover, the model's predictive performance is much more robust than L1-based methods with hyper-parameter settings that induce highly sparse model structures.
△ Less
Submitted 22 June, 2012; v1 submitted 5 June, 2012;
originally announced June 2012.
-
On Smoothing and Inference for Topic Models
Authors:
Arthur Asuncion,
Max Welling,
Padhraic Smyth,
Yee Whye Teh
Abstract:
Latent Dirichlet analysis, or topic modeling, is a flexible latent variable framework for modeling high-dimensional sparse count data. Various learning algorithms have been developed in recent years, including collapsed Gibbs sampling, variational inference, and maximum a posteriori estimation, and this variety motivates the need for careful empirical comparisons. In this paper, we highlight the c…
▽ More
Latent Dirichlet analysis, or topic modeling, is a flexible latent variable framework for modeling high-dimensional sparse count data. Various learning algorithms have been developed in recent years, including collapsed Gibbs sampling, variational inference, and maximum a posteriori estimation, and this variety motivates the need for careful empirical comparisons. In this paper, we highlight the close connections between these approaches. We find that the main differences are attributable to the amount of smoothing applied to the counts. When the hyperparameters are optimized, the differences in performance among the algorithms diminish significantly. The ability of these algorithms to achieve solutions of comparable accuracy gives us the freedom to select computationally efficient approaches. Using the insights gained from this comparative study, we show how accurate topic models can be learned in several seconds on text corpora with thousands of documents.
△ Less
Submitted 9 May, 2012;
originally announced May 2012.
-
Herding Dynamic Weights for Partially Observed Random Field Models
Authors:
Max Welling
Abstract:
Learning the parameters of a (potentially partially observable) random field model is intractable in general. Instead of focussing on a single optimal parameter value we propose to treat parameters as dynamical quantities. We introduce an algorithm to generate complex dynamics for parameters and (both visible and hidden) state vectors. We show that under certain conditions averages computed over t…
▽ More
Learning the parameters of a (potentially partially observable) random field model is intractable in general. Instead of focussing on a single optimal parameter value we propose to treat parameters as dynamical quantities. We introduce an algorithm to generate complex dynamics for parameters and (both visible and hidden) state vectors. We show that under certain conditions averages computed over trajectories of the proposed dynamical system converge to averages computed over the data. Our "herding dynamics" does not require expensive operations such as exponentiation and is fully deterministic.
△ Less
Submitted 9 May, 2012;
originally announced May 2012.
-
Super-Samples from Kernel Herding
Authors:
Yutian Chen,
Max Welling,
Alex Smola
Abstract:
We extend the herding algorithm to continuous spaces by using the kernel trick. The resulting "kernel herding" algorithm is an infinite memory deterministic process that learns to approximate a PDF with a collection of samples. We show that kernel herding decreases the error of expectations of functions in the Hilbert space at a rate O(1/T) which is much faster than the usual O(1/pT) for iid rando…
▽ More
We extend the herding algorithm to continuous spaces by using the kernel trick. The resulting "kernel herding" algorithm is an infinite memory deterministic process that learns to approximate a PDF with a collection of samples. We show that kernel herding decreases the error of expectations of functions in the Hilbert space at a rate O(1/T) which is much faster than the usual O(1/pT) for iid random samples. We illustrate kernel herding by approximating Bayesian predictive distributions.
△ Less
Submitted 15 March, 2012;
originally announced March 2012.
-
Avalanches and Self-Organized Criticality in Superconductors
Authors:
Rinke J. Wijngaarden,
Marco S. Welling,
Christof M. Aegerter,
Mariela Menghini
Abstract:
We review the use of superconductors as a playground for the experimental study of front roughening and avalanches. Using the magneto-optical technique, the spatial distribution of the vortex density in the sample is monitored as a function of time. The roughness and growth exponents corresponding to the vortex landscape are determined and compared to the exponents that characterize the avalanch…
▽ More
We review the use of superconductors as a playground for the experimental study of front roughening and avalanches. Using the magneto-optical technique, the spatial distribution of the vortex density in the sample is monitored as a function of time. The roughness and growth exponents corresponding to the vortex landscape are determined and compared to the exponents that characterize the avalanches in the framework of Self-Organized Criticality. For those situations where a thermo-magnetic instability arises, an analytical non-linear and non-local model is discussed, which is found to be consistent to great detail with the experimental results. On anisotropic substrates, the anisotropy regularizes the avalanches.
△ Less
Submitted 19 October, 2005;
originally announced October 2005.
-
Dynamic roughening of the magnetic flux landscape in YBa$_2$Cu$_3$O$_{7-x}$
Authors:
C. M. Aegerter,
M. S. Welling,
R. J. Wijngaarden
Abstract:
We study the magnetic flux landscape in YBa$_2$Cu$_3$O$_{7-x}$ thin films as a two dimensional rough surface. The vortex density in the superconductor forms a self-affine structure in both space and time. This is characterized by a roughness exponent $α= 0.76(3)$ and a growth exponent $β= 0.57(6)$. This is due to the structure and distribution of flux avalanches in the self-organized critical st…
▽ More
We study the magnetic flux landscape in YBa$_2$Cu$_3$O$_{7-x}$ thin films as a two dimensional rough surface. The vortex density in the superconductor forms a self-affine structure in both space and time. This is characterized by a roughness exponent $α= 0.76(3)$ and a growth exponent $β= 0.57(6)$. This is due to the structure and distribution of flux avalanches in the self-organized critical state, which is formed in the superconductor. We also discuss our results in the context of other roughening systems in the presence of quenched disorder.
△ Less
Submitted 10 November, 2004;
originally announced November 2004.
-
Self-organized criticality induced by quenched disorder: experiments on flux avalanches in NbH$_x$ films
Authors:
M. S. Welling,
C. M. Aegerter,
R. J. Wijngaarden
Abstract:
We present an experimental study of the influence of quenched disorder on the distribution of flux avalanches in type-II superconductors. In the presence of much quenched disorder, the avalanche sizes are power-law distributed and show finite size scaling, as expected from self-organized criticality (SOC). Furthermore, the shape of the avalanches is observed to be fractal. In the absence of quen…
▽ More
We present an experimental study of the influence of quenched disorder on the distribution of flux avalanches in type-II superconductors. In the presence of much quenched disorder, the avalanche sizes are power-law distributed and show finite size scaling, as expected from self-organized criticality (SOC). Furthermore, the shape of the avalanches is observed to be fractal. In the absence of quenched disorder, a preferred size of avalanches is observed and avalanches are smooth. These observations indicate that a certain minimum amount of disorder is necessary for SOC behavior. We relate these findings to the appearance or non-appearance of SOC in other experimental systems, particularly piles of sand.
△ Less
Submitted 14 October, 2004;
originally announced October 2004.
-
Dendritic flux avalanches and nonlocal electrodynamics in thin superconducting films
Authors:
Igor S. Aranson,
Alex Gurevich,
Marco S. Welling,
Rinke J. Wijngaarden,
Vitalii K. Vlasko-Vlasov,
Valerii M. Vinokur,
Ulrich Welp
Abstract:
We present numerical and analytical studies of coupled nonlinear Maxwell and thermal diffusion equations which describe nonisothermal dendritic flux penetration in superconducting films. We show that spontaneous branching of propagating flux filaments occurs due to nonlocal magnetic flux diffusion and positive feedback between flux motion and Joule heat generation. The branching is triggered by…
▽ More
We present numerical and analytical studies of coupled nonlinear Maxwell and thermal diffusion equations which describe nonisothermal dendritic flux penetration in superconducting films. We show that spontaneous branching of propagating flux filaments occurs due to nonlocal magnetic flux diffusion and positive feedback between flux motion and Joule heat generation. The branching is triggered by a thermomagnetic edge instability which causes stratification of the critical state. The resulting distribution of magnetic microavalanches depends on a spatial distribution of defects. Our results are in good agreement with experiments performed on Nb films.
△ Less
Submitted 19 July, 2004;
originally announced July 2004.
-
Self-organized criticality in the Bean state in YBa$_2$Cu$_3$O$_{7-x}$ thin films
Authors:
C. M. Aegerter,
M. S. Welling,
R. J. Wijngaarden
Abstract:
The penetration of magnetic flux into a thin film of YBa$_2$Cu$_3$O$_{7-x}$ is studied when the external field is ramped slowly. In this case the flux penetrates in bursts or avalanches. The size of these avalanches is distributed according to a power law with an exponent of $τ$ = 1.29(2). The additional observation of finite-size scaling of the avalanche distributions, with an avalanche dimensi…
▽ More
The penetration of magnetic flux into a thin film of YBa$_2$Cu$_3$O$_{7-x}$ is studied when the external field is ramped slowly. In this case the flux penetrates in bursts or avalanches. The size of these avalanches is distributed according to a power law with an exponent of $τ$ = 1.29(2). The additional observation of finite-size scaling of the avalanche distributions, with an avalanche dimension D = 1.89(3), gives strong indications towards self-organized criticality in this system. Furthermore we determine exponents governing the roughening dynamics of the flux surface using some universal scaling relations. These exponents are compared to those obtained from a standard roughening analysis.
△ Less
Submitted 26 May, 2003;
originally announced May 2003.
-
Quantum Mechanics of a Point Particle in 2+1 Dimensional Gravity
Authors:
Hans-Juergen Matschull,
Max Welling
Abstract:
We study the phase space structure and the quantization of a pointlike particle in 2+1 dimensional gravity. By adding boundary terms to the first order Einstein Hilbert action, and removing all redundant gauge degrees of freedom, we arrive at a reduced action for a gravitating particle in 2+1 dimensions, which is invariant under Lorentz transformations and a group of generalized translations. Th…
▽ More
We study the phase space structure and the quantization of a pointlike particle in 2+1 dimensional gravity. By adding boundary terms to the first order Einstein Hilbert action, and removing all redundant gauge degrees of freedom, we arrive at a reduced action for a gravitating particle in 2+1 dimensions, which is invariant under Lorentz transformations and a group of generalized translations. The momentum space of the particle turns out to be the group manifold SL(2). Its position coordinates have non-vanishing Poisson brackets, resulting in a non-commutative quantum spacetime. We use the representation theory of SL(2) to investigate its structure. We find a discretization of time, and some semi-discrete structure of space. An uncertainty relation forbids a fully localized particle. The quantum dynamics is described by a discretized Klein Gordon equation.
△ Less
Submitted 21 April, 1998; v1 submitted 22 August, 1997;
originally announced August 1997.