BM³E: Discriminative Density Propagation for Visual Tracking

C Sminchisescu, A Kanaujia… - IEEE Transactions on …, 2007 - ieeexplore.ieee.org
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2007ieeexplore.ieee.org
We introduce BM³E, a Conditional Bayesian Mixture of Experts Markov Model, for consistent
probabilistic estimates in discriminative visual tracking. The model applies to problems of
temporal and uncertain inference and represents the unexplored bottom-up counterpart of
pervasive generative models estimated with Kalman filtering or particle filtering. Instead of
inverting a non-linear generative observation model at run-time, we learn to cooperatively
predict complex state distributions directly from descriptors that encode image observations …
We introduce BM³E, a Conditional Bayesian Mixture of Experts Markov Model, for consistent probabilistic estimates in discriminative visual tracking. The model applies to problems of temporal and uncertain inference and represents the unexplored bottom-up counterpart of pervasive generative models estimated with Kalman filtering or particle filtering. Instead of inverting a non-linear generative observation model at run-time, we learn to cooperatively predict complex state distributions directly from descriptors that encode image observations — typically bag-of-feature global image histograms or descriptors computed over regular spatial grids. These are integrated in a conditional graphical model in order to enforce temporal smoothness constraints and allow a principled management of uncertainty. The algorithms combine sparsity, mixture modeling, and non-linear dimensionality reduction for efficient computation in high-dimensional continuous state spaces. The combined system automatically self-initializes and recovers from failure. The research has three contributions: (1) We establish the density propagation rules for discriminative inference in continuous, temporal chain models; (2) We propose flexible supervised and unsupervised algorithms for learning feedforward, multivalued contextual mappings (multimodal state distributions) based on compact, conditional Bayesian mixture of experts models; (3) We validate the framework empirically for the reconstruction of 3d human motion in monocular video sequences. Our tests on both real and motion capture-based sequences show significant performance gains with respect to competing nearest-neighbor, regression, and structured prediction methods.
ieeexplore.ieee.org