Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article
Free access

Hilbert Space Embeddings and Metrics on Probability Measures

Published: 01 August 2010 Publication History

Abstract

A Hilbert space embedding for probability measures has recently been proposed, with applications including dimensionality reduction, homogeneity testing, and independence testing. This embedding represents any probability measure as a mean element in a reproducing kernel Hilbert space (RKHS). A pseudometric on the space of probability measures can be defined as the distance between distribution embeddings: we denote this as γk, indexed by the kernel function k that defines the inner product in the RKHS.
We present three theoretical properties of γk. First, we consider the question of determining the conditions on the kernel k for which γk is a metric: such k are denoted characteristic kernels. Unlike pseudometrics, a metric is zero only when two distributions coincide, thus ensuring the RKHS embedding maps all distributions uniquely (i.e., the embedding is injective). While previously published conditions may apply only in restricted circumstances (e.g., on compact domains), and are difficult to check, our conditions are straightforward and intuitive: integrally strictly positive definite kernels are characteristic. Alternatively, if a bounded continuous kernel is translation-invariant on d, then it is characteristic if and only if the support of its Fourier transform is the entire d. Second, we show that the distance between distributions under γk results from an interplay between the properties of the kernel and the distributions, by demonstrating that distributions are close in the embedding space when their differences occur at higher frequencies. Third, to understand the nature of the topology induced by γk, we relate γk to other popular metrics on probability measures, and present conditions on the kernel k under which γk metrizes the weak topology.

References

[1]
S. M. Ali and S. D. Silvey. A general class of coefficients of divergence of one distribution from another. Journal of the Royal Statistical Society, Series B (Methodological), 28:131-142, 1966.
[2]
N. Anderson, P. Hall, and D. Titterington. Two-sample test statistics for measuring discrepancies between two multivariate probability density functions using kernel-based density estimates. Journal of Multivariate Analysis, 50:41-54, 1994.
[3]
N. Aronszajn. Theory of reproducing kernels. Trans. Amer. Math. Soc., 68:337-404, 1950.
[4]
F. R. Bach and M. I. Jordan. Kernel independent component analysis. Journal of Machine Learning Research, 3:1-48, 2002.
[5]
A. D. Barbour and L. H. Y. Chen. An Introduction to Stein's Method. Singapore University Press, Singapore, 2005.
[6]
C. Berg, J. P. R. Christensen, and P. Ressel. Harmonic Analysis on Semigroups. Spring Verlag, New York, 1984.
[7]
A. Berlinet and C. Thomas-Agnan. Reproducing Kernel Hilbert Spaces in Probability and Statistics. Kluwer Academic Publishers, London, UK, 2004.
[8]
K. M. Borgwardt, A. Gretton, M. Rasch, H.-P. Kriegel, B. Schölkopf, and A. J. Smola. Integrating structured biological data by kernel maximum mean discrepancy. Bioinformatics, 22(14):e49- e57, 2006.
[9]
P. Brémaud. Mathematical Principles of Signal Processing. Springer-Verlag, New York, 2001.
[10]
I. Csiszár. Information-type measures of difference of probability distributions and indirect observations. Studia Scientiarium Mathematicarum Hungarica, 2:299-318, 1967.
[11]
W. Dahmen and C. A. Micchelli. Some remarks on ridge functions. Approx. Theory Appl., 3: 139-143, 1987.
[12]
E. del Barrio, J. A. Cuesta-Albertos, C. Matrán, and J. M. Rodríguez-Rodríguez. Testing of goodness of fit based on the L 2-Wasserstein distance. Annals of Statistics, 27:1230-1239, 1999.
[13]
L. Devroye and L. Györfi. No empirical probability measure can converge in the total variation sense for all distributions. Annals of Statistics, 18(3):1496-1499, 1990.
[14]
R. M. Dudley. Real Analysis and Probability. Cambridge University Press, Cambridge, UK, 2002.
[15]
G. B. Folland. Real Analysis: Modern Techniques and Their Applications. Wiley-Interscience, New York, 1999.
[16]
B. Fuglede and F. Topsøe. Jensen-Shannon divergence and Hilbert space embedding, 2003. Preprint.
[17]
K. Fukumizu, F. R. Bach, and M. I. Jordan. Dimensionality reduction for supervised learning with reproducing kernel Hilbert spaces. Journal of Machine Learning Research, 5:73-99, 2004.
[18]
K. Fukumizu, A. Gretton, X. Sun, and B. Schölkopf. Kernel measures of conditional dependence. In J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 489-496, Cambridge, MA, 2008. MIT Press.
[19]
K. Fukumizu, F. R. Bach, and M. I. Jordan. Kernel dimension reduction in regression. Annals of Statistics, 37(5):1871-1905, 2009a.
[20]
K. Fukumizu, B. K. Sriperumbudur, A. Gretton, and B. Schölkopf. Characteristic kernels on groups and semigroups. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Information Processing Systems 21, pages 473-480, 2009b.
[21]
C. Gasquet and P.Witomski. Fourier Analysis and Applications. Springer-Verlag, New York, 1999.
[22]
A. L. Gibbs and F. E. Su. On choosing and bounding probability metrics. International Statistical Review, 70(3):419-435, 2002.
[23]
A. Gretton, R. Herbrich, A. Smola, O. Bousquet, and B. Schölkopf. Kernel methods for measuring independence. Journal of Machine Learning Research, 6:2075-2129, December 2005a.
[24]
A. Gretton, A. Smola, O. Bousquet, R. Herbrich, A. Belitski, M. Augath, Y. Murayama, J. Pauls, B. Schölkopf, and N. Logothetis. Kernel constrained covariance for dependence measurement. In Z. Ghahramani and R. Cowell, editors, Proc. 10th International Workshop on Artificial Intelligence and Statistics, pages 1-8, 2005b.
[25]
A. Gretton, K. Borgwardt, M. Rasch, B. Schölkopf, and A. Smola. A kernel method for the two sample problem. Technical Report 157, MPI for Biological Cybernetics, 2007a.
[26]
A. Gretton, K. M. Borgwardt, M. Rasch, B. Schölkopf, and A. Smola. A kernel method for the two sample problem. In B. Schölkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems 19, pages 513-520. MIT Press, 2007b.
[27]
A. Gretton, K. Fukumizu, C. H. Teo, L. Song, B. Schölkopf, and A. J. Smola. A kernel statistical test of independence. In J. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 585-592. MIT Press, 2008.
[28]
M. Hein and O. Bousquet. Hilbertian metrics and positive definite kernels on probability measures. In Z. Ghahramani and R. Cowell, editors, Proc. 10th International Workshop on Artificial Intelligence and Statistics, pages 136-143, 2005.
[29]
M. Hein, T.N. Lal, and O. Bousquet. Hilbertian metrics on probability measures and their application in SVMs. In Proceedings of the 26th DAGM Symposium, pages 270-277, Berlin, 2004. Springer.
[30]
E. L. Lehmann and J. P. Romano. Testing Statistical Hypothesis. Springer-Verlag, New York, 2005.
[31]
F. Liese and I. Vajda. On divergences and informations in statistics and information theory. IEEE Trans. Information Theory, 52(10):4394-4412, 2006.
[32]
T. Lindvall. Lectures on the Coupling Method. John Wiley & Sons, New York, 1992.
[33]
C. A. Micchelli, Y. Xu, and H. Zhang. Universal kernels. Journal of Machine Learning Research, 7:2651-2667, 2006.
[34]
A. Müller. Integral probability metrics and their generating classes of functions. Advances in Applied Probability, 29:429-443, 1997.
[35]
A. Pinkus. Strictly positive definite functions on a real inner product space. Adv. Comput. Math., 20:263-271, 2004.
[36]
S. T. Rachev. Probability Metrics and the Stability of Stochastic Models. John Wiley & Sons, Chichester, 1991.
[37]
S. T. Rachev and L. Rüschendorf. Mass Transportation Problems. Vol. I Theory, Vol. II Applications. Probability and its Applications. Springer-Verlag, Berlin, 1998.
[38]
C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, Cambridge, MA, 2006.
[39]
M. Reed and B. Simon. Methods of Modern Mathematical Physics: Functional Analysis I. Academic Press, New York, 1980.
[40]
M. Rosenblatt. A quadratic measure of deviation of two-dimensional density estimates and a test of independence. Annals of Statistics, 3(1):1-14, 1975.
[41]
W. Rudin. Functional Analysis. McGraw-Hill, USA, 1991.
[42]
B. Schölkopf and A. J. Smola. Learning with Kernels. MIT Press, Cambridge, MA, 2002.
[43]
H. Shen, S. Jegelka, and A. Gretton. Fast kernel-based independent component analysis. IEEE Transactions on Signal Processing, 57:3498 - 3511, 2009.
[44]
G. R. Shorack. Probability for Statisticians. Springer-Verlag, New York, 2000.
[45]
A. J. Smola, A. Gretton, L. Song, and B. Schölkopf. A Hilbert space embedding for distributions. In Proc. 18th International Conference on Algorithmic Learning Theory, pages 13-31. Springer-Verlag, Berlin, Germany, 2007.
[46]
B. K. Sriperumbudur, A. Gretton, K. Fukumizu, G. R. G. Lanckriet, and B. Schölkopf. Injective Hilbert space embeddings of probability measures. In R. Servedio and T. Zhang, editors, Proc. of the 21st Annual Conference on Learning Theory, pages 111-122, 2008.
[47]
B. K. Sriperumbudur, K. Fukumizu, A. Gretton, G. R. G. Lanckriet, and B. Schölkopf. Kernel choice and classifiability for RKHS embeddings of probability distributions. In Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems 22, pages 1750-1758. MIT Press, 2009a.
[48]
B. K. Sriperumbudur, K. Fukumizu, A. Gretton, B. Schölkopf, and G. R. G. Lanckriet. On integral probability metrics, ¿-divergences and binary classification. http://arxiv.org/abs/0901.2698v4, October 2009b.
[49]
B. K. Sriperumbudur, K. Fukumizu, and G. R. G. Lanckriet. On the relation between universality, characteristic kernels and RKHS embedding of measures. In Y. W. Teh and M. Titterington, editors, Proc. 13th International Conference on Artificial Intelligence and Statistics, volume 9 of Workshop and Conference Proceedings. JMLR, 2010a.
[50]
B. K. Sriperumbudur, K. Fukumizu, and G. R. G. Lanckriet. Universality, characteristic kernels and RKHS embedding of measures. http://arxiv.org/abs/1003.0887, March 2010b.
[51]
C. Stein. A bound for the error in the normal approximation to the distribution of a sum of dependent random variables. In Proc. of the Sixth Berkeley Symposium on Mathematical Statistics and Probability, 1972.
[52]
I. Steinwart. On the influence of the kernel on the consistency of support vector machines. Journal of Machine Learning Research, 2:67-93, 2001.
[53]
I. Steinwart and A. Christmann. Support Vector Machines. Springer, 2008.
[54]
J. Stewart. Positive definite functions and generalizations, an historical survey. Rocky Mountain Journal of Mathematics, 6(3):409-433, 1976.
[55]
I. Vajda. Theory of Statistical Inference and Information. Kluwer Academic Publishers, Boston, 1989.
[56]
S. S. Vallander. Calculation of the Wasserstein distance between probability distributions on the line. Theory Probab. Appl., 18:784-786, 1973.
[57]
A. W. van der Vaart and J. A. Wellner. Weak Convergence and Empirical Processes. Springer-Verlag, New York, 1996.
[58]
V. N. Vapnik. Statistical Learning Theory. Wiley, New York, 1998.
[59]
N. Weaver. Lipschitz Algebras. World Scientific Publishing Company, 1999.
[60]
H. Wendland. Scattered Data Approximation. Cambridge University Press, Cambridge, UK, 2005.

Cited By

View all
  • (2024)Transportation Marketplace Rate Forecast Using Signature TransformProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671637(4997-5005)Online publication date: 25-Aug-2024
  • (2024)Safe Reinforcement Learning in Uncertain ContextsIEEE Transactions on Robotics10.1109/TRO.2024.335417640(1828-1841)Online publication date: 1-Jan-2024
  • (2024)Towards Nonparametric Topological Layers in Neural NetworksAdvances in Knowledge Discovery and Data Mining10.1007/978-981-97-2259-4_7(91-102)Online publication date: 7-May-2024
  • Show More Cited By

Index Terms

  1. Hilbert Space Embeddings and Metrics on Probability Measures

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image The Journal of Machine Learning Research
      The Journal of Machine Learning Research  Volume 11, Issue
      3/1/2010
      3637 pages
      ISSN:1532-4435
      EISSN:1533-7928
      Issue’s Table of Contents

      Publisher

      JMLR.org

      Publication History

      Published: 01 August 2010
      Published in JMLR Volume 11

      Qualifiers

      • Article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)41
      • Downloads (Last 6 weeks)9
      Reflects downloads up to 11 Sep 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Transportation Marketplace Rate Forecast Using Signature TransformProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671637(4997-5005)Online publication date: 25-Aug-2024
      • (2024)Safe Reinforcement Learning in Uncertain ContextsIEEE Transactions on Robotics10.1109/TRO.2024.335417640(1828-1841)Online publication date: 1-Jan-2024
      • (2024)Towards Nonparametric Topological Layers in Neural NetworksAdvances in Knowledge Discovery and Data Mining10.1007/978-981-97-2259-4_7(91-102)Online publication date: 7-May-2024
      • (2023)Unifying GANs and score-based diffusion as generative particle modelsProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3668732(59729-59760)Online publication date: 10-Dec-2023
      • (2023)Graph-structured gaussian processes for transferable graph learningProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3668336(50879-50906)Online publication date: 10-Dec-2023
      • (2023)PCF-GANProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3667850(39755-39781)Online publication date: 10-Dec-2023
      • (2023)Characteristic CircuitsProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3667598(34074-34086)Online publication date: 10-Dec-2023
      • (2023)Statistical insights into HSIC in high dimensionsProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3666961(19145-19156)Online publication date: 10-Dec-2023
      • (2023)Kernelized cumulantsProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3666610(11049-11074)Online publication date: 10-Dec-2023
      • (2023)Nyström M-hilbert-schmidt independence criterionProceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence10.5555/3625834.3625929(1005-1015)Online publication date: 31-Jul-2023
      • Show More Cited By

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Get Access

      Login options

      Full Access

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media