Abstract
Similarity and distance functions are essential to many learning algorithms, thus training them has attracted a lot of interest. When it comes to dealing with structured data (e.g., strings or trees), edit similarities are widely used, and there exists a few methods for learning them. However, these methods offer no theoretical guarantee as to the generalization performance and discriminative power of the resulting similarities. Recently, a theory of learning with (ε, γ,τ)-good similarity functions was proposed. This new theory bridges the gap between the properties of a similarity function and its performance in classification. In this paper, we propose a novel edit similarity learning approach (GESL) driven by the idea of (ε,γ,τ)-goodness, which allows us to derive generalization guarantees using the notion of uniform stability. We experimentally show that edit similarities learned with our method induce classification models that are both more accurate and sparser than those induced by the edit distance or edit similarities learned with a state-of-the-art method.
We would like to acknowledge support from the ANR LAMPADA 09-EMER-007-02 project and the PASCAL 2 Network of Excellence.
Chapter PDF
Similar content being viewed by others
References
Yang, L., Jin, R.: Distance Metric Learning: A Comprehensive Survey. Technical report, Dep. of Comp. Science and Eng., Michigan State University (2006)
Davis, J.V., Kulis, B., Jain, P., Sra, S., Dhillon, I.S.: Information-theoretic metric learning. In: Proc. of the Int. Conf. on Machine Learning (ICML), pp. 209–216 (2007)
Weinberger, K.Q., Saul, L.K.: Distance Metric Learning for Large Margin Nearest Neighbor Classification. J. of Mach. Learn. Res. (JMLR) 10, 207–244 (2009)
Jin, R., Wang, S., Zhou, Y.: Regularized distance metric learning: Theory and algorithm. In: Adv. in Neural Inf. Proc. Sys. (NIPS), pp. 862–870 (2009)
Ristad, E.S., Yianilos, P.N.: Learning String-Edit Distance. IEEE Trans. on Pattern Analysis and Machine Intelligence. 20, 522–532 (1998)
Bilenko, M., Mooney, R.J.: Adaptive Duplicate Detection Using Learnable String Similarity Measures. In: Proc. of the Int. Conf. on Knowledge Discovery and Data Mining (SIGKDD), pp. 39–48 (2003)
Oncina, J., Sebban, M.: Learning Stochastic Edit Distance: application in handwritten character recognition. Pattern Recognition 39(9), 1575–1587 (2006)
Bernard, M., Boyer, L., Habrard, A., Sebban, M.: Learning probabilistic models of tree edit distance. Pattern Recognition 41(8), 2611–2629 (2008)
Takasu, A.: Bayesian Similarity Model Estimation for Approximate Recognized Text Search. In: Proc. of the Int. Conf. on Doc. Ana. and Reco., pp. 611–615 (2009)
Saigo, H., Vert, J.-P., Akutsu, T.: Optimizing amino acid substitution matrices with a local alignment kernel. BMC Bioinformatics 7(246), 1–12 (2006)
Balcan, M.F., Blum, A.: On a Theory of Learning with Similarity Functions. In: Proc. of the Int. Conf. on Machine Learning (ICML), pp. 73–80 (2006)
Balcan, M.F., Blum, A., Srebro, N.: Improved Guarantees for Learning via Similarity Functions. In: Proc. of the Conf. on Learning Theory (COLT), pp. 287–298 (2008)
Bousquet, O., Elisseeff, A.: Stability and generalization. Journal of Machine Learning Research 2, 499–526 (2002)
Wang, L., Yang, C., Feng, J.: On Learning with Dissimilarity Functions. In: Proc. of the Int. Conf. on Machine Learning (ICML), pp. 991–998 (2007)
Zhu, J., Rosset, S., Hastie, T., Tibshirani, R.: 1-norm Support Vector Machines. In: Adv. in Neural Inf. Proc. Sys. (NIPS), vol. 16, pp. 49–56 (2003)
Henikoff, S., Henikoff, J.G.: Amino acid substitution matrices from protein blocks. Proc. of the National Academy of Sciences of the United States of America 89, 10915–10919 (1992)
McCallum, A., Bellare, K., Pereira, F.: A Conditional Random Field for Discriminatively-trained Finite-state String Edit Distance. In: Conference on Uncertainty in AI, pp. 388–395 (2005)
McDiarmid, C.: On the method of bounded differences. In: Surveys in Combinatorics, pp. 148–188. Cambridge University Press, Cambridge (1989)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Bellet, A., Habrard, A., Sebban, M. (2011). Learning Good Edit Similarities with Generalization Guarantees. In: Gunopulos, D., Hofmann, T., Malerba, D., Vazirgiannis, M. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2011. Lecture Notes in Computer Science(), vol 6911. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23780-5_22
Download citation
DOI: https://doi.org/10.1007/978-3-642-23780-5_22
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-23779-9
Online ISBN: 978-3-642-23780-5
eBook Packages: Computer ScienceComputer Science (R0)