Abstract
We present two innovative machine-learning approaches to topic model clustering for the XML domain. The first approach consists in exploiting consolidated clustering techniques, in order to partition the input XML documents by their meaning. This is captured through a new Bayesian probabilistic topic model, whose novelty is the incorporation of Dirichlet-multinomial distributions for both content and structure. In the second approach, a novel Bayesian probabilistic generative model of XML corpora seamlessly integrates the foresaid topic model with clustering. Both are conceived as interacting latent factors, that govern the wording of the input XML documents. Experiments over real-world benchmark XML corpora reveal the overcoming effectiveness of the devised approaches in comparison to several state-of-the-art competitors.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Abiteboul, S., Buneman, P., Suciu, D.: Data on the Web: From Relations to Semistructured Data and XML. Morgan Kaufmann, Burlington (2000)
Aggarwal, C.C., Ta, N., Wang, J., Feng, J., Zaki, M.: XProj: a framework for projected structural clustering of XML documents. In: Proceedings of ACM KDD, pp. 46–55 (2007)
Andrieu, C., De Freitas, N., Doucet, A., Jordan, M.I.: An introduction to MCMC for machine learning. Mach. Learn. 50(1–2), 5–43 (2003)
Bengio, Y., Ducharme, R., Vincent, P., Janvin, C.: A neural probabilistic language model. J. Mach. Learn. Res. 3, 1137–1155 (2003)
Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, New York (2006)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
Bratko, A., Filipic̆, B.: Exploiting structural information for semi-structured document categorization. Inf. Process. Manag. 42(3), 679–694 (2006)
Costa, G., Ortale, R.: Developments in partitioning XML documents by content and structure based on combining multiple clusterings. In: Proceedings of IEEE ICTAI, pp. 477–482 (2013)
Costa, G., Ortale, R.: A latent semantic approach to XML clustering by content and structure based on non-negative matrix factorization. In: Proceedings of IEEE ICMLA, pp. 179–184 (2013)
Costa, G., Ortale, R.: Mining clusters in XML corpora based on Bayesian generative topic modeling. In: Proceedings of IEEE ICMLA, pp. 515–520 (2015)
Costa, G., Ortale, R.: XML clustering by structure-constrained phrases: a fully-automatic approach using contextualized N-grams. Int. J. Artif. Intell. Tools 26(1), 1–24 (2017)
Costa, G., Ortale, R.: Machine learning techniques for XML (co-)clustering by structure-constrained phrases. Inf. Retr. J. 21(1), 24–55 (2018)
Denoyer, L., Gallinari, P.: Report on the XML mining track at INEX 2007. ACM SIGIR Forum 42(1), 22–28 (2008)
Endres, D.M., Schindelin, J.E.: A new metric for probability distributions. IEEE Trans. Inf. Theory 49(7), 1858–1860 (2003)
Hagenbuchner, M., Tsoi, A.C., Sperduti, A., Kc, M.: Efficient clustering of structured documents using graph self-organizing maps. In: Fuhr, N., Kamps, J., Lalmas, M., Trotman, A. (eds.) INEX 2007. LNCS, vol. 4862, pp. 207–221. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-85902-4_19
Han, J., Kamber, M., Pei, J.: Data Mining: Concepts and Techniques, 3rd edn. Morgan Kaufmann, Burlington (2011)
Heinrich, G.: Parameter estimation for text analysis. Technical report, University of Leipzig (2008). http://www.arbylon.net/publications/text-est.pdf
Kutty, S., Nayak, R., Li, Y.: HCX: an efficient hybrid clustering approach for XML documents. In: Proceedings of ACM DocEng, pp. 94–97 (2009)
Kutty, S., Nayak, R., Li, Y.: XCFS: an XML documents clustering approach using both the structure and the content. In: Proceedings of ACM CIKM, pp. 1729–1732 (2009)
Li, S., Huang, G., Tan, R., Pan, R.: Tag-weighted dirichlet allocation. In: Proceedings of IEEE ICDM, pp. 438–447 (2013)
Li, S., Li, J., Pan, R.: Tag-weighted topic model for mining semi-structured documents. In: Proceedings of IJCAI, pp. 2855–2861 (2013)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Proceedings of NIPS, pp. 3111–3119 (2013)
Mimno, D.M., McCallum, A.: Topic models conditioned on arbitrary features with dirichlet-multinomial regression. In: Proceedings of UAI, pp. 411–418 (2008)
Ramage, D., Manning, C.D., Dumais, S.: Partially labeled topic models for interpretable text mining. In: Proceedings of ACM KDD, pp. 457–465 (2011)
Tran, T., Nayak, R., Bruza, P.: Document clustering using incremental and pairwise approaches. In: Fuhr, N., Kamps, J., Lalmas, M., Trotman, A. (eds.) INEX 2007. LNCS, vol. 4862, pp. 222–233. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-85902-4_20
Xie, P., Xing, E.P.: Integrating document clustering and topic modeling. In: Proceedings of UAI (2013)
Yao, J., Zerida, N.: Rare patterns to improve path-based clustering of Wikipedia articles. In: Pre-proceedings of INEX, pp. 224–231 (2007)
Yi, J., Sundaresan, N.: A classifier for semi-structured documents. In: Proceedings of ACM KDD, pp. 340–344 (2000)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Costa, G., Ortale, R. (2019). Mining Cluster Patterns in XML Corpora via Latent Topic Models of Content and Structure. In: Yang, Q., Zhou, ZH., Gong, Z., Zhang, ML., Huang, SJ. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2019. Lecture Notes in Computer Science(), vol 11441. Springer, Cham. https://doi.org/10.1007/978-3-030-16142-2_19
Download citation
DOI: https://doi.org/10.1007/978-3-030-16142-2_19
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-16141-5
Online ISBN: 978-3-030-16142-2
eBook Packages: Computer ScienceComputer Science (R0)