Mining Cluster Patterns in XML Corpora via Latent Topic Models of Content and Structure

Costa, Gianni; Ortale, Riccardo

doi:10.1007/978-3-030-16142-2_19

Gianni Costa¹⁹ &
Riccardo Ortale¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11441))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

2002 Accesses
7 Citations

Abstract

We present two innovative machine-learning approaches to topic model clustering for the XML domain. The first approach consists in exploiting consolidated clustering techniques, in order to partition the input XML documents by their meaning. This is captured through a new Bayesian probabilistic topic model, whose novelty is the incorporation of Dirichlet-multinomial distributions for both content and structure. In the second approach, a novel Bayesian probabilistic generative model of XML corpora seamlessly integrates the foresaid topic model with clustering. Both are conceived as interacting latent factors, that govern the wording of the input XML documents. Experiments over real-world benchmark XML corpora reveal the overcoming effectiveness of the devised approaches in comparison to several state-of-the-art competitors.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Machine learning techniques for XML (co-)clustering by structure-constrained phrases

Article 04 August 2017

Clustering XML documents by patterns

Article Open access 23 January 2015

TreeXP—An Instantiation of XPattern Framework

References

Abiteboul, S., Buneman, P., Suciu, D.: Data on the Web: From Relations to Semistructured Data and XML. Morgan Kaufmann, Burlington (2000)
Google Scholar
Aggarwal, C.C., Ta, N., Wang, J., Feng, J., Zaki, M.: XProj: a framework for projected structural clustering of XML documents. In: Proceedings of ACM KDD, pp. 46–55 (2007)
Google Scholar
Andrieu, C., De Freitas, N., Doucet, A., Jordan, M.I.: An introduction to MCMC for machine learning. Mach. Learn. 50(1–2), 5–43 (2003)
Article MATH Google Scholar
Bengio, Y., Ducharme, R., Vincent, P., Janvin, C.: A neural probabilistic language model. J. Mach. Learn. Res. 3, 1137–1155 (2003)
MATH Google Scholar
Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, New York (2006)
MATH Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
MATH Google Scholar
Bratko, A., Filipic̆, B.: Exploiting structural information for semi-structured document categorization. Inf. Process. Manag. 42(3), 679–694 (2006)
Article Google Scholar
Costa, G., Ortale, R.: Developments in partitioning XML documents by content and structure based on combining multiple clusterings. In: Proceedings of IEEE ICTAI, pp. 477–482 (2013)
Google Scholar
Costa, G., Ortale, R.: A latent semantic approach to XML clustering by content and structure based on non-negative matrix factorization. In: Proceedings of IEEE ICMLA, pp. 179–184 (2013)
Google Scholar
Costa, G., Ortale, R.: Mining clusters in XML corpora based on Bayesian generative topic modeling. In: Proceedings of IEEE ICMLA, pp. 515–520 (2015)
Google Scholar
Costa, G., Ortale, R.: XML clustering by structure-constrained phrases: a fully-automatic approach using contextualized N-grams. Int. J. Artif. Intell. Tools 26(1), 1–24 (2017)
Article Google Scholar
Costa, G., Ortale, R.: Machine learning techniques for XML (co-)clustering by structure-constrained phrases. Inf. Retr. J. 21(1), 24–55 (2018)
Article Google Scholar
Denoyer, L., Gallinari, P.: Report on the XML mining track at INEX 2007. ACM SIGIR Forum 42(1), 22–28 (2008)
Article Google Scholar
Endres, D.M., Schindelin, J.E.: A new metric for probability distributions. IEEE Trans. Inf. Theory 49(7), 1858–1860 (2003)
Article MathSciNet MATH Google Scholar
Hagenbuchner, M., Tsoi, A.C., Sperduti, A., Kc, M.: Efficient clustering of structured documents using graph self-organizing maps. In: Fuhr, N., Kamps, J., Lalmas, M., Trotman, A. (eds.) INEX 2007. LNCS, vol. 4862, pp. 207–221. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-85902-4_19
Chapter Google Scholar
Han, J., Kamber, M., Pei, J.: Data Mining: Concepts and Techniques, 3rd edn. Morgan Kaufmann, Burlington (2011)
MATH Google Scholar
Heinrich, G.: Parameter estimation for text analysis. Technical report, University of Leipzig (2008). http://www.arbylon.net/publications/text-est.pdf
Kutty, S., Nayak, R., Li, Y.: HCX: an efficient hybrid clustering approach for XML documents. In: Proceedings of ACM DocEng, pp. 94–97 (2009)
Google Scholar
Kutty, S., Nayak, R., Li, Y.: XCFS: an XML documents clustering approach using both the structure and the content. In: Proceedings of ACM CIKM, pp. 1729–1732 (2009)
Google Scholar
Li, S., Huang, G., Tan, R., Pan, R.: Tag-weighted dirichlet allocation. In: Proceedings of IEEE ICDM, pp. 438–447 (2013)
Google Scholar
Li, S., Li, J., Pan, R.: Tag-weighted topic model for mining semi-structured documents. In: Proceedings of IJCAI, pp. 2855–2861 (2013)
Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Proceedings of NIPS, pp. 3111–3119 (2013)
Google Scholar
Mimno, D.M., McCallum, A.: Topic models conditioned on arbitrary features with dirichlet-multinomial regression. In: Proceedings of UAI, pp. 411–418 (2008)
Google Scholar
Ramage, D., Manning, C.D., Dumais, S.: Partially labeled topic models for interpretable text mining. In: Proceedings of ACM KDD, pp. 457–465 (2011)
Google Scholar
Tran, T., Nayak, R., Bruza, P.: Document clustering using incremental and pairwise approaches. In: Fuhr, N., Kamps, J., Lalmas, M., Trotman, A. (eds.) INEX 2007. LNCS, vol. 4862, pp. 222–233. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-85902-4_20
Chapter Google Scholar
Xie, P., Xing, E.P.: Integrating document clustering and topic modeling. In: Proceedings of UAI (2013)
Google Scholar
Yao, J., Zerida, N.: Rare patterns to improve path-based clustering of Wikipedia articles. In: Pre-proceedings of INEX, pp. 224–231 (2007)
Google Scholar
Yi, J., Sundaresan, N.: A classifier for semi-structured documents. In: Proceedings of ACM KDD, pp. 340–344 (2000)
Google Scholar

Download references

Author information

Authors and Affiliations

ICAR-CNR, Via P. Bucci 8/9C, Rende, CS, Italy
Gianni Costa & Riccardo Ortale

Authors

Gianni Costa
View author publications
You can also search for this author in PubMed Google Scholar
Riccardo Ortale
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Riccardo Ortale .

Editor information

Editors and Affiliations

Hong Kong University of Science and Technology, Hong Kong, China
Qiang Yang
Nanjing University, Nanjing, China
Zhi-Hua Zhou
University of Macau, Taipa, Macau, China
Zhiguo Gong
Southeast University, Nanjing, China
Min-Ling Zhang
Nanjing University of Aeronautics and Astronautics, Nanjing, China
Sheng-Jun Huang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Costa, G., Ortale, R. (2019). Mining Cluster Patterns in XML Corpora via Latent Topic Models of Content and Structure. In: Yang, Q., Zhou, ZH., Gong, Z., Zhang, ML., Huang, SJ. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2019. Lecture Notes in Computer Science(), vol 11441. Springer, Cham. https://doi.org/10.1007/978-3-030-16142-2_19

Download citation

DOI: https://doi.org/10.1007/978-3-030-16142-2_19
Published: 20 March 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-16141-5
Online ISBN: 978-3-030-16142-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Mining Cluster Patterns in XML Corpora via Latent Topic Models of Content and Structure

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Machine learning techniques for XML (co-)clustering by structure-constrained phrases

Clustering XML documents by patterns

TreeXP—An Instantiation of XPattern Framework

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Mining Cluster Patterns in XML Corpora via Latent Topic Models of Content and Structure

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Machine learning techniques for XML (co-)clustering by structure-constrained phrases

Clustering XML documents by patterns

TreeXP—An Instantiation of XPattern Framework

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation