research-article

Semantic clustering of XML documents

Authors:

Andrea Tagarelli,

Sergio GrecoAuthors Info & Claims

ACM Transactions on Information Systems (TOIS), Volume 28, Issue 1

Article No.: 3, Pages 1 - 56

https://doi.org/10.1145/1658377.1658380

Published: 29 January 2010 Publication History

Get Access

Abstract

Dealing with structure and content semantics underlying semistructured documents is challenging for any task of document management and knowledge discovery conceived for such data. In this work we address the novel problem of clustering semantically related XML documents according to their structure and content features. XML features are generated by enriching syntactic with semantic information based on a lexical knowledge base. The backbone of the proposed framework for the semantic clustering of XML documents is a data representation model that exploits the notion of tree tuple to identify semantically cohesive substructures in XML documents and represent them as transactional data. This framework is equipped with two clustering algorithms based on different paradigms, namely centroid-based partitional clustering and frequent-itemset-based hierarchical clustering. An extensive experimental evaluation was conducted on real data sets from various domains, showing the significance of our approach as a solution for the semantic clustering of XML documents.

Supplementary Material

Tagarelli Appendix (a3-tagarelli-apndx.pdf)

Online appendix to semantic clustering of XML documents on article 3.

Download
761.79 KB

References

[1]

Abiteboul, S., Buneman, P., and Suciu, D. 1999. Data on the Web: From Relations to Semistructured Data and XML. Morgan Kaufmann Publishers.

Digital Library

Google Scholar

[2]

Ananiadou, S., Kell, D. B., and Tsujii, J. 2006. Text mining and its potential applications in systems biology. Trends Biotechnol. 24, 12, 571--579.

Crossref

Google Scholar

[3]

Arenas, M. and Libkin, L. 2003. An information-theoretic approach to normal forms for relational and XML data. In Proceedings of the ACM Symposium on Principles of Database Systems (PODS). 15--26.

Digital Library

Google Scholar

[4]

Arenas, M. and Libkin, L. 2004. A normal form for XML documents. ACM Trans. Datab. Syst. 29, 1, 195--232.

Digital Library

Google Scholar

[5]

Baeza-Yates, R. and Ribeiro-Neto, B. A. 1999. Modern Information Retrieval. ACM Press. Addison Wesley.

Digital Library

Google Scholar

[6]

Baeza-Yates, R. A., Fuhr, N., and Maarek, Y. S. 2006. Special issue on XML retrieval. ACM Trans. Inform. Syst. 24, 4.

Digital Library

Google Scholar

[7]

Banerjee, S. and Pedersen, T. 2003. Extended gloss overlaps as a measure of semantic relatedness. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI). 805--810.

Digital Library

Google Scholar

[8]

Candillier, L., Tellier, I., and Torre, F. 2005. Transforming XML trees for efficient classification and clustering. In Proceedings of the Workshop of the Initiative for the Evaluation of XML Retrieval (INEX). 469--480.

Digital Library

Google Scholar

[9]

Corp, X.-H. 2002. X-Hive/DB.

Google Scholar

[10]

Costa, G., Manco, G., Ortale, R., and Tagarelli, A. 2004. A Tree-based approach to clustering XML documents by structure. In Proceedings of the European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD). 137--148.

Digital Library

Google Scholar

[11]

De Meo, P., Quattrone, G., Terracina, G., and Ursino, D. 2005. An approach for clustering semantically heterogeneous XML schemas. In Proceedings of the International Conference on Cooperative Information Systems (CoopIS). 329--346.

Digital Library

Google Scholar

[12]

De Meo, P., Quattrone, G., Terracina, G., and Ursino, D. 2007. Semantics-guided clustering of heterogeneous XML Schemas. J. Data Seman. To appear.

Digital Library

Google Scholar

[13]

Denoyer, L. and Gallinari, P. 2007. Report on the XML Mining Track at INEX 2005 and INEX 2006: Categorization and clustering of XML documents. SIGIR Forum 41, 1, 79--90.

Digital Library

Google Scholar

[14]

Denoyer, L. and Gallinari, P. 2008. Report on the XML Mining Track at INEX 2007: Categorization and clustering of XML documents. Tech. rep.

Google Scholar

[15]

Deutsch, A., Fernandez, M., and Suciu, D. 1999. Storing semistructured data with STORED. In Proceedings of the of ACM SIGMOD International Conference on Management of Data (SIGMOD). 431--442.

Digital Library

Google Scholar

[16]

Doucet, A. and Ahonen-Myka, H. 2002. Naive clustering of a large XML document collection. In Proceedings of the Workshop of the Initiative for the Evaluation of XML Retrieval (INEX). 81--88.

Google Scholar

[17]

Doucet, A. and Lehtonen, M. 2006. Unsupervised classification of text-centric XML document collections. In Proceedings of the Workshop of the Initiative for the Evaluation of XML retrieval (INEX).

Google Scholar

[18]

eXcelon Corp. 2002. eXcelon XML platform.

Google Scholar

[19]

Fellbaum, C. 1998. WordNet: an Electronic Lexical Database. MIT Press.

Google Scholar

[20]

Fiebig, T., Helmer, S., Kanne, C., Moerkotte, G., Neumann, J., Schiele, R., and Westmann, T. 2002. Anatomy of a native XML base management system. VLDB J. 11, 4, 292--314.

Digital Library

Google Scholar

[21]

Flesca, S., Furfaro, F., Greco, S., and Zumpano, E. 2003. Repairs and consistent answers for XML data with functional dependencies. In Proceedings of the International XML Database Symposium (XSym). 238--253.

Google Scholar

[22]

Flesca, S., Manco, G., Masciari, E., Pontieri, L., and Pugliese, A. 2005. Fast detection of XML structural similarity. IEEE Trans. Knowl. Data Eng. 17, 2, 160--175.

Digital Library

Google Scholar

[23]

Fung, B., Wang, K., and Ester, M. 2003. Hierarchical document clustering using frequent itemsets. In Proceedings of the SIAM International Conference on Data Mining (SDM). 59--70.

Google Scholar

[24]

Giannotti, F., Gozzi, C., and Manco, G. 2002. Clustering transactional data. In Proceedings of the European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD). 175--187.

Digital Library

Google Scholar

[25]

Guillaume, D. and Murtagh, F. 2000. Clustering of XML documents. Comput. Phys. Comm. 127, 215--227.

Crossref

Google Scholar

[26]

He, B., Tao, T., and Chang, K. C.-C. 2004. Organizing structured Web sources by query schemas: a clustering approach. In Proceedings of the ACM International Conference on Information and Knowledge Management (CIKM). 22--31.

Digital Library

Google Scholar

[27]

Jagadish, H., Al-Khalifa, S., Chapman, A., Lakshmanan, L., Nierman, A., Paparizos, S., Patel, J., Srivastava, D., Wiwatwattana, N., Wu, Y., and Yu, C. 2002. TIMBER: A native XML database. VLDB J. 11, 4, 274--291.

Digital Library

Google Scholar

[28]

Jain, A. and Dubes, R. 1988. Algorithms for Clustering Data. Prentice-Hall Advanced Reference Series. Prentice-Hall.

Digital Library

Google Scholar

[29]

Lahiri, T., Abiteboul, S., and Widom, J. 1999. Ozone: Integrating structured and semistructured data. In Proceedings of the International Conference on Database Programming Languages (DBPL). 297--323.

Digital Library

Google Scholar

[30]

Lee, M. L., Yang, L. H., Hsu, W., and Yang, X. 2002. XClust: Clustering XML schemas for effective integration. In Proceedings of the ACM International Conference on Information and Knowledge Management (CIKM). 292--299.

Digital Library

Google Scholar

[31]

Lesk, M. 1986. Automatic sense disambiguation using machine readable dictionaries: How to tell a pine cone from a ice cream cone. In Proceedings of the ACM SIGDOC International Conference on Systems Documentation. 24--26.

Digital Library

Google Scholar

[32]

Lian, W., Cheung, D., Mamoulis, N., and Yiu, S. 2004. An efficient and scalable algorithm for clustering XML documents by structure. IEEE Trans. Knowl. Data Eng. 16, 1, 82--96.

Digital Library

Google Scholar

[33]

Mandreoli, F., Martoglia, R., and Ronchetti, E. 2005. Versatile structural disambiguation for semantic-aware applications. In Proceedings of the ACM CIKM International Conference on Information and Knowledge Management (CIKM). 209--216.

Digital Library

Google Scholar

[34]

McHugh, J., Abiteboul, S., Goldman, R., Quass, D., and Widom, J. 1997. Lore: A database management system for semistructured data. ACM SIGMOD Record 26, 3, 54--66.

Digital Library

Google Scholar

[35]

Miller, G. 1995. WordNet: a lexical database for english. Comm. ACM 38, 11, 39--41.

Digital Library

Google Scholar

[36]

Navigli, R. and Lapata, M. 2007. Graph connectivity measures for unsupervised word sense disambiguation. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI). 1683--1688.

Digital Library

Google Scholar

[37]

Navigli, R. and Velardi, P. 2005. Structural semantic interconnections: A knowledge-based approach to word sense disambiguation. IEEE Trans. Pattern Anal. Mach. Intell. 27, 7, 1075--1086.

Digital Library

Google Scholar

[38]

Nayak, R. and Xu, S. 2006. XCLS: a fast and effective clustering algorithm for heterogeneous XML documents. In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD). 292--302.

Digital Library

Google Scholar

[39]

Nierman, A. and Jagadish, H. 2002. Evaluating structural similarity in XML documents. In Proceedings of the ACM SIGMOD International Workshop on the Web and Databases (WebDB). 61--66.

Google Scholar

[40]

Patwardhan, S., Banerjee, S., and Pedersen, T. 2003. Using measures of semantic relatedness for word sense disambiguation. In Proceedings of the International Conference on Intelligent Text Processing and Computational Linguistics (CICLing). 241--257.

Digital Library

Google Scholar

[41]

Patwardhan, S., Banerjee, S., and Pedersen, T. 2005. SenseRelate::TargetWord-A generalized framework for word sense disambiguation. In Proceedings of the National Conference on Artificial Intelligence (AAAI). 1692--1693.

Digital Library

Google Scholar

[42]

Polyzotis, N. and Garofalakis, M. 2002. Structure and value synopses for XML data graphs. In Proceedings of the International Conference on Very Large Data Bases (VLDB). 466--477.

Digital Library

Google Scholar

[43]

Project, X. 2001. Xyleme: A dynamic warehouse for XML data of the Web.

Google Scholar

[44]

Resnik, P. 1999. Semantic similarity in a taxonomy: An information-based measure and its application to problems of ambiguity in natural language. J. Artif. Intell. Res. 11, 95--130.

Crossref

Google Scholar

[45]

Runapongsa, K. and Patel, J. 2002. Storing and querying XML data in ORDBMSs. In Proceedings of the EDBT XML-Based Data Management Workshop (XMLDB).

Digital Library

Google Scholar

[46]

Sahuguet, A. 2000. Kweelt, the making-of: Mistakes made and lessons learned. Tech. Rep., Department of Computer and Information Science, University of Pennsylvania.

Google Scholar

[47]

Schoning, H. 2001. Tamino - A DBMS designed for XML. In Proceedings of the International Conference on Data Engineering (ICDE). 149--154.

Digital Library

Google Scholar

[48]

Shanmugasundaram, J., Tufte, K., Zhang, C., He, G., DeWitt, D., and Naughton, J. 1999. Relational databases for querying XML documents: Limitations and opportunities. In Proceedings of the International Conference on Very Large Data Bases (VLDB). 302--314.

Digital Library

Google Scholar

[49]

Stein, B., zu Eissen, S. M., and Wissbrock, F. 2003. On cluster validity and the information need of users. In Proceedings of the International Conference on Artificial Intelligence and Applications (AIA). 216--221.

Google Scholar

[50]

Strehl, A. and Ghosh, J. 2002. Cluster ensembles—a knowledge reuse framework for combining multiple partitions. J. Mach. Learn. Res. 3, 583--617.

Digital Library

Google Scholar

[51]

Tagarelli, A., Longo, M., and Greco, S. 2009. Word sense disambiguation for XML structure feature generation. In Proceedings of the European Semantic Web Conference (ESWC). To appear.

Digital Library

Google Scholar

[52]

Theobald, M., Schenkel, R., and Weikum, G. 2003. Exploiting structure, annotation, and ontological knowledge for automatic classification of XML data. In Proceedings of the ACM SIGMOD International Workshop on the Web and Databases (WebDB). 1--6.

Google Scholar

[53]

Vercoustre, A. M., Fegas, M., Gul, S., and Lechevallier, Y. 2005. A flexible structured-based representation for XML document mining. In Proceedings of the Workshop of the Initiative for the Evaluation of XML Retrieval (INEX). 443--457.

Digital Library

Google Scholar

[54]

Widom, J. 1999. Data management for XML: Research directions. IEEE Data Eng. Bull. 22, 3, 44--52.

Google Scholar

[55]

Yang, J. W. and Chen, X. O. 2002. A semi-structured document model for text mining. J. Comput. Sci. Tech. 17, 5, 603--610.

Digital Library

Google Scholar

[56]

Yang, J. W., Cheung, W. K., and Chen, X. O. 2005. Integrating element and term semantics for similarity-based XML document clustering. In Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence (WI). 222--228.

Digital Library

Google Scholar

[57]

Yoon, J., Raghavan, V., Chakilam, V., and Kerschberg, L. 2001. BitCube: A three-dimensional bitmap indexing for XML documents. J. Intell. Inform. Syst. 17, 1, 241--252.

Digital Library

Google Scholar

[58]

Zhao, Y. and Karypis, G. 2002. Evaluation of hierarchical clustering algorithms for document datasets. In Proceedings of the ACM International Conference on Information and Knowledge Management (CIKM). 515--524.

Digital Library

Google Scholar

[59]

Zhao, Y. and Karypis, G. 2004. Empirical and theoretical comparisons of selected criterion functions for document clustering. Mach. Learn. 55, 3, 311--331.

Digital Library

Google Scholar

Cited By

View all

Belahyane IMammass MAbioui HIdarrou A(2024)Information Retrieval in XML Document: State of the ArtInternational Conference on Advanced Intelligent Systems for Sustainable Development (AI2SD'2023)10.1007/978-3-031-54318-0_28(322-331)Online publication date: 21-Feb-2024
https://doi.org/10.1007/978-3-031-54318-0_28
Tekli JTekli GChbeir R(2023)Combining offline and on-the-fly disambiguation to perform semantic-aware XML queryingComputer Science and Information Systems10.2298/CSIS220228063T20:1(423-457)Online publication date: 2023
https://doi.org/10.2298/CSIS220228063T
Tekli JTekli GChbeir RChbeir RManolopoulos YBellatreche LBenslimane DIvanovic MMaamar Z(2021)Almost Linear Semantic XML Keyword SearchProceedings of the 13th International Conference on Management of Digital EcoSystems10.1145/3444757.3485079(129-138)Online publication date: 1-Nov-2021
https://dl.acm.org/doi/10.1145/3444757.3485079
Show More Cited By

Index Terms

Semantic clustering of XML documents
1. Applied computing
  1. Document management and text processing
    1. Document preparation
      1. Markup languages
2. Information systems

Recommendations

Semantic Structural Similarity for Clustering XML Documents
ICHIT '08: Proceedings of the 2008 International Conference on Convergence and Hybrid Information Technology

The amount of XML documents is increasing rapidly. In order to analyze the information represented in XML documents efficiently, researches on XML document clustering are actively in progress. The key issue is how to devise the similarity measure ...
Clustering XML Documents by Combining Content and Structure
ISISE '08: Proceedings of the 2008 International Symposium on Information Science and Engieering - Volume 01

XML has become a de facto standard for data representation and exchange over the Internet. With the emergence of more and more XML documents, the clustering of XML documents has become an active research area. XML documents lie between structured data ...
Collaborative clustering of XML documents

Clustering XML documents is extensively used to organize large collections of XML documents in groups that are coherent according to structure and/or content features. The growing availability of distributed XML sources and the variety of high-demand ...

Reviews

Reviewer: Aris Gkoulalas-Divanis

With the advent of Extensible Markup Language (XML) and its wide adoption in applications, data extraction from semi-structured documents to facilitate data analysis has become an attractive research direction. The existence of structure in documents provides the means for designing sophisticated approaches for data management and knowledge discovery. These approaches take into consideration both content and structure semantics. In this paper, Tagarelli and Greco propose a framework, along with algorithms to cluster semantically related semi-structured documents, based on commonalities in their structure and content. First, they apply structure analysis to the XML documents to remove the ambiguity in the different tag names and allow the selection of the most appropriate sense for each tag name. Following this, they analyze the documents based on their content similarity, using techniques that consider both syntactic and semantic term relevance. An important characteristic of the proposed approach is the use of a novel representation scheme for mapping XML document trees into transactions consisting of items that carry both structure and content characteristics. The authors employ a transactional clustering algorithm that quantifies similarity by taking into consideration the semantics of the data. Subsequently, the identified clusters of transactions derive a classification of the XML documents for the end user. The authors demonstrate the effectiveness of the proposed approach through experiments on real-world data that test it against state-of-the-art algorithms for clustering XML documents. Overall, this is interesting work. The paper is well structured, motivated, and presented, and the experimental results look promising. For these reasons, researchers in the field will benefit from reading it. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Information & Contributors

Information

Published In

ACM Transactions on Information Systems Volume 28, Issue 1

January 2010

157 pages

ISSN:1046-8188

EISSN:1558-2868

DOI:10.1145/1658377

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 January 2010

Accepted: 01 February 2009

Revised: 01 June 2008

Received: 01 April 2007

Published in TOIS Volume 28, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

39
Total Citations
View Citations
2,125
Total Downloads

Downloads (Last 12 months)5
Downloads (Last 6 weeks)1

Reflects downloads up to 13 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Belahyane IMammass MAbioui HIdarrou A(2024)Information Retrieval in XML Document: State of the ArtInternational Conference on Advanced Intelligent Systems for Sustainable Development (AI2SD'2023)10.1007/978-3-031-54318-0_28(322-331)Online publication date: 21-Feb-2024
https://doi.org/10.1007/978-3-031-54318-0_28
Tekli JTekli GChbeir R(2023)Combining offline and on-the-fly disambiguation to perform semantic-aware XML queryingComputer Science and Information Systems10.2298/CSIS220228063T20:1(423-457)Online publication date: 2023
https://doi.org/10.2298/CSIS220228063T
Tekli JTekli GChbeir RChbeir RManolopoulos YBellatreche LBenslimane DIvanovic MMaamar Z(2021)Almost Linear Semantic XML Keyword SearchProceedings of the 13th International Conference on Management of Digital EcoSystems10.1145/3444757.3485079(129-138)Online publication date: 1-Nov-2021
https://dl.acm.org/doi/10.1145/3444757.3485079
Amelio ATagarelli A(2019)Data Mining: ClusteringEncyclopedia of Bioinformatics and Computational Biology10.1016/B978-0-12-809633-8.20489-5(437-448)Online publication date: 2019
https://doi.org/10.1016/B978-0-12-809633-8.20489-5
Li RTian BYan LQu Y(2018)Semantic Mapping of Energy Simulation Data Using Bag of Words and Graph Matching2018 26th International Conference on Geoinformatics10.1109/GEOINFORMATICS.2018.8557096(1-5)Online publication date: Jun-2018
https://doi.org/10.1109/GEOINFORMATICS.2018.8557096
Thulasi ARemya KRaju G(2017)Structure based XML document clustering: A review2017 International Conference on Infocom Technologies and Unmanned Systems (Trends and Future Directions) (ICTUS)10.1109/ICTUS.2017.8286068(543-547)Online publication date: Dec-2017
https://doi.org/10.1109/ICTUS.2017.8286068
Liu JZhang XZhang L(2017)Tree pattern matching in heterogeneous fuzzy XML databasesKnowledge-Based Systems10.1016/j.knosys.2017.02.003122:C(119-130)Online publication date: 15-Apr-2017
https://dl.acm.org/doi/10.1016/j.knosys.2017.02.003
Tekli J(2016)An Overview on XML Semantic Disambiguation from Unstructured Text to Semi-Structured DataIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2016.252576828:6(1383-1407)Online publication date: 1-Jun-2016
https://dl.acm.org/doi/10.1109/TKDE.2016.2525768
Tekli JCharbel NChbeir R(2016)Building semantic trees from XML documentsWeb Semantics: Science, Services and Agents on the World Wide Web10.1016/j.websem.2016.03.00237:C(1-24)Online publication date: 1-Mar-2016
https://dl.acm.org/doi/10.1016/j.websem.2016.03.002
Asghari EKeyvanpour M(2015)XML document clusteringArtificial Intelligence Review10.1007/s10462-012-9379-243:3(417-436)Online publication date: 1-Mar-2015
https://dl.acm.org/doi/10.1007/s10462-012-9379-2
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Index Terms

Recommendations

Semantic Structural Similarity for Clustering XML Documents

Clustering XML Documents by Combining Content and Structure

Collaborative clustering of XML documents

Reviews

Access critical reviews of Computing literature here

Comments

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Other Metrics

Article Metrics

Other Metrics

Cited By

Login options

Full Access

PDF

eReader

Abstract

Supplementary Material

References

Cited By

Index Terms

Recommendations

Semantic Structural Similarity for Clustering XML Documents

Clustering XML Documents by Combining Content and Structure

Collaborative clustering of XML documents

Reviews

Access critical reviews of Computing literature here

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Get Access

Login options

Full Access

View options

PDF

eReader

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations