Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Semantic clustering of XML documents

Published: 29 January 2010 Publication History

Abstract

Dealing with structure and content semantics underlying semistructured documents is challenging for any task of document management and knowledge discovery conceived for such data. In this work we address the novel problem of clustering semantically related XML documents according to their structure and content features. XML features are generated by enriching syntactic with semantic information based on a lexical knowledge base. The backbone of the proposed framework for the semantic clustering of XML documents is a data representation model that exploits the notion of tree tuple to identify semantically cohesive substructures in XML documents and represent them as transactional data. This framework is equipped with two clustering algorithms based on different paradigms, namely centroid-based partitional clustering and frequent-itemset-based hierarchical clustering. An extensive experimental evaluation was conducted on real data sets from various domains, showing the significance of our approach as a solution for the semantic clustering of XML documents.

Supplementary Material

Tagarelli Appendix (a3-tagarelli-apndx.pdf)
Online appendix to semantic clustering of XML documents on article 3.

References

[1]
Abiteboul, S., Buneman, P., and Suciu, D. 1999. Data on the Web: From Relations to Semistructured Data and XML. Morgan Kaufmann Publishers.
[2]
Ananiadou, S., Kell, D. B., and Tsujii, J. 2006. Text mining and its potential applications in systems biology. Trends Biotechnol. 24, 12, 571--579.
[3]
Arenas, M. and Libkin, L. 2003. An information-theoretic approach to normal forms for relational and XML data. In Proceedings of the ACM Symposium on Principles of Database Systems (PODS). 15--26.
[4]
Arenas, M. and Libkin, L. 2004. A normal form for XML documents. ACM Trans. Datab. Syst. 29, 1, 195--232.
[5]
Baeza-Yates, R. and Ribeiro-Neto, B. A. 1999. Modern Information Retrieval. ACM Press. Addison Wesley.
[6]
Baeza-Yates, R. A., Fuhr, N., and Maarek, Y. S. 2006. Special issue on XML retrieval. ACM Trans. Inform. Syst. 24, 4.
[7]
Banerjee, S. and Pedersen, T. 2003. Extended gloss overlaps as a measure of semantic relatedness. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI). 805--810.
[8]
Candillier, L., Tellier, I., and Torre, F. 2005. Transforming XML trees for efficient classification and clustering. In Proceedings of the Workshop of the Initiative for the Evaluation of XML Retrieval (INEX). 469--480.
[9]
Corp, X.-H. 2002. X-Hive/DB.
[10]
Costa, G., Manco, G., Ortale, R., and Tagarelli, A. 2004. A Tree-based approach to clustering XML documents by structure. In Proceedings of the European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD). 137--148.
[11]
De Meo, P., Quattrone, G., Terracina, G., and Ursino, D. 2005. An approach for clustering semantically heterogeneous XML schemas. In Proceedings of the International Conference on Cooperative Information Systems (CoopIS). 329--346.
[12]
De Meo, P., Quattrone, G., Terracina, G., and Ursino, D. 2007. Semantics-guided clustering of heterogeneous XML Schemas. J. Data Seman. To appear.
[13]
Denoyer, L. and Gallinari, P. 2007. Report on the XML Mining Track at INEX 2005 and INEX 2006: Categorization and clustering of XML documents. SIGIR Forum 41, 1, 79--90.
[14]
Denoyer, L. and Gallinari, P. 2008. Report on the XML Mining Track at INEX 2007: Categorization and clustering of XML documents. Tech. rep.
[15]
Deutsch, A., Fernandez, M., and Suciu, D. 1999. Storing semistructured data with STORED. In Proceedings of the of ACM SIGMOD International Conference on Management of Data (SIGMOD). 431--442.
[16]
Doucet, A. and Ahonen-Myka, H. 2002. Naive clustering of a large XML document collection. In Proceedings of the Workshop of the Initiative for the Evaluation of XML Retrieval (INEX). 81--88.
[17]
Doucet, A. and Lehtonen, M. 2006. Unsupervised classification of text-centric XML document collections. In Proceedings of the Workshop of the Initiative for the Evaluation of XML retrieval (INEX).
[18]
eXcelon Corp. 2002. eXcelon XML platform.
[19]
Fellbaum, C. 1998. WordNet: an Electronic Lexical Database. MIT Press.
[20]
Fiebig, T., Helmer, S., Kanne, C., Moerkotte, G., Neumann, J., Schiele, R., and Westmann, T. 2002. Anatomy of a native XML base management system. VLDB J. 11, 4, 292--314.
[21]
Flesca, S., Furfaro, F., Greco, S., and Zumpano, E. 2003. Repairs and consistent answers for XML data with functional dependencies. In Proceedings of the International XML Database Symposium (XSym). 238--253.
[22]
Flesca, S., Manco, G., Masciari, E., Pontieri, L., and Pugliese, A. 2005. Fast detection of XML structural similarity. IEEE Trans. Knowl. Data Eng. 17, 2, 160--175.
[23]
Fung, B., Wang, K., and Ester, M. 2003. Hierarchical document clustering using frequent itemsets. In Proceedings of the SIAM International Conference on Data Mining (SDM). 59--70.
[24]
Giannotti, F., Gozzi, C., and Manco, G. 2002. Clustering transactional data. In Proceedings of the European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD). 175--187.
[25]
Guillaume, D. and Murtagh, F. 2000. Clustering of XML documents. Comput. Phys. Comm. 127, 215--227.
[26]
He, B., Tao, T., and Chang, K. C.-C. 2004. Organizing structured Web sources by query schemas: a clustering approach. In Proceedings of the ACM International Conference on Information and Knowledge Management (CIKM). 22--31.
[27]
Jagadish, H., Al-Khalifa, S., Chapman, A., Lakshmanan, L., Nierman, A., Paparizos, S., Patel, J., Srivastava, D., Wiwatwattana, N., Wu, Y., and Yu, C. 2002. TIMBER: A native XML database. VLDB J. 11, 4, 274--291.
[28]
Jain, A. and Dubes, R. 1988. Algorithms for Clustering Data. Prentice-Hall Advanced Reference Series. Prentice-Hall.
[29]
Lahiri, T., Abiteboul, S., and Widom, J. 1999. Ozone: Integrating structured and semistructured data. In Proceedings of the International Conference on Database Programming Languages (DBPL). 297--323.
[30]
Lee, M. L., Yang, L. H., Hsu, W., and Yang, X. 2002. XClust: Clustering XML schemas for effective integration. In Proceedings of the ACM International Conference on Information and Knowledge Management (CIKM). 292--299.
[31]
Lesk, M. 1986. Automatic sense disambiguation using machine readable dictionaries: How to tell a pine cone from a ice cream cone. In Proceedings of the ACM SIGDOC International Conference on Systems Documentation. 24--26.
[32]
Lian, W., Cheung, D., Mamoulis, N., and Yiu, S. 2004. An efficient and scalable algorithm for clustering XML documents by structure. IEEE Trans. Knowl. Data Eng. 16, 1, 82--96.
[33]
Mandreoli, F., Martoglia, R., and Ronchetti, E. 2005. Versatile structural disambiguation for semantic-aware applications. In Proceedings of the ACM CIKM International Conference on Information and Knowledge Management (CIKM). 209--216.
[34]
McHugh, J., Abiteboul, S., Goldman, R., Quass, D., and Widom, J. 1997. Lore: A database management system for semistructured data. ACM SIGMOD Record 26, 3, 54--66.
[35]
Miller, G. 1995. WordNet: a lexical database for english. Comm. ACM 38, 11, 39--41.
[36]
Navigli, R. and Lapata, M. 2007. Graph connectivity measures for unsupervised word sense disambiguation. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI). 1683--1688.
[37]
Navigli, R. and Velardi, P. 2005. Structural semantic interconnections: A knowledge-based approach to word sense disambiguation. IEEE Trans. Pattern Anal. Mach. Intell. 27, 7, 1075--1086.
[38]
Nayak, R. and Xu, S. 2006. XCLS: a fast and effective clustering algorithm for heterogeneous XML documents. In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD). 292--302.
[39]
Nierman, A. and Jagadish, H. 2002. Evaluating structural similarity in XML documents. In Proceedings of the ACM SIGMOD International Workshop on the Web and Databases (WebDB). 61--66.
[40]
Patwardhan, S., Banerjee, S., and Pedersen, T. 2003. Using measures of semantic relatedness for word sense disambiguation. In Proceedings of the International Conference on Intelligent Text Processing and Computational Linguistics (CICLing). 241--257.
[41]
Patwardhan, S., Banerjee, S., and Pedersen, T. 2005. SenseRelate::TargetWord-A generalized framework for word sense disambiguation. In Proceedings of the National Conference on Artificial Intelligence (AAAI). 1692--1693.
[42]
Polyzotis, N. and Garofalakis, M. 2002. Structure and value synopses for XML data graphs. In Proceedings of the International Conference on Very Large Data Bases (VLDB). 466--477.
[43]
Project, X. 2001. Xyleme: A dynamic warehouse for XML data of the Web.
[44]
Resnik, P. 1999. Semantic similarity in a taxonomy: An information-based measure and its application to problems of ambiguity in natural language. J. Artif. Intell. Res. 11, 95--130.
[45]
Runapongsa, K. and Patel, J. 2002. Storing and querying XML data in ORDBMSs. In Proceedings of the EDBT XML-Based Data Management Workshop (XMLDB).
[46]
Sahuguet, A. 2000. Kweelt, the making-of: Mistakes made and lessons learned. Tech. Rep., Department of Computer and Information Science, University of Pennsylvania.
[47]
Schoning, H. 2001. Tamino - A DBMS designed for XML. In Proceedings of the International Conference on Data Engineering (ICDE). 149--154.
[48]
Shanmugasundaram, J., Tufte, K., Zhang, C., He, G., DeWitt, D., and Naughton, J. 1999. Relational databases for querying XML documents: Limitations and opportunities. In Proceedings of the International Conference on Very Large Data Bases (VLDB). 302--314.
[49]
Stein, B., zu Eissen, S. M., and Wissbrock, F. 2003. On cluster validity and the information need of users. In Proceedings of the International Conference on Artificial Intelligence and Applications (AIA). 216--221.
[50]
Strehl, A. and Ghosh, J. 2002. Cluster ensembles—a knowledge reuse framework for combining multiple partitions. J. Mach. Learn. Res. 3, 583--617.
[51]
Tagarelli, A., Longo, M., and Greco, S. 2009. Word sense disambiguation for XML structure feature generation. In Proceedings of the European Semantic Web Conference (ESWC). To appear.
[52]
Theobald, M., Schenkel, R., and Weikum, G. 2003. Exploiting structure, annotation, and ontological knowledge for automatic classification of XML data. In Proceedings of the ACM SIGMOD International Workshop on the Web and Databases (WebDB). 1--6.
[53]
Vercoustre, A. M., Fegas, M., Gul, S., and Lechevallier, Y. 2005. A flexible structured-based representation for XML document mining. In Proceedings of the Workshop of the Initiative for the Evaluation of XML Retrieval (INEX). 443--457.
[54]
Widom, J. 1999. Data management for XML: Research directions. IEEE Data Eng. Bull. 22, 3, 44--52.
[55]
Yang, J. W. and Chen, X. O. 2002. A semi-structured document model for text mining. J. Comput. Sci. Tech. 17, 5, 603--610.
[56]
Yang, J. W., Cheung, W. K., and Chen, X. O. 2005. Integrating element and term semantics for similarity-based XML document clustering. In Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence (WI). 222--228.
[57]
Yoon, J., Raghavan, V., Chakilam, V., and Kerschberg, L. 2001. BitCube: A three-dimensional bitmap indexing for XML documents. J. Intell. Inform. Syst. 17, 1, 241--252.
[58]
Zhao, Y. and Karypis, G. 2002. Evaluation of hierarchical clustering algorithms for document datasets. In Proceedings of the ACM International Conference on Information and Knowledge Management (CIKM). 515--524.
[59]
Zhao, Y. and Karypis, G. 2004. Empirical and theoretical comparisons of selected criterion functions for document clustering. Mach. Learn. 55, 3, 311--331.

Cited By

View all
  • (2024)Information Retrieval in XML Document: State of the ArtInternational Conference on Advanced Intelligent Systems for Sustainable Development (AI2SD'2023)10.1007/978-3-031-54318-0_28(322-331)Online publication date: 21-Feb-2024
  • (2023)Combining offline and on-the-fly disambiguation to perform semantic-aware XML queryingComputer Science and Information Systems10.2298/CSIS220228063T20:1(423-457)Online publication date: 2023
  • (2021)Almost Linear Semantic XML Keyword SearchProceedings of the 13th International Conference on Management of Digital EcoSystems10.1145/3444757.3485079(129-138)Online publication date: 1-Nov-2021
  • Show More Cited By

Recommendations

Reviews

Aris Gkoulalas-Divanis

With the advent of Extensible Markup Language (XML) and its wide adoption in applications, data extraction from semi-structured documents to facilitate data analysis has become an attractive research direction. The existence of structure in documents provides the means for designing sophisticated approaches for data management and knowledge discovery. These approaches take into consideration both content and structure semantics. In this paper, Tagarelli and Greco propose a framework, along with algorithms to cluster semantically related semi-structured documents, based on commonalities in their structure and content. First, they apply structure analysis to the XML documents to remove the ambiguity in the different tag names and allow the selection of the most appropriate sense for each tag name. Following this, they analyze the documents based on their content similarity, using techniques that consider both syntactic and semantic term relevance. An important characteristic of the proposed approach is the use of a novel representation scheme for mapping XML document trees into transactions consisting of items that carry both structure and content characteristics. The authors employ a transactional clustering algorithm that quantifies similarity by taking into consideration the semantics of the data. Subsequently, the identified clusters of transactions derive a classification of the XML documents for the end user. The authors demonstrate the effectiveness of the proposed approach through experiments on real-world data that test it against state-of-the-art algorithms for clustering XML documents. Overall, this is interesting work. The paper is well structured, motivated, and presented, and the experimental results look promising. For these reasons, researchers in the field will benefit from reading it. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Information Systems
ACM Transactions on Information Systems  Volume 28, Issue 1
January 2010
157 pages
ISSN:1046-8188
EISSN:1558-2868
DOI:10.1145/1658377
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 January 2010
Accepted: 01 February 2009
Revised: 01 June 2008
Received: 01 April 2007
Published in TOIS Volume 28, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. XML document clustering
  2. XML structure and content mining
  3. XML transactional modeling
  4. XML tree tuples
  5. similarity measures

Qualifiers

  • Research-article
  • Research
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)5
  • Downloads (Last 6 weeks)1
Reflects downloads up to 13 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Information Retrieval in XML Document: State of the ArtInternational Conference on Advanced Intelligent Systems for Sustainable Development (AI2SD'2023)10.1007/978-3-031-54318-0_28(322-331)Online publication date: 21-Feb-2024
  • (2023)Combining offline and on-the-fly disambiguation to perform semantic-aware XML queryingComputer Science and Information Systems10.2298/CSIS220228063T20:1(423-457)Online publication date: 2023
  • (2021)Almost Linear Semantic XML Keyword SearchProceedings of the 13th International Conference on Management of Digital EcoSystems10.1145/3444757.3485079(129-138)Online publication date: 1-Nov-2021
  • (2019)Data Mining: ClusteringEncyclopedia of Bioinformatics and Computational Biology10.1016/B978-0-12-809633-8.20489-5(437-448)Online publication date: 2019
  • (2018)Semantic Mapping of Energy Simulation Data Using Bag of Words and Graph Matching2018 26th International Conference on Geoinformatics10.1109/GEOINFORMATICS.2018.8557096(1-5)Online publication date: Jun-2018
  • (2017)Structure based XML document clustering: A review2017 International Conference on Infocom Technologies and Unmanned Systems (Trends and Future Directions) (ICTUS)10.1109/ICTUS.2017.8286068(543-547)Online publication date: Dec-2017
  • (2017)Tree pattern matching in heterogeneous fuzzy XML databasesKnowledge-Based Systems10.1016/j.knosys.2017.02.003122:C(119-130)Online publication date: 15-Apr-2017
  • (2016)An Overview on XML Semantic Disambiguation from Unstructured Text to Semi-Structured DataIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2016.252576828:6(1383-1407)Online publication date: 1-Jun-2016
  • (2016)Building semantic trees from XML documentsWeb Semantics: Science, Services and Agents on the World Wide Web10.1016/j.websem.2016.03.00237:C(1-24)Online publication date: 1-Mar-2016
  • (2015)XML document clusteringArtificial Intelligence Review10.1007/s10462-012-9379-243:3(417-436)Online publication date: 1-Mar-2015
  • Show More Cited By

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media