research-article

X-Class: Associative Classification of XML Documents by Structure

Authors:

Riccardo Ortale,

Ettore RitaccoAuthors Info & Claims

ACM Transactions on Information Systems (TOIS), Volume 31, Issue 1

Article No.: 3, Pages 1 - 40

https://doi.org/10.1145/2414782.2414785

Published: 01 January 2013 Publication History

Abstract

The supervised classification of XML documents by structure involves learning predictive models in which certain structural regularities discriminate the individual document classes. Hitherto, research has focused on the adoption of prespecified substructures. This is detrimental for classification effectiveness, since the a priori chosen substructures may not accord with the structural properties of the XML documents. Therein, an unexplored question is how to choose the type of structural regularity that best adapts to the structures of the available XML documents.

We tackle this problem through X-Class, an approach that handles all types of tree-like substructures and allows for choosing the most discriminatory one. Algorithms are designed to learn compact rule-based classifiers in which the chosen substructures discriminate the classes of XML documents.

X-Class is studied across various domains and types of substructures. Its classification performance is compared against several rule-based and SVM-based competitors. Empirical evidence reveals that the classifiers induced by X-Class are compact, scalable, and at least as effective as the established competitors. In particular, certain substructures allow the induction of very compact classifiers that generally outperform the rule-based competitors in terms of effectiveness over all chosen corpora of XML data. Furthermore, such classifiers are substantially as effective as the SVM-based competitor, with the additional advantage of a high-degree of interpretability.

References

[1]

Aggarwal, C., Ta, N., Wang, J., Feng, J., and Zaki, M. 2007. Xproj: A framework for projected structural clustering of xml documents. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovey and Data Mining. 46--55.

Digital Library

[2]

Agrawal, R. and Srikant, R. 1994. Fast algorithms for mining association rules. In Proceedings of the International Conference on Very Large Data Bases. 487--499.

Digital Library

[3]

Aiolli, F., Da San Martino, G., and Sperduti, A. 2009. Route kernels for trees. In Proceedings of the International Conference on Machine Learning. 17--24.

Digital Library

[4]

Amer-Yahia, S. and Lalmas, M. 2006. Xml search: Languages, inex and scoring. ACM SIGMOD Record 35, 4, 16--23.

Digital Library

[5]

Antonie, M.-L. and Zaïane, O. 2002. Text document categorization by term association. In Proceedings of the IEEE International Conference on Data Mining. 19--26.

Digital Library

[6]

Antonie, M.-L. and Zaïane, O. 2004. An associative classifier based on positive and negative rules. In Proceedings of the ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery. 64--69.

Digital Library

[7]

Arunasalam, B. and Chawla, S. 2006. CCCS: A top-down association classifier for imbalanced class distribution. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 517--522.

Digital Library

[8]

Baeza-Yates, R. and Ribeiro-Neto, B. 1999. Modern Information Retrieval. Addison-Wesley, Boston, MA.

Digital Library

[9]

Baeza-Yates, R., Fuhr, N., and Andamaarek, Y. 2006. Special issue on xml retrieval. ACM Trans. Inform. Syst. 24, 4.

Digital Library

[10]

Baker, L. and McCallum, A. 1998. Distributional clustering of words for text classification. In Proceedings of the ACM International Conference on Research and Development in Information Retrieval. 96--103.

Digital Library

[11]

Bennet, K. and Campbell, C. 2000. Support vector machines: Hype or hallelujah. SIGKDD Explo. 2, 2, 1--13.

Digital Library

[12]

Bratko, A. and Filipic̆, B. 2006. Exploiting structural information for semi-structured document categorization. Inf. Process. Manage. 42, 3, 679--694.

Digital Library

[13]

Brutlag, J. and Meek, C. 2000. Challenges of the email domain for text classification. In Proceedings of the International Conference on Machine Learning. 103--110.

Digital Library

[14]

Burges, C. 1998. A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Discovery 2, 2, 121--167.

Digital Library

[15]

Candillier, L., Tellier, I., and Torre, F. 2006. Transforming xml trees for efficient classification and clustering. In Advances in XML Information Retrieval and Evaluation, Proceedings of the 4th International Workshop of the Initiative for the Evaluation of XML Retrieval, N. Fuhr, M. Lalmas, S. Malik, and G. Kazai Eds., Springer, 469--480.

Digital Library

[16]

Candillier, L., Tellier, I., and Torre, F. 2007. Xml document mining using graph neural network. In Comparative Evaluation of XML Information Retrieval Systems, Proceedings of the 5th International Workshop of the Initiative for the Evaluation of XML Retrieval, N. Fuhr, M. Lalmas, and A. Trotman Eds., Springer, 458--472.

[17]

Chang, C.-C. and Lin, C.-J. 2001. LIBSVM: A library for support vector machines. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.

Digital Library

[18]

Cheng, H., Yan, X., Han, J., and Hsu, C.-W. 2007. Discriminative frequent pattern analysis for effective classification. In Proceedings of the International Conference on Data Engineering. 716--725.

[19]

Coenen, F. 2004. LUCS KDD implementations of CBA and CPAR. Department of Computer Science, The University of Liverpool. www.csc.liv.ac.uk/~frans/KDD/Software/.

[20]

Collins, M. and Duffy, N. 2001. Convolution kernels for natural language. In Proceedings of the Conference on Neural Information Processing Systems. 625--632.

[21]

Collins, M. and Duffy, N. 2002. New ranking algorithms for parsing and tagging: Kernels over discrete structures, and the voted perceptron. In Proceedings of the Annual Meeting on Association for Computational Linguistics. 263--270.

Digital Library

[22]

Cong, G., Xu, X., Pan, F., Tung, A., and Yang, J. 2004. FARMER: Finding interesting rule groups in microarray datasets. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 123--126.

Digital Library

[23]

Costa, G. and Ortale, R. 2012. On effective xml clustering by path commonality: An efficient and scalable algorithm. In Proceedings of the IEEE International Conference on Tools with Artificial Intelligence. 389--396.

Digital Library

[24]

Costa, G., Manco, G., Ortale, R., and Tagarelli, A. 2004. A tree-based approach to clustering xml documents by structure. In Proceedings of the International Conference on Principles and Practice of Knowledge Discovery in Databases. 137--148.

Digital Library

[25]

Costa, G., Ortale, R., and Ritacco, E. 2011a. Effective xml classification using content and structural information via rule learning. In Proceedings of the IEEE International Conference on Tools with Artificial Intelligence. 102--109.

Digital Library

[26]

Costa, G., Ortale, R., and Ritacco, E. 2011b. A transactional approach to associative xml classification by content and structure. In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval. 104--113.

[27]

Crescenzi, V., Mecca, G., and Merialdo, P. 2001. Roadrunner: Towards automatic data extraction from large web sites. In Proceedings of the International Conference on Very Large Databases. 109--118.

Digital Library

[28]

Cristianini, N. and Shawe-Taylor, J. 2000. An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods. Cambridge University Press.

Digital Library

[29]

Da San Martino, G. 2009. Kernel methods for tree structured data. Ph.D. dissertation, University of Bologna, Padova.

[30]

Dalamagas, T., Cheng, T., Winkel, K.-J., and Sellis, T. 2006. A methodology for clustering xml documents by structure. Inf. Syst. 31, 3, 187--228.

Digital Library

[31]

de Campos, L., Fernández-Luna, J., Huete, J., and Romero, A. 2008. Probabilistic methods for structured document classification at inex’07. In Focused Access to XML Documents, Proceedings of the 6th International Workshop of the Initiative for the Evaluation of XML Retrieval, N. Fuhr, J. Kamps, M. Lalmas, and A. Trotman Eds., Springer, 195--206.

Digital Library

[32]

Denoyer, L. and Gallinari, P. 2007. Report on the xml mining track at inex 2005 and inex 2006: Categorization and clustering of xml documents. ACM SIGIR Forum 41, 1, 79--90.

Digital Library

[33]

Denoyer, L. and Gallinari, P. 2008. Report on the xml mining track at inex 2007: Categorization and clustering of xml documents. ACM SIGIR Forum 42, 1, 22--28.

Digital Library

[34]

Flesca, S., Manco, G., Masciari, E., Pontieri, L., and Pugliese, A. 2005. Fast detection of xml structural similarity. IEEE Trans. Knowl. Data Eng. 17, 2, 160--175.

Digital Library

[35]

Fox, C. 1992. Lexical Analysis and Stoplists. Prentice Hall, Upper Saddle River, NJ.

[36]

Garboni, C., Masseglia, F., and Trousse, B. 2006. Sequential pattern mining for structure-based xml document classification. In Advances in XML Information Retrieval and Evaluation, Proceedings of the 4th International Workshop of the Initiative for the Evaluation of XML Retrieval, N. Fuhr, M. Lalmas, S. Malik, and G. Kazai Eds., Springer, 458--468.

Digital Library

[37]

Han, J. and Yin, Y. 2000. Mining frequent patterns without candidate generation. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 1--12.

Digital Library

[38]

Haussler, D. 1999. Convolution kernels on discrete structures. Tech. rep., University of California at Santa Cruz.

[39]

Hearst, M. 1998. Trends & controversies: Support vector machines. IEEE Intell. Syst. 13, 4, 18--28.

Digital Library

[40]

Helmer, S. 2007. Measuring the structural similarity of semistructured documents using entropy. In Proceedings of the International Conference on Very Large Data Base. 1022--1032.

Digital Library

[41]

Hofmann, T., Schölkopf, B., and Smola, A. 2008. Kernel methods in machine learning. Ann. Stat. 36, 3, 1171--1220.

[42]

Ikeda, K. 2006. Effects of kernel function on support vector machines in extreme cases. IEEE Trans. Neural Netw. 17, 1, 1--9.

Digital Library

[43]

Jaakkola, T. and Haussler, D. 1999. Exploiting generative models in discriminative classifiers. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 487--493.

Digital Library

[44]

Kashima, H. and Koyanagi, T. 2002. Kernels for semi-structured data. In Proceedings of the International Conference on Machine Learning. 291--298.

Digital Library

[45]

Knijf, J. D. 2007. Fat-cat: Frequent attributes tree based classification. In Comparative Evaluation of XML Information Retrieval Systems, Proceedings of the 5th International Workshop of the Initiative for the Evaluation of XML Retrieval, N. Fuhr, M. Lalmas, and A. Trotman Eds., Springer, 485--496.

[46]

Kuboyama, T., Hirata, K., Kashima, H., Aoki-Kinoshita, K., and Yasuda, H. 2007. Information and media technologies. Ann. Stat. 2, 1, 292--299.

[47]

Kurt, A. and Tozal, E. 2006. Classification of xslt-generated Web documents with support vector machines. In Proceedings of the International Workshop on Knowledge Discovery from XML Documents. 33--42.

Digital Library

[48]

Li, W., Han, J., and Pei, J. 2001. CMAR: Accurate and efficient classification based on multiple class-association rules. In Proceedings of the IEEE International Conference on Data Mining. 369--376.

Digital Library

[49]

Lian, W., Cheung, D., Mamoulis, N., and Yiu, S.-M. 2004. An efficient and scalable algorithm for clustering xml documents by structure. IEEE Trans. Knowl. Data Eng. 16, 1, 82--96.

Digital Library

[50]

Liu, B., Hsu, W., and Ma, Y. 1998. Integrating classification and association rule mining. In Proceedings of the ACM SIGKDD International Conference on Kwnoledge Discovery and Data Mining. 80--86.

[51]

Liu, B., Ma, Y., and Wong, C. 2000. Improving an association rule based classifier. In Proceedings of the International Conference on Principles of Data Mining and Knowledge Discovery. 504--509.

Digital Library

[52]

Manning, C., Raghavan, P., and Schütze., H. 2008. Introduction to Information Retrieval. Cambridge University Press.

Digital Library

[53]

Mitchell, T. 1997. Machine Learning. McGraw-Hill, New York, NY.

Digital Library

[54]

Murugeshan, M., Lakshmi, K., and Mukherjee, S. 2007. A categorization approach for Wikipedia collection based on negative category information and initial descriptions. In Proceedings of the INitiative for the Evaluation of XML Retrieval (INEX’07). 212--214.

[55]

Ning, P., Steinbach, M., and Kumar, V. 2006. Introduction to Data Mining. Addison Wesley, Boston, MA.

[56]

Pearl, J. 1998. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan and Kaufmann, Burlington, MA.

Digital Library

[57]

Porter, M. 1980. An algorithm for suffix stripping. Program 14, 3, 130--137.

[58]

Quinlan, J. and Cameron-Jones, R. 1993. FOIL: A midterm report. In Proceedings of the European Conference on Machine Learning. 3--20.

Digital Library

[59]

Schölkopf, B. and Smola, A. 2001. Learning with Kernels: Support Vector Machines, Regularization, Optimization and Beyond. MIT Press, Cambridge, MA.

Digital Library

[60]

Steinwart, I. and Christmann, A. 2008. Support Vector Machines. Springer, Berlin.

Digital Library

[61]

Suzuki, J. and Isozaki, H. 2006. Sequence and tree kernels with statistical feature mining. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 1321--1328.

[62]

Thabtah, F. 2007. A review of associative classification mining. Knowl. Eng. Rev. 22, 1, 37--65.

Digital Library

[63]

Theobald, M., Schenkel, R., and Weikum, G. 2003. Exploiting structure, annotation, and ontological knowledge for automatic classification of xml data. In Proceedings of the WebDB Workshop. 1--6.

[64]

Vapnik, V. 1998. Statistical Learning Theory. John Wiley & Sons, Hoboken, NJ.

[65]

Vishwanathan, S. and Smola, A. 2002. Fast kernels for string and tree matching. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 569--576.

[66]

Walpole, R., Myers, R., Myers, S., and Ye, K. 2011. Probability & Statistics forEngineers & Scientists. Prentice Hall, Upper Saddle River, NJ.

[67]

Wang, C., Hong, M., Pei, J., Zhou, H., Wang, W., and Shi, B. 2004. Efficient pattern-growth methods for frequent tree pattern mining. In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining. 441--451.

[68]

Wang, J. and Karypis, G. 2005. HARMONY: Efficiently mining the best rules for classification. In Proceedings of the SIAM International Conference on Data Mining. 205--216.

[69]

Wolpert, D. 1992. Staked generalization. Neural Netw. 5, 241--259.

Digital Library

[70]

Wu, J. and Tang, J. 2008. A bottom-up approach for xml documents classification. In Proceedings of the International Symposium on Database Engineering and Applications. 131--137.

Digital Library

[71]

Xing, G., Guo, J., and Xia, Z. 2007. Classifying xml documents based on structure/content similarity. In Comparative Evaluation of XML Information Retrieval Systems, Proceedings of the 5th International Workshop of the Initiative for the Evaluation of XML Retrieval, N. Fuhr, M. Lalmas, and A. Trotman Eds., Springer, 444--457.

[72]

Yang, J. and Chen, X. 2002. A semi-structured document model for text mining. J. Comput. Sci. Techol. 17, 5, 603--610.

Digital Library

[73]

Yang, J. and Wang, S. 2010. Extended vsm for xml document classification using frequent subtrees. In Focused Retrieval and Evaluation, Proceedings of the 8th International Workshop of the Initiative for the Evaluation of XML Retrieval, S. Geva, J. Kamps, and A. Trotman Eds., Springer, 441--448.

Digital Library

[74]

Yang, J. and Zhang, F. 2008. Xml document classification using extended vsm. In Focused Access to XML Documents, Proceedings of the 6th International Workshop of the Initiative for the Evaluation of XML Retrieval, N. Fuhr, J. Kamps, M. Lalmas, and A. Trotman Eds., Springer, 234--244.

Digital Library

[75]

Yi, J. and Sundaresan, N. 2000. A classifier for semi-structured documents. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovey and Data Mining. 340--344.

Digital Library

[76]

Yin, X. and Han, J. 2003. CPAR: Classification based on predictive association rules. In Proceedings of the SIAM International Conference on Data Mining. 331--335.

[77]

Zaki, M. 2005. Efficiently mining frequent trees in a forest: Algorithms and applications. IEEE Trans. Knowl. Data Eng. 17, 8, 1021--1035.

Digital Library

[78]

Zaki, M. and Aggarwal, C. 2006. Xrules: An effective algorithm for structural classification of xml data. Mach. Learn. 62, 1--2, 137--170.

Digital Library

Cited By

Shombot EDusserre GBestak RAhmed N(2024)An application for predicting phishing attacks: A case of implementing a support vector machine learning modelCyber Security and Applications10.1016/j.csa.2024.1000362(100036)Online publication date: 2024
https://doi.org/10.1016/j.csa.2024.100036
Li JShi JLiu ZFeng C(2023)A parallel and balanced SVM algorithm on spark for data-intensive computingIntelligent Data Analysis10.3233/IDA-22677427:4(1065-1086)Online publication date: 20-Jul-2023
https://doi.org/10.3233/IDA-226774
Zhang JLi XWang L(2023)A Review Selection Method Based on Consumer Decision Phases in E-commerceACM Transactions on Information Systems10.1145/358726542:1(1-27)Online publication date: 21-Aug-2023
https://dl.acm.org/doi/10.1145/3587265
Show More Cited By

Index Terms

X-Class: Associative Classification of XML Documents by Structure
1. Applied computing
  1. Document management and text processing
    1. Document preparation
      1. Markup languages
2. Information systems

Recommendations

A weighted common structure based clustering technique for XML documents

XML has recently become very popular as a means of representing semistructured data and as a standard for data exchange over the Web, because of its varied applicability in numerous applications. Therefore, XML documents constitute an important data ...
Semantic clustering of XML documents

Dealing with structure and content semantics underlying semistructured documents is challenging for any task of document management and knowledge discovery conceived for such data. In this work we address the novel problem of clustering semantically ...
A Novel Top-Down Algorithm of Frequent XML Query Pattern Mining
ICCEA '10: Proceedings of the 2010 Second International Conference on Computer Engineering and Applications - Volume 02

Providing efficient query to XML documents is crucial, as XML has become the most important technique to exchange data in WWW. Due to its tree-structure paradigm, XML is superior for its capability of storing and querying complex data. Therefore, ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Information Systems

ACM Transactions on Information Systems Volume 31, Issue 1

January 2013

163 pages

ISSN:1046-8188

EISSN:1558-2868

DOI:10.1145/2414782

Issue’s Table of Contents

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 January 2013

Accepted: 01 October 2012

Revised: 01 August 2012

Received: 01 January 2011

Published in TOIS Volume 31, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

21
Total Citations
View Citations
525
Total Downloads

Downloads (Last 12 months)5
Downloads (Last 6 weeks)0

Reflects downloads up to 26 Jul 2024

Other Metrics

View Author Metrics

Citations

Cited By

Shombot EDusserre GBestak RAhmed N(2024)An application for predicting phishing attacks: A case of implementing a support vector machine learning modelCyber Security and Applications10.1016/j.csa.2024.1000362(100036)Online publication date: 2024
https://doi.org/10.1016/j.csa.2024.100036
Li JShi JLiu ZFeng C(2023)A parallel and balanced SVM algorithm on spark for data-intensive computingIntelligent Data Analysis10.3233/IDA-22677427:4(1065-1086)Online publication date: 20-Jul-2023
https://doi.org/10.3233/IDA-226774
Zhang JLi XWang L(2023)A Review Selection Method Based on Consumer Decision Phases in E-commerceACM Transactions on Information Systems10.1145/358726542:1(1-27)Online publication date: 21-Aug-2023
https://dl.acm.org/doi/10.1145/3587265
Costa GForestiero AOrtale R(2023)Rule-Based Detection of Anomalous Patterns in Device Behavior for Explainable IoT SecurityIEEE Transactions on Services Computing10.1109/TSC.2023.332782216:6(4514-4525)Online publication date: Dec-2023
https://doi.org/10.1109/TSC.2023.3327822
Rajab K(2019)New Associative Classification Method Based on Rule Pruning for Classification of DatasetsIEEE Access10.1109/ACCESS.2019.29503747(157783-157795)Online publication date: 2019
https://doi.org/10.1109/ACCESS.2019.2950374
Costa GOrtale R(2018)Machine learning techniques for XML (co-)clustering by structure-constrained phrasesInformation Retrieval10.1007/s10791-017-9314-x21:1(24-55)Online publication date: 1-Feb-2018
https://dl.acm.org/doi/10.1007/s10791-017-9314-x
Almasi MSaniee Abadeh M(2018)A new MapReduce associative classifier based on a new storage format for large-scale imbalanced dataCluster Computing10.1007/s10586-018-2812-921:4(1821-1847)Online publication date: 1-Dec-2018
https://dl.acm.org/doi/10.1007/s10586-018-2812-9
Costa GOrtale R(2017)XML Clustering by Structure-Constrained Phrases: A Fully-Automatic Approach Using Contextualized N-GramsInternational Journal on Artificial Intelligence Tools10.1142/S021821301760002826:01(1760002)Online publication date: Mar-2017
https://doi.org/10.1142/S0218213017600028
Bechini AMarcelloni FSegatori A(2016)A MapReduce solution for associative classification of big dataInformation Sciences: an International Journal10.1016/j.ins.2015.10.041332:C(33-55)Online publication date: 1-Mar-2016
https://dl.acm.org/doi/10.1016/j.ins.2015.10.041
Costa GOrtale R(2016)Structure-Oriented Techniques for XML Document PartitioningNovel Applications of Intelligent Systems10.1007/978-3-319-14194-7_9(167-182)Online publication date: 28-Jan-2016
https://doi.org/10.1007/978-3-319-14194-7_9
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents