Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

X-Class: Associative Classification of XML Documents by Structure

Published: 01 January 2013 Publication History
  • Get Citation Alerts
  • Abstract

    The supervised classification of XML documents by structure involves learning predictive models in which certain structural regularities discriminate the individual document classes. Hitherto, research has focused on the adoption of prespecified substructures. This is detrimental for classification effectiveness, since the a priori chosen substructures may not accord with the structural properties of the XML documents. Therein, an unexplored question is how to choose the type of structural regularity that best adapts to the structures of the available XML documents.
    We tackle this problem through X-Class, an approach that handles all types of tree-like substructures and allows for choosing the most discriminatory one. Algorithms are designed to learn compact rule-based classifiers in which the chosen substructures discriminate the classes of XML documents.
    X-Class is studied across various domains and types of substructures. Its classification performance is compared against several rule-based and SVM-based competitors. Empirical evidence reveals that the classifiers induced by X-Class are compact, scalable, and at least as effective as the established competitors. In particular, certain substructures allow the induction of very compact classifiers that generally outperform the rule-based competitors in terms of effectiveness over all chosen corpora of XML data. Furthermore, such classifiers are substantially as effective as the SVM-based competitor, with the additional advantage of a high-degree of interpretability.

    References

    [1]
    Aggarwal, C., Ta, N., Wang, J., Feng, J., and Zaki, M. 2007. Xproj: A framework for projected structural clustering of xml documents. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovey and Data Mining. 46--55.
    [2]
    Agrawal, R. and Srikant, R. 1994. Fast algorithms for mining association rules. In Proceedings of the International Conference on Very Large Data Bases. 487--499.
    [3]
    Aiolli, F., Da San Martino, G., and Sperduti, A. 2009. Route kernels for trees. In Proceedings of the International Conference on Machine Learning. 17--24.
    [4]
    Amer-Yahia, S. and Lalmas, M. 2006. Xml search: Languages, inex and scoring. ACM SIGMOD Record 35, 4, 16--23.
    [5]
    Antonie, M.-L. and Zaïane, O. 2002. Text document categorization by term association. In Proceedings of the IEEE International Conference on Data Mining. 19--26.
    [6]
    Antonie, M.-L. and Zaïane, O. 2004. An associative classifier based on positive and negative rules. In Proceedings of the ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery. 64--69.
    [7]
    Arunasalam, B. and Chawla, S. 2006. CCCS: A top-down association classifier for imbalanced class distribution. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 517--522.
    [8]
    Baeza-Yates, R. and Ribeiro-Neto, B. 1999. Modern Information Retrieval. Addison-Wesley, Boston, MA.
    [9]
    Baeza-Yates, R., Fuhr, N., and Andamaarek, Y. 2006. Special issue on xml retrieval. ACM Trans. Inform. Syst. 24, 4.
    [10]
    Baker, L. and McCallum, A. 1998. Distributional clustering of words for text classification. In Proceedings of the ACM International Conference on Research and Development in Information Retrieval. 96--103.
    [11]
    Bennet, K. and Campbell, C. 2000. Support vector machines: Hype or hallelujah. SIGKDD Explo. 2, 2, 1--13.
    [12]
    Bratko, A. and Filipic̆, B. 2006. Exploiting structural information for semi-structured document categorization. Inf. Process. Manage. 42, 3, 679--694.
    [13]
    Brutlag, J. and Meek, C. 2000. Challenges of the email domain for text classification. In Proceedings of the International Conference on Machine Learning. 103--110.
    [14]
    Burges, C. 1998. A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Discovery 2, 2, 121--167.
    [15]
    Candillier, L., Tellier, I., and Torre, F. 2006. Transforming xml trees for efficient classification and clustering. In Advances in XML Information Retrieval and Evaluation, Proceedings of the 4th International Workshop of the Initiative for the Evaluation of XML Retrieval, N. Fuhr, M. Lalmas, S. Malik, and G. Kazai Eds., Springer, 469--480.
    [16]
    Candillier, L., Tellier, I., and Torre, F. 2007. Xml document mining using graph neural network. In Comparative Evaluation of XML Information Retrieval Systems, Proceedings of the 5th International Workshop of the Initiative for the Evaluation of XML Retrieval, N. Fuhr, M. Lalmas, and A. Trotman Eds., Springer, 458--472.
    [17]
    Chang, C.-C. and Lin, C.-J. 2001. LIBSVM: A library for support vector machines. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
    [18]
    Cheng, H., Yan, X., Han, J., and Hsu, C.-W. 2007. Discriminative frequent pattern analysis for effective classification. In Proceedings of the International Conference on Data Engineering. 716--725.
    [19]
    Coenen, F. 2004. LUCS KDD implementations of CBA and CPAR. Department of Computer Science, The University of Liverpool. www.csc.liv.ac.uk/~frans/KDD/Software/.
    [20]
    Collins, M. and Duffy, N. 2001. Convolution kernels for natural language. In Proceedings of the Conference on Neural Information Processing Systems. 625--632.
    [21]
    Collins, M. and Duffy, N. 2002. New ranking algorithms for parsing and tagging: Kernels over discrete structures, and the voted perceptron. In Proceedings of the Annual Meeting on Association for Computational Linguistics. 263--270.
    [22]
    Cong, G., Xu, X., Pan, F., Tung, A., and Yang, J. 2004. FARMER: Finding interesting rule groups in microarray datasets. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 123--126.
    [23]
    Costa, G. and Ortale, R. 2012. On effective xml clustering by path commonality: An efficient and scalable algorithm. In Proceedings of the IEEE International Conference on Tools with Artificial Intelligence. 389--396.
    [24]
    Costa, G., Manco, G., Ortale, R., and Tagarelli, A. 2004. A tree-based approach to clustering xml documents by structure. In Proceedings of the International Conference on Principles and Practice of Knowledge Discovery in Databases. 137--148.
    [25]
    Costa, G., Ortale, R., and Ritacco, E. 2011a. Effective xml classification using content and structural information via rule learning. In Proceedings of the IEEE International Conference on Tools with Artificial Intelligence. 102--109.
    [26]
    Costa, G., Ortale, R., and Ritacco, E. 2011b. A transactional approach to associative xml classification by content and structure. In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval. 104--113.
    [27]
    Crescenzi, V., Mecca, G., and Merialdo, P. 2001. Roadrunner: Towards automatic data extraction from large web sites. In Proceedings of the International Conference on Very Large Databases. 109--118.
    [28]
    Cristianini, N. and Shawe-Taylor, J. 2000. An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods. Cambridge University Press.
    [29]
    Da San Martino, G. 2009. Kernel methods for tree structured data. Ph.D. dissertation, University of Bologna, Padova.
    [30]
    Dalamagas, T., Cheng, T., Winkel, K.-J., and Sellis, T. 2006. A methodology for clustering xml documents by structure. Inf. Syst. 31, 3, 187--228.
    [31]
    de Campos, L., Fernández-Luna, J., Huete, J., and Romero, A. 2008. Probabilistic methods for structured document classification at inex’07. In Focused Access to XML Documents, Proceedings of the 6th International Workshop of the Initiative for the Evaluation of XML Retrieval, N. Fuhr, J. Kamps, M. Lalmas, and A. Trotman Eds., Springer, 195--206.
    [32]
    Denoyer, L. and Gallinari, P. 2007. Report on the xml mining track at inex 2005 and inex 2006: Categorization and clustering of xml documents. ACM SIGIR Forum 41, 1, 79--90.
    [33]
    Denoyer, L. and Gallinari, P. 2008. Report on the xml mining track at inex 2007: Categorization and clustering of xml documents. ACM SIGIR Forum 42, 1, 22--28.
    [34]
    Flesca, S., Manco, G., Masciari, E., Pontieri, L., and Pugliese, A. 2005. Fast detection of xml structural similarity. IEEE Trans. Knowl. Data Eng. 17, 2, 160--175.
    [35]
    Fox, C. 1992. Lexical Analysis and Stoplists. Prentice Hall, Upper Saddle River, NJ.
    [36]
    Garboni, C., Masseglia, F., and Trousse, B. 2006. Sequential pattern mining for structure-based xml document classification. In Advances in XML Information Retrieval and Evaluation, Proceedings of the 4th International Workshop of the Initiative for the Evaluation of XML Retrieval, N. Fuhr, M. Lalmas, S. Malik, and G. Kazai Eds., Springer, 458--468.
    [37]
    Han, J. and Yin, Y. 2000. Mining frequent patterns without candidate generation. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 1--12.
    [38]
    Haussler, D. 1999. Convolution kernels on discrete structures. Tech. rep., University of California at Santa Cruz.
    [39]
    Hearst, M. 1998. Trends & controversies: Support vector machines. IEEE Intell. Syst. 13, 4, 18--28.
    [40]
    Helmer, S. 2007. Measuring the structural similarity of semistructured documents using entropy. In Proceedings of the International Conference on Very Large Data Base. 1022--1032.
    [41]
    Hofmann, T., Schölkopf, B., and Smola, A. 2008. Kernel methods in machine learning. Ann. Stat. 36, 3, 1171--1220.
    [42]
    Ikeda, K. 2006. Effects of kernel function on support vector machines in extreme cases. IEEE Trans. Neural Netw. 17, 1, 1--9.
    [43]
    Jaakkola, T. and Haussler, D. 1999. Exploiting generative models in discriminative classifiers. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 487--493.
    [44]
    Kashima, H. and Koyanagi, T. 2002. Kernels for semi-structured data. In Proceedings of the International Conference on Machine Learning. 291--298.
    [45]
    Knijf, J. D. 2007. Fat-cat: Frequent attributes tree based classification. In Comparative Evaluation of XML Information Retrieval Systems, Proceedings of the 5th International Workshop of the Initiative for the Evaluation of XML Retrieval, N. Fuhr, M. Lalmas, and A. Trotman Eds., Springer, 485--496.
    [46]
    Kuboyama, T., Hirata, K., Kashima, H., Aoki-Kinoshita, K., and Yasuda, H. 2007. Information and media technologies. Ann. Stat. 2, 1, 292--299.
    [47]
    Kurt, A. and Tozal, E. 2006. Classification of xslt-generated Web documents with support vector machines. In Proceedings of the International Workshop on Knowledge Discovery from XML Documents. 33--42.
    [48]
    Li, W., Han, J., and Pei, J. 2001. CMAR: Accurate and efficient classification based on multiple class-association rules. In Proceedings of the IEEE International Conference on Data Mining. 369--376.
    [49]
    Lian, W., Cheung, D., Mamoulis, N., and Yiu, S.-M. 2004. An efficient and scalable algorithm for clustering xml documents by structure. IEEE Trans. Knowl. Data Eng. 16, 1, 82--96.
    [50]
    Liu, B., Hsu, W., and Ma, Y. 1998. Integrating classification and association rule mining. In Proceedings of the ACM SIGKDD International Conference on Kwnoledge Discovery and Data Mining. 80--86.
    [51]
    Liu, B., Ma, Y., and Wong, C. 2000. Improving an association rule based classifier. In Proceedings of the International Conference on Principles of Data Mining and Knowledge Discovery. 504--509.
    [52]
    Manning, C., Raghavan, P., and Schütze., H. 2008. Introduction to Information Retrieval. Cambridge University Press.
    [53]
    Mitchell, T. 1997. Machine Learning. McGraw-Hill, New York, NY.
    [54]
    Murugeshan, M., Lakshmi, K., and Mukherjee, S. 2007. A categorization approach for Wikipedia collection based on negative category information and initial descriptions. In Proceedings of the INitiative for the Evaluation of XML Retrieval (INEX’07). 212--214.
    [55]
    Ning, P., Steinbach, M., and Kumar, V. 2006. Introduction to Data Mining. Addison Wesley, Boston, MA.
    [56]
    Pearl, J. 1998. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan and Kaufmann, Burlington, MA.
    [57]
    Porter, M. 1980. An algorithm for suffix stripping. Program 14, 3, 130--137.
    [58]
    Quinlan, J. and Cameron-Jones, R. 1993. FOIL: A midterm report. In Proceedings of the European Conference on Machine Learning. 3--20.
    [59]
    Schölkopf, B. and Smola, A. 2001. Learning with Kernels: Support Vector Machines, Regularization, Optimization and Beyond. MIT Press, Cambridge, MA.
    [60]
    Steinwart, I. and Christmann, A. 2008. Support Vector Machines. Springer, Berlin.
    [61]
    Suzuki, J. and Isozaki, H. 2006. Sequence and tree kernels with statistical feature mining. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 1321--1328.
    [62]
    Thabtah, F. 2007. A review of associative classification mining. Knowl. Eng. Rev. 22, 1, 37--65.
    [63]
    Theobald, M., Schenkel, R., and Weikum, G. 2003. Exploiting structure, annotation, and ontological knowledge for automatic classification of xml data. In Proceedings of the WebDB Workshop. 1--6.
    [64]
    Vapnik, V. 1998. Statistical Learning Theory. John Wiley & Sons, Hoboken, NJ.
    [65]
    Vishwanathan, S. and Smola, A. 2002. Fast kernels for string and tree matching. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 569--576.
    [66]
    Walpole, R., Myers, R., Myers, S., and Ye, K. 2011. Probability & Statistics forEngineers & Scientists. Prentice Hall, Upper Saddle River, NJ.
    [67]
    Wang, C., Hong, M., Pei, J., Zhou, H., Wang, W., and Shi, B. 2004. Efficient pattern-growth methods for frequent tree pattern mining. In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining. 441--451.
    [68]
    Wang, J. and Karypis, G. 2005. HARMONY: Efficiently mining the best rules for classification. In Proceedings of the SIAM International Conference on Data Mining. 205--216.
    [69]
    Wolpert, D. 1992. Staked generalization. Neural Netw. 5, 241--259.
    [70]
    Wu, J. and Tang, J. 2008. A bottom-up approach for xml documents classification. In Proceedings of the International Symposium on Database Engineering and Applications. 131--137.
    [71]
    Xing, G., Guo, J., and Xia, Z. 2007. Classifying xml documents based on structure/content similarity. In Comparative Evaluation of XML Information Retrieval Systems, Proceedings of the 5th International Workshop of the Initiative for the Evaluation of XML Retrieval, N. Fuhr, M. Lalmas, and A. Trotman Eds., Springer, 444--457.
    [72]
    Yang, J. and Chen, X. 2002. A semi-structured document model for text mining. J. Comput. Sci. Techol. 17, 5, 603--610.
    [73]
    Yang, J. and Wang, S. 2010. Extended vsm for xml document classification using frequent subtrees. In Focused Retrieval and Evaluation, Proceedings of the 8th International Workshop of the Initiative for the Evaluation of XML Retrieval, S. Geva, J. Kamps, and A. Trotman Eds., Springer, 441--448.
    [74]
    Yang, J. and Zhang, F. 2008. Xml document classification using extended vsm. In Focused Access to XML Documents, Proceedings of the 6th International Workshop of the Initiative for the Evaluation of XML Retrieval, N. Fuhr, J. Kamps, M. Lalmas, and A. Trotman Eds., Springer, 234--244.
    [75]
    Yi, J. and Sundaresan, N. 2000. A classifier for semi-structured documents. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovey and Data Mining. 340--344.
    [76]
    Yin, X. and Han, J. 2003. CPAR: Classification based on predictive association rules. In Proceedings of the SIAM International Conference on Data Mining. 331--335.
    [77]
    Zaki, M. 2005. Efficiently mining frequent trees in a forest: Algorithms and applications. IEEE Trans. Knowl. Data Eng. 17, 8, 1021--1035.
    [78]
    Zaki, M. and Aggarwal, C. 2006. Xrules: An effective algorithm for structural classification of xml data. Mach. Learn. 62, 1--2, 137--170.

    Cited By

    View all
    • (2024)An application for predicting phishing attacks: A case of implementing a support vector machine learning modelCyber Security and Applications10.1016/j.csa.2024.1000362(100036)Online publication date: 2024
    • (2023)A parallel and balanced SVM algorithm on spark for data-intensive computingIntelligent Data Analysis10.3233/IDA-22677427:4(1065-1086)Online publication date: 20-Jul-2023
    • (2023)A Review Selection Method Based on Consumer Decision Phases in E-commerceACM Transactions on Information Systems10.1145/358726542:1(1-27)Online publication date: 21-Aug-2023
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Information Systems
    ACM Transactions on Information Systems  Volume 31, Issue 1
    January 2013
    163 pages
    ISSN:1046-8188
    EISSN:1558-2868
    DOI:10.1145/2414782
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 January 2013
    Accepted: 01 October 2012
    Revised: 01 August 2012
    Received: 01 January 2011
    Published in TOIS Volume 31, Issue 1

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Structural XML classification
    2. XML mining
    3. XML transactional modeling

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)5
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 26 Jul 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)An application for predicting phishing attacks: A case of implementing a support vector machine learning modelCyber Security and Applications10.1016/j.csa.2024.1000362(100036)Online publication date: 2024
    • (2023)A parallel and balanced SVM algorithm on spark for data-intensive computingIntelligent Data Analysis10.3233/IDA-22677427:4(1065-1086)Online publication date: 20-Jul-2023
    • (2023)A Review Selection Method Based on Consumer Decision Phases in E-commerceACM Transactions on Information Systems10.1145/358726542:1(1-27)Online publication date: 21-Aug-2023
    • (2023)Rule-Based Detection of Anomalous Patterns in Device Behavior for Explainable IoT SecurityIEEE Transactions on Services Computing10.1109/TSC.2023.332782216:6(4514-4525)Online publication date: Dec-2023
    • (2019)New Associative Classification Method Based on Rule Pruning for Classification of DatasetsIEEE Access10.1109/ACCESS.2019.29503747(157783-157795)Online publication date: 2019
    • (2018)Machine learning techniques for XML (co-)clustering by structure-constrained phrasesInformation Retrieval10.1007/s10791-017-9314-x21:1(24-55)Online publication date: 1-Feb-2018
    • (2018)A new MapReduce associative classifier based on a new storage format for large-scale imbalanced dataCluster Computing10.1007/s10586-018-2812-921:4(1821-1847)Online publication date: 1-Dec-2018
    • (2017)XML Clustering by Structure-Constrained Phrases: A Fully-Automatic Approach Using Contextualized N-GramsInternational Journal on Artificial Intelligence Tools10.1142/S021821301760002826:01(1760002)Online publication date: Mar-2017
    • (2016)A MapReduce solution for associative classification of big dataInformation Sciences: an International Journal10.1016/j.ins.2015.10.041332:C(33-55)Online publication date: 1-Mar-2016
    • (2016)Structure-Oriented Techniques for XML Document PartitioningNovel Applications of Intelligent Systems10.1007/978-3-319-14194-7_9(167-182)Online publication date: 28-Jan-2016
    • Show More Cited By

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media