Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2539150.2539260acmotherconferencesArticle/Chapter ViewAbstractPublication PagesiiwasConference Proceedingsconference-collections
research-article

An Evaluation of Similarity Search Methods Blending Structures and Keywords in XML Documents

Published: 02 December 2013 Publication History

Abstract

For the past few years, hundreds of document-formats based on XML have appeared. Office documents are typical examples of XML documents. Besides, demands for searching documents become increasing and complicated since we need not only keyword search but also similarity search. In our previous work, we proposed LAX+, an algorithm for measuring a similarity value between XML trees. However, there is a problem that LAX+ performs a rigid matching at leaf-nodes of XML trees. In this paper, we propose two methods: KLAX and LAX&KEY. To measure a precise similarity value between leaf-nodes, KLAX improves LAX+ by-checking the number of common keywords in the leaf-nodes. LAX&KEY separately measures a similarity value between XML trees by LAX+ and a similarity value of common keywords in XML trees, and then combines them. In our experiments with docx, xlsx, and pptx files, the proposed methods yield better results in precision and recall.

References

[1]
ECMA-376 4th edition, http://www.ecma-internatio nal.org/publications/standards/Ecma-376.htm
[2]
Y. Watanabe, H. Kamigaito and H. Yokota, "Similarity Search for Office XML Documents Based on Style and Structure Data", Intl. J. of Web Information Systems, Emerald Group Publishing, Vol. 9, Issue 2, pp. 100--116, June, 2013.
[3]
A. Auvattanasombat, Y. Watanabe and H. Yokota, "XML Documents Searching Combining Structure and Keywords Similarities", IPSJ SIG Technical Reports, Vol. 2013-DBS-157, No.14, pp. 1--6, 2013.
[4]
K. C. Tai, "The Tree-to-Tree Correction Problem", J. of the Association for Computing Machinery, Vol. 26, No. 3, pp. 422--433, 1979.
[5]
D. Buttler, "A Short Survey of Document Structure Similarity Algorithms", Proc. International Conference on Internal Computing, pp.3--9, 2004.
[6]
S. Helmer, "Measuring the Structural Similarity of Semistructured Documents Using Entropy", Proc. VLDB, pp. 1022--1032, 2007.
[7]
J. Tekli, R. Chbeir and K. Yetongnon, "An overview on XML similarity: Background, current trends and future directions", Computer Science Review, Vol. 3, No. 3, pp. 151--173, 2009.
[8]
E. D. Demaine, S. Mozes, B. Rossman and O. Weimann, "An Optimal Decomposition Algorithm for Tree Edit Distance", ACM Trans. Algorithms, Vol. 6, No. 1, pp.2:l-2:19, 2009.
[9]
W. Liang and H. Yokota, "LAX: An Efficient Approximate XML Join Based on Clustered leaf-nodes for XML Data Integration", Proc. BNCOD, Springer LNCS 3567, pp. 82--97, 2005.
[10]
Stanford Log-linear Part-Of-Speech Tagger. http://nlp.stanford.edu/software/tagger.shtml
[11]
Part Of Speech Tagging -- PHP/ir. http://phpir.com/part-of-speech-tagging

Index Terms

  1. An Evaluation of Similarity Search Methods Blending Structures and Keywords in XML Documents

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    IIWAS '13: Proceedings of International Conference on Information Integration and Web-based Applications & Services
    December 2013
    753 pages
    ISBN:9781450321136
    DOI:10.1145/2539150
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    In-Cooperation

    • @WAS: International Organization of Information Integration and Web-based Applications and Services

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 02 December 2013

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Keyword Similarity
    2. OOXML
    3. XML Similarity

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    IIWAS '13

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 99
      Total Downloads
    • Downloads (Last 12 months)5
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 18 Aug 2024

    Other Metrics

    Citations

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media