Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Efficient Computation of the Tree Edit Distance

Published: 25 March 2015 Publication History

Abstract

We consider the classical tree edit distance between ordered labelled trees, which is defined as the minimum-cost sequence of node edit operations that transform one tree into another. The state-of-the-art solutions for the tree edit distance are not satisfactory. The main competitors in the field either have optimal worst-case complexity but the worst case happens frequently, or they are very efficient for some tree shapes but degenerate for others. This leads to unpredictable and often infeasible runtimes. There is no obvious way to choose between the algorithms.
In this article we present RTED, a robust tree edit distance algorithm. The asymptotic complexity of our algorithm is smaller than or equal to the complexity of the best competitors for any input instance, that is, our algorithm is both efficient and worst-case optimal. This is achieved by computing a dynamic decomposition strategy that depends on the input trees. RTED is shown optimal among all algorithms that use LRH (left-right-heavy) strategies, which include RTED and the fastest tree edit distance algorithms presented in literature. In our experiments on synthetic and real-world data we empirically evaluate our solution and compare it to the state-of-the-art.

References

[1]
Tatsuya Akutsu. 2010. Tree edit distance problems: Algorithms and applications to bioinformatics. IEICE Trans. Inf. Syst. 93D, 2, 208--218.
[2]
Nikolaus Augsten, Denilson Barbosa, Michael H. Bohlen, and Themis Palpanas. 2010a. TASM: Top-k approximate subtree matching. In Proceedings of the International Conference on Data Engineering (ICDE'10). 353--364.
[3]
Nikolaus Augsten, Michael H. Bohlen, and Johann Gamper. 2010b. The pq-gram distance between ordered labeled trees. ACM Trans. Database Syst. 35, 1.
[4]
John Bellando and Ravi Kothari. 1999. Region-based modeling and tree edit distance as a basis for gesture recognition. In Proceedings of the International Conference on Image Analysis and Processing (ICIAP'99). 698--703.
[5]
Philip Bille. 2005. A survey on tree edit distance and related problems. Theor. Comput. Sci. 337, 1--3, 217--239.
[6]
Sudarshan S. Chawathe. 1999. Comparing hierarchical data in external memory. In Proceedings of the International Conference on Very Large Data Bases (VLDB'99). 90--101.
[7]
Sudarshan S. Chawathe, Anand Rajaraman, Hector Garcia-Molina, and Jennifer Widom. 1996. Change detection in hierarchically structured information. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'96). 493--504.
[8]
Shihyen Chen and Kaizhong Zhang. 2007. An improved algorithm for tree edit distance incorporating structural linearity. In Proceedings of the International Conference on Computing and Combinatorics (COCOON'07). 482--492.
[9]
Weimin Chen. 2001. New algorithm for ordered tree-to-tree correction problem. J. Algor. 40, 2, 135--158.
[10]
Gregory Cobena, Serge Abiteboul, and Amelie Marian. 2002. Detecting changes in XML documents. In Proceedings of the International Conference on Data Engineering (ICDE'02). 41--52.
[11]
Sara Cohen. 2013. Indexing for subtree similarity-search using edit distance. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'13). 49--60.
[12]
Theodore Dalamagas, Tao Cheng, Klaas-Jan Winkel, and Timos K. Sellis. 2006. A methodology for clustering XML documents by structure. Inf. Syst. 31, 3, 187--228.
[13]
Davi de Castro Reis, Paulo Braz Golgher, Altigran Soares da Silva, and Alberto H. F. Laender. 2004. Automatic Web news extraction using tree edit distance. In Proceedings of the International Conference on World Wide Web (WWW'04). 502--511.
[14]
Erik D. Demaine, Shay Mozes, Benjamin Rossman, and Oren Weimann. 2009. An optimal decomposition algorithm for tree edit distance. ACM Trans. Algor. 6, 1.
[15]
Serge Dulucq and Helene Touzet. 2005. Decomposition algorithms for the tree edit distance problem. J. Discr. Algor. 3, 2--4, 448--471.
[16]
Jan P. Finis, Martin Raiber, Nikolaus Augsten, Robert Brunel, Alfons Kemper, and Franz Farber. 2013. RWS-diff: Flexible and efficient change detection in hierarchical data. In Proceedings of the ACM International Conference on Information and Knowledge Management (CIKM'13). 339--348.
[17]
Minos Garofalakis and Amit Kumar. 2005. XML stream processing using tree-edit distance embeddings. ACM Trans. Database Syst. 30, 1, 279--332.
[18]
Sudipto Guha, H. V. Jagadish, Nick Koudas, Divesh Srivastava, and Ting Yu. 2002. Approximate XML joins. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'02). 287--298.
[19]
Amaury Habrard, Jose Manuel Inesta Quereda, David Rizo, and Marc Sebban. 2008. Melody recognition with learned edit distances. In Proceedings of the Joint IAPR International Workshop on Structural, Syntactic, and Statistical Pattern Recognition (SSPR/SPR'08). 86--96.
[20]
Holger Heumann and Gabriel Wittum. 2009. The tree-edit-distance, a measure for quantifying neuronal morphology. BMC Neurosci. 10, 1, P89.
[21]
Shahab Kamali and Frank Wm. Tompa. 2013. Retrieving documents with mathematical content. In Proceedings of the ACM SIGIR International Conference on Research and Development in Information Retrieval (SIGIR'13). 353--362.
[22]
Yeonjung Kim, Jeahyun Park, Taehwan Kim, and Joongmin Choi. 2007. Web information extraction by html tree edit distance matching. In Proceedings of the International Conference on Convergence Information Technology (ICCIT'07). 2455--2460.
[23]
Philip N. Klein. 1998. Computing the edit-distance between unrooted ordered trees. In Proceedings of the European Symposium on Algorithms (ESA'98). 91--102.
[24]
Philip N. Klein, Srikanta Tirthapura, Daniel Sharvit, and Benjamin B. Kimia. 2000. A tree-edit-distance algorithm for comparing simple, closed shapes. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA'00). 696--704.
[25]
Flip Korn, Barna Saha, Divesh Srivastava, and Shanshan Ying. 2013. On repairing structural problems in semi-structured data. Proc. VLDB Endow. 6, 9, 601--612.
[26]
Kyong-Ho Lee, Yoon-Chul Choy, and Sung-Bae Cho. 2004. An efficient algorithm to compute differences between structured documents. IEEE Trans. Knowl. Data Engin. 16, 8, 965--979.
[27]
Zhiwei Lin, Hui Wang, and Sally I. McClean. 2010. Measuring tree similarity for natural language processing based information retrieval. In Proceedings of the International Conference on Applications of Natural Language to Information Systems (NLDB'10). 13--23.
[28]
Bin Ma, Lusheng Wang, and Kaizhong Zhang. 2002. Computing similarity between RNA structures. Theor. Comput. Sci. 276, 1--2, 111--132.
[29]
Mateusz Pawlik and Nikolaus Augsten. 2011. RTED: A robust algorithm for the tree edit distance. Proc. VLDB Endow. 5, 4, 334--345.
[30]
Mateusz Pawlik and Nikolaus Augsten. 2014. A memory-efficient tree edit distance algorithm. In Proceedings of the International Conference on Database and Expert Systems Applications (DEXA'14). 196--210.
[31]
Daniel Dominic Sleator and Robert Endre Tarjan. 1983. A data structure for dynamic trees. J. Comput. Syst. Sci. 26, 3, 362--391.
[32]
Kuo-Chung Tai. 1979. The tree-to-tree correction problem. J. ACM 26, 3, 422--433.
[33]
Yuan Wang, David J. DeWitt, and Jin Yi Cai. 2003. X-Diff: An effective change detection algorithm for XML documents. In Proceedings of the International Conference on Data Engineering (ICDE'03). 519--530.
[34]
Rui Yang, Panos Kalnis, and Anthony K. H. Tung. 2005. Similarity evaluation on tree-structured data. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'05). 754--765.
[35]
Kaizhong Zhang and Dennis Shasha. 1989. Simple fast algorithms for the editing distance between trees and related problems. SIAM J. Comput. 18, 6, 1245--1262.
[36]
Kaizhong Zhang, Richard Statman, and Dennis Shasha. 1992. On the editing distance between unordered labeled trees. Inf. Process. Lett. 42, 3, 133--139.

Cited By

View all
  • (2024)Towards a Robust Waiting Strategy for Web GUI Testing for an Industrial Software SystemProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695269(2065-2076)Online publication date: 27-Oct-2024
  • (2024)Effective, Platform-Independent GUI Testing via Image Embedding and Reinforcement LearningACM Transactions on Software Engineering and Methodology10.1145/367472833:7(1-27)Online publication date: 21-Jun-2024
  • (2024)Improving Quality and Domain-Relevancy of Paraphrase Generation with Graph-Based Retrieval Augmented GenerationProceedings of the 2024 10th International Conference on Computing and Artificial Intelligence10.1145/3669754.3669784(196-208)Online publication date: 26-Apr-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Database Systems
ACM Transactions on Database Systems  Volume 40, Issue 1
March 2015
260 pages
ISSN:0362-5915
EISSN:1557-4644
DOI:10.1145/2751312
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 March 2015
Accepted: 01 September 2014
Revised: 01 May 2014
Received: 01 December 2013
Published in TODS Volume 40, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Approximate matching
  2. similarity search
  3. tree edit distance

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • RARE project funded by the Autonomous Province of Bolzano-South Tyrol, Italy
  • SyRA project of the Free University of Bozen-Bolzano, Italy

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)114
  • Downloads (Last 6 weeks)15
Reflects downloads up to 25 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Towards a Robust Waiting Strategy for Web GUI Testing for an Industrial Software SystemProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695269(2065-2076)Online publication date: 27-Oct-2024
  • (2024)Effective, Platform-Independent GUI Testing via Image Embedding and Reinforcement LearningACM Transactions on Software Engineering and Methodology10.1145/367472833:7(1-27)Online publication date: 21-Jun-2024
  • (2024)Improving Quality and Domain-Relevancy of Paraphrase Generation with Graph-Based Retrieval Augmented GenerationProceedings of the 2024 10th International Conference on Computing and Artificial Intelligence10.1145/3669754.3669784(196-208)Online publication date: 26-Apr-2024
  • (2024)Semi-supervised Crowdsourced Test Report Clustering via Screenshot-Text Binding RulesProceedings of the ACM on Software Engineering10.1145/36607761:FSE(1540-1563)Online publication date: 12-Jul-2024
  • (2024)Enhancing Text-to-SQL Translation for Financial System DesignProceedings of the 46th International Conference on Software Engineering: Software Engineering in Practice10.1145/3639477.3639732(252-262)Online publication date: 14-Apr-2024
  • (2024)Mccatch: Scalable Microcluster Detection in Dimensional and Nondimensional Datasets2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00116(1407-1420)Online publication date: 13-May-2024
  • (2024)Parameter Efficient Diverse Paraphrase Generation Using Sequence-Level Knowledge Distillation2024 5th International Conference on Advancements in Computational Sciences (ICACS)10.1109/ICACS60934.2024.10473289(1-12)Online publication date: 19-Feb-2024
  • (2023)LinChemIn: SynGraph—a data model and a toolkit to analyze and compare synthetic routesJournal of Cheminformatics10.1186/s13321-023-00714-y15:1Online publication date: 1-Apr-2023
  • (2023)NodeGit: Diffing and Merging Node GraphsACM Transactions on Graphics10.1145/361834342:6(1-12)Online publication date: 5-Dec-2023
  • (2023)Clustering for the Analysis and Enrichment of Corpus of Images for the Spatio-temporal Monitoring of Restoration SitesProceedings of the 5th Workshop on analySis, Understanding and proMotion of heritAge Contents10.1145/3607542.3617353(39-47)Online publication date: 2-Nov-2023
  • Show More Cited By

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media