research-article

Top-k Tree Similarity Join

Authors:

Wenjie ZhangAuthors Info & Claims

CIKM '21: Proceedings of the 30th ACM International Conference on Information & Knowledge Management

Pages 1939 - 1948

https://doi.org/10.1145/3459637.3482304

Published: 30 October 2021 Publication History

Abstract

Tree similarity join is useful for analyzing tree structured data. The traditional threshold-based tree similarity join requires a similarity threshold, which is usually a difficult task for users. To remedy this issue, we advocate the problem of top-k tree similarity join. Given a collection of trees and a parameter k, the top-k tree similarity join aims to find k tree pairs with minimum tree edit distance (TED). Although we show that this problem can be resolved by utilizing the threshold-based join, the efficiency is unsatisfactory. In this paper, we propose an efficient algorithm, namely TopKTJoin, which generates the candidate tree pairs incrementally using an inverted index. We also derive TED lower bound for the unseen tree pairs. Together with TED value of the k-th best join result seen so far, we have a chance to terminate the algorithm early without missing any correct results. To further improve the efficiency, we propose two optimization techniques in terms of index structure and verification mechanism. We conduct comprehensive performance studies on real and synthetic datasets. The experimental results demonstrate that TopKTJoin significantly outperforms the baseline method.

Supplementary Material

MP4 File (presentation.mp4)

Presentation video

Download
41.81 MB

References

[1]

N. Augsten and M. H. Bohlen. 2013. Similarity joins in relational database systems. In Synthesis Lectures on Data Management.

Digital Library

[2]

N. Augsten, M. H. Bohlen, and J. Gamper. 2010. The pq-gram distance between ordered labeled trees. TODS (2010).

Digital Library

[3]

S. Cohen. 2013. Indexing for subtree similarity-search using edit distance. In SIGMOD. 49--60.

Digital Library

[4]

E. D. Demaine, S. Mozes, B. Rossman, and O. Weimann. 2009. An optimal decomposition algorithm for tree edit distance. ACM Transactions on Algorithms (2009).

Digital Library

[5]

Dong Deng, Yufei Tao, and Gualiang Li. 2018. Overlap Set Similarity Joins with Theoretical Guarantees. In SIGMOD. 905--920.

Digital Library

[6]

R. Fagin, A. Lotem, and M. Naor. 2001. Optimal Aggregation Algorithms for Middleware. In PODS.

Digital Library

[7]

R. Fagin, A. Lotem, and M. Naor. 2003. Optimal Aggregation Algorithms for Middleware. In J. Comput. Syst. Sci.

Digital Library

[8]

B. Fluri, M. Wusch, M. Pinzger, and H. C. Gall. 2007. Change distilling tree differencing for fine-grained source code change extraction. (2007).

[9]

S. Guha, H. V. Jagadish, N. Koudas, D. Srivastava, and T. Yu. 2002. Approximate XML joins. In SIGMOD.

Digital Library

[10]

Thomas Hutter, Mateusz Pawlik, Robert Loschinger, and Nikolaus Augsten. 2019. Effective filters and linear time verification for tree similarity joins. In ICDE.

[11]

L. Jiang, G. Misherghi, Z. Su, and S. Glondu. 2007. DECKARD: Scalable and accurate tree-based detection of code clones. In ICSE.

Digital Library

[12]

K. Kailing, H.-P. Kriegel, S. Schonauer, and T. Seidl. 2004. Efficient similarity search for hierarchical data in large databases. In EDBT.

[13]

Philip N. Klein. 1998. Computing the edit-distance between unrooted ordered tree. In European Symposium on Algorithms.

Digital Library

[14]

Daniel Kocher and Nikolaus Augsten. 2019. A Scalable Index for Top-k Subtree Similarity Queries. In SIGMOD.

Digital Library

[15]

F. Li, H. Wang, J. Li, and H. Gao. 2013. A Survey on Tree Edit Distance Lower Bound Estimation Techniques for Similarity Join on XML Data. In SIGMOD Record.

Digital Library

[16]

B. Ma, L. Wang, and K. Zhang. 2002. Computing similarity between RNA structures. (2002).

[17]

Willi Mann, Nikolaus Augsten, and Christian S. Jensen. 2017. SWOOP: Top-k Similarity Joins over Set Streams. In arXiv.org. 1--13.

[18]

M. Pawlik and N. Augstem. 2016. Tree edit distance: robust and memory-efficient. Information Systems (2016).

Digital Library

[19]

M. Pawlik and N. Augsten. 2011. RTED: A robust algorithm for the tree edit distance. In PVLDB.

Digital Library

[20]

Z. Shen, M. A. Cheema, X. Lin, W. Zhang, and H. Wang. 2014. A generic framework for top-k pairs and top-k objects queries over sliding windows. TKDE (2014), 1349--1366.

Digital Library

[21]

R. Socher, A. Perelygin, J. Y. Wu, J. Chuang, C. D. Manning, A. Y. Ng, and C. P. Potts. 2013. Recursive deep models for semantic compositionally over a sentiment treebank. In EMNLP.

[22]

K. Tai. 1979. The tree-to-tree correction problem. J. ACM (1979).

Digital Library

[23]

Y. Tang, Y. Cai, and N. Mamoulis. 2015. Scaling similarity joins over tree-structured data. In PVLDB.

Digital Library

[24]

H. Touzet. 2007. Comparing similar ordered trees in linear-time. Journal of Discrete Algorithms (2007).

Digital Library

[25]

Hongya Wang, Lihong Yang, and Yingyuan Xiao. 2020. SETJoin: a novel top-k similarity join algorithm. Soft Computing (2020).

[26]

Jiannan Wang, Guoliang Li, and Jianhua Feng. 2012. Can we beat the prefix filtering?: an adaptive framework for similarity join and search. In SIGMOD.

Digital Library

[27]

Chuan Xiao, Wei Wang, Xuemin Lin, and Haichuan Shang. 2009. Top-k set similarity joins. In ICDE. 916--927.

Digital Library

[28]

J. Yang, W. Zhang, X. Wang, Y. Zhang, and X. Lin. 2020a. Distributed streaming set similarity join. In ICDE.

[29]

J. Yang, W. Zhang, S. Yang, Y. Zhang, and X. Lin. 2017. TT-Join: efficient set containment join. In ICDE.

[30]

J Yang, W Zhang, S Yang, Y Zhang, X Lin, and L Yuan. 2018. Efficient Set Containment Join. The VLDB Journal (2018).

Digital Library

[31]

R. Yang, P. Kalnis, and A. K. H. Tung. 2005. Similarity evaluation on tree-structured data. In SIGMOD.

Digital Library

[32]

Zhang Yang, Bolong Zheng, Guohui Li, Xi Zhao, Xiaofang Zhou, and Christian S. Jensen. 2020b. Adaptive Top-k overlap set similarity joins. In ICDE. 1081--1092.

[33]

K. Zhang. 1995. Algorithms for the constrained editing distance between ordered labeled trees and related problems. Pattern Recognition (1995).

[34]

K. Zhang and D. Shasha. 1989. Simple fast algorithms for the editing distance between trees and related problems. SIAM Journal on Computing (SICOMP) (1989).

Digital Library

Cited By

Mizokami TBou SAmagasa T(2024)Subtree Similarity Search Based on Structure and TextBig Data Analytics and Knowledge Discovery10.1007/978-3-031-68323-7_6(72-87)Online publication date: 26-Aug-2024
https://dl.acm.org/doi/10.1007/978-3-031-68323-7_6

Index Terms

Top-k Tree Similarity Join
1. General and reference
  1. Cross-computing tools and techniques
    1. Performance
2. Theory of computation
  1. Design and analysis of algorithms
    1. Data structures design and analysis

Recommendations

A Scalable Index for Top-k Subtree Similarity Queries
SIGMOD '19: Proceedings of the 2019 International Conference on Management of Data

Given a query tree Q, the top-k subtree similarity query retrieves the k subtrees in a large document tree T that are closest to Q in terms of tree edit distance. The classical solution scans the entire document, which is slow. The state-of-the-art ...
Approximating Tree Edit Distance through String Edit Distance for Binary Tree Codes

This article proposes an approximation of the tree edit distance through the string edit distance for binary tree codes, instead of for Euler strings introduced by Akutsu (2006). Here, a binary tree code is a string obtained by traversing a binary tree ...
Fast algorithms for computing tree LCS

The LCS of two rooted, ordered, and labeled trees F and G is the largest forest that can be obtained from both trees by deleting nodes. We present algorithms for computing tree LCS which exploit the sparsity inherent to the tree LCS problem. Assuming G ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CIKM '21: Proceedings of the 30th ACM International Conference on Information & Knowledge Management

October 2021

4966 pages

ISBN:9781450384469

DOI:10.1145/3459637

General Chairs:
Gianluca Demartini
The University of Queensland, Australia
,
Guido Zuccon
The University of Queensland, Australia
,
Program Chairs:
J. Shane Culpepper
RMIT University, Australia
,
Zi Huang
The University of Queensland, Australia
,
Hanghang Tong
University of Illinois at Urbana-Champaign, USA

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 October 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Australian Research Council Discovery Project
National Natural Science Foundation of China

Conference

CIKM '21

Sponsor:

CIKM '21: The 30th ACM International Conference on Information and Knowledge Management

November 1 - 5, 2021

Queensland, Virtual Event, Australia

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Sponsor:
sigir
sigir

The 34th ACM International Conference on Information and Knowledge Management

November 10 - 14, 2025

Seoul , Republic of Korea

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
226
Total Downloads

Downloads (Last 12 months)26
Downloads (Last 6 weeks)1

Reflects downloads up to 19 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Mizokami TBou SAmagasa T(2024)Subtree Similarity Search Based on Structure and TextBig Data Analytics and Knowledge Discovery10.1007/978-3-031-68323-7_6(72-87)Online publication date: 26-Aug-2024
https://dl.acm.org/doi/10.1007/978-3-031-68323-7_6

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten