Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3459637.3482304acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Top-k Tree Similarity Join

Published: 30 October 2021 Publication History

Abstract

Tree similarity join is useful for analyzing tree structured data. The traditional threshold-based tree similarity join requires a similarity threshold, which is usually a difficult task for users. To remedy this issue, we advocate the problem of top-k tree similarity join. Given a collection of trees and a parameter k, the top-k tree similarity join aims to find k tree pairs with minimum tree edit distance (TED). Although we show that this problem can be resolved by utilizing the threshold-based join, the efficiency is unsatisfactory. In this paper, we propose an efficient algorithm, namely TopKTJoin, which generates the candidate tree pairs incrementally using an inverted index. We also derive TED lower bound for the unseen tree pairs. Together with TED value of the k-th best join result seen so far, we have a chance to terminate the algorithm early without missing any correct results. To further improve the efficiency, we propose two optimization techniques in terms of index structure and verification mechanism. We conduct comprehensive performance studies on real and synthetic datasets. The experimental results demonstrate that TopKTJoin significantly outperforms the baseline method.

Supplementary Material

MP4 File (presentation.mp4)
Presentation video

References

[1]
N. Augsten and M. H. Bohlen. 2013. Similarity joins in relational database systems. In Synthesis Lectures on Data Management.
[2]
N. Augsten, M. H. Bohlen, and J. Gamper. 2010. The pq-gram distance between ordered labeled trees. TODS (2010).
[3]
S. Cohen. 2013. Indexing for subtree similarity-search using edit distance. In SIGMOD. 49--60.
[4]
E. D. Demaine, S. Mozes, B. Rossman, and O. Weimann. 2009. An optimal decomposition algorithm for tree edit distance. ACM Transactions on Algorithms (2009).
[5]
Dong Deng, Yufei Tao, and Gualiang Li. 2018. Overlap Set Similarity Joins with Theoretical Guarantees. In SIGMOD. 905--920.
[6]
R. Fagin, A. Lotem, and M. Naor. 2001. Optimal Aggregation Algorithms for Middleware. In PODS.
[7]
R. Fagin, A. Lotem, and M. Naor. 2003. Optimal Aggregation Algorithms for Middleware. In J. Comput. Syst. Sci.
[8]
B. Fluri, M. Wusch, M. Pinzger, and H. C. Gall. 2007. Change distilling tree differencing for fine-grained source code change extraction. (2007).
[9]
S. Guha, H. V. Jagadish, N. Koudas, D. Srivastava, and T. Yu. 2002. Approximate XML joins. In SIGMOD.
[10]
Thomas Hutter, Mateusz Pawlik, Robert Loschinger, and Nikolaus Augsten. 2019. Effective filters and linear time verification for tree similarity joins. In ICDE.
[11]
L. Jiang, G. Misherghi, Z. Su, and S. Glondu. 2007. DECKARD: Scalable and accurate tree-based detection of code clones. In ICSE.
[12]
K. Kailing, H.-P. Kriegel, S. Schonauer, and T. Seidl. 2004. Efficient similarity search for hierarchical data in large databases. In EDBT.
[13]
Philip N. Klein. 1998. Computing the edit-distance between unrooted ordered tree. In European Symposium on Algorithms.
[14]
Daniel Kocher and Nikolaus Augsten. 2019. A Scalable Index for Top-k Subtree Similarity Queries. In SIGMOD.
[15]
F. Li, H. Wang, J. Li, and H. Gao. 2013. A Survey on Tree Edit Distance Lower Bound Estimation Techniques for Similarity Join on XML Data. In SIGMOD Record.
[16]
B. Ma, L. Wang, and K. Zhang. 2002. Computing similarity between RNA structures. (2002).
[17]
Willi Mann, Nikolaus Augsten, and Christian S. Jensen. 2017. SWOOP: Top-k Similarity Joins over Set Streams. In arXiv.org. 1--13.
[18]
M. Pawlik and N. Augstem. 2016. Tree edit distance: robust and memory-efficient. Information Systems (2016).
[19]
M. Pawlik and N. Augsten. 2011. RTED: A robust algorithm for the tree edit distance. In PVLDB.
[20]
Z. Shen, M. A. Cheema, X. Lin, W. Zhang, and H. Wang. 2014. A generic framework for top-k pairs and top-k objects queries over sliding windows. TKDE (2014), 1349--1366.
[21]
R. Socher, A. Perelygin, J. Y. Wu, J. Chuang, C. D. Manning, A. Y. Ng, and C. P. Potts. 2013. Recursive deep models for semantic compositionally over a sentiment treebank. In EMNLP.
[22]
K. Tai. 1979. The tree-to-tree correction problem. J. ACM (1979).
[23]
Y. Tang, Y. Cai, and N. Mamoulis. 2015. Scaling similarity joins over tree-structured data. In PVLDB.
[24]
H. Touzet. 2007. Comparing similar ordered trees in linear-time. Journal of Discrete Algorithms (2007).
[25]
Hongya Wang, Lihong Yang, and Yingyuan Xiao. 2020. SETJoin: a novel top-k similarity join algorithm. Soft Computing (2020).
[26]
Jiannan Wang, Guoliang Li, and Jianhua Feng. 2012. Can we beat the prefix filtering?: an adaptive framework for similarity join and search. In SIGMOD.
[27]
Chuan Xiao, Wei Wang, Xuemin Lin, and Haichuan Shang. 2009. Top-k set similarity joins. In ICDE. 916--927.
[28]
J. Yang, W. Zhang, X. Wang, Y. Zhang, and X. Lin. 2020a. Distributed streaming set similarity join. In ICDE.
[29]
J. Yang, W. Zhang, S. Yang, Y. Zhang, and X. Lin. 2017. TT-Join: efficient set containment join. In ICDE.
[30]
J Yang, W Zhang, S Yang, Y Zhang, X Lin, and L Yuan. 2018. Efficient Set Containment Join. The VLDB Journal (2018).
[31]
R. Yang, P. Kalnis, and A. K. H. Tung. 2005. Similarity evaluation on tree-structured data. In SIGMOD.
[32]
Zhang Yang, Bolong Zheng, Guohui Li, Xi Zhao, Xiaofang Zhou, and Christian S. Jensen. 2020b. Adaptive Top-k overlap set similarity joins. In ICDE. 1081--1092.
[33]
K. Zhang. 1995. Algorithms for the constrained editing distance between ordered labeled trees and related problems. Pattern Recognition (1995).
[34]
K. Zhang and D. Shasha. 1989. Simple fast algorithms for the editing distance between trees and related problems. SIAM Journal on Computing (SICOMP) (1989).

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CIKM '21: Proceedings of the 30th ACM International Conference on Information & Knowledge Management
October 2021
4966 pages
ISBN:9781450384469
DOI:10.1145/3459637
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 October 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. incremental algorithm
  2. inverted index
  3. label histogram
  4. tree edit distance
  5. tree similarity join

Qualifiers

  • Research-article

Funding Sources

Conference

CIKM '21
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)26
  • Downloads (Last 6 weeks)1
Reflects downloads up to 19 Feb 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media