research-article

Local Similarity Search for Unstructured Text

Authors:

Xiaoyang Zhang,

Yoshiharu IshikawaAuthors Info & Claims

SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data

Pages 1991 - 2005

https://doi.org/10.1145/2882903.2915211

Published: 26 June 2016 Publication History

Abstract

With the growing popularity of electronic documents, replication can occur for many reasons. People may copy text segments from various sources and make modifications. In this paper, we study the problem of local similarity search to find partially replicated text. Unlike existing studies on similarity search which find entirely duplicated documents, our target is to identify documents that approximately share a pair of sliding windows which differ by no more than τ tokens. Our problem is technically challenging because for sliding windows the tokens to be indexed are less selective than entire documents, rendering set similarity join-based algorithms less efficient. Our proposed method is based on enumerating token combinations to obtain signatures with high selectivity. In order to strike a balance between signature and candidate generation, we partition the token universe and for different partitions we generate combinations composed of different numbers of tokens. A cost-aware algorithm is devised to find a good partitioning of the token universe. We also propose to leverage the overlap between adjacent windows to share computation and thus speed up query processing. In addition, we develop the techniques to support the large thresholds. Experiments on real datasets demonstrate the efficiency of our method against alternative solutions.

References

[1]

P. Agrawal, A. Arasu, and R. Kaushik. On indexing error-tolerant set containment. In SIGMOD Conference, pages 927--938, 2010.

Digital Library

[2]

S. Agrawal, K. Chakrabarti, S. Chaudhuri, and V. Ganti. Scalable ad-hoc entity extraction from text collections. PVLDB, 1(1):945--957, 2008.

Digital Library

[3]

A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In VLDB, pages 918--929, 2006.

Digital Library

[4]

R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In WWW, pages 131--140, 2007.

Digital Library

[5]

L. Blanco, V. Crescenzi, P. Merialdo, and P. Papotti. Probabilistic models to reconcile complex data from inaccurate data sources. In CAiSE, pages 83--97, 2010.

Digital Library

[6]

S. Brin, J. Davis, and H. Garcia-Molina. Copy detection mechanisms for digital documents. In SIGMOD Conference, pages 398--409, 1995.

Digital Library

[7]

A. Z. Broder. On the resemblance and containment of documents. In SEQS, pages 21--29, 1997.

Digital Library

[8]

A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. Computer Networks, 29(8--13):1157--1166, 1997.

Digital Library

[9]

K. Chakrabarti, S. Chaudhuri, V. Ganti, and D. Xin. An efficient filter for approximate membership checking. In SIGMOD Conference, pages 805--818, 2008.

Digital Library

[10]

S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In ICDE, pages 5--16, 2006.

Digital Library

[11]

A. Chowdhury, O. Frieder, D. A. Grossman, and M. C. McCabe. Collection statistics for fast duplicate document detection. ACM Trans. Inf. Syst., 20(2):171--191, 2002.

Digital Library

[12]

D. Deng, G. Li, and J. Feng. A pivotal prefix based filtering algorithm for string similarity search. In SIGMOD Conference, pages 673--684, 2014.

Digital Library

[13]

D. Deng, G. Li, J. Feng, Y. Duan, and Z. Gong. A unified framework for approximate dictionary-based entity extraction. VLDB J., 24(1):143--167, 2015.

Digital Library

[14]

X. Dong, L. Berti-Equille, Y. Hu, and D. Srivastava. Global detection of complex copying relationships between sources. PVLDB, 3(1):1358--1369, 2010.

Digital Library

[15]

X. L. Dong, L. Berti-Equille, and D. Srivastava. Truth discovery and copying detection in a dynamic world. PVLDB, 2(1):562--573, 2009.

Digital Library

[16]

A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In VLDB, pages 518--529, 1999.

Digital Library

[17]

M. Hadjieleftheriou, A. Chandel, N. Koudas, and D. Srivastava. Fast indexes and algorithms for set similarity selection queries. In ICDE, pages 267--276, 2008.

Digital Library

[18]

O. A. Hamid, B. Behzadi, S. Christoph, and M. R. Henzinger. Detecting the origin of text segments efficiently. In WWW, pages 61--70, 2009.

Digital Library

[19]

Y. Jiang, G. Li, J. Feng, and W. Li. String similarity joins: An experimental evaluation. PVLDB, 7(8):625--636, 2014.

Digital Library

[20]

J. W. Kim, K. S. Candan, and J. Tatemura. Efficient overlap and content reuse detection in blogs and online news articles. In WWW, pages 81--90, 2009.

Digital Library

[21]

C. Li, J. Lu, and Y. Lu. Efficient merging and filtering algorithms for approximate string searches. In ICDE, pages 257--266, 2008.

Digital Library

[22]

G. Li, D. Deng, J. Wang, and J. Feng. Pass-Join: A partition-based method for similarity joins. PVLDB, 5(1):253--264, 2012.

Digital Library

[23]

G. Li, J. He, D. Deng, and J. Li. Efficient similarity join and search on multi-attribute data. In SIGMOD Conference, pages 1137--1151, 2015.

Digital Library

[24]

X. Li, X. L. Dong, K. B. Lyons, W. Meng, and D. Srivastava. Scaling up copy detection. In ICDE, pages 89--100, 2015.

[25]

U. Manber. Finding similar files in a large file system. In USENIX Winter, pages 1--10, 1994.

Digital Library

[26]

J. Qin, W. Wang, C. Xiao, Y. Lu, X. Lin, and H. Wang. Asymmetric signature schemes for efficient exact edit similarity query processing. ACM Trans. Database Syst., 38(3):16, 2013.

Digital Library

[27]

S. Sarawagi and A. Kirpal. Efficient set joins on similarity predicates. In SIGMOD Conference, pages 743--754, 2004.

Digital Library

[28]

V. Satuluri and S. Parthasarathy. Bayesian locality sensitive hashing for fast similarity search. PVLDB, 5(5):430--441, 2012.

Digital Library

[29]

S. Schleimer, D. S. Wilkerson, and A. Aiken. Winnowing: Local algorithms for document fingerprinting. In SIGMOD Conference, pages 76--85, 2003.

Digital Library

[30]

J. Seo and W. B. Croft. Local text reuse detection. In SIGIR, pages 571--578, 2008.

Digital Library

[31]

Y. Sun, J. Qin, and W. Wang. Near duplicate text detection using frequency-biased signatures. In WISE, pages 277--291, 2013.

[32]

M. Theobald, J. Siddharth, and A. Paepcke. Spotsigs: robust and efficient near duplicate detection in large web collections. In SIGIR, pages 563--570, 2008.

Digital Library

[33]

J. Wang, G. Li, and J. Feng. Can we beat the prefix filtering?: an adaptive framework for similarity join and search. In SIGMOD Conference, pages 85--96, 2012.

Digital Library

[34]

W. Wang, J. Qin, C. Xiao, X. Lin, and H. T. Shen. Vchunkjoin: An efficient algorithm for edit similarity joins. IEEE Trans. Knowl. Data Eng., 25(8):1916--1929, 2013.

Digital Library

[35]

C. Xiao, W. Wang, X. Lin, J. X. Yu, and G. Wang. Efficient similarity joins for near-duplicate detection. ACM Trans. Database Syst., 36(3):15, 2011.

Digital Library

[36]

X. Yang, Y. Wang, B. Wang, and W. Wang. Local filtering: Improving the performance of approximate queries on string collections. In SIGMOD Conference, pages 377--392, 2015.

Digital Library

[37]

Q. Zhang, Y. Wu, Z. Ding, and X. Huang. Learning hash codes for efficient content reuse detection. In SIGIR, pages 405--414, 2012.

Digital Library

Cited By

Peng ZZhang YDeng D(2024)Near-Duplicate Text Alignment with One Permutation HashingProceedings of the ACM on Management of Data10.1145/36771362:4(1-26)Online publication date: 30-Sep-2024
https://dl.acm.org/doi/10.1145/3677136
Mann WAugsten NJensen CPawlik M(2024)SWOOP: top-k similarity joins over set streamsThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-024-00880-x34:1Online publication date: 23-Dec-2024
https://dl.acm.org/doi/10.1007/s00778-024-00880-x
Schmitt DKocher DAugsten NMann WMiller A(2023)A Two-Level Signature Scheme for Stable Set Similarity JoinsProceedings of the VLDB Endowment10.14778/3611479.361148016:11(2686-2698)Online publication date: 24-Aug-2023
https://dl.acm.org/doi/10.14778/3611479.3611480
Show More Cited By

Recommendations

Overlap Set Similarity Joins with Theoretical Guarantees
SIGMOD '18: Proceedings of the 2018 International Conference on Management of Data

This paper studies the set similarity join problem with overlap constraints which, given two collections of sets and a constant c, finds all the set pairs in the datasets that share at least c common elements. This is a fundamental operation in many ...
Scaling up all pairs similarity search
WWW '07: Proceedings of the 16th international conference on World Wide Web

Given a large collection of sparse vector data in a high dimensional space, we investigate the problem of finding all pairs of vectors whose similarity score (as determined by a function such as cosine distance) is above a given threshold. We propose a ...
HOT: A Height Optimized Trie Index for Main-Memory Database Systems
SIGMOD '18: Proceedings of the 2018 International Conference on Management of Data

We present the Height Optimized Trie (HOT), a fast and space-efficient in-memory index structure. The core algorithmic idea of HOT is to dynamically vary the number of bits considered at each node, which enables a consistently high fanout and thereby ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data

June 2016

2300 pages

ISBN:9781450335317

DOI:10.1145/2882903

General Chairs:
Fatma Özcan
IBM Research, USA
,
Georgia Koutrika
HP Labs, USA
,
Program Chair:
Sam Madden
Massachusetts Institute of Technology, USA

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 June 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

D2DCRC
JSPS

Conference

SIGMOD/PODS'16

Sponsor:

SIGMOD

SIGMOD/PODS'16: International Conference on Management of Data

June 26 - July 1, 2016

California, San Francisco, USA

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

15
Total Citations
View Citations
835
Total Downloads

Downloads (Last 12 months)18
Downloads (Last 6 weeks)1

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Peng ZZhang YDeng D(2024)Near-Duplicate Text Alignment with One Permutation HashingProceedings of the ACM on Management of Data10.1145/36771362:4(1-26)Online publication date: 30-Sep-2024
https://dl.acm.org/doi/10.1145/3677136
Mann WAugsten NJensen CPawlik M(2024)SWOOP: top-k similarity joins over set streamsThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-024-00880-x34:1Online publication date: 23-Dec-2024
https://dl.acm.org/doi/10.1007/s00778-024-00880-x
Schmitt DKocher DAugsten NMann WMiller A(2023)A Two-Level Signature Scheme for Stable Set Similarity JoinsProceedings of the VLDB Endowment10.14778/3611479.361148016:11(2686-2698)Online publication date: 24-Aug-2023
https://dl.acm.org/doi/10.14778/3611479.3611480
Peng ZWang ZDeng D(2023)Near-Duplicate Sequence Search at Scale for Large Language Model Memorization EvaluationProceedings of the ACM on Management of Data10.1145/35893241:2(1-18)Online publication date: 20-Jun-2023
https://dl.acm.org/doi/10.1145/3589324
Feng WDeng DLi GLi ZIdreos SSrivastava D(2021)Allign: Aligning All-Pair Near-Duplicate Passages in Long TextsProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3457548(541-553)Online publication date: 9-Jun-2021
https://dl.acm.org/doi/10.1145/3448016.3457548
Qin JXiao CWang YWang WLin XIshikawa YWang G(2021)Generalizing the Pigeonhole Principle for Similarity Search in Hamming SpaceIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2019.289959733:2(489-505)Online publication date: 11-Jan-2021
https://dl.acm.org/doi/10.1109/TKDE.2019.2899597
Song GShim KLee H(2021)Substring Similarity Search with Synonyms2021 IEEE 37th International Conference on Data Engineering (ICDE)10.1109/ICDE51399.2021.00191(2003-2008)Online publication date: Apr-2021
https://doi.org/10.1109/ICDE51399.2021.00191
Yang CDeng DShang SZhu FLiu LShao L(2021)Internal and external memory set containment joinThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-020-00644-330:3(447-470)Online publication date: 23-Feb-2021
https://dl.acm.org/doi/10.1007/s00778-020-00644-3
Papadakis GSkoutas DThanos EPalpanas T(2020)Blocking and Filtering Techniques for Entity ResolutionACM Computing Surveys10.1145/337745553:2(1-42)Online publication date: 20-Mar-2020
https://dl.acm.org/doi/10.1145/3377455
Xu JMu JChen G(2020)A multi-view similarity measure framework for trouble ticket miningData & Knowledge Engineering10.1016/j.datak.2020.101800127(101800)Online publication date: May-2020
https://doi.org/10.1016/j.datak.2020.101800
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten