Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2882903.2915211acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Local Similarity Search for Unstructured Text

Published: 26 June 2016 Publication History

Abstract

With the growing popularity of electronic documents, replication can occur for many reasons. People may copy text segments from various sources and make modifications. In this paper, we study the problem of local similarity search to find partially replicated text. Unlike existing studies on similarity search which find entirely duplicated documents, our target is to identify documents that approximately share a pair of sliding windows which differ by no more than τ tokens. Our problem is technically challenging because for sliding windows the tokens to be indexed are less selective than entire documents, rendering set similarity join-based algorithms less efficient. Our proposed method is based on enumerating token combinations to obtain signatures with high selectivity. In order to strike a balance between signature and candidate generation, we partition the token universe and for different partitions we generate combinations composed of different numbers of tokens. A cost-aware algorithm is devised to find a good partitioning of the token universe. We also propose to leverage the overlap between adjacent windows to share computation and thus speed up query processing. In addition, we develop the techniques to support the large thresholds. Experiments on real datasets demonstrate the efficiency of our method against alternative solutions.

References

[1]
P. Agrawal, A. Arasu, and R. Kaushik. On indexing error-tolerant set containment. In SIGMOD Conference, pages 927--938, 2010.
[2]
S. Agrawal, K. Chakrabarti, S. Chaudhuri, and V. Ganti. Scalable ad-hoc entity extraction from text collections. PVLDB, 1(1):945--957, 2008.
[3]
A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In VLDB, pages 918--929, 2006.
[4]
R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In WWW, pages 131--140, 2007.
[5]
L. Blanco, V. Crescenzi, P. Merialdo, and P. Papotti. Probabilistic models to reconcile complex data from inaccurate data sources. In CAiSE, pages 83--97, 2010.
[6]
S. Brin, J. Davis, and H. Garcia-Molina. Copy detection mechanisms for digital documents. In SIGMOD Conference, pages 398--409, 1995.
[7]
A. Z. Broder. On the resemblance and containment of documents. In SEQS, pages 21--29, 1997.
[8]
A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. Computer Networks, 29(8--13):1157--1166, 1997.
[9]
K. Chakrabarti, S. Chaudhuri, V. Ganti, and D. Xin. An efficient filter for approximate membership checking. In SIGMOD Conference, pages 805--818, 2008.
[10]
S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In ICDE, pages 5--16, 2006.
[11]
A. Chowdhury, O. Frieder, D. A. Grossman, and M. C. McCabe. Collection statistics for fast duplicate document detection. ACM Trans. Inf. Syst., 20(2):171--191, 2002.
[12]
D. Deng, G. Li, and J. Feng. A pivotal prefix based filtering algorithm for string similarity search. In SIGMOD Conference, pages 673--684, 2014.
[13]
D. Deng, G. Li, J. Feng, Y. Duan, and Z. Gong. A unified framework for approximate dictionary-based entity extraction. VLDB J., 24(1):143--167, 2015.
[14]
X. Dong, L. Berti-Equille, Y. Hu, and D. Srivastava. Global detection of complex copying relationships between sources. PVLDB, 3(1):1358--1369, 2010.
[15]
X. L. Dong, L. Berti-Equille, and D. Srivastava. Truth discovery and copying detection in a dynamic world. PVLDB, 2(1):562--573, 2009.
[16]
A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In VLDB, pages 518--529, 1999.
[17]
M. Hadjieleftheriou, A. Chandel, N. Koudas, and D. Srivastava. Fast indexes and algorithms for set similarity selection queries. In ICDE, pages 267--276, 2008.
[18]
O. A. Hamid, B. Behzadi, S. Christoph, and M. R. Henzinger. Detecting the origin of text segments efficiently. In WWW, pages 61--70, 2009.
[19]
Y. Jiang, G. Li, J. Feng, and W. Li. String similarity joins: An experimental evaluation. PVLDB, 7(8):625--636, 2014.
[20]
J. W. Kim, K. S. Candan, and J. Tatemura. Efficient overlap and content reuse detection in blogs and online news articles. In WWW, pages 81--90, 2009.
[21]
C. Li, J. Lu, and Y. Lu. Efficient merging and filtering algorithms for approximate string searches. In ICDE, pages 257--266, 2008.
[22]
G. Li, D. Deng, J. Wang, and J. Feng. Pass-Join: A partition-based method for similarity joins. PVLDB, 5(1):253--264, 2012.
[23]
G. Li, J. He, D. Deng, and J. Li. Efficient similarity join and search on multi-attribute data. In SIGMOD Conference, pages 1137--1151, 2015.
[24]
X. Li, X. L. Dong, K. B. Lyons, W. Meng, and D. Srivastava. Scaling up copy detection. In ICDE, pages 89--100, 2015.
[25]
U. Manber. Finding similar files in a large file system. In USENIX Winter, pages 1--10, 1994.
[26]
J. Qin, W. Wang, C. Xiao, Y. Lu, X. Lin, and H. Wang. Asymmetric signature schemes for efficient exact edit similarity query processing. ACM Trans. Database Syst., 38(3):16, 2013.
[27]
S. Sarawagi and A. Kirpal. Efficient set joins on similarity predicates. In SIGMOD Conference, pages 743--754, 2004.
[28]
V. Satuluri and S. Parthasarathy. Bayesian locality sensitive hashing for fast similarity search. PVLDB, 5(5):430--441, 2012.
[29]
S. Schleimer, D. S. Wilkerson, and A. Aiken. Winnowing: Local algorithms for document fingerprinting. In SIGMOD Conference, pages 76--85, 2003.
[30]
J. Seo and W. B. Croft. Local text reuse detection. In SIGIR, pages 571--578, 2008.
[31]
Y. Sun, J. Qin, and W. Wang. Near duplicate text detection using frequency-biased signatures. In WISE, pages 277--291, 2013.
[32]
M. Theobald, J. Siddharth, and A. Paepcke. Spotsigs: robust and efficient near duplicate detection in large web collections. In SIGIR, pages 563--570, 2008.
[33]
J. Wang, G. Li, and J. Feng. Can we beat the prefix filtering?: an adaptive framework for similarity join and search. In SIGMOD Conference, pages 85--96, 2012.
[34]
W. Wang, J. Qin, C. Xiao, X. Lin, and H. T. Shen. Vchunkjoin: An efficient algorithm for edit similarity joins. IEEE Trans. Knowl. Data Eng., 25(8):1916--1929, 2013.
[35]
C. Xiao, W. Wang, X. Lin, J. X. Yu, and G. Wang. Efficient similarity joins for near-duplicate detection. ACM Trans. Database Syst., 36(3):15, 2011.
[36]
X. Yang, Y. Wang, B. Wang, and W. Wang. Local filtering: Improving the performance of approximate queries on string collections. In SIGMOD Conference, pages 377--392, 2015.
[37]
Q. Zhang, Y. Wu, Z. Ding, and X. Huang. Learning hash codes for efficient content reuse detection. In SIGIR, pages 405--414, 2012.

Cited By

View all

Index Terms

  1. Local Similarity Search for Unstructured Text

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data
    June 2016
    2300 pages
    ISBN:9781450335317
    DOI:10.1145/2882903
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 26 June 2016

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. k-wise signature
    2. local similarity search
    3. prefix filtering
    4. unstructured text

    Qualifiers

    • Research-article

    Funding Sources

    • D2DCRC
    • JSPS

    Conference

    SIGMOD/PODS'16
    Sponsor:
    SIGMOD/PODS'16: International Conference on Management of Data
    June 26 - July 1, 2016
    California, San Francisco, USA

    Acceptance Rates

    Overall Acceptance Rate 785 of 4,003 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)19
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 23 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Near-Duplicate Text Alignment with One Permutation HashingProceedings of the ACM on Management of Data10.1145/36771362:4(1-26)Online publication date: 30-Sep-2024
    • (2024)SWOOP: top-k similarity joins over set streamsThe VLDB Journal10.1007/s00778-024-00880-x34:1Online publication date: 23-Dec-2024
    • (2023)A Two-Level Signature Scheme for Stable Set Similarity JoinsProceedings of the VLDB Endowment10.14778/3611479.361148016:11(2686-2698)Online publication date: 24-Aug-2023
    • (2023)Near-Duplicate Sequence Search at Scale for Large Language Model Memorization EvaluationProceedings of the ACM on Management of Data10.1145/35893241:2(1-18)Online publication date: 20-Jun-2023
    • (2021)Allign: Aligning All-Pair Near-Duplicate Passages in Long TextsProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3457548(541-553)Online publication date: 9-Jun-2021
    • (2021)Generalizing the Pigeonhole Principle for Similarity Search in Hamming SpaceIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2019.289959733:2(489-505)Online publication date: 11-Jan-2021
    • (2021)Substring Similarity Search with Synonyms2021 IEEE 37th International Conference on Data Engineering (ICDE)10.1109/ICDE51399.2021.00191(2003-2008)Online publication date: Apr-2021
    • (2021)Internal and external memory set containment joinThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-020-00644-330:3(447-470)Online publication date: 23-Feb-2021
    • (2020)Blocking and Filtering Techniques for Entity ResolutionACM Computing Surveys10.1145/337745553:2(1-42)Online publication date: 20-Mar-2020
    • (2020)A multi-view similarity measure framework for trouble ticket miningData & Knowledge Engineering10.1016/j.datak.2020.101800127(101800)Online publication date: May-2020
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media