Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2247596.2247642acmotherconferencesArticle/Chapter ViewAbstractPublication PagesedbtConference Proceedingsconference-collections
research-article

CRSI: a compact randomized similarity index for set-valued features

Published: 27 March 2012 Publication History

Abstract

We propose a similarity index for set-valued features and study algorithms for executing various set similarity queries on it. Such queries are fundamental for many application areas, including data integration and cleaning, data profiling as well as near duplicate document detection. In this paper, we focus on Jaccard similarity and present estimators that work for arbitrary similarity thresholds based on a single similarity index. We show how to build this similarity index a-priori, without knowledge about query similarity thresholds, based on recently proposed synopses for multiset operations. The index is deployed using existing disk-based inverted indexing implementations and our algorithms exploit available techniques, like skip-lists, to further optimize the query performance. The index has provably small space footprints, is orders of magnitude smaller and faster to create/incrementally maintain than exact solutions, and the algorithms provide approximate answers, with an error that is controlled by a user-specified parameter. We prove the error bounds of our algorithms analytically, and, finally, we demonstrate the performance of the algorithms and verify their accuracy experimentally.

References

[1]
Arvind Arasu, Venkatesh Ganti, and Raghav Kaushik. Efficient Exact Set-Similarity Joins. In VLDB, pages 918--929, 2006.
[2]
Ricardo A. Baeza-Yates and Berthier A. Ribeiro-Neto. Modern Information Retrieval. ACM Press/Addison-Wesley, 1999.
[3]
Ziv Bar-Yossef, T. S. Jayram, Ravi Kumar, D. Sivakumar, and Luca Trevisan. Counting Distinct Elements in a Data Stream. In RANDOM, pages 1--10, 2002.
[4]
Roberto J. Bayardo, Yiming Ma, and Ramakrishnan Srikant. Scaling up all Pairs Similarity Search. In WWW, pages 131--140, 2007.
[5]
Kevin Beyer, Peter J. Haas, Berthold Reinwald, Yannis Sismanis, and Rainer Gemulla. On Synopses for Distinct-Value Estimation under Multiset Operations. In SIGMOD, pages 199--210, 2007.
[6]
Andrei Z. Broder, David Carmel, Michael Herscovici, Aya Soffer, and Jason Zien. Efficient Query Evaluation using a Two-Level Retrieval Process. In CIKM, pages 426--434, 2003.
[7]
Andrei Z. Broder, Moses Charikar, Alan M. Frieze, and Michael Mitzenmacher. Min-wise Independent Permutations (extended abstract). In STOC, pages 327--336, 1998.
[8]
Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, and Geoffrey Zweig. Syntactic Clustering of the Web. Comput. Netw. ISDN Syst., 29:1157--1166, 1997.
[9]
Moses S. Charikar. Similarity Estimation Techniques from Rounding Algorithms. In STOC, pages 380--388, 2002.
[10]
Surajit Chaudhuri, Venkatesh Ganti, and Raghav Kaushik. A Primitive Operator for Similarity Joins in Data Cleaning. In ICDE, page 5, 2006.
[11]
Steve Chien and Nicole Immorlica. Semantic Similarity between Search Engine Queries using Temporal Correlation. In WWW, pages 2--11, 2005.
[12]
Edith Cohen and Haim Kaplan. Bottom-k Sketches: Better and More Efficient Estimation of Aggregates. In SIGMETRICS, pages 353--354, 2007.
[13]
Edith Cohen and Haim Kaplan. Summarizing Data using Bottom-k Sketches. In PODC, pages 225--234, 2007.
[14]
Ronald Fagin, Ravi Kumar, and D. Sivakumar. Efficient Similarity Search and Classification via Rank Aggregation. In SIGMOD, pages 301--312, 2003.
[15]
Aristides Gionis, Piotr Indyk, and Rajeev Motwani. Similarity Search in high Dimensions via Hashing. In VLDB, pages 518--529, 1999.
[16]
Marios Hadjieleftheriou, Xiaohui Yu, Nick Koudas, and Divesh Srivastava. Hashed Samples: Selectivity Estimators for Set Similarity Selection Queries. PVLDB, pages 201--212, 2008.
[17]
Monika Henzinger. Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms. In SIGIR, pages 284--291, 2006.
[18]
Piotr Indyk and Rajeev Motwani. Approximate Nearest Neighbors: Towards removing the Curse of Dimensionality. In STOC, pages 604--613, 1998.
[19]
Rajeev Motwani and Prabhakar Raghavan. Randomized algorithms. ACM Comput. Surv., pages 43--45, 1996.
[20]
Mehran Sahami and Timothy D. Heilman. A Web-Based Kernel Function for Measuring the Similarity of Short Text Snippets. In WWW, pages 377--386, 2006.
[21]
Sunita Sarawagi and Alok Kirpal. Efficient Set Joins on Similarity Predicates. In SIGMOD, pages 743--754, 2004.
[22]
Ellen Spertus, Mehran Sahami, and Orkut Buyukkokten. Evaluating Similarity Measures: a Large-Scale Study in the Orkut Social Network. In KDD, pages 678--684, 2005.
[23]
Martin Theobald, Jonathan Siddharth, and Andreas Paepcke. SpotSigs: Robust and Efficient Near Duplicate Detection in Large Web Collections. In SIGIR, pages 563--570, 2008.
[24]
Chuan Xiao, Wei Wang, Xuemin Lin, and Haichuan Shang. Top-k Set Similarity Joins. In ICDE, pages 916--927, 2009.
[25]
Chuan Xiao, Wei Wang, Xuemin Lin, and Jeffrey Xu Yu. Efficient Similarity Joins for Near Duplicate Detection. In WWW, pages 131--140, 2008.

Cited By

View all
  • (2021)Correlation Sketches for Approximate Join-Correlation QueriesProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3458456(1531-1544)Online publication date: 9-Jun-2021
  • (2021)Keyword search over schema-less RDF datasets by SPARQL query compilationInformation Systems10.1016/j.is.2021.101814102:COnline publication date: 1-Dec-2021
  • (2019)Keyword Search Algorithm over Large RDF DatasetsAdvances in Conceptual Modeling10.1007/978-3-030-34146-6_21(230-238)Online publication date: 27-Oct-2019

Index Terms

  1. CRSI: a compact randomized similarity index for set-valued features

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    EDBT '12: Proceedings of the 15th International Conference on Extending Database Technology
    March 2012
    643 pages
    ISBN:9781450307901
    DOI:10.1145/2247596
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 March 2012

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. KMV synopsis
    2. set similarity

    Qualifiers

    • Research-article

    Conference

    EDBT '12

    Acceptance Rates

    Overall Acceptance Rate 7 of 10 submissions, 70%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)5
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 17 Oct 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2021)Correlation Sketches for Approximate Join-Correlation QueriesProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3458456(1531-1544)Online publication date: 9-Jun-2021
    • (2021)Keyword search over schema-less RDF datasets by SPARQL query compilationInformation Systems10.1016/j.is.2021.101814102:COnline publication date: 1-Dec-2021
    • (2019)Keyword Search Algorithm over Large RDF DatasetsAdvances in Conceptual Modeling10.1007/978-3-030-34146-6_21(230-238)Online publication date: 27-Oct-2019

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media