research-article

CRSI: a compact randomized similarity index for set-valued features

Authors:

Petros Venetis,

Yannis Sismanis,

Berthold ReinwaldAuthors Info & Claims

EDBT '12: Proceedings of the 15th International Conference on Extending Database Technology

Pages 384 - 395

https://doi.org/10.1145/2247596.2247642

Published: 27 March 2012 Publication History

Abstract

We propose a similarity index for set-valued features and study algorithms for executing various set similarity queries on it. Such queries are fundamental for many application areas, including data integration and cleaning, data profiling as well as near duplicate document detection. In this paper, we focus on Jaccard similarity and present estimators that work for arbitrary similarity thresholds based on a single similarity index. We show how to build this similarity index a-priori, without knowledge about query similarity thresholds, based on recently proposed synopses for multiset operations. The index is deployed using existing disk-based inverted indexing implementations and our algorithms exploit available techniques, like skip-lists, to further optimize the query performance. The index has provably small space footprints, is orders of magnitude smaller and faster to create/incrementally maintain than exact solutions, and the algorithms provide approximate answers, with an error that is controlled by a user-specified parameter. We prove the error bounds of our algorithms analytically, and, finally, we demonstrate the performance of the algorithms and verify their accuracy experimentally.

References

[1]

Arvind Arasu, Venkatesh Ganti, and Raghav Kaushik. Efficient Exact Set-Similarity Joins. In VLDB, pages 918--929, 2006.

Digital Library

[2]

Ricardo A. Baeza-Yates and Berthier A. Ribeiro-Neto. Modern Information Retrieval. ACM Press/Addison-Wesley, 1999.

Digital Library

[3]

Ziv Bar-Yossef, T. S. Jayram, Ravi Kumar, D. Sivakumar, and Luca Trevisan. Counting Distinct Elements in a Data Stream. In RANDOM, pages 1--10, 2002.

Digital Library

[4]

Roberto J. Bayardo, Yiming Ma, and Ramakrishnan Srikant. Scaling up all Pairs Similarity Search. In WWW, pages 131--140, 2007.

Digital Library

[5]

Kevin Beyer, Peter J. Haas, Berthold Reinwald, Yannis Sismanis, and Rainer Gemulla. On Synopses for Distinct-Value Estimation under Multiset Operations. In SIGMOD, pages 199--210, 2007.

Digital Library

[6]

Andrei Z. Broder, David Carmel, Michael Herscovici, Aya Soffer, and Jason Zien. Efficient Query Evaluation using a Two-Level Retrieval Process. In CIKM, pages 426--434, 2003.

Digital Library

[7]

Andrei Z. Broder, Moses Charikar, Alan M. Frieze, and Michael Mitzenmacher. Min-wise Independent Permutations (extended abstract). In STOC, pages 327--336, 1998.

Digital Library

[8]

Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, and Geoffrey Zweig. Syntactic Clustering of the Web. Comput. Netw. ISDN Syst., 29:1157--1166, 1997.

Digital Library

[9]

Moses S. Charikar. Similarity Estimation Techniques from Rounding Algorithms. In STOC, pages 380--388, 2002.

Digital Library

[10]

Surajit Chaudhuri, Venkatesh Ganti, and Raghav Kaushik. A Primitive Operator for Similarity Joins in Data Cleaning. In ICDE, page 5, 2006.

Digital Library

[11]

Steve Chien and Nicole Immorlica. Semantic Similarity between Search Engine Queries using Temporal Correlation. In WWW, pages 2--11, 2005.

Digital Library

[12]

Edith Cohen and Haim Kaplan. Bottom-k Sketches: Better and More Efficient Estimation of Aggregates. In SIGMETRICS, pages 353--354, 2007.

Digital Library

[13]

Edith Cohen and Haim Kaplan. Summarizing Data using Bottom-k Sketches. In PODC, pages 225--234, 2007.

Digital Library

[14]

Ronald Fagin, Ravi Kumar, and D. Sivakumar. Efficient Similarity Search and Classification via Rank Aggregation. In SIGMOD, pages 301--312, 2003.

Digital Library

[15]

Aristides Gionis, Piotr Indyk, and Rajeev Motwani. Similarity Search in high Dimensions via Hashing. In VLDB, pages 518--529, 1999.

Digital Library

[16]

Marios Hadjieleftheriou, Xiaohui Yu, Nick Koudas, and Divesh Srivastava. Hashed Samples: Selectivity Estimators for Set Similarity Selection Queries. PVLDB, pages 201--212, 2008.

Digital Library

[17]

Monika Henzinger. Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms. In SIGIR, pages 284--291, 2006.

Digital Library

[18]

Piotr Indyk and Rajeev Motwani. Approximate Nearest Neighbors: Towards removing the Curse of Dimensionality. In STOC, pages 604--613, 1998.

Digital Library

[19]

Rajeev Motwani and Prabhakar Raghavan. Randomized algorithms. ACM Comput. Surv., pages 43--45, 1996.

Digital Library

[20]

Mehran Sahami and Timothy D. Heilman. A Web-Based Kernel Function for Measuring the Similarity of Short Text Snippets. In WWW, pages 377--386, 2006.

Digital Library

[21]

Sunita Sarawagi and Alok Kirpal. Efficient Set Joins on Similarity Predicates. In SIGMOD, pages 743--754, 2004.

Digital Library

[22]

Ellen Spertus, Mehran Sahami, and Orkut Buyukkokten. Evaluating Similarity Measures: a Large-Scale Study in the Orkut Social Network. In KDD, pages 678--684, 2005.

Digital Library

[23]

Martin Theobald, Jonathan Siddharth, and Andreas Paepcke. SpotSigs: Robust and Efficient Near Duplicate Detection in Large Web Collections. In SIGIR, pages 563--570, 2008.

Digital Library

[24]

Chuan Xiao, Wei Wang, Xuemin Lin, and Haichuan Shang. Top-k Set Similarity Joins. In ICDE, pages 916--927, 2009.

Digital Library

[25]

Chuan Xiao, Wei Wang, Xuemin Lin, and Jeffrey Xu Yu. Efficient Similarity Joins for Near Duplicate Detection. In WWW, pages 131--140, 2008.

Digital Library

Cited By

Santos ABessa AChirigati FMusco CFreire JLi GLi ZIdreos SSrivastava D(2021)Correlation Sketches for Approximate Join-Correlation QueriesProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3458456(1531-1544)Online publication date: 9-Jun-2021
https://dl.acm.org/doi/10.1145/3448016.3458456
Izquierdo YGarcía GMenendez ELeme LNeves ALemos MFinamore AOliveira CCasanova M(2021)Keyword search over schema-less RDF datasets by SPARQL query compilationInformation Systems10.1016/j.is.2021.101814102:COnline publication date: 1-Dec-2021
https://dl.acm.org/doi/10.1016/j.is.2021.101814
Izquierdo Y(2019)Keyword Search Algorithm over Large RDF DatasetsAdvances in Conceptual Modeling10.1007/978-3-030-34146-6_21(230-238)Online publication date: 27-Oct-2019
https://doi.org/10.1007/978-3-030-34146-6_21

Index Terms

CRSI: a compact randomized similarity index for set-valued features
1. Information systems
  1. Information systems applications

Recommendations

DotHash: Estimating Set Similarity Metrics for Link Prediction and Document Deduplication
KDD '23: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

Metrics for set similarity are a core aspect of several data mining tasks. To remove duplicate results in a Web search, for example, a common approach looks at the Jaccard index between all pairs of pages. In social network analysis, a much-celebrated ...
Subgraph Matching with Set Similarity in a Large Graph Database
In real-world graphs such as social networks, Semantic Web and biological networks, each vertex usually contains rich information, which can be modeled by a set of tokens or elements. In this paper, we study a subgraph matching with set similarity (SMS<...
Efficient estimation for high similarities using odd sketches
WWW '14: Proceedings of the 23rd international conference on World wide web

Estimating set similarity is a central problem in many computer applications. In this paper we introduce the Odd Sketch, a compact binary sketch for estimating the Jaccard similarity of two sets. The exclusive-or of two sketches equals the sketch of the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

EDBT '12: Proceedings of the 15th International Conference on Extending Database Technology

March 2012

643 pages

ISBN:9781450307901

DOI:10.1145/2247596

Editors:
Elke Rundensteiner
Worcester Polytechnic Institute
,
Volker Markl
Technische Universität Berlin, Germany
,
Ioana Manolescu
INRIA, France
,
Sihem Amer-Yahia
QCRI, Doha, Qatar
,
Felix Naumann
Hasso Plattner Institute, Potsdam, Germany
,
Ismail Ari
Ozyegin University, Turkey

Copyright © 2012 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 March 2012

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

EDBT '12

EDBT '12: 15th International Conference on Extending Database Technology

March 27 - 30, 2012

Berlin, Germany

Acceptance Rates

Overall Acceptance Rate 7 of 10 submissions, 70%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
98
Total Downloads

Downloads (Last 12 months)5
Downloads (Last 6 weeks)0

Reflects downloads up to 17 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Santos ABessa AChirigati FMusco CFreire JLi GLi ZIdreos SSrivastava D(2021)Correlation Sketches for Approximate Join-Correlation QueriesProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3458456(1531-1544)Online publication date: 9-Jun-2021
https://dl.acm.org/doi/10.1145/3448016.3458456
Izquierdo YGarcía GMenendez ELeme LNeves ALemos MFinamore AOliveira CCasanova M(2021)Keyword search over schema-less RDF datasets by SPARQL query compilationInformation Systems10.1016/j.is.2021.101814102:COnline publication date: 1-Dec-2021
https://dl.acm.org/doi/10.1016/j.is.2021.101814
Izquierdo Y(2019)Keyword Search Algorithm over Large RDF DatasetsAdvances in Conceptual Modeling10.1007/978-3-030-34146-6_21(230-238)Online publication date: 27-Oct-2019
https://doi.org/10.1007/978-3-030-34146-6_21

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents