Article

Sizing sketches: a rank-based analysis for similarity search

Authors:

William Josephson,

Moses Charikar,

Kai LiAuthors Info & Claims

SIGMETRICS '07: Proceedings of the 2007 ACM SIGMETRICS international conference on Measurement and modeling of computer systems

Pages 157 - 168

https://doi.org/10.1145/1254882.1254900

Published: 12 June 2007 Publication History

Abstract

Sketches are compact data structures that can be used to estimate properties of the original data in building large-scale search engines and data analysis systems. Recent theoretical and experimental studies have shown that sketches constructed from feature vectors using randomized projections can effectively approximate L1 distance on the feature vectors with the Hamming distance on their sketches. Furthermore, such sketches can achieve good filtering accuracy while reducing the metadata space requirement and speeding up similarity searches by an order of magnitude. However, it is not clear how to choose the size of the sketches since it depends ondata type, dataset size, and desired filtering quality. In real systems designs, it is necessary to understand how to choose sketch size without the dataset, or at least without the whole datase.

This paper presents an analytical model and experimental results to help system designers make such design decisions. We present arank-based filtering model that describes the relationship between sketch size and data set size based on the dataset distance distribution. Our experimental results with several datasets including images, audio, and 3D shapes show that the model yields good, conservative predictions. We show that the parameters of the model can be set with a small sample data set and the resulting model can make good predictions for a large dataset. We illustrate how to apply the approach with a concrete example.

References

[1]

S. Balko, I. Schmitt, and G. Saake. The active vertice method: A performance filtering approach to high-dimensional indexing. Elsevier Data and Knowledge Engineering (DKE), 51(3):369--397, 2004.

Digital Library

[2]

J. L. Bentley. K-D trees for semi-dynamic point sets. In Proc. of the 6th Annual ACM Symposium on Computational Geometry (SCG), pages 187--197, 1990.

Digital Library

[3]

A. Beygelzimer, S. Kakade, and J. Langford. Cover trees for nearest neighbor. In Proceedings of the 23rd International Conference on Machine Learning (ICML), pages 97--104, Pittsburgh, PA, USA, June 2006.

Digital Library

[4]

A. Z. Broder. Identifying and filtering near-duplicate documents. In Proceedings of the 11th Annual Symp. on Combinatorial Pattern Matching, pages 1--10. Springer-Verlag, 2000.

Digital Library

[5]

M. Charikar. Similarity estimation techniques from rounding algorithms. In Proc. of the 34th annual ACM Symp. on Theory of Computing, pages 380--388, 2002.

Digital Library

[6]

M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. In Proceedings of the 20th Annual Symposium on Computational Geometry(SCG), pages 253--262, 2004.

Digital Library

[7]

Y. Deng and B. S. Manjunath. Unsupervised segmentation of color-texture regions in images and video. IEEE Trans. on Pattern Analysis and Machine Intelligence, 2001.

Digital Library

[8]

D. Dobkin and R. Lipton. Multidimensional search problems. SIAM J. Computing, 5:181--186, 1976.

[9]

M. G. et al. Gnu scientific library.

[10]

I. K. Fodor. A survey of dimension reduction techniques. Technical Report UCRL-ID-148494, Lawrence Livermore National Laboratory, 2002.

[11]

J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett, and N. L. Dahlgren. DARPA TIMIT acoustic-phonetic continuous speech corpus, 1993.

[12]

A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In Proc. of the 25th Int. Conf. on Very Large Data Bases (VLDB), pages 518--529, 1999.

Digital Library

[13]

A. Guttman. R-trees: A dynamic index structure for spatial searching. In Proc. of ACM SIGMOD Conf. on Management of Data, pages 47--57, 1984.

Digital Library

[14]

P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proc. of the 30th Annual ACM Symposium on Theory of Computing, pages 604--613, 1998.

Digital Library

[15]

N. Katayama and S. Satoh. The SR-tree: An index structure for high-dimensional nearest neighbor queries. In Proc. of ACM SIGMOD Int. Conf. on Management of Data, pages 369--380, 1997.

Digital Library

[16]

M. Kazhdan, T. Funkhouser, and S. Rusinkiewicz. Rotation invariant spherical harmonic representation of 3D shape descriptors. In Proc. of the Eurographics Symposium on Geometry Processing, 2003.

Digital Library

[17]

R. Krauthgamer and J. R. Lee. Navigating nets: Simple algorithms for proximity search. In Proc. of the 15th ACM Symposium on Discrete Algorithms, pages 798--807, 2004.

Digital Library

[18]

Q. Lv, M. Charikar, and K. Li. Image similarity search with compact data structures. In Proc. of the 13th ACM Conf. on Information and Knowledge Management, pages 208--217, 2004.

Digital Library

[19]

Q. Lv, W. Josephson, Z. Wang, M. Charikar, and K. Li. Efficient filtering with sketches in the ferret toolkit. In Workshop on Multimedia Information Retrieval, 2006.

Digital Library

[20]

Q. Lv, W. Josephson, Z. Wang, M. Charikar, and K. Li. Ferret: A Toolkit for Content-Based Similarity Search of Feature-Rich Data. In Proceedings of the ACM SIGOPS EuroSys Conf., 2006.

Digital Library

[21]

S. Meiser. Point location in arrangements of hyperplanes. Information and Computation, 106(2):286--303, 1993.

Digital Library

[22]

R. Panigrahy. Entropy based nearest neighbor search in high dimensions. In Proceedings of the 17th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1186--1195, Miami, Florida, USA, Jan 2006.

Digital Library

[23]

Y. Sakurai, M. Yoshikawa, S. Uemura, and H. Kojima. The a-tree: An index structure for high-dimensional spaces using relative approximation. In Proceedings of the 26th International Conference on Very Large Data Bases (VLDB), pages 516--526, 2000.

Digital Library

[24]

G. Tzanetakis and P. Cook. MARSYAS: A Framework for Audio Analysis. Cambridge University Press, 2000.

Digital Library

[25]

R. Weber, H.-J. Schek, and S. Blott. A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In Proceedings of the 24th International Conference on Very Large Data Bases (VLDB), pages 194--205, 1998.

Digital Library

Cited By

Gerniers ADupont P(2022)MicroCellClust 2: a hybrid approach for multivariate rare cell mining in large-scale single-cell data2022 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)10.1109/BIBM55620.2022.9995176(148-153)Online publication date: 6-Dec-2022
https://doi.org/10.1109/BIBM55620.2022.9995176
Mic VZezula P(2021)On the Similarity Search With Hamming Space SketchesIntelligent Analytics With Advanced Multi-Industry Applications10.4018/978-1-7998-4963-6.ch005(97-127)Online publication date: 2021
https://doi.org/10.4018/978-1-7998-4963-6.ch005
Gong LWang HOgihara MXu J(2020)iDECProceedings of the VLDB Endowment10.14778/3397230.339724313:9(1483-1497)Online publication date: 26-Jun-2020
https://dl.acm.org/doi/10.14778/3397230.3397243
Show More Cited By

Index Terms

Sizing sketches: a rank-based analysis for similarity search
1. Information systems
  1. Information retrieval
    1. Information retrieval query processing
    2. Retrieval tasks and goals
      1. Document filtering
      2. Information extraction

Recommendations

Ferret: a toolkit for content-based similarity search of feature-rich data
EuroSys '06: Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006

Building content-based search tools for feature-rich data has been a challenging problem because feature-rich data such as audio recordings, digital images, and sensor data are inherently noisy and high dimensional. Comparing noisy data requires ...
Sizing sketches: a rank-based analysis for similarity search
SIGMETRICS '07 Conference Proceedings

Sketches are compact data structures that can be used to estimate properties of the original data in building large-scale search engines and data analysis systems. Recent theoretical and experimental studies have shown that sketches constructed from ...
Efficient filtering with sketches in the ferret toolkit
MIR '06: Proceedings of the 8th ACM international workshop on Multimedia information retrieval

Ferret is a toolkit for building content-based similarity search systems for feature-rich data types such as audio, video, and digital photos.The key component of this toolkit is a content-based similarity search engine for generic, multi-feature object ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMETRICS '07: Proceedings of the 2007 ACM SIGMETRICS international conference on Measurement and modeling of computer systems

June 2007

398 pages

ISBN:9781595936394

DOI:10.1145/1254882

General Chair:
Leana Golubchik
University of Southern California, USA
,
Program Chairs:
Mostafa Ammar
Georgia Institute of Technology, USA
,
Mor Harchol-Balter
Carnegie Mellon University, USA

ACM SIGMETRICS Performance Evaluation Review Volume 35, Issue 1
SIGMETRICS '07 Conference Proceedings
June 2007
382 pages
ISSN:0163-5999
DOI:10.1145/1269899
Issue’s Table of Contents

Copyright © 2007 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 June 2007

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

SIGMETRICS07

Sponsor:

SIGMETRICS07: ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems

June 12 - 16, 2007

California, San Diego, USA

Acceptance Rates

Overall Acceptance Rate 459 of 2,691 submissions, 17%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

28
Total Citations
View Citations
568
Total Downloads

Downloads (Last 12 months)11
Downloads (Last 6 weeks)1

Reflects downloads up to 28 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Gerniers ADupont P(2022)MicroCellClust 2: a hybrid approach for multivariate rare cell mining in large-scale single-cell data2022 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)10.1109/BIBM55620.2022.9995176(148-153)Online publication date: 6-Dec-2022
https://doi.org/10.1109/BIBM55620.2022.9995176
Mic VZezula P(2021)On the Similarity Search With Hamming Space SketchesIntelligent Analytics With Advanced Multi-Industry Applications10.4018/978-1-7998-4963-6.ch005(97-127)Online publication date: 2021
https://doi.org/10.4018/978-1-7998-4963-6.ch005
Gong LWang HOgihara MXu J(2020)iDECProceedings of the VLDB Endowment10.14778/3397230.339724313:9(1483-1497)Online publication date: 26-Jun-2020
https://dl.acm.org/doi/10.14778/3397230.3397243
Zhang XWei CSong Z(2020)Fast Locally Weighted PLS Modeling for Large-Scale Industrial ProcessesIndustrial & Engineering Chemistry Research10.1021/acs.iecr.0c0393259:47(20779-20786)Online publication date: 11-Nov-2020
https://doi.org/10.1021/acs.iecr.0c03932
Higuchi NImamura YMic VShinohara THirata KKuboyama T(2020)Pivot Selection for Narrow Sketches by Optimization AlgorithmsSimilarity Search and Applications10.1007/978-3-030-60936-8_3(33-46)Online publication date: 14-Oct-2020
https://doi.org/10.1007/978-3-030-60936-8_3
Chen XWu FChen JLi M(2019)DoRC: Discovery of rare cells from ultra-large scRNA-seq data2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)10.1109/BIBM47256.2019.8983250(111-116)Online publication date: Nov-2019
https://doi.org/10.1109/BIBM47256.2019.8983250
Mic VNovak DZezula P(2018)Binary Sketches for Secondary FilteringACM Transactions on Information Systems10.1145/323193637:1(1-28)Online publication date: 6-Dec-2018
https://dl.acm.org/doi/10.1145/3231936
Mic VNovak DZezula P(2018)Modifying Hamming Spaces for Efficient Search2018 IEEE International Conference on Data Mining Workshops (ICDMW)10.1109/ICDMW.2018.00137(945-953)Online publication date: Nov-2018
https://doi.org/10.1109/ICDMW.2018.00137
Mic VNovak DVadicamo LZezula P(2018)Selecting Sketches for Similarity SearchAdvances in Databases and Information Systems10.1007/978-3-319-98398-1_9(127-141)Online publication date: 2-Sep-2018
https://dl.acm.org/doi/10.1007/978-3-319-98398-1_9
Rachkovskij D(2017)Binary Vectors for Fast Distance and Similarity EstimationCybernetics and Systems Analysis10.1007/s10559-017-9914-x53:1(138-156)Online publication date: 1-Jan-2017
https://dl.acm.org/doi/10.1007/s10559-017-9914-x
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents