Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1254882.1254900acmconferencesArticle/Chapter ViewAbstractPublication PagesmetricsConference Proceedingsconference-collections
Article

Sizing sketches: a rank-based analysis for similarity search

Published: 12 June 2007 Publication History

Abstract

Sketches are compact data structures that can be used to estimate properties of the original data in building large-scale search engines and data analysis systems. Recent theoretical and experimental studies have shown that sketches constructed from feature vectors using randomized projections can effectively approximate L1 distance on the feature vectors with the Hamming distance on their sketches. Furthermore, such sketches can achieve good filtering accuracy while reducing the metadata space requirement and speeding up similarity searches by an order of magnitude. However, it is not clear how to choose the size of the sketches since it depends ondata type, dataset size, and desired filtering quality. In real systems designs, it is necessary to understand how to choose sketch size without the dataset, or at least without the whole datase.
This paper presents an analytical model and experimental results to help system designers make such design decisions. We present arank-based filtering model that describes the relationship between sketch size and data set size based on the dataset distance distribution. Our experimental results with several datasets including images, audio, and 3D shapes show that the model yields good, conservative predictions. We show that the parameters of the model can be set with a small sample data set and the resulting model can make good predictions for a large dataset. We illustrate how to apply the approach with a concrete example.

References

[1]
S. Balko, I. Schmitt, and G. Saake. The active vertice method: A performance filtering approach to high-dimensional indexing. Elsevier Data and Knowledge Engineering (DKE), 51(3):369--397, 2004.
[2]
J. L. Bentley. K-D trees for semi-dynamic point sets. In Proc. of the 6th Annual ACM Symposium on Computational Geometry (SCG), pages 187--197, 1990.
[3]
A. Beygelzimer, S. Kakade, and J. Langford. Cover trees for nearest neighbor. In Proceedings of the 23rd International Conference on Machine Learning (ICML), pages 97--104, Pittsburgh, PA, USA, June 2006.
[4]
A. Z. Broder. Identifying and filtering near-duplicate documents. In Proceedings of the 11th Annual Symp. on Combinatorial Pattern Matching, pages 1--10. Springer-Verlag, 2000.
[5]
M. Charikar. Similarity estimation techniques from rounding algorithms. In Proc. of the 34th annual ACM Symp. on Theory of Computing, pages 380--388, 2002.
[6]
M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. In Proceedings of the 20th Annual Symposium on Computational Geometry(SCG), pages 253--262, 2004.
[7]
Y. Deng and B. S. Manjunath. Unsupervised segmentation of color-texture regions in images and video. IEEE Trans. on Pattern Analysis and Machine Intelligence, 2001.
[8]
D. Dobkin and R. Lipton. Multidimensional search problems. SIAM J. Computing, 5:181--186, 1976.
[9]
M. G. et al. Gnu scientific library.
[10]
I. K. Fodor. A survey of dimension reduction techniques. Technical Report UCRL-ID-148494, Lawrence Livermore National Laboratory, 2002.
[11]
J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett, and N. L. Dahlgren. DARPA TIMIT acoustic-phonetic continuous speech corpus, 1993.
[12]
A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In Proc. of the 25th Int. Conf. on Very Large Data Bases (VLDB), pages 518--529, 1999.
[13]
A. Guttman. R-trees: A dynamic index structure for spatial searching. In Proc. of ACM SIGMOD Conf. on Management of Data, pages 47--57, 1984.
[14]
P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proc. of the 30th Annual ACM Symposium on Theory of Computing, pages 604--613, 1998.
[15]
N. Katayama and S. Satoh. The SR-tree: An index structure for high-dimensional nearest neighbor queries. In Proc. of ACM SIGMOD Int. Conf. on Management of Data, pages 369--380, 1997.
[16]
M. Kazhdan, T. Funkhouser, and S. Rusinkiewicz. Rotation invariant spherical harmonic representation of 3D shape descriptors. In Proc. of the Eurographics Symposium on Geometry Processing, 2003.
[17]
R. Krauthgamer and J. R. Lee. Navigating nets: Simple algorithms for proximity search. In Proc. of the 15th ACM Symposium on Discrete Algorithms, pages 798--807, 2004.
[18]
Q. Lv, M. Charikar, and K. Li. Image similarity search with compact data structures. In Proc. of the 13th ACM Conf. on Information and Knowledge Management, pages 208--217, 2004.
[19]
Q. Lv, W. Josephson, Z. Wang, M. Charikar, and K. Li. Efficient filtering with sketches in the ferret toolkit. In Workshop on Multimedia Information Retrieval, 2006.
[20]
Q. Lv, W. Josephson, Z. Wang, M. Charikar, and K. Li. Ferret: A Toolkit for Content-Based Similarity Search of Feature-Rich Data. In Proceedings of the ACM SIGOPS EuroSys Conf., 2006.
[21]
S. Meiser. Point location in arrangements of hyperplanes. Information and Computation, 106(2):286--303, 1993.
[22]
R. Panigrahy. Entropy based nearest neighbor search in high dimensions. In Proceedings of the 17th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1186--1195, Miami, Florida, USA, Jan 2006.
[23]
Y. Sakurai, M. Yoshikawa, S. Uemura, and H. Kojima. The a-tree: An index structure for high-dimensional spaces using relative approximation. In Proceedings of the 26th International Conference on Very Large Data Bases (VLDB), pages 516--526, 2000.
[24]
G. Tzanetakis and P. Cook. MARSYAS: A Framework for Audio Analysis. Cambridge University Press, 2000.
[25]
R. Weber, H.-J. Schek, and S. Blott. A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In Proceedings of the 24th International Conference on Very Large Data Bases (VLDB), pages 194--205, 1998.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMETRICS '07: Proceedings of the 2007 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
June 2007
398 pages
ISBN:9781595936394
DOI:10.1145/1254882
  • cover image ACM SIGMETRICS Performance Evaluation Review
    ACM SIGMETRICS Performance Evaluation Review  Volume 35, Issue 1
    SIGMETRICS '07 Conference Proceedings
    June 2007
    382 pages
    ISSN:0163-5999
    DOI:10.1145/1269899
    Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 June 2007

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. feature-rich data
  2. similarity search
  3. sketch

Qualifiers

  • Article

Conference

SIGMETRICS07

Acceptance Rates

Overall Acceptance Rate 459 of 2,691 submissions, 17%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)11
  • Downloads (Last 6 weeks)1
Reflects downloads up to 28 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2022)MicroCellClust 2: a hybrid approach for multivariate rare cell mining in large-scale single-cell data2022 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)10.1109/BIBM55620.2022.9995176(148-153)Online publication date: 6-Dec-2022
  • (2021)On the Similarity Search With Hamming Space SketchesIntelligent Analytics With Advanced Multi-Industry Applications10.4018/978-1-7998-4963-6.ch005(97-127)Online publication date: 2021
  • (2020)iDECProceedings of the VLDB Endowment10.14778/3397230.339724313:9(1483-1497)Online publication date: 26-Jun-2020
  • (2020)Fast Locally Weighted PLS Modeling for Large-Scale Industrial ProcessesIndustrial & Engineering Chemistry Research10.1021/acs.iecr.0c0393259:47(20779-20786)Online publication date: 11-Nov-2020
  • (2020)Pivot Selection for Narrow Sketches by Optimization AlgorithmsSimilarity Search and Applications10.1007/978-3-030-60936-8_3(33-46)Online publication date: 14-Oct-2020
  • (2019)DoRC: Discovery of rare cells from ultra-large scRNA-seq data2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)10.1109/BIBM47256.2019.8983250(111-116)Online publication date: Nov-2019
  • (2018)Binary Sketches for Secondary FilteringACM Transactions on Information Systems10.1145/323193637:1(1-28)Online publication date: 6-Dec-2018
  • (2018)Modifying Hamming Spaces for Efficient Search2018 IEEE International Conference on Data Mining Workshops (ICDMW)10.1109/ICDMW.2018.00137(945-953)Online publication date: Nov-2018
  • (2018)Selecting Sketches for Similarity SearchAdvances in Databases and Information Systems10.1007/978-3-319-98398-1_9(127-141)Online publication date: 2-Sep-2018
  • (2017)Binary Vectors for Fast Distance and Similarity EstimationCybernetics and Systems Analysis10.1007/s10559-017-9914-x53:1(138-156)Online publication date: 1-Jan-2017
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media