Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1265530.1265545acmconferencesArticle/Chapter ViewAbstractPublication PagespodsConference Proceedingsconference-collections
Article

Finding near neighbors through cluster pruning

Published: 11 June 2007 Publication History
  • Get Citation Alerts
  • Abstract

    Finding near(est) neighbors is a classic, difficult problem in data management and retrieval, with applications in text and image search,in finding similar objects and matching patterns. Here we study cluster pruning, an extremely simple randomized technique. During preprocessing we randomly choose a subset of data points to be leaders the remaining data points are partitioned by which leader is the closest. For query processing, we find the leader(s) closest to the query point. We then seek the nearest neighbors for the query point among only the points in the clusters of the closest leader(s). Recursion may be used in both preprocessing and in search. Such schemes seek approximate nearest neighbors that are "almost as good" as the nearest neighbors. How good are these approximations and how much do they save in computation.
    Our contributions are: (1) we quantify metrics that allow us to study the tradeoff between processing and the quality of the approximate nearest neighbors; (2) we give rigorous theoretical analysis of our schemes, under natural generative processes (generalizing Gaussian mixtures) for the data points; (3) experiments on both synthetic data from such generative processes, as well as on from a document corpus, confirming that we save orders of magnitude in query processing cost at modest compromises in the quality of retrieved points. In particular, we show that p-spheres, a state-of-the-art solution, is outperformed by our simple scheme whether the data points are stored in main or in external memo.

    Supplementary Material

    Low Resolution (p103-chierichetti_56k.mp4)
    High Resolution (p103-chierichetti_768k.mp4)

    References

    [1]
    S. Arya, D. M. Mount, N. S. Netanyahu, R. Silverman, and A. Wu. An optimal algorithm for nearest neighbor searching. In SODA'94.
    [2]
    S. Berchtold, K. Keim, and H. -P. Kriegel. The X-Tree: An index structure for high dimensional data. In VLDB'96.
    [3]
    E. Kushilevitz, R. Ostrovsky, and Y. Rabani. Efficient search for approximate nearest neighbor in high dimensional spaces. SIAM Journal on Computing, 30(2):451--474, 2000.
    [4]
    M. Bern. Approximate closest point queries in high dimensions. Information Processing Letters, 45, 1993.
    [5]
    T Bozkaya and M. Ozsoyoglu. Distance-based indexing for high-dimensional metric spaces. In PODS'97.
    [6]
    K. Clarkson. Nearest neighbor queries in metric spaces. In STOC'97.
    [7]
    R. Motwani, P. Indyk. Approximate nearest neighbor - towards removing the curse of dimensionality. In STOC'98.
    [8]
    Vladimir Pestov. On the geometry of similarity search: dimensionality curse and concentration of measure. Information Processing Letters, To Appear.
    [9]
    K. S. Beyer, J. Goldstein, R. Ramakrishnan, and Uri Shaft. When is "nearest neighbor" meaningful? In ICDT '99.
    [10]
    K. P. Bennett, U. Fayyad, and D. Geiger. Density-based indexing for approximate nearest-neighbor queries. In KDD '99.
    [11]
    Sergey Brin. Near neighbor search in large metric spaces. In The VLDB Journal, 574--584, 1995.
    [12]
    R. Fagin, R. Kumar, and D. Sivakumar Efficient similarity search and classification via rank aggregation. SIGMOD '03.
    [13]
    A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In The VLDB Journal, 1999.
    [14]
    H. Edelsbrunner. Algorithms in Combinatorial Geometry. Springer-Verlag, New York, NY, 1987.
    [15]
    L. Ertz, M. Steinbach, and V. Kumar. Finding topics in collections of documents: A shared nearest neighbor approach. In Text Mine '01.
    [16]
    C. Buckley and A. F. Lewit. Optimization of inverted vector searches. In SIGIR '85, 97--110.
    [17]
    M. Steinbach, G. Karypis, and V. Kumar. A comparison of document clustering techniques. In KDD Workshop on Text Mining, 2000.
    [18]
    P. Willet. Recent trends in hierarchical document clustering: a critical review. In Information Processing and Management, vol. 24(5), 577--597, 1988.
    [19]
    D. Comer. The ubiquitous b-tree. In ACM Computing Surveys, 11(2):121--137, 1979.
    [20]
    A. Guttman. R-trees: A dynamic index structure for spatial searching. In SIGMOD '84 .
    [21]
    G. Karypis, E -H Han, and V. Kumar. Chameleon: Hierarchical Clustering Using Dynamic Modeling. IEEE Computer, 32(8):68--75, August 1999.
    [22]
    N. Katayama and S. Satoh. The sr-tree: An index structure for high-dimensional nearest neighbor queries. In SIGMOD'97.
    [23]
    A. N. Papadopoulos Y. Manolopoulos, A. Nanopoulos and Y. Theodoridis. R-trees have grown everywhere. In Technical Report available at http://www.rtreeportal.org/, 2003.
    [24]
    F. J. MacWilliams and N. J. A. Sloane. The Theory of Error Correcting Codes. Amsterdam: North-Holland, 1977.
    [25]
    J. Goldstein and Raghu Ramakrishnan. Contrast Plots and P-Sphere Trees: Space vs. Time in Nearest Neighbour Searches. In VLDB'00.

    Cited By

    View all
    • (2024)A Learning-to-Rank Formulation of Clustering-Based Approximate Nearest Neighbor SearchProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657931(2261-2265)Online publication date: 10-Jul-2024
    • (2024)Efficient Inverted Indexes for Approximate Retrieval over Learned Sparse RepresentationsProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657769(152-162)Online publication date: 10-Jul-2024
    • (2023)Self-Supervised Object Detection from Egocentric Videos2023 IEEE/CVF International Conference on Computer Vision (ICCV)10.1109/ICCV51070.2023.00482(5202-5214)Online publication date: 1-Oct-2023
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    PODS '07: Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
    June 2007
    328 pages
    ISBN:9781595936851
    DOI:10.1145/1265530
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 11 June 2007

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. clustering
    2. generative model
    3. nearest neighbor

    Qualifiers

    • Article

    Conference

    SIGMOD/PODS07
    Sponsor:

    Acceptance Rates

    PODS '07 Paper Acceptance Rate 28 of 187 submissions, 15%;
    Overall Acceptance Rate 642 of 2,707 submissions, 24%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)15
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 12 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)A Learning-to-Rank Formulation of Clustering-Based Approximate Nearest Neighbor SearchProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657931(2261-2265)Online publication date: 10-Jul-2024
    • (2024)Efficient Inverted Indexes for Approximate Retrieval over Learned Sparse RepresentationsProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657769(152-162)Online publication date: 10-Jul-2024
    • (2023)Self-Supervised Object Detection from Egocentric Videos2023 IEEE/CVF International Conference on Computer Vision (ICCV)10.1109/ICCV51070.2023.00482(5202-5214)Online publication date: 1-Oct-2023
    • (2022)MSPPIR: Multi-Source Privacy-Preserving Image Retrieval in cloud computingFuture Generation Computer Systems10.1016/j.future.2022.03.040134(78-92)Online publication date: Sep-2022
    • (2020)Keyphrase generation for Vietnamese administrative documents: a collaborative approach2020 12th International Conference on Knowledge and Systems Engineering (KSE)10.1109/KSE50997.2020.9287477(43-48)Online publication date: 12-Nov-2020
    • (2019)Index Maintenance Strategy and Cost Model for Extended Cluster PruningSimilarity Search and Applications10.1007/978-3-030-32047-8_3(32-39)Online publication date: 23-Sep-2019
    • (2018)Alternative patterns of the multidimensional Hilbert curveMultimedia Tools and Applications10.1007/s11042-017-4744-477:7(8419-8440)Online publication date: 1-Apr-2018
    • (2017)PIC: Enable Large-Scale Privacy Preserving Content-Based Image Search on CloudIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2017.271214828:11(3258-3271)Online publication date: 6-Oct-2017
    • (2015)PICProceedings of the 2015 44th International Conference on Parallel Processing (ICPP)10.1109/ICPP.2015.104(949-958)Online publication date: 1-Sep-2015
    • (2015)Reconsideration about clustering analysis2015 IEEE 10th Conference on Industrial Electronics and Applications (ICIEA)10.1109/ICIEA.2015.7334349(1517-1524)Online publication date: Jun-2015
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media