Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2324796.2324847acmconferencesArticle/Chapter ViewAbstractPublication PagesicmrConference Proceedingsconference-collections
research-article

Distributed KNN-graph approximation via hashing

Published: 05 June 2012 Publication History

Abstract

Efficiently constructing the K-Nearest Neighbor Graph (K-NNG) of large and high dimensional datasets is crucial for many applications with feature-rich objects, such as images or other multimedia content. In this paper we investigate the use of high dimensional hashing methods for efficiently approximating the K-NNG, notably in distributed environments. We first discuss the importance of balancing issues on the performance of such approaches and show why the baseline approach using Locality Sensitive Hashing does not perform well. Our new KNN-join method is based on RMMH, a recently introduced hash function family based on randomly trained classifiers. We show that the resulting hash tables are much more balanced and that the number of resulting collisions can be greatly reduced without degrading quality. We further improve the load balancing of our distributed approach by designing a parallelized local join algorithm, implemented within the MapReduce framework.

References

[1]
G. Adomavicius and A. Tuzhilin. Toward the next generation of recommender systems: A survey of the state-of-the-art and possible extensions. IEEE Trans. on Knowl. and Data Eng., 17:734--749, June 2005.
[2]
R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In Proceedings of the 16th Int. Conf. on World Wide Web, WWW '07, pages 131--140, New York, NY, USA, 2007. ACM.
[3]
A. Beygelzimer, S. Kakade, and J. Langford. Cover trees for nearest neighbor. In conf. on Machine learning, pages 97--104, New York, NY, USA, 2006.
[4]
O. Boiman, E. Shechtman, and M. Irani. In defense of nearest-neighbor based image classification. Computer Vision and Pattern Recognition, IEEE Computer Society Conference on, 0:1--8, 2008.
[5]
M. R. Brito, E. L. Chávez, A. J. Quiroz, and J. E. Yukich. Connectivity of the mutual k-nearest-neighbor graph in clustering and outlier detection. Statistics & Probability Letters, 35(1):33--42, Aug. 1997.
[6]
M. Casey and M. Slaney. Song intersection by approximate nearest neighbour search. In Proc. Int. Symp. on Music Information Retrieval, pages 2161--2168, 2006.
[7]
M. S. Charikar. Similarity estimation techniques from rounding algorithms. In Proceedings of the thiry-fourth annual ACM symposium on Theory of computing, STOC '02, pages 380--388, New York, NY, USA, 2002. ACM.
[8]
O. Chum, J. Philbin, and A. Zisserman. Near duplicate image detection: min-hash and tf-idf weighting. In Proc. of the British Machine Vision Conference, 2008.
[9]
P. Ciaccia, M. Patella, and P. Zezula. M-tree: An efficient access method for similarity search in metric spaces. In Int. Conf. on Very Large Data Bases, pages 426--435, 1997.
[10]
M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. In Symposium on Computational geometry, pages 253--262, 2004.
[11]
J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. In Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6, pages 10--10, Berkeley, CA, USA, 2004. USENIX Association.
[12]
W. Dong, C. Moses, and K. Li. Efficient k-nearest neighbor graph construction for generic similarity measures. In Proceedings of the 20th Int. Conf. on World wide web, WWW '11, pages 577--586, New York, NY, USA, 2011. ACM.
[13]
M. Ferecatu. Image retrieval with active relevance feedback using both visual and keyword-based descriptors. In PhD thesis, University of Versailles, 2005.
[14]
A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In Int. Conf. on Very Large Data Bases, pages 518--529, 1999.
[15]
A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In Proceedings of the 25th Int. Conf. on Very Large Data Bases, VLDB '99, pages 518--529, San Francisco, CA, USA, 1999. Morgan Kaufmann Publishers Inc.
[16]
A. Guttman. R-trees: A dynamic index structure for spatial searching. In ACM SIGMOD Conf. of Management of Data, pages 47--57, 1984.
[17]
P. Haghani, S. Michel, P. Cudré-Mauroux, and K. Aberer. Lsh at large - distributed knn search in high dimensions. In WebDB, 2008.
[18]
H. Jegou, M. Douze, and C. Schmid. Hamming embedding and weak geometric consistency for large scale image search. In Proceedings of the 10th European Conference on Computer Vision: Part I, ECCV '08, pages 304--317, Berlin, Heidelberg, 2008. Springer-Verlag.
[19]
H. Jégou, M. Douze, and C. Schmid. Product quantization for nearest neighbor search. IEEE Transactions on Pattern Analysis & Machine Intelligence, 2010. to appear.
[20]
A. Joly and O. Buisson. Random maximum margin hashing. In The 24th IEEE Conf. on Computer Vision and Pattern Recognition, CVPR 2011, Colorado Springs, CO, USA, 20-25 June 2011, pages 873--880. IEEE, 2011.
[21]
N. Katayama and S. Satoh. The sr-tree: An index structure for high-dimensional nearest neighbor queries. In ACM SIGMOD Int. Conf. on Management of Data, pages 369--380, 1997.
[22]
Y. Ke, R. Sukthankar, and L. Huston. Efficient near-duplicate detection and sub-image retrieval. In ACM Int. Conf. on Multimedia, 2004.
[23]
B. Kulis and K. Grauman. Kernelized locality-sensitive hashing for scalable image search. In IEEE Int. Conf. on Computer Vision (ICCV), 2009.
[24]
K. Ling and G. Wu. Frequency based locality sensitive hashing. In Multimedia Technology (ICMT), 2011 Int. Conf. on, pages 4929--4932, july 2011.
[25]
T. Liu, A. W. Moore, A. Gray, and K. Yang. An investigation of practical approximate nearest neighbor algorithms. pages 825--832. MIT Press, 2004.
[26]
M.-B. Matei, S. M.-Y. Shan, M.-H. S. Sawhney, S. M.-Y. Tan, M.-R. Kumar, M.-D. Huber, and M.-M. Hebert. Rapid object indexing using locality sensitive hashing and joint 3d-signature space estimation. IEEE Trans. Pattern Anal. Mach. Intell., 28(7):1111--1126, 2006.
[27]
M. Muja and D. G. Lowe. Fast approximate nearest neighbors with automatic algorithm configuration. In VISAPP (1), pages 331--340, 2009.
[28]
L. Paulevé, H. Jégou, and L. Amsaleg. Locality sensitive hashing: A comparison of hash function types and querying mechanisms. Pattern Recognition Letters, 31(11):1348--1358, 2010.
[29]
J. Philbin and A. Zisserman. Object mining using a matching graph on very large image collections. In Proceedings of the 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, ICVGIP '08, pages 738--745, Washington, DC, USA, 2008. IEEE Computer Society.
[30]
T. Pitoura, N. Ntarmos, and P. Triantafillou. Replication, load balancing, and efficient range query processing in dht data networks. In 10th Int. Conf. on Extending Database Technology (EDBT 2006), March 2006.
[31]
M. Raginsky and S. Lazebnik. Locality-sensitive binary codes from shift-invariant kernels. In Y. Bengio, D. Schuurmans, J. D. Lafferty, C. K. I. Williams, and A. Culotta, editors, NIPS, pages 1509--1517. Curran Associates, Inc., 2009.
[32]
M. Sahami and T. D. Heilman. A web-based kernel function for measuring the similarity of short text snippets. In Proc. of the 15th Int. Conf. on World Wide Web, WWW '06, pages 377--386, New York, NY, USA, 2006. ACM.
[33]
G. Shakhnarovich, T. Darrell, and P. Indyk. Nearest-Neighbor Methods in Learning and Vision: Theory and Practice. MIT Press, 2006.
[34]
C. Silpa-Anan and R. Hartley. Optimised kd-trees for fast image descriptor matching. In CVPR. IEEE Computer Society, 2008.
[35]
M. R. Trad, A. Joly, and N. Boujemaa. Large scale visual-based event matching. In Proceedings of the 1st ACM Int. Conf. on Multimedia Retrieval, ICMR '11, pages 53:1--53:7, New York, NY, USA, 2011. ACM.
[36]
R. Troncy, B. Malocha, and A. T. S. Fialho. Linking events with media. In Proceedings of the 6th Int. Conf. on Semantic Systems, I-SEMANTICS '10, pages 42:1--42:4, New York, NY, USA, 2010. ACM.
[37]
R. Weber, H. J. Schek, and S. Blott. A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In Int. Conf. on Very Large Data Bases, pages 194--205, 1998.
[38]
Y. Weiss, A. Torralba, and R. Fergus. Spectral hashing. In NIPS, pages 1753--1760, 2008.
[39]
S. Yan, D. Xu, B. Zhang, H.-J. Zhang, Q. Yang, and S. Lin. Graph embedding and extensions: A general framework for dimensionality reduction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29:40--51, 2007.
[40]
P. Zezula, P. Savino, G. Amato, and F. Rabitti. Approximate similarity retrieval with m-trees. The VLDB Journal, 7:275--293, December 1998.
[41]
J. Zhai, Y. Lou, and J. Gehrke. Atlas: a probabilistic algorithm for high dimensional similarity search. In Proceedings of the 2011 Int. Conf. on Management of data, SIGMOD '11, pages 997--1008, New York, NY, USA, 2011. ACM.

Cited By

View all

Index Terms

  1. Distributed KNN-graph approximation via hashing

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ICMR '12: Proceedings of the 2nd ACM International Conference on Multimedia Retrieval
    June 2012
    489 pages
    ISBN:9781450313292
    DOI:10.1145/2324796
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 05 June 2012

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. KNN-graph
    2. MapReduce
    3. approximate
    4. distributed
    5. hashing
    6. scalable
    7. similarity search

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    ICMR '12
    Sponsor:

    Acceptance Rates

    ICMR '12 Paper Acceptance Rate 50 of 145 submissions, 34%;
    Overall Acceptance Rate 254 of 830 submissions, 31%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)3
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 10 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Introduction to Distributed Nearest HashProcedia Computer Science10.1016/j.procs.2023.01.135218:C(1571-1580)Online publication date: 1-Jan-2023
    • (2020)Towards distributed node similarity search on graphsWorld Wide Web10.1007/s11280-020-00819-623:6(3025-3053)Online publication date: 1-Nov-2020
    • (2019)A True $$O(n\log {n})$$ Algorithm for the All-k-Nearest-Neighbors ProblemCombinatorial Optimization and Applications10.1007/978-3-030-36412-0_29(362-374)Online publication date: 23-Nov-2019
    • (2016)A Scalable Approach for Content-Based Image Retrieval in Peer-to-Peer NetworksIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2015.250528428:4(858-872)Online publication date: 1-Apr-2016
    • (2015)DartProceedings of the 2015 IEEE 8th International Conference on Cloud Computing10.1109/CLOUD.2015.22(90-97)Online publication date: 27-Jun-2015

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media