Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3227609.3227691acmotherconferencesArticle/Chapter ViewAbstractPublication PageswimsConference Proceedingsconference-collections
tutorial

Hubs in Nearest-Neighbor Graphs: Origins, Applications and Challenges

Published: 25 June 2018 Publication History

Abstract

The tendency of k-nearest neighbor graphs constructed from tabular data using some distance measure to contain hubs, i.e. points with in-degree much higher than expected, has drawn a fair amount of attention in recent years due to the observed impact on techniques used in many application domains. This companion paper will summarize the tutorial organized in three parts: (1) Origins, which will discuss the causes of the emergence of hubs (and their low in-degree counterparts, the anti-hubs), and their relationships with dimensionality, neighborhood size, distance concentration, and the notion of centrality; (2) Applications, where we will present some notable effects of (anti-)hubs on techniques for machine learning, data mining and information retrieval, identify two different approaches to handling hubs adopted by researchers -- through fighting or embracing their existence -- and review techniques and applications belonging to the two groups; and (3) Challenges, which will discuss work in progress, open problems, and areas with significant opportunities for hub-related research.

References

[1]
Laurent Amsaleg, James Bailey, Dominique Barbe, Sarah Erfani, Michael E. Houle, Vinh Nguyen, and Miloš Radovanović. 2017. The vulnerability of learning to adversarial perturbation increases with intrinsic dimensionality. In Proc. 9th IEEE Int. Workshop on Information Forensics and Security (WIFS).
[2]
Jean-Julien Aucouturier and Francois Pachet. 2007. A scale-free distribution of false positives for a large class of audio similarity measures. Pattern Recognition 41, 1 (2007), 272--284.
[3]
Adam Berenzweig. 2007. Anchors and Hubs in Audio-based Music Similarity. Ph.D. Dissertation. Columbia University, New York, USA.
[4]
Brankica Bratić, Michael E. Houle, Vladimir Kurbalija, Vincent Oria, and Miloš Radovanović. 2018. NN-Descent on High-Dimensional Data. In Proc. 8th Int. Conf. on Web Intelligence, Mining and Semantics (WIMS).
[5]
Krisztian Buza, Júlia Koller, and Kristóf Marussy. 2015. PROCESS: Projection-based classification of electroencephalograph signals. In Proc. 14th Int. Conf. on Artificial Intelligence and Soft Computing (ICAISC). 91--100.
[6]
Krisztian Buza, Alexandros Nanopoulos, and Gábor Nagy. 2015. Nearest neighbor regression in the presence of bad hubs. Knowledge-Based Systems 86 (2015), 250--260.
[7]
Krisztian Buza, Alexandros Nanopoulos, and Lars Schmidt-Thieme. 2011. INSIGHT: Efficient and effective instance selection for time-series classification. In Proc. 15th Pacific-Asia Conf. on Knowledge Discovery and Data Mining (PAKDD), Part II. 149--160.
[8]
Georgiana Dinu, Angeliki Lazaridou, and Marco Baroni. 2015. Improving zero-shot learning by mitigating the hubness problem. In Workshop Contribution at the 3rd Int. Conf. on Learning Representations (ICLR).
[9]
George Doddington, Walter Liggett, Alvin Martin, Mark Przybocki, and Douglas Reynolds. 1998. SHEEP, GOATS, LAMBS and WOLVES: A statistical analysis of speaker performance in the NIST 1998 speaker recognition evaluation. In Proc. ICSLP. Paper 0608.
[10]
Wei Dong, Moses Charikar, and Kai Li. 2011. Efficient k-nearest neighbor graph construction for generic similarity measures. In Proc. 20th Int. Conf. on the World Wide Web (WWW). 577--586.
[11]
Kazuo Hara, Ikumi Suzuki, Kei Kobayashi, Kenji Fukumizu, and Miloš Radovanović. 2016. Flattening the density gradient for eliminating spatial centrality to reduce hubness. In Proc. 30th AAAI Conf. on Artificial Intelligence. 1659--1665.
[12]
Kazuo Hara, Ikumi Suzuki, Masashi Shimbo, Kei Kobayashi, Kenji Fukumizu, and Miloš Radovanović. 2015. Localized centering: Reducing hubness in large-sample data. In Proc. 29th AAAI Conf. on Artificial Intelligence. 2645--2651.
[13]
Ville Hautamaki, Ismo Karkkainen, and Pasi Franti. 2004. Outlier detection using k-nearest neighbour graph. In Proc. 17th Int. Conf. on Pattern Recognition (ICPR), Vol. 3. 430--433.
[14]
Austin Hicklin, Craig Watson, and Brad Ulery. 2005. The Myth of Goats: How many people have fingerprints that are hard to match? Internal Report 7271. NIST, USA.
[15]
Tony Jebara, Jun Wang, and Shih-Fu Chang. 2009. Graph construction and b-matching for semi-supervised learning. In Proc. 26th Int. Conf. on Machine Learning (ICML). 441--448.
[16]
Hervé Jegou, Hedi Harzallah, and Cordelia Schmid. 2007. A contextual dissimilarity measure for accurate and efficient image search. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). 1--8.
[17]
Hervé Jegou, Cordelia Schmid, Hedi Harzallah, and Jakob Verbeek. 2010. Accurate image search using the contextual dissimilarity measure. IEEE Transactions on Pattern Analysis and Machine Intelligence 32, 1 (2010), 2--11.
[18]
Peter Knees, Dominik Schnitzer, and Arthur Flexer. 2014. Improving neighborhood-based collaborative filtering by reducing hubness. In Proc. 4th ACM Int. Conf. on Multimedia Retrieval (ICMR). 161--168.
[19]
Georgios Kouimtzis. 2011. Investigating the Impact of Hubness on SVM Classifiers. Master's thesis. Department of Information & Communication Systems Engineering, University of the Aegean, Karlovassi, Samos, Greece.
[20]
Mathieu Lajoie, Olivier Gascuel, Vincent Lefort, and Laurent Bréhélin. 2012. Computational discovery of regulatory elements in a continuous expression space. Genome Biology 13, 11 (2012), R109.
[21]
Michalis Lazaridis, Apostolos Axenopoulos, Dimitrios Rafailidis, and Petros Daras. 2013. Multimedia search and retrieval using multimodal annotation propagation and indexing techniques. Signal Processing: Image Communication 28, 4 (2013), 351--367.
[22]
Jaimie Murdock and Larry S. Yaeger. 2011. Identifying species by genetic clustering. In Proc. 20th European Conf. on Artificial Life (ECAL). 565--572.
[23]
Kohei Ozaki, Masashi Shimbo, Mamoru Komachi, and Yuji Matsumoto. 2011. Using the mutual k-nearest neighbor graphs for semi-supervised classification of natural language data. In Proc. 15th Conf. on Computational Natural Language Learning (CoNLL). 154--162.
[24]
Clémentine Van Parijs and François Fouss. 2014. Improving accuracy by reducing the importance of hubs in nearest neighbor recommendations. In Proc. European Symposium on Artificial Neural Networks (ESANN). 59--64.
[25]
Miloš Radovanović, Alexandros Nanopoulos, and Mirjana Ivanović. 2009. Nearest Neighbors in High-Dimensional Data: The Emergence and Influence of Hubs. In Proc. 26th Int. Conf. on Machine Learning (ICML). 865--872.
[26]
Miloš Radovanović, Alexandros Nanopoulos, and Mirjana Ivanović. 2010. Hubs in space: Popular nearest neighbors in high-dimensional data. Journal of Machine Learning Research 11 (2010), 2487--2531.
[27]
Miloš Radovanović, Alexandros Nanopoulos, and Mirjana Ivanović. 2010. On the existence of obstinate results in vector space models. In Proc. 33rd Int. ACM SIGIR Conf. on Research and Development in Information Retrieval. 186--193.
[28]
Miloš Radovanović, Alexandros Nanopoulos, and Mirjana Ivanović. 2010. Time-series classification in many intrinsic dimensions. In Proc. 2010 SIAM Int. Conf. on Data Mining (SDM). 677--688.
[29]
Miloš Radovanović, Alexandros Nanopoulos, and Mirjana Ivanović. 2015. Reverse nearest neighbors in unsupervised distance-based outlier detection. IEEE Transactions on Knowledge and Data Engineering 27, 5 (2015), 1369--1382.
[30]
Jan Schlüter. 2011. Unsupervised Audio Feature Extraction for Music Similarity Estimation. Master's thesis. Faculty of Informatics, Technical University of Munich, Munich, Germany.
[31]
Dominik Schnitzer, Arthur Flexer, Markus Schedl, and Gerhard Widmer. 2012. Local and global scaling reduce hubs in space. Journal of Machine Learning Research 13 (2012), 2871--2902.
[32]
Yutaro Shigeto, Ikumi Suzuki, Kazuo Hara, and Masashi Shimbo and Yuji Matsumoto. 2015. Ridge regression, hubness, and zero-shot learning. In Proc. Joint European Conf. on Machine Learning and Knowledge Discovery in Databases (ECML/PKDD). 135--151.
[33]
Ikumi Suzuki, Kazuo Hara, Masashi Shimbo, Yuji Matsumoto, and Marco Saerens. 2012. Investigating the effectiveness of Laplacian-based kernels in hub reduction. In Proc. 26th AAAI Conf. on Artificial Intelligence. 1112--1118.
[34]
Ikumi Suzuki, Kazuo Hara, Masashi Shimbo, Marco Saerens, and Kenji Fukumizu. 2013. Centering similarity measures to reduce hubs. In Proc. Conf. on Empirical Methods in Natural Language Processing (EMNLP). 613--623.
[35]
Nenad Tomašev, Miloš Radovanović, Dunja Mladenić, and Mirjana Ivanović. 2011. The role of hubness in clustering high-dimensional data. In Proc. 15th Pacific-Asia Conf. on Knowledge Discovery and Data Mining (PAKDD), Part I. 183--195.
[36]
Nenad Tomašev and Krisztian Buza. 2015. Hubness-aware kNN classification of high-dimensional data in presence of label noise. Neurocomputing 160 (2015), 157--172.
[37]
Nenad Tomašev, Krisztian Buza, Kristóf Marussy, and Piroska B. Kis. 2015. Hubness-aware classification, instance selection and feature construction: Survey and extensions to time-series. In Feature Selection for Data and Pattern Recognition, U. Stanczyk and L. Jain (Eds.). Springer, 231--262.
[38]
Nenad Tomašev and Dunja Mladenić. 2012. Nearest Neighbor Voting in High Dimensional Data: Learning from Past Occurrences. Computer Science and Information Systems 9, 2 (2012), 691--712.
[39]
Nenad Tomašev and Dunja Mladenić. 2013. Class Imbalance and The Curse of Minority Hubs. Knowledge-Based Systems 53 (2013), 157--172.
[40]
Nenad Tomašev and Dunja Mladenić. 2013. Hub Co-occurrence Modeling for Robust High-dimensional kNN Classification. In Proc. European Conf. on Machine Learning (ECML). 643--659.
[41]
Nenad Tomašev, Miloš Radovanović, Dunja Mladenić, and Mirjana Ivanović. 2011. A Probabilistic Approach to Nearest-Neighbor Classification: Naive Hubness Bayesian kNN. In Proc. 20th ACM Int. Conf. on Information and Knowledge Management (CIKM). 2173--2176.
[42]
Nenad Tomašev, Miloš Radovanović, Dunja Mladenić, and Mirjana Ivanović. 2014. Hubness-Based Fuzzy Measures for High-Dimensional k-Nearest Neighbor Classification. International Journal of Machine Learning and Cybernetics 5, 3 (2014), 445--458.
[43]
Nenad Tomašev, Miloš Radovanović, Dunja Mladenić, and Mirjana Ivanović. 2014. The role of hubness in clustering high-dimensional data. IEEE T Knowl Data En 26, 3 (2014), 739--751.
[44]
Nenad Tomašev, Miloš Radovanović, Dunja Mladenić, and Mirjana Ivanović. 2015. Hubness-Based Clustering of High-Dimensional Data. In Partitional Clustering Algorithms, M. Emre Celebi (Ed.). Springer, 353--386.
[45]
Nenad Tomašev, Jan Rupnik, and Dunja Mladenić. 2013. The role of hubs in supervised cross-lingual document retrieval. In Proc. 17th Pacific-Asia Conf. on Knowledge Discovery and Data Mining (PAKDD), Part II. 185--196.
[46]
Didier A Vega-Oliveros, Lilian Berton, Andre Mantini Eberle, Alneu de Andrade Lopes, and Liang Zhao. 2014. Regular graph construction for semi-supervised learning. Journal of Physics: Conference Series 490 (2014), 012022.
[47]
Emmanuel Vincent, Aggelos Gkiokas, Dominik Schnitzer, and Arthur Flexer. 2014. An investigation of likelihood normalization for robust ASR. In Proc. Interspeech.
[48]
Zhengxiang Wang, Yiqun Hu, and Liang-Tien Chia. 2011. Improved learning of I2C distance and accelerating the neighborhood search for image classification. Pattern Recognition 44, 10--11 (2011), 2384--2394.

Index Terms

  1. Hubs in Nearest-Neighbor Graphs: Origins, Applications and Challenges

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Other conferences
      WIMS '18: Proceedings of the 8th International Conference on Web Intelligence, Mining and Semantics
      June 2018
      398 pages
      Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 25 June 2018

      Check for updates

      Author Tags

      1. Nearest neighbor graphs
      2. data mining
      3. hubness
      4. information retrieval
      5. machine learning

      Qualifiers

      • Tutorial
      • Research
      • Refereed limited

      Funding Sources

      Conference

      WIMS '18

      Acceptance Rates

      Overall Acceptance Rate 140 of 278 submissions, 50%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 81
        Total Downloads
      • Downloads (Last 12 months)10
      • Downloads (Last 6 weeks)3
      Reflects downloads up to 01 Sep 2024

      Other Metrics

      Citations

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media