tutorial

Hubs in Nearest-Neighbor Graphs: Origins, Applications and Challenges

Author:

Miloš RadovanovićAuthors Info & Claims

WIMS '18: Proceedings of the 8th International Conference on Web Intelligence, Mining and Semantics

Article No.: 5, Pages 1 - 4

https://doi.org/10.1145/3227609.3227691

Published: 25 June 2018 Publication History

Abstract

The tendency of k-nearest neighbor graphs constructed from tabular data using some distance measure to contain hubs, i.e. points with in-degree much higher than expected, has drawn a fair amount of attention in recent years due to the observed impact on techniques used in many application domains. This companion paper will summarize the tutorial organized in three parts: (1) Origins, which will discuss the causes of the emergence of hubs (and their low in-degree counterparts, the anti-hubs), and their relationships with dimensionality, neighborhood size, distance concentration, and the notion of centrality; (2) Applications, where we will present some notable effects of (anti-)hubs on techniques for machine learning, data mining and information retrieval, identify two different approaches to handling hubs adopted by researchers -- through fighting or embracing their existence -- and review techniques and applications belonging to the two groups; and (3) Challenges, which will discuss work in progress, open problems, and areas with significant opportunities for hub-related research.

References

[1]

Laurent Amsaleg, James Bailey, Dominique Barbe, Sarah Erfani, Michael E. Houle, Vinh Nguyen, and Miloš Radovanović. 2017. The vulnerability of learning to adversarial perturbation increases with intrinsic dimensionality. In Proc. 9th IEEE Int. Workshop on Information Forensics and Security (WIFS).

[2]

Jean-Julien Aucouturier and Francois Pachet. 2007. A scale-free distribution of false positives for a large class of audio similarity measures. Pattern Recognition 41, 1 (2007), 272--284.

Digital Library

[3]

Adam Berenzweig. 2007. Anchors and Hubs in Audio-based Music Similarity. Ph.D. Dissertation. Columbia University, New York, USA.

[4]

Brankica Bratić, Michael E. Houle, Vladimir Kurbalija, Vincent Oria, and Miloš Radovanović. 2018. NN-Descent on High-Dimensional Data. In Proc. 8th Int. Conf. on Web Intelligence, Mining and Semantics (WIMS).

Digital Library

[5]

Krisztian Buza, Júlia Koller, and Kristóf Marussy. 2015. PROCESS: Projection-based classification of electroencephalograph signals. In Proc. 14th Int. Conf. on Artificial Intelligence and Soft Computing (ICAISC). 91--100.

[6]

Krisztian Buza, Alexandros Nanopoulos, and Gábor Nagy. 2015. Nearest neighbor regression in the presence of bad hubs. Knowledge-Based Systems 86 (2015), 250--260.

Digital Library

[7]

Krisztian Buza, Alexandros Nanopoulos, and Lars Schmidt-Thieme. 2011. INSIGHT: Efficient and effective instance selection for time-series classification. In Proc. 15th Pacific-Asia Conf. on Knowledge Discovery and Data Mining (PAKDD), Part II. 149--160.

Digital Library

[8]

Georgiana Dinu, Angeliki Lazaridou, and Marco Baroni. 2015. Improving zero-shot learning by mitigating the hubness problem. In Workshop Contribution at the 3rd Int. Conf. on Learning Representations (ICLR).

[9]

George Doddington, Walter Liggett, Alvin Martin, Mark Przybocki, and Douglas Reynolds. 1998. SHEEP, GOATS, LAMBS and WOLVES: A statistical analysis of speaker performance in the NIST 1998 speaker recognition evaluation. In Proc. ICSLP. Paper 0608.

[10]

Wei Dong, Moses Charikar, and Kai Li. 2011. Efficient k-nearest neighbor graph construction for generic similarity measures. In Proc. 20th Int. Conf. on the World Wide Web (WWW). 577--586.

Digital Library

[11]

Kazuo Hara, Ikumi Suzuki, Kei Kobayashi, Kenji Fukumizu, and Miloš Radovanović. 2016. Flattening the density gradient for eliminating spatial centrality to reduce hubness. In Proc. 30th AAAI Conf. on Artificial Intelligence. 1659--1665.

Digital Library

[12]

Kazuo Hara, Ikumi Suzuki, Masashi Shimbo, Kei Kobayashi, Kenji Fukumizu, and Miloš Radovanović. 2015. Localized centering: Reducing hubness in large-sample data. In Proc. 29th AAAI Conf. on Artificial Intelligence. 2645--2651.

Digital Library

[13]

Ville Hautamaki, Ismo Karkkainen, and Pasi Franti. 2004. Outlier detection using k-nearest neighbour graph. In Proc. 17th Int. Conf. on Pattern Recognition (ICPR), Vol. 3. 430--433.

Digital Library

[14]

Austin Hicklin, Craig Watson, and Brad Ulery. 2005. The Myth of Goats: How many people have fingerprints that are hard to match? Internal Report 7271. NIST, USA.

[15]

Tony Jebara, Jun Wang, and Shih-Fu Chang. 2009. Graph construction and b-matching for semi-supervised learning. In Proc. 26th Int. Conf. on Machine Learning (ICML). 441--448.

Digital Library

[16]

Hervé Jegou, Hedi Harzallah, and Cordelia Schmid. 2007. A contextual dissimilarity measure for accurate and efficient image search. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). 1--8.

[17]

Hervé Jegou, Cordelia Schmid, Hedi Harzallah, and Jakob Verbeek. 2010. Accurate image search using the contextual dissimilarity measure. IEEE Transactions on Pattern Analysis and Machine Intelligence 32, 1 (2010), 2--11.

Digital Library

[18]

Peter Knees, Dominik Schnitzer, and Arthur Flexer. 2014. Improving neighborhood-based collaborative filtering by reducing hubness. In Proc. 4th ACM Int. Conf. on Multimedia Retrieval (ICMR). 161--168.

Digital Library

[19]

Georgios Kouimtzis. 2011. Investigating the Impact of Hubness on SVM Classifiers. Master's thesis. Department of Information & Communication Systems Engineering, University of the Aegean, Karlovassi, Samos, Greece.

[20]

Mathieu Lajoie, Olivier Gascuel, Vincent Lefort, and Laurent Bréhélin. 2012. Computational discovery of regulatory elements in a continuous expression space. Genome Biology 13, 11 (2012), R109.

[21]

Michalis Lazaridis, Apostolos Axenopoulos, Dimitrios Rafailidis, and Petros Daras. 2013. Multimedia search and retrieval using multimodal annotation propagation and indexing techniques. Signal Processing: Image Communication 28, 4 (2013), 351--367.

Digital Library

[22]

Jaimie Murdock and Larry S. Yaeger. 2011. Identifying species by genetic clustering. In Proc. 20th European Conf. on Artificial Life (ECAL). 565--572.

[23]

Kohei Ozaki, Masashi Shimbo, Mamoru Komachi, and Yuji Matsumoto. 2011. Using the mutual k-nearest neighbor graphs for semi-supervised classification of natural language data. In Proc. 15th Conf. on Computational Natural Language Learning (CoNLL). 154--162.

Digital Library

[24]

Clémentine Van Parijs and François Fouss. 2014. Improving accuracy by reducing the importance of hubs in nearest neighbor recommendations. In Proc. European Symposium on Artificial Neural Networks (ESANN). 59--64.

[25]

Miloš Radovanović, Alexandros Nanopoulos, and Mirjana Ivanović. 2009. Nearest Neighbors in High-Dimensional Data: The Emergence and Influence of Hubs. In Proc. 26th Int. Conf. on Machine Learning (ICML). 865--872.

Digital Library

[26]

Miloš Radovanović, Alexandros Nanopoulos, and Mirjana Ivanović. 2010. Hubs in space: Popular nearest neighbors in high-dimensional data. Journal of Machine Learning Research 11 (2010), 2487--2531.

Digital Library

[27]

Miloš Radovanović, Alexandros Nanopoulos, and Mirjana Ivanović. 2010. On the existence of obstinate results in vector space models. In Proc. 33rd Int. ACM SIGIR Conf. on Research and Development in Information Retrieval. 186--193.

Digital Library

[28]

Miloš Radovanović, Alexandros Nanopoulos, and Mirjana Ivanović. 2010. Time-series classification in many intrinsic dimensions. In Proc. 2010 SIAM Int. Conf. on Data Mining (SDM). 677--688.

[29]

Miloš Radovanović, Alexandros Nanopoulos, and Mirjana Ivanović. 2015. Reverse nearest neighbors in unsupervised distance-based outlier detection. IEEE Transactions on Knowledge and Data Engineering 27, 5 (2015), 1369--1382.

Digital Library

[30]

Jan Schlüter. 2011. Unsupervised Audio Feature Extraction for Music Similarity Estimation. Master's thesis. Faculty of Informatics, Technical University of Munich, Munich, Germany.

[31]

Dominik Schnitzer, Arthur Flexer, Markus Schedl, and Gerhard Widmer. 2012. Local and global scaling reduce hubs in space. Journal of Machine Learning Research 13 (2012), 2871--2902.

Digital Library

[32]

Yutaro Shigeto, Ikumi Suzuki, Kazuo Hara, and Masashi Shimbo and Yuji Matsumoto. 2015. Ridge regression, hubness, and zero-shot learning. In Proc. Joint European Conf. on Machine Learning and Knowledge Discovery in Databases (ECML/PKDD). 135--151.

Digital Library

[33]

Ikumi Suzuki, Kazuo Hara, Masashi Shimbo, Yuji Matsumoto, and Marco Saerens. 2012. Investigating the effectiveness of Laplacian-based kernels in hub reduction. In Proc. 26th AAAI Conf. on Artificial Intelligence. 1112--1118.

Digital Library

[34]

Ikumi Suzuki, Kazuo Hara, Masashi Shimbo, Marco Saerens, and Kenji Fukumizu. 2013. Centering similarity measures to reduce hubs. In Proc. Conf. on Empirical Methods in Natural Language Processing (EMNLP). 613--623.

[35]

Nenad Tomašev, Miloš Radovanović, Dunja Mladenić, and Mirjana Ivanović. 2011. The role of hubness in clustering high-dimensional data. In Proc. 15th Pacific-Asia Conf. on Knowledge Discovery and Data Mining (PAKDD), Part I. 183--195.

Digital Library

[36]

Nenad Tomašev and Krisztian Buza. 2015. Hubness-aware kNN classification of high-dimensional data in presence of label noise. Neurocomputing 160 (2015), 157--172.

Digital Library

[37]

Nenad Tomašev, Krisztian Buza, Kristóf Marussy, and Piroska B. Kis. 2015. Hubness-aware classification, instance selection and feature construction: Survey and extensions to time-series. In Feature Selection for Data and Pattern Recognition, U. Stanczyk and L. Jain (Eds.). Springer, 231--262.

[38]

Nenad Tomašev and Dunja Mladenić. 2012. Nearest Neighbor Voting in High Dimensional Data: Learning from Past Occurrences. Computer Science and Information Systems 9, 2 (2012), 691--712.

[39]

Nenad Tomašev and Dunja Mladenić. 2013. Class Imbalance and The Curse of Minority Hubs. Knowledge-Based Systems 53 (2013), 157--172.

Digital Library

[40]

Nenad Tomašev and Dunja Mladenić. 2013. Hub Co-occurrence Modeling for Robust High-dimensional kNN Classification. In Proc. European Conf. on Machine Learning (ECML). 643--659.

[41]

Nenad Tomašev, Miloš Radovanović, Dunja Mladenić, and Mirjana Ivanović. 2011. A Probabilistic Approach to Nearest-Neighbor Classification: Naive Hubness Bayesian kNN. In Proc. 20th ACM Int. Conf. on Information and Knowledge Management (CIKM). 2173--2176.

Digital Library

[42]

Nenad Tomašev, Miloš Radovanović, Dunja Mladenić, and Mirjana Ivanović. 2014. Hubness-Based Fuzzy Measures for High-Dimensional k-Nearest Neighbor Classification. International Journal of Machine Learning and Cybernetics 5, 3 (2014), 445--458.

[43]

Nenad Tomašev, Miloš Radovanović, Dunja Mladenić, and Mirjana Ivanović. 2014. The role of hubness in clustering high-dimensional data. IEEE T Knowl Data En 26, 3 (2014), 739--751.

Digital Library

[44]

Nenad Tomašev, Miloš Radovanović, Dunja Mladenić, and Mirjana Ivanović. 2015. Hubness-Based Clustering of High-Dimensional Data. In Partitional Clustering Algorithms, M. Emre Celebi (Ed.). Springer, 353--386.

[45]

Nenad Tomašev, Jan Rupnik, and Dunja Mladenić. 2013. The role of hubs in supervised cross-lingual document retrieval. In Proc. 17th Pacific-Asia Conf. on Knowledge Discovery and Data Mining (PAKDD), Part II. 185--196.

[46]

Didier A Vega-Oliveros, Lilian Berton, Andre Mantini Eberle, Alneu de Andrade Lopes, and Liang Zhao. 2014. Regular graph construction for semi-supervised learning. Journal of Physics: Conference Series 490 (2014), 012022.

[47]

Emmanuel Vincent, Aggelos Gkiokas, Dominik Schnitzer, and Arthur Flexer. 2014. An investigation of likelihood normalization for robust ASR. In Proc. Interspeech.

[48]

Zhengxiang Wang, Yiqun Hu, and Liang-Tien Chia. 2011. Improved learning of I2C distance and accelerating the neighborhood search for image classification. Pattern Recognition 44, 10--11 (2011), 2384--2394.

Digital Library

Index Terms

Hubs in Nearest-Neighbor Graphs: Origins, Applications and Challenges
1. Computing methodologies
  1. Machine learning
2. Information systems
  1. Information retrieval

Recommendations

An efficient weighted nearest neighbour classifier using vertical data representation

The k-nearest neighbour (KNN) technique is a simple yet effective method for classification. In this paper, we propose an efficient weighted nearest neighbour classification algorithm, called PINE, using vertical data representation. A metric called ...
K-Nearest Neighbor Finding Using MaxNearestDist

Similarity searching often reduces to finding the k nearest neighbors to a query object. Finding the k nearest neighbors is achieved by applying either a depth- first or a best-first algorithm to the search hierarchy containing the data. These ...
Confirmation Sampling for Exact Nearest Neighbor Search
Similarity Search and Applications
Abstract
Locality-sensitive hashing (LSH), introduced by Indyk and Motwani in STOC ’98, has been an extremely influential framework for nearest neighbor search in high-dimensional data sets. While theoretical work has focused on the approximate nearest ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

WIMS '18: Proceedings of the 8th International Conference on Web Intelligence, Mining and Semantics

June 2018

398 pages

ISBN:9781450354899

DOI:10.1145/3227609

Copyright © 2018 Owner/Author.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 June 2018

Check for updates

Author Tags

Qualifiers

Tutorial
Research
Refereed limited

Funding Sources

Ministarstvo Prosvete, Nauke i Tehnoloakog Razvoja

Conference

WIMS '18

WIMS '18: 8th International Conference on Web Intelligence, Mining and Semantics

June 25 - 27, 2018

Novi Sad, Serbia

Acceptance Rates

Overall Acceptance Rate 140 of 278 submissions, 50%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
81
Total Downloads

Downloads (Last 12 months)10
Downloads (Last 6 weeks)3

Reflects downloads up to 01 Sep 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents