research-article

HubHSP graph: : Capturing local geometrical and statistical data properties via spanning graphs

Authors:

Stephane Marchand-Maillet,

Edgar ChávezAuthors Info & Claims

Volume 121, Issue C

https://doi.org/10.1016/j.is.2023.102341

Published: 01 March 2024 Publication History

Abstract

The computation of a continuous generative model to describe a finite sample of an infinite metric space can prove challenging and lead to erroneous hypothesis, particularly in high-dimensional spaces. In this paper, we follow a different route and define the Hubness Half Space Partitioning graph (HubHSP graph). By constructing this spanning graph over the dataset, we can capture both the geometrical and statistical properties of the data without resorting to any continuity assumption. Leveraging the classical graph-theoretic apparatus, the HubHSP graph facilitates critical operations, including the creation of a representative sample of the original dataset, without relying on density estimation. This representative subsample is essential for a range of operations, including indexing, visualization, and machine learning tasks such as clustering or inductive learning. With the HubHSP graph, we can bypass the limitations of traditional methods and obtain a holistic understanding of our dataset’s properties, enabling us to unlock its full potential.

References

[1]

Bishop C.M., Pattern Recognition and Machine Learning (Information Science and Statistics), Springer-Verlag, Berlin, Heidelberg, 2006, (available online).

Digital Library

[2]

Rezende D.J., Mohamed S., Variational inference with normalizing flows, CoRR abs/1505.05770, 2016, arXiv:1505.05770.

[3]

S. Marchand-Maillet, E. Chávez, HubHSP graph: effective data sampling for pivot-based representation strategies, in: 15th International Conference on Similarity Search and Applications, 2022.

[4]

Chávez E., Dobrev S., Kranakis E., Opatrny J., Stacho L., Tejeda H., Urrutia J., Half-space proximal: A new local test for extracting a bounded dilation spanner of a unit disk graph, in: Proc. of Int. Conf. on Principles of Distributed Systems, OPODIS’05, Springer, 2005, pp. 235–245.

[5]

Hoyos A., Ruiz U., Marchand-Maillet S., Chávez E., Indexability-based dataset partitioning, in: Proc. of Int. Conf. on Similarity Search and Applications, SISAP’19, Springer, 2019, pp. 143–150.

[6]

S. Marchand-Maillet, O. Pedreira, E. Chávez, Structural Intrinsic Dimensionality, in: Proceedings of the 14th International Conference on Similarity Search and Applications, SISAP 2021, 2021.

[7]

Cai T., Fan J., Jiang T., Distributions of angles in random packing on spheres, J. Mach. Learn. Res. 14 (21) (2013) 1837–1864. URL http://jmlr.org/papers/v14/cai13a.html.

[8]

Arthur D., Vassilvitskii S., K-Means++: The advantages of careful seeding, in: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’07, Society for Industrial and Applied Mathematics, USA, 2007, pp. 1027–1035.

[9]

Chávez E., Navarro G., Baeza-Yates R., Marroquín J.L., Searching in metric spaces, ACM Comput. Surv. 33 (3) (2001) 273–321.

Digital Library

[10]

Ruiz G., Chávez E., Ruiz U., Tellez E.S., Extreme pivots: a pivot selection strategy for faster metric search, Knowl. Inf. Syst. 62 (6) (2020) 2349–2382,.

Digital Library

[11]

Amato G., Esuli A., Falchi F., A comparison of pivot selection techniques for permutation-based indexing, Inf. Syst. 52 (C) (2015) 176–188,.

Digital Library

[12]

Bustos B., Navarro G., Chávez E., Pivot selection techniques for proximity searching in metric spaces, Pattern Recognit. Lett. 24 (2003) 2357–2366.

[13]

Terrell G.R., Scott D.W., Variable kernel density estimation, Ann. Statist. 20 (3) (1992) 1236–1265.

[14]

Jégou H., Douze M., Schmid C., Product quantization for nearest neighbor search, IEEE Trans. Pattern Anal. Mach. Intell. 33 (1) (2011) 117–128.

Digital Library

[15]

Amsaleg L., Chelly O., Houle M.E., ichi Kawarabayashi K., Radovanović M., Treeratanajaru W., Intrinsic dimensionality estimation within tight localities, in: Proceedings of the SIAM International Conference on Data Mining, SDM, SIAM, 2019.

[16]

Dasgupta S., Long P.M., Performance guarantees for hierarchical clustering, J. Comput. System Sci. 70 (2005) 555–569. Farthest First Traversal for pivot selection.

[17]

Hein M., Maier M., Manifold denoising, in: NIPS 19, MIT Press, 2007, pp. 561–568.

[18]

K. Sun, E. Bruno, S. Marchand-Maillet, Unsupervised Skeleton Learning for Manifold Denoising and Outlier Detection, in: International Conference on Pattern Recognition, ICPR’2012, Tsukuba, JP, 2012.

[19]

Dwork C., McSherry F., Nissim K., Smith A., Calibrating noise to sensitivity in private data analysis, in: Theory of Cryptography: Third Theory of Cryptography Conference, TCC 2006, New York, NY, USA, March 4-7, 2006. Proceedings 3, Springer, 2006, pp. 265–284.

[20]

Dwork C., Roth A., et al., The algorithmic foundations of differential privacy, Found. Trends Theor. Comput. Sci. 9 (3–4) (2014) 211–407.

Digital Library

Recommendations

A Family of Centrality Measures for Graph Data Based on Subgraphs
We present the theoretical foundations and first experimental study of a new approach in centrality measures for graph data. The main principle is straightforward: the more relevant subgraphs around a vertex, the more central it is in the network. We ...
Reducing the dimensionality of dissimilarity space embedding graph kernels

Graphs are a convenient representation formalism for structured objects, but they suffer from the fact that only a few algorithms for graph classification and clustering exist. In this paper a new approach to graph classification by dissimilarity space ...
Collapsible subgraphs of a 4-edge-connected graph
Abstract
Jaeger in 1979 showed that every 4-edge-connected graph is supereulerian, graphs that have spanning eulerian subgraphs. Catlin in 1988 sharpened Jaeger’s result by showing that every 4-edge-connected graph is collapsible, graphs that are ...

Comments

Information & Contributors

Information

Published In

cover image Information Systems

Information Systems Volume 121, Issue C

Mar 2024

217 pages

Issue’s Table of Contents

The Authors.

Publisher

Elsevier Science Ltd.

United Kingdom

Publication History

Published: 01 March 2024

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 20 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents