Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

HubHSP graph: : Capturing local geometrical and statistical data properties via spanning graphs

Published: 01 March 2024 Publication History

Abstract

The computation of a continuous generative model to describe a finite sample of an infinite metric space can prove challenging and lead to erroneous hypothesis, particularly in high-dimensional spaces. In this paper, we follow a different route and define the Hubness Half Space Partitioning graph (HubHSP graph). By constructing this spanning graph over the dataset, we can capture both the geometrical and statistical properties of the data without resorting to any continuity assumption. Leveraging the classical graph-theoretic apparatus, the HubHSP graph facilitates critical operations, including the creation of a representative sample of the original dataset, without relying on density estimation. This representative subsample is essential for a range of operations, including indexing, visualization, and machine learning tasks such as clustering or inductive learning. With the HubHSP graph, we can bypass the limitations of traditional methods and obtain a holistic understanding of our dataset’s properties, enabling us to unlock its full potential.

References

[1]
Bishop C.M., Pattern Recognition and Machine Learning (Information Science and Statistics), Springer-Verlag, Berlin, Heidelberg, 2006, (available online).
[2]
Rezende D.J., Mohamed S., Variational inference with normalizing flows, CoRR abs/1505.05770, 2016, arXiv:1505.05770.
[3]
S. Marchand-Maillet, E. Chávez, HubHSP graph: effective data sampling for pivot-based representation strategies, in: 15th International Conference on Similarity Search and Applications, 2022.
[4]
Chávez E., Dobrev S., Kranakis E., Opatrny J., Stacho L., Tejeda H., Urrutia J., Half-space proximal: A new local test for extracting a bounded dilation spanner of a unit disk graph, in: Proc. of Int. Conf. on Principles of Distributed Systems, OPODIS’05, Springer, 2005, pp. 235–245.
[5]
Hoyos A., Ruiz U., Marchand-Maillet S., Chávez E., Indexability-based dataset partitioning, in: Proc. of Int. Conf. on Similarity Search and Applications, SISAP’19, Springer, 2019, pp. 143–150.
[6]
S. Marchand-Maillet, O. Pedreira, E. Chávez, Structural Intrinsic Dimensionality, in: Proceedings of the 14th International Conference on Similarity Search and Applications, SISAP 2021, 2021.
[7]
Cai T., Fan J., Jiang T., Distributions of angles in random packing on spheres, J. Mach. Learn. Res. 14 (21) (2013) 1837–1864. URL http://jmlr.org/papers/v14/cai13a.html.
[8]
Arthur D., Vassilvitskii S., K-Means++: The advantages of careful seeding, in: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’07, Society for Industrial and Applied Mathematics, USA, 2007, pp. 1027–1035.
[9]
Chávez E., Navarro G., Baeza-Yates R., Marroquín J.L., Searching in metric spaces, ACM Comput. Surv. 33 (3) (2001) 273–321.
[10]
Ruiz G., Chávez E., Ruiz U., Tellez E.S., Extreme pivots: a pivot selection strategy for faster metric search, Knowl. Inf. Syst. 62 (6) (2020) 2349–2382,.
[11]
Amato G., Esuli A., Falchi F., A comparison of pivot selection techniques for permutation-based indexing, Inf. Syst. 52 (C) (2015) 176–188,.
[12]
Bustos B., Navarro G., Chávez E., Pivot selection techniques for proximity searching in metric spaces, Pattern Recognit. Lett. 24 (2003) 2357–2366.
[13]
Terrell G.R., Scott D.W., Variable kernel density estimation, Ann. Statist. 20 (3) (1992) 1236–1265.
[14]
Jégou H., Douze M., Schmid C., Product quantization for nearest neighbor search, IEEE Trans. Pattern Anal. Mach. Intell. 33 (1) (2011) 117–128.
[15]
Amsaleg L., Chelly O., Houle M.E., ichi Kawarabayashi K., Radovanović M., Treeratanajaru W., Intrinsic dimensionality estimation within tight localities, in: Proceedings of the SIAM International Conference on Data Mining, SDM, SIAM, 2019.
[16]
Dasgupta S., Long P.M., Performance guarantees for hierarchical clustering, J. Comput. System Sci. 70 (2005) 555–569. Farthest First Traversal for pivot selection.
[17]
Hein M., Maier M., Manifold denoising, in: NIPS 19, MIT Press, 2007, pp. 561–568.
[18]
K. Sun, E. Bruno, S. Marchand-Maillet, Unsupervised Skeleton Learning for Manifold Denoising and Outlier Detection, in: International Conference on Pattern Recognition, ICPR’2012, Tsukuba, JP, 2012.
[19]
Dwork C., McSherry F., Nissim K., Smith A., Calibrating noise to sensitivity in private data analysis, in: Theory of Cryptography: Third Theory of Cryptography Conference, TCC 2006, New York, NY, USA, March 4-7, 2006. Proceedings 3, Springer, 2006, pp. 265–284.
[20]
Dwork C., Roth A., et al., The algorithmic foundations of differential privacy, Found. Trends Theor. Comput. Sci. 9 (3–4) (2014) 211–407.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Information Systems
Information Systems  Volume 121, Issue C
Mar 2024
217 pages

Publisher

Elsevier Science Ltd.

United Kingdom

Publication History

Published: 01 March 2024

Author Tags

  1. Data analysis
  2. Data modeling
  3. Graph-based representation
  4. Graph centrality
  5. Half-space partitioning

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 20 Jan 2025

Other Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media