research-article

Interpretable Embedding and Visualization of Compressed Data

Authors:

Nikolaos M. Freris,

Ahmad Ajalloeian,

Michalis VlachosAuthors Info & Claims

ACM Transactions on Knowledge Discovery from Data, Volume 17, Issue 2

Article No.: 22, Pages 1 - 22

https://doi.org/10.1145/3537901

Published: 20 February 2023 Publication History

Abstract

Traditional embedding methodologies, also known as dimensionality reduction techniques, assume the availability of exact pairwise distances between the high-dimensional objects that will be embedded in a lower dimensionality. In this article, we propose an embedding that overcomes this limitation and can operate on pairwise distances that are represented as a range of lower and upper bounds. Such bounds are typically estimated when objects are compressed in a lossy manner, so our approach is highly applicable in the case of big compressed datasets. Our methodology can preserve multiple aspects of the original data relationships: distances, correlations, and object scores/ranks, whereas existing techniques typically preserve only distances. Comparative experiments with prevalent embedding methodologies (ISOMAP, t-SNE, MDS, UMAP) illustrate that our approach can provide fidelitous preservation of multiple object relationships, even in the presence of inexact distance information. Our visualization method is also easily interpretable.

References

[1]

Elnaz Barshan, Ali Ghodsi, Zohreh Azimifar, and Mansoor Zolghadri Jahromi. 2011. Supervised principal component analysis: Visualization, classification and regression on subspaces and submanifolds. Pattern Recognition 44, 7 (2011), 1357–1371.

Digital Library

[2]

Heinz H. Bauschke and Patrick L. Combettes. 2011. Convex Analysis and Monotone Operator Theory in Hilbert Spaces. Vol. 408. Springer.

[3]

Mikhail Belkin and Partha Niyogi. 2003. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computing 15, 6 (2003), 1373–1396.

Digital Library

[4]

Dimitri P. Bertsekas. 1999. Nonlinear Programming. Athena Scientific.

[5]

Ella Bingham and Heikki Mannila. 2001. Random projection in dimensionality reduction: Applications to image and text data. In Proceedings of the 7th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 245–250.

Digital Library

[6]

Stephen Boyd and Lieven Vandenberghe. 2004. Convex Optimization. Cambridge University Press.

[7]

José Camacho. 2014. Visualizing big data with compressed score plots: Approach and research challenges. Chemometrics and Intelligent Laboratory Systems 135 (2014), 110–125.

[8]

Emmanuel Candes, Justin Romberg, and Terence Tao. 2006. Stable signal recovery from incomplete and inaccurate measurements. Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences 59, 8 (2006), 1207–1223.

[9]

Yale Chang, Junxiang Chen, Michael H. Cho, Peter J. Castaldi, Edwin K. Silverman, and Jennifer G. Dy. 2017. Clustering with domain-specific usefulness scores. In Proceedings of the SIAM International Conference on Data Mining. 207–215.

[10]

Trevor F. Cox and Michael A. A. Cox. 2000. Multidimensional Scaling. Chapman and Hall/CRC.

[11]

John P. Cunningham and Zoubin Ghahramani. 2015. Linear dimensionality reduction: survey, insights, and generalizations. Journal of Machine Learning Research 16, 89 (2015), 2859–2900.

Digital Library

[12]

Jiarui Ding, Anne Condon, and Sohrab P. Shah. 2018. Interpretable dimensionality reduction of single cell transcriptome data with deep generative models. Nature Communications 9, 1 (21 May2018), 2002.

[13]

Dheeru Dua and Casey Graff. 2017. UCI Machine Learning Repository. Retrieved from http://archive.ics.uci.edu/ml.

[14]

Nikolaos M. Freris, Michalis Vlachos, and Ahmad Ajalloeian. 2020. An interpretable data embedding under uncertain distance information. In Proceedings of the 20th IEEE International Conference on Data Mining. 1022–1027.

[15]

Kelum Gajamannage, Randy Paffenroth, and Erik M. Bollt. 2019. A nonlinear dimensionality reduction framework using smooth geodesics. Pattern Recognition 87 (2019), 226–236.

[16]

Shlomo Hoory, Nathan Linial, and Avi Wigderson. 2006. Expander graphs and their applications. Bulletin of the American Mathematical Society 43, 4 (2006), 439–561.

[17]

Mahdokht Masaeli, Glenn Fung, and Jennifer G. Dy. 2010. From transformation-based dimensionality reduction to feature selection. In Proceedings of the International Conference on Machine Learning. 751–758.

[18]

Leland McInnes, John Healy, Nathaniel Saul, and Lukas Großberger. 2018. UMAP: Uniform Manifold Approximation and Projection. Journal of Open Source Software 3, 29 (2018), 861.

[19]

Carl D. Meyer. 2000. Matrix Analysis and Applied Linear Algebra. SIAM.

[20]

Yurii Nesterov. 2013. Introductory Lectures on Convex Optimization: A Basic Course. Springer Science & Business Media.

Digital Library

[21]

Johan Paratte, Nathanaël Perraudin, and Pierre Vandergheynst. 2017. Compressive embedding and visualization using graphs. arXiv:1702.05815. Retrieved from https://arxiv.org/abs/1702.05815.

[22]

Nikolaos Passalis and Anastasios Tefas. 2018. Dimensionality reduction using similarity-induced embeddings. IEEE Transactions on Neural Networks Learning Systems 29, 8 (2018), 3429–3441.

[23]

Karl Pearson. 1901. LIII. On lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 2, 11 (1901), 559–572.

[24]

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, and J. Vanderplas. 2011. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12 (2011), 2825–2830.

Digital Library

[25]

R. Tyrrell Rockafellar and Roger J.-B. Wets. 2009. Variational Analysis. Springer Science & Business Media.

[26]

Sam T. Roweis and Lawrence K. Saul. 2000. Nonlinear dimensionality reduction by locally linear embedding. Science 290, 5500 (2000), 2323–2326.

[27]

Bernhard Schölkopf, Alexander J. Smola, and Klaus-Robert Müller. 1998. Nonlinear component analysis as a kernel eigenvalue problem. Neural Computing 10, 5 (1998), 1299–1319.

Digital Library

[28]

Jian Tang, Jingzhou Liu, Ming Zhang, and Qiaozhu Mei. 2016. Visualizing large-scale and high-dimensional data. In Proceedings of the 25th International Conference on World Wide Web. 287–297.

Digital Library

[29]

Joshua B. Tenenbaum, Vin De Silva, and John C. Langford. 2000. A global geometric framework for nonlinear dimensionality reduction. Science 290, 5500 (2000), 2319–2323.

[30]

Laurens van der Maaten. 2009. Learning a parametric embedding by preserving local structure. In Proceedings of the International Conference on Artificial Intelligence and Statistics, Vol. 5. 384–391.

[31]

Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research 9 (2008), 2579–2605.

Digital Library

[32]

Michail Vlachos, Nikolaos Freris, and Anastasios Kyrillidis. 2015. Compressive mining: Fast and optimal data mining in the compressed domain. The International Journal on Very Large Data Bases 24, 1 (2015), 1–24.

Digital Library

[33]

Arnold D. Well and Jerome L. Myers. 2003. Research Design & Statistical Analysis. Psychology Press.

[34]

Hujun Yin. 2007. Nonlinear dimensionality reduction and data visualization: A review. International Journal of Automation and Computing 4, 3 (2007), 294–303.

[35]

X. Zheng and K. Ng. 2014. Metric subregularity of piecewise linear multifunctions and applications to piecewise linear multiobjective optimization. SIAM Journal on Optimization 24, 1 (2014), 154–174.

Digital Library

[36]

Shlomo Zilberstein. 1996. Using anytime algorithms in intelligent systems. AI Magazine 17, 3 (1996), 73–73.

Digital Library

Cited By

Sun YHan YFan J(2023)Laplacian-based Cluster-Contractive t-SNE for High-Dimensional Data VisualizationACM Transactions on Knowledge Discovery from Data10.1145/361293218:1(1-22)Online publication date: 6-Sep-2023
https://dl.acm.org/doi/10.1145/3612932

Index Terms

Interpretable Embedding and Visualization of Compressed Data
1. Computing methodologies
  1. Machine learning
    1. Machine learning algorithms
2. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Information extraction
  2. Information systems applications
    1. Data mining

Recommendations

Locally Minimizing Embedding and Globally Maximizing Variance: Unsupervised Linear Difference Projection for Dimensionality Reduction

Recently, many dimensionality reduction algorithms, including local methods and global methods, have been presented. The representative local linear methods are locally linear embedding (LLE) and linear preserving projections (LPP), which seek to find ...
Constrained discriminant neighborhood embedding for high dimensional data feature extraction

When handling pattern classification problem such as face recognition and digital handwriting identification, image data is always represented to high dimensional vectors, from which discriminant features are extracted using dimensionality reduction ...
Locality preserving CCA with applications to data visualization and pose estimation

Canonical correlation analysis (CCA) is a major linear subspace approach to dimensionality reduction and has been applied to image processing, pose estimation and other fields. However, it fails to discover or reveal the nonlinear correlation ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Knowledge Discovery from Data

ACM Transactions on Knowledge Discovery from Data Volume 17, Issue 2

February 2023

355 pages

ISSN:1556-4681

EISSN:1556-472X

DOI:10.1145/3572847

Editor:
Charu Aggarwal
IBM T. J. Watson Research, USA

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 February 2023

Online AM: 18 May 2022

Accepted: 08 May 2022

Received: 30 January 2022

Published in TKDD Volume 17, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Ministry of Science and Technology of China
Anhui Dept. of Science and Technology
Toward Interpretable Machine Learning

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
342
Total Downloads

Downloads (Last 12 months)113
Downloads (Last 6 weeks)2

Reflects downloads up to 22 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Sun YHan YFan J(2023)Laplacian-based Cluster-Contractive t-SNE for High-Dimensional Data VisualizationACM Transactions on Knowledge Discovery from Data10.1145/361293218:1(1-22)Online publication date: 6-Sep-2023
https://dl.acm.org/doi/10.1145/3612932

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents