research-article

Open access

TabEE: Tabular Embeddings Explanations

Authors:

Kathy RazmadzeAuthors Info & Claims

Proceedings of the ACM on Management of Data, Volume 2, Issue 1

Article No.: 72, Pages 1 - 26

https://doi.org/10.1145/3639329

Published: 26 March 2024 Publication History

Abstract

Tabular embedding methods have become increasingly popular due to their effectiveness in improving the results of various tasks, including classic databases tasks and machine learning predictions. However, most current methods treat these embedding models as "black boxes" making it difficult to understand the insights captured by the models. Our research proposes a novel approach to interpret these models, aiming to provide local and global explanations for the original data and detect potential flaws in the embedding models. The proposed solution is appropriate for every tabular embedding algorithm, as it fits the black box view of the embedding model. Furthermore, we propose methods for comparing different embedding models, which can help identify data biases that might impact the models' credibility without the user's knowledge. Our approach is evaluated on multiple datasets and multiple embeddings, demonstrating that our proposed explanations provide valuable insights into the behavior of tabular embedding methods. By making these models more transparent, we believe our research will contribute to the development of more effective and reliable embedding methods for a wide range of applications.

References

[1]

2015. Flights Dataset. https://www.kaggle.com/usdot/flight-delays'select=flights.csv.

[2]

2020. Spotify Dataset. https://www.kaggle.com/datasets/mrmorj/dataset-of-songs-in-spotify.

[3]

2023. TabEE git repository. https://github.com/KathyRaz/TabEE.

[4]

Firas Abuzaid, Peter Kraft, Sahaana Suri, Edward Gan, Eric Xu, Atul Shenoy, Asvin Ananthanarayan, John Sheu, Erik Meijer, Xi Wu, et al. 2021. DIFF: a relational interface for large-scale data explanation. The VLDB Journal 30 (2021), 45--70.

Digital Library

[5]

Sercan Ö Arik and Tomas Pfister. 2021. Tabnet: Attentive interpretable tabular learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 6679--6687.

[6]

Zhifeng Bao, Yong Zeng, HV Jagadish, and Tok Wang Ling. 2015. Exploratory keyword search with interactive input. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. 871--876.

Digital Library

[7]

Ramon Bespinyowong,Wei Chen, HV Jagadish, and Yuxin Ma. 2016. ExRank: An exploratory ranking interface. PVLBD 9, 13 (2016), 1529--1532.

[8]

Przemyslaw Biecek and Tomasz Burzykowski. 2021. Local interpretable model-agnostic explanations (LIME). Explanatory Model Analysis; Chapman and Hall/CRC: New York, NY, USA (2021), 107--123.

[9]

Jock Blackard. 1998. Covertype. UCI Machine Learning Repository.

[10]

Rajesh Bordawekar and Oded Shmueli. 2019. Exploiting latent information in relational databases via word embedding and application to degrees of disclosure. In CIDR.

[11]

Riccardo Cappuzzo, Paolo Papotti, and Saravanan Thirumuruganathan. 2021. Embdi: generating embeddings for relational data integration. In 29th Italian Symposium on Advanced Database Systems (SEDB), Pizzo Calabro, Italy.

[12]

Jieying Chen, Jia-Yu Pan, Christos Faloutsos, and Spiros Papadimitriou. 2013. TSum: fast, principled table summarization. In Proceedings of the Seventh International Workshop on Data Mining for Online Advertising.

Digital Library

[13]

Graham Cormode. 2017. Data sketching. Commun. ACM 60, 9 (2017).

[14]

John P Cunningham and Zoubin Ghahramani. 2015. Linear dimensionality reduction: Survey, insights, and generalizations. J. Mach. Learn. Res. 16, 1 (2015).

[15]

Hoa Khanh Dam, Truyen Tran, and Aditya Ghose. 2018. Explainable software analytics. In Proceedings of the 40th International Conference on Software Engineering: New Ideas and Emerging Results. 53--56.

Digital Library

[16]

Sanjoy Dasgupta, Nave Frost, and Michal Moshkovitz. 2022. Framework for evaluating faithfulness of local explanations. In International Conference on Machine Learning. PMLR, 4794--4815.

[17]

Daniel Deutch, Amir Gilad, Tova Milo, Amit Mualem, and Amit Somech. 2022. FEDEX: An Explainability Framework for Data Exploration Steps. arXiv preprint arXiv:2209.06260 (2022).

[18]

Rui Ding, Shi Han, Yong Xu, Haidong Zhang, and Dongmei Zhang. 2019. Quickinsights: Quick and automatic discovery of insights from multi-dimensional data. In Proceedings of the 2019 International Conference on Management of Data. 317--332.

Digital Library

[19]

Joseph C Dunn. 1974. Well-separated clusters and optimal fuzzy partitions. Journal of cybernetics 4, 1 (1974), 95--104.

[20]

Yunchao Gong, Qifa Ke, Michael Isard, and Svetlana Lazebnik. 2014. A multi-view embedding space for modeling internet images, tags, and their semantics. International journal of computer vision 106 (2014), 210--233.

Digital Library

[21]

Joseph M Hellerstein, Ron Avnur, Andy Chou, Christian Hidber, Chris Olston, Vijayshankar Raman, Tali Roth, and Peter J Haas. 1999. Interactive data analysis: The control project. Computer 32, 8 (1999).

[22]

Robert J Hilderman and Howard J Hamilton. 2013. Knowledge Discovery and Measures of Interest. Vol. 638. Springer Science & Business Media.

[23]

Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013).

[24]

Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix Factorization Techniques for Recommender Systems. Computer 42, 8 (2009), 30--37. https://doi.org/10.1109/MC.2009.263

Digital Library

[25]

Sotiris B Kotsiantis. 2013. Decision trees: a recent overview. Artificial Intelligence Review 39 (2013), 261--283.

Digital Library

[26]

Doris Jung-Lin Lee, Dixin Tang, Kunal Agarwal, Thyne Boonmark, Caitlyn Chen, Jake Kang, Ujjaini Mukhopadhyay, Jerry Song, Micah Yong, Marti A Hearst, et al. 2021. Lux: always-on visualization recommendations for exploratory dataframe workflows. PVLDB 15, 3 (2021), 727--738.

Digital Library

[27]

J Lin. 1991. Divergence measures based on the Shannon entropy. IEEE Transactions on Information Theory 37, 1 (1991), 145--151.

Digital Library

[28]

Scott M. Lundberg and Su-In Lee. 2017. A unified approach to interpreting model predictions. Advances in Neural Information Processing Systems 30 (2017).

Digital Library

[29]

Yuyu Luo, Xuedi Qin, Nan Tang, and Guoliang Li. 2018. DeepEye: Towards Automatic Data Visualization. ICDE.

[30]

Frank J Massey Jr. 1951. The Kolmogorov-Smirnov test for goodness of fit. Journal of the American statistical Association 46, 253 (1951).

[31]

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In NeurIPS.

[32]

Christoph Molnar. 2020. Interpretable machine learning. Lulu. com.

[33]

Alun Preece. 2018. Asking "Why'in AI: Explainability of intelligent systems--perspectives and challenges. Intelligent Systems in Accounting, Finance and Management 25, 2 (2018), 63--72.

Digital Library

[34]

Kathy Razmadze, Yael Amsterdamer, Amit Somech, Susan B Davidson, and Tova Milo. 2022. SubTab: Data Exploration with Informative Sub-Tables. In Proceedings of the 2022 International Conference on Management of Data. 2369--2372.

Digital Library

[35]

Peter J Rousseeuw. 1987. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics 20 (1987), 53--65.

Digital Library

[36]

Peter J. Rousseeuw. 1987. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20 (1987), 53--65. https://doi.org/10.1016/0377-0427(87)90125--7

Digital Library

[37]

Sunita Sarawagi, Rakesh Agrawal, and Nimrod Megiddo. 1998. Discovery-driven exploration of OLAP data cubes. In EDBT.

[38]

Manish Singh, Michael J Cafarella, and HV Jagadish. 2016. DBExplorer: Exploratory Search in Databases. EDBT (2016).

[39]

Arjun Srinivasan, Steven M Drucker, Alex Endert, and John Stasko. 2018. Augmenting visualizations with interactive data facts to facilitate interpretation and communication. IEEE transactions on visualization and computer graphics 25, 1 (2018), 672--681.

[40]

Bo Tang, Shi Han, Man Lung Yiu, Rui Ding, and Dongmei Zhang. 2017. Extracting top-k insights from multi-dimensional data. In Proceedings of the 2017 ACM International Conference on Management of Data. 1509--1524.

Digital Library

[41]

Talip Ucar, Ehsan Hajiramezanali, and Lindsay Edwards. 2021. Subtab: Subsetting features of tabular data for selfsupervised representation learning. Advances in Neural Information Processing Systems 34 (2021), 18853--18865.

[42]

Talip Ucar, Ehsan Hajiramezanali, and Lindsay Edwards. 2021. Subtab: Subsetting features of tabular data for selfsupervised representation learning. Advances in Neural Information Processing Systems 34 (2021), 18853--18865.

[43]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).

[44]

Sergio Verdú. 2014. Total variation distance and the distribution of relative information. In 2014 Information Theory and Applications Workshop (ITA). IEEE, 1--3.

[45]

Giulia Vilone and Luca Longo. 2021. Notions of explainability and evaluation approaches for explainable artificial intelligence. Information Fusion 76 (2021), 89--106.

Digital Library

[46]

ZifengWang and Jimeng Sun. [n. d.]. TransTab: Learning Transferable Tabular Transformers Across Tables. In Advances in Neural Information Processing Systems.

[47]

WilliamWebber, Alistair Moffat, and Justin Zobel. 2010. A similarity measure for indefinite rankings. ACM Transactions on Information Systems (TOIS) 28, 4 (2010), 1--38.

Digital Library

[48]

William Webber, Alistair Moffat, and Justin Zobel. 2010. A Similarity Measure for Indefinite Rankings. ACM Trans. Inf. Syst. 28, 4, Article 20 (nov 2010), 38 pages. https://doi.org/10.1145/1852102.1852106

Digital Library

[49]

Daniel Whiteson. 2014. Higgs Boson Dataset. https://archive.ics.uci.edu/dataset/280/higgs.

[50]

Tomer Wolfson, Mor Geva, Ankit Gupta, Matt Gardner, Yoav Goldberg, Daniel Deutch, and Jonathan Berant. 2020. Break it down: A question understanding benchmark. Transactions of the Association for Computational Linguistics 8 (2020), 183--198.

[51]

KanitWongsuphasawat, Dominik Moritz, Anushka Anand, Jock Mackinlay, Bill Howe, and Jeffrey Heer. 2016. Voyager: Exploratory analysis via faceted browsing of visualization recommendations. TVCG (2016).

[52]

Jinsung Yoon, Yao Zhang, James Jordon, and Mihaela van der Schaar. 2020. Vime: Extending the success of self-and semi-supervised learning to tabular domain. Advances in Neural Information Processing Systems 33 (2020), 11033--11043.

[53]

Brit Youngmann, Sihem Amer-Yahia, and Aurelien Personnaz. 2022. Guided exploration of data summaries. Proceedings of the VLDB Endowment (PVLDB) 15, 9 (2022), 1798--1807.

Digital Library

[54]

Kai Zeng, Sameer Agarwal, Ankur Dave, Michael Armbrust, and Ion Stoica. 2015. G-ola: Generalized on-line aggregation for interactive analysis on big data. In SIGMOD.

Digital Library

Index Terms

TabEE: Tabular Embeddings Explanations
1. Computing methodologies
  1. Artificial intelligence
    1. Knowledge representation and reasoning
  2. Machine learning
2. Information systems
  1. Data management systems

Recommendations

Low-distortion embeddings of general metrics into the line
STOC '05: Proceedings of the thirty-seventh annual ACM symposium on Theory of computing

A low-distortion embedding between two metric spaces is a mapping which preserves the distances between each pair of points, up to a small factor called distortion. Low-distortion embeddings have recently found numerous applications in computer ...
Computational metric embeddings
Lossless Prioritized Embeddings

Given metric spaces $(X,d)$ and $(Y,\rho)$ and an ordering $x_1,x_2,\ldots,x_n$ of $(X,d)$, an embedding $f: X \rightarrow Y$ is said to have a prioritized distortion $\alpha(\cdot)$, for a function $\alpha(\cdot)$, if for any pair $x_j,x'$ of distinct ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Management of Data

Proceedings of the ACM on Management of Data Volume 2, Issue 1

SIGMOD

February 2024

1874 pages

EISSN:2836-6573

DOI:10.1145/3654807

Editor:
Divyakant Agrawal
UC Santa Barbara, United States

Issue’s Table of Contents

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 March 2024

Published in PACMMOD Volume 2, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Author Tags

Qualifiers

Research-article

Funding Sources

BSF - the US-Israel Binational Science foundation
iSF - the Israel Science foundation

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
485
Total Downloads

Downloads (Last 12 months)485
Downloads (Last 6 weeks)100

Reflects downloads up to 11 Sep 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents