Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

TabEE: Tabular Embeddings Explanations

Published: 26 March 2024 Publication History

Abstract

Tabular embedding methods have become increasingly popular due to their effectiveness in improving the results of various tasks, including classic databases tasks and machine learning predictions. However, most current methods treat these embedding models as "black boxes" making it difficult to understand the insights captured by the models. Our research proposes a novel approach to interpret these models, aiming to provide local and global explanations for the original data and detect potential flaws in the embedding models. The proposed solution is appropriate for every tabular embedding algorithm, as it fits the black box view of the embedding model. Furthermore, we propose methods for comparing different embedding models, which can help identify data biases that might impact the models' credibility without the user's knowledge. Our approach is evaluated on multiple datasets and multiple embeddings, demonstrating that our proposed explanations provide valuable insights into the behavior of tabular embedding methods. By making these models more transparent, we believe our research will contribute to the development of more effective and reliable embedding methods for a wide range of applications.

References

[1]
2015. Flights Dataset. https://www.kaggle.com/usdot/flight-delays'select=flights.csv.
[2]
2020. Spotify Dataset. https://www.kaggle.com/datasets/mrmorj/dataset-of-songs-in-spotify.
[3]
2023. TabEE git repository. https://github.com/KathyRaz/TabEE.
[4]
Firas Abuzaid, Peter Kraft, Sahaana Suri, Edward Gan, Eric Xu, Atul Shenoy, Asvin Ananthanarayan, John Sheu, Erik Meijer, Xi Wu, et al. 2021. DIFF: a relational interface for large-scale data explanation. The VLDB Journal 30 (2021), 45--70.
[5]
Sercan Ö Arik and Tomas Pfister. 2021. Tabnet: Attentive interpretable tabular learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 6679--6687.
[6]
Zhifeng Bao, Yong Zeng, HV Jagadish, and Tok Wang Ling. 2015. Exploratory keyword search with interactive input. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. 871--876.
[7]
Ramon Bespinyowong,Wei Chen, HV Jagadish, and Yuxin Ma. 2016. ExRank: An exploratory ranking interface. PVLBD 9, 13 (2016), 1529--1532.
[8]
Przemyslaw Biecek and Tomasz Burzykowski. 2021. Local interpretable model-agnostic explanations (LIME). Explanatory Model Analysis; Chapman and Hall/CRC: New York, NY, USA (2021), 107--123.
[9]
Jock Blackard. 1998. Covertype. UCI Machine Learning Repository.
[10]
Rajesh Bordawekar and Oded Shmueli. 2019. Exploiting latent information in relational databases via word embedding and application to degrees of disclosure. In CIDR.
[11]
Riccardo Cappuzzo, Paolo Papotti, and Saravanan Thirumuruganathan. 2021. Embdi: generating embeddings for relational data integration. In 29th Italian Symposium on Advanced Database Systems (SEDB), Pizzo Calabro, Italy.
[12]
Jieying Chen, Jia-Yu Pan, Christos Faloutsos, and Spiros Papadimitriou. 2013. TSum: fast, principled table summarization. In Proceedings of the Seventh International Workshop on Data Mining for Online Advertising.
[13]
Graham Cormode. 2017. Data sketching. Commun. ACM 60, 9 (2017).
[14]
John P Cunningham and Zoubin Ghahramani. 2015. Linear dimensionality reduction: Survey, insights, and generalizations. J. Mach. Learn. Res. 16, 1 (2015).
[15]
Hoa Khanh Dam, Truyen Tran, and Aditya Ghose. 2018. Explainable software analytics. In Proceedings of the 40th International Conference on Software Engineering: New Ideas and Emerging Results. 53--56.
[16]
Sanjoy Dasgupta, Nave Frost, and Michal Moshkovitz. 2022. Framework for evaluating faithfulness of local explanations. In International Conference on Machine Learning. PMLR, 4794--4815.
[17]
Daniel Deutch, Amir Gilad, Tova Milo, Amit Mualem, and Amit Somech. 2022. FEDEX: An Explainability Framework for Data Exploration Steps. arXiv preprint arXiv:2209.06260 (2022).
[18]
Rui Ding, Shi Han, Yong Xu, Haidong Zhang, and Dongmei Zhang. 2019. Quickinsights: Quick and automatic discovery of insights from multi-dimensional data. In Proceedings of the 2019 International Conference on Management of Data. 317--332.
[19]
Joseph C Dunn. 1974. Well-separated clusters and optimal fuzzy partitions. Journal of cybernetics 4, 1 (1974), 95--104.
[20]
Yunchao Gong, Qifa Ke, Michael Isard, and Svetlana Lazebnik. 2014. A multi-view embedding space for modeling internet images, tags, and their semantics. International journal of computer vision 106 (2014), 210--233.
[21]
Joseph M Hellerstein, Ron Avnur, Andy Chou, Christian Hidber, Chris Olston, Vijayshankar Raman, Tali Roth, and Peter J Haas. 1999. Interactive data analysis: The control project. Computer 32, 8 (1999).
[22]
Robert J Hilderman and Howard J Hamilton. 2013. Knowledge Discovery and Measures of Interest. Vol. 638. Springer Science & Business Media.
[23]
Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013).
[24]
Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix Factorization Techniques for Recommender Systems. Computer 42, 8 (2009), 30--37. https://doi.org/10.1109/MC.2009.263
[25]
Sotiris B Kotsiantis. 2013. Decision trees: a recent overview. Artificial Intelligence Review 39 (2013), 261--283.
[26]
Doris Jung-Lin Lee, Dixin Tang, Kunal Agarwal, Thyne Boonmark, Caitlyn Chen, Jake Kang, Ujjaini Mukhopadhyay, Jerry Song, Micah Yong, Marti A Hearst, et al. 2021. Lux: always-on visualization recommendations for exploratory dataframe workflows. PVLDB 15, 3 (2021), 727--738.
[27]
J Lin. 1991. Divergence measures based on the Shannon entropy. IEEE Transactions on Information Theory 37, 1 (1991), 145--151.
[28]
Scott M. Lundberg and Su-In Lee. 2017. A unified approach to interpreting model predictions. Advances in Neural Information Processing Systems 30 (2017).
[29]
Yuyu Luo, Xuedi Qin, Nan Tang, and Guoliang Li. 2018. DeepEye: Towards Automatic Data Visualization. ICDE.
[30]
Frank J Massey Jr. 1951. The Kolmogorov-Smirnov test for goodness of fit. Journal of the American statistical Association 46, 253 (1951).
[31]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In NeurIPS.
[32]
Christoph Molnar. 2020. Interpretable machine learning. Lulu. com.
[33]
Alun Preece. 2018. Asking "Why'in AI: Explainability of intelligent systems--perspectives and challenges. Intelligent Systems in Accounting, Finance and Management 25, 2 (2018), 63--72.
[34]
Kathy Razmadze, Yael Amsterdamer, Amit Somech, Susan B Davidson, and Tova Milo. 2022. SubTab: Data Exploration with Informative Sub-Tables. In Proceedings of the 2022 International Conference on Management of Data. 2369--2372.
[35]
Peter J Rousseeuw. 1987. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics 20 (1987), 53--65.
[36]
Peter J. Rousseeuw. 1987. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20 (1987), 53--65. https://doi.org/10.1016/0377-0427(87)90125--7
[37]
Sunita Sarawagi, Rakesh Agrawal, and Nimrod Megiddo. 1998. Discovery-driven exploration of OLAP data cubes. In EDBT.
[38]
Manish Singh, Michael J Cafarella, and HV Jagadish. 2016. DBExplorer: Exploratory Search in Databases. EDBT (2016).
[39]
Arjun Srinivasan, Steven M Drucker, Alex Endert, and John Stasko. 2018. Augmenting visualizations with interactive data facts to facilitate interpretation and communication. IEEE transactions on visualization and computer graphics 25, 1 (2018), 672--681.
[40]
Bo Tang, Shi Han, Man Lung Yiu, Rui Ding, and Dongmei Zhang. 2017. Extracting top-k insights from multi-dimensional data. In Proceedings of the 2017 ACM International Conference on Management of Data. 1509--1524.
[41]
Talip Ucar, Ehsan Hajiramezanali, and Lindsay Edwards. 2021. Subtab: Subsetting features of tabular data for selfsupervised representation learning. Advances in Neural Information Processing Systems 34 (2021), 18853--18865.
[42]
Talip Ucar, Ehsan Hajiramezanali, and Lindsay Edwards. 2021. Subtab: Subsetting features of tabular data for selfsupervised representation learning. Advances in Neural Information Processing Systems 34 (2021), 18853--18865.
[43]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).
[44]
Sergio Verdú. 2014. Total variation distance and the distribution of relative information. In 2014 Information Theory and Applications Workshop (ITA). IEEE, 1--3.
[45]
Giulia Vilone and Luca Longo. 2021. Notions of explainability and evaluation approaches for explainable artificial intelligence. Information Fusion 76 (2021), 89--106.
[46]
ZifengWang and Jimeng Sun. [n. d.]. TransTab: Learning Transferable Tabular Transformers Across Tables. In Advances in Neural Information Processing Systems.
[47]
WilliamWebber, Alistair Moffat, and Justin Zobel. 2010. A similarity measure for indefinite rankings. ACM Transactions on Information Systems (TOIS) 28, 4 (2010), 1--38.
[48]
William Webber, Alistair Moffat, and Justin Zobel. 2010. A Similarity Measure for Indefinite Rankings. ACM Trans. Inf. Syst. 28, 4, Article 20 (nov 2010), 38 pages. https://doi.org/10.1145/1852102.1852106
[49]
Daniel Whiteson. 2014. Higgs Boson Dataset. https://archive.ics.uci.edu/dataset/280/higgs.
[50]
Tomer Wolfson, Mor Geva, Ankit Gupta, Matt Gardner, Yoav Goldberg, Daniel Deutch, and Jonathan Berant. 2020. Break it down: A question understanding benchmark. Transactions of the Association for Computational Linguistics 8 (2020), 183--198.
[51]
KanitWongsuphasawat, Dominik Moritz, Anushka Anand, Jock Mackinlay, Bill Howe, and Jeffrey Heer. 2016. Voyager: Exploratory analysis via faceted browsing of visualization recommendations. TVCG (2016).
[52]
Jinsung Yoon, Yao Zhang, James Jordon, and Mihaela van der Schaar. 2020. Vime: Extending the success of self-and semi-supervised learning to tabular domain. Advances in Neural Information Processing Systems 33 (2020), 11033--11043.
[53]
Brit Youngmann, Sihem Amer-Yahia, and Aurelien Personnaz. 2022. Guided exploration of data summaries. Proceedings of the VLDB Endowment (PVLDB) 15, 9 (2022), 1798--1807.
[54]
Kai Zeng, Sameer Agarwal, Ankur Dave, Michael Armbrust, and Ion Stoica. 2015. G-ola: Generalized on-line aggregation for interactive analysis on big data. In SIGMOD.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Management of Data
Proceedings of the ACM on Management of Data  Volume 2, Issue 1
SIGMOD
February 2024
1874 pages
EISSN:2836-6573
DOI:10.1145/3654807
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 March 2024
Published in PACMMOD Volume 2, Issue 1

Permissions

Request permissions for this article.

Author Tags

  1. model explainability
  2. table embedding

Qualifiers

  • Research-article

Funding Sources

  • BSF - the US-Israel Binational Science foundation
  • iSF - the Israel Science foundation

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 485
    Total Downloads
  • Downloads (Last 12 months)485
  • Downloads (Last 6 weeks)100
Reflects downloads up to 11 Sep 2024

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media