short-paper

Advancing Web Science through Foundation Model for Tabular Data

Author:

Inwon KangAuthors Info & Claims

Websci Companion '24: Companion Publication of the 16th ACM Web Science Conference

Pages 32 - 36

https://doi.org/10.1145/3630744.3658614

Published: 13 June 2024 Publication History

Abstract

As the landscape of web science expands, handling the vast datasets collected from the Web while preserving computational efficiency and privacy remains a significant challenge. Data distillation offers a compelling solution by condensing large datasets into a distilled subset that retains essential characteristics. Part of my ongoing thesis work on tabular data distillation has shown that autoencoders and clustering algorithms can effectively distill tabular datasets, offering a promising solution for handling large datasets. Building upon this, my next step is to develop a versatile pre-trained model analogous to BERT and RoBERTa. This model can distill arbitrary tabular datasets, streamlining processes like data size reduction, synthetic data generation, large-scale analysis, reproducibility, and privacy preservation. Such a foundation model will serve as a versatile tool for web science research, making both data and research more accessible and computationally efficient. This model will not be limited to downstream classification and will be applicable to many further uses, such as reducing dataset sizes for efficient analysis, producing privatized synthetic datasets, or enhancing reproducibility through shared distilled data. By developing a foundation model for tabular data distillation, I aim to unlock new avenues in web science and improve computational accessibility, privacy protection, and reproducibility. The proposed direction holds promise for a versatile tool for handling the large amounts of data generated from the web while preserving its essence.

References

[1]

1999. Internet Usage Data. UCI Machine Learning Repository.

[2]

Yi Wang Aden. 2012. KDD Cup 2012, Track 2. https://kaggle.com/competitions/kddcup2012-track2

[3]

Jacob Bien and Robert Tibshirani. 2011. Prototype selection for interpretable classification. (2011).

[4]

Edward Choi, Siddharth Biswal, Bradley Malin, Jon Duke, Walter F. Stewart, and Jimeng Sun. 2017. Generating Multi-label Discrete Patient Records Using Generative Adversarial Networks. In Proceedings of the 2nd Machine Learning for Healthcare Conference. PMLR, 286–305.

[5]

Justin Cui and Ruochen Wang. 2022. DC-BENCH: Dataset Condensation Benchmark. In Advances in Neural Information Processing Systems.

[6]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Jill Burstein, Christy Doran, and Thamar Solorio (Eds.). Association for Computational Linguistics, Minneapolis, Minnesota, 4171–4186. https://doi.org/10.18653/v1/N19-1423

[7]

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. Advances in neural information processing systems 27 (2014).

[8]

Yu V. Gorishniy, Ivan Rubachev, Valentin Khrulkov, and Artem Babenko. 2021. Revisiting Deep Learning Models for Tabular Data. In Neural Information Processing Systems.

[9]

Léo Grinsztajn, Edouard Oyallon, and Gaël Varoquaux. 2022. Why do tree-based models still outperform deep learning on tabular data?arxiv:2207.08815 [cs.LG]

[10]

Lasse Hansen, Nabeel Seedat, Mihaela van der Schaar, and Andrija Petrovic. 2024. Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A Comprehensive Benchmark. Advances in Neural Information Processing Systems 36 (2024).

[11]

Maryam Heidari and James H Jones. 2020. Using bert to extract topic-independent sentiment features for social media bot detection. In 2020 11th IEEE annual ubiquitous computing, electronics & mobile communication conference (UEMCON). IEEE, 0542–0547.

[12]

Rohit Kumar Kaliyar, Anurag Goswami, and Pratik Narang. 2021. FakeBERT: Fake news detection in social media with a BERT-based deep learning approach. Multimedia tools and applications 80, 8 (2021), 11765–11788.

[13]

Inwon Kang, Maruf Ahmed Mridul, Sanders Abraham, Yao Ma, Thilanka Munasinghe, Aparna Gupta, and Oshani Seneviratne. 2024. Deciphering Crypto Twitter.

[14]

Inwon Kang, Parikshit Ram, Yi Zhou, Horst Samulowitz, and Oshani Seneviratne. 2024. Effective Data Distillation for Tabular Datasets. In AAAI Conference on Artificial Intelligence.

[15]

Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013).

[16]

Anantaa Kotal, Aritran Piplai, Sai Sree Laya Chukkapalli, and Anupam Joshi. 2022. PriveTAB: Secure and Privacy-Preserving Sharing of Tabular Data. In Proceedings of the 2022 ACM on International Workshop on Security and Privacy Analytics. ACM, Baltimore MD USA, 35–45. https://doi.org/10.1145/3510548.3519377

Digital Library

[17]

Akim Kotelnikov, Dmitry Baranchuk, Ivan Rubachev, and Artem Babenko. 2023. Tabddpm: Modelling tabular data with diffusion models. In International Conference on Machine Learning. PMLR, 17564–17579.

[18]

Saïd Ladjal, Alasdair Newson, and Chi-Hieu Pham. 2019. A PCA-like Autoencoder. https://doi.org/10.48550/arXiv.1904.01277 arxiv:1904.01277 [cs]

[19]

Szu-Chuang Li, Bo-Chen Tai, and Yennun Huang. 2019. Evaluating Variational Autoencoder as a Private Data Release Mechanism for Tabular Data. In 2019 IEEE 24th Pacific Rim International Symposium on Dependable Computing (PRDC). 198–1988. https://doi.org/10.1109/PRDC47002.2019.00050

[20]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. (Sept. 2019).

[21]

Dmitry Medvedev and Alexander D’yakonov. 2021. New Properties of the Data Distillation Method When Working with Tabular Data. In Analysis of Images, Social Networks and Texts, Wil M. P. Van Der Aalst, Vladimir Batagelj, Dmitry I. Ignatov, Michael Khachay, Olessia Koltsova, Andrey Kutuzov, Sergei O. Kuznetsov, Irina A. Lomazova, Natalia Loukachevitch, Amedeo Napoli, Alexander Panchenko, Panos M. Pardalos, Marcello Pelillo, Andrey V. Savchenko, and Elena Tutubalina (Eds.). Vol. 12602. Springer International Publishing, Cham, 379–390. https://doi.org/10.1007/978-3-030-72610-2_29

Digital Library

[22]

Rami Mohammad and Lee McCluskey. 2015. Phishing Websites. UCI Machine Learning Repository.

[23]

Marzieh Mozafari, Reza Farahbakhsh, and Noël Crespi. 2020. Hate speech detection and racial bias mitigation in social media based on BERT model. PloS one 15, 8 (2020), e0237861.

[24]

Dat Quoc Nguyen, Thanh Vu, and Anh Tuan Nguyen. 2020. BERTweet: A pre-trained language model for English Tweets. arXiv preprint arXiv:2005.10200 (2020).

[25]

Timothy Nguyen, Zhourong Chen, and Jaehoon Lee. 2021. Dataset Meta-Learning from Kernel Ridge-Regression. arxiv:2011.00050 [cs, stat]

[26]

Daniele Panfilo, Alexander Boudewijn, Sebastiano Saccani, Andrea Coser, Borut Svara, Carlo Rossi Chauvenet, Ciro Antonio Mami, and Eric Medvet. 2023. A Deep Learning-based Pipeline for the Generation of Synthetic Tabular Data. IEEE Access (2023).

[27]

Matthew Rocklin. 2015. Dask: Parallel Computation with Blocked Algorithms and Task Scheduling. In Python in Science Conference. Austin, Texas, 126–132. https://doi.org/10.25080/Majora-7b98e3ed-013

[28]

Michal Rolinek, Dominik Zietlow, and Georg Martius. 2019. Variational Autoencoders Pursue PCA Directions (by Accident). In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12406–12415.

[29]

Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. 2008. The graph neural network model. IEEE transactions on neural networks 20, 1 (2008), 61–80.

[30]

Ravid Shwartz-Ziv and Amitai Armon. 2022. Tabular Data: Deep Learning Is Not All You Need. Information Fusion 81 (May 2022), 84–90. https://doi.org/10.1016/j.inffus.2021.11.011

Digital Library

[31]

Zhiqiang Wan, Yazhou Zhang, and Haibo He. 2017. Variational autoencoder based synthetic data generation for imbalanced learning. In 2017 IEEE symposium series on computational intelligence (SSCI). IEEE, 1–7.

[32]

Tongzhou Wang, Jun-Yan Zhu, Antonio Torralba, and Alexei A. Efros. 2020. Dataset Distillation. arxiv:1811.10959 [cs, stat]

[33]

Qitian Wu, Chenxiao Yang, and Junchi Yan. 2021. Towards Open-World Feature Extrapolation: An Inductive Graph Learning Approach. https://doi.org/10.48550/arXiv.2110.04514 arxiv:2110.04514 [cs]

[34]

X. 2024. More on Restricted Use Cases – Twitter Developers. https://developer.twitter.com/en/developer-terms/more-on-restricted-use-cases.

[35]

Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, and Kalyan Veeramachaneni. 2019. Modeling tabular data using conditional gan. Advances in neural information processing systems 32 (2019).

[36]

Lei Xu and Kalyan Veeramachaneni. 2018. Synthesizing tabular data using generative adversarial networks. arXiv preprint arXiv:1811.11264 (2018).

[37]

Matei Zaharia, Reynold S. Xin, Patrick Wendell, Tathagata Das, Michael Armbrust, Ankur Dave, Xiangrui Meng, Josh Rosen, Shivaram Venkataraman, Michael J. Franklin, Ali Ghodsi, Joseph Gonzalez, Scott Shenker, and Ion Stoica. 2016. Apache Spark: A Unified Engine for Big Data Processing. Commun. ACM 59, 11 (Oct. 2016), 56–65. https://doi.org/10.1145/2934664

Digital Library

[38]

Bo Zhao and Hakan Bilen. 2021. Dataset Condensation with Differentiable Siamese Augmentation. In Proceedings of the 38th International Conference on Machine Learning. PMLR, 12674–12685. https://proceedings.mlr.press/v139/zhao21a.html ISSN: 2640-3498.

[39]

Bo Zhao, Konda Reddy Mopuri, and Hakan Bilen. 2021. Dataset Condensation with Gradient Matching. http://arxiv.org/abs/2006.05929 arXiv:2006.05929 [cs].

[40]

Zilong Zhao, Aditya Kunar, Robert Birke, Hiek Van der Scheer, and Lydia Y Chen. 2023. Ctab-gan+: Enhancing tabular data synthesis. Frontiers in big Data 6 (2023).

Index Terms

Advancing Web Science through Foundation Model for Tabular Data

Recommendations

Improved dequantization and normalization methods for tabular data pre-processing in smart buildings
BuildSys '22: Proceedings of the 9th ACM International Conference on Systems for Energy-Efficient Buildings, Cities, and Transportation

Ubiquitous deployment of IoT sensors marks a defining characteristic of smart buildings, for they constitute the source of data on building operation, diagnosis, and maintenance. For machine learning applications in buildings, often the sensor data is ...
Semantic Concept Annotation for Tabular Data
CIKM '21: Proceedings of the 30th ACM International Conference on Information & Knowledge Management

Determining the semantic concepts of columns in tabular data is of use for many applications ranging from data integration, cleaning, search to feature engineering and model building in machine learning. Several prior works have proposed supervised ...
Asterisk-Shaped Features for Tabular Data
CIKM '21: Proceedings of the 30th ACM International Conference on Information & Knowledge Management

Data often accumulates in tabular format with many attribute items, and prediction using machine learning adds value to data for business. However, studies on machine learning for tabular data only input attribute values, which reduces accuracy. ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

Websci Companion '24: Companion Publication of the 16th ACM Web Science Conference

May 2024

128 pages

ISBN:9798400704536

DOI:10.1145/3630744

Editors:
Raphael Heiberger
University of Stuttgart, Germany
,
Ujwal Gadiraju
Delft University of Technology, The Netherlands
,
Marc Spaniol
Université de Caen Normandie, France
,
Katharina Kinder-Kurlanda
University of Klagenfurt, Austria
,
Agnieszka Faleńska
University of Stuttgart, Germany
,
Afra Mashhadi
University of Washington, USA
,
Jun Sun
GESIS, Germany
,
Sierra Kaiser
University of Stuttgart, Germany
,
Steffen Staab
University of Stuttgart, Germany

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 June 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper
Research
Refereed limited

Conference

Websci '24

Sponsor:

SIGWEB

Websci '24: 16th ACM Web Science Conference

May 21 - 24, 2024

Stuttgart, Germany

Acceptance Rates

Websci Companion '24 Paper Acceptance Rate 27 of 58 submissions, 47%;

Overall Acceptance Rate 245 of 933 submissions, 26%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
18
Total Downloads

Downloads (Last 12 months)18
Downloads (Last 6 weeks)8

Reflects downloads up to 10 Aug 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents