Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3630744.3658614acmconferencesArticle/Chapter ViewAbstractPublication PageswebsciConference Proceedingsconference-collections
short-paper

Advancing Web Science through Foundation Model for Tabular Data

Published: 13 June 2024 Publication History
  • Get Citation Alerts
  • Abstract

    As the landscape of web science expands, handling the vast datasets collected from the Web while preserving computational efficiency and privacy remains a significant challenge. Data distillation offers a compelling solution by condensing large datasets into a distilled subset that retains essential characteristics. Part of my ongoing thesis work on tabular data distillation has shown that autoencoders and clustering algorithms can effectively distill tabular datasets, offering a promising solution for handling large datasets. Building upon this, my next step is to develop a versatile pre-trained model analogous to BERT and RoBERTa. This model can distill arbitrary tabular datasets, streamlining processes like data size reduction, synthetic data generation, large-scale analysis, reproducibility, and privacy preservation. Such a foundation model will serve as a versatile tool for web science research, making both data and research more accessible and computationally efficient. This model will not be limited to downstream classification and will be applicable to many further uses, such as reducing dataset sizes for efficient analysis, producing privatized synthetic datasets, or enhancing reproducibility through shared distilled data. By developing a foundation model for tabular data distillation, I aim to unlock new avenues in web science and improve computational accessibility, privacy protection, and reproducibility. The proposed direction holds promise for a versatile tool for handling the large amounts of data generated from the web while preserving its essence.

    References

    [1]
    1999. Internet Usage Data. UCI Machine Learning Repository.
    [2]
    Yi Wang Aden. 2012. KDD Cup 2012, Track 2. https://kaggle.com/competitions/kddcup2012-track2
    [3]
    Jacob Bien and Robert Tibshirani. 2011. Prototype selection for interpretable classification. (2011).
    [4]
    Edward Choi, Siddharth Biswal, Bradley Malin, Jon Duke, Walter F. Stewart, and Jimeng Sun. 2017. Generating Multi-label Discrete Patient Records Using Generative Adversarial Networks. In Proceedings of the 2nd Machine Learning for Healthcare Conference. PMLR, 286–305.
    [5]
    Justin Cui and Ruochen Wang. 2022. DC-BENCH: Dataset Condensation Benchmark. In Advances in Neural Information Processing Systems.
    [6]
    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Jill Burstein, Christy Doran, and Thamar Solorio (Eds.). Association for Computational Linguistics, Minneapolis, Minnesota, 4171–4186. https://doi.org/10.18653/v1/N19-1423
    [7]
    Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. Advances in neural information processing systems 27 (2014).
    [8]
    Yu V. Gorishniy, Ivan Rubachev, Valentin Khrulkov, and Artem Babenko. 2021. Revisiting Deep Learning Models for Tabular Data. In Neural Information Processing Systems.
    [9]
    Léo Grinsztajn, Edouard Oyallon, and Gaël Varoquaux. 2022. Why do tree-based models still outperform deep learning on tabular data?arxiv:2207.08815 [cs.LG]
    [10]
    Lasse Hansen, Nabeel Seedat, Mihaela van der Schaar, and Andrija Petrovic. 2024. Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A Comprehensive Benchmark. Advances in Neural Information Processing Systems 36 (2024).
    [11]
    Maryam Heidari and James H Jones. 2020. Using bert to extract topic-independent sentiment features for social media bot detection. In 2020 11th IEEE annual ubiquitous computing, electronics & mobile communication conference (UEMCON). IEEE, 0542–0547.
    [12]
    Rohit Kumar Kaliyar, Anurag Goswami, and Pratik Narang. 2021. FakeBERT: Fake news detection in social media with a BERT-based deep learning approach. Multimedia tools and applications 80, 8 (2021), 11765–11788.
    [13]
    Inwon Kang, Maruf Ahmed Mridul, Sanders Abraham, Yao Ma, Thilanka Munasinghe, Aparna Gupta, and Oshani Seneviratne. 2024. Deciphering Crypto Twitter.
    [14]
    Inwon Kang, Parikshit Ram, Yi Zhou, Horst Samulowitz, and Oshani Seneviratne. 2024. Effective Data Distillation for Tabular Datasets. In AAAI Conference on Artificial Intelligence.
    [15]
    Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013).
    [16]
    Anantaa Kotal, Aritran Piplai, Sai Sree Laya Chukkapalli, and Anupam Joshi. 2022. PriveTAB: Secure and Privacy-Preserving Sharing of Tabular Data. In Proceedings of the 2022 ACM on International Workshop on Security and Privacy Analytics. ACM, Baltimore MD USA, 35–45. https://doi.org/10.1145/3510548.3519377
    [17]
    Akim Kotelnikov, Dmitry Baranchuk, Ivan Rubachev, and Artem Babenko. 2023. Tabddpm: Modelling tabular data with diffusion models. In International Conference on Machine Learning. PMLR, 17564–17579.
    [18]
    Saïd Ladjal, Alasdair Newson, and Chi-Hieu Pham. 2019. A PCA-like Autoencoder. https://doi.org/10.48550/arXiv.1904.01277 arxiv:1904.01277 [cs]
    [19]
    Szu-Chuang Li, Bo-Chen Tai, and Yennun Huang. 2019. Evaluating Variational Autoencoder as a Private Data Release Mechanism for Tabular Data. In 2019 IEEE 24th Pacific Rim International Symposium on Dependable Computing (PRDC). 198–1988. https://doi.org/10.1109/PRDC47002.2019.00050
    [20]
    Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. (Sept. 2019).
    [21]
    Dmitry Medvedev and Alexander D’yakonov. 2021. New Properties of the Data Distillation Method When Working with Tabular Data. In Analysis of Images, Social Networks and Texts, Wil M. P. Van Der Aalst, Vladimir Batagelj, Dmitry I. Ignatov, Michael Khachay, Olessia Koltsova, Andrey Kutuzov, Sergei O. Kuznetsov, Irina A. Lomazova, Natalia Loukachevitch, Amedeo Napoli, Alexander Panchenko, Panos M. Pardalos, Marcello Pelillo, Andrey V. Savchenko, and Elena Tutubalina (Eds.). Vol. 12602. Springer International Publishing, Cham, 379–390. https://doi.org/10.1007/978-3-030-72610-2_29
    [22]
    Rami Mohammad and Lee McCluskey. 2015. Phishing Websites. UCI Machine Learning Repository.
    [23]
    Marzieh Mozafari, Reza Farahbakhsh, and Noël Crespi. 2020. Hate speech detection and racial bias mitigation in social media based on BERT model. PloS one 15, 8 (2020), e0237861.
    [24]
    Dat Quoc Nguyen, Thanh Vu, and Anh Tuan Nguyen. 2020. BERTweet: A pre-trained language model for English Tweets. arXiv preprint arXiv:2005.10200 (2020).
    [25]
    Timothy Nguyen, Zhourong Chen, and Jaehoon Lee. 2021. Dataset Meta-Learning from Kernel Ridge-Regression. arxiv:2011.00050 [cs, stat]
    [26]
    Daniele Panfilo, Alexander Boudewijn, Sebastiano Saccani, Andrea Coser, Borut Svara, Carlo Rossi Chauvenet, Ciro Antonio Mami, and Eric Medvet. 2023. A Deep Learning-based Pipeline for the Generation of Synthetic Tabular Data. IEEE Access (2023).
    [27]
    Matthew Rocklin. 2015. Dask: Parallel Computation with Blocked Algorithms and Task Scheduling. In Python in Science Conference. Austin, Texas, 126–132. https://doi.org/10.25080/Majora-7b98e3ed-013
    [28]
    Michal Rolinek, Dominik Zietlow, and Georg Martius. 2019. Variational Autoencoders Pursue PCA Directions (by Accident). In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12406–12415.
    [29]
    Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. 2008. The graph neural network model. IEEE transactions on neural networks 20, 1 (2008), 61–80.
    [30]
    Ravid Shwartz-Ziv and Amitai Armon. 2022. Tabular Data: Deep Learning Is Not All You Need. Information Fusion 81 (May 2022), 84–90. https://doi.org/10.1016/j.inffus.2021.11.011
    [31]
    Zhiqiang Wan, Yazhou Zhang, and Haibo He. 2017. Variational autoencoder based synthetic data generation for imbalanced learning. In 2017 IEEE symposium series on computational intelligence (SSCI). IEEE, 1–7.
    [32]
    Tongzhou Wang, Jun-Yan Zhu, Antonio Torralba, and Alexei A. Efros. 2020. Dataset Distillation. arxiv:1811.10959 [cs, stat]
    [33]
    Qitian Wu, Chenxiao Yang, and Junchi Yan. 2021. Towards Open-World Feature Extrapolation: An Inductive Graph Learning Approach. https://doi.org/10.48550/arXiv.2110.04514 arxiv:2110.04514 [cs]
    [34]
    X. 2024. More on Restricted Use Cases – Twitter Developers. https://developer.twitter.com/en/developer-terms/more-on-restricted-use-cases.
    [35]
    Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, and Kalyan Veeramachaneni. 2019. Modeling tabular data using conditional gan. Advances in neural information processing systems 32 (2019).
    [36]
    Lei Xu and Kalyan Veeramachaneni. 2018. Synthesizing tabular data using generative adversarial networks. arXiv preprint arXiv:1811.11264 (2018).
    [37]
    Matei Zaharia, Reynold S. Xin, Patrick Wendell, Tathagata Das, Michael Armbrust, Ankur Dave, Xiangrui Meng, Josh Rosen, Shivaram Venkataraman, Michael J. Franklin, Ali Ghodsi, Joseph Gonzalez, Scott Shenker, and Ion Stoica. 2016. Apache Spark: A Unified Engine for Big Data Processing. Commun. ACM 59, 11 (Oct. 2016), 56–65. https://doi.org/10.1145/2934664
    [38]
    Bo Zhao and Hakan Bilen. 2021. Dataset Condensation with Differentiable Siamese Augmentation. In Proceedings of the 38th International Conference on Machine Learning. PMLR, 12674–12685. https://proceedings.mlr.press/v139/zhao21a.html ISSN: 2640-3498.
    [39]
    Bo Zhao, Konda Reddy Mopuri, and Hakan Bilen. 2021. Dataset Condensation with Gradient Matching. http://arxiv.org/abs/2006.05929 arXiv:2006.05929 [cs].
    [40]
    Zilong Zhao, Aditya Kunar, Robert Birke, Hiek Van der Scheer, and Lydia Y Chen. 2023. Ctab-gan+: Enhancing tabular data synthesis. Frontiers in big Data 6 (2023).

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    Websci Companion '24: Companion Publication of the 16th ACM Web Science Conference
    May 2024
    128 pages
    ISBN:9798400704536
    DOI:10.1145/3630744
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 13 June 2024

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Data Distillation
    2. Privacy
    3. Reproducibility
    4. Tabular Data

    Qualifiers

    • Short-paper
    • Research
    • Refereed limited

    Conference

    Websci '24
    Sponsor:
    Websci '24: 16th ACM Web Science Conference
    May 21 - 24, 2024
    Stuttgart, Germany

    Acceptance Rates

    Websci Companion '24 Paper Acceptance Rate 27 of 58 submissions, 47%;
    Overall Acceptance Rate 245 of 933 submissions, 26%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 18
      Total Downloads
    • Downloads (Last 12 months)18
    • Downloads (Last 6 weeks)8
    Reflects downloads up to 10 Aug 2024

    Other Metrics

    Citations

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media