Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3534678.3539454acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

SOS: Score-based Oversampling for Tabular Data

Published: 14 August 2022 Publication History

Abstract

Score-based generative models (SGMs) are a recent breakthrough in generating fake images. SGMs are known to surpass other generative models, e.g., generative adversarial networks (GANs) and variational autoencoders (VAEs). Being inspired by their big success, in this work, we fully customize them for generating fake tabular data. In particular, we are interested in oversampling minor classes since imbalanced classes frequently lead to sub-optimal training outcomes. To our knowledge, we are the first presenting a score-based tabular data oversampling method. Firstly, we re-design our own score network since we have to process tabular data. Secondly, we propose two options for our generation method: the former is equivalent to a style transfer for tabular data and the latter uses the standard generative policy of SGMs. Lastly, we define a fine-tuning method, which further enhances the oversampling quality. In our experiments with 6 datasets and 10 baselines, our method outperforms other oversampling methods in all cases.

References

[1]
Commonwealth of Australia 2010 Bureau of Meteorology. https://www.kaggle.com/jsphyg/weather-dataset-rattle-package.
[2]
HackerEarth Machine Learning Challenge-Adopt a buddy. https://www.kaggle.com/akash14/adopt-a-buddy.
[3]
Jonas Adler and Sebastian Lunz. 2018. Banach Wasserstein GAN. In NeurIPS.
[4]
Martin Arjovsky, Soumith Chintala, and Léon Bottou. 2017. Wasserstein Generative Adversarial Networks. In ICML.
[5]
Christopher M. Bishop. 2006. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag.
[6]
L. Breiman, J. Friedman, C.J. Stone, and R.A. Olshen. 1984. Classification and Regression Trees. Taylor & Francis. https://books.google.co.kr/books?id=JwQx- WOmSyQC
[7]
Nitesh V. Chawla, KevinW. Bowyer, Lawrence O. Hall, andW. Philip Kegelmeyer. 2002. SMOTE: Synthetic Minority over-Sampling Technique. J. Artif. Int. Res. 16, 1 (2002).
[8]
Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. 2016. InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets. In NeurIPS.
[9]
Edward Choi, Siddharth Biswal, A. Bradley Maline, Jon Duke, F. Walter Stewart, and Jimeng Sun. 2017. Generating Multi-label Discrete Electronic Health Records using Generative Adversarial Networks. (2017). arXiv:1703.06490
[10]
C. Chow and C. Liu. 1968. Approximating discrete probability distributions with dependence trees. IEEE Transactions on Information Theory 14, 3 (1968), 462--467.
[11]
David R Cox. 1958. The regression analysis of binary sequences. Journal of the Royal Statistical Society: Series B (Methodological) 20, 2 (1958), 215--232.
[12]
Tim Dockhorn, Arash Vahdat, and Karsten Kreis. 2022. Score-Based Generative Modeling with Critically-Damped Langevin Diffusion. In ICLR.
[13]
Dheeru Dua and Casey Graff. 2017. UCI Machine Learning Repository. http://archive.ics.uci.edu/ml
[14]
Justin Engelmann and S. Lessmann. 2021. Conditional Wasserstein GAN-based Oversampling of Tabular Data for Imbalanced Learning. Expert Syst. Appl. 174 (2021), 114582.
[15]
Cristóbal Esteban, L. Stephanie Hyland, and Gunnar Rätsch. 2017. Realvalued (Medical) Time Series Generation with Recurrent Conditional GANs. arXiv:1706.02633
[16]
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative Adversarial Nets. In NeurIPS.
[17]
Will Grathwohl, Ricky TQ Chen, Jesse Bettencourt, Ilya Sutskever, and David Duvenaud. 2018. Ffjord: Free-form continuous dynamics for scalable reversible generative models. arXiv preprint arXiv:1810.01367 (2018).
[18]
Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron Courville. 2017. Improved Training of Wasserstein GANs. In NeurIPS.
[19]
Hui Han, Wen-Yuan Wang, and Bing-Huan Mao. 2005. Borderline-SMOTE: A New over-Sampling Method in Imbalanced Data Sets Learning. In ICIC.
[20]
Haibo He, Yang Bai, Edwardo A. Garcia, and Shutao Li. 2008. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In IJCNN.
[21]
Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising Diffusion Probabilistic Models. In NeurIPS.
[22]
Alexia Jolicoeur-Martineau, Rémi Piché-Taillefer, Rémi Tachet des Combes, and Ioannis Mitliagkas. 2020. Adversarial score matching and improved sampling for image generation. arXiv preprint arXiv:2009.05475 (2020).
[23]
James Jordon, Jinsung Yoon, and V. D. Mihaela Schaar. 2019. PATE-GAN: Generating Synthetic Data with Differential Privacy Guarantees. In International Conference on Learning Representations.
[24]
Jayoung Kim, Jinsung Jeon, Jaehoon Lee, Jihyeon Hyeong, and Noseong Park. 2021. OCT-GAN: Neural ODE-Based Conditional Tabular GANs. In TheWebConf.
[25]
Jaehoon Lee, Jihyeon Hyeong, Jinsung Jeon, Noseong Park, and Jihoon Cho. 2021. Invertible Tabular GANs: Killing Two Birds with One Stone for Tabular Data Synthesis. In NeurIPS.
[26]
M. Lichman. 2013. UCI Machine Learning Repository. http://archive.ics.uci.edu/ml
[27]
Giovanni Mariani, Florian Scheidegger, Roxana Istrate, Costas Bekas, and A. Cristiano I. Malossi. 2018. BAGAN: Data Augmentation with Balancing GAN. CoRR abs/1803.09655 (2018).
[28]
Sankha Subhra Mullick, Shounak Datta, and Swagatam Das. 2019. Generative Adversarial Minority Oversampling. In ICCV.
[29]
Augustus Odena, Christopher Olah, and Jonathon Shlens. 2017. Conditional Image Synthesis With Auxiliary Classifier GANs. arXiv:1610.09585
[30]
KANCHARLA PARIMALA and Sumohana Channappayya. 2019. Quality Aware Generative Adversarial Networks. In NeurIPS.
[31]
Noseong Park, Ankesh Anand, Joel Ruben Antony Moniz, Kookjin Lee, Jaegul Choo, David Keetae Park, Tanmoy Chakraborty, Hongkyu Park, and Youngmin Kim. 2018. MMGAN: Manifold-Matching Generative Adversarial Networks. In ICPR.
[32]
Noseong Park, Mahmoud Mohammadi, Kshitij Gorde, Sushil Jajodia, Hongkyu Park, and Youngmin Kim. 2018. Data Synthesis based on Generative Adversarial Networks. (2018). arXiv:1806.03384
[33]
Eckhard Platen. 1999. An introduction to numerical methods for stochastic differential equations. Acta Numerica 8 (1999), 197--246. https://doi.org/10.1017/ S0962492900002920
[34]
Kashif Rasul, Calvin Seward, Ingmar Schuster, and Roland Vollgraf. 2021. Autoregressive Denoising Diffusion Models for Multivariate Probabilistic Time Series Forecasting. In ICML.
[35]
C Okan Sakar, S Olcay Polat, Mete Katircioglu, and Yomi Kastro. 2019. Real-time prediction of online shoppers' purchasing intention using multilayer perceptron and LSTM recurrent neural networks. Neural Computing and Applications 31 (10 2019), 6893--6908.
[36]
Robert E. Schapire. 1999. A Brief Introduction to Boosting. In IJCAI.
[37]
Daniel Sessler, Andrea Kurz, Leif Saager, and Jarrod Dalton. 2011. Operation Timing and 30-Day Mortality After Elective General Surgery. Anesthesia and analgesia 113 (09 2011), 1423--8. https://doi.org/10.1213/ANE.0b013e3182315a6d
[38]
Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. 2021. Score-Based Generative Modeling through Stochastic Differential Equations. In ICLR.
[39]
Akash Srivastava, Lazar Valkov, Chris Russell, Michael U. Gutmann, and Charles Sutton. 2017. VEEGAN: Reducing Mode Collapse in GANs using Implicit Variational Learning. In NeurIPS.
[40]
Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing Data using t-SNE. Journal of Machine Learning Research 9, 86 (2008).
[41]
Pascal Vincent. 2011. A Connection between Score Matching and Denoising Autoencoders. Neural Comput. 23, 7 (2011), 1661--1674.
[42]
WentaoWang, SuhangWang,Wenqi Fan, Zitao Liu, and Jiliang Tang. 2020. Globaland- local aware data generation for the class imbalance problem. In ICDM. SIAM, 307--315.
[43]
Zhisheng Xiao, Karsten Kreis, and Arash Vahdat. 2022. Tackling the Generative Learning Trilemma with Denoising Diffusion GANs. In ICLR.
[44]
Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, and Kalyan Veeramachaneni. 2019. Modeling Tabular data using Conditional GAN. In NeurIPS.

Cited By

View all

Index Terms

  1. SOS: Score-based Oversampling for Tabular Data

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    KDD '22: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
    August 2022
    5033 pages
    ISBN:9781450393850
    DOI:10.1145/3534678
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 14 August 2022

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. oversampling
    2. score-based generative model
    3. tabular data synthesis

    Qualifiers

    • Research-article

    Funding Sources

    • IITP

    Conference

    KDD '22
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

    Upcoming Conference

    KDD '25

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)101
    • Downloads (Last 6 weeks)7
    Reflects downloads up to 08 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)SynC2S: An Efficient Method for Synthesizing Tabular Data With a Learnable Pre-ProcessingIEEE Access10.1109/ACCESS.2024.347270613(5575-5594)Online publication date: 2025
    • (2025)Deterministic Autoencoder using Wasserstein loss for tabular data generationNeural Networks10.1016/j.neunet.2025.107208185(107208)Online publication date: May-2025
    • (2025)TTVAE: Transformer-based Generative Modeling for Tabular Data GenerationArtificial Intelligence10.1016/j.artint.2025.104292(104292)Online publication date: Jan-2025
    • (2024)CuTSProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3694089(49408-49433)Online publication date: 21-Jul-2024
    • (2024)Generative AI for Self-Adaptive Systems: State of the Art and Research RoadmapACM Transactions on Autonomous and Adaptive Systems10.1145/368680319:3(1-60)Online publication date: 30-Sep-2024
    • (2024)Controllable Tabular Data Synthesis Using Diffusion ModelsProceedings of the ACM on Management of Data10.1145/36392832:1(1-29)Online publication date: 26-Mar-2024
    • (2024)AI Surrogate Model for Distributed Computing WorkloadsProceedings of the SC '24 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis10.1109/SCW63240.2024.00018(79-86)Online publication date: 17-Nov-2024
    • (2024)A tabular data generation framework guided by downstream tasks optimizationScientific Reports10.1038/s41598-024-65777-914:1Online publication date: 3-Jul-2024
    • (2024)TabSAL: Synthesizing tabular data with small agent assisted language modelsKnowledge-Based Systems10.1016/j.knosys.2024.112438(112438)Online publication date: Aug-2024
    • (2023)CoDiProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3619190(18940-18956)Online publication date: 23-Jul-2023
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media