Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3626232.3653255acmconferencesArticle/Chapter ViewAbstractPublication PagescodaspyConference Proceedingsconference-collections
research-article

Coherent Multi-Table Data Synthesis for Tabular and Time-Series Data with GANs

Published: 19 June 2024 Publication History

Abstract

As the usage of user-private-data is increasingly monitored by regulatory institutions for security purposes, its transfer becomes more constrained. Synthetic data has recently emerged as a viable alternative to prevent the disclosure of user-protected information that complies with data sharing regulations. Both public and private sectors commonly use a combination of tabular and time-series tables that often contains user-related sensitive information. They are usually intrinsically interlinked as they describe the users and their behaviors over different perimeters. Moreover, it contains both numerical and categorical features, adding complexity to the anonymization task. State of the art generative methods, specialized either in tabular or time-series data, are able to generate high quality synthetic data. However, if each table is generated independently, it becomes impossible to link them. As a result, the usability of such synthetic data is impacted. To address this issue, we not only propose a coherent multi-table generative model that uses Generative Adversarial Networks (GANs) to sample both tabular and time-series tables, but also a conditional time-series generative model that handles both numerical and categorical features. Additionally, many experiments are conducted to analyse the inner modules of our model and evaluate it on an in-house private dataset in order to prove the viability of the synthetic data generated for machine learning tasks.

References

[1]
Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. 2016. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security. 308--318.
[2]
Debajyoti Bera, Rameshwar Pratap, and Bhisham Dev Verma. 2023. Dimensionality Reduction for Categorical Data. IEEE Transactions on Knowledge and Data Engineering, Vol. 35, 4 (2023), 3658--3671. https://doi.org/10.1109/TKDE.2021.3132373
[3]
Cristóbal Esteban, Stephanie L Hyland, and Gunnar Rätsch. 2017. Real-valued (medical) time series generation with recurrent conditional gans. arXiv preprint arXiv:1706.02633 (2017).
[4]
Sébastien Gambs, Marc-Olivier Killijian, and Miguel Núñez del Prado Cortez. 2014. De-anonymization attack on geolocated data. J. Comput. System Sci., Vol. 80, 8 (2014), 1597--1614. https://doi.org/10.1016/j.jcss.2014.04.024 Special Issue on Theory and Applications in Parallel and Distributed Computing Systems.
[5]
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative Adversarial Nets. In Advances in Neural Information Processing Systems, Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K.Q. Weinberger (Eds.), Vol. 27. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2014/file/5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf
[6]
Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. 2017. Improved training of wasserstein gans. Advances in neural information processing systems, Vol. 30 (2017).
[7]
Markus Herdin, Nicolai Czink, Hüseyin Ozcelik, and Ernst Bonek. 2005. Correlation matrix distance, a meaningful measure for evaluation of non-stationary MIMO channels. In 2005 IEEE 61st Vehicular Technology Conference, Vol. 1. IEEE, 136--140.
[8]
Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models. Advances in neural information processing systems, Vol. 33 (2020), 6840--6851.
[9]
Eric Jang, Shixiang Gu, and Ben Poole. 2017. Categorical reparameterization with gumbel-softmax. 2017. URL https://arxiv. org/abs/1611.01144 (2017).
[10]
Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013).
[11]
Akim Kotelnikov, Dmitry Baranchuk, Ivan Rubachev, and Artem Babenko. 2023. Tabddpm: Modelling tabular data with diffusion models. In International Conference on Machine Learning. PMLR, 17564--17579.
[12]
Mehdi Mirza and Simon Osindero. 2014. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784 (2014).
[13]
Noseong Park, Mahmoud Mohammadi, Kshitij Gorde, Sushil Jajodia, Hongkyu Park, and Youngmin Kim. 2018. Data synthesis based on generative adversarial networks. arXiv preprint arXiv:1806.03384 (2018).
[14]
Albin Petit, Thomas Cerqueus, Antoine Boutet, Sonia Mokhtar, David Coquil, Lionel Brunie, and Harald Kosch. 2016. SimAttack: private web search under fire. Journal of Internet Services and Applications, Vol. 7 (12 2016). https://doi.org/10.1186/s13174-016-0044-x
[15]
Padmanaba Srinivasan and William J Knottenbelt. 2022. Time-series transformer generative adversarial networks. arXiv preprint arXiv:2205.11164 (2022).
[16]
Magnus Wiese, Robert Knobloch, Ralf Korn, and Peter Kretschmer. 2020. Quant GANs: deep generation of financial time series. Quantitative Finance, Vol. 20, 9 (2020), 1419--1440.
[17]
Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, and Kalyan Veeramachaneni. 2019. Modeling Tabular data using Conditional GAN. arxiv: 1907.00503 [cs.LG]
[18]
Jinsung Yoon, Daniel Jarrett, and Mihaela van der Schaar. 2019. Time-series Generative Adversarial Networks. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. dtextquotesingle Alché-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2019/file/c9efe5f26cd17ba6216bbe2a7d26d490-Paper.pdf
[19]
Zilong Zhao, Aditya Kunar, Robert Birke, and Lydia Y Chen. 2021. Ctab-gan: Effective table data synthesizing. In Asian Conference on Machine Learning. PMLR, 97--112.
[20]
Zilong Zhao, Aditya Kunar, Robert Birke, and Lydia Y Chen. 2022. Ctab-gan: Enhancing tabular data synthesis. arXiv preprint arXiv:2204.00401 (2022).

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CODASPY '24: Proceedings of the Fourteenth ACM Conference on Data and Application Security and Privacy
June 2024
429 pages
ISBN:9798400704215
DOI:10.1145/3626232
  • General Chair:
  • João P. Vilela,
  • Program Chairs:
  • Haya Schulmann,
  • Ninghui Li
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 June 2024

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. data privacy
  2. deep learning
  3. gan
  4. generative ai
  5. synthetic data
  6. tabular data
  7. time-series data

Qualifiers

  • Research-article

Conference

CODASPY '24
Sponsor:

Acceptance Rates

Overall Acceptance Rate 149 of 789 submissions, 19%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 54
    Total Downloads
  • Downloads (Last 12 months)54
  • Downloads (Last 6 weeks)16
Reflects downloads up to 30 Aug 2024

Other Metrics

Citations

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media