research-article

Coherent Multi-Table Data Synthesis for Tabular and Time-Series Data with GANs

Authors:

Clément Elliker,

Emeric Tonnelier,

Aymen ShabouAuthors Info & Claims

CODASPY '24: Proceedings of the Fourteenth ACM Conference on Data and Application Security and Privacy

Pages 245 - 252

https://doi.org/10.1145/3626232.3653255

Published: 19 June 2024 Publication History

Abstract

As the usage of user-private-data is increasingly monitored by regulatory institutions for security purposes, its transfer becomes more constrained. Synthetic data has recently emerged as a viable alternative to prevent the disclosure of user-protected information that complies with data sharing regulations. Both public and private sectors commonly use a combination of tabular and time-series tables that often contains user-related sensitive information. They are usually intrinsically interlinked as they describe the users and their behaviors over different perimeters. Moreover, it contains both numerical and categorical features, adding complexity to the anonymization task. State of the art generative methods, specialized either in tabular or time-series data, are able to generate high quality synthetic data. However, if each table is generated independently, it becomes impossible to link them. As a result, the usability of such synthetic data is impacted. To address this issue, we not only propose a coherent multi-table generative model that uses Generative Adversarial Networks (GANs) to sample both tabular and time-series tables, but also a conditional time-series generative model that handles both numerical and categorical features. Additionally, many experiments are conducted to analyse the inner modules of our model and evaluate it on an in-house private dataset in order to prove the viability of the synthetic data generated for machine learning tasks.

References

[1]

Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. 2016. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security. 308--318.

Digital Library

[2]

Debajyoti Bera, Rameshwar Pratap, and Bhisham Dev Verma. 2023. Dimensionality Reduction for Categorical Data. IEEE Transactions on Knowledge and Data Engineering, Vol. 35, 4 (2023), 3658--3671. https://doi.org/10.1109/TKDE.2021.3132373

Digital Library

[3]

Cristóbal Esteban, Stephanie L Hyland, and Gunnar Rätsch. 2017. Real-valued (medical) time series generation with recurrent conditional gans. arXiv preprint arXiv:1706.02633 (2017).

[4]

Sébastien Gambs, Marc-Olivier Killijian, and Miguel Núñez del Prado Cortez. 2014. De-anonymization attack on geolocated data. J. Comput. System Sci., Vol. 80, 8 (2014), 1597--1614. https://doi.org/10.1016/j.jcss.2014.04.024 Special Issue on Theory and Applications in Parallel and Distributed Computing Systems.

Digital Library

[5]

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative Adversarial Nets. In Advances in Neural Information Processing Systems, Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K.Q. Weinberger (Eds.), Vol. 27. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2014/file/5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf

[6]

Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. 2017. Improved training of wasserstein gans. Advances in neural information processing systems, Vol. 30 (2017).

[7]

Markus Herdin, Nicolai Czink, Hüseyin Ozcelik, and Ernst Bonek. 2005. Correlation matrix distance, a meaningful measure for evaluation of non-stationary MIMO channels. In 2005 IEEE 61st Vehicular Technology Conference, Vol. 1. IEEE, 136--140.

[8]

Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models. Advances in neural information processing systems, Vol. 33 (2020), 6840--6851.

[9]

Eric Jang, Shixiang Gu, and Ben Poole. 2017. Categorical reparameterization with gumbel-softmax. 2017. URL https://arxiv. org/abs/1611.01144 (2017).

[10]

Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013).

[11]

Akim Kotelnikov, Dmitry Baranchuk, Ivan Rubachev, and Artem Babenko. 2023. Tabddpm: Modelling tabular data with diffusion models. In International Conference on Machine Learning. PMLR, 17564--17579.

[12]

Mehdi Mirza and Simon Osindero. 2014. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784 (2014).

[13]

Noseong Park, Mahmoud Mohammadi, Kshitij Gorde, Sushil Jajodia, Hongkyu Park, and Youngmin Kim. 2018. Data synthesis based on generative adversarial networks. arXiv preprint arXiv:1806.03384 (2018).

[14]

Albin Petit, Thomas Cerqueus, Antoine Boutet, Sonia Mokhtar, David Coquil, Lionel Brunie, and Harald Kosch. 2016. SimAttack: private web search under fire. Journal of Internet Services and Applications, Vol. 7 (12 2016). https://doi.org/10.1186/s13174-016-0044-x

[15]

Padmanaba Srinivasan and William J Knottenbelt. 2022. Time-series transformer generative adversarial networks. arXiv preprint arXiv:2205.11164 (2022).

[16]

Magnus Wiese, Robert Knobloch, Ralf Korn, and Peter Kretschmer. 2020. Quant GANs: deep generation of financial time series. Quantitative Finance, Vol. 20, 9 (2020), 1419--1440.

[17]

Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, and Kalyan Veeramachaneni. 2019. Modeling Tabular data using Conditional GAN. arxiv: 1907.00503 [cs.LG]

[18]

Jinsung Yoon, Daniel Jarrett, and Mihaela van der Schaar. 2019. Time-series Generative Adversarial Networks. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. dtextquotesingle Alché-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2019/file/c9efe5f26cd17ba6216bbe2a7d26d490-Paper.pdf

[19]

Zilong Zhao, Aditya Kunar, Robert Birke, and Lydia Y Chen. 2021. Ctab-gan: Effective table data synthesizing. In Asian Conference on Machine Learning. PMLR, 97--112.

[20]

Zilong Zhao, Aditya Kunar, Robert Birke, and Lydia Y Chen. 2022. Ctab-gan: Enhancing tabular data synthesis. arXiv preprint arXiv:2204.00401 (2022).

Index Terms

Coherent Multi-Table Data Synthesis for Tabular and Time-Series Data with GANs

Recommendations

New Properties of the Data Distillation Method When Working with Tabular Data
Analysis of Images, Social Networks and Texts
Abstract
Data distillation is the problem of reducing the volume of training data while keeping only the necessary information. With this paper, we deeper explore the new data distillation algorithm, previously designed for image data. Our experiments with ...
Medical Time-Series Data Generation Using Generative Adversarial Networks
Artificial Intelligence in Medicine
Abstract
Medical data is rarely made publicly available due to high de-identification costs and risks. Access to such data is highly regulated due to it’s sensitive nature. These factors impede the development of data-driven advancements in the healthcare ...
Generating synthetic personal health data using conditional generative adversarial networks combining with differential privacy
Abstract
A large amount of personal health data that is highly valuable to the scientific community is still not accessible or requires a lengthy request process due to privacy concerns and legal restrictions. As a solution, synthetic data has been ...
Graphical abstract

Display Omitted

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CODASPY '24: Proceedings of the Fourteenth ACM Conference on Data and Application Security and Privacy

June 2024

429 pages

ISBN:9798400704215

DOI:10.1145/3626232

General Chair:
João P. Vilela
University of Porto, Portugal
,
Program Chairs:
Haya Schulmann
Goethe-Universität Frankfurt | National Research Center for Applied Cybersecurity ATHENE, Germany
,
Ninghui Li
Purdue University, USA

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGSAC: ACM Special Interest Group on Security, Audit, and Control

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 June 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

CODASPY '24

Sponsor:

SIGSAC

CODASPY '24: Fourteenth ACM Conference on Data and Application Security and Privacy

June 19 - 21, 2024

Porto, Portugal

Acceptance Rates

Overall Acceptance Rate 149 of 789 submissions, 19%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
54
Total Downloads

Downloads (Last 12 months)54
Downloads (Last 6 weeks)16

Reflects downloads up to 30 Aug 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents