Trading Off Scalability, Privacy, and Performance in Data Synthesis

Ling, Xiao; Menzies, Tim; Hazard, Christopher; Shu, Jack; Beel, Jacob

Computer Science > Software Engineering

arXiv:2312.05436 (cs)

[Submitted on 9 Dec 2023]

Title:Trading Off Scalability, Privacy, and Performance in Data Synthesis

Authors:Xiao Ling, Tim Menzies, Christopher Hazard, Jack Shu, Jacob Beel

View PDF HTML (experimental)

Abstract:Synthetic data has been widely applied in the real world recently. One typical example is the creation of synthetic data for privacy concerned datasets. In this scenario, synthetic data substitute the real data which contains the privacy information, and is used to public testing for machine learning models. Another typical example is the unbalance data over-sampling which the synthetic data is generated in the region of minority samples to balance the positive and negative ratio when training the machine learning models. In this study, we concentrate on the first example, and introduce (a) the Howso engine, and (b) our proposed random projection based synthetic data generation framework. We evaluate these two algorithms on the aspects of privacy preservation and accuracy, and compare them to the two state-of-the-art synthetic data generation algorithms DataSynthesizer and Synthetic Data Vault. We show that the synthetic data generated by Howso engine has good privacy and accuracy, which results the best overall score. On the other hand, our proposed random projection based framework can generate synthetic data with highest accuracy score, and has the fastest scalability.

Comments:	13 pages, 2 figures, 6 tables, submitted to IEEEAccess
Subjects:	Software Engineering (cs.SE)
Cite as:	arXiv:2312.05436 [cs.SE]
	(or arXiv:2312.05436v1 [cs.SE] for this version)
	https://doi.org/10.48550/arXiv.2312.05436

Submission history

From: Xiao Ling [view email]
[v1] Sat, 9 Dec 2023 02:04:25 UTC (5,948 KB)

Computer Science > Software Engineering

Title:Trading Off Scalability, Privacy, and Performance in Data Synthesis

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Software Engineering

Title:Trading Off Scalability, Privacy, and Performance in Data Synthesis

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators