Accurate Synthetic Generation of Realistic Personal Information

Christen, Peter; Pudjijono, Agus

doi:10.1007/978-3-642-01307-2_47

Peter Christen²³ &
Agus Pudjijono²⁴

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5476))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

3493 Accesses

Abstract

A large portion of data collected by many organisations today is about people, and often contains personal identifying information, such as names and addresses. Privacy and confidentiality are of great concern when such data is being shared between organisations or made publicly available. Research in (privacy-preserving) data mining and data linkage is suffering from a lack of publicly available real-world data sets that contain personal information, and therefore experimental evaluations can be difficult to conduct. In order to overcome this problem, we have developed a data generator that allows flexible creation of synthetic data containing personal information with realistic characteristics, such as frequency distributions, attribute dependencies, and error probabilities. Our generator significantly improves earlier approaches, and allows the generation of data for individuals, families and households.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Exploring Privacy-Preserving Techniques on Synthetic Data as a Defense Against Model Inversion Attacks

Privacy Risk from Synthetic Data: Practical Proposals

synple: A Platform for Privacy Preserving Synthetic Patient Data Generation

References

Christen, P.: Privacy-preserving data linkage and geocoding: Current approaches and research directions. In: ICDM PADM workshop, Hong Kong (2006)
Google Scholar
Hernandez, M., Stolfo, S.: Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery 2(1), 9–37 (1998)
Article Google Scholar
Christen, P., Goiser, K.: Quality and complexity measures for data linkage and deduplication. In: Quality Measures in Data Mining. Studies in Computational Intelligence, vol. 43, pp. 127–151. Springer, Heidelberg (2007)
Chapter Google Scholar
Elmagarmid, A., Ipeirotis, P., Verykios, V.: Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering 19(1), 1–16 (2007)
Article Google Scholar
Christen, P.: Probabilistic data generation for deduplication and data linkage. In: Gallagher, M., Hogan, J.P., Maire, F. (eds.) IDEAL 2005. LNCS, vol. 3578, pp. 109–116. Springer, Heidelberg (2005)
Chapter Google Scholar
Bertolazzi, P., De Santis, L., Scannapieco, M.: Automated record matching in cooperative information systems. In: DQCIS, Siena, Italy (2003)
Google Scholar
Pudjijono, A.: Probabilistic data generation. Master of Computing (Honours) thesis, Department of Computer Science, The Australian National University (2008)
Google Scholar
Pollock, J., Zamora, A.: Automatic spelling correction in scientific and scholarly text. Communications of the ACM 27(4), 358–368 (1984)
Article Google Scholar
Christen, P.: A comparison of personal name matching: Techniques and practical issues. In: ICDM MCD workshop, Hong Kong (2006)
Google Scholar
Christen, P.: Febrl – An open source data cleaning, deduplication and record linkage system with a graphical user interface. In: ACM KDD, Las Vegas (2008)
Google Scholar
Phua, C., Lee, V., Smith-Miles, K.: The personal name problem and a recommended data mining solution. In: Encyclopedia of Data Warehousing and Mining, 2nd edn., Information Science Reference (2008)
Google Scholar
Damerau, F.: A technique for computer detection and correction of spelling errors. Communications of the ACM 7(3), 171–176 (1964)
Article Google Scholar
Hall, P., Dowling, G.: Approximate string matching. ACM Computing Surveys 12(4), 381–402 (1980)
Article MathSciNet Google Scholar
Kukich, K.: Techniques for automatically correcting words in text. ACM Computing Surveys 24(4), 377–439 (1992)
Article Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science, The Australian National University, Canberra, ACT 0200, Australia
Peter Christen
Data Center, Ministry of Public Works of Republic of Indonesia, Jakarta, 12110, Indonesia
Agus Pudjijono

Authors

Peter Christen
View author publications
You can also search for this author in PubMed Google Scholar
Agus Pudjijono
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Sirindhorn International Institute of Technology, Thammasat University, 131 Moo 5 Tiwanont Road, 12000, Bangkadi, Muang, Pathumthani, Thailand
Thanaruk Theeramunkong
Dept. of Computer Engineering, Faculty of Engineering, Chulalongkorn University, 10330, Bangkok, Thailand
Boonserm Kijsirikul
Faculty of Science & Engineering, York University, 355 Lumbers Building, 4700 Keele Street, M3J 1P3, Toronto, Ontario, Canada
Nick Cercone
School of Knowledge Science, Japan Advanced Institute of Science and Technology, 1-1 Asahidai, Nomi, 923-1292, Ishikawa, Japan
Tu-Bao Ho

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Christen, P., Pudjijono, A. (2009). Accurate Synthetic Generation of Realistic Personal Information. In: Theeramunkong, T., Kijsirikul, B., Cercone, N., Ho, TB. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2009. Lecture Notes in Computer Science(), vol 5476. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-01307-2_47

Download citation

DOI: https://doi.org/10.1007/978-3-642-01307-2_47
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-01306-5
Online ISBN: 978-3-642-01307-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics