Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2505515.2508207acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
demonstration

GeCo: an online personal data generator and corruptor

Published: 27 October 2013 Publication History

Abstract

We demonstrate GeCo, an online personal data GEnerator and COrruptor that facilitates the creation of realistic personal data ranging from names, addresses, and dates, to social security and credit card numbers, as well as numerical values such as salary or blood pressure. Using an intuitive Web interface, a user can create records containing such data according to their needs, and apply various corruption functions to generate duplicates of these records. Synthetic personal data are increasingly required in areas such as record de-duplication, fraud detection, cloud computing, and health informatics, where data quality issues can significantly affect the outcomes of data integration, processing, and mining projects. Privacy concerns, however, often make it difficult for researchers to obtain real data that contain personal details. Compared to other data generators that have to be downloaded, installed and customized,GeCo allows the creation of personal data with much less effort. In this demonstration we show (1) how different types of attributes, and dependencies between them, can be specified; (2) how the generated data can be modified using various types of corruption functions; and (3) how a user can contribute to GeCo by providing attribute generation functions and look-up files. We believe GeCo will be a valuable tool for researchers that require realistic personal data to evaluate their algorithms with regard to efficiency and effectiveness.

References

[1]
S. M. Bertolazzi P, De Santis L. Automated record matching in cooperative information systems. In DQCIS Workshop held at ICDT, Siena, Italy, 2003.
[2]
P. Christen. Development and user experiences of an open source data cleaning, deduplication and record linkage system. SIGKDD Explorations, 11(1), 2009.
[3]
P. Christen. Data Matching. Data-Centric Systems and Applications. Springer, 2012.
[4]
P. Christen and A. Pudjijono. Accurate synthetic generation of realistic personal information. In PAKDD, Springer LNAI 5476, Bangkok, 2009.
[5]
P. Christen and D. Vatsalan. A flexible data generator for privacy-preserving data mining and record linkage, Manual. The Australian National University, 2012.
[6]
M. A. Hernandez and S. J. Stolfo. The merge/purge problem for large databases. In ACM SIGMOD, 1995.
[7]
J. Talburt, Y. Zhou, and S. Shivaiah. SOG: a synthetic occupancy generator to support entity resolution instruction and research. In ICIQ, Potsdam, 2009.

Cited By

View all
  • (2024)Simulated data for census-scale entity resolution research without privacy restrictions: a large-scale dataset generated by individual-based modelingGates Open Research10.12688/gatesopenres.15418.28(36)Online publication date: 18-Oct-2024
  • (2024)Simulated data for census-scale entity resolution research without privacy restrictions: a large-scale dataset generated by individual-based modelingGates Open Research10.12688/gatesopenres.15418.18(36)Online publication date: 3-May-2024
  • (2024)Fast Bayesian Record Linkage for Streaming Data ContextsJournal of Computational and Graphical Statistics10.1080/10618600.2023.228357133:3(833-844)Online publication date: 3-Jan-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CIKM '13: Proceedings of the 22nd ACM international conference on Information & Knowledge Management
October 2013
2612 pages
ISBN:9781450322638
DOI:10.1145/2505515
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2013

Check for updates

Author Tags

  1. data generation
  2. duplicates
  3. online demo
  4. synthetic data

Qualifiers

  • Demonstration

Conference

CIKM'13
Sponsor:
CIKM'13: 22nd ACM International Conference on Information and Knowledge Management
October 27 - November 1, 2013
California, San Francisco, USA

Acceptance Rates

CIKM '13 Paper Acceptance Rate 143 of 848 submissions, 17%;
Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)18
  • Downloads (Last 6 weeks)1
Reflects downloads up to 15 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Simulated data for census-scale entity resolution research without privacy restrictions: a large-scale dataset generated by individual-based modelingGates Open Research10.12688/gatesopenres.15418.28(36)Online publication date: 18-Oct-2024
  • (2024)Simulated data for census-scale entity resolution research without privacy restrictions: a large-scale dataset generated by individual-based modelingGates Open Research10.12688/gatesopenres.15418.18(36)Online publication date: 3-May-2024
  • (2024)Fast Bayesian Record Linkage for Streaming Data ContextsJournal of Computational and Graphical Statistics10.1080/10618600.2023.228357133:3(833-844)Online publication date: 3-Jan-2024
  • (2024)Gecko: A Python library for the generation and mutation of realistic personal identification data at scaleSoftwareX10.1016/j.softx.2024.10184627(101846)Online publication date: Sep-2024
  • (2023)Privacy-Preserving Record Linkage for Cardinality CountingProceedings of the 2023 ACM Asia Conference on Computer and Communications Security10.1145/3579856.3590338(53-64)Online publication date: 10-Jul-2023
  • (2022)(Almost) all of entity resolutionScience Advances10.1126/sciadv.abi80218:12Online publication date: 25-Mar-2022
  • (2022)Fairness and Cost Constrained Privacy-Aware Record LinkageIEEE Transactions on Information Forensics and Security10.1109/TIFS.2022.319149217(2644-2656)Online publication date: 2022
  • (2022)Multifile Partitioning for Record Linkage and Duplicate DetectionJournal of the American Statistical Association10.1080/01621459.2021.2013242118:543(1786-1795)Online publication date: 28-Jan-2022
  • (2022)A Practical Approach to Proper Inference with Linked DataThe American Statistician10.1080/00031305.2022.204148276:4(384-393)Online publication date: 23-Mar-2022
  • (2021)Optimization of the Mainzelliste software for fast privacy-preserving record linkageJournal of Translational Medicine10.1186/s12967-020-02678-119:1Online publication date: 15-Jan-2021
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media