research-article

Data generation using declarative constraints

Authors:

Raghav Kaushik,

Jian LiAuthors Info & Claims

SIGMOD '11: Proceedings of the 2011 ACM SIGMOD International Conference on Management of data

Pages 685 - 696

https://doi.org/10.1145/1989323.1989395

Published: 12 June 2011 Publication History

Abstract

We study the problem of generating synthetic databases having declaratively specified characteristics. This problem is motivated by database system and application testing, data masking, and benchmarking. While the data generation problem has been studied before, prior approaches are either non-declarative or have fundamental limitations relating to data characteristics that they can capture and efficiently support. We argue that a natural, expressive, and declarative mechanism for specifying data characteristics is through cardinality constraints; a cardinality constraint specifies that the output of a query over the generated database have a certain cardinality. While the data generation problem is intractable in general, we present efficient algorithms that can handle a large and useful class of constraints. We include a thorough empirical evaluation illustrating that our algorithms handle complex constraints, scale well as the number of constraints increase, and outperform applicable prior techniques.

References

[1]

P. Abbeel, D. Koller, and A. Y. Ng. Learning factor graphs in polynomial time and sample complexity. J. of Machine Learning Research, 7:1743--1788, 2006.

Digital Library

[2]

A. Aboulnaga, J. F. Naughton, and C. Zhang. Generating synthetic complex-structured XML data. In WebDB, pages 79--84, 2001.

[3]

W. Aiello, F. Chung, and L. Lu. A random graph model for power law graphs. Experimental Mathematics, 10(1):53--66, 2001.

[4]

D. Barbosa, A. O. Mendelzon, J. Keenleyside, et al. ToXgene: An extensible template-based data generator for XML. In WebDB, pages 49--54, 2002.

Digital Library

[5]

C. Binnig, D. Kossmann, and E. Lo. Reverse query processing. In ICDE, pages 506--515, 2007.

[6]

C. Binnig, D. Kossmann, E. Lo, et al. QAGen: generating query-aware test databases. In SIGMOD, pages 341--352, 2007.

Digital Library

[7]

N. Bruno and S. Chaudhuri. Flexible database generators. In VLDB, pages 1097--1107, 2005.

Digital Library

[8]

N. Bruno, S. Chaudhuri, and L. Gravano. STHoles: A multidimensional workload-aware histogram. In SIGMOD, pages 211--222, 2001.

Digital Library

[9]

N. Bruno, S. Chaudhuri, and D. Thomas. Generating queries with cardinality constraints for dbms testing. IEEE Trans. Knowl. Data Eng., 18(12):1271--1275, 2006.

Digital Library

[10]

M. Castellanos, B. Zhang, I. Jimenez, et al. Data desensitization of customer data for use in optimizer performance experiments. In ICDE, pages 1081--1092, 2010.

[11]

S. Chaudhuri and U. Dayal. An overview of data warehousing and OLAP technology. SIGMOD Record, 26(1):65--74, 1997.

Digital Library

[12]

S. Cohen. Generating XML structure using examples and constraints. PVLDB, 1(1):490--501, 2008.

Digital Library

[13]

C. Dwork. Differential privacy. International Colloquium on Automata, languages and programming, 2006.

Digital Library

[14]

L. Getoor, B. Taskar, and D. Koller. Selectivity estimation using probabilistic models. In SIGMOD, pages 461--472, 2001.

Digital Library

[15]

J. Gray, P. Sundaresan, S. Englert, et al. Quickly generating billion-record synthetic databases. In SIGMOD, pages 243--252, 1994.

Digital Library

[16]

J. M. Hammersley and P. Clifford. Markov fields on finite graphs and lattices. Unpublished manuscript.

[17]

K. Houkjaer, K. Torp, and R. Wind. Simple and realistic data generation. In VLDB, pages 1243--1246, 2006.

Digital Library

[18]

IBM DB2 test data generator. http://www.ibm.com/developerworks/data/library/techarticle/dm-0706salko%suo/index.html.

[19]

B. Korte and J. Vygen. Combinatorial Optimization: Theory and Algorithms. Springer Verlag, 2005.

Digital Library

[20]

S. Lattanzi and D. Sivakumar. Affiliation networks. In Proceedings of the 41st annual ACM symposium on Theory of computing, pages 427--434. ACM, 2009.

Digital Library

[21]

J. Leskovec, D. Chakrabarti, J. M. Kleinberg, et al. Realistic, mathematically tractable graph generation and evolution, using kronecker multiplication. In PKDD, pages 133--145, 2005.

Digital Library

[22]

E. Lo, N. Cheng, and W.-K. Hon. Generating databases for query workloads. In VLDB, pages 848--859, 2010.

Digital Library

[23]

A. Machanavajjhala, D. Kifer, J. Gehrke, and M. Venkitasubramaniam. l-diversity: Privacy beyond k-anonymity. ACM Transactions on Knowledge Discovery from Data (TKDD), 1(1):3, 2007.

Digital Library

[24]

H. Mannila and K.-J. Raiha. Automatic generation of test data for relational queries. J. Comp. Syst. Sci, 38(2):240--258, 1989.

[25]

C. Olston, S. Chopra, and U. Srivastava. Generating example data for dataflow programs. In SIGMOD, pages 245--256, 2009.

Digital Library

[26]

J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, 1988.

Digital Library

[27]

C. Re and D. Suciu. Understanding cardinality estimation using entropy maximization. In PODS, pages 53--64, 2010.

Digital Library

[28]

U. Srivastava, P. J. Haas, V. Markl, et al. ISOMER: consistent histogram construction using query feedback. In ICDE, 2006.

Digital Library

[29]

L. Sweeney. k-Anonymity: A Model for Protecting Privacy. Int. J. Uncertain. Fuzziness Knowl.-Based Syst, 10(5):557--570, 2002.

Digital Library

[30]

T. Syrjanen. Logic Programs and Cardinality Constraints: Theory and Practice. PhD thesis, Helsinki University of Technology, 2009.

[31]

R. E. Tarjan and M. Yannakakis. Simple linear-time algorithms to test chordality of graphs, test acyclicity of hypergraphs, and selectively reduce acyclic hypergraphs. SIAM J. of Comput, 13(3):566--579, 1984.

Digital Library

[32]

J. Winick and S. Jamin. Inet-3.0: Internet topology generator, 2002. Technical Report CSE-TR-456-02, University of Michigan, Ann Arbor.

[33]

W. E. Winkler. Masking and re-identification methods for public-use microdata: Overview and research problems. In Privacy in Statistical Databases, pages 231--246, 2004.

Cited By

Sun YZhu JXu XXu XSun YSong SLi XYuan X(2024)Win-Win: On Simultaneous Clustering and Imputing over Incomplete DataProceedings of the VLDB Endowment10.14778/3681954.368198217:11(3045-3057)Online publication date: 30-Aug-2024
https://doi.org/10.14778/3681954.3681982
Enzler EGoerens OAlonso GGruenheid A(2024)Performance Truthfulness of Differential Privacy for DB TestingProceedings of the Tenth International Workshop on Testing Database Systems10.1145/3662165.3662762(30-35)Online publication date: 9-Jun-2024
https://dl.acm.org/doi/10.1145/3662165.3662762
Weng SWang QQu LZhang RCai PQian WZhou A(2024)Lauca: A Workload Duplicator for Benchmarking Transactional Database PerformanceIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.336011636:7(3180-3194)Online publication date: Jul-2024
https://doi.org/10.1109/TKDE.2024.3360116
Show More Cited By

Index Terms

Data generation using declarative constraints

Recommendations

Test data generation for web application using a UML class diagram with OCL constraints

In this paper, we report on our current work toward efficient and effective verification of web application's basic design. We use a UML class diagram with Object Constraint Language (OCL) to describe the application behaviors and data constraints. Then ...
Constraints and modularity (keynote)
MODULARITY Companion 2016: Companion Proceedings of the 15th International Conference on Modularity

A constraint is a declarative description of a relation that we want to have hold, for example, that a set of icons be equally spaced and positioned at the bottom of a window, or that a resistor in an electrical circuit simulation obey Ohm’s Law. A ...
Integration of declarative and constraint programming

Combining a set of existing constraint solvers into an integrated system of cooperating solvers is a useful and economic principle to solve hybrid constraint problems. In this paper we show that this approach can also be used to integrate different ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '11: Proceedings of the 2011 ACM SIGMOD International Conference on Management of data

June 2011

1364 pages

ISBN:9781450306614

DOI:10.1145/1989323

General Chair:
Timos Sellis
IMIS/RC Athena
,
Program Chair:
Renée J. Miller
University of Toronto
,
Publications Chairs:
Anastasios Kementsietsidis
IBM T.J. Watson Research Center
,
Yannis Velegrakis
University of Trento

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 June 2011

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGMOD/PODS '11

Sponsor:

SIGMOD

SIGMOD/PODS '11: International Conference on Management of Data

June 12 - 16, 2011

Athens, Greece

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

67
Total Citations
View Citations
838
Total Downloads

Downloads (Last 12 months)49
Downloads (Last 6 weeks)3

Reflects downloads up to 09 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Sun YZhu JXu XXu XSun YSong SLi XYuan X(2024)Win-Win: On Simultaneous Clustering and Imputing over Incomplete DataProceedings of the VLDB Endowment10.14778/3681954.368198217:11(3045-3057)Online publication date: 30-Aug-2024
https://doi.org/10.14778/3681954.3681982
Enzler EGoerens OAlonso GGruenheid A(2024)Performance Truthfulness of Differential Privacy for DB TestingProceedings of the Tenth International Workshop on Testing Database Systems10.1145/3662165.3662762(30-35)Online publication date: 9-Jun-2024
https://dl.acm.org/doi/10.1145/3662165.3662762
Weng SWang QQu LZhang RCai PQian WZhou A(2024)Lauca: A Workload Duplicator for Benchmarking Transactional Database PerformanceIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.336011636:7(3180-3194)Online publication date: Jul-2024
https://doi.org/10.1109/TKDE.2024.3360116
Wang QLi HHu ZZhang RYang CCai PZhou XZhou A(2024)Mirage: Generating Enormous Databases for Complex Workloads2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00306(3989-4001)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00306
Li HWang QHu ZHuang XNi LZhang RCai PZhou XXu Q(2024)$$\text {Touchstone}^{+}$$ : Query Aware Database Generation for Match OperatorsDatabase Systems for Advanced Applications10.1007/978-981-97-5552-3_18(266-282)Online publication date: 1-Oct-2024
https://doi.org/10.1007/978-981-97-5552-3_18
Yan CNath SLu SGrundy JPollock LPenta M(2023)Generating Test Databases for Database-Backed ApplicationsProceedings of the 45th International Conference on Software Engineering10.1109/ICSE48619.2023.00173(2048-2059)Online publication date: 14-May-2023
https://dl.acm.org/doi/10.1109/ICSE48619.2023.00173
Negi PBindschaedler LAlizadeh MKraska TLeeka JGruenheid AInterlandi M(2023)Unshackling Database Benchmarking from Synthetic Workloads2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00292(3659-3662)Online publication date: Apr-2023
https://doi.org/10.1109/ICDE55515.2023.00292
Sanghi AHaritsa J(2023)Synthetic Data Generation for Enterprise DBMS2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00274(3585-3588)Online publication date: Apr-2023
https://doi.org/10.1109/ICDE55515.2023.00274
Oriol XTeniente EMaynou MNadal S(2023)Generating valid test data through data cloningFuture Generation Computer Systems10.1016/j.future.2023.02.020144(179-191)Online publication date: Jul-2023
https://doi.org/10.1016/j.future.2023.02.020
Maltry MDittrich J(2022)A critical analysis of recursive model indexesProceedings of the VLDB Endowment10.14778/3510397.351040515:5(1079-1091)Online publication date: 18-May-2022
https://dl.acm.org/doi/10.14778/3510397.3510405
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents