Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1938551.1938559acmotherconferencesArticle/Chapter ViewAbstractPublication PagesedbtConference Proceedingsconference-collections
research-article

Generating, sampling and counting subclasses of regular tree languages

Published: 21 March 2011 Publication History

Abstract

To experimentally validate learning and approximation algorithms for XML Schema Definitions (XSDs), we need algorithms to generate uniformly at random a corpus of XSDs as well as a similarity measure to compare how close the generated XSD resembles the target schema. In this paper, we provide the formal foundation for such a testbed. We adopt similarity measures based on counting the number of common and different trees in the two languages, and we develop the necessary machinery for computing them. We use the formalism of extended DTDs (EDTDs) to represent the unranked regular tree languages. In particular, we obtain an efficient algorithm to count the number of trees up to a certain size in an unambiguous EDTD. The latter class of unambiguous EDTDs encompasses the more familiar classes of single-type, restrained competition and bottom-up deterministic EDTDs. The single-type EDTDs correspond precisely to the core of XML Schema, while the others are strictly more expressive. We also show how constraints on the shape of allowed trees can be incorporated. As we make use of a translation into a well-known formalism for combinatorial specifications, we get for free a sampling procedure to draw members of any unambiguous EDTD. When dropping the restriction to unambiguous EDTDs, i.e. taking the full class of EDTDs into account, we show that the counting problem becomes #P-complete and provide an approximation algorithm. Finally, we discuss uniform generation of single-type EDTDs, i.e., the formal abstraction of XSDs. To this end, we provide an algorithm to generate k-occurrence automata (k-OAs) uniformly at random and show how this leads to uniform generation of single-type EDTDs.

References

[1]
J. Albert, D. Giammerresi, and D. Wood. Normal form algorithms for extended context free grammars. Theoretical Computer Science, 267(1--2):35--47, 2001.
[2]
M. Almeida, N. Moreira, and R. Reis. Enumeration and generation with a string automata representation. Theoretical Computer Science, 387(2):93--102, 2007.
[3]
D. Barbosa, A. O. Mendelzon, J. Keenleyside, and K. A. Lyons. ToXgene: a template-based data generator for XML. In International Symposium on Management of Data (SIGMOD), page 616, 2002.
[4]
F. Bassino, J. David, and C. Nicaud. Enumeration and random generation of possibly incomplete deterministic automata. Pure Mathematics and Applications, 19(2--3):1--16, 2008.
[5]
F. Bassino and C. Nicaud. Enumeration and random generation of accessible automata. Theoretical Computer Science, 381(1--3):86--104, 2007.
[6]
A. Bertoni, M. Goldwurm, and N. Sabadini. The complexity of computing the number of strings of given length in context-free languages. Theoretical Computer Science, 86(2):325--342, 1991.
[7]
G. J. Bex, W. Gelade, W. Martens, and F. Neven. Simplifying XML Schema: effortless handling of nondeterministic regular expressions. In International Symposium on Management of Data (SIGMOD), pages 731--744, 2009.
[8]
G. J. Bex, W. Gelade, F. Neven, and S. Vansummeren. Learning deterministic regular expressions for the inference of schemas from XML data. In International World Wide Web Conference (WWW), pages 825--834, 2008.
[9]
G. J. Bex, F. Neven, T. Schwentick, and S. Vansummeren. Inference of concise regular expressions and DTDs. ACM Transactions on Database Systems, 2010.
[10]
G. J. Bex, F. Neven, and S. Vansummeren. Inferring XML Schema Definitions from XML data. In International Conference on Very Large Data Bases (VLDB), pages 998--1009, 2007.
[11]
H. Björklund and W. Martens. The tractability frontier for NFA minimization. In International Colloquium on Automata, Languages and Programming (ICALP), pages 27--38, 2008.
[12]
A. Brüggemann-Klein. Regular expressions into finite automata. In Latin American Symposium on Theoretical Informatics (LATIN), pages 87--98, 1992.
[13]
A. Brüggemann-Klein, M. Murata, and D. Wood. Regular tree and regular hedge languages over unranked alphabets: Version 1, april 3, 2001. Technical Report HKUST-TCSC-2001-0, The Hongkong University of Science and Technology, 2001.
[14]
A. Brüggemann-Klein and D. Wood. One-unambiguous regular languages. Information and Computation, 142(2):182--206, 1998.
[15]
S. Cohen, B. Kimelfeld, and Y. Sagiv. Incorporating constraints in probabilistic XML. ACM Transactions on Database Systems, 34(3):1--45, 2009.
[16]
S. Cohen, B. Kimelfeld, and Y. Sagiv. Running tree automata on probabilistic XML. In International Symposium on Principles of Database Systems (PODS), pages 227--236, 2009.
[17]
P. Flajolet, P. Zimmermann, and B. Van Cutsem. A calculus for the random generation of labelled combinatorial structures. Theoretical Computer Science, 132(2):1--35, 1994.
[18]
W. Gelade, T. Idziaszek, W. Martens, and F. Neven. Simplifying XML Schema: Single-type approximations of regular tree languages. In International Symposium on Principles of Database Systems (PODS), 2010.
[19]
M. Goldwurm. Random generation of words in an algebraic language in linear binary space. Information Processing Letters, 54:229--233, 1995.
[20]
V. Gore, M. Jerrum, S. Kannan, Z. Sweedyk, and S. R. Mahaney. A quasi-polynomial-time algorithm for sampling words from a context-free language. Information and Computation, 134(1):59--74, 1997.
[21]
P.-C. Héam, C. Nicaud, and S. Schmitz. Random generation of deterministic tree (walking) automata. In International Conference on Implementation and Application of Automata (CIAA), pages 115--124, 2009.
[22]
J. E. Hopcroft, R. Motwani, and J. D. Ullman. Introduction to Automata Theory, Languages, and Computation. Addison-Wesley, 3 edition, 2007.
[23]
S. Kannan, Z. Sweedyk, and S. R. Mahaney. Counting and random generation of strings in regular languages. In ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 551--557, 1995.
[24]
J. Lee and J. Shallit. Enumerating regular expressions and their languages. In International Conference on Implementation and Application of Automata (CIAA), pages 2--22, 2004.
[25]
H. G. Mairson. Generating words in a context-free language uniformly at random. Information Processing Letters, 49(2):95--99, 1994.
[26]
W. Martens, F. Neven, and T. Schwentick. Simple off the shelf abstractions of XML Schema. Sigmod RECORD, 36(3):15--22, 2007.
[27]
W. Martens, F. Neven, and T. Schwentick. Complexity of decision problems for XML schemas and chain regular expressions. SIAM Journal on Computing, 39(4):1486--1530, 2009.
[28]
W. Martens, F. Neven, T. Schwentick, and G. J. Bex. Expressiveness and complexity of XML Schema. ACM Transactions on Database Systems, 31(3):770--813, 2006.
[29]
W. Martens and J. Niehren. On the minimization of XML Schemas and tree automata for unranked trees. Journal of Computer and System Sciences, 73(4):550--583, 2007.
[30]
M. Murata, D. Lee, M. Mani, and K. Kawaguchi. Taxonomy of XML schema languages using formal language theory. ACM Transactions on Internet Technology, 5(4):660--704, 2005.
[31]
A. Nijenhuis and H. Wilf. Combinatorial algorithms. Academic Press Inc., 1979.
[32]
H. Seidl. Deciding equivalence of finite tree automata. SIAM Journal on Computing, 19(3):424--437, 1990.

Cited By

View all
  • (2015)Optimal Probabilistic Generation of XML DocumentsTheory of Computing Systems10.1007/s00224-014-9581-557:4(806-842)Online publication date: 1-Nov-2015
  • (2012)Bounded repairability for regular tree languagesProceedings of the 15th International Conference on Database Theory10.1145/2274576.2274593(155-168)Online publication date: 26-Mar-2012
  • (2012)Finding optimal probabilistic generators for XML collectionsProceedings of the 15th International Conference on Database Theory10.1145/2274576.2274591(127-139)Online publication date: 26-Mar-2012
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
ICDT '11: Proceedings of the 14th International Conference on Database Theory
March 2011
285 pages
ISBN:9781450305297
DOI:10.1145/1938551
  • Program Chair:
  • Tova Milo
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 March 2011

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Funding Sources

Conference

EDBT/ICDT '11
EDBT/ICDT '11: EDBT/ICDT '11 joint conference
March 21 - 24, 2011
Uppsala, Sweden

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)3
  • Downloads (Last 6 weeks)1
Reflects downloads up to 07 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2015)Optimal Probabilistic Generation of XML DocumentsTheory of Computing Systems10.1007/s00224-014-9581-557:4(806-842)Online publication date: 1-Nov-2015
  • (2012)Bounded repairability for regular tree languagesProceedings of the 15th International Conference on Database Theory10.1145/2274576.2274593(155-168)Online publication date: 26-Mar-2012
  • (2012)Finding optimal probabilistic generators for XML collectionsProceedings of the 15th International Conference on Database Theory10.1145/2274576.2274591(127-139)Online publication date: 26-Mar-2012
  • (2012)Auto-completion learning for XMLProceedings of the 2012 ACM SIGMOD International Conference on Management of Data10.1145/2213836.2213928(669-672)Online publication date: 20-May-2012

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media