Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2274576.2274591acmotherconferencesArticle/Chapter ViewAbstractPublication PagesedbtConference Proceedingsconference-collections
research-article

Finding optimal probabilistic generators for XML collections

Published: 26 March 2012 Publication History

Abstract

We study the problem of, given a corpus of XML documents and its schema, finding an optimal (generative) probabilistic model, where optimality here means maximizing the likelihood of the particular corpus to be generated. Focusing first on the structure of documents, we present an efficient algorithm for finding the best generative probabilistic model, in the absence of constraints. We further study the problem in the presence of integrity constraints, namely key, inclusion, and domain constraints. We study in this case two different kinds of generators. First, we consider a continuation-test generator that performs, while generating documents, tests of schema satisfiability; these tests prevent from generating a document violating the constraints but, as we will see, they are computationally expensive. We also study a restart generator that may generate an invalid document and, when this is the case, restarts and tries again. Finally, we consider the injection of data values into the structure, to obtain a full XML document. We study different approaches for generating these values.

References

[1]
S. Abiteboul, O. Benjelloun, and T. Milo. The Active XML project: an overview. VLDB J., 17(5), 2008.
[2]
S. Abiteboul, P. Bourhis, A. Galland, and B. Marinoiu. The AXML artifact model. In TIME, 2009.
[3]
S. Abiteboul, T.-H. H. Chan, E. Kharlamov, W. Nutt, and P. Senellart. Aggregate queries for discrete and continuous probabilistic XML. In ICDT, 2010.
[4]
S. Abiteboul, B. Kimelfeld, Y. Sagiv, and P. Senellart. On the expressiveness of probabilistic XML models. VLDB J., 18(5), 2009.
[5]
T. Antonopoulos, F. Geerts, W. Martens, and F. Neven. Generating, sampling and counting subclasses of regular tree languages. In ICDT, 2011.
[6]
D. Barbosa, A. O. Mendelzon, J. Keenleyside, and K. A. Lyons. ToXgene: An extensible template-based data generator for XML. In WebDB, 2002.
[7]
M. Benedikt, E. Kharlamov, D. Olteanu, and P. Senellart. Probabilistic XML via Markov chains. PVLDB, 3(1), 2010.
[8]
G. J. Bex, W. Gelade, F. Neven, and S. Vansummeren. Learning deterministic regular expressions for the inference of schemas from XML data. In WWW, 2008.
[9]
G. J. Bex, F. Neven, T. Schwentick, and K. Tuyls. Inference of concise DTDs from XML data. In VLDB, 2006.
[10]
G. J. Bex, F. Neven, and S. Vansummeren. Inferring XML schema definitions from XML data. In VLDB, 2007.
[11]
C. M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.
[12]
Z. Chi and S. Geman. Estimation of probabilistic context-free grammars. Comput. Linguist., 24(2), 1998.
[13]
S. Cohen. Generating XML structure using examples and constraints. PVLDB, 1(1), 2008.
[14]
S. Cohen, B. Kimelfeld, and Y. Sagiv. Incorporating constraints in probabilistic XML. In PODS, 2008.
[15]
C. David, L. Libkin, and T. Tan. Efficient reasoning about data trees via integer linear programming. In ICDT, 2011.
[16]
K. Etessami and M. Yannakakis. Recursive Markov chains, stochastic grammars, and monotone systems of nonlinear equations. JACM, 56(1), 2009.
[17]
W. Fan and L. Libkin. On XML integrity constraints in the presence of DTDs. JACM, 49(3), 2002.
[18]
D. Freedman. Markov Chains. Springer-Verlag, 1983.
[19]
M. Garofalakis, A. Gionis, R. Rastogi, S. Seshadri, and K. Shim. XTRACT: a system for extracting document type descriptors from XML documents. In SIGMOD, 2000.
[20]
W. Gelade, T. Idziaszek, W. Martens, and F. Neven. Simplifying XML schema: Single-type approximations of regular tree languages. In PODS, 2010.
[21]
G. Grahne and J. Zhu. Discovering approximate keys in XML data. In CIKM, 2002.
[22]
R. Kosala, H. Blockeel, M. Bruynooghe, and J. Van den Bussche. Information extraction from structured documents using k-testable tree automaton inference. Data Knowl. Eng., 58(2), 2006.
[23]
K. Lange. Optimization. Springer-Verlag, 2004.
[24]
K. Lary and S. J. Young. The estimation of stochastic context-free grammars using the inside-outside algrithm. Computer Speech and Language, 4, 1990.
[25]
W. Martens, F. Neven, and T. Schwentick. Simple off the shelf abstractions for xml schema. SIGMOD Record, 36(3):15--22, 2007.
[26]
W. Martens, F. Neven, T. Schwentick, and G. J. Bex. Expressiveness and complexity of XML schema. ACM Trans. Database Syst., 31(3), 2006.
[27]
W. Martens, F. Neven, T. Schwentick, and G. J. Bex. Expressiveness and complexity of xml schema. ACM Trans. Database Syst., 31(3):770--813, 2006.
[28]
W. Martens and J. Niehren. On the minimization of XML schemas and tree automata for unranked trees. J. Comput. Syst. Sci., 73(4), 2007.
[29]
T. Milo and D. Suciu. Type inference for queries on semistructured data. In PODS, 1999.
[30]
M. Murata, D. Lee, M. Mani, and K. Kawaguchi. Taxonomy of XML schema languages using formal language theory. ACM Trans. Internet Technol., 5(4), 2005.
[31]
S. Nestorov, S. Abiteboul, and R. Motwani. Extracting schema from semistructured data. In SIGMOD, 1998.

Cited By

View all
  • (2022)Designing XML Schema Inference Algorithm for Intra-enterprise UsePerspectives in Business Informatics Research10.1007/978-3-031-16947-2_3(35-49)Online publication date: 16-Sep-2022
  • (2015)Optimal Probabilistic Generation of XML DocumentsTheory of Computing Systems10.1007/s00224-014-9581-557:4(806-842)Online publication date: 1-Nov-2015
  • (2014)Discovering XSD Keys from XML DataACM Transactions on Database Systems10.1145/263854739:4(1-49)Online publication date: 30-Dec-2014
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
ICDT '12: Proceedings of the 15th International Conference on Database Theory
March 2012
329 pages
ISBN:9781450307918
DOI:10.1145/2274576
  • General Chair:
  • Alin Deutsch
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 March 2012

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. XML
  2. constraints
  3. generator
  4. probabilistic model
  5. schema

Qualifiers

  • Research-article

Funding Sources

Conference

ICDT '12

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 13 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2022)Designing XML Schema Inference Algorithm for Intra-enterprise UsePerspectives in Business Informatics Research10.1007/978-3-031-16947-2_3(35-49)Online publication date: 16-Sep-2022
  • (2015)Optimal Probabilistic Generation of XML DocumentsTheory of Computing Systems10.1007/s00224-014-9581-557:4(806-842)Online publication date: 1-Nov-2015
  • (2014)Discovering XSD Keys from XML DataACM Transactions on Database Systems10.1145/263854739:4(1-49)Online publication date: 30-Dec-2014
  • (2013)Discovering XSD keys from XML dataProceedings of the 2013 ACM SIGMOD International Conference on Management of Data10.1145/2463676.2463705(61-72)Online publication date: 22-Jun-2013
  • (2013)On the connections between relational and XML probabilistic data modelsProceedings of the 29th British National conference on Big Data10.1007/978-3-642-39467-6_13(121-134)Online publication date: 8-Jul-2013
  • (2013)Probabilistic XML: Models and ComplexityAdvances in Probabilistic Databases for Uncertain Information Management10.1007/978-3-642-37509-5_3(39-66)Online publication date: 2013
  • (2012)Auto-completion learning for XMLProceedings of the 2012 ACM SIGMOD International Conference on Management of Data10.1145/2213836.2213928(669-672)Online publication date: 20-May-2012
  • (2012)The ERC webdam on foundations of web data managementProceedings of the 21st International Conference on World Wide Web10.1145/2187980.2188011(211-214)Online publication date: 16-Apr-2012

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media