Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2463676.2463705acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Discovering XSD keys from XML data

Published: 22 June 2013 Publication History

Abstract

A great deal of research into the learning of schemas from XML data has been conducted in recent years to enable the automatic discovery of XML Schemas from XML documents when no schema, or only a low-quality one is available. Unfortunately, and in strong contrast to, for instance, the relational model, the automatic discovery of even the simplest of XML constraints, namely XML keys, has been left largely unexplored in this context. A major obstacle here is the unavailability of a theory on reasoning about XML keys in the presence of XML schemas, which is needed to validate the quality of candidate keys. The present paper embarks on a fundamental study of such a theory and classifies the complexity of several crucial properties concerning XML keys in the presence of an XSD, like, for instance, testing for consistency, boundedness, satisfiability, universality, and equivalence. Of independent interest, novel results are obtained related to cardinality estimation of XPath result sets. A mining algorithm is then developed within the framework of levelwise search. The algorithm leverages known discovery algorithms for functional dependencies in the relational model, but incorporates the above mentioned properties to assess and refine the quality of derived keys. An experimental study on an extensive body of real world XML data evaluating the effectiveness of the proposed algorithm is provided.

References

[1]
University of Amsterdam XML web collection. http://data.politicalmashup.nl/sgrijzen/xmlweb/.
[2]
S. Abiteboul, Y. Amsterdamer, D. Deutch, T. Milo, and P. Senellart. Finding optimal probabilistic generators for XML collections. In ICDT, pages 127--139, 2012.
[3]
S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison-Wesley, 1995.
[4]
M. Arenas, W. Fan, and L. Libkin. What's hard about XML schema constraints? In DEXA, pages 269--278, 2002.
[5]
D. Barbosa and A. O. Mendelzon. Finding id attributes in XML documents. In XSym, pages 180--194, 2003.
[6]
G. J. Bex, W. Gelade, W. Martens, and F. Neven. Simplifying XML schema: effortless handling of nondeterministic regular expressions. In SIGMOD, pages 731--744, 2009.
[7]
G. J. Bex, W. Gelade, F. Neven, and S. Vansummeren. Learning deterministic regular expressions for the inference of schemas from XML data. TWEB, 4(4), 2010.
[8]
G. J. Bex, F. Neven, T. Schwentick, and S. Vansummeren. Inference of concise regular expressions and DTDs. ACM TODS, 35(2), 2010.
[9]
G. J. Bex, F. Neven, and S. Vansummeren. Inferring XML schema definitions from XML data. In VLDB, pages 998--1009, 2007.
[10]
G. J. Bex, F. Neven, and S. Vansummeren. Schemascope: a system for inferring and cleaning XML schemas. In SIGMOD, pages 1259--1262, 2008.
[11]
D. Bitton, J. Millman, and S. Torgersen. A feasibility and performance study of dependency inference. In ICDE, pages 635--641, 1989.
[12]
H. Björklund, W. Martens, and T. Schwentick. Validity of tree pattern queries with respect to schema information. 2012.
[13]
A. Brüggemann-Klein and D. Wood. One-unambiguous regular languages. Inf. Comput., 140(2):229--253, 1998.
[14]
P. Buneman, S. B. Davidson, W. Fan, C. S. Hara, and W. C. Tan. Keys for XML. Computer Networks, 39(5):473--487, 2002.
[15]
P. Buneman, S. B. Davidson, W. Fan, C. S. Hara, and W. C. Tan. Reasoning about keys for XML. Inf. Syst., 28(8):1037--1063, 2003.
[16]
S. Fajt, I. Mlynkova, and M. Necasky. On mining XML integrity constraints. In ICDIM, pages 23--29, 2011.
[17]
W. Fan and L. Libkin. On XML integrity constraints in the presence of DTDs. J. ACM, 49(3):368--406, 2002.
[18]
M. N. Garofalakis, A. Gionis, R. Rastogi, S. Seshadri, and K. Shim. XTRACT: Learning document type descriptors from XML document collections. Data Min. Knowl. Discov., 7(1):23--56, 2003.
[19]
G. Grahne and J. Zhu. Discovering approximate keys in XML data. CIKM, page 453--460, 2002.
[20]
S. Hartmann and S. Link. Efficient reasoning about a robust XML key fragment. ACM TODS, 34(2), 2009.
[21]
H. Mannila and K.-J. Raiha. Practical algorithms for finding prime attributes and testing normal forms. In PODS, 1989.
[22]
H. Mannila and K.-J. Räihä. The design of relational databases. Addison-Wesley, 1991.
[23]
H. Mannila and K.-J. Räihä. Algorithms for inferring functional dependencies from relations. Data Knowl. Eng., 12(1):83--99, 1994.
[24]
H. Mannila and H. Toivonen. Levelwise search and borders of theories in knowledge discovery. Data Min. Knowl. Discov., 1(3):241--258, 1997.
[25]
W. Martens, F. Neven, and T. Schwentick. Simple off the shelf abstractions for XML schema. SIGMOD Record, 36(3):15--22, 2007.
[26]
W. Martens, F. Neven, T. Schwentick, and G. J. Bex. Expressiveness and complexity of XML schema. ACM TODS, 31(3):770--813, 2006.
[27]
M. Murata, D. Lee, M. Mani, and K. Kawaguchi. Taxonomy of XML schema languages using formal language theory. ACM Trans. Internet Techn., 5(4):660--704, 2005.
[28]
M. Necaský and I. Mlýnková. Discovering XML keys and foreign keys in queries. In SAC, pages 632--638. ACM, 2009.
[29]
R. Ramakrishnan and J. Gehrke. Database management systems (3. ed.). McGraw-Hill, 2003.
[30]
H. Seidl. Deciding equivalence of finite tree automata. SIAM J. Comput., 19(3):424--437, 1990.
[31]
W3C. XML schema part 1: Structures, 2nd edition.
[32]
C. Yu and H. V. Jagadish. XML schema refinement through redundancy detection and normalization. VLDB J., 17(2):203--223, 2008.

Cited By

View all
  • (2018)Data ProfilingSynthesis Lectures on Data Management10.2200/S00878ED1V01Y201810DTM05210:4(1-154)Online publication date: 7-Nov-2018
  • (2017)Research on the translation from XSD to JSON schema2017 IEEE 9th International Conference on Communication Software and Networks (ICCSN)10.1109/ICCSN.2017.8230338(1393-1396)Online publication date: May-2017
  • (2015)The (Almost) Complete Guide to Tree Pattern ContainmentProceedings of the 34th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems10.1145/2745754.2745766(117-130)Online publication date: 20-May-2015
  • Show More Cited By

Index Terms

  1. Discovering XSD keys from XML data

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
    June 2013
    1322 pages
    ISBN:9781450320375
    DOI:10.1145/2463676
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 22 June 2013

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. key
    2. mining
    3. xml

    Qualifiers

    • Research-article

    Conference

    SIGMOD/PODS'13
    Sponsor:

    Acceptance Rates

    SIGMOD '13 Paper Acceptance Rate 76 of 372 submissions, 20%;
    Overall Acceptance Rate 785 of 4,003 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)5
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 07 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2018)Data ProfilingSynthesis Lectures on Data Management10.2200/S00878ED1V01Y201810DTM05210:4(1-154)Online publication date: 7-Nov-2018
    • (2017)Research on the translation from XSD to JSON schema2017 IEEE 9th International Conference on Communication Software and Networks (ICCSN)10.1109/ICCSN.2017.8230338(1393-1396)Online publication date: May-2017
    • (2015)The (Almost) Complete Guide to Tree Pattern ContainmentProceedings of the 34th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems10.1145/2745754.2745766(117-130)Online publication date: 20-May-2015
    • (2015)Profiling relational dataThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-015-0389-y24:4(557-581)Online publication date: 1-Aug-2015
    • (2014)Discovering XSD Keys from XML DataACM Transactions on Database Systems10.1145/263854739:4(1-49)Online publication date: 30-Dec-2014

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media