Inferring XML Schema Definitions From XML Data
Inferring XML Schema Definitions From XML Data
●Problem
●Solution
●Related work
●Background
●Contributions
➔ iLocal algorithm
➔ Reduce algorithm
➔ iXSD
●Experimental evaluation
Problem
XML DTD
<library>
<borrowed>
<person>
<name/><tel/><email/> <!ELEMENT library (borrowed*,stock+)>
</person>
<!ELEMENT borrowed (person,book+)>
<book>
<id/> <author/> <time/> <!ELEMENT stock (book)+>
</book> <!ELEMENT person (name,tel+,email?)>
</borrowed> <!ELEMENT book (id,author,nbBooks?,
<stock> (bookshelf|time)?)>
<book>
<id/> <author/>
<nbBooks/> <bookshelf/>
</book>
</stock>
</library>
Solution
<library> XSD
<borrowed>
<person>
<name/><tel/><email/> root -> library[library]
</person> library -> borrowed[borrowed]*, stock[stock]
<book> borrowed -> person[person], book[book1]+
<id/> <author/> <time/> stock -> book[book2]+
</book> person -> name[emp], tel[emp]?, email[emp]+
</borrowed>
<stock> book1 -> id[emp], author[emp], time[emp]
<book> book2 -> id[emp], author[emp], nbBooks[emp], bookshelf[emp]
<id/> <author/> <nbBooks/> emp -> #PCDATA
<bookshelf/>
</book>
</stock>
</library>
Related Work
● Schema inference(SSD)
○ Restricting algorithms to trees
—> XSD schemas can’t
○ No order considered between the children of a node be derived
● DTD inference
● XSD inference
○ Trang The expressiveness of the generated schema
○ Xstruct —> does not go beyond that of a DTD.
● Learning of tree automata
○ Inferring queries, not XSD
Background
Considering an XML,
● an XSD is k-local if its content models depend only on labels up to the k-th ancestor.
Background
Definition 2 :
SORE: A regular expression r is single occurrence if every element name occurs at
most once in it. An XSD is single occurrence if it contains only SOREs.(SOXSD)
Definition 3 :
SOA is a graph A = (V,E) where all states in V-{in,out} are element names, and
E ⊆ (V-{in}) x (V-{out}) is the edge relation.
iLocal Algorithm:
➔ T = { set of types consist of all (p/k) / p ∈ paths(C) }
➔ ρ ← Ø ; τ ← Ø;
➔ construct the content model for these types:
◆ learn the SOA for the set k-strings(C, (p/k)) of all strings occurring in C below a
path q that is k-equivalent to the type pk
◆ transform this SOA into SORE
◆ add each transition from pk to sore to the ρ
➔ for each path pa in paths(C), add(p/k,a)->(pa)/k to τ
Let C be the corpora of these two XMLs <library>
<borrowed>
<library> <person>
<borrowed> <name/><email/>
<person> </person>
<name/><tel/><email/> <book>
</person> <id/> <author/> <time/>
<book> </book>
<id/> <author/> <time/> </borrowed>
</book> <borrowed> ,,, </borrowed>
</borrowed> <stock>
<stock> <book>
<book> <id/> <author/> <nbBooks/>
<id/> <author/> <nbBooks/> <bookshelf/>
<bookshelf/> </book>
</book> <book>
</stock> <id/> <author/> <nbBooks/>
</library> <bookshelf/>
<book/>
</book>
</stock>
</library>
Running the iLocal:
k=2
SOA exemple
SORE exemple
determine the type associated with the element names in these content models, for k = 2 exemple
Reduce of iLocal:
Experimental Evaluation
Personal Opinion
Conclusion
❖ Problem: Document Type Definition for inferring XML it is not enough, the content
model of an element can only depend on the element name and not on the context in
which is used.
❖ Solution: inferring XML Schema Definition. XSD allow the content model of an
element to depend on the context in which is used.
❖ Background:
➢ Definition of XSD
➢ SORE
➢ SOA
❖ Contribution
➢ iLocal
➢ Reduce
➢ iXSD = iLocal + Reduce
❖ Experimental Evaluation