2.2.1 Structure-Based Mapping.
The first category is schema-based mapping, also called a structure-based approach, where the schema/structure means the XML schema (DTD) - describing the structure of XML data and facilitating the data exchange among different applications based on a consensus on the meaning of tags. Here, it could help design a more compact storage schema by eliminating redundancy and help improve query efficiency by significantly reducing the number of joins (e.g., inlining as many proper elements as possible into a single table). Therefore, the primary purpose of this part is to comprehensively review the development of the structure-based approach, summarize them, and represent them in Table
2.
Inlining is proposed in [
117], which uses a set of transformations to “simplify” the original DTD’s complexity while preserving the semantics. Next, it utilizes a DTD graph to represent the simplified DTD and converts the DTD graph to relations. However, there is a high probability of causing excessive fragmentation of the document when directly mapping elements to relations. Hence, the
Basic inlining is presented to solve the fragmentation problem by inlining as many descendants of an element as possible into a single relation. But, due to the
Basic allowing an element node to appear in multiple tables repeatedly, the proposed
Shared inlining technology identifies these element nodes and creates separate tables for these elements to share. Besides, to control the number of tables, the
Shared provides some rules to decide whether or not to create a relation. Finally, it proposes the
Hybrid inlining technology to combine the
Basic (join reduction properties) with the
Shared (the sharing features) for improving query performance.
Yan and Fu [
90] propose two algorithms (Global Schema Extraction and DTD-splitting Schema Extraction) to generate relational schemas based on the XML data and the DTD. Those two algorithms have the same framework. Firstly, they simplify DTD; Then they need to create schema prototype trees; Next, they form relational schema prototypes; After that, they find
functional dependencies (FDs) and candidate keys; Finally, they normalize the formed prototypes. However, the global algorithm analyzes the XML data to discover FDs and the candidate keys. Next, they use them to normalize the prototypes. The DTD-splitting algorithm infers features of the XML data from the DTD and conducts schema decomposition (DTD split) before discovering FDs and keys.
NewInlining is proposed in [
84] and inspired by the shared-inlining method [
117]. It starts with simplifying DTDs by a set of new transformation rules. Next, it creates and inlines DTD graphs, where inlining rules eliminate the redundancy and deal with DTDs containing arbitrary cycles. Finally, it generates relational schemas based on the inlined graph.
Balmin et al. [
22] propose a schema-driven decomposition framework, which firstly adopts the labeled tree notation to represent XML data. Next, it utilizes schema graphs to abstract the syntax of XML schema definitions, decomposes the schema graph into fragments (including non-
MVD (Multi-valued dependency) fragments), and generates a relational table definition for each fragment. Finally, it decomposes the XML data and loads them into the corresponding tables.
ODTDMap is proposed in [
16], which simplifies DTD, creates a DTD graph, does the inlining operation, and generates the database schema and
\(\delta -\) mapping. Besides, two data mapping algorithms (
OXInsert and
SDM) are proposed. Both
OXInsert and
SDM utilize globe IDs of elements to help reconstruct XML documents.
Suri and Sharma [
123] propose mapping an XML DTD into relations, which has three steps: (1) simplifying the complexities of DTD; (2) creating a DTD graph based on simplified DTD; and (3) using the proposed inlining algorithm to generate a relational schema from the DTD graph, where the algorithm will decide to create one or two relations for the two elements appearing in a cycle.
MDF is proposed in [
12], which is a
mapping definition framework (MDF). MDF starts with annotating an input XML schema with a limited number of pre-defined annotations, then parses annotated XML schema, creates the relational schema, verifies mapping correctness and losslessness, and ends up with shredding the input XML document to tuples.
ShreX is proposed in [
11,
47], which provides a comprehensive system for mapping, loading, and querying XML documents. Specifically, the mapping is specified by annotating an XML schema, which shows how elements and attributes are stored in tables. Furthermore, it makes mappings diversify through combining different annotations. That is,
ShreX can use existing mapping strategies as well as potential new mapping techniques. The annotation processor’s function is to parse an annotated XML schema, check the validity of the mappings, and form the corresponding relational schema. Finally, the document shredder shreds an XML document and generates the tuples.
X-Ray is proposed in [
69], whose principal purpose is to support the existing schemas.
X-Ray offers several basic mapping options and decides which kind of mapping is reasonable according to different situations. Reasonable mappings are served as mapping patterns to promote the mapping process at a syntactical level through analyzing the database schema and the DTD and suggesting potential mappings as well as preventing others due to syntactical conflicts. Since those mapping patterns are universally applicable,
X-Ray employs them to represent mapping knowledge for mapping an XML to a relational schema.
SPIDER is proposed in [
10], which uses
SPIDER (Schema based Path IDentifiER) to identify paths from the root node to a node, adopts Sibling Dewey Order to identify multiple nodes appearing in the same path, and designs the following four relational tables to preserve the XML document.
(1)
Element (docID, nodeID, spider, sibling, parentID);
(2)
Attribute (docID, nodeID, spider, sibling, parentID, value);
(3)
Text (docID, nodeID, spider, sibling, parentID, value);
Discussion . With DTDs, the relational schema generated by a structure-based approach is tailored to specific XML documents. This is to say, structure-based approaches could use predefined rules to generate different relational schemas for different DTDs. These schemas usually tend to have a more compact storage representation and an excellent query performance [
126]. However, both inlining and annotation techniques do not consider semantic constraints. Besides, due to a lack of path information, some queries require many joining operations in the relational schema generated by the above methods. What’s more, a complex and large XML schema may generate a relational schema with many simple tables, although the XML document instance is simple. As for
X-Ray [
69], it is just a research prototype supporting the existing schemas. Finally,
SPIDER [
10] uses a pair of spider and Sibling Dewey Order to identify each node. With these labeled nodes, it creates a four-table schema according to
XRel [
115]. Although this schema could reduce the range of relabeling (spider is not affected) when updating documents and make retrieval more efficient, it cannot store node orders exactly by employing a pair of spider and Sibling Dewey Order if a DTD contains multiple components having the same name but appearing in different places [
10]. Furthermore, XML documents do not require DTDs’ existence. This fact would cause a problem that these methods may not be inapplicable when the absence of DTDs.
2.2.2 Model-Based Mapping.
Contrary to the previous work, this part deals with mapping in the absence of XML schema. In fact, schema absence is a common phenomenon these days, which makes querying these schemaless XML documents difficult. Considering this, people propose a model-based approach to map XML documents without schema information into relational data as an alternative way to solve this difficulty. Generally, the model-based approach is a generic mapping that regards an XML document as a tree model and designs mapping based on nodes, edges, paths, and so on. Next, we will review the development of this generic mapping comprehensively.
(1) Fixed Schema . The work presented in this portion (and summarized in Table
3) is about mapping an XML document into relation tuples with a fixed number of tables.
Edge-Based Approach. This approach maintains the Parent-Child (using Source object - Target object) and Ancestor-Descendent (using self-join) relationships in the table.
Florescu and Kossmann [
58,
59] regards an XML document as an ordered and labeled directed graph, where each XML element is a node, element-subelement relationships are edges, values of an XML document are leaves. Then it proposes three alternative ways to record edges of a graph:
(1)
Store all edges of the graph in a single table (i.e., the edge approach):
Edge (source, ordinal, name, flag, target).
(2)
Class every edge having an identical label to a table:
B \(_{name}\) (source, ordinal, flag, target);
(3)
Use a single universal table to store all the edges (i.e., the universal table):
Universal (source, ordinal \(_{n_{1}}\) , flag \(_{n_{1}}\) , target \(_{n_{1}}, \ldots ,\) ordinal \(_{n_{k}}\) , flag \(_{n_{k}}\) , target \(_{n_{k}}\) ).
two alternative ways to preserve the leaves:
(1)
Establish separate Value tables for each conceivable data type:
V \(_{type}\) (vID, value).
(2)
Store values together with edges (Inlining) to keep values and attributes in the same tables.
which leads to overall six different relational schemas for storing XML documents (i.e., graphs).
In the above tables, the attribute \(source\) keeps the source ids of each edge, the \(target\) preserves the target ids and utilizes the \(flag\) to distinguish between internal nodes and leaves, the \(ordinal\) holds the orders of edges, and \(n_1, \ldots , n_k\) in the table Universal are the label names.
Path-Based Approach. It preserves all available path expressions (from the root to each node in the XML tree) in a relational attribute.
XRel is proposed in [
71], which decomposes an XML document into nodes based on its tree structure, stores the simple path expressions (from the root to node) of these nodes, and preserves these nodes in different relational tables according to their types.
(2)
Element (docID, pathID, start, end, index, reindex);
(3)
Text (docID, pathID, start, end, value);
(4)
Attribute (docID, pathID, start, end, value).
XRel designs a schema containing four tables to store the combination of the path expression and the region of nodes in an XML tree as relational tuples. These could help record the topology information of the XML tree and the expanded names of nodes. The attributes start and end indicate start and end position of a region. The index represents the order of an element node among its siblings in the XML document order, and the reindex indicates the reverse document order.
SUCXENT is proposed in [
103], which stores the information of paths and nodes in tables:
(1)
Document (docID, docName);
(3)
PathValue (docID, pathID, leafOrder, siblingOrder, leftSibIxnLevel, leafValue);
(4)
TextContent (docID, linkID, text);
(5)
AncestorInfo (docID, siblingOrder, ancestorOrder, ancestorLevel).
The table Path preserves paths of all the leaf nodes. PathValue stores all leaf nodes, where the column leftSibIxnLevel storing the level of the highest common ancestor of the leaf node is used to reconstruct the XML document, the column leafValue is used to record the textual content of the leaf node. However, for large textual data (e.g., DNA sequences), LeafValue only keeps a link. SUCXENT uses another table TextContent to hold such large data. As for AncestorInfo, it saves the ancestor information for each leaf node to quickly answer some queries.
SUCXENT++ is proposed in [
104], which stores the leaf nodes and the associated paths together with new offered attributes to handle the recursive XML queries.
(1)
Document (docID, docName);
(2)
Path (pathID, pathExp, cPathID);
(3)
PathValue (docID, pathID, leafOrder, cPathID, branchOrder, branchOrderSum, leafValue);
(4)
TextContent (docID, pathID, leafOrder, cPathID, branchOrder, branchOrderSum, text);
(5)
DocumentRValue (docID, level, rValue).
It introduces the attribute cPathID to convert any recursive path expression to a range query. Users could use the attributes branchOrder, branchOrderSum, and rValue to decrease the consumption of storage and the times of join operations.
Xlight is proposed in [
139], whose schema has the following five relational tables:
(1)
Document (docID, docName);
(3)
Data (docID, pathID, leafNo, leafGroup, linkLevel, leafValue, hasAttrib);
(4)
Ancestor (docID, leafGroup, ancestorPre, ancestorLevel);
(5)
Attribute (name, val, id, pre).
In this schema, the table Data stores all the information of leaf nodes in the XML document. Ancestor preserves the ancestor information of each leaf node. The attribute leafGroup marks the same number for any leaf nodes having the same parent. The linkLevel indicates the level that each path is linked with its previous path. The hasAttrib records the number of attributes in each path.
SMX/R is proposed in [
6], where
startPos/endPos denotes the starting (pre-order) /end (post-order) location of the node.
(1)
Path (docID, pathID, startPos, endPos, nodeLevel, nodeType, nodeValue);
(2)
PathIndex (pathID, pathExp, nodeName).
Edge- & Path- & Signature-Based Approach. It preserves path expressions (path-based method), parent-child relationships (edge-based method) in the tables. Besides, this approach assigns a different signature (number) to each distinctive label (node).
XParent is proposed in [
67], where the table
LabelPath provides a global view of the XML documents.
DataPath keeps parent-child relationships, which can be further materialized as ancestor-descendant relationships. The attribute
length and
order represent the number of edges in the label path and the order of the element among its siblings, respectively.
(1)
LabelPath (pathID, length, pathExp);
(2)
DataPath (parentNodeID, childNodeID); /Ancestor (nodeID, ancestorID, level);
(3)
Element (pathID, nodeID, order);
(4)
Data (pathID, nodeID, order, value).
XPred is proposed in [
132,
133], which stores the structural information (e.g., parent-child relationship and order) distributively into nodes to reduce the number of joins when doing queries.
(1)
Path (pathID, length, labelPath);
(2)
Node (nodeID, pathID, order, parentID);
(3)
Data (nodeID, pathID, order, parentID, value).
Wang et al. [
129] propose the following schema, where
ValueTable stores the leaf nodes with the value.
NoValueTable stores the inner nodes. The attribute
nodeID is the node identifier number assigned by pre-order traversal.
(1)
ValueTable (nodeID, name, value, pathExp, parentID, level);
(2)
NoValueTable (nodeID, name, parentID, level).
Ying et al. [
136] keep the parent-child relationship, path, and level information to support structural queries, especially for the twig query.
(3)
LeafNode (docID, leafNodeID, pathID, parentID, leafValue);
(4)
InnerNodes (docID, innerNodeID, nodeName, parentID, level, sibling).
Edge- & Path-Based Approach.
Khan and Rao [
70] propose the following schema to keep parent-child relationships and path information, where the attribute
pathExp, considered as the primary key, is the simple path expression (from root to node) of these nodes.
(1)
SampleTable (pathExp, dataItem, parentPathExp);
(2)
AttributeTable (pathExp, attributeName attributeValue).
XPEV is proposed in [
105], whose schema is proposed by combining edge [
59] with path [
71]:
(2)
Edge (pathID, source, target, label, ordinal, flag);
(3)
Value (pathID, source, target, label, ordinal, value).
Path- & Signature-Based Approach. It preserves path expressions (path-based method) in the table and assigns a different signature (number) to each distinctive label (node).
LNV is proposed in [
50], where the attribute
pathNode (
pathSignature) is a list of nodes (signatures of labels) in the path ordered from the root. The attribute
value is the value associated with the end of the path. The attribute
typeNode denotes the leaf node’s type that can be an element, attribute, comment, or text. The attribute
position records where the element node is among its sibling.
(1)
LabelsSignatures (label, signature);
(2)
Path (docID, pathSignature, pathNode, value, typeNode, position).
Pointer-Based Approach. It preserves as many the pointers of nodes’ ancestors as possible.
XMLEase is proposed in [
51], where some redundant edges are introduced into the XML tree so that each node is connected to its ancestors instead of just its parent. How many ancestors be connected for each node will depend on the number of ancestor columns in the pre-defined table.
(1)
Table (identifier, ancestor \(_1\) , ancestor \(_2\) , ...)
The attribute identifier denotes labels (values) for intermediate nodes (leaves) of the XML tree. Other columns keep identifier’s ancestors. In this way, it could speed up hierarchical data’s retrieval.
Token-Based Approach. It uses a table to record XML document structure information and uses another table to preserve token (element, tag, attribute, or property) information.
Dweib et al. [
48] keep the XML document structure in the attribute
docStructure (a big text field containing a coded string). Any changes (e.g., adding a new tag or deleting an existing property) in the structure should be recorded in this attribute.
(1)
Documents (docID, docStructure);
(2)
Tokens (docID, tokenID, tokenName, tokenValue).
MAXDOR is proposed in [
49], which adopts a global label approach for identifying each token in an XML document and assigns additional labels (parent, left and right sibling) to each token for facilitating future inserting and relocating a given token. Besides,
MAXDOR uses the table
Documents to keep document information.
(1)
Documents (docID, docName, docElement, totalTokens, schemaInfo);
(2)
Tokens (doctID, tokenID, lSib, parentID, rSib, tokenLevel, tokenName, tokenVal, tokenType).
Edge- & Signature-Based Approach. Each element or attribute is identified by a signature (number) and each path is identified by its parent from the root node in a recursive manner.
XRecursive is proposed in [
52,
53], whose schema is:
(1)
LabelStructure (labelName, signature, parentID);
(2)
LabelValue (signature, value, type).
Suri and Sharma [
122] design the following schema:
(2)
Data (docID, nodeID, parentID, nodeValue, nodeType, position).
Labeling-Based Approach. It uses a labeling technique to annotate each node.
s-XML is proposed in [
120], which adopts the Persistent Labeling [
62] to annotate each node in the XML tree and stores those labels in the attribute
selfLabel. In the following schema,
ParentTable preserves the non-leaf (internal) nodes.
ChildTable records the leaf (external) nodes.
(1)
ParentTable (nodeID, parentNodeName, NodeName, level, parentNodeID, selfLabel);
(2)
ChildTable (nodeID, level, parentNodeName, selfLabel, parentNodeID, value).
Path- & Labeling-Based Approach. It preserves path expressions in the table and adopts a labeling technology to annotate each node.
XMap is proposed in [
29], which uses ORDPATH labeling [
97] (conceptually similar to the Dewey technique) to materialize the parent-child relationship, stores it in the attribute
ordpath, and uses it to reflect a numbered tree edge of the path from the root to a node.
(1)
Data (ordpath, value, order, numberofElements, numberofAttributes, pathID);
XAncestor is proposed in [
107], where the table
AncestorPath stores the ancestor paths (root-to-parent) of the leaf nodes in the XML tree. The attribute
ancesPos is a position of the ancestor for the leaf node, whose value is obtained by Dewey order labeling.
(1)
AncestorPath (ancesPathID, ancesPathExp);
(2)
LeafNode (nodeName, ancesPathID, ancesPos, nodeValue).
Mini-XML is proposed in [
142], which adopts a persistent labeling approach to annotate leaf nodes. The specific format is (Level, [P-pathID, S-order]) stored in the attribute
pos, where Level is the depth of the current leaf node in the XML tree, P-pathID is the path id of the direct parent node, and S-order is the order among its sibling.
(1)
Path (pathID, pathExp, pos);
(2)
Leaf (leafID, name, value, pos).
(2) Non-Fixed Schema . Next, we will introduce some works that map an XML document into a relational schema with a non-fixed number of tables and summarize these methods in Table
4.
STORED is proposed in [
44,
45], which takes data instances as input and uses a heuristic algorithm, data-mining, to generate complex storage patterns with high combined support for creating tables. Each storage pattern keeps a pointer back to its subpattern with the highest data support, which is used to find the required attributes. Each semi-structured object having all the required attributes for a relational table will be stored in it. And the remaining attributes in this table may be filled with nulls.
STORED uses created relational schemas to store most of the data. As for the outliers, parts of the semi-structured data that do not fit the generated schema or the possible future inserted data, are stored in a self-describing structure (overflow graph) to guarantee that the mapping and storing are lossless. Besides,
STORED could take several parameters as input (e.g., the maximum number of relations allowed) to control generated relational schema.
Kyu and Nyunt [
75] utilizes a data extraction approach to get a table name list, a table element list, a table attribute list, and the primary key of each table. Next, it uses those lists to create a relational schema and presents a data mapping algorithm to store XML data into relational tables.
Discussion. Compared to structure-based mapping, model-based approaches are widely studied since they are typically simple to implement, do not require extra schema information basically, and could have a better performance. Moreover, most model-based approaches could handle dynamic XML documents whose DTDs change from time to time and support XML documents without any extension of the relational model. And there are many methods (edge, path, signature, labeling, pointer, token or combinations among these methods, etc.) supporting the model-based approach to map XML documents into relational tuples. Depending on adopted methods, they will create a varying number of tables. But the works introduced in the former have in common that they have a fixed schema, regardless of how XML document instances change. However, these methods also have their limitations. According to different methods taken, they may cause different performance variations. This is because some approaches may generate very complex SQL queries involving many joins for complex path expressions. For example, the edge method possibly has many self-joins when reconstructing a large XML document. Besides, the path method needs high storage space to keep path information. The pointer method holds more ancestor columns in the pre-defined relational table, and there will be more chances for “Null” pointers, thus wasting storage. The token method could not handle the complex semantic searches. The labeling method needs a larger space to store labeling when dealing with a large XML document. Therefore, except for introducing new techniques or combining different strategies to improve query performance, researchers also attempt to create relational schemas with non-fixed tables to fit XML document instances better. Unfortunately, these methods also inevitably store much structure information to reconstruct the original XML data and do not consider the importance of semantic information toward the relational schema, which could help reduce space consumption.
2.2.3 Semantic Information Approach.
Recently, studies in the constraints of XML (e.g., keys and foreign keys) have caused an interest in using the semantic information to improve the generated relational schema’s quality. In this part, we will introduce current researches in this area, classify it as the third category of the mapping approach, and summarize those works in Table
5.
CPI is proposed in [
78,
79], which discovers semantic constraints hidden in DTDs and then rewrites the discovered constraints in relational notations. Since finding and preserving semantic constraints is independent of transformation algorithms, one could use other transformations instead of only the hybrid inlining algorithm.
XSchema is proposed in [
86], which provides two normal form representations of regular tree grammars - NF1 and NF2. NF1 representation is used for document validation and schema validation. NF2 forms the basis for mapping type definitions in XML schema to SQL. Besides, this paper defines
XSchema, a language-independent formalism to specify XML schemas. Next, it starts with simplifying
XSchema to get a simpler
XSchema, which does not have constraints that cannot be captured in the relational model. Then, it uses inlining [
117] to generate relational schemas, maps collection types, stores
IDREF and
IDREFS attributes, handles recursion, captures the order specified in the XML model, and keeps constraints such as key constraints and inclusion dependencies.
RRXS is proposed in [
36], which presents
XML functional dependencies (XFDs) to capture structural as well as semantic information. It offers a set of rewriting rules to obtain redundancy-reduced XFDs in polynomial time. Then,
RRXS translates optimized XFDs to relational functional dependencies and creates a
third normal form (3NF) decomposition to guide the design of the target relational schema, where the generated schema is redundancy-reduced and has a set of keys.
Xshrex is proposed in [
80], which is an extended
ShreX by adding more constraints like structural choice, unique, key & foreign key, and domain constraints. Although these constraints need to be checked when doing insertions, deletions, and updates, it does not yield prohibitive costs. On the contrary, queries could utilize the index created based on the user-defined primary and foreign key constraints to improve performance.
X2R-Xing is proposed in [
135], which starts by using a data structure called a marked schema tree to store the mapping from the DTD to a relational schema, where the node grouping algorithm generates the schema tree. Then the schema tree is used to shred XML documents into relational tuples. In this process, it indexes XML node groups based on range indexing. And it propagates key constraints for XML to keys in a relational schema.
Castro et al. [
33] propose using the conceptual model as the intermediate schema for achieving the mapping. For establishing parallelism between two data models (i.e., XML and relation), they use a class diagram in
UML (Unified Modeling Language). This is because of the simplicity with which schemas modeled in UML can be mapped to relational databases. In this intermediate schema, DTD constructors are mapped into classes, and the relationships between them are presented in the form of associations in the UML diagram. The attribute
level represents the nested levels for the main elements. The number of the appearance of an element is stored in the attribute
cardinality. The logical operators in DTD are preserved in the attribute
operator.
X2R-Ahmad is proposed in [
8], which first obtains the XML structure from DTD and generates the DTD schema for describing XML. The expression of form about functional dependency for XML (XFD) is: (C, Q : X
\(\rightarrow\) Y), where C is the downward context path (defined by an XPath expression), Q is a target path, X is an
LHS (Left-Hand-Side), and Y is an
RHS (Right-Hand-Side). Next, it uses a constraint-preserving algorithm to remove redundant paths in XFD. It then maps the paths to attributes for obtaining a relational schema and several functional dependencies over this schema.
Monet-XML-Model is proposed in [
114], which offers a data model (Monet-XML-Model) based on a complete binary fragmentation of the document tree to represent, store, and query all related associations (e.g., parent-child relationships, attributes, and topological orders) in the document. It applies paths to group semantically related associations into the same relation. In this way, related data can be accessed directly in the form of a separate table for a given query, avoiding large scans.
Davidson et al. [
41] develop algorithms to find minimum cover functional dependencies from a set of XML keys on XML data through a given mapping (transforming an XML document to relational tables). With the functional dependencies, one could normalize the relational schema into, e.g., 3NF, BCNF to obtain efficient relational storage for XML data.
Discussion. When creating a relational schema, it is quite natural to consider all kinds of normal forms and integrities. Therefore, we think mapping XML to a relational schema with semantic information is more in line with our perception. However, most works in this field need DTD information (e.g., [
33,
78,
79,
86]). Some methods may increase space consumption to keep redundant information (e.g., [
80,
135]). And several approaches (e.g., [
8,
114]) may create many simple tables, which will increase efforts to reconstruct the original document. Besides, those methods do not consider the importance of workloads (queries and data updates) toward the relational schema.
2.2.4 Cost-Driven Approach.
Given the flexibility of XML, and the variety and complexity of transactions processed by XML applications, it’s hard to say which of a structure-based approach and a model-based approach is better. The structure-based approaches take advantage of DTD to generate a specific relational schema for each XML document. However, this method may not get a “good” schema for arbitrary XML data having different complexity. What’s more, there are some applications needing to deal with XML documents without DTDs. Therefore, model-based mapping is proposed. But this generic mapping limits the performance of relational schema. This is because the target relational schema is pre-defined and fixed, regardless of the XML schema. As a result, it is unlikely to work well for all possible applications. Therefore, next, we will review a cost-driven approach in this section, which could generate a near-optimal relational schema. We classify this approach as the fourth category of the mapping approach and summarize current works in Table
6.
LegoDB is proposed in [
24,
25], which is a cost-based XML mapping system that takes an XML schema, an XQuery workload, and a set of sample documents as input, and outputs an efficient relational schema for a given application. In detail,
LegoDB starts with extracting statistical information from the given XML documents. This information is used to derive relational statistics that are needed by the relational optimizer to estimate the cost of the query workload. Then,
LegoDB utilizes the XML schema and XML statistics to generate an initial physical schema (
p-schema). Next, a set of
p-schema rewritings are applied to the generated
p-schema for getting a space of alternative
p-schema. Based on a greedy heuristic,
LegoDB explores an interesting subset of this space to find the best relational schema. In this process,
LegoDB derives a relational schema from the
p-schema, transforms XML statistics into relational statistics for the corresponding relational schema, translates the XQuery workload into the corresponding SQL equivalent, and uses a relational optimizer to obtain cost estimates.
Zheng et al. [
141] firstly use an annotated schema graph to represent the XML schema. Thus, all of the possible partitioning schemes on the annotated schema graph consist of the solution space. The selection problem of the XML mapping schema can be regarded as the problem of the graph’s optimal partition. It could use the Hill-Climbing algorithm to find the optimal solution in this solution space for an expected workload at a reasonable time. The Hill-Climbing algorithm starts from an initial schema generated by three approaches (
Attribute mapping [
58],
Shared, and
Hybrid mapping [
117]). If one mapping schema is a state in the solution space, the algorithm tries to visit all the neighboring states that can be reached from the current state through state transformation defined by four primitive operations and uses the cost model to estimate the cost of executing the workload at the new state. Finally, the final state with minimal cost is returned as the optimal partitioning scheme, i.e., target relational schema.
FlexMap is proposed in [
109], which defines a schema tree by several type constructors to represent an XML schema. A relational configuration could be derived from a schema tree. Suppose there is a set of transformations like inline and outline, type split/merge, commutativity, and associativity, and union distribution/factorization. As transformations are applied and new configurations are derived,
FlexMap uses a cost model to estimate the cost for the query workload under each relational configuration. To find a nice configuration,
FlexMap designs three greedy algorithms (InlineGreedy, ShallowGreedy, and DeepGreedy) to study how the quality of the final configuration is influenced by the choice of transformations and the query workload. In the end,
FlexMap optimizes the DeepGreedy to get GroupGreedy by a grouping transformation concept and uses a small threshold of
\(\delta\) to accelerate processing (early terminate the iteration).
Discussion. The cost-driven approach uses a cost model or a relational optimizer to obtain cost estimates for each storage schema to find or generate an “optimal” relational schema. However, the problem is that we need to guarantee the accuracy of the cost model, which has a significant influence on the results [
109,
141]. Besides, another problem is the generated schema that does not preserve too many constraints [
24,
25]. Therefore, how to combine the cost-driven approach with semantic information to design a “good” relational schema is an interesting topic.