XML Information Retrieval Systems: A SURVEY
Awny Sayed
Information Technology Dept. - Ibri College of Applied Science
Sultanate of Oman
Mobile Number 00968-98838296,
awny.ibr@cas.edu.om
for semantic data. In unstructured information
retrieval, it is usually clear what the right
document unit is: files on your desktop, email
messages, web pages on the web etc. While the
first challenge in the semistructured information
retrieving is that we don’t have such a standard
traditional document unit or indexing unit that is
could be retrieved as a result to a query. The main
profit of the XML which is considered as a new
concept in the information retrieval branch is that
when we query the XML documents we can dive
deeply more than the document level allow to us
into more specific units as document fragments
(e.g. XML elements) which answer the user’s
query.
A new decision criterion that has been proposed
for selecting the most appropriate and specific part
of a document is the structured document retrieval
principle [10]: Structured document retrieval
principle: states that, {a system should always
retrieve the most specific part of a document
answering the query}. That principle motivates a
retrieval strategy that returns the smallest unit that
contains the information sought, but does not go
below this level.
In our survey, we give an overview of the
different XML information systems and classify
them according to their storage and indexing
strategies. For storage, we will answer the
question, what is the best way of storing xml
documents.
Moreover, we will provide a
classification of the different strategies used to
store XML documents. The classification is based
ABSTRACT: The continuous growth in the
XML information repositories has been matched
by increasing efforts in development of XML
retrieval systems, in large parts aiming at
supporting content-oriented XML retrieval. These
systems exploit the available structural
information, as market up in XML documents, in
order to return documents components- the so
called XML elements-instead of the complement
documents in repose to the user query. In this
paper, we provide an overview of the different
XML information retrieval systems and classify
them according to their storage and query
evaluation strategies.
Keywords: XML, XML storing, XML
indexing, XML querying, Information Retrieval
1. INTRODUCTION
Indexing data for efficient search capabilities is a
core problem in many domains of computer
science. As applications centered on semantic data
sources become more common, the need for more
sophisticated indexing and querying capabilities
arises. In particular, the need to search for specific
information in the presence becomes of particular
importance, as the information a user seeks may
exist as an entailment of the explicit data by
means of the terminology. This variant on
traditional indexing and search problems forms
the foundation of a range of possible technologies
1
on the underlying system used for it (e.g.,
relational systems, object-relational systems, or
native systems). For indexing and querying in our
survey we will classify indexes into three parts
(structured indexes, connection indexes, and path
indexes) based on the underlying XML data, its
tree-like structure or graph-like.
The rest of the paper is organized as follows.
Section 2 introduces XML storage techniques.
Sections 3 provide the details of the different
indexing techniques; Finally, Section 4 concludes
the paper and provides some suggestions for
possible future research directions on the subject.
child_offset to parent_offset. These two indices
are used to facilitate navigation through the XML
graph. Another index mapping (tagname, value)
or (attribute_name, attribute_value) to element
offset is built to help evaluate selection predicates.
A query engine can use these indices to retrieve
segments of an XML file relevant to the query,
reducing parsing time dramatically.
The main disadvantage of this approach is that
whenever the XML document is updated, the
element offset of preceding tags are also changed,
which invalidates the indices and they have to be
rebuilt. Regarding concurrency control it is
necessary to lock both the XML document and the
matching indices when some thread access data
(reading/writing) due to data consistency. When a
one thread is reading other threads can read as
well, but when some thread is updating other
threads cannot read or update the whole document
since it cannot be considered consistent. The worst
case is of course if new threads continue to access
the document for reading, then it will not be
possible to update any part of the document,
unless some sort of prioritizing algorithm is
implemented (and updates are given higher
priority, of course this could lock out reads).
2.XML STORAGE TECHNIQUES
The basic properties of XML data are hierarchical
tree-structured and semi-structured unlike
ordinary relational databases. With this in mind in
order to retrieve XML data efficiently we need
different types of indexing techniques. An XML
document can be modeled as a tree-like or a
graph- like depending on the containment of that
document to links or not. If the XML document
does not contain such global or internal links it is
modeled as a tree-like structure, otherwise if the
XML document contains whether a global or
internal links it is modeled as a graph-like
structure. A tree, with nodes representing XML
elements or attributes and edges representing
parent-children relationships. Boxes with rounded
corners represent attribute or text nodes.
2.2 The Relational DTD Approach
The second strategy is the shared-inclining
method proposed in and requires the existence of a
Document Type Definitions (DTD). In DTD All
element declarations begin with <! ELEMENT
(case-sensitive) and end with >. They include the
name of the element being declared followed by
the content specification. In this declaration, the
content specification is the keyword ANY (again
case-sensitive). The element declaration <!
ELEMENT SPEECH (SPEAKER, LINE+)> says
that a SPEECH element must contain a single
SPEAKER element followed by one or more
LINE elements, the + quantifier indicates that the
LINE must exist at least one time and no limits for
the maximum number of its recurrence. An
element that can only contain plain text is declared
using the keyword #PCDATA in parentheses, like
this: <! ELEMENT STAGEDIR (#PCDATA)>
This declaration says that a STAGEDIR can
contain only parsed character data, that is, text
that’s not markup. Like elements, the attributes
2.1 Text Approach
The first strategy stores each XML document as a
text file. One way to implement a query engine
with this approach is to parse the XML file into a
memory-resident tree against which the query is
then executed. The tree is retained in memory as
long as some nodes in the tree are needed for
query evaluation. [23] found that the parsing time
dominated query execution time and the approach
was unacceptably slow. To make this approach
competitive they adopted the following indexing
strategy. Using the offset off an XML element
inside the text file as its id, and build a path index
mapping (parent_offset, tag) to child_offset as
shown and an inverse path index mapping
2
used in a document must be declared in the DTD
for the document to be valid. Attributes are
declared by an attribute list in the following form:
<! ATTLIST Element_name Attribute_name Type
Default_value>.
A separate table is used to capture the setcontainment relationship between an element and
a set of children elements with the same tag. Each
tuple in a table is assigned an ID and contains a
parentlD column to identify its parent, an element
that can appear only once in its parent is inline. If
the DTD graph contains a cycle, a separate table
must be used to break the cycle, the relational
schema generated from the DTD and how the
document is stored are shown below.
When reconstructing the XML document from
this approach it is necessary to know how to build
the document in terms of layout. Whether it is a
partial or a full reconstruction does not matter
because the work is the same, only when it is
partial it is necessary to make specifications about
which part one wishes to reconstruct. There is
though a problem of recreating whitespace outside
contents because this information is lost when the
XML document is uploaded to the database.
cluster the table according to SourcelD. This
strategy has the benefit that sub-elements of one
XML element are stored close to each other. The
drawback of that Approach is that elements with
the same tag name are not clustered.
Consequently, queries such as "select all students
whose major is Computer Science" will incur a
large number of random I/Os. Similar to the
EDGE
model,
the
BINARY
approach
materializes the generic tree structure of XML
documents in database tables. Hence, it is a model
mapping approach as well
2.4 The Object Approach
An obvious way of storing XML documents in an
object manager is to store each XML element as a
separate object. However, since XML elements
are usually quite small, all the elements of an
XML document are stored in a single object with
the XML elements becoming light-weight objects
inside the object. [23] [24] use the term
LW_object to refer to the light-weight object and
file_object to denote the object corresponding to
the entire XML document.
The offset of the lw_object inside a file_object is
used as its identifier (lw_oid). The length field
records the total length of the lw_object. The flag
field contains bits that indicate whether this
lw_object has opt_child, opt_attr, or opt_text
fields. The tag field is the tag name of the XML
element. The parent field records the lw_oid of the
parent node. Opt_child records the lw_oids of the
first and last child, if the lw_object has children.
The sibling list of a node is implemented as
doubly linked list via the prev and next fields. Opt
attr records the (name, value) pair of each
attribute of the XML element. Text data is in-lined
in the opt_text field if the text is the only child of
the XML element; otherwise, the text data is
treated as a separate lw_object. [23] built a B-Tree
index that maps (tag, opt_text) and (attr_name,
attr_value) to lw_oid. An element is entered in this
index even if the opt_text field is empty so that
this index can be used to retrieve all XML
elements with a specific tag name. They also built
a path index those maps (parent_id, tag) to child
lw_oid. This optimized object approach is hard to
perform concurrent operations on since the
locking has to occur on the object representing the
2.3 Edge Approach
The third strategy is the "EDGE" approach
described in The directed graph of an XML file is
stored in a single Edge table. Each node in the
directed graph is assigned an id . Each tuple in the
Edge table corresponds to one edge in the directed
graph and contains the ids of the two nodes
connected by the edge, the tag of the target node,
and an ordinal number that is used to encode the
order of children nodes. When an element has
only one text child, the text is inlined.
TargetlD indicates that the edge points to a TEXT
node or ATTRIBUTE node. 0 in ordinal field
indicates an attribute edge. As suggested in an
index is built on (tag, data) in order to reduce the
execution time of selection queries. We found that
it was also very important to build indices on
(sourceid, ordinal) and (targetID). The former is
used to lookup children elements of a given
element and the later is used when traversing from
a child node to its parent. The clustering strategy
on the Edge table has significant impacts on query
performance. While we clustered the Edge table
on the Tag field, an alternative strategy is to
3
whole document; unless there should be build
some extra concurrency control into the
lw_objects themselves, but this would be overkill.
To when locking anything in this approach means
at least locking the whole XML document.
structure indexes are not prescriptive and thus may
change with any update. Generalizations of these
structures have gained increasing attention
recently, as flexible index structures for XML [9],
[16], [18], and size and performance issues in the
original proposals have been addressed [18].
Pre/post schema encoding XML tree-structure.
In addition, the ideas behind these structure
indexes have been used as statistical synopses for
estimating path expression selectivity [2],[20].
Moreover, the structure index proposed in[and
[13] presents a database index structure designed
to support path expressions evaluation on trees. It
has the capability to support all XPath axes and
start traversal from any arbitrary nodes in an XML
document. Building the index takes O (|E|), and
space consumption is O (|V|), where V denotes the
number of nodes in the XML tree and E the
number of edges. The main idea of this index
depends on the numbering schema. It computes
two numbers for each element name in the XML
data tree, one representing the pre-order and the
other representing the post-order. These numbers
are the result of a depth-first search on the XML
data tree. Starting with the root element, the preorder numbers are assigned in the order in which
the nodes are visited during this search. The postorder defines the order in which the nodes are left.
The authors explain that XPath axes (like ancestor
and descendant axes) can be evaluated using these
numbers. This index based on the following
property for evaluating path expressions: For any
two given nodes A and B in the tree, an arbitrary
node B is a descendant of a node A, if and only if
this condition is satisfied:
2.5 Native XML Storage Approach
Finally, we should have a look shortly at so-called
native XML databases, which are specialized to
store and process XML documents. Native storage
schemas aim at efficient support for loading and
storage complete documents as well as efficient
navigation in documents. A native XML storage
system store XML documents as flat files, i.e., it
uses a
text-based mapping. However, evaluation of
queries requires reconstructing the complete XML
documents, which is not efficient when only parts
of the documents are evaluated by the given
query. As a result, most native XML storage
schemas store XML documents as a tree structures
based on the tree data model of XML [12] . These
particular
approaches are model-mapping
approaches. Usually, native XML storage systems
rely on the DOM tree representation of XML
documents.
3. INDEXING TECHNIQUES
Since the hierarchical nature of the XML
documents there is a lot of interesting in a query
processing on data that conforms to a labeled- tree
or labeled- graph model. To summarize, the
structure of such data in the absence of a schema
and to support path expressions evaluation, several
structure indexes have been proposed for semistructure data described as follows
pre(A) < pre(B) and post (A) > post(B)
If we want to evaluate all descendants of a given
node using this schema, then the result is the set of
all nodes that satisfies the above condition.
The pre-/post-order approach can be determined
in a constant time by examining the pre-and postorder variable of the corresponding tree nodes.
The [22] stated that the drawback of this approach
is its lack of flexibility in case of changes to the
structure of the XML-document. That is, the pre/post-order variables need to be recomputed for
the number of tree nodes if any update into the
tree whether a new node is inserted or an existing
one is deleted.
3.1 Structure Indexes
The structure index I (G) of a data graph
G is a summary graph that preserves all the paths
in the data graph but contains a fewer number of
nodes and edges To summarize the structure of
such data in the absence of a schema and to
support path expression evaluation, novel
structural indexes [14], [19] have been proposed
for semi-structured data. Unlike a schema,
4
v with the concatenation strings assigned to the
edges of the path from the root node to v.
3.2 Connection Indexes
For every assignment, labels are unique. Node u is
ancestor of node v, iff the label of u is a prefix of
the label of v. One major problem related to this
approach is how to find an assignment that
minimizes the sum of the lengths of the labels,
unfortunately this problem is NP-hard [17] means
no optimal solution to this problem. The main goal
of the work in [17] is to find an assignment that
minimizes the maximum length of the labels by
using Huffman’s algorithm [14]. Several labeling
schemes are proposed using the above technique,
for example, [4] [6] proposed a labeling schema
for rooted trees that supports ancestor queries by
assigning to each node in the tree a label which is
a binary string. Given the labels of two nodes u
and v it can be determined in a constant time
whether u is an ancestor of v only by looking at
the labels. Another labeling schema proposed on
[25], it takes the advantages of the unique property
of prime numbers to meet this need. Answering
the ancestor-descendant queries for a given two
nodes by only looking at the labels (based on
prime numbers). An analytical study of the size
requirements of the prime numbers indicates that
this schema is compact and hardly affected.
A connection index is the index which supports
the XPath axes that are used as wildcards in path
expressions (ancestors-or-self, descendants-orself, ancestors, and descendants).
Labeling
schemes for rooted trees that support ancestor
queries have recently been developed in the
following researches.
In [4] and [16] they present a tree labeling scheme
based on two level partition of the tree, computed
by a recursive algorithm called prune&contract
algorithm. All these approaches are, so far, limited
to trees. We are not aware of any index structure
that supports the efficient evaluation of ancestor
and descendant queries on arbitrary graphs. The
one, but somewhat naive, exception is to precompute and store the transitive closure Cx = (Vx,
) of the complete XML graph Gx = (Vx ,Ex)
Cx is a very time-efficient connection index, but
is wasteful in terms of space. Therefore, its
effectiveness with regard to memory usage tends
to be poor (for large data that does not entirely fit
into memory) which in turn may result in
excessive disk I/O and poor response times. To
compute the transitive closure, time O(|V|3) is
needed using the Floyd- Warshall algorithm. This
can be lowered to O(|V|2 + |V|.|E|) using Johnson’s
algorithm. Computing transitive closures for very
large, disk-resident relations should, however, use
diskblock- aware external storage algorithms. [1]
[7] [8] implemented the “semi-naive” method
[BR86] that needs time O (| | . |V|). Although
Moreover, the authors introduced several
optimization techniques to reduce the size of the
schema. Unfortunately, these indexing techniques
were supposed to handle tree-structure data.
Extension of these techniques to the context of
graph data could be very difficult because of the
possibly exponential number of paths in the graph.
Moreover, it may require a lot of computing
power for the creation process and a lot of space
to store the index.
there are several approaches are proposed to
evaluate all the ancestors of a given node and test
the reachability between two given nodes. For
example, labeling schema proposed in [17] is
called a prefix-labeling schema to handle a
dynamic XML tree. The nodes in the XML tree
are labeled such that the ancestor relationship is
determined by whether one label is a prefix of the
other. New nodes can be inserted without
affecting the labels of the existing nodes. They
define an assignment of binary strings to the edges
of the tree, such that, the collection of strings
associated with the outgoing edges from any node
is prefix free, a prefix free assignment. At the first,
the simple prefix schema finds a prefix free
assignment to the tree. Then, it is label every node
3.3 Path Indexes
A path index is the index which supports the
navigational XPath axes (parent, child,
descendants-or-self,
ancestors-or-self,
descendants, and ancestors). Recent work on path
indexing is based on structural summaries of XML
graphs. Some approaches represent all paths
starting from document roots, e.g., Data Guide
[14] and Index Fabric [11]. T–indexes [19]
support a pre– defined subset of paths starting at
the root. APEX [9] is constructed by utilizing data
5
mining algorithms to summarize paths that appear
frequently in the query workload. The Index
Definition Scheme [16] is based on bisimilarity of
nodes. Depending on the application, the index
definition scheme can be used to define special
indexes (e.g. 1–Index, A(k)–Index, D(k)–Index
[QLO03], F&B–Index) where k is the maximum
length of the supported paths. Most of these
approaches can handle arbitrary graphs or can be
easily extended to this end. Most of these indexes
are quite efficient in evaluating simple path
queries. These indexes widely differ in space
utilization, support for paths with wildcards
(wildcard means the arbitrary long paths from
source point to targets in XML graph). These path
indexes depend on the structure summaries of the
XML graph. Structure summary is an important
technique for indexing XML arbitrary graph, in
case the general schema of the information is
missing. Using this summary of the data, one can
evaluate the path expression queries without
looking at the original data. In the following, we
will describe these indexes in details.
linked to the parent via a descriptive textual label.
Note that a single OEM object may have multiple
parent objects and that cycles are allowed. For
more details on OEM and its motivation.
[14] Describes the DataGuide that it is, intended to
be a concise, accurate, and convenient summary of
the structure of a database. They assume that the
source database is identified by its root object. To
achieve conciseness, they specify that a
DataGuide describes every unique label path of a
source exactly once, regardless of the number of
times it appears in that source.
To ensure accuracy, they specify that the
DataGuide encodes no label path that does not
appear in the source. In addition they require that
a DataGuide itself be an OEM object so we can
store and access it using the same techniques
available for processing OEM databases.
3.3.2 Indexing Template-compliant
Paths: T-index
Like DataGuide [14], 1-index [19] is intended to
be used by queries that search the database from
the root for nodes matching some arbitrary path
expressions. 1-index therefore, represents the same
set of paths from the root like DataGuide. The
main idea behind the index construction is the
generation of a non-deterministic automaton
(NFA) [22] to get more compact structure than the
DataGuide. To construct the 1-index of a data
graph, the authors compute for each node the
equivalence class using a bisimulation as
equivalence relation which is defined in the next
definition.
Definition 3-2 (Equivalence Relation ““): For
each node u in the data graph, let the set Lu= {w
a path from the root to node u labeled w}. The
set Lu may be infinite when the graph has cycles;
however, it is always a regular set. Given two
nodes u and v in the data graph we say that they
are language-equivalent in notation u v, if Lu=
Lv.
Definition 3-3 (Bisimilarity): Two nodes in the
data graph are bisimilar () if all label paths into
them are the same. In other words, if node u’ is
parent of node u, node v’ is the parent of node v. If
the two nodes u and v have the same label, then, u
v if u’v’.
Using bisimulation to deal with the index size and
the construction cost problems that DataGuide
3.3.1 Data Guide
DataGuide [14] is a "structural summary" for
semistructured data and may be considered as
analog of traditional database schema in context of
semistructured data management. The DataGuide
is a descriptive schema for XML data. While
prescriptive schemas (DTD, XML Schema, RelaxNG) act more as a traditional database schema,
restricting allowable XML data, a DataGuide
infers rather than imposes structure. DataGuide
describes actual (rather than possible) structure of
XML data extracting the structure from the XML
data. It may be used as schema for semistructered
data without any explicit schema declaration, such
as non-validated XML documents.
The dataguide is based on the Object Exchange
Model (OEM) which is the simple and flexible
data model that originates from the Tsimmis
project at Stanford University [PGW95]. OEM
itself is not particularly original, and the work
presented using OEM adapts easily to any graphstructured data model. A value may be atomic or
complex. Atomic values may be integers, reals,
strings, images, programs, or any other data
considered indivisible. A complex OEM value is a
collection of 0 or more OEM subobjects, each
6
index yields. Where the size of the DataGuide
may be large as the database itself, while 1-index
is at most linear.
similarity requirements that can tailored to support
a given set of frequently used path expressions
and to avoid the A(k)-index drawbacks. For parts
of the data graph targeted only by longer path
expressions, a larger k can be used for finer
partitioning. For parts targeted only for shorter
path expressions, a smaller k can be used for
coarser partitions. However, as a generalization of
1-index and A(k)-index, the D(k)-index processes
the adaptive ability to adjust its structure
according to the current query loads. D(k)-index
has a very nice property compared with 1-index
and A(k)-index because of dynamics. The author
provides an efficient algorithms to update the
D(k)-index with changes in the source data . The
general approach of the D(k)-index is flexible and
powerful, but the index design still has several
limitations that need to overcome. For example of
these limitations, the construction procedure of the
D(k)-index forces all index nodes with the same
label to have the same local similarity, which is
unnecessary and restrictive. The D(k)-index also
proposes
a
promoting
procedure
that
incrementally refines the index to support a given
set of frequently used path expressions. This
procedure increases the local similarity of an
index node if it reached by a given set of
frequently used path expressions in the index
graph. This index node will be partitioned into
smaller nodes, all with the same increased local
similarity. However, the problem is that in general
the index node to be refined also points to data
nodes that are irrelevant to the given set of
frequently used path expressions.
Definition (Index Graph): Index graph means that
we reduced the graph that summarized all the
paths from the root in the data graph, the nodes
that have the same label from root are collected
into one node called index node. The index graph
is smaller than the data. Path expressions can be
directly evaluated from the index graph and can
retrieval label-matching nodes without referring to
the original data graph.
M(k) Index: To overcome these limitations for
the D(k)-index, A M(k)-index (for “Mixed-k”) is
proposed in [15]. The authors built on the strength
of D(k)-index and proposed M(k)-index and
M*(k)-index to overcome its limitation. To
overcome the limitations of over-refinement of
irrelevant index nodes and data nodes, M(k)-index
is proposed to target only the data nodes relevant
The advantage of 1-index and its family (2-index
and T-index [19]) is that, it can be used to evaluate
any path expressions accurately without accessing
the data graph. However, the size of 1-index can
be quit larger for irregular XML data. Moreover,
not all structures are interesting and most queries
probably only involve short path expressions.
A(k)-index: A(k)-index [18] is a type of
approximate structural summary of data graph
since it does not reflect whole structure and nodes
of XML tree are grouped according to the local
structure. With these properties in mind we can
think of several issues as follows.
Not all structures are interesting.
Paths longer than k may be never used.
Complex paths may never show up.
Longer path results in large index graph,
which takes time to construct and
traverse while querying.
We can reach to one solution considering above
issues, that is, use of local similarity, which is
approximate structural summary. We focus on
features of A(k)-index in the following sections
within the view of implementation issues.
Taking advantages of local similarity [3], the A
(k)-index can be substantially smaller than 1-index
[19]. The parameter k control the “resolution” of
the entire A (k)-index; all index nodes have the
same local similarity of k. If k is too smaller, the
index cannot support long path expressions
accurately. If k is too large, the index may become
so large. At this case, evaluating any path
expression over this index will be expensive. The
time required to build the index is O(km) where m
is the number of edges in the data graph.
Furthermore, not all path expressions of length k
are equally common. The A(k)-index lacks the
ability to make certain parts have higher resolution
than the others do, so it can not be optimized for
complex path expressions with wildcards.
D(k)-Index: The D(k)-index is an adaptive
summary structure for the general graphstructured data proposed recently. It allows
different index nodes to have different local
7
to frequent queries. Like the D(k)-index, the M(k)index uses the k-bisimilarity equivalence relation
but allows different k values for different nodes; it
is also incrementally refined to support new
frequently used path expressions extracted from
the query workload. Unlike the D(k)-index,
however, M(k)-index is never over-refined for
irrelevant index or data nodes. Thus, the M(k)index has a smaller size without sacrificing
support for any frequent used path expressions. To
overcome the limitations of over-refinement due
to overqualified parents and single resolution each
node, M*(k)-index is introduced as a collection of
M(k)-indexes whose nodes are organized in a
partition hierarchy, allowing successively coarser
partitioning information to co-exist with the finest
partitioning information required. The M*(k)index maintains k-bisimilarity information for all
k up to some desired maximum, which can be
different across nodes and adjusted dynamically
according to the query workload. This feature
allows the M*(k)-index to avoid over-refinement
due to overqualified parents and support both
short and long path expression queries over the
same data nodes at the same time. Experiments
show that although keeping partitioning
information at different resolutions requires extra
storage space; it is negligible compared to the
savings achieved by avoiding over-refinement.
The performance gain from query processing
further justifies this new approach.
despite some similarities with unstructured text,
XML needs special treatment in terms of
relevance of its elements to a user query and in its
evaluation. Hence we need a new paradigm in its
retrieval techniques and evaluation metrics. On the
whole, XML as an research area holds immense
prospect which is not still extensively explored
and therefore remains an interesting field of
further research.
On the other hand, for Indexing and querying
XML data our survey introduces a short
classification
of
structures
indexes
for
semistructured data based on the navigational axes
they support. Structure index supports all
navigational for XPath axes. Connection index
supports the XPath axes that are used as wildcards
in path expressions (ancestor (descendant)-or-selfrelationship and ancestor-descendant relationship).
Path index supports only the following kinds of
XPath axes (parent-child relationship, ancestordescendant
relationship,
ancestor-or-self
relationship, and descendant-or-self relationship).
For heterogeneous XML documents in the Web
(divided
XML
documents
into
several
subcollections), a single index structure may not
be appropriate. Therefore, it will be investigated
whether it makes sense to combine several indexes
as building blocks. This would allow for building
an index for each subcollection and evaluating the
proposed queries by “navigating” through the
underlying sub-collection only.
4. CONCLUDING REMARKING
Moreover, The most common web technology that
will realize Web 3.0 is RDF (resource document
Framework) model. The Resource Description
Framework (RDF) is a flexible model for
representing information about resources in the
web. With the increasing amount of RDF data
which is becoming available, efficient and
scalable management of RDF data has become a
fundamental challenge to achieve the Semantic
Web vision. So, the most important question is,
could we apply the same technologies used to
store and retrieval Xml to RRD Data, this is still a
very hot topics for research.
After reviewing a number of existing XML
information and retrieval systems, we can draw
some conclusions about the state of the art and
general trends in the fields. Our Survey addresses
what exactly are the requirements for efficient
XML storage management. A storage management
schema must cover the following aspects
efficiently: lossless storage of XML documents,
complete and efficient reconstruction of
decomposed XML documents, and support for
processing path expressions on the XML
document structure, support for processing of
precise and vague predicates on XML content,
navigation in XML documents, and online updates
of XML documents. Moreover, IR community
applies with some modification standard IR
techniques for focused element-level retrieval. But
5. REFERENCES
8
[1]
Awny Sayed , Ahmed A A Radwan,
Mohamed masod. "Efficient Evaluation of
Relevance Feedback algorithms for XML
Content-based Retrieval System". International
journal of web information system, 2010.
[10] D. Chirtopher Manning, P. Raghaven and H.
Schuetze. Introduction to information Retrieval.
Cambridge University Press. 2007.
[11] B. Cooper, N. Sample, M. Franklin, G.
Hjaltason et. al. A fast index for semi structured
data. In VLDB, 2001.
[2] A. Aboulnaga, A. R. Alameldeen, and J. F.
Naughton. Estimating the selectivity of XML path
expressions for internet scale application. In
Proceedings of VLDB, 2001.
[12] Chenying Wang, Xiaojie Yuan, Shitao Yu,
Hua Ning, Huibin Zhang, "A Storage Scheme of
Native XML Database Supporting Efficient
Updates," Database Technology and Applications,
International Workshop on, pp. 522-525, 2009
First International Workshop on Database
Technology and Applications, 2009.
[3] S. Abiteboul, P. Buneman, and D. Suciu. Data
on the Web- From relation to Semistructured data
and XML. San Francisco , Morgan Kaufmann
Publishers, 2000.
[4] S. Abiteboul, H. Kaplan, and T. Milo.
Compact Labeling Schemes for ancestor queries.
In ACM/SIAM Symposium on Discrete
Algorithms (SODA), 2001.
[13] T. Grust and M. Keuulen, Tree Awareness
for Relational DBMS Kernels. In intelligent
search on XML data. , Springer Verlag. 2003.
[14] R. Goldman and J. Wisdom . DataGuides,
Enabling query formulation and optimization in
semstructured databases, In VLDB, 1997.
[5] S. Abiteboul, D. Quess, J, McHugh, J.
Wisdom, and J. Wiener. The Loral Query
Language for Semistrutured Data. In international
Journal of Digital Library, 1997.
[6] S. Allstruo and T. Rauhe. Improved Labeling
schema for ancestor queries.
In ACM/SIAM
Symposium on Discrete Algorithms (SODA),
USA, 2002.
[14] D. Huffman. A method for the construction
of minimum redundancy codes. In IRE, 40, pages
1098-1101, 1952.
[15] H. He and J. Yang. Multisoluation indexing
of XML for frequent queries, In ICDE, 2004.
[7] Awny Sayed “Fast and efficient
computation of connectivity queries over linked
XML documents graph. Issue 1, Vol. 4., 2009,
International journal of web information system.
[16] H. Kaplan, and T. Milo. Short and simple
labels for small distances and other functions. In
WADS , 2001.
[8] Awny Sayed. "A prime Number Labeling
Scheme for Reachability Queries over Complex
XML Collection". the 4th Indian International
Conference on Artificial Intelligence (IICAI-09)
2009.
[17] H. Kaplan, T. Milo, and R. Shabo. A
Comparison of labeling schemes for ancestor
queries. In SODA, 2002, USA.
[18] R. Kaushik, P. Shenoy, P. Bohannon, and E.
Gudes. Exploiting Local Similarity for indexing
paths in graph-structure data. In ICDE , 2002.
[19] T. Milo and D. Suciu. Index structures for
path Expression. In ICDE, 1999.
[NM 10] Natima Mebhaza . Analyzing the Impact
of XML Storage Models on the Performance of
[9] C. Chung, J. Min, and K,Shim. APEX: An
adaptive path index for XML data. SIGMD 2002.
9
Native XML Database Systems. In Seventh
International
Conference
on
Information
Technology Las Vegas, Nevada, USA, 2010.
[20] N. Polyzois, and M. Gaeofalakis, Statistical
synopses for graph- Structured data. In SIGMD,
2002, USA.
[21] J. Shanmugasundaram. K. Tuffe, G. He, C.
Zhang el al. Relational databases for querying
XML documents. : Limitation and opportunities.
In VLDB, 1999.
[22] A. sayed, R. Unland. Indexing Collection of
XML documents with arbitrary Linnks.
Dissertation from Duisburg-Essen Uni., Germany,
2005.
[23] F. Tain, D. DeWitt, J, Chen, and C. Zhang.
The design and performance evaluation of
alternative XML Storage strategies. In SIGMD,
2002.
[24] Wenxin Liang, Akihrio Takahashi and Haruo
Yokota.
A Low-Storage-Consumption XML
Labeling Method for Efficient Structural
Information Extraction, in DEXA 2009:7-22
10