Web Query Optimizer
Vladimir Zadorozhny
Laura Bright
Louiqa Raschid
Tolga Urhan
Maria Esther Vidal
University of Maryland
College Park, MD 20742
Abstract
We demonstrate a Web Query Optimizer (WQO) within
an architecture of mediators and wrappers, for WebSources
of limited capability in a wide area environment. The WQO
has several innovative features including a CBR (capability
based rewriting) Tool, an enhanced randomized relational
optimizer extended to a Web environment, and a WebWrapper cost model that can provide relevant metrics for accessing WebSources. The prototype has been tested against a
number of WebSources.
1. Overview
We consider an architecture of mediators and wrappers
[8], for WebSources of limited capability in a wide area
environment. We have developed a Web Query Optimizer
(WQO) within the mediator, where the mediator has been
developed as an extension of the Predator object-relational
database system [6]. The WQO has two innovative components. The first is a CBR (capability based rewriting) Tool
and the second is an enhanced randomized relational optimizer extended to a Web environment. The third innovation
of this architecture is a WebWrapper cost model that can
provide relevant metrics for accessing WebSources.
The mediator maintains a catalog for external relations
that are supported by wrapper calls to a WebSource. In addition to the schema for these relations, the catalog must provide information on the capabilities of these WebSources,
i.e., limitations on queries that can be submitted to these
WebSources. A Web Query Broker handles wrapper calls
during the query execution, providing for each mediator relation (subgoal) a WebSource implementations (WSI). WSI
defines, in particular, an access pattern which should be used
to retrieve the data from a WebSource.
This research has been partially supported by the Defense Advanced
Research Project Agency under grant 01-5-28838; the National Science
Foundation under grant IRI9630102 and by CONICIT, Venezuela.
In a pre-optimization phase, the CBR Tool [7] produces
(multiple) pre-plan(s) for a mediator query. A pre-plan
consists of (possibly ordered) subgoals to be executed in
the WebSources and the mediator. The pre-plan identifies
one (or more) relevant WebSource Implementations WSI
(wrapper calls) for a mediator subgoal, as well as restrictions and orderings imposed by the WebSource capabilities.
The WQO uses the pre-plan to drive the relational optimizer.
The WQO first chooses a "good" WSI. In doing so, the WQO
may explore specific evaluation strategies, such as top-down
versus bottom-up evaluation of mediator subgoals. It may
also choose between atomic versus composed WSI for some
subgoals. During optimization, subgoal orderings and subgoal restrictions identified in the pre-plan are provided to the
relational optimizer, and it respects them while producing a
good plan for the subgoals in the query. This is the second
innovation of our research - extending a traditional randomized optimizer to produce plans that respect the limitations
of WebSources. A WebWrapper cost model [2] provides a
number of metrics that can be used by the WQO in choosing
a good WSI and in choosing a good plan. These metrics are
obtained using query feedback, since WebSources typically
are autonomous and do not provide either access costs or
statistics. We have developed a WebPT - a tool that can be
used to learn from query feedback and predict the response
time for accessing a WebSource across a wide area network
[3, 4]. Thus, our third innovation is using query feedback to
develop a cost model for autonomous sources. The Predator
evaluation engine has been extended with several operators
to handle the wrapper calls to WebSources. This includes
an ExternalScan operator and a DependentJoin operator.
The physical implementation of these operators involves
the WSI supported by the Web Query Broker. The prototype has been implemented using C++/Java/Prolog. It has
been tested against a number of WebSources including the
ACM Digital Library WebSource [1].
2. CBR Tool
to generate only valid plans. To do that we introduce a graph
collapsing technique. When randomly picking two subgoals
to join, we first consult the DG. Any two subgoals can be
safely joined provided that collapsing their corresponding
nodes in the DG does not produce cycles, i.e. it preserves
the property of a DG. The graph is collapsed to reflect the effect of the new join operator. Thus, the Dependency Graph
keeps shrinking as each join operator is generated. Finally
a single join operator is obtained which corresponds to the
topmost operator of the query plan.
After the usual syntactic and semantic checking is performed on the mediator query, it is rewritten as a set of
subgoals on mediator relations (mediator subgoals). The
CBR tool accomplishes the following tasks: (a) determines
if a mediator query is safe, i.e., it can be evaluated respecting the capabilities of given WebSources; (b) generates a
set of pre-plans, where a pre-plan specifies a sequence of
mediator subgoals that can be evaluated safely; (c) for each
mediator subgoal in the pre-plan, CBR tool identifies all the
relevant WSI, corresponding to a particular wrapper calls.
The CBR tool partitions and partially orders the WebSources
according to their capabilities.
A pre-plan provides the following information to the optimizer: (a) an ordering of mediator subgoals; (b) WebSource
implementations (WSIs) that support the mediator subgoal
and (c) restrictions on queries to the relevant WebSources.
The restrictions are (1) attributes that require bindings, (2)
post selection attributes i.e., operations that cannot be evaluated by the WebSource and (3) post projection attributes
that can be output from the WebSource.
Pre-plans produced by CBR Tool can provide a mediator relation with multiple WSI. A choice of a WSI and
corresponding pre-plan, which is most appropriate under
given conditions, is the responsibility of WQO in the first
optimization stage. This choice may significantly influence
a behavior of relational query optimizer in the second optimization stage. In particular, WQO may extend search
space of possible plans explored by the relational optimizer,
or enforce a certain query evaluation strategy to be chosen.
4. WebWrapper Cost Model
A WebWrapper cost model provides a number of metrics
that can be used by the WQO in choosing a good WSI
and in choosing a good plan. We consider several costs
associated with sending a query to a remote WebSource,
such as remote query processing costs at the WebSource,
and the cost of downloading pages from the WebSource.
Remote cost is the initial cost of processing the query at the
remote WebSource before the source begins to return data. A
remote source may need time to process a query and retrieve
all the relevant data from some underlying database. In
many sources, this cost is significant, especially if the query
result is large. Download cost is the cost of downloading
the relevant data from the WebSource, after the WebSource
has processed the query and produced a page or set of pages
that contain the answers to the query.
We have developed a WebPT (Web Prediction Tool) - a
tool that can be used to learn from query feedback and predict
the response time for accessing a WebSource across a wide
area network. The WebPT uses query feedback to estimate
the remote and download cost. The estimate could be a
function of different dimensions, including the day, time,
and the quantity of data. In addition, the WebWrapper cost
model considers the number of pages that are downloaded
by the wrapper, since the answer to a query may access
multiple pages and the WebWrapper must download all the
pages that contain the relevant data.
The WebWrapper Cost Model was integrated with native
Predator cost model. The cost of a query plan in Predator is expressed in terms of various resource usages, e.g.,
disk usage, memory usage, etc. We extended this model
and introduced a WebWrapper resource and its usage. The
WebWrapper usage values include statistics, such as the
cardinality of a relation in a WebSource, and wrapper execution cost. These values are provided by the WebWrapper,
which uses WebPT trained on query feedbacks for the WebSources. The cost of the ExternalScan operator is increased
corresponding to the particular WebWrapper usage. The
statistics and the WebWrapper usage are also used to determine the cost of a specific implementation of the Dependent
Join operator.
3. Extended Randomized Optimizer
After the WSI for each mediator subgoal is chosen, subgoal orderings and subgoal restrictions identified in the preplan are provided to the relational optimizer, and it respects
them while producing a good plan for the subgoals in the
query. Moreover, we enforce randomized optimizer to generate only valid plans using a graph collapsing technique.
The randomized optimizer [5] performs random walks
over the search space and picks the plan with the cheapest
cost among the plans it has examined. Each random walk
stage consists of an initial plan generation step followed by
a number of plan transformation steps. Both plan generation
and plan transformation have to respect the ordering restrictions on mediator subgoals imposed by a pre-plan. Plans
whose join orderings violate the ordering imposed by the
pre-plan must be avoided. On the basis of pre-plan ordering
and corresponding WSIs for mediator subgoals, the Web
Query Optimizer generates a Dependency Graph (DG) for
the randomized optimizer. The Dependency Graph reflects
the pre-plan order restrictions and is used by the optimizer
2
References
[1] ACM Digital Library. http://www.acm.org/dl/Search.html.
[2] L. Bright and L. Raschid. Cost Modeling of Wrappers for Web Accesible Data Sources (WebSources).
http://www.umiacs.umd.edu/labs/CLIP/DARPA/ww97.html.
(under review), 1999.
[3] L. Bright, L. Raschid, V. Zadorozhny, and T. Zhan. Learning
Response Times for WebSources: A Comparison of a Web
Prediction Tool (WebPT) and a Neural Network. Proc. of the
CoopIS Conference, 1999.
[4] J.-R. Gruser, L. Raschid, V. Zadorozhny, and T. Zhan. Learning Response Time for WebSources using Query Feedback
and Application in Query Optimization. To appear in The
VLDB Journal, 2000.
[5] Y. Ioanidis and Y. Kang. Randomized Algorithms for Optimizing Large Join Queries. Proc. of the ACM Sigmod Conference,
1990.
[6] P. Seshadri and M. Paskin. PREDATOR: An OR-DBMS with
Enhanced Data Types. Proc. of the ACM Sigmod Conference,
1997.
[7] M. Vidal and L. Raschid. WebSrcMed: A Mediator for Scaling up to Multiple Web Accessible Sources (WebSources).
ftp://www.umiacs.umd.edu/pub/mvidal/websrcmed.ps. (under
review), 1998.
[8] G. Wiederhold. Mediators in the Architecture of Future Information Systems. IEEE Computer, pages 38–49, March
1992.
3