Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Web query optimizer

Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073), 2000
...Read more
Web Query Optimizer Vladimir Zadorozhny Laura Bright Louiqa Raschid Tolga Urhan Maria Esther Vidal University of Maryland College Park, MD 20742 Abstract We demonstrate a Web Query Optimizer (WQO) within an architecture of mediators and wrappers, for WebSources of limited capability in a wide area environment. The WQO has several innovative features including a CBR (capability based rewriting) Tool, an enhanced randomized relational optimizer extended to a Web environment, and a WebWrap- per cost model that can provide relevant metrics for access- ing WebSources. The prototype has been tested against a number of WebSources. 1. Overview We consider an architecture of mediators and wrappers [8], for WebSources of limited capability in a wide area environment. We have developed a Web Query Optimizer (WQO) within the mediator, where the mediator has been developed as an extension of the Predator object-relational database system [6]. The WQO has two innovative compo- nents. The first is a CBR (capability based rewriting) Tool and the second is an enhanced randomized relational opti- mizer extended to a Web environment. The third innovation of this architecture is a WebWrapper cost model that can provide relevant metrics for accessing WebSources. The mediator maintains a catalog for external relations that are supported by wrapper calls to a WebSource. In addi- tion to the schema for these relations, the catalog must pro- vide information on the capabilities of these WebSources, i.e., limitations on queries that can be submitted to these WebSources. A Web Query Broker handles wrapper calls during the query execution, providing for each mediator re- lation (subgoal) a WebSource implementations (WSI). WSI defines, in particular, an access pattern which should be used to retrieve the data from a WebSource. This research has been partially supported by the Defense Advanced Research Project Agency under grant 01-5-28838; the National Science Foundation under grant IRI9630102 and by CONICIT, Venezuela. In a pre-optimization phase, the CBR Tool [7] produces (multiple) pre-plan(s) for a mediator query. A pre-plan consists of (possibly ordered) subgoals to be executed in the WebSources and the mediator. The pre-plan identifies one (or more) relevant WebSource Implementations WSI (wrapper calls) for a mediator subgoal, as well as restric- tions and orderings imposed by the WebSource capabilities. The WQO uses the pre-plan to drive the relational optimizer. The WQO first chooses a "good" WSI. In doing so, the WQO may explore specific evaluation strategies, such as top-down versus bottom-up evaluation of mediator subgoals. It may also choose between atomic versus composed WSI for some subgoals. During optimization, subgoal orderings and sub- goal restrictions identified in the pre-plan are provided to the relational optimizer, and it respects them while producing a good plan for the subgoals in the query. This is the second innovation of our research - extending a traditional random- ized optimizer to produce plans that respect the limitations of WebSources. A WebWrapper cost model [2] provides a number of metrics that can be used by the WQO in choosing a good WSI and in choosing a good plan. These metrics are obtained using query feedback, since WebSources typically are autonomous and do not provide either access costs or statistics. We have developed a WebPT - a tool that can be used to learn from query feedback and predict the response time for accessing a WebSource across a wide area network [3, 4]. Thus, our third innovation is using query feedback to develop a cost model for autonomous sources. The Predator evaluation engine has been extended with several operators to handle the wrapper calls to WebSources. This includes an ExternalScan operator and a DependentJoin operator. The physical implementation of these operators involves the WSI supported by the Web Query Broker. The proto- type has been implemented using C++/Java/Prolog. It has been tested against a number of WebSources including the ACM Digital Library WebSource [1].
2. CBR Tool After the usual syntactic and semantic checking is per- formed on the mediator query, it is rewritten as a set of subgoals on mediator relations (mediator subgoals). The CBR tool accomplishes the following tasks: (a) determines if a mediator query is safe, i.e., it can be evaluated respect- ing the capabilities of given WebSources; (b) generates a set of pre-plans, where a pre-plan specifies a sequence of mediator subgoals that can be evaluated safely; (c) for each mediator subgoal in the pre-plan, CBR tool identifies all the relevant WSI, corresponding to a particular wrapper calls. The CBR tool partitionsand partiallyorders the WebSources according to their capabilities. A pre-plan provides the following information to the opti- mizer: (a) an ordering of mediator subgoals; (b) WebSource implementations (WSIs) that support the mediator subgoal and (c) restrictions on queries to the relevant WebSources. The restrictions are (1) attributes that require bindings, (2) post selection attributes i.e., operations that cannot be eval- uated by the WebSource and (3) post projection attributes that can be output from the WebSource. Pre-plans produced by CBR Tool can provide a medi- ator relation with multiple WSI. A choice of a WSI and corresponding pre-plan, which is most appropriate under given conditions, is the responsibility of WQO in the first optimization stage. This choice may significantly influence a behavior of relational query optimizer in the second op- timization stage. In particular, WQO may extend search space of possible plans explored by the relational optimizer, or enforce a certain query evaluation strategy to be chosen. 3. Extended Randomized Optimizer After the WSI for each mediator subgoal is chosen, sub- goal orderings and subgoal restrictions identified in the pre- plan are provided to the relational optimizer, and it respects them while producing a good plan for the subgoals in the query. Moreover, we enforce randomized optimizer to gen- erate only valid plans using a graph collapsing technique. The randomized optimizer [5] performs random walks over the search space and picks the plan with the cheapest cost among the plans it has examined. Each random walk stage consists of an initial plan generation step followed by a number of plan transformation steps. Both plan generation and plan transformation have to respect the ordering restric- tions on mediator subgoals imposed by a pre-plan. Plans whose join orderings violate the ordering imposed by the pre-plan must be avoided. On the basis of pre-plan ordering and corresponding WSIs for mediator subgoals, the Web Query Optimizer generates a Dependency Graph (DG) for the randomized optimizer. The Dependency Graph reflects the pre-plan order restrictions and is used by the optimizer to generate only valid plans. To do that we introduce a graph collapsing technique. When randomly picking two subgoals to join, we first consult the DG. Any two subgoals can be safely joined provided that collapsing their corresponding nodes in the DG does not produce cycles, i.e. it preserves the property of a DG. The graph is collapsed to reflect the ef- fect of the new join operator. Thus, the Dependency Graph keeps shrinking as each join operator is generated. Finally a single join operator is obtained which corresponds to the topmost operator of the query plan. 4. WebWrapper Cost Model A WebWrapper cost model provides a number of metrics that can be used by the WQO in choosing a good WSI and in choosing a good plan. We consider several costs associated with sending a query to a remote WebSource, such as remote query processing costs at the WebSource, and the cost of downloading pages from the WebSource. Remote cost is the initial cost of processing the query at the remote WebSource before the source begins to return data. A remote source may need time to process a query and retrieve all the relevant data from some underlying database. In many sources, this cost is significant, especially if the query result is large. Download cost is the cost of downloading the relevant data from the WebSource, after the WebSource has processed the query and produced a page or set of pages that contain the answers to the query. We have developed a WebPT (Web Prediction Tool) - a tool that can be used to learn from query feedback and predict the response time for accessing a WebSource across a wide area network. The WebPT uses query feedback to estimate the remote and download cost. The estimate could be a function of different dimensions, including the day, time, and the quantity of data. In addition, the WebWrapper cost model considers the number of pages that are downloaded by the wrapper, since the answer to a query may access multiple pages and the WebWrapper must download all the pages that contain the relevant data. The WebWrapper Cost Model was integrated with native Predator cost model. The cost of a query plan in Preda- tor is expressed in terms of various resource usages, e.g., disk usage, memory usage, etc. We extended this model and introduced a WebWrapper resource and its usage. The WebWrapper usage values include statistics, such as the cardinality of a relation in a WebSource, and wrapper exe- cution cost. These values are provided by the WebWrapper, which uses WebPT trained on query feedbacks for the Web- Sources. The cost of the ExternalScan operator is increased corresponding to the particular WebWrapper usage. The statistics and the WebWrapper usage are also used to deter- mine the cost of a specific implementation of the Dependent Join operator. 2
Web Query Optimizer  Vladimir Zadorozhny Laura Bright Louiqa Raschid Tolga Urhan Maria Esther Vidal University of Maryland College Park, MD 20742 Abstract We demonstrate a Web Query Optimizer (WQO) within an architecture of mediators and wrappers, for WebSources of limited capability in a wide area environment. The WQO has several innovative features including a CBR (capability based rewriting) Tool, an enhanced randomized relational optimizer extended to a Web environment, and a WebWrapper cost model that can provide relevant metrics for accessing WebSources. The prototype has been tested against a number of WebSources. 1. Overview We consider an architecture of mediators and wrappers [8], for WebSources of limited capability in a wide area environment. We have developed a Web Query Optimizer (WQO) within the mediator, where the mediator has been developed as an extension of the Predator object-relational database system [6]. The WQO has two innovative components. The first is a CBR (capability based rewriting) Tool and the second is an enhanced randomized relational optimizer extended to a Web environment. The third innovation of this architecture is a WebWrapper cost model that can provide relevant metrics for accessing WebSources. The mediator maintains a catalog for external relations that are supported by wrapper calls to a WebSource. In addition to the schema for these relations, the catalog must provide information on the capabilities of these WebSources, i.e., limitations on queries that can be submitted to these WebSources. A Web Query Broker handles wrapper calls during the query execution, providing for each mediator relation (subgoal) a WebSource implementations (WSI). WSI defines, in particular, an access pattern which should be used to retrieve the data from a WebSource.  This research has been partially supported by the Defense Advanced Research Project Agency under grant 01-5-28838; the National Science Foundation under grant IRI9630102 and by CONICIT, Venezuela. In a pre-optimization phase, the CBR Tool [7] produces (multiple) pre-plan(s) for a mediator query. A pre-plan consists of (possibly ordered) subgoals to be executed in the WebSources and the mediator. The pre-plan identifies one (or more) relevant WebSource Implementations WSI (wrapper calls) for a mediator subgoal, as well as restrictions and orderings imposed by the WebSource capabilities. The WQO uses the pre-plan to drive the relational optimizer. The WQO first chooses a "good" WSI. In doing so, the WQO may explore specific evaluation strategies, such as top-down versus bottom-up evaluation of mediator subgoals. It may also choose between atomic versus composed WSI for some subgoals. During optimization, subgoal orderings and subgoal restrictions identified in the pre-plan are provided to the relational optimizer, and it respects them while producing a good plan for the subgoals in the query. This is the second innovation of our research - extending a traditional randomized optimizer to produce plans that respect the limitations of WebSources. A WebWrapper cost model [2] provides a number of metrics that can be used by the WQO in choosing a good WSI and in choosing a good plan. These metrics are obtained using query feedback, since WebSources typically are autonomous and do not provide either access costs or statistics. We have developed a WebPT - a tool that can be used to learn from query feedback and predict the response time for accessing a WebSource across a wide area network [3, 4]. Thus, our third innovation is using query feedback to develop a cost model for autonomous sources. The Predator evaluation engine has been extended with several operators to handle the wrapper calls to WebSources. This includes an ExternalScan operator and a DependentJoin operator. The physical implementation of these operators involves the WSI supported by the Web Query Broker. The prototype has been implemented using C++/Java/Prolog. It has been tested against a number of WebSources including the ACM Digital Library WebSource [1]. 2. CBR Tool to generate only valid plans. To do that we introduce a graph collapsing technique. When randomly picking two subgoals to join, we first consult the DG. Any two subgoals can be safely joined provided that collapsing their corresponding nodes in the DG does not produce cycles, i.e. it preserves the property of a DG. The graph is collapsed to reflect the effect of the new join operator. Thus, the Dependency Graph keeps shrinking as each join operator is generated. Finally a single join operator is obtained which corresponds to the topmost operator of the query plan. After the usual syntactic and semantic checking is performed on the mediator query, it is rewritten as a set of subgoals on mediator relations (mediator subgoals). The CBR tool accomplishes the following tasks: (a) determines if a mediator query is safe, i.e., it can be evaluated respecting the capabilities of given WebSources; (b) generates a set of pre-plans, where a pre-plan specifies a sequence of mediator subgoals that can be evaluated safely; (c) for each mediator subgoal in the pre-plan, CBR tool identifies all the relevant WSI, corresponding to a particular wrapper calls. The CBR tool partitions and partially orders the WebSources according to their capabilities. A pre-plan provides the following information to the optimizer: (a) an ordering of mediator subgoals; (b) WebSource implementations (WSIs) that support the mediator subgoal and (c) restrictions on queries to the relevant WebSources. The restrictions are (1) attributes that require bindings, (2) post selection attributes i.e., operations that cannot be evaluated by the WebSource and (3) post projection attributes that can be output from the WebSource. Pre-plans produced by CBR Tool can provide a mediator relation with multiple WSI. A choice of a WSI and corresponding pre-plan, which is most appropriate under given conditions, is the responsibility of WQO in the first optimization stage. This choice may significantly influence a behavior of relational query optimizer in the second optimization stage. In particular, WQO may extend search space of possible plans explored by the relational optimizer, or enforce a certain query evaluation strategy to be chosen. 4. WebWrapper Cost Model A WebWrapper cost model provides a number of metrics that can be used by the WQO in choosing a good WSI and in choosing a good plan. We consider several costs associated with sending a query to a remote WebSource, such as remote query processing costs at the WebSource, and the cost of downloading pages from the WebSource. Remote cost is the initial cost of processing the query at the remote WebSource before the source begins to return data. A remote source may need time to process a query and retrieve all the relevant data from some underlying database. In many sources, this cost is significant, especially if the query result is large. Download cost is the cost of downloading the relevant data from the WebSource, after the WebSource has processed the query and produced a page or set of pages that contain the answers to the query. We have developed a WebPT (Web Prediction Tool) - a tool that can be used to learn from query feedback and predict the response time for accessing a WebSource across a wide area network. The WebPT uses query feedback to estimate the remote and download cost. The estimate could be a function of different dimensions, including the day, time, and the quantity of data. In addition, the WebWrapper cost model considers the number of pages that are downloaded by the wrapper, since the answer to a query may access multiple pages and the WebWrapper must download all the pages that contain the relevant data. The WebWrapper Cost Model was integrated with native Predator cost model. The cost of a query plan in Predator is expressed in terms of various resource usages, e.g., disk usage, memory usage, etc. We extended this model and introduced a WebWrapper resource and its usage. The WebWrapper usage values include statistics, such as the cardinality of a relation in a WebSource, and wrapper execution cost. These values are provided by the WebWrapper, which uses WebPT trained on query feedbacks for the WebSources. The cost of the ExternalScan operator is increased corresponding to the particular WebWrapper usage. The statistics and the WebWrapper usage are also used to determine the cost of a specific implementation of the Dependent Join operator. 3. Extended Randomized Optimizer After the WSI for each mediator subgoal is chosen, subgoal orderings and subgoal restrictions identified in the preplan are provided to the relational optimizer, and it respects them while producing a good plan for the subgoals in the query. Moreover, we enforce randomized optimizer to generate only valid plans using a graph collapsing technique. The randomized optimizer [5] performs random walks over the search space and picks the plan with the cheapest cost among the plans it has examined. Each random walk stage consists of an initial plan generation step followed by a number of plan transformation steps. Both plan generation and plan transformation have to respect the ordering restrictions on mediator subgoals imposed by a pre-plan. Plans whose join orderings violate the ordering imposed by the pre-plan must be avoided. On the basis of pre-plan ordering and corresponding WSIs for mediator subgoals, the Web Query Optimizer generates a Dependency Graph (DG) for the randomized optimizer. The Dependency Graph reflects the pre-plan order restrictions and is used by the optimizer 2 References [1] ACM Digital Library. http://www.acm.org/dl/Search.html. [2] L. Bright and L. Raschid. Cost Modeling of Wrappers for Web Accesible Data Sources (WebSources). http://www.umiacs.umd.edu/labs/CLIP/DARPA/ww97.html. (under review), 1999. [3] L. Bright, L. Raschid, V. Zadorozhny, and T. Zhan. Learning Response Times for WebSources: A Comparison of a Web Prediction Tool (WebPT) and a Neural Network. Proc. of the CoopIS Conference, 1999. [4] J.-R. Gruser, L. Raschid, V. Zadorozhny, and T. Zhan. Learning Response Time for WebSources using Query Feedback and Application in Query Optimization. To appear in The VLDB Journal, 2000. [5] Y. Ioanidis and Y. Kang. Randomized Algorithms for Optimizing Large Join Queries. Proc. of the ACM Sigmod Conference, 1990. [6] P. Seshadri and M. Paskin. PREDATOR: An OR-DBMS with Enhanced Data Types. Proc. of the ACM Sigmod Conference, 1997. [7] M. Vidal and L. Raschid. WebSrcMed: A Mediator for Scaling up to Multiple Web Accessible Sources (WebSources). ftp://www.umiacs.umd.edu/pub/mvidal/websrcmed.ps. (under review), 1998. [8] G. Wiederhold. Mediators in the Architecture of Future Information Systems. IEEE Computer, pages 38–49, March 1992. 3