Exploiting multiple paths to express scientific queries

Maria-Esther Vidal

Exploiting multiple paths to express scientific queries

Scientific and …, 2004

number of retrieved entries. BioNavigation uses the ESearch algorithm [8] that takes a regular expression as input and performs an exhaustive breadth-first search of all paths in a graph. Each path through the graph of all available resources (Figure 1) may have different properties that have varying benefits. Figure 1. Physical graph To better introduce our system, we present a complex query typical of biological research and describe how it may be evaluated using life science resources. Our focus is on the navigation of the many alternate links and paths. Consider the following complex query: Return all citations published since 1995 that mention "heart" and refer to sequences that are annotated as "calcium channel" [10]. The solution to this query requires knowledge of the capabilities of all available biological resources. There are many different paths exploiting various databases and applications that will enable the user to solve this query. A user can first access PubMed and retrieve all citations published since 1995 that mention “heart” and then extract from this output all GenBank identifiers. The user can then retrieve the information from GenBank and filter the ones that are annotated as “calcium channel”. An alternative approach would be to follow the Entrez Nucleotide Links from PubMed. The user could also first access GenBank and retrieve all sequences that are annotated as “calcium channel”, and then extract the MEDLINE identifiers from the output. Then use these identifiers to retrieve the citations from PubMed and filter the ones published since 1995 that mention “heart”. Other biological resources may also be used to retrieve the information. For instance, a scientist may retrieve sequence information from the EMBL Nucleotide database instead of NCBI. Each of the above paths may return different information: the number of entries (sequences and citations) may differ, and the characterization of the retrieved entries (attributes) may differ. In addition, the cost of executing the query on each of these paths may also differ. This example clearly shows that different paths involving different combinations of biological resources can be used to obtain a dataset. Traditional database integration approaches do not exploit alternative resources to evaluate a query. In contrast, the BioNavigation system allows users to explore the characteristics of the paths and then select the most appropriate path. The various paths of biological resources that evaluate queries on a mediator are addressed with GeneSeek 1 [11]. Unlike the BioNavigation system, GeneSeek only exploits the type (1-to-1 or 1-to-many) of links between entries, and does not return ranked possible paths. 2. What Will Be Demonstrated BioNavigation is a scientist-friendly interface to a mediation system that integrates multiple biological resources. Hence, the interface first helps the user build graphically a data collection query that consists of a regular expression. Then the user can view the results as paths through available resources, and explore their characteristics. The internal representation of the biological resources is twofold: a logical layer represents the scientific concepts (e.g., gene) typically used to express the queries, while the physical layer represents the resources integrated by the mediator. Data sources (e.g., PubMed) are represented by nodes in the physical graph shown in Figure 1. Source capabilities (e.g., the Entrez Nucleotide Links, BLAST) are represented as edges between data sources at the physical level. Each physical source may implement one or multiple scientific concepts (e.g., Unigene implements the concepts gene and sequence), which are nodes on the logical graph (see Figure 2). Biological capabilities implement logical links (edges between scientific concepts in the logical graph). Figure 2. Mapping from logical to physical level In the browsing mode, the user can explore the graphs, learning more about the different sources integrated in the mediation, the links and their logical and physical mappings. In the querying mode, the user expresses a data collection pipeline through scientific concepts. The following sections describe these two modes for the interface. 2.1 Browsing Mode 1 GeneSeek has been renamed BioMediation recently.

Exploiting Multiple Paths to Express Scientific Queries Zoé Lacroix, Tiffany Morris, Kaushal Parekh Louiqa Raschid Maria-Esther Vidal Arizona State University {zoe.lacroix, tiffany.j.morris, kaushal}@asu.edu University of Maryland louiqa@umiacs.umd.edu Universidad Simón Bolívar mvidal@ldc.usb.ve Abstract The purpose of this demonstration is to present the main features of the BioNavigation system. Scientific data collection needed in various stages of scientific discovery is typically performed manually. For each scientific object of interest (e.g., a gene, a sequence), scientists query a succession of Web resources following links between retrieved entries. Each of the steps provides part of the intended characterization of the scientific object. This process is sometimes partially supported by hard-coded scripts or complex queries that will be evaluated by a mediation-based data integration system or against a data warehouse. These approaches fail in guiding the scientists during the collection process. In contrast, the BioNavigation approach presented in the paper provides the scientists with information on the available alternative resources, their provenance, and the costs of data collection. The BioNavigation system enhances a mediation-based integration system and provides scientists with support for the following: • To ask queries at a high conceptual level; • To visualize the multiple alternative resources that may be exploited to execute their data collection queries; • To choose the final execution path to evaluate their queries. 1. Introduction A scientific protocol typically involves the collection of information about various entries retrieved from multiple data sources that are often linked in a large federation, e.g., the NCBI Nucleotide database and PubMed. A data collection protocol may require accessing several data sources successively, and following links from entry to entry, to constitute the pool of data for analysis and validation of the protocol. Scientists are not necessarily familiar with all possible life science sources. In 2004, Galperin listed 548 public molecular biology databases in [1], an increase of 42% since the previous year [2]. The increasing number and diversity in formats and available capabilities of scientific resources combined with the multiple changes made to the resources over time makes it difficult for the scientists to keep their knowledge up to date about information resources. While the large number of such resources provides useful information to scientists, it makes it difficult to exploit them. It takes time to interrogate a new data source; to understand and become familiar with its format, capabilities, and the quality of the data it provides. Yet most systems currently available to scientists that provide access to integrated biological resources, including DiscoveryLink [3], TAMBIS [4], and SRS [5], expect the user to specify explicitly the resources involved in the data collection process, or the system transparently chooses a particular database for the user. Both approaches raise critical issues and may affect the data collection process, and thus the quality and completeness of the retrieved data. An approach relying on the explicit input of the user expects the user to know all available resources and choose the most appropriate one to exploit. The transparent access will not allow the user to provide any guidelines in the collection process, so while the scientist is able to avoid this tedious task, the provenance of the data collected is hidden from the user. In contrast, our approach allows the scientists to identify and select among all available resources the ones they can use to answer their queries. Combined with a mediation system, BioNavigation allows a user to express his data collection queries at a high conceptual level. For example, a scientist interested in retrieving papers relevant to a specific disease, will be able to express a query with two concept classes: disease and citation. The system will return all possible ways to evaluate such a query expressed as paths in the graph of all available resources. For each path, the user has the opportunity to browse the characteristics of its resources and select the ones that are more likely to implement the pipeline. All resource characteristics are combined in a capability map that can be exploited in the source selection [6,7]. This map captures metadata that include the name, hosting institution, URL, format (in XML), and useful statistics such as the number of entries, information on the data quality (data curation), etc. The capability map allows scientists to explore resources they may not be familiar with but that are integrated in the mediation. Our system is particularly useful because, in addition to exploiting alternate similar resources, it goes a step further and returns to the user paths that are ranked with respect to some benefit, e.g., maximizing the number of retrieved entries. BioNavigation uses the ESearch algorithm [8] that takes a regular expression as input and performs an exhaustive breadth-first search of all paths in a graph. Each path through the graph of all available resources (Figure 1) may have different properties that have varying benefits. characteristics of the paths and then select the most appropriate path. The various paths of biological resources that evaluate queries on a mediator are addressed with GeneSeek1 [11]. Unlike the BioNavigation system, GeneSeek only exploits the type (1-to-1 or 1-to-many) of links between entries, and does not return ranked possible paths. 2. What Will Be Demonstrated Figure 1. Physical graph To better introduce our system, we present a complex query typical of biological research and describe how it may be evaluated using life science resources. Our focus is on the navigation of the many alternate links and paths. Consider the following complex query: Return all citations published since 1995 that mention "heart" and refer to sequences that are annotated as "calcium channel" [10]. The solution to this query requires knowledge of the capabilities of all available biological resources. There are many different paths exploiting various databases and applications that will enable the user to solve this query. A user can first access PubMed and retrieve all citations published since 1995 that mention “heart” and then extract from this output all GenBank identifiers. The user can then retrieve the information from GenBank and filter the ones that are annotated as “calcium channel”. An alternative approach would be to follow the Entrez Nucleotide Links from PubMed. The user could also first access GenBank and retrieve all sequences that are annotated as “calcium channel”, and then extract the MEDLINE identifiers from the output. Then use these identifiers to retrieve the citations from PubMed and filter the ones published since 1995 that mention “heart”. Other biological resources may also be used to retrieve the information. For instance, a scientist may retrieve sequence information from the EMBL Nucleotide database instead of NCBI. Each of the above paths may return different information: the number of entries (sequences and citations) may differ, and the characterization of the retrieved entries (attributes) may differ. In addition, the cost of executing the query on each of these paths may also differ. This example clearly shows that different paths involving different combinations of biological resources can be used to obtain a dataset. Traditional database integration approaches do not exploit alternative resources to evaluate a query. In contrast, the BioNavigation system allows users to explore the BioNavigation is a scientist-friendly interface to a mediation system that integrates multiple biological resources. Hence, the interface first helps the user build graphically a data collection query that consists of a regular expression. Then the user can view the results as paths through available resources, and explore their characteristics. The internal representation of the biological resources is twofold: a logical layer represents the scientific concepts (e.g., gene) typically used to express the queries, while the physical layer represents the resources integrated by the mediator. Data sources (e.g., PubMed) are represented by nodes in the physical graph shown in Figure 1. Source capabilities (e.g., the Entrez Nucleotide Links, BLAST) are represented as edges between data sources at the physical level. Each physical source may implement one or multiple scientific concepts (e.g., Unigene implements the concepts gene and sequence), which are nodes on the logical graph (see Figure 2). Biological capabilities implement logical links (edges between scientific concepts in the logical graph). Figure 2. Mapping from logical to physical level In the browsing mode, the user can explore the graphs, learning more about the different sources integrated in the mediation, the links and their logical and physical mappings. In the querying mode, the user expresses a data collection pipeline through scientific concepts. The following sections describe these two modes for the interface. 2.1 Browsing Mode 1 GeneSeek has been renamed BioMediation recently. The browsing functionality of the interface allows the scientist to explore the logical and physical graphs as well as the mapping between them, and access the metadata of each available biological resource. The graphs are visualized using the Graph Visualization Framework [12], a Java based architecture for visualization and manipulation of large graphs. Step by step, the scientist may explore the logical graph by first selecting a concept, and then exploring all concepts linked to it, including incoming and outgoing links. Each concept and each link between concepts may be selected to display their physical implementation. For example, the concept citation maps to the physical data source PubMed, when the link between citation and sequence is mapped to the Entrez Nucleotide Links. From the physical graph, the browsing mode allows the user to search for a particular data source by name. Similar to browsing the logical graph, a node of the physical graph can be selected to display the incoming and outgoing links to and from other sources. The browsing mode also enables scientists to explore similar biological resources. For example, by selecting a physical capability, the system may display all similar capabilities, i.e., the capabilities implementing the same logical links as the selected capability. Finally, the user may display the metadata for each biological resource, node and edge of the physical graph. The browsing mode allows the user to better understand the resources available before expressing any query and to plan the navigation protocol according to the output of the query. 2.2 Query Mode The query mode allows the user to express a query through scientific concepts, generates the regular expression input of ESearch [8], and then returns the paths. To express a query, the user selects the start and destination nodes and intermediate nodes if desired. The selection results in a regular expression built from the symbols for each node. The regular expression can be at either the logical or the physical level or a combination of both. At the logical level the regular expression is made up of the class symbols while at the physical level it contains the actual physical source names. In the case of the physical path, the user can also opt to get paths similar to the selected one. Logical and physical expressions can also be combined. For example, one can use or avoid a particular physical source in part of the regular expression while the remaining part is more general or is at the logical level. The generated regular expression is available for editing for advanced users who may want to tweak it manually. For instance, scientists interested to “Retrieve all citations linked to genes of interest” would simply select the node gene and then click on the node citation and select the logical link between the two selected classes. These selections generate the regular expression g.c, which will return all paths of length 1, i.e., GeneCards → PubMed and Unigene → PubMed. Scientists interested in evaluating their queries on paths that include one or more specific intermediate data source(s) would simply click on the source(s) before clicking on the final source. For instance, to use the above example, but to include sequence as an intermediate source, the user would select gene, then sequence, then citation. This would generate the regular expression g.s.c. The user may prefer to have paths with an unlimited number of unspecified intermediate data sources (path of length > 2). Using the same example the regular expression generated would be g. +.c. A final option the user may utilize is to exclude a particular physical source from the path or to request that a particular physical source be used in all returned paths. To retrieve citations relevant to genes with paths of length ≥ 2 that do not start with GeneCards, the system will generate the regular expression g!genecards. *.c. The regular expression corresponding to the retrieval of citations from PubMed linked to genes is g.pubmed. The regular expression generated by the system is displayed to the user and submitted as input to the ESearch algorithm. 2.3 Display Results ESearch processes the regular expression and generates a result graph of paths that satisfy the regular expression, as well as a list of ranked paths. These returned paths are at the physical level and indicate the corresponding data sources and the physical links. Ranking the paths is based on a variety of criteria including (1) the cardinality or the number of data sources in the path, (2) the target object cardinality or the number of distinct objects in the target sources, (3) the path cardinality (ignoring duplicates in the intermediate nodes and the target node), and (4) the attribute cardinality for all the sources in the path. Top ranked paths can be displayed in the physical graph using different colors for each top ranked path. The user can browse the result graph in a similar manner to the physical graph. If desired, the rationale behind the path rankings can also be explained to the user. The result graph points the user to the actual data sources to guide the implementation of the scientific pipeline. Scientists then have the ability to browse the retrieved paths and access their properties [6] in order to select the one that will be used by the mediation system to evaluate the query. 3. Conclusions and Future Work The BioNavigation interface enhances existing mediation approaches. It provides scientists with the ability to browse through available integrated resources and to access their properties. It returns alternate paths that may be exploited to evaluate the scientists’ queries. With the BioNavigation interface, scientists may choose resources that satisfy their scientific criteria without being familiar with their complex characteristics (capabilities, format, etc.). In future work, we will extend the interface so that scientists can provide details on their own criteria for ranking paths. Criteria of interest have been identified in [6,9] and include the number of retrieved entries, the characterization (number of attributes) of the retrieved entries, the cost (time) of the evaluation, etc. Each of the semantics may be maximized or minimized. For example, a scientist expecting to obtain the best (more complete) characterization of a gene may choose a path that maximizes the number of attributes. BioNavigation will be coupled with the DiscoveryLink mediation system [3] to execute the queries when the evaluation path has been selected. Acknowledgements: This research is partially supported by NSF grants 02230042 and 022847 and NIH National Library of Medicine grant R03 LM008046-01. Anna Joy and Mike Berens of the Brain Tumor Research Unit of the Translational Genomics Research Institute (TGen) helped define the BioNavigation system requirements. Pallavi Mundumby and Srilakshmi Bysani contributed to defining the capability maps. We also thank Julia Rice and Peter Schwartz of IBM for their input and for sharing their insight on DiscoveryLink. 4. References [1] M. Y. Galperin, “The Molecular Biology Database Collection: 2004 update”, Nucleic Acids Research, Vol. 32, pp 3-22, 2004. [2] A. D. Baxenavis, “The Molecular Biology Database Collection: 2003 update”, Nucleic Acids Research, Vol. 31, pp 1-12, 2003. [3] L. Haas, B. Eckman, P. Kodali, E. Lin, J. Rice, and P. Schwarz, “Chapter 6- DiscoveryLink”, In Bioinformatics: Managing Scientific Data, Z. Lacroix and T. Critchlow (eds), Morgan Kaufmann, 2003, pp 303-334. [4] R. Stevens, C. Goble, N. Paton , S. Bechhofer, G. Ng, P. Baker, and A. Brass, “Chapter 7- Complex Query Formulation Over Diverse Information Sources in TAMBIS”, In Bioinformatics: Managing Scientific Data, Z. Lacroix and T. Critchlow (eds), Morgan Kaufmann, 2003, pp 189-224. [5] T. Etzold, H. Harris and S. Beaulah “Chapter 5- SRS An Integration Platform for Databanks and Analysis Tools”, in Bioinformatics: Managing Scientific Data, Z. Lacroix and T. Critchlow (eds), Morgan Kaufmann, 2003, pp 109-145. [6] Z. Lacroix, L. Raschid, and B.A. Eckman “Exploiting Biomolecular Source Capabilities for Query Optimization”, Journal of Bioinformatics and Computational Biology (in Press), 2004. [7] Z. Lacroix, H. Murthy, F. Naumann, and L. Raschid, “Links and Paths through Life Sciences Data Sources”, In Proc. International Workshop on Data Integration in the Life Sciences (DILS 2004), Leipzig, Germany, March 25-26, 2004. [8] Z. Lacroix, L. Raschid, and M-E. Vidal, “Efficient Techniques to Explore and Rank Paths in Life Science Data Sources”, In Proc. International Workshop on Data Integration in the Life Sciences (DILS 2004), Leipzig, Germany, March 25-26, 2004. [9] B. Eckman, K. Deutsch, M. Janer, Z. Lacroix, L. Raschid, “A Query Language for Life Sciences”, In Proc. IEEE Computer Society Bionformatics Conference, Palo Alto, California, August 2003. [10] B. Eckman, A. Kosky, and L. Laroco. “Extending traditional query-based integration approaches for functional characterization of post-genomic data”, Bioinformatics, Vol. 17 no. 7, pp 587-601, 2001. [11] P. Mork, A. Halevy, and P. Tarczy-Hornoch. “A Model for Data Integration Systems of Biomedical Data Applied to Online Genetic Databases.” In Proc. American Medical Informatics Association (AMIA) Annual Symposium, Washington D.C., November 2001. [12] M. S. Marshall, I. Herman and G. Melancon, “An Object-oriented Design for Graph Visualization”, Software Practice and Experience, 2001, Vol. 31, pp 739-765.

Log In

Exploiting multiple paths to express scientific queries