Exploiting Multiple Paths to Express Scientific Queries
Zoé Lacroix, Tiffany Morris, Kaushal Parekh
Louiqa Raschid
Maria-Esther Vidal
Arizona State University
{zoe.lacroix, tiffany.j.morris, kaushal}@asu.edu
University of Maryland
louiqa@umiacs.umd.edu
Universidad Simón Bolívar
mvidal@ldc.usb.ve
Abstract
The purpose of this demonstration is to present the
main features of the BioNavigation system. Scientific
data collection needed in various stages of scientific
discovery is typically performed manually. For each
scientific object of interest (e.g., a gene, a sequence),
scientists query a succession of Web resources
following links between retrieved entries. Each of the
steps provides part of the intended characterization of
the scientific object. This process is sometimes
partially supported by hard-coded scripts or complex
queries that will be evaluated by a mediation-based
data integration system or against a data warehouse.
These approaches fail in guiding the scientists during
the collection process. In contrast, the BioNavigation
approach presented in the paper provides the scientists
with information on the available alternative
resources, their provenance, and the costs of data
collection. The BioNavigation system enhances a
mediation-based integration system and provides
scientists with support for the following:
• To ask queries at a high conceptual level;
• To visualize the multiple alternative resources that
may be exploited to execute their data collection
queries;
• To choose the final execution path to evaluate
their queries.
1. Introduction
A scientific protocol typically involves the collection
of information about various entries retrieved from
multiple data sources that are often linked in a large
federation, e.g., the NCBI Nucleotide database and
PubMed. A data collection protocol may require
accessing several data sources successively, and
following links from entry to entry, to constitute the
pool of data for analysis and validation of the protocol.
Scientists are not necessarily familiar with all
possible life science sources. In 2004, Galperin listed
548 public molecular biology databases in [1], an
increase of 42% since the previous year [2]. The
increasing number and diversity in formats and
available capabilities of scientific resources combined
with the multiple changes made to the resources over
time makes it difficult for the scientists to keep their
knowledge up to date about information resources.
While the large number of such resources provides
useful information to scientists, it makes it difficult to
exploit them. It takes time to interrogate a new data
source; to understand and become familiar with its
format, capabilities, and the quality of the data it
provides. Yet most systems currently available to
scientists that provide access to integrated biological
resources, including DiscoveryLink [3], TAMBIS [4],
and SRS [5], expect the user to specify explicitly the
resources involved in the data collection process, or the
system transparently chooses a particular database for
the user. Both approaches raise critical issues and may
affect the data collection process, and thus the quality
and completeness of the retrieved data. An approach
relying on the explicit input of the user expects the user
to know all available resources and choose the most
appropriate one to exploit. The transparent access will
not allow the user to provide any guidelines in the
collection process, so while the scientist is able to
avoid this tedious task, the provenance of the data
collected is hidden from the user. In contrast, our
approach allows the scientists to identify and select
among all available resources the ones they can use to
answer their queries.
Combined with a mediation system, BioNavigation
allows a user to express his data collection queries at a
high conceptual level. For example, a scientist
interested in retrieving papers relevant to a specific
disease, will be able to express a query with two
concept classes: disease and citation. The system will
return all possible ways to evaluate such a query
expressed as paths in the graph of all available
resources. For each path, the user has the opportunity
to browse the characteristics of its resources and select
the ones that are more likely to implement the pipeline.
All resource characteristics are combined in a
capability map that can be exploited in the source
selection [6,7]. This map captures metadata that
include the name, hosting institution, URL, format (in
XML), and useful statistics such as the number of
entries, information on the data quality (data curation),
etc. The capability map allows scientists to explore
resources they may not be familiar with but that are
integrated in the mediation.
Our system is particularly useful because, in addition
to exploiting alternate similar resources, it goes a step
further and returns to the user paths that are ranked
with respect to some benefit, e.g., maximizing the
number of retrieved entries. BioNavigation uses the
ESearch algorithm [8] that takes a regular expression
as input and performs an exhaustive breadth-first
search of all paths in a graph. Each path through the
graph of all available resources (Figure 1) may have
different properties that have varying benefits.
characteristics of the paths and then select the most
appropriate path. The various paths of biological
resources that evaluate queries on a mediator are
addressed with GeneSeek1 [11]. Unlike the
BioNavigation system, GeneSeek only exploits the
type (1-to-1 or 1-to-many) of links between entries, and
does not return ranked possible paths.
2. What Will Be Demonstrated
Figure 1. Physical graph
To better introduce our system, we present a
complex query typical of biological research and
describe how it may be evaluated using life science
resources. Our focus is on the navigation of the many
alternate links and paths. Consider the following
complex query: Return all citations published since
1995 that mention "heart" and refer to sequences that
are annotated as "calcium channel" [10].
The solution to this query requires knowledge of the
capabilities of all available biological resources. There
are many different paths exploiting various databases
and applications that will enable the user to solve this
query. A user can first access PubMed and retrieve all
citations published since 1995 that mention “heart” and
then extract from this output all GenBank identifiers.
The user can then retrieve the information from
GenBank and filter the ones that are annotated as
“calcium channel”. An alternative approach would be
to follow the Entrez Nucleotide Links from PubMed.
The user could also first access GenBank and retrieve
all sequences that are annotated as “calcium channel”,
and then extract the MEDLINE identifiers from the
output. Then use these identifiers to retrieve the
citations from PubMed and filter the ones published
since 1995 that mention “heart”. Other biological
resources may also be used to retrieve the information.
For instance, a scientist may retrieve sequence
information from the EMBL Nucleotide database
instead of NCBI. Each of the above paths may return
different information: the number of entries (sequences
and citations) may differ, and the characterization of
the retrieved entries (attributes) may differ. In addition,
the cost of executing the query on each of these paths
may also differ.
This example clearly shows that different paths
involving different combinations of biological
resources can be used to obtain a dataset. Traditional
database integration approaches do not exploit
alternative resources to evaluate a query. In contrast,
the BioNavigation system allows users to explore the
BioNavigation is a scientist-friendly interface to a
mediation system that integrates multiple biological
resources. Hence, the interface first helps the user build
graphically a data collection query that consists of a
regular expression. Then the user can view the results
as paths through available resources, and explore their
characteristics. The internal representation of the
biological resources is twofold: a logical layer
represents the scientific concepts (e.g., gene) typically
used to express the queries, while the physical layer
represents the resources integrated by the mediator.
Data sources (e.g., PubMed) are represented by nodes
in the physical graph shown in Figure 1. Source
capabilities (e.g., the Entrez Nucleotide Links,
BLAST) are represented as edges between data sources
at the physical level. Each physical source may
implement one or multiple scientific concepts (e.g.,
Unigene implements the concepts gene and sequence),
which are nodes on the logical graph (see Figure 2).
Biological capabilities implement logical links (edges
between scientific concepts in the logical graph).
Figure 2. Mapping from logical to physical level
In the browsing mode, the user can explore the
graphs, learning more about the different sources
integrated in the mediation, the links and their logical
and physical mappings. In the querying mode, the user
expresses a data collection pipeline through scientific
concepts. The following sections describe these two
modes for the interface.
2.1 Browsing Mode
1
GeneSeek has been renamed BioMediation recently.
The browsing functionality of the interface allows
the scientist to explore the logical and physical graphs
as well as the mapping between them, and access the
metadata of each available biological resource. The
graphs are visualized using the Graph Visualization
Framework [12], a Java based architecture for
visualization and manipulation of large graphs.
Step by step, the scientist may explore the logical
graph by first selecting a concept, and then exploring
all concepts linked to it, including incoming and
outgoing links. Each concept and each link between
concepts may be selected to display their physical
implementation. For example, the concept citation
maps to the physical data source PubMed, when the
link between citation and sequence is mapped to the
Entrez Nucleotide Links.
From the physical graph, the browsing mode allows
the user to search for a particular data source by name.
Similar to browsing the logical graph, a node of the
physical graph can be selected to display the incoming
and outgoing links to and from other sources. The
browsing mode also enables scientists to explore
similar biological resources. For example, by selecting
a physical capability, the system may display all
similar capabilities, i.e., the capabilities implementing
the same logical links as the selected capability.
Finally, the user may display the metadata for each
biological resource, node and edge of the physical
graph.
The browsing mode allows the user to better
understand the resources available before expressing
any query and to plan the navigation protocol
according to the output of the query.
2.2 Query Mode
The query mode allows the user to express a query
through scientific concepts, generates the regular
expression input of ESearch [8], and then returns the
paths. To express a query, the user selects the start and
destination nodes and intermediate nodes if desired.
The selection results in a regular expression built from
the symbols for each node. The regular expression can
be at either the logical or the physical level or a
combination of both. At the logical level the regular
expression is made up of the class symbols while at the
physical level it contains the actual physical source
names. In the case of the physical path, the user can
also opt to get paths similar to the selected one.
Logical and physical expressions can also be
combined. For example, one can use or avoid a
particular physical source in part of the regular
expression while the remaining part is more general or
is at the logical level.
The generated regular
expression is available for editing for advanced users
who may want to tweak it manually.
For instance, scientists interested to “Retrieve all
citations linked to genes of interest” would simply
select the node gene and then click on the node citation
and select the logical link between the two selected
classes. These selections generate the regular
expression g.c, which will return all paths of length 1,
i.e., GeneCards → PubMed and Unigene → PubMed.
Scientists interested in evaluating their queries on paths
that include one or more specific intermediate data
source(s) would simply click on the source(s) before
clicking on the final source. For instance, to use the
above example, but to include sequence as an
intermediate source, the user would select gene, then
sequence, then citation. This would generate the
regular expression g.s.c. The user may prefer to have
paths with an unlimited number of unspecified
intermediate data sources (path of length > 2). Using
the same example the regular expression generated
would be g. +.c. A final option the user may utilize is
to exclude a particular physical source from the path or
to request that a particular physical source be used in
all returned paths. To retrieve citations relevant to
genes with paths of length ≥ 2 that do not start with
GeneCards, the system will generate the regular
expression g!genecards. *.c. The regular expression
corresponding to the retrieval of citations from
PubMed linked to genes is g.pubmed. The regular
expression generated by the system is displayed to the
user and submitted as input to the ESearch algorithm.
2.3 Display Results
ESearch processes the regular expression and
generates a result graph of paths that satisfy the regular
expression, as well as a list of ranked paths. These
returned paths are at the physical level and indicate the
corresponding data sources and the physical links.
Ranking the paths is based on a variety of criteria
including (1) the cardinality or the number of data
sources in the path, (2) the target object cardinality or
the number of distinct objects in the target sources, (3)
the path cardinality (ignoring duplicates in the
intermediate nodes and the target node), and (4) the
attribute cardinality for all the sources in the path. Top
ranked paths can be displayed in the physical graph
using different colors for each top ranked path. The
user can browse the result graph in a similar manner to
the physical graph. If desired, the rationale behind the
path rankings can also be explained to the user. The
result graph points the user to the actual data sources to
guide the implementation of the scientific pipeline.
Scientists then have the ability to browse the retrieved
paths and access their properties [6] in order to select
the one that will be used by the mediation system to
evaluate the query.
3. Conclusions and Future Work
The BioNavigation interface enhances existing
mediation approaches. It provides scientists with the
ability to browse through available integrated resources
and to access their properties. It returns alternate paths
that may be exploited to evaluate the scientists’
queries. With the BioNavigation interface, scientists
may choose resources that satisfy their scientific
criteria without being familiar with their complex
characteristics (capabilities, format, etc.).
In future work, we will extend the interface so that
scientists can provide details on their own criteria for
ranking paths. Criteria of interest have been identified
in [6,9] and include the number of retrieved entries, the
characterization (number of attributes) of the retrieved
entries, the cost (time) of the evaluation, etc. Each of
the semantics may be maximized or minimized. For
example, a scientist expecting to obtain the best (more
complete) characterization of a gene may choose a path
that maximizes the number of attributes. BioNavigation
will be coupled with the DiscoveryLink mediation
system [3] to execute the queries when the evaluation
path has been selected.
Acknowledgements:
This research is partially
supported by NSF grants 02230042 and 022847 and
NIH National Library of Medicine grant R03
LM008046-01. Anna Joy and Mike Berens of the Brain
Tumor Research Unit of the Translational Genomics
Research Institute (TGen) helped define the
BioNavigation
system
requirements.
Pallavi
Mundumby and Srilakshmi Bysani contributed to
defining the capability maps. We also thank Julia Rice
and Peter Schwartz of IBM for their input and for
sharing their insight on DiscoveryLink.
4. References
[1] M. Y. Galperin, “The Molecular Biology Database
Collection: 2004 update”, Nucleic Acids Research, Vol. 32,
pp 3-22, 2004.
[2] A. D. Baxenavis, “The Molecular Biology Database
Collection: 2003 update”, Nucleic Acids Research, Vol. 31,
pp 1-12, 2003.
[3] L. Haas, B. Eckman, P. Kodali, E. Lin, J. Rice, and P.
Schwarz, “Chapter 6- DiscoveryLink”, In Bioinformatics:
Managing Scientific Data, Z. Lacroix and T. Critchlow (eds),
Morgan Kaufmann, 2003, pp 303-334.
[4] R. Stevens, C. Goble, N. Paton , S. Bechhofer, G. Ng, P.
Baker, and A. Brass, “Chapter 7- Complex Query
Formulation Over Diverse Information Sources in TAMBIS”,
In Bioinformatics: Managing Scientific Data, Z. Lacroix and
T. Critchlow (eds), Morgan Kaufmann, 2003, pp 189-224.
[5] T. Etzold, H. Harris and S. Beaulah “Chapter 5- SRS An
Integration Platform for Databanks and Analysis Tools”, in
Bioinformatics: Managing Scientific Data, Z. Lacroix and T.
Critchlow (eds), Morgan Kaufmann, 2003, pp 109-145.
[6] Z. Lacroix, L. Raschid, and B.A. Eckman “Exploiting
Biomolecular Source Capabilities for Query Optimization”,
Journal of Bioinformatics and Computational Biology (in
Press), 2004.
[7] Z. Lacroix, H. Murthy, F. Naumann, and L. Raschid,
“Links and Paths through Life Sciences Data Sources”, In
Proc. International Workshop on Data Integration in the Life
Sciences (DILS 2004), Leipzig, Germany, March 25-26,
2004.
[8] Z. Lacroix, L. Raschid, and M-E. Vidal, “Efficient
Techniques to Explore and Rank Paths in Life Science Data
Sources”, In Proc. International Workshop on Data
Integration in the Life Sciences (DILS 2004), Leipzig,
Germany, March 25-26, 2004.
[9] B. Eckman, K. Deutsch, M. Janer, Z. Lacroix, L. Raschid,
“A Query Language for Life Sciences”, In Proc. IEEE
Computer Society Bionformatics Conference, Palo Alto,
California, August 2003.
[10] B. Eckman, A. Kosky, and L. Laroco. “Extending
traditional query-based integration approaches for functional
characterization of post-genomic data”, Bioinformatics, Vol.
17 no. 7, pp 587-601, 2001.
[11] P. Mork, A. Halevy, and P. Tarczy-Hornoch. “A Model
for Data Integration Systems of Biomedical Data Applied to
Online Genetic Databases.” In Proc. American Medical
Informatics Association (AMIA) Annual Symposium,
Washington D.C., November 2001.
[12] M. S. Marshall, I. Herman and G. Melancon, “An
Object-oriented Design for Graph Visualization”, Software
Practice and Experience, 2001, Vol. 31, pp 739-765.