Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

BioNavigation: Selecting resources to evaluate scientific queries

2008

BIONAVIGATION: SELECTING RESOURCES TO EVALUATE SCIENTIFIC QUERIES by Kaushal D. Parekh A Internship Report Presented in Partial Fulfillment of the Requirements for the Degree MASTER OF SCIENCE ARIZONA STATE UNIVERSITY August 2005 ABSTRACT Advances in genome science have created a surge of data. These data critical to scientific discovery are made available in thousands of heterogeneous public resources. Each of these resources provides biological data with a specific data organization, format, and quality, object identification, and a variety of capabilities that allow scientists to access, analyze, cluster, visualize and navigate through the datasets. The heterogeneity of biological resources and their increasing number make it difficult for scientists to exploit and understand them. Learning the properties of a new resource is a tedious and time-consuming process, often made more difficult by the many changes made on the resources (new or changed information, capabilities) that stress scientists keeping their knowledge up-to-date. Therefore many scientists master a few resources while ignoring others that may provide additional data and useful capabilities. The BioNavigation system completes existing data integration approaches, by allowing users to explore biological resources. The BioNavigation system provides the scientist with valuable guidance in selecting the most effective evaluation path through the physical resources for his ontological query. It allows the user to visualize the conceptual level ontology, the physical graph of resources and the mappings between the two levels and browse the graphs to obtain more information about the resources; build queries with the help of the ontology by selecting the desired classes connected by labeled relationships; and obtain all possible physical paths that implement the query and rank them to optimize certain user selected criteria. BioNavigation could also be coupled with a data integration tool that would allow users to collect data automatically after selecting the resources. ii ACKNOWLEDGMENTS This work was partially supported by the National Science Foundation, Division of Computer and Information Science and Engineering, through the grant IIS0223042(September 2003 - August 2005). The Project has also benefited from valuable inputs from Peter Schwarz and Julia Rice at the IBM Almaden Research Center and Barbara Eckman at IBM Life Sciences. Michael Berens, Anna Joy and scientists at the Neurogenomics Division of the Translational Genomics Research Institute (TGen), Phoenix, provided support in determining the requirements of the system and helping test the prototype. Students at the Scientific Data Management Lab, Hervé Ménager and Pallavi Mudumby provided valuable feedback and comments. Finally and most importantly, I would like to thank my internship advisor, Dr. Zoé Lacroix, for providing me with the opportunity to work as a Research Assistant at the Scientific Data Management Lab and present our work at several prestigious conferences. iii TABLE OF CONTENTS Page LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi CHAPTER 1 Introduction and Motivation . . . . . . . . . . . . . . . . . . . . . . . 1 1. Complexity in Biological Resources . . . . . . . . . . . . . . . . . . . . . . . 1 2. Problems in Scientific Data Collection . . . . . . . . . . . . . . . . . . . . . 3 3. Existing Integration Systems . . . . . . . . . . . . . . . . . . . . . . . . . . 4 4. The BioNavigation Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 6 CHAPTER 2 Graph Representation of Resources . . . . . . . . . . . . . . . . . . . 8 1. Bi-Level Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.1. The Physical Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.2. The Logical or Conceptual Graph . . . . . . . . . . . . . . . . . . . 11 The BioMetaDatabase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.1. Metadata for Data Sources . . . . . . . . . . . . . . . . . . . . . . . 13 2.2. Metadata for capabilities . . . . . . . . . . . . . . . . . . . . . . . . 14 CHAPTER 3 Use of Ontology for Data Integration . . . . . . . . . . . . . . . . . . 16 2. 1. What is Ontology? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 1.1. Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2. Need for Ontologies in Biological Data Management . . . . . . . . . . . . . 19 3. OWL: The Web Ontology Language . . . . . . . . . . . . . . . . . . . . . . 22 4. Protégé Ontology Editor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 5. The BioNavigation Ontology . . . . . . . . . . . . . . . . . . . . . . . . . . 23 iv Page CHAPTER 4 Querying Integrated Biology Data Sources - Esearch Algorithm . . . 25 1. Query Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2. ESearch Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3. Ranking Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 CHAPTER 5 The BioNavigation Interface . . . . . . . . . . . . . . . . . . . . . . . 30 1. Interface Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 1.1. Browsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 1.2. Querying . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 1.3. Interpreting Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Using the BioNavigation System . . . . . . . . . . . . . . . . . . . . . . . . 34 CHAPTER 6 Future Work and Conclusions . . . . . . . . . . . . . . . . . . . . . . 41 REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 2. v LIST OF FIGURES Figure Page 1. Mapping physical resources to the conceptual level . . . . . . . . . . . . . . 9 2. An Example Ontology of Concepts and Associations . . . . . . . . . . . . . 24 3. BNF grammar of regular expressions . . . . . . . . . . . . . . . . . . . . . . 27 4. The BioNavigation Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 5. Genecards Properties Window . . . . . . . . . . . . . . . . . . . . . . . . . . 36 6. Properties Window for the OMIM to CGAP Link . . . . . . . . . . . . . . . 37 7. Output for the ‘disease-protein’ Query . . . . . . . . . . . . . . . . . . . . . 38 8. Disease to Citation with 3 Intermediate Nodes . . . . . . . . . . . . . . . . 39 9. Disease to Protein with one Intermediate Node . . . . . . . . . . . . . . . . 39 10. Using any number of Intermediate Nodes . . . . . . . . . . . . . . . . . . . 40 11. Gene-Citation query with 0 or more intermediates 2 output s for target object and path cardinality ranking . . . . . . . . . . . . . . . . . . . . . . . . . . vi 40 CHAPTER 1 Introduction and Motivation A scientific data collection protocol is always specified in terms of scientific classes being studied and it need not specify the data sources from which to get the information about these classes. These protocols are also mostly navigational, i.e. scientists start with obtaining information about a particular scientific object then from there go to another using the provided links and so on, thus forming a path. Scientists tend to use only a set of resources which they are familiar with to express their protocols rather than selecting the best possible resource that matches their needs. Most of the times, they do not even know which is the best resource, or even if they are aware that such a source exists, they are not familiar with its features and query interface to effectively exploit it. 1. Complexity in Biological Resources With new advances in the biological sciences, the number of available data sources is increasing dramatically. The key to scientific discovery lies in effectively exploiting the wealth of publicly available data, but this is not simple. For example, the current number of public molecular biology databases according to the 2005 update [Galp 05] in the Database issue of Nucleic Acids Research, is 719 databases compared to 548 in 2004 and 386 in year 2 2003. Not only is the number of sources large and increasing, but the data repositories themselves are highly heterogeneous. They organize biological data differently, they structure their data in multiple ways (even two resources with the same overall organization use different schemas) and publish them in various formats (flat files, relational tables, XML, etc.). Also, it is not unnatural that there exists an overlap of data in multiple resources. Each resource offers a different level of curation that affects data quality. In addition, resources are not always up-to-date; some sources may have more recent information than others. Each data source offers to the users a set of capabilities that help to access, navigate, visualize, and perform other operations on the datasets. These capabilities are also highly heterogeneous among different databases. for example, GeneCards [Rebh 97] allows users to search for genes through a single full text search, while Genew [Gene 05] allows searching of genes with additional specifications such as approved symbol, approved gene name, etc. Other sources provide analytic (e.g. NCBI BLAST1 ) or navigational (e.g. PubMed2 links from OMIM3 records) capabilities. It is difficult to stay at par with the characteristics of each source and its capabilities, and as a consequence, scientists tend to limit themselves to a few that they are familiar with. They would rather spend their valuable time on research than learning how to access a new data source; and as a price, miss out on information that could significantly affect their research. The public resources evolve significantly over time which adds to the above complexity. Although these changes allow the data sources to keep up with new data and improve the support provided to scientists, they contribute to the increasing burden of 1 NCBI BLAST - http://www.ncbi.nlm.nih.gov/BLAST/ PubMed Literature Database - http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?DB=pubmed 3 Online Mendelian Inheritance in Man - http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=OMIM 2 3 mastering the biological resources. 2. Problems in Scientific Data Collection Exploiting the complex maze of publicly available Biological resources to implement scientific data collection pipelines poses a multitude of challenges to biologists. Their first challenge is to accurately reflect the scientific question at hand in expressing the query. Ideally the scientists should not deal with the properties of the data sources intended to be used while framing this query. The query should be constructed only in terms of the higher level scientific concepts involved while keeping the implementation details transparent. Instead, scientists build their queries to adapt to the characteristics and limitations of the resources that they are familiar with. Another challenge lies in the availability of multiple resources serving similar purposes. For example, you can get information about a particular ‘gene’ (Which is a higher level scientific concept) from various alternate data sources like NCBI Gene4 or GeneCards5 or OMIM etc. These resources, although they all provide information about genes, are highly heterogeneous with respect to the data format, number of records, level of curation, navigational capabilities or links to other resources, etc. Thus, when the query involves multiple scientific concepts, the same higher level query can be translated to various evaluation paths involving a number of different alternate data sources, links, and applications. Each of these paths might have different semantic meanings and is bound to provide to the scientist with a different set of results [Lacr 04a]. Hence, it becomes important for the user to understand what path is best suited to his purpose to get the best possible set of results from the query. 4 5 NCBI Gene Database - http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene GeneCards gene database - http://www.genecards.org/ 4 Once the scientist has decided what resources he will use to evaluate his query, then the challenge lies in effectively formulating the query in the format acceptable to those resources and collecting and utilizing the data. All resources have different query interfaces and we can not expect the biologist to be always up to date with the query language, data format of all of them. This problem is usually taken care of by many available integrated database systems and hence we do not deal with this issue. Examples of such systems are described in the next section. 3. Existing Integration Systems There are a few systems that address the need of integrated access to multiple data sources; examples of which are DB2 Information Integrator [Haas 03], TAMBIS [Bake 98], and SRS [Etzo 03]. The characteristics of these systems are briefly described below. • The DB2 Information Integrator system (Now known as WebSphere Information Integrator, and previously known as Discovery Link) allows the integration of nonrelational data sources (flat file, XML, Web resources) and other relational databases with the DB2 relational database so that they can be queried through a single DB2 query interface. This is done with the help of wrappers that encapsulate query and search capabilities of the resources into user-defined functions. In simplified terms, the wrapper translates the relational query (written in SQL) into resource specific set of queries or web requests. The data retrieved from these is then converted into a relational (tabular) form according to the predefined schema for that wrapper. The system comes with certain built in wrappers for popular bioinformatics resources such as Entrez, Blast, etc. and also provides toolkits for C and Java languages to develop custom wrappers for additional resources. 5 • The TAMBIS (Transparent Access to Multiple Bioinformatics Information Sources) system acts as a virtual integrated data source by providing transparent information retrieval from various wrapped data sources with the help of a mediator. The mediator uses an ontology to describe a conceptual model of the data sources and assists the users in expressing queries against this universal model. The user can thus write queries in terms of the universal model or ontology while remaining unaware of what resources will be used to implement it. The mediator then translates these conceptual queries to corresponding mapped source queries which are sent to the individual wrappers for the sources. These wrappers then send the actual queries or calls to each respective resource, retrieve the data and reformat in accordance with the conceptual model so that results from different heterogeneous resources are presented in the same format. • The SRS (Sequence Retrieval System) provides a single interface to access a large number of bioinformatics data sources and tools which can be queried in the same way regardless of the heterogeneous formats through a simple graphical interface. In the same manner, the results of the analysis can also be viewed through a single interface and are presented in a uniform format. SRS also allows users to exploit the links between various resources allowing for queries that can take the user from one data source to another and thus are navigational in nature. The problem with the above and also most other available systems is that they either expect the user to specify explicitly the resources involved in the data collection process (e.g., DB2 and SRS), or the system transparently chooses a particular database for the user (e.g. TAMBIS). There are obviously critical issues with both the approaches that may affect the data collection process, and thus the quality and completeness of the retrieved data. As 6 explained previously, we can not expect the user to know all available resources and choose the most appropriate one to exploit. On the other hand the transparent access does not allow the user to play an important role in the selection of the particular data sources and capabilities, so while the scientist is able to avoid this tedious task, the provenance of the data collected is hidden from the user. 4. The BioNavigation Approach To summarize the problems discussed above: • Scientists’ data collection protocols may not effectively reflect the scientific question since they limit themselves to familiar resources, because of difficulties in learning about new ones and lack of information about possible alternate resources. • Multiple resources exist providing same or similar information, but there is high heterogeneity with respect to the data format, the number of records, quality of data etc. • Data collection protocols, which are navigational in nature, may be evaluated using many alternative paths through resources; each path bound to provide different results. • Which Path is the most suitable? BioNavigation aims to address these problems by allowing the scientists to identify and select among all available resources the ones they can use to answer their queries. 1. It provides him with important metadata information about the sources and their capabilities, and their visualization in an easy to interpret format. 7 2. It also assists scientists in looking at their protocols at the higher conceptual level and building the corresponding queries graphically. 3. BioNavigation then presents the user with various possible implementations of their query so that the user can choose the best one that suits his purposes. 4. The user then just has to use one his favorite tools (web interfaces, Perl scripts or any mediation system described above), but this time with the confidence that all possible resources were exploited, to get the data. BioNavigation could also be used as an interface to employ a mediation or integration system such as the ones described above to evaluate the particular implementation path that the user selected. The remainder of this Internship report describes the various aspects of the design and development of this BioNavigation system. CHAPTER 2 Graph Representation of Resources Scientists should be able to formulate their queries at the higher conceptual level of scientific classes and their relationships, without the concern of what source would be used underneath to collect the data. This is the ontology level. Classes in the ontology are mapped to the data sources which represent them, for e.g. the scientific class ’gene’ is represented by many sources such as Entrez Gene, GeneCards, etc. Similarly the relationships in the ontology are mapped to the physical links between the data sources. These links could be in the form of navigational links, indices or applications that capture the semantics of the ontology level relationships. 1. Bi-Level Representation Most data sources typically represent a particular type of scientific class. For example, PubMed provides references to published literature, UniProt1 provides information about proteins, etc. There can be several data sources for the same scientific class. For example, one can retrieve ’DNA sequences’ from either NCBI Nucleotide2 or EMBL3 . 1 UniProt - http://www.ebi.ac.uk/uniprot/ NCBI Nucleotide - http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=nucleotide 3 EMBL - http://www.ebi.ac.uk/embl/ 2 9 Data sources also provide navigational links connecting a record to other records in the same data source as well as external data sources in order to provide comprehensive and complete information about the scientific object they represent. Scientists use these links to navigate from one source to another and in the process gathering useful information relevant to the scientific question being studied. Each such link represents a meaningful scientific relationship between the two conceptual classes. For example, in NCBI Entrez, a link from a Gene record to a Nucleotide record containing its sequence represents the relationship ‘Has DNA Sequence’ between the two conceptual classes ‘Gene’ and ‘Nucleotide sequence’. The figure 1 below shows an example of such mapping of physical resources to the higher level scientific concepts. In the figure, you can see that there are two kinds of links between the Gene and PubMed databases. These links have different semantic meanings although they are identical syntactically. On the other hand the same conceptual class of ‘gene’ is served by two different databases, OMIM and Gene which have different capabilities. This is a very small example and the real picture is much more complex. Figure 1. Mapping physical resources to the conceptual level Bioinformatics tools and applications also represent relationships between various 10 scientific classes represented by the inputs and outputs of the application. Consider the example of a BLAST search for finding similar nucleotide sequences. in simplified terms, the input and output belong to the scientific class ‘Nucleotide sequence’ and the tool itself implements the relationship ‘Has Similar Sequence’ between two nucleotide sequences. As described in the previous chapter, a scientific data collection protocol is ideally designed at the conceptual level, whereas the implementation is at the physical level of the resources. Thus it becomes important to define formally the two levels of representation which will be used by the BioNavigation system. 1.1. The Physical Graph. The physical graph represents the resource level. In the first version of the BioNavigation system[Lacr 04b] and the ESearch algorithm[Lacr 04d, Lacr 04c], the physical graph consisted of data sources and the links between them as nodes and edges of the graph respectively. A navigational query would then be represented as a sequence of data sources. There are two major limitations to this model for the resource level. 1. There can be more than one type of links between two particular sources with different semantics. For example, in Figure 1, consider the links to PubMed citations from the NCBI Gene database. There are two types of links ‘PubMed Links’ and ‘GeneRIF Links’, which are the same at the physical level since they are links from a Gene record to a PubMed record. But they have different meanings; the first set of links consists of citations that are related to the gene in general whereas the other set of links represents citations that specifically provide a functional annotation for the gene. The representation of links as simple edges in a graph does not allow for capturing these differences in multiple edges between the same set of nodes. 11 2. The graph representation in the first version of BioNavigation also does not include the tools and applications which are often part of a data collection protocol. Although applications may represent a scientific relationship between two classes, they are different from links in that they are not bound to a specific data source, but always can be plugged in between two data sources which match the input and output class types of the application respectively. Taking the above limitations into consideration we defined the new graph model [Lacr 05b] for the physical level where all the resources, i.e., data sources, links and applications are modeled as different types of nodes. The edges in the graph are used only to specify the direction of association. The Physical Graph P G = (VP , L) is a directed graph, where: • VP is a set of nodes, partitioned into three subsets, S, AP , and QC, such that, S represents physical data sources, AP represents applications, and QC represents query capabilities. • L is a set of directed edges L ⊆ VP × VP that represents the directional associations between sources and applications or query capabilities. If a pair (a, b) belongs to L then, a is a source and b is an application or query capability, or a is an application or query capability and b is a source. 1.2. The Logical or Conceptual Graph. The logical graph represents the higher conceptual level of scientific concepts or classes and the relationships or associations between them. This allows the design of the query to express the scientific question accurately while being transparent with respect to the underlying resources. The Logical Graph LG = (VL , E) is a directed graph, where: 12 • VL is a set of nodes, partitioned into two sets C and A, where, C represents logical classes and A represents logical associations between classes. • E is a set of directed edges E ⊆ (C × A) ∪ (A × C) that represents roles played by logical classes in the associations. The logical level is actually built as an ontology which in simple terms is a definition of concepts and associations. This is described in detail in the next chapter. 2. The BioMetaDatabase The BioMetaDatabase materializes the physical graph in the BioNavigation system. In addition to defining the graph structure of the physical resources, the database is a rich collection of meta-information about these resources which serves two major purposes: 1. It aids the user obtain more information about a particular resource which allows him to make a selection of one resource over another 2. It includes several semantic and statistical metrics about these resources which are used by the BioNavigation system to rank the alternate paths generated that can be used to evaluate the query. Most of the information contained in the BioMetaDatabase was collected as part of the Computational Bioscience Class project in Spring 2004 [Mudu 04]. The database can be edited and updated via a web interface at BioMeta. The following two subsections provide the details about the type of meta information stored in this database for each kind of resource. 13 2.1. Metadata for Data Sources. The following is the list of attributes collected for each data source in the BioMetaDatabase: 1. ID - internal identifier for the BioMetaDatabase 2. Name - official name of the data source 3. URL - location of the source on the web 4. Description - A brief text describing the source 5. Species - Specifies the particular species (if any) which the source holds information about 6. Schema - schema of the source in XML DDT format 7. Scientific class - scientific class the source represents. e.g. OMIM belongs to the scientific class of ‘gene’ 8. Source Information URL - location of reference material for the data source on the web 9. Source Internal Identifier - The primary internal identifier for records in the data source, e.g. PMID for PubMed citations. Also for each data source two additional attributes are collected which are used for the ranking algorithm. These are, 1. Cardinality - the number of records in the data source. 2. Attributes - the number of attributes for the records. A greater number of attributes should correspond to a greater amount of information in each record. 14 2.2. Metadata for capabilities. Capabilities are mostly links provided by data sources from a record in one source to a record in another source. These links act as cross references and hence contribute to the richness of a dataset. Scientists typically do exploratory data collection where they navigate through different data sources by following interesting links. Hence collecting information about these links and what they offer is very important. Currently the BioMetaDatabase holds the following attributes for the capabilities: 1. ID - internal identifier 2. Input source - source of the input for the capability. In most cases it is the data source that provides the capability. 3. Input scientific class - scientific class of the input. 4. Input format - The format of the input information. 5. Output source - target data source of the capability 6. Output scientific class - The scientific class of the output. 7. Output format - The format of the output information. 8. Name - name of the capability as listed on the source website. 9. URL - web location of the capability. 10. Semantics - textual description of the capability and what it does. 11. Implementation - describes how the capability is implemented (i.e. full text search, hyperlink, etc) 15 12. Type - describes whether the capability is One to Many, Many to Many etc. 13. Properties - lists any characteristic properties of aa particular capability (i.e. ranked/unranked, duplicates, maximum length of input, maximum entries in output, any reference that explains the capability) In addition to the above informational metadata the following is also collected for each (unidirectional) Link between two data sources: 1. Link cardinality - number of link instances existing between the two data sources (i.e. number of pairs of connected records) 2. Link participation - number of objects in the start source having at least one outgoing link to the target source 3. Link image - number of objects in the target source having at least one incoming link from the start source These three statistics, in combination with the source cardinalities are used to estimate for example the number of records that could be expected at the end of a long navigational path. Such measurements are used to rank the evaluation paths for the queries and are explained in detail in Chapter 4. CHAPTER 3 Use of Ontology for Data Integration As stated in Chapter 2 Section 1.2, the Logical Graph will be represented using an ontology in the BioNavigation system because it provides a better representation for knowledge about scientific classes and their relationships and makes it easy for users to express their queries in terms of these ontological concepts. Before discussing in detail the ontology that will be used in BioNavigation, it will be a good idea to provide a brief introduction to ontologies and their applications. 1. What is Ontology? In computer science, an ontology is an ‘explicit specification of a conceptualization’, where: • Conceptualization is the definition of the properties of important concepts and their relationships • Explicit specification is the model specified in an unambiguous language, machine and human readable Originally, in philosophy, ontology meant the study of being or existence as well as the basic categories thereof. All mentions of ontology in this report refer to the Computer Science 17 definition of ontology. An ontology is made up of four type of elements [Stev 00]. They are: 1. Concept - A concept is a set or class of entities or things within a domain 2. Relation - Relations describe the interactions between concepts 3. Instance - Instances are things represented by concepts. Theoretically instances are not part of ontology but the distinction between concept and instance is not clear 4. Axiom - An axiom is a general rule and is used to constrain the values of concepts or instances The relations are the most important part of an ontology since they give it meaning by connecting the various concepts. A relation can belong to one of two categories: 1. Taxonomies provide the hierarchical tree structure to concepts. These are mainly the two relations, ‘isA’ and ‘isPartOf’. ‘IsA’ describes the ‘subclass-superclass’ relation between concepts whereas ‘partOf’ deals with the ‘subset-superset’ relation. Examples are, ‘Man isA Animal’ or ‘Leaf isPartOf Tree’. 2. Associations are relationships which are not ’sub-super’ type relations. Examples of these type of relations are ‘Person isAuthorOf Book’ or ‘Child isOffspringOf Parent’. Like classes, relations can also be organized as taxonomies. Thus, the relation ‘isFatherOf’ is a subtype of the relation ‘isParentOf’ which is a subtype of ‘isAncestorOf’ and so on. Each relation has certain properties which give further meaning to the relationship between the involved classes. Some of the common properties are listed below: 1. Domain and Range of relations restricts the concepts the relation can apply to. The Domain is the set of concepts that can be on the left hand side of a relation while the 18 Range is the set of concepts which can be on the right hand side. Thus, the domain of ‘isFatherOf’ will belong to the class of ‘Male’ and so will be the range of ‘hasFather’ 2. Cardinality specifies the restriction on the number of concepts on each side of the relation. Examples are one-to-one, one-to-many etc. 3. Transitivity (if A → B and B → C then A → C). For example the relation ‘isAncestorOf’ is obviously transitive, some other relations may not be transitive. Ontologies themselves are broadly classified into two types. A Generic Ontology is one captures all common high level concepts. It is also called upper ontology or core ontology. These have applications in Artificial Intelligence where a generic ontology can be used as a Knowledge Base. A highly ambitious generic ontology, Cyc aims to include all commonsense knowledge 1 . A true generic ontology is highly impractical if not impossible. A Domain Ontology is a more specialized ontology for specific applications. Commonly used ontologies are mostly domain specific and are usually knowledge bases for specialized applications like Expert Systems etc. 1.1. Applications. Ontologies have been widely used in the field of computer science for various purposes. They were first used in the field Artificial Intelligence for Knowledge Representation. They formed the basis of many knowledge based or expert systems. A more recent use of ontologies has been in the development of the Semantic Web [Hend 02]. The goal of the Semantic Web project is to create a universal medium for exchange of data. It aims to overcome the limitations of the present Web by providing semantic meaning to Web resources. This will allow all the data shared on the web to be processed by automated tools in addition to people. Ontologies form a very important layer 1 Cyc Project - http://www.cyc.com/ 19 in the Semantic Web framework since they are used to assign the machine interpretable meaning to the Web resources. Another application of ontologies is in Ontology-based Query Processing [Mena 01] An Ontology can be used to provide semantic descriptions of data repositories. The use of an ontology for querying heterogeneous distributed data sources allows the user to form queries at higher levels while making the the aspects related to syntax, location, structure, data repositories transparent. The ontology uses semantic metadata to capture the information content of the data repositories and their capabilities and provides independence from the underlying data structure. The ontology can then be exploited in two ways: 1. Navigation or Browsing of the ontology to view the concepts and their relationships 2. Building the query from the ontology by selecting interesting concepts and relations, which is then sent to the query processor The query processor can access data with the help of mapping information that translates the user query into queries for the underlying repositories. Results from these queries can then be combined and presented to the user who is unaware of the inner details. 2. Need for Ontologies in Biological Data Management Biological data sources present huge volumes of structured, semi-structured and unstructured data. There is a huge problem of object identity (ambiguity of names), different data sources provide information about the same concepts using different names and identifiers which poses a great challenge to integrated access. For example, the problem of the diversity of names and identifiers assigned to genes is well known and is being tackled to some extent by the HUGO [Gene 05]. There are innumerable applicable algorithms and 20 implemented components or applications publicly available, but it is difficult to search for, identify and use these resources. There is continuous and dynamic growth at the data instance level as well as meta-levels (new facts, concepts, properties, data formats etc. are being introduced daily). High heterogeneity exists at both the syntactic and semantic levels of representation between different data sources and even among the data bases belonging to the same organization [Lacr 04a]. Uncertainty and inconsistency is always an issue, due to missing or misrepresented information un-coordinated and uneven propagation of change. There is also incompatibility of context or logic during the integration of data elements or computational methods. Use of ontologies solves several of these problems as follows: • An ontology specification can be used as a common vocabulary for the purpose of annotation • Shared ontologies allow for neutral authoring and reuse of scientific knowledge • Ontology based query processing allows common access to heterogeneous information and forming queries over multiple databases • Ontologies are also used for automated annotation and understanding of technical literature using Natural language processing The BioNavigation system handles the issues dealing with accessing heterogeneous resources by allowing the user to visualize the conceptual level described in chapter 2, section 1.2, and framing their queries at that level. Using an ontology to represent this conceptual level graph is the most logical solution. It can capture the necessary scientific knowledge necessary for the system to be able to capture the scientific question being asked most accurately and thus get the user what he is exactly looking for. The system thus requires an ontology 21 that can represents the complex relationships between different scientific concepts and also explain the relationships that exist between the various resources that map to these scientific concepts and relationships. There are several ontologies being currently used in the field of Biology and hence we looked at a few of them to identify the candidate ontology for our system. Gene Ontology (GO) [Cons 00], the most commonly used biological ontology explains the biological roles of genes and gene product. It has been very successfully used for the purpose of annotation of genes. The MGED (Microarray Gene Expression Data) Ontology deals with concepts, definitions, terms, and resources for standardized description of a microarray experiment [Jr 02]. The BioCyc Ontology 2 is a collection of pathway and genome information for various organisms. Only one ontology, the one used in TAMBIS (Transparent Access to Multiple Bioinformatics Information Sources) [Bake 98], was close to our requirements for the BioNavigation system. The TAMBIS Ontology, TaO, describes a wide range of bioinformatics tasks and resources to enable biologists to ask questions over multiple external databases using a common query interface. But, the TAMBIS system does not allow the users to visualize the mapping between these scientific concepts and the underlying resources. It also does not capture the complexity of the biological data sources and their links to provide the user with the information about the possible alternate resources that could be used to evaluate his query, hence the need for developing a new ontology or adapting existing ones to meets these specific requirements of the BioNavigation system. The following sections describe the language and tools used for building and editing the ontology. 2 BioCyc Database Collection - http://biocyc.org/ 22 3. OWL: The Web Ontology Language OWL is the Ontology language standard developed by the World Wide Web Consortium (W3C) for the ontology layer of the Semantic Web Framework [McGu 04]. It is being accepted as the standard language for building ontologies and hence we used it for the development of the ontology for the BioNavigation system. OWL is an improvement over the earlier ontology languages, RDF (Resource Description Framework) and RDF Schema, and provides greater machine interpretability. The OWL specification provides three levels of expressiveness with increasing complexity: 1. OWL Lite supports classification hierarchies and only simple constraints on relations. It is easy to process but not very expressive 2. OWL DL is based on Description Logics and hence is more expressive while retaining computational completeness 3. OWL Full provides maximum expressiveness, but provides no computational guarantees. Based on our requirements for expressing rich relationships we selected OWL DL as the language to represent the conceptual level of the BioNavigation system. 4. Protégé Ontology Editor Protégé is tool which allows the user to construct a domain ontology, customize data entry forms, and enter data or instances belonging to that ontology. Protégé can also be extended with graphical widgets for tables, diagrams, animation components to access other knowledge-based systems embedded applications and also provides a library 23 which other applications can use to access and display knowledge bases. Protégé has almost become a standard for ontology building and editing and also has a plugin for development of OWL ontologies. The Protégé OWL Plugin enables: 5 1. Loading and Saving of OWL and RDF ontologies 2. Editing and Visualizing OWL classes and their properties 3. Defining logical class characteristics as OWL expressions 4. Execute OWL individuals for Semantic Web markup In general, Protégé is a very useful tool for ontology design, development and manipulation, and is used in the BioNavigation project for that purpose. 5. The BioNavigation Ontology According to the previous discussion, the ontology used to represent the logical graph in BioNavigation needs to satisfy at least the following requirements: • Represent scientific knowledge to enable to scientists to express queries. • Map all available resources to ontological concepts and relationships. A couple of existing ontologies such as the TAMBIS ontology and the myGrid ontology [Stev 03] do satisfy but only parts of these requirements. Our intension is to use, as much as possible, existing ontologies, and if necessary integrate a few of them to get a better result, the reason being that ontology development itself requires a lot of effort and it is wasteful to spend time reinventing the wheel. We currently have a sample ontology for prototype development and it serves the purpose well in demonstrating the usefulness of 24 Figure 2. An Example Ontology of Concepts and Associations the system. This example for a conceptual ontology is shown in Figure 2 above and involves the scientific classes, disease, gene, citation, and protein, and their labeled associations or relationships. Consider a scientist interested to ‘retrieve citations related to a particular disease’. An evaluation path for this query could consist of initiating the retrieval process from a particular source that provides information on diseases and then through the links it offers, obtain related citations. One such path could be exploiting the NCBI PubMed Link from OMIM to PubMed. Hence, at the conceptual level the path would be ‘d in c’ formed from the class ‘disease’ or ‘d’, the class ‘citation’ or ‘c’, and the association ‘discussed in’ or ‘in’. The user might also want to include in his path any possible intermediate nodes in addition to the direct path which we took care of by introducing the special ‘ε’ symbols in the query language discussed in chapter 4, section 1. CHAPTER 4 Querying Integrated Biology Data Sources - Esearch Algorithm A query is represented as a regular expression made up of the sequence of scientific classes and relationships to be followed. The user can also specify a wildcard character within a regular expression to indicate that any possible resource can be used in its place. The ESearch algorithm performs an extensive breadth-first search on the physical graph to search for paths that match the users query expression. The algorithm uses metadata information about the data sources to estimate the relative ranks of these paths with respect to the ranking criteria selected by the user. For example the user can chose the path to return the maximum number of entries, and the list of paths will be sorted according to the target cardinality measure calculated by ESearch. 1. Query Language We now formally define the language that will be used to express the queries over the logical concepts in set VL . We use the following notations: • v is either a class or a logical association in VL i.e., v ∈ VL 26 • v < AnnotList > is an annotated class or association where < AnnotList > is a list of expressions of the form: OP < P hysicalImpN ame > where OP is either 6= or =, and < P hysicalImpN ame > corresponds to a data source, application or query capability in VP such that < P hysicalImpN ame > belongs to φ(v). • εc is a term representing any possible class in C, similarly, εa represents any possible association in A, and ε represents the path εa εc . The query language L(RE) over the logical concepts in VL is defined by the regular expression, L(RE) = X (ε + Y X)∗ where, • X = εc | c | c < AnnotList > • Y = εa | a | a < AnnotList > Thus any conceptual level query starts with a logical concept and ends with a logical concept. Two concepts are always connected through a logical association. The term ε allows users to express queries such as ‘c1 ε∗ εa c2 ’, which means that the path between classes c1 and c2 could be of any length and consist of any possible intermediate class and association. A BNF grammar generating the regular expressions is shown in Figure 3. Given the regular expression RE, our optimization algorithm will identify the set of physical paths in P G that corresponds to the physical implementations of expressions of the language induced by RE, L(RE). The following definition formalizes the paths that are physical implementations of an expression in L(RE). 27 <RE>:= <cTerm><Y> <cTerm>:= <EpsilonC> | <ClassName><SourceAnnotation> <Y>:= <Epsilon><Y> | <aTerm><cTerm><Y> | empty <aTerm>:= <EpsilonA> | <AssociationName><LinkAnnotation> <SourceAnnotation>:= empty | "[" <SourceList>"]" <SourceList>:=<AnnotatedSource> | <AnnotatedSource> "," <SourceList> <AnnotatedSource>:=<OP><SourceName> <LinkAnnotation>:= empty | "[" <LinksList>"]" <LinkList>:=<AnnotatedLink> | <AnnotatedLink> "," <LinkList> <AnnotatedLink>:=<OP><LinkName> <LinkName>:= <ApplicationName> | <QueryCapName> <OP>:="!=" | "=" Figure 3. BNF grammar of regular expressions 2. ESearch Algorithm A path p = (s1 , a1 , s2 , . . . , sn−1 , an−1 , sn ) in P G is defined as a list of sources si and applications ai ∈ VP . A regular expression r over the alphabet VL expresses a retrieval query Qr . The result of Qr is the set of paths p in P G that interpret r, i.e., the set of paths in P G that correspond to physical implementations of the paths in LG that respect the regular expression Qr . α is a one-to-many mapping from an expression e ∈ L(RE) into a set of paths in P G corresponding to the physical implementation of e. • If e is εc , then α(e) = S. • If e is εa , then α(e) = AP ∪ QC. • If e is a logical concept l ∈ VL , then α(e)=φ(l). • If e = l < AnnotList >, where l ∈ VL and < AnnotList > is partitioned into < AnnotListInc > and < AnnotListExc >, where the former corresponds to the list of sources that must be considered and the latter sources that must be excluded, then, α(e) = φ(l)∩ < AnnotListInc > − < AnnotListExc > 28 • If e = e1 e2 then, α(e1 e2 ) = {w1 w2 |w1 ∈ α(e1 ), w2 ∈ α(e2 ), edge(last(w1 ), f irst(w2 )) ∈ L}, where last and first are functions that respectively map a path with its last and first elements and L is the set of edges in P G (definition 1.1). A naive method for evaluating a query Qr is to traverse all paths in P G, and to determine if they interpret r. The time complexity of the naive evaluation is exponential in the size of P G because P G has an exponential number of paths. A similar problem was addressed in [Mend 89] where it was shown that for (any) graph and regular expression, determining whether a particular edge occurs in a path that satisfies the regular expression and is in the answer is NP complete. The ESearch algorithm is based on an annotated deterministic finite state automaton (DFA) that recognizes a regular expression or query Qr and the physical implementations that must be excluded from the final result. The algorithm performs an exhaustive breadth-first search of all paths in P G that respect the regular expression. 3. Ranking Criteria The result of a query Qr is a list of paths that represents the different ways in which the user can navigate through the data sources in order to evaluate Qr . It becomes important to assign ranks to these paths so that the user can easily select the most suitable one. We use three metrics for ranking the paths: 1. Path Cardinality - is the number of instances of paths of the result. For a path of length 1 between two sources S1 and S2, it is the number of pairs (e1, e2) of entries e1 of S1 linked to an entry e2 of S2. 29 2. Target Object Cardinality - is the number of distinct objects retrieved from the final data source. 3. Evaluation Cost - is the cost of the evaluation plan, which involves both the local processing cost and remote network access delays. These three metrics are meaningful to the scientists as the path cardinality computes the probability there exists a path between two sources, the target object cardinality estimates the number of retrieved entries, whereas the evaluation cost guides the scientists to the selection of an efficient evaluation path. These metrics for each path are estimated based on the properties of the links, described in chapter 2, section 2.2 that exist between the data sources in S using the methods introduced in [Lacr 04d] and [Lacr 04c]. CHAPTER 5 The BioNavigation Interface Design and development of the user interface for the BioNavigation system was the major task of the internship project. Following are the important features that were originally desired of the BioNavigation user interface. 1. Visualize the conceptual and the physical levels and the mappings between the two levels. 2. Browse the physical graph to obtain more information about the resources, e.g. their URL, data formats, schema, etc. 3. Build queries at the conceptual level by selecting the desired classes and relationships. 4. Interface with the ESearch algorithm and present the results to the user. 5. Integrate with a data integration tool that can implement the evaluation path selected by the user. 1. Interface Requirements As with any software development project, it is very important to draw up the formal requirements of the system beforehand. The features desired above lead to specific 31 requirements that can be classified into the following three categories which reflect the different stages of a navigation process. 1.1. Browsing. The browsing functionality of the interface allows the scientist to explore the scientific concepts and relationships, the biological resources integrated as well as the mapping between them, and access the metadata of each available biological resource. Step by step, the scientist may explore the logical graph by first selecting a concept, and then exploring all concepts related to it, using the incoming and outgoing relations. Each concept and each relationship between concepts may be selected to display their physical implementation using the mapping. From the physical graph, the browsing mode allows the user to search for a particular data source by name. Similar to browsing the logical graph, a node of the physical graph can be selected to display the incoming and outgoing links to and from other sources. Finally, the user may display the metadata for each biological resource, node and edge of the physical graph. To achieve these features, the interface includes the following capabilities: • A Graph visualization component to display the two levels where scientific classes and data sources will be represented as labeled nodes and the relationships between classes and the links between data sources will be represented as labeled edges in the conceptual and physical graphs respectively. We have used the Graph Visualization Framework (GVF) package [Mars 01] in the first version of the BioNavigation prototype for this purpose. The framework in addition to drawing graphs provides facilities such as easy zooming and panning, different alternative graph layouts, etc. • A Graph representation of the two levels in a format compatible with the visualization system. For this purpose the two graphs were translated to the GraphXML 32 [Herm 00] format used by GVF. Thus all information stored in the BioMetaDatabase was converted to this XML format. • Selection of nodes and edges using mouse clicks to let the users obtain the meta information about the resources and concepts. A right click on a particular node and edge should display a context menu depending on the node and edge type and allow the user to display the metadata from the BioMetaDatabase. The next version of the BioNavigation system will use an even better graph visualization system which is known as the JUNG (Java Universal Network and Graph) Framework [OMad] which draws much more pleasant looking graphs, highly customizable and has a well documented API. This will be very useful when the next version will incorporate the more expressive ontology based representation for the logical level. 1.2. Querying. The query mode allows the user to express a query through scientific concepts, generates a regular expression input (defined in chapter 4) for ESearch, and then returns the paths. To express a query, the user selects the start and destination nodes and intermediate nodes if desired. The selection results in a regular expression built from the symbols for each node. The regular expression can be at either the logical level or a combination of logical and physical levels for example, one can use or avoid a particular physical source in part of the regular expression while the remaining part is more general or is at the logical level. The generated regular expression is available for editing for advanced users who may want to tweak it manually. The BioNavigation interface thus has to support the following user operations: • Selection of nodes and edges from the logical graph as in the browsing mode and add them sequentially to the regular expression query. 33 • Annotation of selected scientific objects from the logical graph with specific physical resources to restrict the algorithm to generate paths with or without the particular source. The user should be able to select such physical source constraints graphically • Specify if the navigation path should include intermediate nodes and if yes, specify the number of intermediate nodes. • Display the generated regular expression from the above selections so that the user can verify and edit the query if necessary • Set ESearch specific preferences such as the ranking criteria. • Submit or clear the regular expression query built thus far. • Maintain a history of previously submitted regular expressions so that a repeat query with different preferences will not require repeat selection of nodes and edges manually. The user should be able to select a past regular expression and then change the ESearch preferences and rerun the query with new settings. The above requirements led to the creation of a form type interface with necessary buttons, text boxes and pull down menus for the user to build, modify and execute such navigational queries. The details of the interface are covered in the section . 1.3. Interpreting Results. The ESearch algorithm was implemented in Java by our collaborators at the University of Maryland, College-Park, MD and the Universidad Simon Bolivar, Caracas, Venezuela. The regular expression built using the query interface described above is sent to this implementation of ESearch which is part of the BioNavigation system. It then processes the regular expression and generates a result graph of paths that satisfy the regular expression, as well as a list of ranked paths. These returned paths are at 34 the physical level and indicate the corresponding data sources and the physical links. The requirements for this are: • Format the ESearch results to present them to the user. Currently this is just a list of paths and will be displayed using and text window. • Save the results generated along with the query asked and the ranking criteria used for future reference. This is done by saving the results in a text file. • Allow the user to select a desired path from the results and highlight it on the physical level graph. This capability has not been implemented in the first version and will be included in the next one. • Use a data integration or mediation system to take the users selected path and send queries to the respective resources to execute the data collection protocol. This feature is also not available in the current interface but will be added soon. 2. Using the BioNavigation System The BioNavigation interface and the ESearch algorithm are developed in Java and hence is platform independent. Although BioNavigation utilizes external packages for purposes like graph visualization, these are available through open source licenses and are included within he BioNavigation system itself and hence no separate installation is required. The system needs to have the Java Runtime Environment JRE v1.4.2 or greater to be pre-installed on the user’s machine. The BioNavigation system is available freely for academic and research purposes and it can be obtained from our website http://bioinformatics.eas.asu.edu/BioNavigation.html. The system is easy to install and use and includes an installation guide and user manual . The utility of the 35 Figure 4. The BioNavigation Interface BioNavigation system can be best explained using an example of a user’s action from the start of the exploratory browsing process to the interpretation of the ESearch results. The following use cases and screen shots will provide a better description of what the BioNavigation system does for the user. Figure 4 shows the BioNavigation interface that displays to the user a graph representing the resources that can be queried. This graph is divided in two parts representing the logical and the physical: 1. the top part (red ovals and blue edges) displays the scientific objects (e.g., a Gene, a Citation) that can be queried 2. the bottom part (blue cylinders and grey edges) displays the physical resources that map the logical resources (e.g., GeneCards or Genew both provide information about the class gene). 36 Figure 5. Genecards Properties Window Right-clicking a node in the physical graph and selecting the “Properties” option in the contextual menu leads to a window displaying properties of this node, such as its main URL, its description, or the scientific class it describes (see Figure 5). These properties are basically the details obtained from the BioMetaDatabase. Similarly, right-clicking an edge and selecting the “Properties” option in the contextual menu leads to a window displaying properties of the capability (i.e, link between two resources) described, such as its input type or its implementation (see Figure 6). The “Build Query” tab of the BioNavigation tool allows to express logical queries and submit them. The output is a list of paths that can be followed to implement these queries, according to the preferences that were specified. The basic mechanism to query this graph of resources is to specify a regular expression by selecting nodes and adding them by clicking on the “Add selected” button. For example, Figure 7 displays the query 37 Figure 6. Properties Window for the OMIM to CGAP Link ‘disease-protein’ and its output. The corresponding regular expression is ‘d p’, and there is only one path on the current physical graph that implements this query: navigating from the OMIM to the NCBI Protein resource (shown in the result window). One can also specify the number of intermediate resources that can be used by selecting one of the three options from the drop-down menu of the “Intermediate nodes” frame, and clicking on the “Add” button. For example, Figure 8 displays a path query between a disease and a citation resource specifying that there must be three intermediate resources. The output offers two solutions: going from OMIM to PubMed by linking successively either through DBSNP, NCBI Nucleotide and NCBI Protein, or DBSNP, NCBI Protein and NCBI Nucleotide. Figure 9 displays a query retrieving proteins using a disease as an input and exploiting one intermediate resource. The two solutions proposed go from NCBI OMIM to NCBI Protein, either by linking through NCBI Nucleotide or DBSNP. Figure 10 displays a similar query but specifying any number of intermediate nodes. The output offers four 38 Figure 7. Output for the ‘disease-protein’ Query different solutions. The different paths proposed by the tool when submitting a query can be ranked according to different criteria. For instance, Figure 11 displays two different rankings of the output of a query specifying a path between a Gene and a Citation resource, with any number of intermediate resources. The two different ranking criteria selected are: 1. On the left of the screen, the output is ranked with respect to target object cardinality (i.e., the number of entries of the target resource referenced through the path). 2. On the right side of the screen, the output is ranked with respect to the path cardinality (i.e., the number of links existing between the source and the target resource). This example shows that depending on the ranking criterion used, different paths will be ranked higher according to the estimates for cardinalities, cost etc. as described in the chapter 4. 39 Figure 8. Disease to Citation with 3 Intermediate Nodes Figure 9. Disease to Protein with one Intermediate Node 40 Figure 10. Using any number of Intermediate Nodes Figure 11. Gene-Citation query with 0 or more intermediates 2 output s for target object and path cardinality ranking CHAPTER 6 Future Work and Conclusions BioNavigation can enhance existing mediation approaches by providing scientists with the ability to browse through available integrated resources and to access their properties. It acts as a very helpful guidance system for scientists in designing their data collection protocols and queries. Certain innovative features make the BioNavigation approach better than existing systems. These are: • The use of an ontology to graphically build navigational queries and the ability to specify a wildcard ε∗ allows users to identify alternate paths that may be exploited to evaluate the queries. • The annotations can be used by advanced users to specify resources they may require to be used (or not be used) in the process. • The ESearch algorithm designed and implemented for BioNavigation allows efficient search in the space of all possible evaluation paths. • Moreover three scientifically meaningful metrics provide scientists a way to identify the paths that best meet their needs. But the BioNavigation interface still has room for a lot of improvements and innovations which will be part of Future work for this project. 42 The current version of the BioNavigation interface has some limitations which will be overcome in the next version. These are: 1. Top ranked paths could be highlighted in the physical graph using different colors for each top ranked path so that the user can browse the result graph in a similar manner to the physical graph. 2. The rationale behind the path rankings can be explained to the user so that he can better select the best measure. Also the current three metrics are very limited measures and we need to identify more semantically meaningful measures that the users can relate to for example, the level of curation in a source (data quality), trustworthiness (provenance) of the data, user’s preference, etc. 3. Currently the results only point the user to the actual data sources that can be used to implement the scientific pipeline which they have to do manually or using some other system. In the future, a scientist should be able to select a desired path and make the system query the resources to get the corresponding data. These are just some of the few ideas we have in mind for the improvement of the BioNavigation system. The software is made freely available for distribution so that people can evaluate the utility and provide important feedback that can help in further improvements. Another major future goal for BioNavigation is its ultimate integration with the SemanticBio system, a scientific workflow system which uses web services for data collection [Lacr 05a]. The integration of the two systems will allow scientists to select one of the result paths and collect data on that path. This system is also under development at the Scientific Data Management Lab. REFERENCES [Bake 98] P. G. Baker, A. Brass, S. Bechhofer, C. Goble, N. Paton, and R. Stevens. “TAMBIS - Transparent Access to Multiple Bioinformatics Information Sources”. In: Intelligent Systems for Molecular Biology (ISMB), pp. 25–43, AAAI Press, July 1998. [Cons 00] G. O. Consortium. “Gene Ontology: tool for the unification of biology”. Nature Genetics, Vol. 25, pp. 25–29, May 2000. [Etzo 03] T. Etzold, H. Harris, and S. Beaulah. SRS - An Integration Platform for Databanks and Analysis Tools, Chap. 5, pp. 109–145. Morgan Kaufmann Publishing, 2003. [Galp 05] M. Y. Galperin. “The Molecular Biology Database Collection: 2005 update”. Nucleic Acids Res, pp. 5–24, Jan 2005. vol. 33 Database Issue. [Gene 05] “Genew, HUGO Gene Nomenclature Committee (HGNC), Department of Biology, University College London”. 2005. http://www.gene.ucl.ac.uk/ cgi-bin/nomenclature/searchgenes.pl. [Haas 03] L. Haas, B. Eckman, P. Kodali, E. Lin, J. Rice, and P. Schwarz. DiscoveryLink, Chap. 11, pp. 303–334. Morgan Kaufmann Publishing, 2003. [Hend 02] J. Hendler, T. Berners-Lee, and E. Miller. “Integrating Applications on the Semantic Web”. Journal of the Institute of Electrical Engineers of Japan, Vol. 122, No. 10, pp. 676–680, Oct. 2002. [Herm 00] I. Herman and M. S. Marshall. “GraphXML - An XML-based Graph Description Format”. In: Proceedings of the Symposium on Graph Drawing, pp. 52–62, 2000. [Jr 02] C. J. S. Jr, H. C. Causton, and C. A. Ball. “Microarray databases: standards and ontologies”. Nature Genetics, Vol. 32, pp. 469–473, Dec. 2002. Supplement - Chipping Forecast II. 44 [Lacr 03] Z. Lacroix and T. Critchlow, Eds. Bioinformatics: Managing Scientific Data. Morgan Kaufmann Publishing, 2003. [Lacr 04a] Z. Lacroix and V. Edupuganti. “How Biological Source Capabilities May Affect the Data Collection Process”. In: Computational Systems Bioinformatics Conference, pp. 596–597, IEEE Computer Society, 2004. [Lacr 04b] Z. Lacroix, T. Morris, K. Parekh, L. Raschid, and M.-E. Vidal. “Exploiting Multiple Paths to Express Scientific Queries”. In: Scientific and Statistical Database Management (SSDBM), pp. 357–360, IEEE Computer Society, June 2004. [Lacr 04c] Z. Lacroix, H. Murthy, F. Naumann, and L. Raschid. “Links and Paths Through Life Science Data Sources”. In: E. Rahm, Ed., First International Workshop on Data Integration in the Life Sciences, pp. 203–211, Springer, March 2004. [Lacr 04d] Z. Lacroix, L. Raschid, and M.-E. Vidal. “Efficient Techniques to Explore and Rank Paths in Life Science Data Sources”. In: E. Rahm, Ed., First International Workshop on Data Integration in the Life Sciences, pp. 187–202, Springer, March 2004. [Lacr 05a] Z. Lacroix and H. Ménager. “SemanticBio: Building Conceptual Scientific Workflows Over Web Services”. In: B. Ludäscher and L. Raschid, Eds., Second International Workshop on Data Integration in the Life Sciences, Springer, July 2005. [Lacr 05b] Z. Lacroix, K. Parekh, M.-E. Vidal, M. Cardenas, and N. Marquez. “BioNavigation: Selecting Optimum Paths through Biological Resources to Evaluate Ontological Navigational Queries”. In: B. Ludäscher and L. Raschid, Eds., Second International Workshop on Data Integration in the Life Sciences, Springer, July 2005. [Mars 01] M. S. Marshall, I. Herman, and G. Melancon. “An Object-oriented Design for Graph Visualization”. Software Practice and Experience, Vol. 31, pp. 739–765, 2001. [McGu 04] D. L. McGuinness and F. van Harmelen. “OWL Web Ontology Language Overview”. W3C Recommendation, feb 2004. http://www.w3.org/TR/ owl-features/. [Mena 01] E. Mena and A. Illarramendi. Ontology-Based Query Processing for Global Information Systems. Kluwer Academix Publishers, 2001. 45 [Mend 89] A. O. Mendelzon and P. T. Wood. “Finding Regular Simple Paths in Graph Databases”. In: P. M. G. Apers and G. Wiederhold, Eds., Very Large Data Bases (VLDB), pp. 185–193, Morgan Kaufmann, 1989. [Mudu 04] P. Mudumby, T. Morris, and S. Bysani. “Design and Development of a User Interface to Support Navigation for Scientific Discovery”. May 2004. http:// math.la.asu.edu/∼cbs/pdfs/projects/Spring 2004/Group1 report.pdf. [OMad] J. O’Madadhain, D. Fisher, P. Smyth, S. White, and Y.-B. Boey. “Analysis and Visualization of Network Data using JUNG”. (preprint) http://jung. sourceforge.net/doc/JUNG journal.pdf. [Rahm 04] E. Rahm, Ed. Data Integration in the Life Sciences (DILS), Springer, 2004. [Rebh 97] M. Rebhan, V. Chalifa-Caspi, J. Prilusky, and D. Lancet. “GeneCards: encyclopedia for genes, proteins and diseases, Weizmann Institute of Science, Bioinformatics Unit and Genome Center (Rehovot, Israel)”. 1997. http: //bioinformatics.weizmann.ac.il/cards. [rmLu 05] B. Ludäscher and L. Raschid, Eds. Data Integration in the Life Sciences (DILS), Springer, 2005. [Stev 00] R. Stevens, C. A. Goble, and S. Bechhofer. “Ontology-Based Knowledge Representation for Bioinformatics”. Briefings in Bioinformatics, Vol. 1, No. 4, pp. 398–416, November 2000. [Stev 03] R. D. Stevens, A. J. Robinson, and C. A. Goble. “myGrid: personalised bioinformatics on the information grid”. Bioinformatics, Vol. 19, No. 90001, pp. 302i– 304, 2003.