RP 7
RP 7
RP 7
net/publication/13503531
Article in Proceedings / ... International Conference on Intelligent Systems for Molecular Biology; ISMB. International Conference on Intelligent Systems for Molecular Biology · February 1998
Source: PubMed
CITATIONS READS
222 375
6 authors, including:
Carole Goble
The University of Manchester
594 PUBLICATIONS 32,242 CITATIONS
SEE PROFILE
All content following this page was uploaded by Robert David Stevens on 11 May 2014.
Patricia G. Bakera, Andy Brassa, Sean Bechhoferb, Carole Gobleb, Norman Patonb, Robert
Stevensb.
a b
School of Biological Sciences, Department of Computer Science,
Stopford Building, University of Manchester,
University of Manchester, Oxford Road,
Oxford Road, Manchester, M13 9PT
Manchester, M13 9PT U.K.
U.K. Telephone: 44 (161) 275 6142
Telephone: 44 (161) 275 2000 Fax: 44 (161) 275 6236
Fax: 44 (161) 275 5082
pbaker@manchester.ac.uk carole@cs.man.ac.uk
abrass@manchester.ac.uk norm@cs.man.ac.uk
seanb@cs.man.ac.uk stevensr@cs.man.ac.uk
a
Because the biological knowledge base is a conceptual
model of biological terminology, the words ‘concept’ and
‘term’ are used interchangeably in this paper.
Knowledge-Driven Graphical User Interface
Layer 1
Query formulation
Biological Concept Model
declarative query
Query
Transformation
Source Model Layer 2
Query planning and
translation
(source mediation)
ordered execution
plan Layer 3
Query execution
(wrapped sources)
order to share standardised and unambiguous information, terms ‘hasFunction’ and ‘Hydrolase’ to form a
controlled vocabularies, or terminologies, can be used as a composite term ‘Motif which isComponentOf Protein
framework for expressing and communicating ideas in a and hasFunction Hydrolase’; this term is both a
consistent manner. The TAMBIS biological Concept concept and a query.
Model describes such a terminology. This knowledge • it is a classification scheme that organises terms into
base covers terms associated with proteins and nucleic a hierarchy based on the ‘isa’ relationship (also
acids, their component parts and their structures, known as the subsumption relationship). For
biological functions and processes, tissues and taxonomy. example, ProteinSequence ‘isa’ more specialised kind
The terminology has two key aspects: of Sequence.
• it is compositional, resembling a dictionary of
elementary terms that are assembled according to a To be truly effective, such a terminology needs to be
restricted grammar to form new complex composite represented in a scheme that can reason about the inferred
terms. These composite terms can in turn be relationships between terms and their components, can
components in new compositional terms, so the control the formation of terms, and can automatically
terminology is recursive. For example, the term classify terms based on their components so that the
‘Motif’ can be combined with the terms hierarchy takes care of itself. As terms are changed the
‘isComponentOf’ and ‘Protein’, to create a new scheme should also dynamically reclassify them to ensure
composite term ‘Motif which isComponentOf the hierarchy’s correctness.
Protein’. This in turn could be combined with the Description Logics (DL), also known as Terminology
Logics, are a family of logics explicitly designed to expressive than most other DLs but it compensates for
represent taxonomic and conceptual knowledge of an this by supporting a powerful set of assertion axioms and
application domain on an abstract level; for an overview a multi-layer sanctioning mechanism. These sanctions
see (Borgida 1995). DLs are usually given a Tarski style decree whether two concepts are permitted to be related
declarative semantics, which allows them to be seen as via some relationship and so constrain the construction of
sub-languages of first order predicate logic. In the complex concepts. Sanctions ensure that only
TAMBIS project we use the GRAIL DL (Rector 1996), semantically valid concepts are formed and that a large
developed at Manchester. Briefly, a DL is an ‘isa’-based number of complex concepts can be inferred from a
classification system that allows a recursive, sparse model. As only reasonable concepts can be inferred
compositional model to be built from terms and binary from the model the user is allowed to construct only those
relations. A base term can be combined with any number queries that it is reasonable to ask. For example, in figure
of relation-term pairs (or criteria) to create a more 2, asserting that ‘SequenceComponent isComponentOf
complex term. Any of these terms can be composite Protein’ is legal, is sufficient to infer that ‘Motif
(complex) or elementary. Figure 2 gives a small fragment isComponentOf Protein’ without having to create it or
of the GRAIL classification, omitting the term position it until it is asked for. Therefore, only a small
constructors. In this example ‘Motif’ is the base term and number of constraints need be asserted in order that a
‘isComponentOf Protein’ is the criterion with which it is large number of concepts can be inferred.
combined. GRAIL supports the automatic classification of In TAMBIS the biological Concept Model is used to:
concepts into ‘isa’ hierarchies by reasoning about the • describe the metadata of the underlying data sources,
component descriptions of the concepts. Therefore, representing an over-arching universal schema
‘Protein Motif’ would be classified automatically as a • express queries in the modelling language
child of ‘Motif’ and a parent of ‘Poecilia Reticulata • drive a GUI user interface for query formulation
Protein Motif’ based on its definition. Only 3 of the 11 • mediate between the various data sources by
‘isa’ relationships shown in figure 2 have been hand- exploiting the biological concept hierarchy to assist
crafted by the knowledge modeller. DLs support multi- in the identification and resolution of equivalences or
dimensional classification so that the same concept can be near equivalences – similar approaches have been
classified in many ways, thus allowing for the different taken in non-biological projects, for example SIMS
user views of a concept. The classification is dynamic so [Arens93].
as the description of a concept is further elaborated it is
automatically reclassified. Description Logics therefore As (Markowitz 1995) and (Davidson 1995) suggest,
support the incremental description of terms. integration is costly and the quest for an agreed schema
The classification hierarchy supports imprecise and futile. However, our biological terminology does not
general queries and query exploration by moving around attempt to force a global schema representing a consistent
the hierarchy. The compositional nature of the integrated view of all the component databases. Instead it
representation allows for the flexible construction of seeks to describe what is in the component databases and,
queries at varying levels of complexity and abstraction. In rather than resolve conflicts, it acknowledges them and
DLs the modelling language and the query language are indicates possible equivalences.
the same thing; to find the concept you define it and the
classifier classifies it. If it is sound then it is positioned in The Knowledge-Driven User Interface
the hierarchy and you can ask for its parent, children or
the instances it describes. If it is unsound then it doesn’t Queries are formulated against the biological concept
classify and, therefore, cannot appear in a query. model in the GRAIL language. It would be inappropriate
A whole family of knowledge representation systems for biologists to learn either GRAIL or the contents of the
have been built using DLs and recent work has provided a knowledge base. Instead, TAMBIS provides a forms
sound formal basis for several DLs along with results based GUI that is driven by the terminological model. The
concerning their complexity (Donini 1991). Significantly interface supports two tasks:
large models are now being produced, for example the • exposure of the terminological model and
Galen-In-Use medical model (Rector 1997) expressed in • guided query formulation and manipulation.
GRAIL is some 10,000 concepts and relations.
DLs are expressive, and usually have complete and During the query formulation process the model may be
decidable reasoning. However, the conflict when applying browsed to find what can sensibly be said of a concept of
any DL is between computational tractability and interest. A convenient mechanism for browsing the model
expressiveness; GRAILs terminological language is less without query formulation is provided by the navigation
isComponentOf hasOrganismSource
Protein Organism
hasFunction
Function
SequenceComponent Poecilia
SequenceComponent
hasFunction Hydrolase. reticulata
Hydrolase
Motif SequenceComponent isComponentOf
Protein
Motif
<isComponentOf (Protein hasOrganismSource
PoeciliaReticulata) hasFunction Hydrolase>.
Figure 2. A simplified fragment of the TAMBIS GRAIL model showing the power of auto-classification; the only ‘isa’ relationships that
have been ‘hand-crafted’ by the knowledge worker are indicated by the solid arrows. All the other terms are implied by the sanctioning
scheme and automatically and dynamically classified upon request, as indicated by the broken arrows. The solid lines indicate the
sanctioned relationships between terms. It is these relationships that allow the construction of all of the composite terms shown.
tool. Figure 3 shows the navigator focused on the concept ‘hasOrganismSource PoeciliaReticulata’. The query is
‘Protein Structure’. The concept currently in focus equivalent to the English expression “find all motifs
occupies the center of the frame and related concepts from occurring in guppy proteins”.
the Knowledge Base are displayed around it. The model It is important to appreciate that in TAMBIS the term
may be browsed by promoting any of the related concepts concept is interchangeable with the term query. Therefore,
to be the central concept. The new central concept is then in constructing a concept (a description of what you the
surrounded by all its related concepts. user wants) the user is constructing a query (“what things
Having identified a concept of interest, for example exist that fit the description I have just given?”).
‘motif’, the user may want to form a query based on that
concept. A Query Manipulation tool gives the user an Query Planning and Translation
option to add more information about the concept (or Queries expressed in GRAIL are declarative and source
specialise the concept) by presenting all the legitimate independent. GRAIL queries thus specify what
criteria that can be applied to the concept ‘motif’ (see information is required, but neither how it should be
figure 4). obtained nor from where. It is the role of the query
The user may choose one or many of these criteria. If they planning and translation layer to provide this additional
chose, for example, ‘isComponentOf Protein’, the query information. This layer takes as input a GRAIL query and
is equivalent to the English expression “find all protein generates as output an execution plan in CPL. The
motifs”. Having constructed the query the user may planning and translation process is broken into three main
manipulate the whole query or any of its component sub- steps:
queries by (i) the addition or removal criteria or (ii) the • Translation into a Query Internal Form (QIF): The
replacement of terms with more specialized or more GRAIL query is unnested and certain query
general terms. Figure 5a shows a query that has been constructs are simplified.
built by further specialisation of the term ‘Protein’ in the
• Query Planning: A search algorithm considers
above query by addition of the criterion
alternative evaluation orders for the components of
the QIF generated at step 1, with a view to optimisation (Paton 1990, Fegaras 1997). The QIF is a list
identifying both valid and efficient ways of of query components, each of which is a tuple (Base,
evaluating the query. Variable, Criteria, Cost, Cardinality) representing the
• Code Generation: The query plan that results from evaluation of part of the query. Base is the base concept
the planning phase is converted into a CPL program of the component, Variable is the name of the variable
for execution. used to store values retrieved as a result of evaluation the
component, Criteria represents the set of criteria
The following subsections elaborate on the above steps, associated with Base, Cost is an estimate of the cost of
both detailing what is done at each stage and outlining the evaluating the component, and Cardinality is the size of
auxiliary data structures that are required. the collection that it is anticipated will result from
Translation into Query Internal Form (QIF). GRAIL evaluating the component. Values for Cost and
queries are intrinsically nested structures. However, Cardinality are computed by the planner. Figure 5a shows
nested language structures generally imply some an example query that is equivalent to the English query
evaluation order, so we follow a number of earlier query “find all motifs in Poecilia reticulata (guppy) proteins”.
planners in unnesting the source query prior to query The GRAIL representation
Figure 3. TAMBIS prototype user interface navigation tool showing the navigation of the concept ‘Protein Structure’. The
central term is surrounded by related terms. Each related term is coloured according to its relationship with the central term.
There are four possible relationships: parent terms - concepts immediately above it in the hierarchy with which it has an ‘isa’
relationship e.g. ‘Structure’; child terms - concepts immediately below it in the hierarchy which have ‘isa’ relationships with
it e.g. ‘Protein Tertiary Structure’; defining terms – relation-term pairs that form part of its definition eg. ‘is structure of
Protein’; sanctioned terms - concepts with which it has appropriately sanctioned relationships but which do not form part of
the concept’s definition eg. ‘is determined by Method of Determining Structure’.
Figure 4. An example from the TAMBIS user interface prototype showing the relationships that can be used to specialise the
concept of ‘motif’.
b)
Motif which isComponentOf (Protein which
hasOrganismSource PoeciliaReticulata)
c)
[ ( Motif, Motif-1, [(isComponentOf Protein, Protein-1)], -1, 1),
(Protein, Protein-1, [(hasSourceOrganism PoeciliaReticulata,
null)], -1, -1) ]
d)
{Motif-1|
\Protein-1<-get-sp-entry-by-os(" POECILIA+RETICULATA"),
Motif-1<-do-prosite-scan-by-entry-rec(Protein-1)}
e)
Figure 5. An example showing the stages in the information retrieval process using TAMBIS. a) The knowledge-driven GUI
allows the user to construct a declarative, conceptual and source independent query. The query formulated at the interface is
represented in GRAIL as shown in b). c) The single GRAIL query is transformed into query internal form (QIF). d) The QIF
is transformed into a functional, source-dependent query in CPL. e) The results from the CPL wrapped sources are presented
to the user via a Web browser.
References Kemp G.J.L. and Gray P.M.G., Using the Functional Data
Model to Integrate Distributed Biological Data Sources,
Arens Y, Chee C.Y., Hsu C-H, Knoblock C.A. Retrieving Proc. 8th Int. Conf. on Scientific and Statistical Database
and Integrating Data from Multiple Information Sources, Management, IEEE Press, 176-195, 1996.
in Journal on Intelligent and Cooperative Information
Systems, 2:127-158,1993.
Markowitz, V.M., and Ritter, O., Characterizing
Heterogeneous Molecular Biology Database Systems,
Borgida A., Description Logics in Data Management.
Journal of Computational Biology, 2(4), 1995.
IEEE Transactions on Knowledge and Data Engineering,
7(5): 671-682, 1995. Paton, N.W. and Gray, P.M.D., Optimising and Executing
Daplex Queries Using Prolog, The Computer Journal, Vol
Buneman P., Davidson S.B., Hart K., Overton C. and
33, No 6, 547-555, 1990.
Wong L. A Data Transformation System for Biological
Data Sources In Proceedings of VLDB, Sept. 1995
Rector A.L., Bechhofer S., Goble C.A., Horrocks I,
(Zurich, Switzerland). Nowlan W.A., Solomon W.D., The GALEN modelling
language for medical terminology, in AI in Medicine
Davidson S.B., Overton C., Buneman P., Challenges in
1996.
Integrating Biological Data Sources, Journal of
Computational Biology Vol 2, No 4, 1995.
Rector A. and Horrocks I. Experience building a Large,
Re-usable Medical Ontology using a Description Logic
Donini, F., Lenzerini, M., Nardi, D., Nutt, W., ‘The with Transitivity and Concept Inclusions. AAAI Spring
Complexity of Symposium on Ontological Engineering, 1997.
Concept Languages’, KR-91, pp151-162, 1991.
Rodriguez-Tome P, Helgesen C, Lijnzaad P, Jungfer K, A
Etzold T, Ulyanov A, Argos P, SRS: information retrieval
CORBA server for the radiation hybrid database.
system for molecular biology data banks. Methods Proceedings of the ISMB 1997, 5:250-253.
Enzymol. 1996, 266: 114-128.
Warren D.H.D., Efficient Processing of Interactive
Fegaras L. An experimental optimizer for OQL. Technical
Relational Database Queries Expressed in Logic, Proc.
Report TR-CSE-97-007, CSE, University of Texas at
7th VLDB, 272-281, 1981.
Arlington, 1997.
Wiederhold G. Mediators in the Architecture of future
Karp P, A Strategy for Database Interoperation, in Journal Information Systems, IEEE Computer 21(3) March 1992,
of Computational Biology, 1996.
pp. 38-50.