Ontologies For Bioinformatics Unknown
Ontologies For Bioinformatics Unknown
Ontologies For Bioinformatics Unknown
for
Bioinformatics
Kenneth Baclawski
Tianhua Niu
Baclawski, Kenneth
Ontologies for bioinformatics / Kenneth Baclawski, Tianhua Niu.
p. cm.—(Computational molecular biology)
Includes bibliographical references and index.
ISBN 0-262-02591-4 (alk. paper)
1. Bioinformatics–Methodology. I. Niu, Tianhua. II. Title. III. Series.
QH324.2.B33 2005
572.8’0285–dc22 2005042803
Contents
Preface xi
I Introduction to Ontologies 1
1 Hierarchies and Relationships 3
1.1 Traditional Record Structures 3
1.2 The eXtensible Markup Language 5
1.3 Hierarchical Organization 7
1.4 Creating and Updating XML 10
1.5 The Meaning of a Hierarchy 17
1.6 Relationships 25
1.7 Namespaces 28
1.8 Exercises 32
2 XML Semantics 35
2.1 The Meaning of Meaning 35
2.2 Infosets 38
2.3 XML Schema 42
2.4 XML Data 46
2.5 Exercises 49
References 393
Index 413
Preface
Introduction to Ontologies
2
The actual records are considerably longer. It should be apparent that one
cannot have any understanding of the meaning of the records without some
explanation such as the following:
NAME LABEL
instudy Date of randomization into study
bmi Body Mass Index. Weight(kgs)/height(m)**2
4 1 Hierarchies and Relationships
The explanation of what the fields mean is called metadata. In general, meta-
data are any “data about data,” such as the names of the fields, the kind of
values that are allowed, the range of values, and explanations of what the
fields mean.
In this case each field has a fixed number of characters, and each record
has a fixed total number of characters. This is called the fixed-width format
or fixed-column format. This format simplifies the processing of the file, but it
limits what can be said within each field. If the text that should be in a field
does not fit, then it must be abbreviated or truncated. There are other file
formats that eliminate these limitations. One commonly used format is to use
commas or tabs to delimit the fields. This allows the fields to have varying
size. However, it complicates processing when the delimiting character (i.e.,
the comma or tab) must be used within a field.
The information in the record is often highly redundant. For example, the
obesity and ovrwt fields are unnecessary because they can be computed from
the bmi field. Similarly, the bmi field can be computed from the Height and
Weight fields. Another common feature of flat files is that the field formats are
often inappropriate. For example, the obesity field can only have the values
“yes” or “no,” but it is represented using numbers.
Each field of a flat file is defined by features such as its name, format,
description, and so on. A database is a collection of flat files (called tables)
with auxiliary structures (e.g., indexes) that improve performance for certain
commonly used operations. The description of the fields of one or more flat
files is called the schema.
A database schema is an example of an ontology. In general, whenever
data are structured, the description of their structure is the ontology for the
data. A glance at the example record makes it clear that the raw data record
is completely useless without the ontology. The ontology is what gives the
raw data their meaning. The same is true for any kind of data, whether
they be electronic data used by a computer or audiovisual data sensed by a
person. Ontologies are the means by which a person or some other agent
understands its world, as well as the means by which a person or agent com-
municates with others.
1.2 The eXtensible Markup Language 5
Summary
• A flat file is a collection of records.
• Each record in a flat file has the same number and kinds of fields as any
other record in the same file.
• The schema of a flat file describes the structure (i.e., the kinds of fields) of
each record.
Flat files are simple and easy to process. A typical program using and pro-
ducing flat files simply performs the same operation on each record. How-
ever, flat files are limited to relatively simple forms of data. They are not
well suited to the complex information of genomics, proteomics, and so on.
Accordingly, a new approach is necessary.
The eXtensible Markup Language (XML) is a powerful and flexible mech-
anism that can be used to represent bioinformatic data and facilitates com-
munication. Unlike flat files, an XML document is self-describing: the name of
each attribute is specified in addition to the value of the attribute. The health
study record shown above could be written like this in XML:
attributes can appear in any order, and an attribute that is not needed by an
element is not written at all.
An attribute in general is a property or characteristic of an entity. Linguis-
tically, attributes are adjectives that describe entities. For example, a person
may be overweight or obese, and the BMI attribute makes the description
quantitative rather than qualitative. The notion of attribute represents two
somewhat different concepts: the attribute in general and the attribute of a
specific entity. BMI is an example of an attribute, but one would also speak
of a BMI equal to 18.66 for a specific person as being an attribute. To avoid
confusion we will refer to the former as the attribute name, while the latter is
an attribute value.
<!ATTLIST molecule
title CDATA #IMPLIED
id CDATA #IMPLIED
convention CDATA "CML"
dictRef CDATA #IMPLIED
count CDATA #REQUIRED
>
Figure 1.1 Part of the Chemical Markup Language DTD. This defines the attribute
names that are allowed in a molecule element.
It is also used to show where an element ends. To include the left angle
bracket in ordinary text, write it as “<”. Writing a special character like
the left angle bracket as “<” is called escaping. The ampersand character
(&) is also reserved by XML, and it must be written as “&”.
Figure 1.2 Data entry screen for the molecule element of the Chemical Markup
Language.
Summary
• XML is a format for representing data.
Modern biology and medicine, like much of society, is currently faced with
overwhelming amounts of raw data being produced by new information-
gathering techniques. In a relatively short period of time information has
gone from being relatively scarce and expensive to being plentiful and inex-
pensive. As a consequence, the traditional methods for dealing with infor-
mation are overwhelmed by the sheer volume of information available. The
traditional methods were developed when information was scarce, and they
cannot handle the enormous scale of information.
The first and most natural reaction by people to this situation is to attempt
to categorize and classify. People are especially good at this task. We are
8 1 Hierarchies and Relationships
Figure 1.3 A BioML document showing some of the information about the human
insulin gene. Boxes were drawn around each XML element so that the hierarchi-
cal structure is more apparent. XML documents normally indicate the hierarchical
structure by successive indentation, as in this example.
Summary
• XML elements are hierarchical: each element can contain other elements,
that in turn can contain other elements, and so on.
Figure 1.4 File management vs. XML document management. The image on the
left used the Windows file manager. It shows disk drives, folders, and files on a PC.
The image on the right used the Xerlin XML document editor. It shows the elements
of a single XML document.
Figure 1.5 Data entry screen for an element of an XML document. The window on
the left shows the hierarchical structure of the XML document in the same manner
as a file manager. A gene element is highlighted, indicating that this is the currently
open element. The attributes for the gene element are shown in the right window.
The window on the right acts like a data entry screen for viewing and updating the
attributes of the element.
>
<!ELEMENT atom EMPTY>
<!ATTLIST atom
elementType CDATA #IMPLIED
title CDATA #IMPLIED
id CDATA #IMPLIED
convention CDATA "CML"
dictRef CDATA #IMPLIED
count CDATA "1"
>
<!ELEMENT bondArray (bond+)>
<!ATTLIST bondArray
title CDATA #IMPLIED
id CDATA #IMPLIED
convention CDATA "CML"
>
<!ELEMENT bond EMPTY>
<!ATTLIST bond
title CDATA #IMPLIED
id CDATA #IMPLIED
convention CDATA "CML"
dictRef CDATA #IMPLIED
atomRefs CDATA #IMPLIED
>
Figure 1.6 Part of the Chemical Markup Language DTD. This part defines the con-
tent and some of the attributes of a molecule element as well as the content and
some of the attributes of elements that can be contained in a molecule element.
<!ENTITY % title_id_conv ’
title CDATA #IMPLIED
id CDATA #IMPLIED
convention CDATA "CML" ’>
<!ENTITY % title_id_conv_dict
%title_id_conv;
14 1 Hierarchies and Relationships
Figure 1.7 Part of the Chemical Markup Language DTD. This DTD uses entities to
simplify the DTD in figure 1.6.
• Entities can be used to build a large DTD from smaller files. The entities
in this case refer to the files being incorporated rather than to the actual
value of the entity. Such an entity would be defined like this:
1.4 Creating and Updating XML 15
To include the contents of the file ml.dtd, one uses %dtd1; in your DTD.
One can use a URL instead of a filename, in which case the DTD informa-
tion will be obtained from an external source.
<?xml version="1.0"?>
<!DOCTYPE ExperimentSet SYSTEM "experiment.dtd"
[
<!ENTITY experiment1 SYSTEM "experiment1.xml">
<!ENTITY experiment2 SYSTEM "experiment2.xml">
<!ENTITY experiment3 SYSTEM "experiment3.xml">
<!ENTITY experiment4 SYSTEM "experiment4.xml">
<!ENTITY experiment5 SYSTEM "experiment5.xml">
]>
<ExperimentSet>
&experiment1;
&experiment2;
&experiment3;
&experiment4;
&experiment5;
</ExperimentSet>
Note that entities used within documents use the ampersand rather than
the percent sign. This example is considered again in section 11.6 where
it is discussed in more detail.
When one is editing an XML document, the DTD assists one to identify
the attributes and elements that need to be provided. Figure 1.5 shows the
BioML insulin gene document. The “directory” structure is on the left, and
the attributes are on the right. In this case a gene element is open, and so the
attributes for the gene element are displayed. To enter or update an attribute,
click on the appropriate attribute and use the keyboard to enter or modify the
16 1 Hierarchies and Relationships
attribute’s value. When an attribute has only a limited list of possible values,
then one chooses the desired value from a menu. Attributes are specified
in the same manner as one specifies fields of a record in a database using a
traditional “data entry” screen. An XML document is effectively an entire
database with one table for every kind of element.
In addition to attributes, an XML element can have text. This is often re-
ferred to as its text content to distinguish it from the elements it can contain.
In an XML editor, the text content is shown as if it were another child ele-
ment, but labeled with #text. It is also shown as if it were another attribute,
also labeled with #text.
Figure 1.8 The process of adding a new element to an XML document. The menus
shown were obtained by right-clicking on the gene element and then selecting the
Add choice. The menu containing dna, name, and so on shows the elements that are
allowed in the gene element.
1.5 The Meaning of a Hierarchy 17
Summary
• XML documents are examined and updated by taking advantage of the
hierarchical structure.
• The XML DTD assists in updating a document by giving clues about what
attributes need to be entered as well as what elements need to be added.
Figure 1.9 Adding a new element to an XML document. A note element has been
chosen to be added to a gene element.
possible for people to communicate with each other. Individuals must have a
shared conceptual framework in order to communicate, but communication
requires more than just a shared conceptualization; it is also necessary for
the concepts to have names, and these names must be known to the two
individuals who are communicating.
Biochemistry has a rich set of concepts ranging from very generic notions
such as chemical to exquisitely precise notions such as Tumor necrosis factor
alpha-induced protein 3. Concepts are typically organized into hierarchies to
capture at least some of the relationships between them. XML document
hierarchies are a means by which one can represent such hierarchical organi-
1.5 The Meaning of a Hierarchy 19
Figure 1.10 Result of adding a new element to an XML document. A note element
has been added to a gene element. The active element is now the note element, and
its attributes appear in the window on the right side.
zations of knowledge.
Aristotle (384-322 BC) was the first who understood the difficulty of cat-
egorizing living organisms into classes according to their anatomical and
physiological characteristics (Asimov 1964). Since then, this tradition of clas-
sification has been one of the major themes in science. Figure 1.11 illustrates
a hierarchy of chemicals taken from EcoCyc (EcoCyc 2003). For example,
protein is more specific than chemical, and enzyme is more specific than pro-
tein. Classifications that organize concepts according to whether concepts
are more general or more specific are called taxonomies by analogy with bio-
logical classifications into species, genera, families, and so on.
Hierarchies are traditionally obtained by starting with a single all-inclu-
20 1 Hierarchies and Relationships
sive class, such as “living being,” and then subdividing into more specific
subclasses based on one or more common characteristics shared by the mem-
bers of a subclass. These subclasses are, in turn, subdivided into still more
specialized classes, and so on, until the most specific subclasses are identi-
fied. We use this technique when we use an outline to organize a task: the
most general topic appears first, at the top of the hierarchy, with the more
specialized topics below it. Constructing a hierarchy by subdivision is often
called a “top-down” classification.
An alternative to the top-down technique is to start with the most specific
classes. Collections of the classes that have features in common are grouped
together to form larger, more general, classes. This is continued until one
collects all of the classes together into a single, most general, class. This ap-
proach is called “bottom-up” classification. This is the approach that has
been used in the classification of genes (see figure 1.12). Whether one uses a
1.5 The Meaning of a Hierarchy 21
tioned until relatively recently, and is still commonly accepted. By the middle
of the nineteenth century, scholars began to question the implicit assump-
tions underlying taxonomic classification. Whewell, for example, discussed
classification in science, and observed that categories are not usually speci-
fiable by shared characteristics, but rather by resemblance to what he called
“paradigms.” (Whewell 1847) This theory of categorization is now called
“prototype theory.” A prototype is an ideal representative of a category from
which other members of the category may be derived by some form of modi-
fication. One can see this idea in the classification of genes, since they evolve
via mutation, duplication, and translocation (see figure 1.13). Wittgenstein
further elaborated on this idea, pointing out that various items included in a
category may not have one set of characteristics shared by all, yet given any
two items in the category one can easily see their common characteristics and
understand why they belong to the same category (Wittgenstein 1953). Witt-
genstein referred to such common characteristics as “family resemblances,”
because in a family any two members will have some resemblance, such as
the nose or the eyes, so that it is easy to see that they are related, but there
may be no one feature that is shared by all members of the family. Such a cat-
egorization is neither top-down nor bottom-up, but rather starts somewhere
in the middle and goes up and down from there.
This is especially evident in modern genetics. Genes are classified both
by function and by sequence. The two approaches interact with one another
in complex ways, and the classification is continually changing as more is
learned about gene function. Figure 1.12 shows some examples of the clas-
sification of genes into families and superfamilies. The superfamily is used
to describe a group of gene families whose members have a common evolu-
tionary origin but differ with respect to other features between families. A
gene family is a group of related genes encoding proteins differing at fewer
than half their amino acid positions. Within each family there is a structure
that indicates how closely related the genes are to one another. For exam-
ple figure 1.13 shows the evolutionary structure of the nuclear receptor gene
family. The relationships among the various concepts is complex, including
evolution, duplication and translocation.
The hierarchies shown in figure 1.11, 1.12, and 1.13 are very different from
one another due to the variety of purposes represented in each case. The
chemical hierarchy in figure 1.11 is a specialization/generalization hierarchy.
The relationship here is called subclass because mathematically it represents
a subset relationship between the two concepts. The gene families and su-
perfamilies in figure 1.12 are also related by the subclass relationship, but the
1.5 The Meaning of a Hierarchy 23
Figure 1.12 Some gene families. The first row below Gene in this classification con-
sists of superfamilies. The row below that contains families. Below the families are
some individual genes. See (Cooper 1999), Chapter 4.
individual genes shown in the diagram are members (also called instances)
of their respective families rather than being subsets. However, the nuclear
receptor gene diagram in figure 1.13 illustrates that the distinction between
subclass and instance is not very clear-cut, as the entire superfamily evolved
from a single ancestral gene. In any case, the relationships in this last dia-
gram are neither subclass nor instance relationships but rather more complex
relationships such as: evolves by mutation, duplicates, and translocates.
Although hierarchical classification is an important method for organiz-
ing complex information, it is not the only one in common use. Two other
techniques are partitioning and self-organizing maps. Both of these can be re-
garded as classification using attribute values rather than hierarchical struc-
tures. In partitioning, a set of entities is split into a specified number of subsets
(MacQueen 1967). A self-organizing map is mainly used when a large num-
24 1 Hierarchies and Relationships
Figure 1.13 The human nuclear receptor gene superfamily. A common ancestor
evolved into the three gene families. Unlabeled arrows represent evolution over time.
Labeled arrows indicate translocation between families or subfamilies. See (Cooper
1999), Figure 4.28.
Summary
• Classifications can be constructed top-down, bottom-up, or from the mid-
dle.
1.6 Relationships
Of course, most titles are like this, and the abstract quickly clears up the
confusion. However, it does point out how important such connecting
phrases can be to the meaning of a document. These are called relationships,
and they are the subject of this section.
The organization of concepts into hierarchies can capture at least some
of the relationships between them, and such a hierarchy can be represented
using an XML document hierarchy. The relationship in an XML document
between a parent element and one of its child elements is called containment
because elements contain each other in the document. However, the actual
relationship between the parent element and child element need not be a
containment. For example, it is reasonable to regard a chromosome as con-
taining a set of locus elements because a real chromosome actually does
contain loci. Similarly, a gene really does contain exons, introns, and do-
mains. However, the relationship between a gene and a reference is not
one of containment, but rather the referral or citation relationship.
One of the disadvantages of XML is that containment is the only way to
relate one element to another explicitly. The problem is that all the various
kinds of hierarchy and various forms of relationship have to be represented
using containment. The hierarchy in figure 1.13 does not use any relation-
ships that could reasonably be regarded as being containment. Yet, one must
use the containment relationship to represent this hierarchy. The actual rela-
tionship is therefore necessarily implicit, and some auxiliary, informal tech-
nique must be used to elucidate which relationship is intended.
Unfortunately, this is not a small problem. One could not communicate
very much if all one had were concepts and a single kind of relationship.
Relating concepts to each other is fundamental. Linguistically, concepts are
usually represented by nouns and relationships by verbs. Because relation-
ships relate concepts to concepts, the linguistic notion of a simple sentence,
with its subject, predicate, and object, represents a basic fact. The subject and
object are the concepts and the predicate is the relationship that links them.
One can specify relationships in XML, but there are two rather different
ways that this can be done, and neither one is completely satisfactory. The
first technique is to add another “layer” between elements that specifies the
relationship. This is called striping. A BioML document could be represented
using striping, as in figure 1.14. If one consistently inserts a relationship
element between parent and child concept elements, then one can unam-
biguously distinguish the concept elements from the relationship elements.
26 1 Hierarchies and Relationships
Figure 1.14 Using striping to represent relationships involving the human insulin
gene. The shaded elements in the figure are the relationships that link a parent ele-
ment to its child elements.
Figure 1.15 The use of references to specify a bond between two atoms in a
molecule. The arrows show the atoms that are being referenced by the bond element.
Summary
• Relationships connect concepts to each other.
• RDF and languages based on it allow one to use either striping or refer-
ences interchangeably.
1.7 Namespaces
So far, all of the examples of XML documents used a single DTD. It is becom-
ing much more common to use several DTDs in a single document. This has
the important advantage that markup vocabulary that is already available
can be reused rather than being invented again. However, simply merging
the vocabularies of multiple DTDs can have undesirable consequences, such
as:
• The same term can be used in different ways. For example, “locus” is an
attribute in the Bioinformatic Sequence Markup Language (BSML), but it
is an element in BioML.
• The same term can have different meanings. This is especially true of
commonly occurring terms such as “value” and “label.”
• The same term might have the same use and meaning, but it may be con-
strained differently. For example, the “Sequence” element occurs in sev-
eral DTDs and has the same meaning, but the content and attributes that
are allowed will vary.
xmlns:cmlr="http://www.xml-cml.org/schema/cml2/react"
xmlns:sbml="http://www.sbml.org/sbml/level2"
1.7 Namespaces 29
These declarations are attributes that can be added to any element, but
they are most commonly added to the root element. Once the prefixes have
been declared, one can use the prefixes for elements and for attributes. For
example, the following document mixes CML, BioML and SBML terminol-
ogy:
<bioml:organism
xmlns:cml="http://www.xml-cml.org/schema/cml2/core"
xmlns:cmlr="http://www.xml-cml.org/schema/cml2/react"
xmlns:bioml="http://xml.coverpages.org/bioMLDTD-19990324.txt"
xmlns:sbml="http://www.sbml.org/sbml/level2"
>
<bioml:species>Homo sapiens</bioml:species>
<sbml:reaction sbml:id="reaction_1" sbml:reversible="false">
<sbml:listOfReactants>
<sbml:speciesReference sbml:species="X0"/>
</sbml:listOfReactants>
<sbml:listOfProducts>
<sbml:speciesReference sbml:species="S1"/>
</sbml:listOfProducts>
</sbml:reaction>
<cmlr:reaction>
<cmlr:reactantList>
<cml:molecule cml:id="r1"/>
</cmlr:reactantList>
<cmlr:productList>
<cml:molecule cml:id="p1"/>
</cmlr:productList>
</cmlr:reaction>
...
</bioml:organism>
<organism
xmlns="http://xml.coverpages.org/bioMLDTD-19990324.txt"
xmlns:sbml="http://www.sbml.org/sbml/level2"
xmlns:cml="http://www.xml-cml.org/schema/cml2/core"
xmlns:cmlr="http://www.xml-cml.org/schema/cml2/react"
>
<species>Homo sapiens</species>
<sbml:reaction sbml:id="reaction_1" sbml:reversible="false">
<sbml:listOfReactants>
<sbml:speciesReference sbml:species="X0"/>
</sbml:listOfReactants>
<sbml:listOfProducts>
<sbml:speciesReference sbml:species="S1"/>
</sbml:listOfProducts>
</sbml:reaction>
...
</organism>
Summary
• Namespaces organize multiple vocabularies so that they may be used at
the same time.
1.8 Exercises
element_id,sequence_id,organism_name,seq_length,type
U83302,MICR83302,Colaptes rupicola,1047,DNA
U83303,HSU83303,Homo sapiens,3460,DNA
U83304,MMU83304,Mus musculus,51,RNA
U83305,MIASSU833,Accipiter striatus,1143,DNA
Show how these records would be written as XML elements using the
bio_sequence tag.
2. For the spreadsheet in exercise 1.1 above, show the corresponding XML
DTD. The element_id attribute is a unique key for the element. As-
sume that all attributes are optional. The molecule type is restricted to the
biologically significant types of biopolymer.
A physical unit is, in general, composed of several factors. This was en-
coded in the relational table by using several records, one for each factor.
The microF_per_mm2 unit, for example, is the ratio of microfarads by
square millimeters.
This relational database table illustrates how several distinct concepts can
be encoded in a single relational table. In general, information in a re-
lational database about a single concept can be spread around several
records, and a single record can include information about several con-
cepts. This can make it difficult to understand the meaning of a relational
table, even when the relational schema is available.
Show how to design an XML document so that the information about the
two concepts (i.e, the physical units and the factors) in the table above are
separated.
4. This next relational database table defines some of the variables used in
the Fitzhugh-Nagumo model (Fitzhugh 1961; Nagumo 1962) for the trans-
mission of signals between nerve axons:
The physical units are the ones defined in exercise 3 above. Extend the
solution of that exercise to include the data in the table above. Note that
34 1 Hierarchies and Relationships
5. Use an XML editor (such as Xerlin or XML Spy) to construct the examples
in the previous two exercises. Follow these steps:
<?xml version="1.0">
<!DOCTYPE model [
<!ELEMENT model (physical_unit*,component*)>
<!ELEMENT physical_unit (factor)*>
<!ATTLIST physical_unit name ID #REQUIRED>
<!ELEMENT factor EMPTY>
<!ATTLIST factor
prefix CDATA #IMPLIED
unit CDATA #REQUIRED
exponent CDATA "1">
<!ELEMENT component (variable)*>
<!ATTLIST component name ID #REQUIRED>
<!ELEMENT variable EMPTY>
<!ATTLIST variable
name CDATA #REQUIRED
initial CDATA #IMPLIED
physical_unit IDREF "dimensionless"
interface (in|out) #IMPLIED>
]>
<model/>
Medical Language System (UMLS). In the UMLS, “spinal tap” has concept
identifier C0553794. All terms with this same concept identifier are synony-
mous.
An ontology is a means by which the language of a domain can be formal-
ized (Heflin et al. 1999; Opdahl and Barbier 2000; Heflin et al. 2000; McGuin-
ness et al. 2000). As such, an ontology is a context within which the semantics
of terminology and of statements using the terminology are defined. On-
tologies define the syntax and semantics of concepts and of relationships
between concepts. Concepts are used to define the vocabulary of the do-
main, and relationships are used to construct statements using the vocabu-
lary. Such statements express known or at least possible knowledge whose
meaning can be understood by individuals in the domain. Representing
knowledge is therefore one of the fundamental purposes of an ontology.
Classic ontologies in philosophy are informally described in natural lan-
guage. Modern ontologies differ in having the ability to express knowledge
in machine-readable form. Expressing knowledge in this way requires that
it be represented as data. So it is not surprising that ontology languages and
data languages have much in common, and both kinds of language have
borrowed concepts from each other. As we saw in section 1.1, a database
schema can be regarded as a kind of ontology. Modern ontology languages
were derived from corresponding notions in philosophy. See the classic work
(Bunge 1977, 1979), as well as more recent work such as (Wand 1989; Guarino
and Giaretta 1995; Uschold and Gruninger 1996). Ontologies are fundamen-
tal for communication between individuals in a community. They make it
possible for individuals to share information in a meaningful way. Formal
ontologies adapt this idea to automated entities (such as programs, agents,
or databases). Formal ontologies are useful even for people, because infor-
mal and implicit assumptions often result in misunderstandings. Sharing
of information between disparate entities (whether people or programs) is
another fundamental purpose of an ontology.
It would be nice if there were just one way to define ontologies, but at the
present time there is not yet a universal ontology language. Perhaps there
will be one someday, but in the meantime, one must accept that there will be
some diversity of approaches. In this chapter and in chapter 4, we introduce
the diverse mechanisms that are currently available, and we compare their
features. The ontology languages discussed in chapter 4 make use of logic
and rules, so we introduce them in chapter 3.
Two examples are used throughout this chapter as well as chapter 4. The
first one is a simplified Medline document, and the second is the specification
2.1 The Meaning of Meaning 37
for nitrous oxide using CML. The document-type definitions were highly
simplified in both cases. The simplified Medline document is figure 2.1. The
original Medline citation is to (Kuter 1999). The DTD being used is given
in figure 2.2. The nitrous oxide document is figure 2.3. The simplified CML
DTD being used is given in figure 1.6.
Figure 2.1 Example of part of a Medline citation using the Medline DTD.
<!ELEMENT MedlineCitation
(MedlineID, PMID, DateCreated, ArticleTitle?)>
<!ATTLIST MedlineCitation
Owner CDATA "NLM"
Status (Incomplete|Completed) #REQUIRED>
<!ELEMENT MedlineID (#PCDATA)>
<!ELEMENT PMID (#PCDATA)>
<!ELEMENT DateCreated (Year, Month, Day)>
<!ELEMENT Year (#PCDATA)>
<!ELEMENT Month (#PCDATA)>
<!ELEMENT Day (#PCDATA)>
<!ELEMENT ArticleTitle (#PCDATA)>
2.2 Infosets
Although XML is not usually regarded as being an ontology language, it is
formally defined, so it certainly can be used to define ontologies. In fact, it
is currently the most commonly used and supported approach to ontologies
among all of the approaches considered in this book.
The syntax for XML is defined in (W3C 2001b). The structure of a docu-
ment is specified using a DTD as discussed in section 1.2. A DTD can be re-
garded as being an ontology. A DTD defines concepts (using element types)
and relationships (using the parent-child relationship and attributes). The
concept of a DTD was originally introduced in 1971 at IBM as a means of
specifying the structure of technical documents, and for two decades it was
seldom used for any other purpose. However, when XML was introduced,
there was considerable interest in using it for other kinds of data, and XML
has now become the preferred interchange format for any kind of data.
The formal semantics for XML documents is defined in (W3C 2004b). The
mathematical model is called an infoset. The mathematical model for the
XML document in figure 2.1 is shown in figure 2.4. The infoset model con-
sists of nodes (shown as rectangles or ovals) and relationship links (shown as
arrows). There are various types of nodes, but the two most common types
are element nodes and text nodes. There are two kinds of relationship link:
parent-child link and attribute link. Every infoset model has a root node. For
an XML document, the root node has exactly one child node, but infosets
in general can have more than one child node of the root, as, for example,
when the infoset represents a fragment of an XML document or the result of
a query.
2.2 Infosets 39
Figure 2.4 XML data model for a typical Medline citation. Element nodes are shown
using rectangles, text nodes are shown using ovals, child links are unlabeled, and
attributes are labeled with the attribute name.
The corresponding infoset is shown in figure 2.5 and differs from that in
figure 2.4 in only one way: the MedlineID and PMID child nodes have been
reversed. These two infosets are different.
By contrast, the attribute links can be in any order. For example suppose
that the attributes of the MedlineCitation were reversed as follows:
The corresponding infoset is shown in figure 2.6 and differs from that in
figure 2.4 in only one way: the owner and status links have been reversed.
These two infosets are the same.
Figure 2.5 XML data model for a Medline citation in which the MedlineID and
PMID nodes are in the opposite order.
This example illustrates that the semantics of XML does not always cor-
rectly capture the semantics of the domain. In this case, the XML documents
in which the PMID and MedlineID elements have been reversed have a dif-
ferent meaning in XML but are obviously conveying the same information
from the point of view of a bibliographic citation. One can deal with this
problem by specifying in the DTD that these two elements must always ap-
pear in one order. In this case, the MedlineID element must occur before the
PMID element.
2.2 Infosets 41
Figure 2.6 XML data model for a Medline citation in which the status and owner
attributes are in the opposite order.
The infoset for the nitrous oxide document in figure 2.3 is shown in fig-
ure 2.7. If the first two atom elements were reversed the infoset would be
as in figure 2.8. These two infosets are different. However, from a chemical
point of view, the molecules are the same. This is another example of a clash
between the semantics of XML and the semantics of the domain. Unlike the
previous example, there is no mechanism in XML for dealing with this ex-
ample because all of the child elements have the same name (i.e., they are
all atom elements). So one cannot specify in the DTD that they must be in a
particular order. One also cannot specify that the order does not matter.
Summary
• An XML DTD can be regarded as an ontology language.
Figure 2.7 XML data model for a molecule represented using CML.
2. There are only a few data types. Because of this limitation, nearly all
attributes are defined to have type CDATA, that is, ordinary text, more
commonly known as strings. Important types such as numbers, times,
and dates cannot be specified.
Figure 2.8 XML data model for the same molecule as in figure 2.7 except that the
two atoms have been reversed.
ple data types for specialized needs. For example, one could define a
DNA sequence to be text containing only the letters A, C, G, and T.
There is a tool written in Perl, called dtd2xsd.pl (W3C 2001a) that trans-
lates DTDs to XML schemas. However, one must be cautious when using
this tool. It does not support all of the features of XML. For example, con-
ditional sections are not supported. As one of the authors pointed out, “It
is worth pointing out that this tool does not produce terribly high quality
schemas, but it is a decent starting point if you have existing DTDs.” When
one is using this tool one must manually check that the translation is correct.
One can then enhance the schema to improve the semantics using features of
XSD that are not available in DTDs.
Applying the dtd2xsd.pl program to figure 2.2 gives the XML schema
shown below. The XML schema is considerably longer than the DTD. We
leave it as an exercise to do the same for the molecule DTD in figure 1.6.
<schema
xmlns=’http://www.w3.org/2000/10/XMLSchema’
targetNamespace=’http://www.w3.org/namespace/’
xmlns:t=’http://www.w3.org/namespace/’>
<element name=’MedlineCitation’>
44 2 XML Semantics
<complexType>
<sequence>
<element ref=’t:MedlineID’/>
<element ref=’t:PMID’/>
<element ref=’t:DateCreated’/>
<element ref=’t:ArticleTitle’
minOccurs=’0’ maxOccurs=’1’/>
</sequence>
<attribute name=’Owner’ type=’string’
use=’default’ value=’NLM’/>
<attribute name=’Status’ use=’required’>
<simpleType>
<restriction base=’string’>
<enumeration value=’Incomplete’/>
<enumeration value=’Completed’/>
</restriction>
</simpleType>
</attribute>
</complexType>
</element>
<element name=’MedlineID’>
<complexType mixed=’true’>
</complexType>
</element>
<element name=’PMID’>
<complexType mixed=’true’>
</complexType>
</element>
<element name=’DateCreated’>
<complexType>
<sequence>
<element ref=’t:Year’/>
<element ref=’t:Month’/>
<element ref=’t:Day’/>
</sequence>
</complexType>
2.3 XML Schema 45
</element>
<element name=’Year’>
<complexType mixed=’true’>
</complexType>
</element>
<element name=’Month’>
<complexType mixed=’true’>
</complexType>
</element>
<element name=’Day’>
<complexType mixed=’true’>
</complexType>
</element>
<element name=’ArticleTitle’>
<complexType mixed=’true’>
</complexType>
</element>
</schema>
The XML schema shown above has exactly the same meaning as the DTD.
Having translated this DTD to XSD, one can make use of features of XSD that
are not available in a DTD. Some examples of these features are shown in the
next section.
Abstract Syntax Notation One (ASN.1) is another mechanism for encoding
hierarchically structured data. The development of ASN.1 goes back to 1984,
and it was a mature standard by 1987. It is mainly used in telecommunica-
tions, but it is also being used in other areas, including biomedical databases.
ASN.1 and XSD have similar capabilities and semantics. The main difference
is that ASN.1 allows for much more efficient encoding than XML. XER is an
encoding of XSD using ASN.1 and the xsdasn1 script translates from XSD
to ASN.1. Both XER and xsdasn1 are available at asn1.elibel.tm.fr.
Summary
• XSD adds additional data-type and data-structuring features to XML.
46 2 XML Semantics
<element name=’Day’>
<simpleType>
<xsd:restriction base=’xsd:positiveInteger’>
<xsd:maxInclusive value=’31’/>
</xsd:restriction>
</simpleType>
</element>
2.4 XML Data 47
One can make similar restrictions for the Year and Month elements. How-
ever, this still does not entirely capture all possible restrictions. For example,
it would allow February to have 31 days. As it happens, there is an XML
datatype for a date which includes all restrictions required for an arbitrary
calendar date. To use this datatype, replace the Year, Month, and Day ele-
ments with the following:
<element name=’DateCreated’ type=’xsd:date’/>
Using this approach, the Medline citation in figure 2.1 would look like this:
<MedlineCitation Owner="NLM" Status="Completed">
<MedlineID>99405456</MedlineID>
<PMID>10476541</PMID>
<DateCreated>1999-10-21</DateCreated>
<ArticleTitle>Breast cancer highlights.</ArticleTitle>
</MedlineCitation>
The semantics of an XML datatype is given in three parts:
1. The lexical space is the set of strings that are allowed by the datatype. In
other words, the kind of text that can appear in an attribute or element
that has this type.
2. The value space is the set of abstract values being represented by the strings.
Each string represents exactly one value, but one value may be repre-
sented by more than one string. For example, 6.3200 and 6.32 are different
strings but they represent the same value. In other words, two strings have
the same meaning when they represent the same value.
3. A set of facets that determine what operations can be performed on the
datatype. For example, a set of values can be sorted only if the datatype
has the ordered facet.
For some datatypes, the lexical space and value space coincide, so what one
sees is what it means. However, for most datatypes there will be multiple
representations of the same value. When this is the case, each value will
have a canonical representation. Since values and canonical representations
correspond exactly to each other, in a one-to-one fashion, it is reasonable to
think of the canonical representation as being the meaning.
XSD includes over 40 built-in datatypes. In addition one can construct
datatypes based on the built-in ones. The built-in datatypes that are the most
useful to bioinformatics applications are:
48 2 XML Semantics
There are three ways to construct a new datatype from other datatypes:
2. Union. One can combine the set of values of several datatypes. This is
handy for adding special cases to another datatype.
<element name=’DateCreated’>
<simpleType>
<xsd:union memberTypes=’xsd:date’>
<xsd:simpleType>
<enumeration value=’N/A’/>
</xsd:simpleType>
</xsd:union>
</simpleType>
</element>
Summary
• XSD provides built-in datatypes for the most commonly used purposes,
such as strings, numbers, dates, times, and resource references (URIs).
2.5 Exercises
1. Convert the molecule DTD shown in figure 1.6 to an XML schema.
2. Revise the molecule schema in exercise 2.1 above so that the elementType
attribute can only be one of the standard abbreviations of the 118 currently
known elements in the periodic table.
3. Define a simple datatype for a single DNA base. Hint: Use an enumera-
tion as in exercise 2.2 above.
The sequence is divided into groups of 60 bases, and these groups are
divided into subgroups of 10 bases. A number follows each group of 60
bases. The letter n is used when a base is not known.
Rules and inference are important for tackling the challenges in bioinformat-
ics. For example, consider the Biomolecular Interaction Network (BIND).
The problem of defining interactions is very complex, and interactions must
be obtained from several sources, such as the Protein Data Bank (PDB), met-
abolic/regulative pathways, or networks. Rules can be used to model and
query these interaction networks.
1. The pattern, also called the antecedent or the hypothesis. This part of the rule
specifies the match condition.
2. The action, consequent, or conclusion. This part of the rule specifies the
effect that is exerted when the match condition holds.
A rule can be regarded as a logical statement of the form “if the match con-
dition holds, then perform the action.” When considered from this point of
view, the match condition is a “hypothesis,” and the action is a “conclusion.”
Just as in organisms, match conditions range from being very precise to being
very generic.
The condition of a rule is a Boolean combination of elementary facts, each
of which may include constants as well as one or more variables. A query is
essentially a rule with no conclusion, just a condition. At the other extreme, a
fact is a rule with no condition, just a conclusion. The result of a query is the
set of assignments to the variables that cause the rule to fire. From the point
of view of relational databases, a query can be regarded as a combination
of selections, projections, and joins. The variables in a rule engine query
correspond to the output variables (i.e., the projection) and join conditions of
a relational query. The constants occurring in a rule engine query correspond
to the selection criteria of a relational query. Both rule engines and relational
databases support complex Boolean selection criteria.
When the match condition of a rule is found to hold and the consequent
3.1 Introduction to Rule-Based Systems 53
action is performed, the rule is said to have been “invoked” or “fired.” The
firing of a rule affects the environment, and this can result in the firing of
other rules. The resulting cascade of rule firings is what gives rule-based
systems their power. By contrast, the most common programming style (the
so-called procedural programming or imperative style) does not typically
have such a cascading effect.
Rule-based inferencing has another benefit. Rules express the meaning
of a program in a manner that can be much easier to understand. Each rule
should stand by itself, expressing exactly the action that should be performed
in a particular situation. In principle, each rule can be developed and verified
independently, and the overall system will function correctly provided only
that it covers all situations. Unfortunately, rules can interact in unexpected
ways, so that building a rule-based system is not a simple as one might sup-
pose. The same is true in organisms, and it is one of the reasons why it is so
difficult to understand how they function.
Rules have been used as the basis for computer software development for
a long time. Rule-based systems have gone by many names over the years.
About a decade ago they were called “expert systems,” and they attracted
a great deal of interest. While expert systems are still in use, they are no
longer as popular today. The concept is certainly a good one, but the field
suffered from an excess of hubris. The extravagantly optimistic promises led
to equally extreme disappointment when the promises could not be fulfilled.
Today it is recognized that rules are only one part of any knowledge-based
system, and it is important to integrate rules with many other techniques.
The idea that rules can do everything is simply unreasonable.
The process of using rules to deduce facts is called inference or reason-
ing, although these terms have many other meanings. Systems that claim
to use reasoning can use precise (i.e., logical reasoning), or various degrees
of imprecise reasoning (such as "heuristic" reasoning, case-based reasoning,
probabilistic reasoning, and many others). This chapter focuses on logical
reasoning. In chapter 13 logical inference is compared and contrasted with
scientific inference.
Logical reasoners act upon a collection of facts and logical constraints (usu-
ally called axioms) stored in a knowledge base. Rules cause additional facts to
be inferred and stored in the knowledge base. Storing a new fact in the know-
ledge base is called assertion. The most common action of a rule is to assert
one or more facts, but any other action can be performed.
Many kinds of systems attempt automated human reasoning. A system
that evaluates and fires rules is called a rule engine, but there are may other
54 3 Rules and Inference
3. theorem provers,
4. constraint solvers,
7. translators,
8. miscellaneous systems.
These are not mutually exclusive categories, and some systems support more
than one style of reasoning. We now discuss each of these categories in detail,
and then give a list of some of the available software for automated reason-
ing.
Summary
• Rule-based programming is a distinct style from the more common pro-
cedural programming style.
• Rule engines logically infer facts from other facts, and so are a form of
automated reasoning system.
• There are many other kinds of reasoning system such as theorem provers,
constraint solvers, and business rule systems.
there must be a mechanism to prevent rules from firing endlessly on the same
facts. A rule is normally only invoked once on a particular set of facts that
match the rule. When the rule engine finds that no new facts can be inferred,
it stops. At that point one can query the knowledge base.
Backward-chaining rule engines are much harder to understand. They
maintain a knowledge base of facts, but they do not perform all possible
inferences that a forward-chaining rule engine would perform. Rather, a
backward-chaining engine starts with the query to be answered. The engine
then tries to determine whether it is already known (i.e., it can be answered
with known facts in the knowledge base). If so, then it simply retrieves the
facts. If the query cannot be answered with known facts, then it examines
the rules to determine whether any one of them could be used to deduce the
answer to the query. If there are some, then it tries each one. For each such
rule, the rule engine tries to determine whether the hypothesis of the rule is
true. It does this the same way as it does for answering any query: the en-
gine first looks in the knowledge base and then the engine tries to deduce it
by using a rule.
Thus a backward-chaining rule engine is arguing backward from the de-
sired conclusion (sometimes called the “goal”) to the known facts in the
knowledge base. In contrast with the forward-chaining technique that match-
es the hypothesis and then performs the corresponding action, a backward-
chaining engine will match the conclusion and then proceed backward to the
hypothesis. Actions are performed only if the hypothesis is eventually veri-
fied. Rules are invoked only if they are relevant to the goal. Thus actions that
would be performed by a forward-chaining engine might not be performed
by a backward-chaining engine. On the other hand, actions that would be
performed just once by a forward-chaining engine could be performed more
than once by a backward-chaining engine.
The best-known example of a backward-chaining rule engine is the Prolog
programming language (Clocksin et al. 2003). However, there are many oth-
ers, especially commercial business rule engines, which are discussed later
in this chapter.
Backward chainers have some nice features. Because of their strong focus
on a goal, they only consider relevant rules. This can make them very fast.
However, they also have disadvantages. They are much more prone to infi-
nite loops than forward-chaining engines, and it is difficult to support some
forms of reasoning such as paramodulation, which is needed by OWL on-
tologies (see section 4.4). Programming in backward-chaining mode is also
counterintuitive. As a result it takes considerable skill to do it well com-
56 3 Rules and Inference
Summary
• Both forward- and backward-chaining rule engines require a set of rules
and an initial knowledge base of facts.
tionships. Most biological concepts can be defined using DLs, and they allow
limited forms of reasoning about biological knowledge (Baker et al. 1999).
However, not all concepts can be defined using DLs, and many forms of rea-
soning cannot be expressed in this framework. Database joins, for example,
cannot be expressed using DLs. A DL reasoner will be very efficient, but the
limitations of DL reasoners can be too severe for many application domains.
This efficiency leads to another problem: it is difficult to extract the reasons
for conclusions made by the reasoner. Consequently, DL reasoners provide
little feedback in tasks such as consistency checking.
The Web Ontology Language (OWL) was discussed in section 1.6 and will
be covered in detail in section 4.4. OWL has three language levels, depending
on what features are supported. The lowest level, OWL Lite, has the mini-
mum number of features that are necessary for specifying ontologies. The
intermediate level, OWL-DL, has more features than OWL Lite, but still has
some restrictions. The restrictions were chosen so that OWL-DL ontologies
could be processed using a DL reasoner. The highest level, OWL Full, has no
restrictions. The OWL Full level cannot be processed by a DL reasoner, and
one must use a theorem prover.
Business rule systems can be classified as rule engines (and some of them
are excellent in this regard). However, they tend to emphasize ease of use via
graphical user interfaces (GUIs) rather than support for underlying function-
ality. They are intended to be used by individuals who do not have a back-
ground in logic or reasoning systems. Business rule systems are nearly al-
ways proprietary, and their performance is usually relatively poor, although
there are exceptions. Typically the rule system is only part of a larger sys-
tem, so the poor performance is effectively masked by the other activities
occurring at the same time. Web portal servers often contain a business rule
system. Some business rule systems have full support for ontologies, most
commonly ontologies expressed in RDF or OWL.
Many systems simply translate from one language to another one, perform
the reasoning using a different system, and then translate back to the original
language. The advantage is flexibility. The disadvantage is that they can be
much less efficient than systems that are optimized for the target language.
Translators are commonly used for processing ontologies.
Many other kinds of reasoning system exist, such as Boolean constraint
solvers and decision support systems. These may be regarded as optimized
reasoners (just as a DL reasoner is an optimized specialization of a theorem
prover). However, such reasoners are generally much too limited for pro-
cessing ontologies.
3.4 Performance of Automated Reasoners 59
Summary
• Theorem provers prove theorems.
1. Constraint solvers
2. Description logic
3. Business rule systems
support for dynamic knowledge bases, but not as much as a Rete forward
chainer. The Rete-based systems are especially well suited for knowledge
bases that both add data (and rules) and retract them.
It is important to bear in mind that a system in one class can be used to
perform reasoning that is normally associated with another class. Prolog, for
example, is a general-purpose programming language, so one can, in prin-
ciple, implement a theorem prover or a DL reasoning system in Prolog (and
this is commonly done). However, by itself, Prolog is not a theorem prover.
Summary
• Automated reasoners use specialized indexes to improve performance.
Many people have had the experience of suddenly realizing that two of their
acquaintances are actually the same person, although it usually is not as dra-
matic as it was for the main characters in the movie You’ve Got Mail. The
other kind of identity confusion is considerably more sinister: two persons
having the same identity. This is a serious problem, known as identify theft.
The issue of whether two entities are the same or different is fundamental to
semantics.
Addressing logical issues such as whether two entities are the same re-
quires substantially more powerful reasoning capabilities than XML DTDs
or schemas provide. Someday, automated reasoners and expert systems may
be ubiquitous on the web, but at the moment they are uncommon. The web
is a powerful medium, but it does not yet have any mechanism for rules
and inference. Tim Berners-Lee, the director of the World Wide Web Consor-
tium, has proposed a new layer above the web which would make all of this
possible. He calls this the Semantic Web (Miller et al. 2001).
The World Wide Web is defined by a language, the Hypertext Markup Lan-
guage (HTML), and an Internet protocol for using the language, the Hyper-
text Transfer Protocol (HTTP). In the same way, the Semantic Web is defined
by languages and protocols. In this chapter, we introduce the languages of
the Semantic Web and explain what they mean.
Biologists use the web heavily, but the web is geared much more toward hu-
man interaction than automated processing. While the web gives biologists
access to information, it does not allow users to easily integrate different
data sources or to incorporate additional analysis tools. The Semantic Web
62 4 The Semantic Web and Bioinformatics Applications
As another example, consider a biologist who has just found a novel DNA
sequence from an Anopheles clone which may be important in the devel-
opmental process. To find related sequences, the biologist runs a blastn
search based on a set of requirements (e.g., the sequence identities must be
over 60% and the E-value must be less than 10−10 ). These requirements can
be captured as rules and constraints which could be taken into account by
an online Semantic Web–enabled sequence comparison service. If the re-
searcher found a number of significantly similar sequences in Drosophila, the
scientist could then obtain gene expression data for the relevant genes from
a Semantic Web–enabled gene expression database. Rules can then be spec-
ified which capture the interesting expression profiles, such as genes which
are highly expressed at specified time points in the developmental process.
In both of these examples, the activities can, in principle, be carried out
manually by the researcher. The researcher reads material, selects the rele-
vant data, copies and pastes from the web browser, and then struggles with
diverse formats, protocols, and applications. Constraints and rules can be
enforced informally by manually selecting the desired data. All of this is
tedious and error-prone, and the amount of data that can be processed this
way is limited. The Semantic Web offers the prospect of addressing these
problems.
Summary
The Semantic Web addresses two important problems in Bioinformatics:
1. The dramatic increase of bioinformatics data available in web-based sys-
tems and databases calls for novel processing methods.
Hypertext links are one of the most important features that the World
Wide Web adds to the underlying Internet. If one regards hypertext links
as defining relationships between resources, then the World Wide Web was
responsible for adding relationships to the resources that were already avail-
able on the Internet prior to the introduction of the web. Indeed, the name
“World Wide Web” was chosen because its purpose was to link together the
resources of the Internet into an enormous web of knowledge. However, as
we discussed in section 1.6, for relationships to be meaningful, they must be
explicit. As stated by Wittgenstein in Proposition 3.3 of (Wittgenstein 1922)),
“Only the proposition has sense; only in the context of a proposition has a
name meaning.” Unfortunately, hypertext links by themselves do not con-
vey any meaning. They do not explicitly specify the relationship between the
two resources that are linked.
The Semantic Web is a layer above the World Wide Web that adds meaning
to hypertext links. In other words, the Semantic Web makes hypertext links
into ontological relationships. The Semantic Web is a means for introducing
formal semantics to the World Wide Web. All reasoning in the Semantic Web
is formal and rigorous. The Semantic Web is defined by a series of progres-
sively more expressive languages and recommendations of the World Wide
Web Consortium. The first of these is the Resource Description Framework
(RDF) (Lassila and Swick 1999) which is introduced in this section. RDF is
developing quickly (Decker et al. 1998), and there are now many tools and
products that can process RDF. In section 4.4 we introduce the Web Ontology
Language (OWL) which adds many new semantic features to RDF.
As the name suggests, RDF is a language for representing information
about resources in the World Wide Web. It is particularly intended for repre-
senting annotations about web resources, such as the title, author, and mod-
ification date of a webpage. However, RDF can also be used to represent
information about anything that can be identified on the web, even when
it cannot be directly retrieved. Thus one could use URIs to represent dis-
eases, genes, universities, and hospitals, even though none of these are web
resources in the original sense.
The following is the beginning and end of the GO database, as expressed
in RDF:
<go:go
xmlns:go="http://www.geneontology.org/dtds/go.dtd#"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:RDF>
<go:term rdf:about="http://www.geneontology.org/go#GO:0003673"
4.2 The Resource Description Framework 65
n_associations="149784">
<go:accession>GO:0003673</go:accession>
<go:name>Gene_Ontology</go:name>
</go:term>
<go:term rdf:about="http://www.geneontology.org/go#GO:0003674"
n_associations="101079">
<go:accession>GO:0003674</go:accession>
<go:name>molecular_function</go:name>
<go:definition>Elemental activities, such as catalysis or
binding, describing the actions of a gene product at the
molecular level. A given gene product may exhibit one or
more molecular functions.
</go:definition>
<go:part_of
rdf:resource="http://www.geneontology.org/go#GO:0003673"/>
</go:term>
...
</rdf:RDF>
</go:go>
The entire GO database is currently over 350 MB. The root element (named
go:go), defines the two namespaces that are used by the database: RDF
and the GO. RDF statements are always contained in an element named
rdf:RDF. Within the rdf:RDF elements look like ordinary XML elements,
except that they are organized in alternating layers, or stripes, as discussed in
section 1.6. The first layer defines instances belonging to classes. In this case,
the GO database defines two instances of type go:term. The second layer
makes statements about these instances, such as the go:accession identi-
fier and the go:name. The rdf:about attribute is special: it gives the re-
source identifier (URI) of the resource about which one is making statements.
The rdf:resource attribute is also special: it refers to another resource.
Such a reference is analogous to a web link used for navigating from one
page to another page on the web. If there were a third layer, then it would
define instances, and so on. For an example of deeper layers, see figure 1.14.
XML, especially when using XML Schema (XSD), is certainly capable of ex-
pressing annotations about URIs, so it is natural to wonder what RDF adds
to XSD. Tim Berners-Lee wrote an article in 1998 attempting to answer this
question (Berners-Lee 2000b). The essence of the article is that RDF seman-
tics can be closer to the semantics of the domain being represented. As we
discussed in section 2.2, there are many features of the semantics of a do-
main that are difficult to capture using DTDs or XML schemas. Another way
of putting this is that XML documents will make distinctions (such as the
order of child elements) that are semantically irrelevant to the information
66 4 The Semantic Web and Bioinformatics Applications
</isStoredIn>
</gene>
</contains>
</locus>
The corresponding RDF graph is shown in figure 4.2. The element names
alternate between names of classes and names of properties, depending on
the “striping” level. Thus locus is a class, contains is a property, gene
is a class, and so on. Attributes are always names of properties. The nodes
with no label (i.e., the empty ovals in the graph) are called blank or anonymous
resources. They are important for conveying meaning, but they do not have
explicit URIs. RDF processors generate URIs for blank nodes, but these gen-
erated URIs have no significance. The use of blank nodes in RDF complicates
query processing, compared with XML. However, high-performance graph-
matching systems have been developed that are efficient and scalable. This
will be discussed in section 6.6.
Figure 4.2 RDF graph for an XML document. Resources are represented using
ovals, and rectangles contain data values.
Every link in an RDF graph has three components: the two resources be-
ing linked and the property that links them. Properties are themselves re-
sources, so a link consists of three resources. The two resources being linked
are called the subject and object, while the property that does the linking is
4.2 The Resource Description Framework 69
called the predicate. Together the three resources form a statement, analogous
to a statement in natural language. It is a good practice to use verbs for the
names of predicates so that each RDF statement looks just like a sentence,
and means essentially the same. RDF statements are also called triples. Some
of the triples of the RDF graph in figure 4.2 include the following:
The underscore means that the resource is a blank node so it does not have a
URI. The other resources are part of either the BioML ontology or are part of
the RDF language. When expressed in English the triples above might look
like the following:
Simple data values, such as the text string “HUMINS locus” are formally
defined by XSD datatypes as in section 2.4.
Unlike the conversion from DTDs to XSD, it is not possible to automate the
conversion from DTDs to RDF. The problem is that relationships are not ex-
plicitly represented in either DTDs and XSD. In the Medline DTD shown in
figure 2.2, some of the elements correspond to RDF classes while others cor-
respond to RDF properties. A person who is familiar with the terminology
can usually recognize the distinction, but because the necessary information
is not available in the DTD or schema, the conversion cannot be automated.
The MedlineCitation element, for example, probably corresponds to an
RDF class, and each particular Medline citation is an instances of this RDF
class. After a little thought, it seems likely that all of the other elements in
the Medline DTD correspond to RDF properties. However, these choices are
70 4 The Semantic Web and Bioinformatics Applications
speculative, and one could certainly make other choices, all of which would
result in a consistent conversion to RDF. Converting from a DTD to RDF is
further complicated by implicit classes. When converting the Medline DTD
to RDF, it is necessary to introduce an RDF class for the date, yet there is no
such element in the Medline DTD. In general, XML element types can cor-
respond to either RDF classes or RDF properties, and both RDF classes and
RDF properties can be implicit. In other words, XML DTDs and schemas are
missing important information about the concepts being represented.
One specifies an RDF ontology using RDF itself. The fact that a resource is
an RDF class, for example, is stated using an ordinary RDF. For example, one
possibility for the classes and properties of the RDF ontology corresponding
to the Medline DTD is shown in figure 4.3. There are two namespaces used
by RDF. The first one is RDF, and the second is RDF Schema (RDFS). The RDF
namespace is sufficient for specifying ordinary facts, while RDFS is necessary
for specifying an RDF ontology.
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#">
<rdfs:Class rdf:ID="MedlineCitation"/>
<rdf:Property rdf:ID="Owner"/>
<rdf:Property rdf:ID="Status"/>
<rdf:Property rdf:ID="MedlineID"/>
<rdf:Property rdf:ID="PMID"/>
<rdf:Property rdf:ID="DateCreated"/>
<rdfs:Class rdf:ID="Date"/>
<rdf:Property rdf:ID="Year"/>
<rdf:Property rdf:ID="Month"/>
<rdf:Property rdf:ID="Day"/>
<rdf:Property rdf:ID="ArticleTitle"/>
</rdf:RDF>
Figure 4.3 One possible way to represent the Medline DTD of figure 2.2 using an
RDF ontology.
The Medline citation in figure 2.1 is already almost in a form that is com-
patible with RDF. All that is needed is to add a Date element as shown in
figure 4.4. However, RDF gives one the freedom to represent the same infor-
4.2 The Resource Description Framework 71
mation in many other ways. The document shown in figure 4.5 is equivalent.
Both representations have the same RDF graph, shown in figure 4.6.
Figure 4.5 Part of a Medline citation written using RDF. Although it looks different,
the information is the same as that in figure 4.4.
Figure 4.6 RDF graph for a typical Medline citation. Resource nodes are shown
using ovals, text nodes are shown using rectangles. All links are labeled with the
property. The ml: prefix stands for the Medline ontology.
<rdf:Property rdf:ID="DateCreated">
<rdfs:domain rdf:resource="#MedlineCitation"/>
<rdfs:range rdf:resource="#Date"/>
</rdf:Property>
The rdf:ID attribute is used for defining a resource. Each resource is de-
fined exactly once. At this point one can also annotate the resource with
additional property values, as was done above. The rdf:about attribute is
used for annotating a resource. Use this when one is adding property values
to a resource that has been defined elsewhere. The rdf:resource attribute
is used for referring to a resource. In terms of statements, use rdf:ID and
rdf:about for a resource that is to be the subject of the statement, and use
rdf:resource when the resource is to be the object of the statement. We
leave it as an exercise to the reader to restate the molecule DTD as an RDF
ontology and to write the nitrous oxide molecule document in RDF.
The last important feature that distinguishes RDF from XML is its incor-
poration of built-in inference rules. The most important built-in rule is the
subClass rule because this is the rule that implements inheritance and taxo-
nomic classification of concepts. Although there are many notions of hier-
archy, as discussed in section 1.5, the most commonly used is the notion of
taxonomy which is based on the mathematical notion of set containment.
4.2 The Resource Description Framework 73
<Class rdf:about="#Protein">
<subClassOf rdf:resource="#Macromolecule"/>
</Class>
<Class rdf:about="#Macromolecule">
<subClassOf rdf:resource="#Chemical"/>
</Class>
<Protein rdf:ID="rhodopsin"/>
1. It is actually a cycle. The order is significant, but one can start with any one of the enzymes.
4.2 The Resource Description Framework 75
However, now the order is lost. The fact that the statements are in the right
order does not matter. An RDF processor will not maintain this order, and
one cannot make any use of it. Fortunately, there are two mechanisms for
retaining the ordering. The older method is to place the enzymes in a sequence
container as follows:
<Protein name="Fumerase"/>
</rdf:li>
<rdf:li>
<Protein name="Malate dehydrogenase"/>
</rdf:li>
</rdf:Seq>
</usesEnzyme>
</Pathway>
The sequence is itself a resource as well as being the container of the other
resources. Notice the use of the rdf:li property for the members of the
container. This name was borrowed from HTML where it is used for the
members of lists. There are three kinds of container:
More recently, a second mechanism for ordered lists was added to RDF,
called a collection. The Krebs cycle can now be expressed as follows:
To get this kind of list, one only needs to say that the rdf:ParseType if the
property value is Collection. This is much simpler than using a container.
Summary
• RDF is a framework for representing explicit many-to-many relationships
(called properties) between web-based resources and data.
• Inference is a powerful feature, but one must be careful when using it.
<topic id="PMID10476541">
<instanceOf><topicRef xlink:href="#MedlineCitation"/></instanceOf>
<baseName>
<baseNameString>Breast cancer highlights</baseNameString>
</baseName>
<occurrence>
<instanceOf><topicRef xlink:href="#html-format"/></instanceOf>
<resourceRef
xlink:href="http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?..."/>
</occurrence>
</topic>
78 4 The Semantic Web and Bioinformatics Applications
<association>
<instanceOf>
<topicRef xlink:href="#citation-attributes"/>
</instanceOf>
<member>
<roleSpec><topicRef xlink:href="#owner"/></roleSpec>
<topicRef xlink:href="#NLM"/>
</member>
<member>
<roleSpec><topicRef xlink:href="#status"/></roleSpec>
<topicRef xlink:href="#completed"/>
</member>
<member>
<roleSpec><topicRef xlink:href="#date-created"/></roleSpec>
<topicRef xlink:href="#date991021"/>
</member>
</association>
<topic id="date991021">
<baseName>
<scope>
<topicRef
xlink:href="http://kmi.open.ac.uk/psi/datatypes.xtm#date"/>
</scope>
<baseNameString>1999-10-21</baseNameString>
</baseName>
</topic>
Except for some syntactic details such as striping and built-in attributes in
the RDF namespace, RDF documents can be very similar to general XML
documents. As the example above illustrates, XTM documents have no such
advantage.
XTM is a graph-based language that has much in common with RDF. Both
of them are intended to be a mechanism for annotating web resources. The
web resources that are being annotated occur within documents which fur-
nish the “primary structure” defining the resources. The annotations are a
“secondary structure” known as metadata or “data about data.”
Although XTM and RDF have many similarities, they also differ in some
important respects:
• XTM has a notion of scope or context that the RDF languages lack.
4.4 The Web Ontology Language 79
• The RDF languages have a formal semantics. XTM only has a formal
metamodel.
• XTM makes a clear distinction between metadata and data, while RDF
does not. In RDF one can annotate anything, including annotations.
<owl:Class rdf:ID="ICE-Symptoms">
<owl:oneOf parseType="Collection">
<Symptom name="corneal endothelium
proliferation and migration"/>
<Symptom name="iris atrophy"/>
<Symptom name="corneal oedema"/>
<Symptom name="pigmentary iris nevi"/>
</owl:oneOf>
</owl:Class>
This defines a class of symptoms consisting of exactly the ones specified. One
can then define the ICE syndrome as the subclass of disease for which at least
one of these four symptoms occurs:
<owl:Class rdf:ID="ICE-Syndrome">
<owl:intersectionOf parseType="Collection">
<owl:Class rdf:about="#Disease"/>
<owl:Restriction>
<owl:onProperty rdf:resource="#has-symptom"/>
<owl:someValuesFrom
rdf:resource="#ICE-Symptoms"/>
</owl:Restriction>
</owl:intersectionOf>
</owl:Class>
The statements above specify the ICE-Syndrome class as being the intersec-
tion of two sets:
2. The set of things that have at least one of the four ICE symptoms
4.4 The Web Ontology Language 81
3. owl:hasValue. This constructor defines the set of resources for which the
property takes the specified value.
<owl:ObjectProperty rdf:ID="stores">
<owl:inverseOf rdf:resource="#isStoredIn"/>
</owl:ObjectProperty>
<owl:ObjectProperty rdf:ID="cites">
<owl:inverseOf rdf:resource="#isCitedBy"/>
</owl:ObjectProperty>
There are no other ways to construct a property in OWL. However, there are
some property constraints:
<owl:Class rdf:ID="Gene">
<rdfs:subClassOf>
<owl:Restriction>
<owl:onProperty rdf:resource="#occursIn"/>
<owl:someValuesFrom rdf:resource="#Species"/>
</owl:Restriction>
</rdfs:subClassOf>
</owl:Class>
<Gene rdf:ID="hbae3">
<rdfs:label>hemoglobin alpha embryonic-3</rdfs:label>
</Gene>
<FunctionalProperty rdf:about="#occursIn"/>
If one has not specified that the hbae3 gene occurs in any species, then one
would infer that there is exactly one, as yet unknown, species where this
gene occurs. This is shown in figure 4.7.
Now suppose that one specifies that the hbae3 gene occurs in two species:
<Gene rdf:about="hbae3">
<occursIn rdf:resource="#D.rerio"/>
<occursIn rdf:resource="#D.danglia"/>
</Gene>
<Gene rdf:about="#D.rerio">
<owl:differentFrom rdf:about="#D.danglia"/>
</Gene>
Figure 4.8 An example in which two resources are inferred to be the same. In this
case the ontology allows a gene to belong to at most one species. As a result if a gene
is linked to more than one species, then all of them must be the same. The inferred
relationship is shown in gray.
occurring as a result of other facts and rules. In general, one can reduce
spurious inferences in two ways:
Specifying that resources are different can get very tedious if there is a
large number of them. To deal with this problem, OWL has a mechanism for
specifying that a list of resources are all different from one another. For exam-
ple, the two species above could have been declared to be different by using
the owl:AllDifferent resource and owl:distinctMembers property
as follows:
<AllDifferent>
<distinctMembers parseType="Collection">
<Gene rdf:about="#D.rerio">
<Gene rdf:about="#D.danglia">
</distinctMembers>
</AllDifferent>
4.5 Exercises 87
Summary
• OWL is based on RDF and has three increasingly more general levels:
OWL Lite, OWL-DL, and OWL Full.
• An OWL document defines a theory of the world. States of the world that
are consistent with the theory are called models of the theory.
• A fact that is true in every model is said to be entailed by the theory. OWL
inference is defined by entailment.
• OWL is especially well suited for defining concepts in terms of other con-
cepts using class constructors.
• OWL has only one property constructor, but it has some property con-
straints.
• OWL inference is monotonic, which can limit inferences, but careful de-
sign can reduce this problem.
4.5 Exercises
1. Restate the molecule schema in figure 1.6 as an RDF ontology. There will
not be a single correct answer to this exercise.
2. Define the nitrous oxide molecule in figure 2.3 using RDF. The answer will
depend on the RDF ontology.
There are a large number of biomedical ontologies and databases that are cur-
rently available, and more continue to be developed. There is even a site that
tracks the publicly available sources. Ontologies have emerged because of
the need for a common language to develop effective human and computer
communication across scattered, personal sources of data and knowledge.
In this chapter, we provide a survey of ontologies and databases used in
the bioinformatics community. In the first section we focus on human com-
munication. The ontologies in this section are concerned with medical and
biological terminology and with ontologies for organizing other ontologies.
The rest of the chapter shifts the focus to computer communication. In sec-
tion 5.2 we survey the main XML-based ontologies for bioinformatics. The
remaining sections consider some of the many databases that have been de-
veloped for biomedical purposes. Each database has its own structure and
therefore can be regarded as defining an ontology. However, the focus is on
the data contained in the database rather than on the language used for rep-
resenting the data. These databases differ markedly from one another with
respect to how the data are specified and whether they are compatible with
the ontologies in the first two sections. Many of the databases are available
in several formats. Only databases that can be downloaded were included in
the survey.
5.1 Bio-Ontologies
ogy. The first one was originally focused on medical terminology but now
also includes many other biomedical vocabularies, has grown to be impres-
sively large, but is sometimes incoherent as a result. The second ontology
focuses exclusively on terminology for genomics. As a result of its narrow
focus, it is very coherent, and a wide variety of tools have been developed
that make use of it. Finally, we consider ontologies that organize other on-
tologies. The number of biomedical ontologies and databases has grown so
large that it is necessary to have a framework for organizing them.
tions. MetaMap can also be used for constructing a list of ranking concepts
by applying the MetaMap indexing ranking function to each UMLS META
concept. The UMLS Knowledge Source Server (UMLSKS) umlsks.nlm.
nih.gov is a web server that provides access to the knowledge sources and
other related resources made available by developers using the UMLS.
The UMLS is a rich source of knowledge in the biomedical domain. The
UMLS is used for research and development in a range of different applica-
tions, including natural language processing (Baclawski et al. 2000; McCray
et al. 2001). UMLS browsers are discussed in section 6.3. Search engines
based on the UMLS use it either as a source of keywords or as a means of gen-
erating knowledge representations. An example of the keyword approach is
the Medical World Search at www.mwsearch.com which is a search engine
for medical information in selected medical sites. An example of the know-
ledge representation approach is the Semantic Knowledge Indexing Platform
(SKIP), shown in section 6.6.
The terms within each of the three GO ontologies may be related to other
terms in two ways:
Figure 5.1 The GO hierarchy for inositol lipid-mediated signaling. The parentheses
show the total number of terms in the category at that level.
94 5 Survey of Ontologies in Bioinformatics
web-based tool for rapidly listing genes in GO categories (Dennis, Jr. et al.
2003).
GOTM genereg.ornl.gov/gotm
The GOTree Machine is a web-based platform for interpreting microarray
data or other interesting gene sets using GO (Zhang et al. 2004).
Figure 5.2 A GO network graph generated using the NetAffx Gene Ontology Min-
ing Tool.
Figure 5.3 A gene expression profiling study of preterm delivery (PTD) of eight
mothers with PTDs and six mothers with term deliveries. In this study, 159 genes
were found to be significantly belonging to the “Response to External Stimulus” GO
term (P < .0001).
A number of efforts are underway to enhance and extend GO. The Gene
Ontology Annotation (GOA), run by the European Bioinformatics Institute
(EBI), is providing assignments of terms from the GO resource to gene prod-
98 5 Survey of Ontologies in Bioinformatics
$structures.goff ; ZFIN:0000000
<001_Zygote\:1-cell\,embryo ; ZFIN:0000004
<001_Zygote\:1-cell\,blastomere ; ZFIN:0000001
<001_Zygote\:1-cell\,yolk ; ZFIN:0000012
<001_Zygote\:1-cell\,extraembryonic ; ZFIN:0000005
<001_Zygote\:1-cell\,chorion ; ZFIN:0000002
<002_Cleavage\:2-cell\,embryo ; ZFIN:0000017
<002_Cleavage\:2-cell\,blastomeres ; ZFIN:0000013
<002_Cleavage\:2-cell\,yolk ; ZFIN:0000025
<002_Cleavage\:2-cell\,extraembryonic ; ZFIN:0000018
<002_Cleavage\:2-cell\,chorion ; ZFIN:0000014
<003_Cleavage\:4-cell\,embryo ; ZFIN:0000030
<003_Cleavage\:4-cell\,blastomeres ; ZFIN:0000026
<003_Cleavage\:4-cell\,yolk ; ZFIN:0000038
<003_Cleavage\:4-cell\,extraembryonic ; ZFIN:0000031
<003_Cleavage\:4-cell\,chorion ; ZFIN:0000027
<004_Cleavage\:8-cell\,embryo ; ZFIN:0000043
<004_Cleavage\:8-cell\,blastomeres ; ZFIN:0000039
<004_Cleavage\:8-cell\,yolk ; ZFIN:0000051
<004_Cleavage\:8-cell\,extraembryonic ; ZFIN:0000044
BSML www.bsml.org
The Bioinformatic Sequence Markup Language (BSML) is a language that
encodes biological sequence information, which encompasses graphical rep-
resentations of biologically meaningful objects such as nucleotide or protein
sequences. The current version (released in 2002) is BSML v3.1. BSML takes
advantage of XML features for encoding hierarchically organized informa-
tion to provide a representation of knowledge about biological sequences.
BSML is useful in capturing the semantics of biological objects (e.g., com-
plete genome, chromosome, regulatory region, gene, transcript, gene prod-
uct, etc.). BSML can be rendered in the Genomic XML viewer, which greatly
facilitates communications among biologists, since biologists are accustomed
to visualizing biological objects and to communicating graphically about the
these objects and their annotations.
The root element for a BSML document is tagged with Bsml. Conse-
quently, a BSML document should look like the following:
1. Sequence data. The primary sequence data of the molecule of interest are
contained within the sequence element; the information of the sequence
is represented using attributes and their associated values, defined in the
BSML DTD. figure 5.5 shows an example of using BSML to represent the
amino acid sequence of human tumor suppressor p53.
Figure 5.5 The BSML representation for the SWISS-PROT entry P04637.
BioML www.rdcormia.com/COIN78/files/XML_Finals/
BIOML/Pages/BIOML.htm
The Biopolymer Markup Language provides an extensible framework for an-
notating experimental information about molecular entities, such as proteins
and genes. Many examples of BioML documents were shown in chapter 1.
The four chemical letters of DNA, G, C, A, and T, have their normal mean-
ings as individual nucleotides (case-insensitive). White space (e.g., spaces,
tabs, carriage returns) are ignored by the parser, and can be freely added to
aid the flow and readability of the file. The parser also ignores any character
that cannot be a nucleotide residue, allowing the author to include numbers
and other symbols that make reading the file easier. The kinds of element for
DNA, RNA, and protein in BioML are presented in table 5.1.
The BioML ontology can also be used to refer to public database entries.
For example, one can refer to the GenBank entry for the DNA sequence en-
coding the human δ-aminolevulinate dehydratase as follows:
<bioml>
<reference>
<db_entry format="GENBANK" entry="X64467"/>
</reference>
</bioml>
Table 5.1 The elements for DNA, RNA, and protein in BioML
MAGE-ML www.mged.org
The MicroArray Gene Expression Markup Language is an XML ontology
for microarray data. MAGE-ML aims to create a common data format so
that data can be shared easily between projects (Stoeckert, Jr. et al. 2002).
The predecessor of MAGE-ML is the Gene Expression Markup Language
(GEML), initially developed by Rosetta Inpharmatics (Kohane et al. 2003).
MAGE-ML is a data-exchange syntax for microarray data recently created
by the microarray gene expression data group (MGED) (MAGE-ML 2003).
In order to standardize the information concerning microarray data, MGED
initially introduced the minimal information for the annotation of a microar-
ray experiment (MIAME). MIAME describes the minimum information re-
quired to ensure that microarray data can be easily interpreted and that re-
sults derived from its analysis can be independently verified (Brazma et al.
2001). Practically speaking, MIAME is a checklist of what should be supplied
for publication. MIAME-compliant conceptualization of microarray experi-
ments is then modeled using the UML-based microarray gene expression
object model (MAGE-OM). MAGE-OM is then translated into an XML-based
data format, MAGE-ML, to facilitate the exchange of data (Spellman et al.
2002).
There is a close relationship between the MAGE-ML and MGED ontolo-
gies. The MGED ontology, being developed by the Ontology Working Group
of the MGED ontology project, is providing standard controlled vocabularies
for microarrays. The goal of the MGED ontology is to create a framework of
microarray concepts that reflects the MIAME guidelines and MAGE struc-
ture. Therefore, the MGED ontology project has a practical aim to develop
standards, and to reduce nonuniform usage of annotation in microarray ex-
periments. Concepts for which existing controlled vocabularies and ontolo-
gies can be identified are specified by reference to those external resources,
and no new ontologies will be created. Concepts that are microarray-based
or tractable (such as experimental conditions) are specified within the MGED
ontology. MAGE-ML provides a standard XML format, which supersedes
the MicroArray Markup Language (MAML) format, for reporting microar-
ray data and its associated information.
CellML www.cellml.org
The CellML ontology is being developed by Physiome Sciences Inc. in Prince-
ton, New Jersey, in conjunction with the Bioengineering Institute at the Uni-
104 5 Survey of Ontologies in Bioinformatics
RNAML www-lbit.iro.umontreal.ca/rnaml
RNAML provides a standard syntax that allows for the storage and exchange
of information about RNA sequence as well as secondary and tertiary struc-
tures. The syntax permits the description of higher-level information about
the data, including, but not restricted to, base pairs, base triples, and pseu-
doknots (Waugh et al. 2002).
Because of the hierarchical nature of XML, RNAML is a valuable method
for structuring the knowledge related to RNA molecules into a nested-struc-
tured text document. For example, in RNAML, a “molecule” is an element
consisting of the following three lower-level elements: identity (which con-
tains two nested elements, name and taxonomy), sequence (which contains
three nested elements, numbering-system, seq-data, and seq-annotation), and
structure (which contains one nested element, model). To ensure compatibil-
ity with other existing standards of RNA nomenclature, RNAML uses in-
cluding formats such as the International Union of Pure and Applied Chem-
5.2 Ontology Languages in Bioinformatics 105
istry (IUPAC) lettering and PDB ATOM records. If RNAML needs to depict
multiple interacting RNA molecules, the interactions of RNA molecules are
presented as character data in an interaction element.
AGAVE www.animorphics.net/lifesci.html
The Architecture for Genomic Annotation, Visualization and Exchange is an
XML language created by DoubleTwist, Inc., for representing genomic an-
notation data. AGAVE uses XML Schema (XSD) for describing the syntactic
structure of the data. A bioperl script can be used to convert data in the Euro-
pean Molecular Biology Laboratory (EMBL) or Genome Annotation Markup
Elements (GAME) format into the AGAVE format. The XML EMBL (XEMBL)
project of EBI is building a service tool that employs Common Object Re-
quest Broker Architecture (CORBA) servers to access EMBL data. The data
can then be distributed in XML format via a number of mechanisms (Wang
et al. 2002).
CML www.xml-cml.org
The Chemical Markup Language was discussed in chapter 1. The purpose of
CML is to manage chemical information (e.g., atomic, molecular, crystallo-
graphic information). CML is supported by tools such as the popular Jumbo
browser. CMLCore retains most of the chemical functionality of CML 1.0,
and extends it by adding handlers for chemical substances, extended bond-
ing models, and names (Murray-Rust and Rzepa 2003).
CytometryML
The Cytometry Markup Language is designed for the representation and ex-
change of cytometry data. CytometryML provides an open, standard XML
format, which may replace the Flow Cytometry Standard (Leif et al. 2003).
GAME www.fruitfly.org/comparative
GAME is an XML language for curation of DNA, RNA, or protein sequences.
GAME uses an XML DTD to specify the syntactic structure of the content of
a GAME document. GAME is extensively used within the FlyBase/Berkeley
Drosophila Genome Project (BDGP). For example, genomic regions for Rho-
dopsin 1 (ninaE), Rhodopsin 2 (Rh2), Rhodopsin 3 (Rh3), Rhodopsin 4 (Rh4),
apterous (ap), even-skipped (eve), fushi-tarazu (ftz) and twist (twi) have
been annotated in GAME format (Bergman et al. 2002; GAME 2002) in four
Drosophila species (D. erecta, D. pseudoobscura, D. willistoni, and D. littoralis)
covering over 500 kb of the D. melanogaster genome.
106 5 Survey of Ontologies in Bioinformatics
MML
The Medical Markup Language provides the XML-based standard for medi-
cal data exchange/storage (Guo et al. 2003).
MotifML motifml.org
MotifML is a language for representing the computationally predicted DNA
motifs (often in the regulatory region such as promoters) generated by the
Gibbs motif sampler, AlignACE, BioProspector, and CONSENSUS. MotifML
was created by the authors of this book and two collaborators (Sui Huang
and Jerzy Letkowski). MotifML uses the Web Ontology Language (OWL) to
specify the data structure of a MotifML document. MotifML is supported by
Java-based visualization tools such as MotifML viewers.
NeuroML www.neuroml.org/main.html
The Neural Open Markup Language is an XML language for describing mod-
els, methods, and literature for neuroscience. NeuroML uses XSD to specify
the syntactic requirements for the model descriptions (Goddard et al. 2001).
ProML
The Protein Markup Language is for specifying protein sequences, struc-
tures, and families using an open XML standard. ProML allows machine-
readable representations of key protein features (Hanisch et al. 2002).
TML
Taxonomic Markup Language is mainly an XML format for representing the
topology of a phylogeny, but also includes a representation for statistical
metadata (e.g., branch length, retention index, and consistency index) de-
scribing the phylogeny (Gilmour 2000). It is notable that for TML, the hier-
archical nature of a phylogeny is readily represented by XML.
GenBank www.ncbi.nlm.nih.gov/Genbank
GenBank is a comprehensive database that contains publicly available DNA
sequences for more than 140,000 named organisms. The sequences are pri-
marily obtained through submissions from individual laboratories and batch
submissions from large-scale sequencing projects (Benson et al. 2004). As of
February 2004, GenBank contained over 37 billion bases in over 32 million
sequence records. GenBank uses its own non-XML text format.
Most submissions to GenBank are made using the BankIt web service or
Sequin program and accession numbers are assigned by GenBank staff upon
receipt. Daily data exchange with the EMBL data library in the U.K. and the
DNA data bank of Japan (DDBJ) helps ensure worldwide coverage. Gen-
Bank is accessible through NCBI’s retrieval system, Entrez, which integrates
data from the major DNA and protein sequence databases along with taxon-
omy, genome mapping, protein structure, and domain information, and the
biomedical journal literature via PubMed.
EMBL www.ebi.ac.uk/embl
The EMBL Nucleotide Sequence Database, maintained at the European Bioin-
formatics Institute (EBI), incorporates, organizes, and distributes nucleotide
sequences from public sources (Kulikova et al. 2004). The database is a part of
an international collaboration with DDBJ and GenBank. Data are exchanged
between the collaborating databases on a daily basis. The Webin web service
is the preferred system for individual submission of nucleotide sequences,
including third party annotation (TPA) and alignment data. Automatic sub-
mission procedures are used for submission of data from large-scale genome
sequencing centers and from the European Patent Office. Database releases
are produced quarterly.
EMBL uses its own non-XML text format, but the XEMBL project has made
it possible to obtain EMBL data in the AGAVE XML format (Wang et al. 2002).
The latest EMBL data collection can be accessed via ftp, email, and web
interfaces. The EBI’s Sequence Retrieval System (SRS) integrates and links
the main nucleotide and protein databases as well as many other specialist
molecular biology databases. For sequence similarity searching, a variety of
tools (e.g., FASTA and BLAST) are available that allow users to compare their
own sequences against the data in EMBL and other databases.
DDBJ www.ddbj.nig.ac.jp
DDBJ is maintained at the National Institute of Genetics in Japan (Miyazaki
108 5 Survey of Ontologies in Bioinformatics
SWISS-PROT au.expasy.org/sprot
SWISS-PROT is the most widely used publicly available protein sequence
database. This database aims to be nonredundant, fully annotated, and highly
cross-referenced (Jung et al. 2001). SWISS-PROT also includes information
on many types of protein modifications. The database is available in both
FASTA and XML formats. The XML format is defined both as a DTD and
using XSD. The XSD schema is at www.uniprot.org/support/docs/
uniprot.xsd. The database itself is available at ftp://ftp.ebi.ac.uk/
pub/databases/uniprot/knowledgebase/uniprot_sprot.xml.gz.
Both SWISS-PROT and TrEMBL are available at this site in a variety of for-
mats.
NDB ndbserver.rutgers.edu
The most prominent nucleotide structure database is the Nucleic Acid Data-
base. NDB was established in 1991 as a resource to assemble and distribute
structural information about nucleic acids (both DNA and RNA) (Berman
et al. 1992). The core of the NDB has been its relational database of nucleic
acid-containing crystal structures. The primary data include the crystallo-
graphic coordinate data, structure factors, and information about the exper-
iments used to determine the structures, such as crystallization information,
data collection, and refinement statistics. Derived information from experi-
mental data, including valency geometry, torsion angles, and intermolecular
5.4 Structural Databases 109
contacts, is calculated and stored in the database. Database entries are fur-
ther annotated to include information about the overall structural features,
including conformational classes, special structural features, biological func-
tions, and crystal-packing classifications. The NDB has been used to analyze
characteristics of nucleic acids alone as well as complexed with proteins. The
NDB database is available in the PDB and mmCIF formats.
BLOCKS blocks.fhcrc.org
Blocks are defined as ungapped multiple alignments corresponding to the
most conserved regions of proteins. Blocks contain “multiple alignment” in-
formation, and the use of the BLOCKS database can improve the detection of
sequence similarities in searches of sequence databases. The BLOCKS data-
base was introduced to aid in the family classification of proteins (Henikoff
and Henikoff 1991). This database turns out to be a very important database,
because hits to BLOCKS database entries pinpoint the location of conserved
motifs, which are important for further functional characterization (Henikoff
et al. 2000). Furthermore, the BLOCKS database can be used for detecting
distant relationships (Henikoff et al. 1998). The BLOCKS database is the ba-
sis for the BLOSUM substitution tables that are used in amino acid sequence
similarity searching, as explained in section 7.1.
The BLOCKS database contains more than 24,294 blocks from nearly 5000
different protein groups (Henikoff et al. 2000). There are a variety of for-
mats for blocks, including the Blocks, FASTA, and Clustal formats. All of the
5.4 Structural Databases 111
more flexible and less stable than those in a crystal. Indeed, solution struc-
tures determined by the NMR data are slightly different from crystal struc-
tures. Therefore, NMR is often used to study small and peculiar proteins.
Protein glycosylation is probably the most common and complex type
of co- and post-translational modification encountered in proteins (Lutteke
et al. 2004). Inspection of the protein databases reveals that 70% of all pro-
teins have potential N-glycosylation sites - Asn-X-Ser/Thr, where X is not
Pro (Mellquist et al. 1998). O-glycosylation is even more ubiquitous (Berman
et al. 2000). Consequently, PDB entries contain not only protein structures
but also pure carbohydrate structures. However, to date, there is no standard
nomenclature for carbohydrate residues within the PDB files (Westbrook and
Bourne 2000). For example, although many monosaccharide residues are de-
fined in the PDB Het Group Dictionary pdb.rutgers.edu/het_dictio
nary.txt, there is no distinction between the α- and the β-forms. Thus, it
is difficult for glycobiologists to find relevant carbohydrate structures from
PDB.
The PDB database has two non-XML formats, PDB and mmCIF, that are in
use by many other molecular structure databases. Recently an XSD format,
PDBML, has been introduced in PDB and automated generation of XML files
is driven by the data dictionary infrastructure in use at the PDB. The current
XML schema file is located at deposit.pdb.org/pdbML/pdbx-v1.000.
xsd, and on the PDB mmCIF resource page at deposit.pdb.org/mmcif/.
SCOP scop.mrc-lmb.cam.ac.uk/scop
The Structural Classification of Proteins database classifies proteins by do-
mains that have a common ancestor based on sequence, structural, and func-
tional evidence (Murzin et al. 1995; Andreeva et al. 2004). In order to under-
stand how multidomain proteins function, it is important to know how they
are created during evolution. Duplication is one of the main sources for cre-
ating new genes and new domains (Lynch and Conery 2000). For examples
of this, see section 1.5. In fact, 98% of human protein domains are duplicates
(Gough et al. 2001; Madera et al. 2004; Muller et al. 2002). Once a domain or
protein has duplicated, it can evolve a new or modified function.
Access to SCOP requires a license. It is available in a non-XML text format.
CATH www.biochem.ucl.ac.uk/bsm/cath_new
This database contains domain structures classified into superfamilies and
sequence families (Orengo et al. 1997, 2003). Its name stands for Class/-
Architecture/Topology/Homology. Each structural family is expanded with
domain sequence relatives recruited from GenBank using a variety of ef-
5.4 Structural Databases 113
MIPS mips.gsf.de
The Munich Information Center for Protein Sequences provides protein se-
quence-related information based on whole-genome analysis (Mewes et al.
2004). The main focus of the work is directed toward the systematic organi-
zation of sequence-related attributes as gathered by a variety of algorithms
and primary information from experimental data together with information
compiled from the scientific literature.
DIP dip.doe-mbi.ucla.edu
The Database of Interacting Proteins is a research tool for studying cellu-
lar networks of protein interactions (Salwinski et al. 2004). The DIP aims
to integrate the diverse body of experimental evidence on protein-protein
interactions into a single, easily accessible online database. Because the re-
liability of experimental evidence varies widely, methods of quality assess-
ment have been developed and utilized to identify the most reliable subset
of the interactions. This core set can be used as a reference when evaluating
the reliability of high-throughput protein-protein interaction data sets for de-
velopment of prediction methods, as well as in studies of the properties of
protein interaction networks.
Obtaining the DIP database requires registration. The database is available
in an XSD format called XIN, as well as in tab-delimited flat files and other
formats.
SpiD http://genome.jouy.inra.fr/cgi-bin/spid/index.cgi
The Subtilis Protein interaction Database is a protein-protein interaction net-
work database centered on the replication machinery of the gram-positive
bacterium Bacillus subtilis (Hoebeke et al. 2001). This network was found by
using genome-wide yeast two-hybrid screening experiments and systematic
specificity assays (Noirot-Gros et al. 2002).
MINT http://160.80.34.4/mint/
The Molecular INTeraction database is a relational database containing inter-
action data between biological molecules (Zanzoni et al. 2002). At present,
MINT centers on storing experimentally verified protein-protein interactions
with special emphasis on proteomes of mammalian organisms. MINT con-
sists of entries obtained from data mining of the scientific literature. The
database is available in either a text format or in XML.
HPID http://wilab.inha.ac.kr/hpid/
The Human Protein Interaction Database was designed for the following
purposes (Han et al. 2004):
5.5 Transcription Factor Databases 115
A set of online software tools has been developed to visualize and analyze
protein interaction networks.
TRANSFAC transfac.gbf.de
The most complete transcription factor database is TRANSFAC (Wingender
et al. 1996). This database is concerned with eukaryotic transcription regula-
tion. It contains data on transcription factors, their target genes, and regula-
tory binding sites. The TRANSFAC database requires a license and fee, even
for noncommercial use. It uses a flat file format which can be browsed but
cannot be downloaded.
TRRD www.bionet.nsc.ru/trrd
The Transcription Regulatory Regions Database is a resource containing an
integrated description of gene transcription regulation. Each entry of the
database is concerned with one gene and contains data on localization and
functions of the transcription regulatory regions as well as gene expression
patterns (Kolchanov et al. 2002). TRRD contains only experimental data ob-
tained from annotations in scientific publications. TRRD release 6.0 contains
116 5 Survey of Ontologies in Bioinformatics
information on 1167 genes, 5537 transcription factor binding sites, 1714 reg-
ulatory regions, 14 locus control regions and 5335 expression patterns ob-
tained from 3898 scientific papers.
The TRRD is arranged in seven databases: TRRDGENES (general gene
description), TRRDLCR (locus control regions); TRRDUNITS (regulatory re-
gions: promoters, enhancers, silencers, etc.), TRRDSITES (transcription fac-
tor binding sites), TRRDFACTORS (transcription factors), TRRDEXP (expres-
sion patterns), and TRRDBIB (experimental publications). All of them are
relational databases, and the schema consists of a large number of table def-
initions. SRS is used as a basic tool for navigating and searching TRRD and
integrating it with external database and software resources.
COMPEL compel.bionet.nsc.ru
COMPEL is a database of composite regulatory elements, the basic struc-
tures of combinatorial regulation. Composite regulatory elements are two
closely situated binding sites for distinct transcription factors and represent
minimal functional units providing combinatorial transcriptional regulation.
Both specific factor DNA and factor-factor interactions contribute to the func-
tion of composite elements (CEs). Information about the structure of known
CEs and specific gene regulation achieved through such CEs appears to be
extremely useful for promoter prediction, for gene function prediction, and
for applied gene engineering as well.
Access to COMPEL requires registration, but it is free for noncommercial
use. The database consists of three relational database tables.
ooTFD www.ifti.org/ootfd
The purpose of ooTFD (object-oriented Transcription Factors Database) is to
capture information regarding the polypeptide interactions which constitute
and define the properties of transcription factors (Ghosh 2000). ooTFD is an
object-oriented successor to TFD (Ghosh 1993). The database is currently im-
plemented using ozone, a Java-based object-oriented database system. The
schema consists of nine primary Java data structures.
SGD www.yeastgenome.org
The Saccharomyces Genome Database is a database of the molecular biol-
ogy and genetics of the budding yeast Saccharomyces cerevisiae (Dwight et al.
2004). This database collects and organizes biological information about
5.6 Species-Specific Databases 117
genes and proteins of this yeast from the scientific literature, and presents
this information on individual Locus pages for each yeast gene. The Pathway
Tools software (Karp et al. 2002a) and the MetaCyc Database of metabolic
reactions (Karp et al. 2002b) were used to generate the metabolic pathway
information for S. cerevisiae. Metabolic pathways are illustrated in graphical
format and the information can be viewed at multiple levels, ranging from
general summaries to detailed diagrams showing each compound’s chem-
ical structure. Enzymatic activities of the proteins shown in each pathway
diagram are linked to the corresponding SGD Locus pages.
FlyBase flybase.bio.indiana.edu
The fruit fly, Drosophila melanogaster, is one of the most studied eukaryotic or-
ganisms and a central model for the Human Genome Project (FlyBase 2002).
FlyBase is a comprehensive database containing information on the genetics
and molecular biology of Drosophila. It includes data from the Drosophila ge-
nome projects and data curated from the literature. FlyBase is a joint project
with the Berkeley Drosophila Genome Project.
FlyBase is one of the founding participants in the GO consortium. As an
example of how FlyBase is related to GO, consider the D. melanogaster gene
p53 (FlyBase ID: FBgn0039044). Through FlyBase GO annotations, we can
learn that p53 is classified by the organization principles as follows:
1. GO:Molecular function:The p53 gene encodes a DNA-binding protein prod-
uct which functions as a transcription factor for RNA polymerase II.
clues regarding gene location, phenotype, and function. Synteny maps are
built based on the identification and mapping of conserved human-mouse
synteny regions. Comparative mapping is used to pinpoint unknown hu-
man homologs of known, mapped mouse genes.
GDB gdbwww.gdb.org
The GDB Human Genome Database is the main repository for all published
mapping information generated by the Human Genome Project. This data-
base is specific to Homo sapiens. The information stored in GDB includes
genetic maps, physical maps (clone, Sequence Tagged Site (STS), and Flu-
orescence In Situ Hybridization (FISH)-based), cytogenetic maps, physical
mapping reagents (clones, STSs), polymorphism information, and citations.
Pathbase www.pathbase.net
Pathbase is a mutant mouse pathology database that stores images of the
abnormal histology associated with spontaneous and induced mutations of
both embryonic and adult mice (Schofield et al. 2004). The database and the
images are publicly accessible and linked by anatomical site, gene, and other
identifiers to relevant databases. The database is structured around a novel
mouse pathology ontology, called MPATH, and provides high-resolution im-
ages of normal and diseased tissues that are searchable through orthogo-
nal taxonomies for pathology, developmental stage, anatomy, and gene at-
tributes. The database is annotated with GO terms, controlled vocabularies
for type of genetic manipulation or mutation, genotype, and free text for
mouse strain and additional attributes. The MPATH ontology is available in
DAG-Edit format.
ORDB senselab.med.yale.edu/senselab/ordb
The Olfactory Receptor Database is a central repository of olfactory recep-
tor (OR) and olfactory receptor-like gene and protein sequences (Crasto et al.
2002). The 2004 Nobel Prize in Physiology or Medicine was awarded jointly
to Richard Axel and Linda B. Buck for their discoveries of “odorant recep-
tors and the organization of the olfactory system.” Humans detect odorants
through ORs, which are located on the olfactory sensory neurons in the ol-
factory epithelium of the nose (Buck and Axel 1991; Buck 2000).
In building ORDB, relevant HTML files from GenBank and SWISS-PROT
and user-supplied text files are parsed to extract relevant data. Upon filter-
5.8 Gene Expression Databases 119
ing, an XML-encoded file is then built that is entered into the database via an
HTML submission form. The ORDB can be downloaded as an HTML file.
RiboWeb smi-web.stanford.edu/projects/helix/riboweb.html
RiboWeb is a relational database containing a representation of the primary
3D data relevant to the structure of the ribosome of the prokaryotic 30S ribo-
somal subunit, which initiates the translation of messenger RNA (mRNA)
into protein and is the site of action of numerous antibiotics (Chen et al.
1997). The project has since been expanded to include structural data per-
taining to the entire ribosome of prokaryotes (but primarily Escherichia coli).
The project includes computational modules for constructing and studying
structural models,
for transcriptional expression. These databases can be used for both hypoth-
esis testing and knowledge discovery.
NCBI’s dbEST Database www.ncbi.nlm.nih.gov/dbEST/
The GeneCards Database
bioinformatics.weizmann.ac.il/cards
Kidney Development Gene Expression Database
organogenesis.ucsd.edu
Gene Expression in Tooth bite-it.helsinki.fi
Mouse Gene Expression Database www.informatics.jax.org
The Cardiac Gene Expression Knowledgebase
www.cage.wbmei.jhu.edu
Gene Expression Atlas expression.gnf.org/cgi-bin/index.cgi
NCBI’s Gene Expression Omnibus
www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=geo
Cancer Gene Expression Database
cged.hgc.jp/cgi-bin/input.cgi
Saccharomyces Genome Database www.yeastgenome.org
The Nematode Expression Pattern DataBase
nematode.lab.nig.ac.jp
WormBase www.wormbase.org
The Plasmodium Genome Resource plasmodb.org
The Zebrafish Information Network zfin.org
HGVbase hgvbase.cgb.ki.se/
The objective of the Human Genome Variation Database is to provide an ac-
curate, high-utility, and ultimately fully comprehensive catalog of normal
human gene and genome variation, useful as a research tool to help de-
fine the genetic component of human phenotypic variation. All records are
highly curated and annotated, ensuring maximal utility and data accuracy.
124 5 Survey of Ontologies in Bioinformatics
nent of the Cancer Genome Anatomy Project (Packer et al. 2004) of the Na-
tional Cancer Institute (NCI). SNP500Cancer provides bidirectional sequenc-
ing information on a set of control DNA samples derived from anonymized
subjects (102 Coriell samples representing four self-described ethnic groups:
African/African-American, White, Hispanic, and Pacific Rim). All SNPs are
chosen from public databases and reports, and the choice of genes includes
a bias toward nonsynonymous and promoter SNPs in genes that have been
implicated in one or more cancers. The website is searchable by gene, chro-
mosome, gene ontology pathway, and by known dbSNP ID. For each ana-
lyzed SNP, the database includes the gene location and over 200 bp of sur-
rounding annotated sequence (including nearby SNPs). Other information is
also provided such as frequency information in total and per subpopulation
and calculation of the Hardy-Weinberg equilibrium for each subpopulation.
Sequence validated SNPs with minor allele frequency greater than 5% are en-
tered into a high-throughput pipeline for genotyping analysis to determine
concordance for the same 102 samples. The website provides the conditions
for validated genotyping assays.
SeattleSNPs Database pga.mbt.washington.edu
The SeattleSNPs is a collaboration between the University of Washington and
the Fred Hutchinson Cancer Research Center, funded as part of the National
Heart Lung and Blood Institute’s (NHLBI) Programs for Genomic Applica-
tions (PGA). The goal of SeattleSNPs is to discover and model the associa-
tions between single nucleotide sequence differences in the genes and path-
ways that underlie inflammatory responses in humans. In addition to SNP
data (location, allele frequency, and function for coding SNPs), haplotypes
are presented graphically on the SeattleSNPs website. Haplotype tagging
SNPs (htSNPs) information is also provided that will allow fewer SNPs to be
genotyped per gene, thereby reducing cost and improving throughput. Data
is available in tab-delimited text files.
GeneSNPs www.genome.utah.edu/genesnps
The GeneSNPs database is sponsored by the National Institute of Environ-
mental Health Sciences and is being developed by the University of Utah
Genome Center. GeneSNPs is a component of the Environmental Genome
Project which integrates gene, sequence, and polymorphism data into indi-
vidually annotated gene models. The human genes included are related to
DNA repair, cell cycle control, cell signaling, cell division, homeostasis and
metabolism, and are thought to play a role in susceptibility to environmen-
tal exposure. Data are available in HTML, FASTA, and XML formats. The
126 5 Survey of Ontologies in Bioinformatics
XML format does not use a DTD, and most of the information is encoded as
FASTA text within element content.
The SNP Consortium snp.cshl.org
The SNP Consortium (TSC) was established in 1999 as a collaboration of sev-
eral companies and institutions to produce a public resource of SNPs in the
human genome (Thorisson and Stein 2003). The initial goal was to discover
300,000 SNPs in 2 years, but the final results exceeded this. For example, at
the end of 2001, as many as 1.4 million SNPs had been released into the pub-
lic domain (ISMWG 2001). The database now contains over 1.8 million SNPs.
The data are stored in a relational database and are available in tab-delimited
flat files.
International HapMap Project www.hapmap.org
The International HapMap project is charting the haplotype structure across
the entire human genome in major human ethnic groups (IHMC 2003). The
haplotype data of this project are available in XML. The format is specified
using XSD in www.hapmap.org/xml-schema/2003-11-04/hapmap.xsd.
PART II
This part addresses how ontologies are constructed and used. One uses on-
tologies far more frequently than one creates them, and it is a good idea to
have some experience with how ontologies are used before attempting to
design new ontologies. Accordingly, this part begins with the many uses for
ontologies, and it ends with how one constructs them.
One of the most common uses of ontologies is for querying and retrieval.
The first three chapters discuss how query processing works and how to for-
mulate effective queries. Because ontologies have deductive capabilities, the
result of a query makes use of inferred information as well as explicitly spec-
ified information. There are two main points of view that one can take with
respect to retrieval. The first point of view is based on imprecise queries,
while the second point of view is based on precise, logical queries. Imprecise
bioinformatics queries can be expressed in two ways: natural language or
biological sequences. Chapter 6 considers natural language queries, while
chapter 7 deals with biological sequence queries. Chapter 8 introduces com-
puter languages for unambiguous queries.
After information retrieval, the most common activity involving ontolo-
gies is transformation. The process whereby information is transformed from
one format to another is surveyed in chapter 9. Such processes can have
many steps and involve many groups of individuals. It is helpful to under-
stand the entire transformation process so that the individual steps can serve
the overall process better.
The individual transformation steps use a variety of programming lan-
guages and tools. One of the most common is Perl. While Perl is especially
well suited for data transformations involving unstructured files, it can also
be used for structured data. Chapter 10 is an introduction to Perl that em-
phasizes its use for data transformations. While Perl can be used effectively
on XML documents, there is now a language specifically designed for trans-
forming XML. This language is called XSLT, and it is introduced in chap-
ter 11. As bioinformatics data migrate from flat files to XML structured files,
one can expect that XSLT will play an increasing role.
This part ends with a detailed treatment of the process whereby ontolo-
gies are built. The ontologies and databases that were surveyed in chapter 5
were substantial endeavors involving many individuals and requiring the
agreement of the community being served. While ontologies certainly can
be developed in this way, it is also possible for ontologies to serve smaller
communities for more limited purposes. Chapter 12 is a practical guide for
developing ontologies in a systematic manner, whether the ontology will be
used by a large community, a small community, or even a single individual.
6 Information Retrieval
Summary
• Online search engines are based on the standard model for information
retrieval.
• In the standard model, a query is matched against a corpus and the most
relevant documents are retrieved.
The simplest search technique is to look for documents that contain the words
specified in a query. From this point of view a document is simply a set of
words, and the same is true of a query. Search consists of finding the docu-
ments that contain the words of the query. Many retrieval systems use this
basic technique, but this is only effective for relatively small repositories. The
problem is that the number of matches to a query can be very large, so some
mechanism must be provided that selects among the matching documents
or arranges the documents so that the best matches appear first.
Simply arranging the matching documents by the number of matching
words is not very effective because words differ in their selectivity. A word
such as “the” in English has little use in search by word matching because
nearly every document that uses English will have this word. For example,
PubMed (NIH 2004b) is a very large corpus containing titles, abstracts, and
other information about medical research articles. Table 6.1 gives the number
of times that the most common words occur in PubMed. The second column
of this table gives the number of times that the word occurs in the text parts
of the PubMed citations. The third column gives the number of documents
132 6 Information Retrieval
that contain the word. Note that “of” occurs in more documents than “the,”
although the latter occurs more often.
One can deal with the varying selectivity of words in several ways. One
could ignore the most commonly occurring words. The list of ignored words
is called the “stop word list.” One can also weight the matches so that more
commonly occurring words have a smaller effect on the choice of documents
to be returned. When this technique is used, the documents are arranged in
order by how well the documents match the query. Many algorithms have
been proposed for how one should rank the selected documents, but the one
that has been the most effective is vector space retrieval, also called the vector
space model. This method was pioneered by (Salton et al. 1983; Salton 1989).
In this model, each document and query is represented by a vector in a very
high-dimensional vector space. The components of the vector (i.e., the axes
or dimensions of the vector space) are all the words that can occur in a doc-
ument or query and that can be used for searching. Such words are called
terms. Terms normally do not include stop words, and one commonly maps
synonymous words (such as words that differ only by upper- or lower-case
distinctions) to the same term.
The vector of a document or query will be very sparse: nearly all entries
will be zero for a particular document or query. The entry for a particular
term in the vector is a number called the term weight. Term weights can be
based on many criteria, but the two most important are the following (Salton
and McGill 1986):
1. Term Frequency. The number of times that the term occurs in a doc-
ument. The assumption is that if a term occurs more frequently in the
document, then it must be more important for that document.
The cost factors are shown diagrammatically in figure 6.1. This same dia-
gram applies to any situation in which a statistical decision must be made.
Figure 6.1 Types of errors that can occur during document retrieval.
In practice, one does not explicitly specify either c1 or c2 or even their ra-
tio. Rather, one attempts to arrange the documents in descending order by
the ratio Pr(Relevant|D)/Pr(Irrelevant|D). The person requesting the query
6.2 Vector Space Retrieval 135
can then examine the documents in this order until it is found that the doc-
uments are no longer relevant. In other words, the ratio c1 /c2 is implicitly
determined by the researcher during examination of the document list.
The conditional probabilities Pr(Relevant|D) and Pr(Irrelevant|D) can be
“reversed” by applying Bayes’ law. Thus
and similarly for the probability of irrelevance. In the ratio of these two, the
term Pr(D) cancels, and we obtain the following expression:
The last factor in the equation above is a ratio that depends only on the
query Q, not on the document D. Consequently, arranging the documents in
descending order by the ratio Pr(D|Relevant)/Pr(D|Irrelevant) will produce
exactly the same order as using the ratio Pr(Relevant|D)/Pr(Irrelevant|D).
This is fortunate because the probabilities in the former ratio are much easier
to compute.
To estimate the ratio Pr(D|Relevant)/Pr(D|Irrelevant), first consider the
denominator. In a large corpus such as the web, with billions of pages, or
Medline with over 12 million citations, one will rarely be interested in more
than a very small fraction of all documents. Thus nearly all documents will
be irrelevant. As a result, it is reasonable to assume that Pr(D|Irrelevant) is
the same as Pr(D).
To estimate Pr(D|Relevant)/Pr(D) it is common to assume that the docu-
ments and queries can be decomposed into statistically independent terms.
We will discuss how to deal with statistical dependencies later. Statistical in-
dependence implies that Pr(D|Relevant) is the product of Pr(T|Relevant) for
all terms T in the document D, and Pr(D) is the product of the unconditional
probabilities Pr(T). Because queries can also be decomposed into indepen-
dent terms, there are two possibilities for a term T in a document D. It is
either part of the query Q or it is not. If T is in the query Q, then by defini-
tion the term T is relevant, so Pr(T|Relevant) = 1. If T is not in the query Q,
then the occurrence of T is independent of any relevance determination, so
Pr(T|Relevant) = Pr(T). The ratio Pr(D|Relevant)/Pr(D) is then the product of
two kinds of factor: 1/Pr(T) when T is in the query Q and Pr(T)/Pr(T) when
T is not in the query Q. So all that matters are the terms in D that are also in
Q.
136 6 Information Retrieval
3. Queries are highly specific. In other words, the set of relevant documents is
relatively small compared with the entire collection of documents. This
holds when the queries are small (i.e., have very few terms), but it is less
accurate when queries are large (e.g., when one compares documents with
other documents). However, modern corpora (such as Medline or the
World Wide Web) are becoming so immense, that even very large docu-
ments are small compared with the corpus.
The dot product has a nice geometric interpretation. If the two vectors
have unit length, then the dot product is the cosine of the angle between the
two vectors. For any nonzero vector v there is exactly one vector that has unit
138 6 Information Retrieval
length and has the same direction as v. This vector is obtained by dividing v
v v·w
by its length: |v| . Thus the angle between vectors v and w is given by |v||w| .
v
The length |v| of a vector is also called its norm, hence |v| is called the nor-
malization of v. Some systems normalize the vectors of documents so that all
documents have the same “size” with respect to information retrieval, and so
that the dot product is the cosine of the angle between vectors. Normaliza-
tion does not have a probabilistic interpretation, so it is not appropriate for
information retrieval using a query. However, it is useful when documents
are compared with one another. In this case, the cosine of the angle between
the document vectors is a measure of similarity that varies between 0 and
1. A value of 0 means that the documents are unrelated, while a value of
1 means that the documents use the same terms with the same relative fre-
quencies. One can use similarity functions such as the cosine as a means of
classifying documents by looking for clusters of documents that are near one
another. All of the clustering algorithms mentioned in section 1.5 can be used
to cluster documents either hierarchically or by using some other organizing
principle. Clustering techniques based on similarity functions are still in use,
but they have been superseded to some extent by citation-based techniques,
to be discussed in section 6.4.
In spite of the logical elegance of the vector space model, it has several
deficiencies.
2. Many languages, including English, also vary the form of a word for
grammatical purposes. This is known as inflection. For example, English
words can be singular or plural. For example, while “normetanephrines”
only occurs in five PubMed citations, the singular form “normetanephrine”
occurs in 1207 citations. Although the singular and plural forms have dif-
ferent meanings, such distinctions are rarely important during a search.
6.2 Vector Space Retrieval 139
3. The vector space model treats the document as just a collection of un-
connected and unrelated terms. There is no meaning beyond the terms
themselves.
4. It presumes that the terms are statistically independent, both in the collec-
tion as a whole and in each document. The vector space model in general
allows for terms that are correlated, but it is computationally difficult even
to find correlations between pairs of terms, let alone sets of three or more
terms, so very few retrieval engines attempt to find or to make use of such
correlations.
Summary
• Words have different degrees of selectivity.
• The most common term weight is the TFIDF weight which is the product
of the number of times that the word occurs in the document times the
logarithm of the inverse of the number of documents that have the word.
140 6 Information Retrieval
• In spite of its elegance and geometric appeal, the vector space model de-
pends on many assumptions and has a number of limitations.
1. One can use more general concepts, when more specific concepts do not
find the desired information. This is known as “broadening” the query.
142 6 Information Retrieval
2. One can use more specific concepts, when more general concepts find too
much information. This is known as “narrowing” the query.
3. One can use concepts that are related in ways that are nonhierarchical.
For example, a nucleolus is a part of the nucleus of a cell. This is a query
modification which neither broadens nor narrows the query.
Summary
• Ontologies are an important source of terminology that can be used to
formulate queries.
• Biological and medical ontologies can be so large and complex that spe-
cialized browsing and retrieval tools are necessary.
• Several browsers are now available for the UMLS.
• One can use ontologies as a means of query modification when a query
does not return satisfactory results.
D E F
D 0 1 1
E 0 0 0
F 0 1 0
For example, the 1 in the first row and second column of the matrix indi-
cates that document D refers to document E. Note that the rows and columns
have been labeled for ease in understanding the meaning of the entries. This
matrix is called the adjacency matrix of the graph. It is usually designated by
the letter A. We now compute the matrix products AT A and AAT , where the
superscript T means that the matrix has been transposed. The following are
these two matrices:
D E F
D 0 0 0
E 0 2 1
F 0 1 1
D E F
D 2 0 1
E 0 0 0
F 1 0 1
The original matrix A will not be symmetric in general, but both of the
products will be. In fact, both matrices are positive semidefinite. In other
words, the eigenvalues will be nonnegative. The largest eigenvalue is called
the principal eigenvalue, and its eigenvectors are called the principal eigen-
vectors. While it is difficult in general to compute eigenvalues and eigen-
vectors of large matrices, it is relatively easy to find a principal eigenvector.
The space of principal eigenvectors is called the principal component. Prin-
cipal components analysis (PCA) is a commonly used statistical technique
for accounting for the variance in data.
144 6 Information Retrieval
In the case of graphs, the entries in a principal eigenvector measure the rel-
ative importance of the corresponding node with respect to the links. Each
of the two matrices has a different interpretation. The matrix AT A is the au-
thority matrix. The principal eigenvector ranks the documents according to
how much they are referred to by other documents. In this case the principal
eigenvector is (0, 1, 0.618). Document D is not referred to by any other doc-
ument in this set, so it is no surprise that it is not an authority. Document E
is referred to by two other documents, and document F is referred to by just
one other document. Thus E is more of an authority than F.
It is interesting to compare the Kleinberg algorithm with what one would
obtain using simple citation counts, as is often done in the research literature.
Since E has twice as many citation counts, one would expect that E would be
twice as authoritative as F. However, the principal eigenvector adjusts the
authoritativeness of each citation so that the authority weights are consis-
tent. In effect, the algorithm is implicitly assigning a level of “quality” to
the citations. In other words, being cited by a more authoritative document
counts more than being cited by a less authoritative source.
The matrix AAT is the hub matrix. A hub or central source is a document
that refers to a large number of other documents in the same set. In the re-
search literature, a survey article in a field would be a hub, and it might not
be an authority. This would be the case shortly after the survey article has
been published and before it has been cited by other articles. The princi-
pal eigenvector ranks the documents according to how much of a hub it is
for the particular query. In the example, the principal eigenvector is (1, 0,
0.618). Since document E does not refer to any other documents in this set, it
is not a hub. Document D is the main hub, since it refers to two other doc-
uments, while document F is much less of a hub since it only refers to one
other document.
One interesting and useful feature of the Kleinberg algorithm is that it is
possible for a document to have considerable importance either as an au-
thority or a hub even when the document does not match any of the terms
in the query. As a result, the Kleinberg algorithm improves both selectivity
and coverage of retrieval. It improves selectivity by improving the ordering
of documents so that the most relevant documents are more likely to occur at
the beginning of the list. It improves coverage by retrieving documents that
do not match the query but which are cited by documents that do match.
However, the Kleinberg algorithm does have its weaknesses. Because it
focuses on the principal eigenvector, the other eigenvectors are ignored even
when they may represent the actual focus of interest of the researcher. This
6.4 Organizing by Citation 145
is the case when a relatively small community uses the same terminology as
a much larger community. For this reason, commercial search engines like
Google that are based on the Kleinberg algorithm do not implement it in its
original form.
Google, for example, uses a formula which differs from the Kleinberg al-
gorithm in several ways:
3. The original adjacency matrix is used rather than either the authority or
hub matrix. Thus the algorithm is measuring a form of popularity rather
than whether the document is authoritative or a central source.
Current search engines have another weakness. The original set of candi-
date documents is obtained using simple word-matching strategies that do
not incorporate any of the meaning of the words. As a simple example, try
running these two queries with Google: “spinal tap” and “spinal taps.” From
almost any point of view these two have essentially the same meaning. Yet,
the documents displayed by Google have completely different rankings in
these two cases. Among the first ten documents of each query there is only
one document in common. Although the spinal tap query is problematic
because there is a popular movie by that name, one can easily create many
more such examples by just varying the inflection of the words in the query
or by substituting synonymous words or phrases.
One obvious way to deal with this shortcoming of Google would be to in-
dex using concepts rather than character strings. This leads to the possibility
of search based on the meaning of the documents. Many search engines, in-
cluding Google, are starting to incorporate semantics in their algorithms. We
discuss this in section 6.6.
146 6 Information Retrieval
Summary
• Citations (such as hypertext links) can be used to rank documents relevant
to a query according to various criteria:
1. Authoritativeness
2. Central source
3. Popularity
1. Current systems are based on matching words in the query with words
in documents and do not consider the meaning of the words.
2. Only the principal eigenvector is used, so smaller communities will be
masked by larger ones.
One of the main assumptions of the vector space model is that documents
are composed of collections of terms. While some systems attempt to take
advantage of correlations between terms, such correlations are difficult to
determine accurately, and the number of correlations that must be computed
is huge. In any case, the terms are still disjoint from one another. Knowledge
representations change this situation. Terms can now be complex concept
combinations that are built from simpler terms. Thus a term like “flu vac-
cine” contains both “flu” and “vaccine” as well as the complex relationship
between these two concepts which expresses the effect of the vaccine on the
influenza virus as well as the the derivation of the vaccine from the virus and
in response to it. In the UMLS, all three of these are concepts, and they are
related to one another.
To see how natural, as well as how subtle, concept combinations can be,
try juxtaposing two commonly used terms in different orders. For example,
“test drug” and “drug test.” Although these two have completely different
meanings, most search engines give essentially the same answer for both.
Indeed, “test drug” can be interpreted in two ways depending on whether
“test” is a verb or adjective. The term “drug test” also has several meanings.
As an exercise, try some other pairs of terms to see how many meanings
6.5 Vector Space Retrieval of Knowledge Representations 147
you can extract from them. Concept combination could be a powerful in-
formation retrieval mechanism, provided it is properly interpreted. With a
relatively small number of basic concepts along with a small number of con-
ventional relationships, one can construct a very large number of concept
combinations.
Concept combination, also called conceptual blending and conceptual inte-
gration, is an active area of research in linguistics. The meaning of a concept
combination requires a deeper understanding of the relationship between
words and the phenomena in the world that they signify. Based on the earlier
work of Peirce, de Saussure, and others in the field of semiotics, Fauconnier
and Turner (1998, 2002) have developed a theory of conceptual blending that
explains how concepts can be blended. However, this theory is informal.
Goguen has now developed a formal basis for conceptual blending (Goguen
1999; Goguen and Harrell 2004). Furthermore, Goguen and his students have
developed software that automates the blending of concepts, and their sys-
tem has been used to understand and even to create poetry and other narra-
tives. Concept combination is closely connected with human categorization
and metaphor. For an entertaining account of these topics, see Lakoff’s book
with the intriguing title Women, Fire and Dangerous Things: What Categories
Reveal about the Mind (Lakoff 1987).
The tool developed by Goguen and his students, mentioned above, is ca-
pable of finding a wealth of concept combinations even when the concepts
are relatively simple. For the words “house” and “boat,” their tool finds
48 complete blends and 736 partial blends. Two of these have become so
common that they are considered single words; namely, “houseboat” and
“boathouse.” Others are less obvious, but still make sense, such as a boat
used for transporting houses, an amphibious recreational vehicle, or a boat
used permanently on land as a house.
As one might imagine, the combinatorial possibilities for combinations be-
come enormous when there are more than two words being combined. A
typical title for a biomedical research article can have a dozen words. Un-
derstanding the meaning of such a title can be a formidable undertaking if
one is not familiar with the subject matter of the article, as we pointed out in
section 1.6. Goguen and Harrell (2004) pointed out that conceptual blending
alone is not sufficient for understanding entire narratives that involve many
such blends (Goguen and Harrell 2004). They introduced the notion of struc-
tural blending, also called structural integration, to account for the meaning of
whole documents.
Having introduced concept combinations, one still has the problem of how
148 6 Information Retrieval
Summary
• Concepts can be combined in many ways which are much deeper than
just the juxtaposition of the words used.
• The vector space model can be extended to deal with concept combina-
tions, but it is still subject to deficiencies because it does not deal with the
meaning of words.
times the retrieval will be useless. These engines use a variety of mechanisms
for overcoming this limitation, but they can never completely eliminate it.
This is in striking contrast with relational database queries which always
return all of the items specified and no others. Using the terminology of
information retrieval, relational queries always have 100% coverage and se-
lectivity. It is natural to imagine that one could try to achieve the same cover-
age and selectivity with information retrieval. To do so one must overcome
several difficult problems:
2. Query the documents with natural language. This means that NLP tech-
niques must be used to extract the knowledge representation of the query.
The query knowledge representation can then be matched against the
knowledge representations of the documents.
Figure 6.3 Top part of the result screen for the SKIP retrieval system showing the
knowledge representation of the query and the document that matches the query the
best.
The first two matching documents match the best. They contain all of the
concepts in the query, and all but two of them are related as in the query.
The only relationship that was not found in these documents was between
“plasma membrane” and “lymphocytes.” The knowledge representation to
the left of the document link shows the part of the query that was found in
the document.
Scrolling down the result screen gives figure 6.4 which shows a number
of other documents that contain fewer concepts and relationships than the
documents that match the best. Continuing to scroll the results screen will
show many more matching documents, but these match less and less of the
original query.
This approach to retrieval has some advantages that were already dis-
cussed above. Another advantage specific to SKIP is that the retrieved doc-
uments are arranged in groups and labeled by how they match using an
intuitive and visually appealing graphical structure.
6.6 Retrieval of Knowledge Representations 153
Figure 6.4 Other documents that match a query. The knowledge representations
shown on the left show the part of the query that occurs in the documents on the
right.
Summary
• Translating natural language text to a representation language that cap-
tures meaning remains an unsolved problem, but reasonably good know-
ledge representations are possible.
Information retrieval can take many forms, and does not have to be based
on natural language. In bioinformatics, it is very common to base queries on
biological sequences, the biochemical language of cells. Indeed, most predic-
tions of biological function are obtained by comparing new sequence data
(for which little is known) with existing data (for which there is prior know-
ledge). The comparison is performed by using the new sequence data as a
query to retrieve similar sequence data in a corpus of such data. Such com-
parisons are of fundamental importance in computational biology. Similar
sequences are referred to as being homologous.
In this chapter we present the basic concepts necessary for sequence sim-
ilarity and the main approaches and tools for sequence similarity search-
ing. The most commonly used sequence similarity searching tools in com-
putational biology are FASTA, Basic Local Alignment Search Tool (BLAST),
and the many variations of BLAST. All these algorithms search a sequence
database for the closest matches to a query sequence. It should be noted that
all three algorithms are database search heuristics, which may completely
miss some significant matches and may produce nonoptimal matches. Of
these three tools, BLAST is the most heavily used sequence analysis tool
available in the public domain.
a way of lining up the residues in the query sequence with part of a sequence
in the corpus. Such a lining up is called an alignment. In an alignment, the
match can fail to be an exact match in two ways: aligned residues can be
different and there may be gaps in one sequence relative to the other. For
each alignment one can compute a similarity measure or score based on the
residues that match or fail to match and the sizes of the gaps. Matches gen-
erally contribute positively to the overall score while mismatches and gaps
contribute negatively. The scoring matrix specifies the contribution to the
overall score of each possible match and mismatch. This contribution can
be dependent on the position of a residue in the query sequence, in which
case the scoring matrix is called a position-specific scoring matrix (PSSM). Such
matrices are also called “profiles” or “motifs.” If the contributions do not
depend on positions, then the scoring matrix specifies the score associated
with a substitution of one type of residue for another. Such a scoring matrix
is called a substitution matrix. The gap penalties specify the effect of gaps on
the score. The objective of a sequence similarity matching tool is to find the
alignments with the best overall score.
There are a number of ways to compute the alignment score. The pri-
mary distinction is between nucleotide sequences and amino acid sequences.
The scoring for amino acid sequence similarity is more complicated because
there are more kinds of amino acids and because amino acid properties are
more complicated than nucleotide properties. For example, chemical struc-
tures and amino acid frequencies can both be taken into consideration. If two
aligned residues have a very low probability of being homologous, a heavy
penalty score is given for such a mismatch. Protein evolution is believed to
be subject to stronger forces than DNA evolution, so that some amino acid
substitutions (which result in Mendelian disorders) are much less function-
ally tolerant than others because natural selection processes select against
them.
The two most commonly used substitution matrices for amino acids are
the point accepted mutation (PAM) (Dayhoff et al. 1978) and the blocks sub-
stitution matrix (BLOSUM) (Henikoff and Henikoff 1992). BLOSUM is more
popular than PAM. In both cases, the entries in the matrix have the form
sij = Clog C (rij ), where C determines the units by which the entries are
scaled (usually 2 for BLOSUM and 10 for PAM) and rij is the ratio of the
estimated frequency with which the amino acids i and j are substituted due
to evolutionary descent, to the frequency with which they would be substi-
tuted by chance. The numerator of this ratio is computed by using a sample
of known alignments. This formula is known more succinctly as the log-odds
7.1 Basic Concepts 157
formula. Logarithms are used so that total scores can be computed by adding
the scores for individual residues in the alignment. Vector space retrieval for
text databases uses the same technique. For convenience, sij is often rounded
to the nearest integer.
BLOSUM matrices were based on data derived from the BLOCKS database
(Henikoff and Henikoff 1991, 1994), which is a set of ungapped alignments
of protein families (i.e., structurally and functionally related proteins). Using
about 2000 blocks of such aligned sequence segments, the sequences of each
block are sorted into closely related clusters, and the probability of a mean-
ingful amino acid substitution is calculated based on the frequencies of sub-
stitutions among these clusters within a family. The number associated with
a BLOSUM matrix (such as BLOSUM62) indicates the cut-off value for per-
centage sequence identity that defines the clusters. In particular, BLOSUM62
scores alignments with sequence identity at most 62%. Note that a lower
cut-off value would allow for more diverse sequences into groups, and the
corresponding matrices are therefore appropriate for examining more distant
relationships.
The PAM matrices are based on taking sets of high-confidence alignments
of many homologous proteins and assessing the frequencies of all substi-
tutions. The PAM matrices were calculated based on a certain model of
evolutionary distance from alignments of closely related sequences (about
85% identical) from 34 “superfamilies” grouped into 71 evolutionary trees
and containing 1572 point mutations. Phylogenetic trees were reconstructed
based on these sequences to determine the ancestral sequence for each align-
ment. Substitutions were tallied by type, normalized over usage frequencies,
and then converted to log-odds scores. The value in a PAM1 matrix repre-
sents the probability that 1 out of 100 amino acids will undergo substitution.
Multiplying PAM1 by itself generates PAM2, and more generally (PAM1)n is
a scoring matrix for amino acid sequences that have undergone n multiple
and independent steps of mutations. Thus, the PAM250 matrix has under-
gone 130 more steps of mutations than the PAM120 matrix. Hence, for align-
ing closely related amino acid sequences, PAM120 matrix is a good choice;
for aligning more distantly related amino acid sequences, PAM250 matrix is
a more appropriate choice. It should be noted that errors can be amplified
during the multiplication process, and thus higher-order PAM matrices are
more error-prone. By comparison, in a BLOSUM62 matrix, each value is cal-
culated by dividing the frequency of occurrence of the amino acid pair in the
BLOCKS database, “clustered” at the 62% level, by the probability that the
same amino acid pair aligns purely by chance. PAM matrices are scaled in
158 7 Sequence Similarity Searching Tools
10log 10 units, which is roughly the same as third-bit units. BLOSUM matri-
ces are usually scaled in half-bit units. In either type of scoring matrix, if the
score is 0, then the alignment of the amino acid pair is equivalent to being
coincidental; if the score is positive, the alignment of the amino acid pair is
found to be more often than by chance; if the score is negative, the align-
ment of the amino acid pair is found to be even less often than by chance.
The NCBI BLAST tool allows one to choose from a variety of scoring matri-
ces, including PAM30, PAM70, BLOSUM45, BLOSUM62, and BLOSUM80. A
more complete roster of scoring matrices (PAM10-PAM500, and BLOSUM30-
BLOSUM100) is available at the following ftp site: ftp://ftp.ncbi.nlm.
nih.gov/blast/matrices.
Mutational events include not only substitutions but also insertions and
deletions. Consequently one must also consider the possibility of alignment
gaps. However, gaps are a form of sequence mismatch, so they affect the
score negatively. During the process of alignment, the initiation of a new
gap adds a penalty called an opening gap penalty, while the widening of
an existing gap adds an extension gap penalty. For amino acid sequences,
it is common to set extension gap penalties to be lower than opening gap
penalties because certain protein domains evolve as a unit, rather than as
single residues.
Summary
• Sequence similarity search is a process whereby a query sequence is com-
pared with sequences in a database to find the best matches.
• The score depends on the scoring matrix and the gap penalties.
• The most commonly used substitution matrices are PAM and BLOSUM.
The first algorithm that was used for sequence matching was a dynamic
programming algorithm, called the Needleman-Wunsch algorithm (Needle-
man and Wunsch 1970). A dynamic programming algorithm finds an op-
timal solution by breaking the original composite problem recursively into
7.3 FASTA 159
smaller and smaller problems until the smallest problems have trivial solu-
tions. The smaller solutions are then used to construct the solutions for the
larger and larger parts of the original problem until the original problem has
been solved. In this case, the composite problem is to determine the optimal
alignment of the two sequences at their full lengths. This alignment prob-
lem is split by breaking down the two sequences into smaller segments. The
splitting continues recursively until the subproblem consists of comparing
two residues. At this point the score is obtained from the scoring matrix. The
resulting alignment is guaranteed to be globally optimal. Smith and Water-
man (1981) modified the Needleman-Wunsch algorithm to make it run faster
but only guaranteeing that the alignment is locally optimal.
Although the exact dynamic programming algorithm are guaranteed to
find the optimal match (either global or local), they can be very slow. This
is especially true for a full search of the very large sequence databases such
as GenBank for nucleotide sequences and SWISS-PROT for amino acid se-
quences that are commonly used today. To deal with this problem, a number
of heuristic techniques have been introduced, such as FASTA and BLAST,
that give up the guarantee of optimality for the sake of improved speed.
In practice, the effect on optimality is small, so the improvement in perfor-
mance is worth the compromise. These new algorithms search for the best
local alignment rather than the best global alignment.
Summary
• The earliest sequence similarity searching algorithms applied exact dy-
namic programming either globally or locally.
• Current algorithms are heuristic methods that still use dynamic program-
ming but apply approximations to improve performance.
7.3 FASTA
ktup is 2 for amino acid and 6 for nucleotide sequences. The next step is to
extend the matches of length ktup to obtain the highest scoring ungapped
regions. In the third step, these ungapped regions are assessed to determine
whether they could be joined together with gaps, taking into account the
gap penalties. Finally the highest scoring candidates of the third step are re-
aligned using the full Smith-Waterman algorithm, but confining the dynamic
programming matrix to a subregion around the candidates. The trade-off
between speed and sensitivity is determined by the value of the ktup param-
eter. Higher values of ktup, which represent higher “word” sizes, will give
rise to a smaller number of exact hits and hence a lower sensitivity, but will
result in a faster search. For the purpose of tuning, the ktup parameter will
generally be either 1 or 2 for amino acid sequences and can range from 4 to 6
for nucleotide sequences.
A sequence file in FASTA format can contain several sequences. Each se-
quence in FASTA format begins with a single-line description, followed by
lines of sequence data. The description line must begin with a greater-than
symbol (>) in the first column. An example sequence in FASTA format is
shown in figure 7.1.
Figure 7.1 FASTA format of a 718-bp DNA sequence (GenBank accession number
AF200505.1) encoding exon 4 of Pongo pygmaeus apolipoprotein E (ApoE) gene.
7.4 BLAST 161
Summary
• FASTA is a set of sequence similarity search programs.
• FASTA is also a sequence format, and this is currently the main use for
FASTA.
7.4 BLAST
The most widely used tool for sequence alignment is BLAST (McGinnis and
Madden 2004), and it plays an important role in genome annotation (Muller
et al. 1999). BLAST uses a heuristic approach to construct alignments based
on optimizing a measure of local similarity (Altschul et al. 1990, 1997). Be-
cause of its heuristic nature, BLAST searches much faster than the main
dynamic programming methods: the Needleman-Wunsch (Needleman and
Wunsch 1970) and Smith-Waterman (Smith and Waterman 1981) algorithms.
In this section we begin by explaining the BLAST algorithm. The algorithm
is then used for a number of types of search, as presented in subsection 7.4.2.
The result of a BLAST search is a collection of matching sequences (or “hits”).
Each hit is given a number of scores that attempt to measure how well the
hit matches the query. These scores are explained in subsection 7.4.3. We end
the section with some variations on the BLAST algorithm.
FASTA differs from BLAST primarily in that FASTA strives to get exact
“word” matches, whereas BLAST uses a scoring matrix (such as the de-
fault BLOSUM62 for amino acid sequences) to search for words that may
not match exactly, but are high-scoring nevertheless. FASTA does not have a
preprocessing step as in BLAST, and FASTA does not use the BLAST strategy
of extending seeds using sophisticated dynamic programming. Both FASTA
and BLAST have a word generation step which does not allow gaps, fol-
lowed by a Smith-Waterman alignment step that can introduce gaps.
Summary
Figure 7.2 Illustration of blastx (version 2.2.10) output using a 718-bp DNA se-
quence (GenBank accession number AF200505.1) encoding exon 4 of Pongo pygmaeus
ApoE gene.
Summary
• There are publicly available BLAST web services for searches done with
one sequence at a time.
Figure 7.3 Comparison of the extreme value distribution with the normal distribu-
tion.
distinct HSPs that would have that score or higher entirely by chance. The
expectation value is written E and is approximated by a Poisson distribution
(Karlin and Altschul 1990; Altschul 1991). In terms of the normalized score
S , the expectation value E is given by mn2−S , where m is the size of the
query and n is the size of the database. The expectation value is probably the
most useful in the BLAST output. The threshold for significance is usually
set at either 10% or 5%. In other words, when E is less than 0.1 or E is less
than 0.05, then the HSP is considered to be statistically significant (Altschul
et al. 1997).
Strictly speaking, the E-value is not a probability, so it should not be used
to determine statistical significance. However, it is easy to convert E to a
probability by using the formula P = 1 − e−E . The P -value is the probability
that a search with a random query would produce at least one HSP at the
same score or higher. Table 7.1 shows the relationship between E and P . For
E-values below 0.01, there is essentially no difference between E and P . The
reason for this is that the Taylor expansion of ex is 1 + x + x2 /2! + x3 /3! + . . .
so that for x close to 0, we have ex is approximately equal to 1 + x and thus,
when E is close to 0, P = 1 − e−E is approximately equal to 1 − (1 − E) = E.
The usual way to use BLAST is to find those sequences in a database that
are homologous to a given query sequence. This process compares sequences
in the database with the query sequence, but it does not compare the data-
base sequences with each other. If one wishes to learn about the evolution of
168 7 Sequence Similarity Searching Tools
E-value P -value
10 0.9999546
5 0.9932621
1 0.6321206
0.1 0.0951626
0.05 0.0487706
0.001 0.0009995 (about 0.001)
0.0001 0.0001000
Summary
• The raw score S for a search can be normalized so that the results of dif-
ferent searches can be compared.
• The expectation value E can be used to test whether the result of a search
has statistical significance.
PHI-BLAST bioinfo.bgu.ac.il/blast/psiblast_cs.html
PHI-BLAST, the pattern-hit initiated BLAST program, is a hybrid strategy
that addresses a question frequently asked by researchers; namely, whether
a particular pattern seen in a protein of interest is likely to be functionally
relevant or occurs simply by chance (Zhang et al. 1998). This question is
addressed by combining a pattern search with a search for statistically sig-
nificant sequence similarity. The input to PHI-BLAST consists of an amino
acid or DNA sequence, along with a specific pattern occurring at least once
within the sequence. The pattern consists of a sequence of residues or sets of
residues, with “wild cards” and variable spacing allowed. PHI-BLAST helps
to ascertain the biological relevance of patterns detected within sequences,
and in some cases to detect subtle similarities that escape a regular BLAST
170 7 Sequence Similarity Searching Tools
that have been conserved by evolution (Kelley et al. 2004). The basic method
searches for high-scoring alignments between pairs of protein interaction
paths, for which proteins of the first path are paired with putative orthologs
occurring in the same order in the second path.
BLAT genome.ucsc.edu/cgi-bin/hgBlat
The BLAST-Like Alignment Tool is a very fast DNA/amino acid sequence
alignment tool written by Jim Kent at the University of California, Santa Cruz
(Kent 2002). It is designed to quickly find sequences of 95% and greater sim-
ilarity of length 40 bases or more. It will find perfect sequence matches of 33
bases, and sometimes find them down to 22 bases. BLAT on proteins finds
sequences of 80% and greater similarity of length 20 amino acids or more. In
practice DNA BLAT works well on primates, and protein BLAT on land ver-
tebrates. It is noted that BLAT may miss more divergent or shorter sequence
alignments.
BLAT is similar in many ways to BLAST. The program rapidly scans for
relatively short matches (hits), and extends these into HSPs. However, BLAT
differs from BLAST in some significant ways. For instance, where BLAST re-
turns each area of homology between two sequences as separate alignments,
BLAT stitches them together into a larger alignment. BLAT has a special code
to handle introns in RNA/DNA alignments. Therefore, whereas BLAST de-
livers a list of exons sorted by exon size, with alignments extending slightly
beyond the edge of each exon, BLAT effectively “unsplices” mRNA onto the
genome giving a single alignment that uses each base of the mRNA only
once, and which correctly positions splice sites. BLAT is more accurate and
500 times faster than popular existing tools for mRNA/DNA alignments and
50 times faster for amino acid alignments at sensitivity settings typically used
when comparing vertebrate sequences.
BLAT’s speed stems from an index of all nonoverlapping sequences of
fixed length in the sequence database. DNA BLAT maintains an index of
all nonoverlapping sequences of length 11 in the genome, except for those
heavily involved in repeats. The index takes up a bit less than a gigabyte
of RAM. The genome itself is not kept in memory, allowing BLAT to deliver
high performance on a reasonably priced computer. The index is used to
find areas of probable homology, which are then loaded into memory for a
detailed alignment analysis. Protein BLAT works in a similar manner, ex-
cept with sequences of length 4. The protein index takes a little more than 2
gigabytes.
BLAT has several major stages. It uses the index to find regions in the
172 7 Sequence Similarity Searching Tools
Summary
2. Iterated BLAST:
7.5 Exercises
1. If m = 100 and n = 120, 000, 000, what normalized bit score S is necessary
to achieve an E-value of 0.01? If the E-value threshold is lowered by 200
times (i.e., lowered to 5×10−5 ), what normalized bit score is necessary?
2. The probability of finding exactly k HSPs with a raw score S that is at least
S 0 follows a Poisson distribution. Suppose that the expected number of
HSPs with a raw score S≥S 0 is 0.01. What is the probability of finding no
HSPs with score at least S 0 ? What is the probability of finding at least 2
HSPs with score at least S 0 ?
8 Query Languages
For relational databases the standard query language is SQL. The main pur-
pose of SQL is to select records from one (or more) tables according to se-
lection criteria. Having selected the relevant records, one can then extract
the required information from the fields of the relevant records. Because of
the success and popularity of SQL, it was natural to imitate SQL when a lan-
guage was developed for querying XML. However, XML documents have a
hierarchical structure that relational databases do not possess. Consequently,
XML querying involves three kinds of operation:
1. Navigation. This is the process of locating an element or attribute within
the hierarchical structure of an XML document.
2. Selection. Having located desirable elements and attributes, one selects
the relevant ones.
3. Extraction. The last operation is to extract required information from the
relevant elements and attributes.
The first kind of operation is unique to XML querying, while the other two
are similar to what one does in SQL.
Navigation is so important to XML that a separate language was devel-
oped to deal with it, called XPath (W3C 1999). This language is introduced
in section 8.1. XPath has been incorporated into many other languages and
tools, and so it is widely available. One such language is XQuery (W3C 2004c)
which is the standard query language for XML, covered in section 8.2. If one
has some experience with relational databases and SQL, it will look famil-
iar, although there are a few differences. The main difference is that XQuery
supports navigation using XPath. Indeed, a query using XQuery can consist
of nothing more than an XPath expression, and in many cases that is all one
needs.
176 8 Query Languages
//Chemical/*
//Interview/@*
One can navigate from one step to another in several ways. XPath calls this
the axis of navigation. The most common are the following:
1. child element. This is the normal way to navigate from one node to an-
other. If no other axis of navigation is explicitly specified, then the path
navigates to the child element.
3. parent element. One can go up one level by using a double dot. To obtain
the PubMed identifier (PMID) of every Chemical node, use this path:
//Chemical/../../PMID
4. ancestor element. One can go up any number of levels by using the an-
cestor:: axis. For example, to obtain the PMID of every Chemical node,
even when it is located several levels above, use this path:
//Chemical/ancestor::PMID
5. root element. A slash at the start of a path means to start at the highest
level of the document. If a slash is not specified at the start of a path,
then the path starts at the current element. This depends on the context in
which XPath is used.
178 8 Query Languages
While directory paths and XML paths are very similar, there is a distinction
mentioned in table 1.1 which is important for navigation; namely, there can
be many child elements with the same name. A molecule, for example, can
have many atoms (see figure 1.6). To select a particular atom, such as the first
one, use this path:
/molecule/atomArray/atom[position()=1]
This path will select the first atom of every molecule. One can abbreviate the
path above to the following:
/molecule/atomArray/atom[1]
which makes it look like the array notation used in programming languages,
except that child numbering begins with 1, while programming languages
usually start numbering at 0. However, this notation is an abbreviation that
can be used in this case only, and it should not be used in more complicated
selection expressions.
XPath brackets are a versatile mechanism for selecting nodes. In addition
to selection by position, one can also select by attribute and node values. For
example, to select the nitrogen atoms in nitrous oxide, use this path:
/molecule[@name=’nitrous oxide’]//atom[@elementType=’N’]
XPath has many numerical and string operations. Some of these are shown in
table 8.1. Selection criteria can be combined, using the Boolean operators. For
example, if one would like the carbon and oxygen atoms in hydroquinone,
then use this path:
/molecule[@name=’hydroquinone’]
//atom[@elementType=’C’ or @elementType=’O’]
The XPath query above should have been on a single line, but it was shown
on two lines for typographical purposes. Other XPath queries in this chapter
have also been split to fit in the space available.
When using more complicated expressions, one cannot use abbreviations.
For example, if one would like the last atom of each molecule, but only if it
is a carbon atom, then use this path:
/molecule
//atom[@elementType=’C’ and position()=last()]
8.1 XML Navigation Using XPath 179
Summary
• XPath is a language for navigating the hierarchical structure of an XML
document.
• Navigation uses paths that are similar to the ones used to find files in a
directory hierarchy.
• An axis can specify directions such as: down one level (child), down any
number of levels (descendant), up one level (parent), up any number of
levels (ancestor), and the top of the hierarchy (root).
• One can select nodes using a variety of criteria which can be combined
using Boolean operators.
document("healthstudy.xml")//Interview
8.2 Querying XML Using XQuery 181
will return all of the interview records in the health study database.
Alternatively, one can specify a collection of XML documents for which
one will perform a series of queries. Such a collection is called the “database”
or corpus. The specification of the corpus will vary from one XQuery engine
to another. Once the corpus is ready, the queries do not have to mention any
documents.
So far we have only considered XPath expressions. Queries can be far
more elaborate. A general XQuery query may have four kinds of clause, as
follows:
1. for clause. This specifies a loop or iteration over a collection. It says that
a variable is to take on each value in the collection, one value at a time.
For example,
for $bmi in
document("healthstudy.xml")//Interview/@BMI
will set the $bmi variable to each BMI attribute. All variables in XQuery
start with a dollar sign. This clause corresponds to the FROM clause in
SQL queries, except that in SQL one has only one FROM clause, while an
XQuery expression can have any number of for clauses. Most program-
ming languages (including Perl, C, C++, and Java) use “for” to indicate
an iteration process, and the meaning is the same.
2. where clause. This restricts which values are to be included in the result
of the query. This clause corresponds to the where clause in SQL queries.
For example, if one were only interested in BMI values larger than 30, then
the query would be
for $bmi in
document("healthstudy.xml")//Interview/@BMI
where $bmi > 30
for $bmi in
document("healthstudy.xml")//Interview/@BMI
where $bmi > 30
return $bmi
let $bmilist :=
document("healthstudy.xml")//Interview/@BMI
for $bmi in $bmilist
where $bmi > 30
return $bmi
The $bmilist variable is set to the whole collection of BMI values. The
for clause then sets $bmi to each of the values in this collection, one at a
time.
XQuery uses variables in ways that are different from how they are used
in programming languages such as Perl. In Perl, the dollar sign is used to
indicate that a variable is a scalar. A different symbol, the at-sign (@), is used
to indicate variables that can have an array of values. In XQuery, there is no
distinction between scalars and arrays: a variable can be either one. More
significantly, Perl variables can be assigned to a value any number of times.
XQuery will only assign a variable to different values in a for clause. A
variable can only be given a value once by a let clause. Subsequent lets
for the same variable are not allowed.
One can build up more complicated XQuery expressions by combining a
series of for and let clauses. These are followed by an optional where
clause. The return clause is always the last clause. For example, suppose
that one wants to obtain the major topics of the Medline articles in a corpus
of Medline citations. One would use a query like this:
8.3 Semantic Web Queries 183
for $desc in
document("medline.xml")//MeshHeading/DescriptorName
let $cite := ../../MedlineCitation
where $desc/@MajorTopicYN = "Y"
return $cite, $desc/text()
Summary
• XQuery is the standard query language for processing XML documents.
1. A for clause scans the result of an XPath expression, one node at a time.
2. A where clause selects which of the nodes scanned by the for clauses
are to be used.
3. A return clause specifies the output of the query.
4. A let clause sets a variable to an intermediate result. This is an op-
tional convenience so that a complicated expression does not have to
be written more than once.
Unlike XML which has standard navigation and query languages, the Se-
mantic Web languages do not yet have a standard query language. Some
suggestions have been made, but it is still unclear what the standard lan-
guage will eventually look like.
There are several contenders for a Semantic Web query language:
8.4 Exercises
1. Using the health study database in section 1.2, find all interviews in the
year 2000 for which the study subject had a BMI greater than 30.
8.4 Exercises 185
2. Given a BioML document as in figure 1.3, find all literature references for
the insulin gene.
3. In the PubMed database, find all citations dealing with the therapeutic use
of glutethimide. More precisely, find the citations that have “glutethimide”
as a major topic descriptor, qualified by “therapeutic use.”
4. Perform the same task as in exercise 8.3, but further restrict the citations
to be within the last 6 months.
5. For the health study database in section 1.2, the subject identifier is a field
named SID. Find all subjects in the database for which the BMI of the
subject increased by more than 4.5 during a period of less than 2 years.
specifications link together third party and local resources using web ser-
vice protocols. Taverna is a GUI used for assembling, adapting and running
workflows. Workflows that execute remote or local web services and Java
applications are the chief mechanism for forming experiments. Legacy appli-
cations are incorporated using myGrid wrapper tools. In addition to services
and applications, databases may be integrated using a query processor de-
veloped jointly with the UK OGSA-DAI project. An example of a myGrid
workflow is shown in figure 9.3. The software can be freely downloaded.
Summary
• Biology experiments and statistical analyses are transformation processes.
Figure 9.4 The transformation process for constructing a website. The process sep-
arates the content, logical structure, and presentation. Each of the steps can be devel-
oped and maintained by different individuals or groups.
to a single format: XML. If the original source data are already in XML, then
this step simply reads the file. Here is what it might look like:
</Source>
The next step in the process selects and transforms the relevant source in-
formation. For example, if one is producing a webpage for a project, then
only information relevant to the project is extracted, such as its title, descrip-
tion, personnel, and so on. Here is what a project might look like after the
source information for one project has been extracted from the source:
This step is mainly concerned with selecting relevant source information and
rearranging it appropriately. For the most part it will not modify the source
information significantly.
The last step is to transform the selected information to a presentation for-
mat such as HTML or PDF. Unlike the previous step, this can result in a
substantially different format. The PDF format is completely different from
XML. Here is what the example above might look like in HTML:
<HTML>
<HEAD>
<META http-equiv="Content-Type"
content="text/html; charset=UTF-8">
<TITLE>
Harvard Medical School Bioinformatics Web Site
</TITLE>
</HEAD>
<BODY BGCOLOR="#FFFFFF">
194 9 The Transformation Process
Summary
• Transformation is an effective means for controlling how data are pre-
sented.
...
<Protein id="Mas375">
<Substrate id="Sub89032">
<BindingStrength>5.67</BindingStrength>
<Concentration unit="nm">43</Concentration>
</Substrate>
<Substrate id="Sub8933">
<BindingStrength>4.37</BindingStrength>
<Concentration unit="nm">75</Concentration>
</Substrate>
...
196 9 The Transformation Process
</Protein>
<Protein id="Mtr245">
<Substrate id="Sub89032">
<BindingStrength>0.65</BindingStrength>
<Concentration unit="um">0.53</Concentration>
</Substrate>
<Substrate id="Sub8933">
<BindingStrength>8.87</BindingStrength>
<Concentration unit="nm">8.4</Concentration>
</Substrate>
...
</Protein>
...
This certainly represents the data well, but one could equally well have cho-
sen to take the point of view of the substrates instead of the proteins, as
follows:
...
<Substrate id="Sub89032">
<Protein id="Mas375">
<BindingStrength>5.67</BindingStrength>
<Concentration unit="nm">43</Concentration>
</Protein>
<Protein id="Mtr245">
<BindingStrength>0.65</BindingStrength>
<Concentration unit="um">0.53</Concentration>
</Protein>
...
</Substrate>
<Substrate id="Sub8933">
<Protein id="Mas375">
<BindingStrength>4.37</BindingStrength>
<Concentration unit="nm">75</Concentration>
</Protein>
<Protein id="Mtr245">
<BindingStrength>8.87</BindingStrength>
<Concentration unit="nm">8.4</Concentration>
</Protein>
...
9.4 Transformation Techniques 197
</Substrate>
...
The data are the same, but the point of view has changed. The point of view
depends strongly on the purpose for which the data were collected. Chang-
ing the purpose generally requires that one also change the point of view.
The point of view is especially important when information is being dis-
played. There are many examples of this. One can present a list of research
papers organized in many different ways: by topic, by author, by publication
date.
The point of view is also important when data are being processed. The
processing program expects to receive the data in a particular way. Even
when the source document has all of the necessary data, the data can easily
be in the wrong form. Indeed, unless the source document and program
were developed together (or they conform to the same standard), it is very
unlikely that they will be compatible.
The process of changing the point of view of an XML document is an exam-
ple of a transformation which is also called “repackaging” or “repurposing.”
Transformations in general are sometimes called “stylesheets” because they
were first used as a means of specifying the style (visual appearance) of a
document. Separating display characteristics from the content of a docu-
ment was one of the original motivations for the development of XML and
its predecessors.
Summary
• Transformation is the means by which information in one format and for
one purpose is adapted to another format for another purpose.
ment by simply changing one line of the document. In a similar way, one can
specify the style of an XML document using a stylesheet as follows:
<?xml version="1.0"?>
<?xsl-stylesheet type="text/xml" href="transform.xsl"?>
...
Summary
A transformation step is performed using one of three main approaches:
1. Event-based parsing
2. Tree-based processing
3. Rule-based transformation
There are many names for the process of discovering transformations. Rec-
onciling differing terminology in various ontologies is called ontology medi-
ation. For relational databases, the problem is called schema integration for
which there is a large literature. See, for example, (Rahm and Bernstein
2001) for a survey of schema integration tools. Similar structures and con-
cepts that appear in multiple schemas are called “integration points” (Berga-
maschi et al. 1999). When the data from a variety of sources are transformed
to a single target database, then the process is called data warehousing. Data
warehousing for relational databases is an entire industry, and many data
warehousing companies now also support XML. If a query using one vocab-
ulary is rewritten so as to retrieve data from various sources, each of which
uses its own vocabulary, then it is called virtual data integration. Another
name for this process is query discovery (Embley et al. 2001; Li and Clifton
2000; Miller et al. 2000).
Ontology mediation and transformation depend on identifying semanti-
cally corresponding elements in a set of schemas. (Do and Rahm 2002; Mad-
havan et al. 2001; Rahm and Bernstein 2001) This is a difficult problem to
solve because terminology for the same entities from different sources may
use very different structural and naming conventions. The same name can be
used for elements having totally different meanings, such as different units,
precision, resolution, measurement protocol, and so on. It is usually nec-
essary to annotate an ontology with auxiliary information to assist one in
determining the meaning of elements, but the ontology mediation and trans-
formation is difficult to automate even with this additional information.
For example, in ecology, the species density is the ratio of the number of
species by the area. In one schema one might have a species density ele-
ment, while in another, there might be elements for both the species count
and area. As another example, in the health study example in section 9.1, the
BMI attribute is a ratio of the weight by the square of the height. Another
database might have only the weight and height, and these attributes might
use different units than in the first database. Consequently, a single element
in one schema may correspond to multiple elements in another. In general,
the correspondence between elements is many-to-many: many elements cor-
respond to many elements.
Many tools for automating ontology mediation have been proposed and
some research prototypes exist. There are also some commercial products for
relational schema integration in the data warehousing industry. However,
these tools mainly help discover simple one-to-one matches, and they do
not consider the meaning of the data or how the transformation will be used.
202 9 The Transformation Process
Using such a tool requires significant manual effort to correct wrong matches
and add missing matches. In practice, schema matching is done manually
by domain experts, and is very time-consuming when there are many data
sources or when schemas are large or complex.
Automated ontology mediation systems are designed to reduce manual
effort. However, such a system requires a substantial amount of time to
prepare input to the system as well as to guide the matching process. This
amount of time can be substantial, and may easily swamp the amount of time
saved by using the system. Unfortunately, existing schema-matching sys-
tems focus on measuring accuracy and completeness rather than on whether
they provide a net gain. Schema-matching systems have now been proposed
(Wang et al. 2004) that address this issue. However, such systems are not yet
available. The best that one can hope for from current systems is that they
can help one to record and to manage the schema matches that have been
detected, by whatever means.
One example of a schema integration tool is COMA, developed at the Uni-
versity of Leipzig (Do and Rahm 2002; Do et al. 2002), but there are many
others. See (Rahm and Bernstein 2001) for a survey of these tools. Some of
these tools also deal with XML DTDs (Nam et al. 2002). Unfortunately, they
are only research prototypes and do not seem to be available for download-
ing.
There are many ontology mediation projects, and some have developed
prototypes, such as PROMPT (Noy and Musen 2000) from the Stanford Med-
ical Informatics laboratory and the Semantic Knowledge Articulation Tool
(SKAT), also from Stanford (Mitra et al. 1999), but as with schema integra-
tion, none seem to be available for public use, either via open source software
or commercial software.
Summary
• Reconciling differing terminology has many names depending on the par-
ticular context where it is done, such as: ontology mediation, schema inte-
gration, data warehousing, virtual data integration, query discovery, and
schema matching.
There are many programming languages, but the one that has been especially
popular in bioinformatics is Perl. It is designed to “make easy jobs easy with-
out making the hard jobs impossible” (Wall et al. 1996). On the other hand,
this does not mean that hard jobs are best done with Perl. It is still the case
that programming languages such as Java and C++ are better suited for ma-
jor system development than Perl. It is likely that there will always be a need
for a variety of programming languages. Indeed, this is perfectly compatible
with the Perl slogan, “There’s More Than One Way to Do It” (TMTOWTDI).
Perl is especially well suited to data transformation tasks, and this is what
will be emphasized here. Perl is much too large a language for complete cov-
erage in even several books, let alone a chapter in just one book. However,
the coverage should be adequate for most transformation tasks.
In keeping with the TMTOWTDI philosophy of Perl, there are many ways
to approach any given transformation task. There are also many kinds of
transformation tasks. This chapter is organized first around the kind of trans-
formation task, and then for each kind of transformation, a number of ap-
proaches are given, arranged from the simpler to the more complex. When
one is facing a particular task, whether you think of it as transformation,
conversion, or reformatting, first look to the main classification to choose the
section for your task. Within a section, all of the techniques accomplish the
same basic task. The earlier ones are simple and work well in easy cases, but
get tedious for the harder tasks of this kind. The later ones require a more
careful design, but the result is a smaller program that is easier to maintain.
Accordingly, just scan through the possibilities until you reach one that is
sufficient for your needs.
Another aspect of TMTOWTDI is that one can omit punctuation and vari-
ables if Perl can understand what is being said. This increases the possibili-
204 10 Transforming with Traditional Programming Languages
ties for what a program can look like enormously. It can also make it difficult
for a person to read some Perl programs even if the Perl compiler has no dif-
ficulty with it. Except for some common Perl motifs and an example in the
section 10.1 below, the examples in this chapter will try to use a programming
style that emphasizes readability over cleverness as much as possible.
Some of the most common programming tasks can be classified as being
transformations. Even statistical computations are a form of data transfor-
mation. To organize the transformation tasks, the world of data will be di-
vided into XML and text files. The text file category includes flat files as well
as the text produced by many bioinformatics tools. This lumps together a lot
of very different formats, but it is convenient for classification purposes. The
many file formats (such as PDF, Word, spreadsheet formats, etc.) that require
specialized software for their interpretation will not be considered unless the
format can be converted to either an XML file or a text file.
The first section of the chapter deals with non-XML text processing, and
the second section of the chapter deals with XML processing. Many tech-
niques from the first part reappear in the second, but some new notions are
also required.
while (<>) {
$month = substr($_, 0, 2) + 0;
$day = substr($_, 2, 2) + 0;
$yr = substr($_, 4, 2) + 0;
$year = 1900 + $yr;
$year = 2000 + $yr if $yr < 20;
$bmi = substr($_, 6, 8) + 0;
$status = "normal";
$status = "obese"
if substr($_, 14, 3) + 0 > 0;
$status = "overweight"
if substr($_, 17, 3) + 0 > 0;
$height = substr($_, 20, 3) + 0;
$wtkgs = substr($_, 23, 8) + 0;
$wtlbs = substr($_, 31, 3) + 0;
print("$month/$day/$year $bmi $status");
print(" $height cm $wtkgs kg $wtlbs lb\n");
}
but rather indicate that the variables are scalars, that is, numbers or strings
(ordinary text). The line that was just read is available in the variable whose
name is an underscore character. One extracts parts of a string by using the
substr function (short for “substring”). Scalars have a kind of “split per-
sonality” since they can be either numbers or strings. The substr function
produces a string, but all of the substrings being extracted in this program
are supposed to be numbers. One can change a scalar to a number by adding
0 to it. If the scalar is already a number this does nothing. If the scalar is a
string, then this will find some way to interpret the string as being a number.
Perl is very flexible in how it interprets strings as numbers. For example, if
there is some text in the string that could not be part of a number, it (and
everything after it) is just ignored. For example “123 kgs” will be interpreted
as the number 123, and “Hello 123” will be interpreted as the number 0.
The computation of the year is somewhat problematic because there are
only two digits in the original file, but the full year number is expected in
the report. This is handled by adding conditions after the statements that
compute the full four-digit year. The assumption is that all years are between
1921 and 2020.
Program 10.1 is certainly not the only way to perform this task in Perl.
The style of programming was chosen to make it as easy as possible to read
this program. The use of angle brackets for obtaining the next line of the file
which is then represented using an underscore is a bit obscure, but it is a
commonly used motif in Perl. It is relatively easy to get used to it.
To illustrate some of the variations that are possible in Perl, program 10.1
could also have been written as in program 10.2. This program avoids the
use parentheses as much as possible, and when it does use them, it does
so differently than the first program. In general, one can omit parentheses in
functions, and it is only necessary to include them when Perl would misinter-
pret your intentions. If the parentheses were omitted in the tests for obesity
and overweight, then Perl would have compared 0 with 3 rather than with
the number extracted from the original file. Notice also that the semicolon
after the last statement can be omitted because it occurs immediately before
a right brace.
The next task to consider is the computation of summary information. One
common use of data from a study is to compute the mean and variance. Pro-
gram 10.3 computes the mean, variance, and standard deviation of the BMI
column of the health study. Running this program on the four records at the
beginning of this section produces this report:
208 10 Transforming with Traditional Programming Languages
while (<>) {
$month = 0 + substr $_, 0, 2;
$day = 0 + substr $_, 2, 2;
$yr = 0 + substr $_, 4, 2;
$year = $yr < 20 ? 2000 + $yr : 1900 + $yr;
$bmi = 0 + substr $_, 6, 8;
$status = "normal";
$status = "obese" if (substr $_, 14, 3) > 0;
$status = "overweight" if (substr $_, 17, 3) > 0;
$height = 0 + substr $_, 20, 3;
$wtkgs = 0 + substr $_, 23, 8;
$wtlbs = 0 + substr $_, 31, 3;
print "$month/$day/$year $bmi $status";
print " $height cm $wtkgs kg $wtlbs lb\n"
}
Number of records: 4
Average BMI: 24.23
BMI Variance: 59.9052666666668
BMI Standard Deviation: 7.73984926640479
This program uses all three parts of a typical program. The first part prints
the title of the report as before, and the body processes the records in the
health study file, but now there is also a concluding part that prints the
statistics. The processing of the records has some additional computations.
The count variable has the number of records, the bmisum variable has the
sum of the BMI values for all records, and the bmisumsq has the sum of the
squares of the BMI values. These are set to 0 in the introductory part of the
10.1 Text Transformations 209
while (<>) {
$month = substr($_, 0, 2) + 0;
$day = substr($_, 2, 2) + 0;
$yr = substr($_, 4, 2) + 0;
$year = 1900 + $yr;
$year = 2000 + $yr if $yr < 20;
$bmi = substr($_, 6, 8) + 0;
$status = "normal";
$status = "obese" if substr($_, 14, 3) + 0 > 0;
$status = "overweight" if substr($_, 17, 3) + 0 > 0;
$height = substr($_, 20, 3) + 0;
$wtkgs = substr($_, 23, 8) + 0;
$wtlbs = substr($_, 31, 3) + 0;
print("$month/$day/$year $bmi $status");
print(" $height cm $wtkgs kg $wtlbs lb\n");
$count = $count + 1;
$bmisum = $bmisum + $bmi;
$bmisumsq = $bmisumsq + $bmi ** 2;
}
print("\n");
print("Number of records: $count\n");
$bmimean = $bmisum / $count;
print("Average BMI: $bmimean\n");
$bmivar =
($bmisumsq - $count * $bmimean ** 2) / ($count - 1);
print("BMI Variance: $bmivar\n");
$bmisd = $bmivar ** 0.5;
print("BMI Standard Deviation: $bmisd\n");
program, and modified for each record in the health study file. The mean,
variance, and standard deviation are computed from these three values by
well-known formulas.
It is not actually necessary to initialize the three variables to 0. In other
words these three lines in the first part of the program could have been omit-
ted:
$count = 0;
$bmisum = 0;
$bmisumsq = 0;
Perl will automatically set any variable to a standard default value the first
time it is used. For numbers the default initial value is 0. For strings it is the
empty string.
Many commonly occurring statements can be abbreviated. For example,
the statement
$bmisum += $bmi;
Similar abbreviations are available for all the arithmetic operations such as
subtraction, multiplication, and division, as well as for other kinds of opera-
tion. Incrementing a variable (i.e., increasing it by 1) is so common that it has
its own special operator. As a result, one can abbreviate
$count = $count + 1;
$count++;
while (<>) {
chomp;
@record = split(" ", $_);
@date = split("/", $record[0]);
$month = $date[0] + 0;
$day = $date[1] + 0;
$yr = $date[2] + 0;
if ($yr < 20) { $year = 2000 + $yr; }
elsif ($yr < 1000) { $year = 1900 + $yr; }
else { $year = $yr; }
$bmi = $record[1];
if ($record[2] + 0 > 0) { $status = "obese"; }
elsif ($record[3] + 0 > 0) { $status = "overweight"; }
else { $status = "normal"; }
$height = $record[4] + 0;
$wtkgs = $record[5] + 0;
$wtlbs = $record[6] + 0;
print("$month/$day/$year $bmi $status");
print(" $height cm $wtkgs kg $wtlbs lb\n");
}
Program 10.4 does the same transformation as program 10.1, except that
it assumes that the data file uses variable-length fields separated by spaces,
and that the month, day, and year in a date are separated from one another
by using the forward slash (“/”) character. This program will transform the
following data file:
to the following:
The first step is to remove any extra space at the beginning and end of the
line. This is done with the chomp function. This was not necessary in previ-
ous programs because the fields were in fixed locations in the line. It is a good
idea to use chomp whenever the input has variable-length fields. The next
step is to split the record into fields using the split operator. This produces
the @record array. The values in this array are denoted by $record[0],
$record[1], $record[2],... The number in brackets is called the in-
dex or position of the value in the array. If one uses a negative index, then it
specifies a position starting from the last value (i.e., starting from the other
end of the array). Notice that the array as a whole uses @ but the individual
values use $. Remember that in Perl, the initial character on a variable (and
there are more than just $ and @) denotes the kind of value. It is not part of
the name of the variable.
The next step is to split the first field (i.e., the date) into its parts. The parts
are separated by slashes. The rest of the program is the same as the first
program except that array values are used instead of substrings. Another
difference is that the conditional statements are written with the if-conditions
first rather than after the statement. The first condition is indicated by if.
Subsequent conditions (except the last one) are indicated using elsif which
is short for “else if”. The last case is indicated by else which is used for
those cases not handled any other case. Putting the conditional after the
statement as in the first program is best if you think of the statement as being
subject to a condition. Putting a conditional before the statement is best if
you are thinking in terms of a series of cases, such that only one of them
applies to each record.
The split statement
The default for splitting is to split up the value of $_. One can even abbrevi-
ate further to
10.1 Text Transformations 213
@record = split;
This is actually better than the previous form because it will treat all forms
of “white space” (such as tab characters) as being the same as spaces. Fi-
nally, one can abbreviate all the way to split; except that now the array
containing the fields of the line is @_ instead of @record.
The opposite of split is join. One can put the split array back together
after splitting by using
This can be handy if one would like to separate the fields with a character
other than a space. For example,
join(",", @record);
while (<>) {
chomp;
@record = split(" ", $_);
@date = split("/", $record[0]);
$month = $date[0] + 0;
$day = $date[1] + 0;
$yr = $date[2] + 0;
if ($yr < 20) { $year = 2000 + $yr; }
elsif ($yr < 1000) { $year = 1900 + $yr; }
else { $year = $yr; }
$bmi = $record[1];
if ($record[2] + 0 > 0) { $status = "obese"; }
elsif ($record[3] + 0 > 0) { $status = "overweight"; }
else { $status = "normal"; }
$height = $record[4] + 0;
$wtkgs = $record[5] + 0;
$wtlbs = $record[6] + 0;
print("$month/$day/$year $bmi $status");
print(" $height cm $wtkgs kg $wtlbs lb\n");
$m = "$month/$year";
$count{$m}++;
$bmisum{$m} += $bmi;
$bmisumsq{$m} += $bmi ** 2;
}
foreach $m (sort(keys(%count))) {
print("\nStatistics for $m\n");
print("Number of records: $count{$m}\n");
$bmimean = $bmisum{$m} / $count{$m};
print("Average BMI: $bmimean\n");
$bmivar = ($bmisumsq{$m} - $count{$m} * $bmimean ** 2)
/ ($count{$m} - 1);
print("BMI Variance: $bmivar\n");
$bmisd = $bmivar ** 0.5;
print("BMI Standard Deviation: $bmisd\n");
}
The hashes are used in each of the three parts of the program. In the intro-
ductory part, the three hashes are declared and initialized to empty hashes.
As in the case of scalars, it is not necessary to declare and initialize hashes. If
they are not declared and initialized, then they will be set to empty hashes.
Arrays also do not have to be initialized. By default, arrays are initially
empty.
10.1 Text Transformations 215
In the main body of the program, the hashes are used at the end to compute
the three statistics: count, sum, and sum of squares. First, the month and
year are combined into a single string. This string is then called the key for
the hash value. It is analogous to the index of an array, but the index for
array values can only be an integer, while a hash key can be any scalar. This
includes integers, other numbers, and strings. The value corresponding to
a hash key is specified by using braces as, for example, in the expression
$bmisum{$m}. By contrast, arrays use brackets to specify a value of an array.
The final and concluding part of the program introduces a new kind of
statement: the foreach statement. This statement is used for performing some
action on every element of a list. This is called iteration or looping because the
same action is done repeatedly, differing each time only by which element is
being processed. In this case the iteration is to be over all of the month-year
combinations that are in the hashes. The body of the iteration should be per-
formed once for each month-year combination. Each time it is performed $m
will be a different month-year combination. The month-year combinations
are the keys of any one of the three hashes. The program uses %count, but
any one of the three hashes could have been used. The keys function gets
the list of all keys of a hash. This list can be in any order, so one usually sorts
the keys to get output that looks better. If the order does not matter, then one
can omit using the sort function. The rest of the computation is nearly the
same as before except that the values in the hashes are used instead of simple
scalar statistics. Applying this program to the simple four-record example
data file will print the following:
Number of records: 2
Average BMI: 25.665
BMI Variance: 137.28245
BMI Standard Deviation: 11.7167593642611
foreach $m (sort(keys(%count))) {
can be written
Many other sorting orders can be used by varying the sort order used in the
braces after the sort function. The default order is dictionary ordering which
uses the cmp operator.
10.1 Text Transformations 217
Summary
• While programs run, they store data in variables. As the program runs
the data stored in each variable will change. The simplest kind of variable
in Perl is a scalar, which holds a single string or number.
1. The introduction prints the title and sets variables to initial values.
2. The body reads each line, extracts data from it and prints the data in
the required format.
3. The conclusion computes summary information and prints it.
• Perl has two kinds of variables for holding collections of data items:
1. An array holds a sequence of data items. Arrays are also called lists.
2. A hash maps keys to associated information.
@lines = <>;
$size = scalar(@lines);
print("The input file has $size lines.\n");
one might expect, this is used frequently in Perl programs, although it does
not necessary appear explicitly. In fact, in this case it can be omitted because
assigning an array to a scalar tells Perl that the array is to be converted to a
scalar. In the case of hashes, scalar gives information about the structure
of the hash that is usually not very useful. To get the size of a hash use
scalar(keys(%h)). This gives the number of keys in the hash.
One might think that one can simplify the last two lines of program 10.5 to
the one statement
but this does not work. In a quoted string, the special meaning of scalar is
lost. The string “scalar” will be printed verbatim, and the scalar function
will not be invoked. Quoted strings know how to deal with variables, but
they do not understand computations in general. One can get around this
restriction in two ways:
Arrays are a versatile technique that can be used for lists of values as well
as for representing the mathematical notion of a vector. It is natural to con-
sider how to represent other mathematical structures such as matrices and
10.1 Text Transformations 219
while (<>) {
chomp;
push(@table, [split]);
}
$size = @table;
print("Table size is $size\n");
tables. Once one has a concept of an array, it is easy to represent these other
mathematical structures. A matrix, for example, is just a vector of vectors, so
to represent it in Perl, one simply constructs a array whose items are them-
selves arrays.
Consider the task of reading all of the fields of all the records in an input
file. The array will have one item for each record of the data file. Each item,
in turn, will be an array that has one item for each field of the record. In
other words, the data will be represented as an array of arrays, also called a
two-dimensional array or database table. It is very easy to create such a table in
Perl. In program 10.6, the array is constructed, and its size is printed.
The push procedure adds new items to the end of a list. In this case, it
adds a new record to the table array. Each record is obtained by splitting the
current line. Recall that split by itself splits the current line into fields that
were separated by spaces.
The opposite of push is pop. It removes one item from the end of a list.
There are also procedures for adding and removing items from the beginning
of a list. The shift procedure removes the first item from a list. Unlike push
and pop, the shift procedure changes the positions of all the items in the
list (e.g., the one in position 1 now has position 0). The opposite of shift is
unshift, which adds items to the beginning of a list.
The brackets around split tell Perl to maintain the integrity of the record.
Without the brackets, the fields of the record would be pushed individually
onto the array resulting in a very large one-dimensional array with all of the
fields of all of the records.
Brackets around an array tell Perl that the array is to be considered a single
unit rather than a collection of values. The term for this in Perl is reference. It
is similar to the distinction between a company and the employees of a com-
pany. The company is a legal entity by itself, with its own tax identification
220 10 Transforming with Traditional Programming Languages
number and legal obligations, almost as if it were a person. There are similar
situations in biology as well. Multicellular organs and organisms are living
entities that are made up of cells but which act as if they were single units.
Perl arrays are made into single entities (scalars) by using brackets. For ex-
ample, the array @lines could be made into a scalar by writing [@lines],
and one can assign such an entity to a scalar variable as in
$var = [@lines];
One can put any number of arrays and scalars in brackets, and the result is
(reference to) a single array. Hashes are made into single entities by using
braces. One can combine hashes by putting more than one in braces, and
one can add additional keys as in
$var = {
name => "George",
id => "123456",
%otherData,
};
In this case $var refers to a hash that maps “name” to “George” and “id” to
“123456,” in addition to all of the other mappings in %otherData.
Now consider the same statistical task as in program 10.3; namely, com-
pute the mean and standard deviation of the BMI. However, now use a
database table rather than computing it as the file is being read to obtain
program 10.7.
The statistical computation is done in the for statement. This is similar to
the foreach statement. Whereas the foreach statement iterates over the
items of a list, the for statement iterates over the numbers in a sequence.
The statement specifies the three parts of any such sequence: where to start
it, how to end it, and how to go from one number in the sequence to the next
one. The three parts in this case specify:
1. Where to start: $i = 0 means start the sequence at 0.
2. How to end: $i < $count means to stop just before the number of
records in the database. The reason for ending just before the number
of records rather than at the number of records is that numbering starts at
0.
3. How to go from one number to the next: $i++ means increment the num-
ber to get the next one. In other words, the sequence is consecutive.
10.1 Text Transformations 221
while (<>) {
chomp;
push(@table, [split]);
$bmi = $record[1];
}
$count = @table;
for ($i = 0; $i < $count; $i++) {
$bmisum = $bmisum + $table[$i][1];
$bmisumsq = $bmisumsq + $table[$i][1] ** 2;
}
Summary
• Braces are used for selecting the value associated with a key in a hash.
222 10 Transforming with Traditional Programming Languages
• The for statement is used to perform some action for each number in a
sequence.
sub stats {
my $count = @table;
my $column = $_[0];
my $sum = 0;
my $sumsq = 0;
for ($i = 0; $i < $count; $i++) {
my $field = $table[$i][$column];
$sum += $field;
$sumsq += $field * $field;
10.1 Text Transformations 223
}
my $mean = $sum / $count;
my $var = ($sumsq - $count * $mean ** 2)
/ ($count - 1);
return ($mean, $var);
}
This procedure introduces some new notation. The most noticeable change
from the previous Perl programs is the use of my at the beginning of most
lines. The variables that have been used so far are known as global variables.
They are accessible everywhere in the program. In particular, they can be
used in any of the procedures. The @table variable, for example, is used
in the first line of this procedure. A my variable, on the other hand, belongs
only to that part of the program in which it was declared. All but one of the
my variables in this procedure belong to the procedure. The two exceptions
are $i and $field. Both of these belong to the for statement. It is not
necessary to declare that such variables are my variables, but it is okay to do
so.
The advantage of my variables is that they prevent any confusion in case
the same variable name is used in more than one procedure. While one can
certainly use ordinary global variables for computation done in procedures,
it is risky. To be safe, it is best for all variables that are only used within a
procedure to be my variables.
The stats procedure is invoked by specifying the position of the field that
is to be computed. For example, stats(4) computes the mean and variance
of the column with index 4. This is actually the fifth column because Perl
array indexes start at 0. The number 4 in stats(4) is called a parameter of
the procedure. The parameters given to a procedure are available within the
procedure as the @_ array. In particular, the first parameter is $_[0], and
this explains the second line of the procedure:
my $column = $_[0];
which sets $column to the first parameter given to the procedure when it is
invoked.
The return statement has two purposes. It tells Perl that the procedure
is finished with its computation. In addition, it specifies the end result of
the computation. This is what the program that invoked the procedure will
receive. Note that a list of two values is produced by this procedure. One can
224 10 Transforming with Traditional Programming Languages
use this list like any other. For example, the following program will print the
statistics for two of the columns of the database:
while (<>) {
chomp;
push(@table, [split]);
}
($mean, $var) = stats(1);
print("Statistics for column 1:");
print(" mean $mean variance $var\n");
($mean, $var) = stats(4);
print("Statistics for column 4:");
print(" mean $mean variance $var\n");
Of course, the program above has yet another opportunity for a procedure;
namely, one that prints the statistics:
sub printstats {
my $column = $_[0];
my ($mean, $var) = stats($column);
print("Statistics for column $column:");
print(" mean $mean variance $var\n");
}
One cannot help but notice that the scalar $_ and the array @_ are used
frequently in Perl. Because of this it is a good idea to assign the parameters
of a procedure to various my variables belonging to the procedure as soon as
possible. It also makes it much easier for a person to understand what a pro-
cedure is supposed to do. In this case, $column is a lot more understandable
than $_[0].
Putting all of this together, one obtains the solution to the task in pro-
gram 10.8.
The procedure definitions can go either before the main program or after it.
Summary
while (<>) {
chomp;
push(@table, [split]);
}
printstats(1);
printstats(4);
sub stats {
my $count = @table;
my $column = $_[0];
my $sum = 0;
my $sumsq = 0;
for (my $i = 0; $i < $count; $i++) {
my $field = $table[$i][$column];
$sum += $field;
$sumsq += $field * $field;
}
my $mean = $sum / $count;
my $var = ($sumsq - $count * $mean ** 2)
/ ($count - 1);
return ($mean, $var);
}
sub printstats {
my $column = $_[0];
my ($mean, $var) = stats($column);
print("Statistics for column $column:");
print(" mean $mean variance $var\n");
}
• The return statement marks the end of the computation and specifies
the value produced by the procedure.
following file that was produced by BioProspector (Liu et al. 2001). Some
parts of the file were omitted to save space.
****************************************
* *
* BioProspector Search Result *
* *
****************************************
Motif #1:
******************************
[1 line omitted]
while (<>) {
chomp;
if (/Motif #1:/) {
print "The first motif has been found!\n";
}
}
Program 10.9 Using pattern matching to find one piece of data in a file
et al. 2000; Roth et al. 1998), CONSENSUS (Stormo and Hartzell III 1989;
Hertz et al. 1990; Hertz and Stormo 1999), and Gibbs sampler (Lawrence et al.
1993; Liu et al. 1995), and all of them use their own output formats. No
doubt many more formats already exist for motifs, and many more will be
used in the future. A similar situation exists for virtually every other kind of
bioinformatics information. Many tools are available for similar tasks, and
each one uses its own input and output formats.
To process information such as the BioProspector file above, we make use
of the pattern-matching features of Perl. Pattern matching is one of the most
powerful features of Perl, and it is one of the reasons why Perl has become
so popular.
Consider the task of extracting just the information about the first motif. A
motif is defined as a sequence of probability distributions on the four DNA
bases. We will do this in a series of steps. First we need to read the Bio-
Prospector file and find where the information about the desired motif is
located, as shown in program 10.9.
Each motif description begins with a title containing “Motif #” followed
by a number and ending with a colon. The condition /Motif #1:/ is re-
sponsible for detecting such a title. The text between the forward slashes is
the pattern to be matched. A pattern can be as simple as just some text that is
to be matched, as in this case.
If one wanted the line that contained exactly this text, one would use the
condition $_ eq "Motif #1:\n". Note that string comparison uses eq,
not the equal-to sign. Also note that every line ends with the newline char-
acter. In practice it is usually easier to use a pattern match condition than a
test for equality. The pattern match will handle more cases, and one does not
have to worry about whether or not the newline character might be in the
line.
228 10 Transforming with Traditional Programming Languages
while (<>) {
chomp;
if (/Motif #[0-9]+:/) {
print "A motif has been found!\n";
}
}
Program 10.10 Using pattern matching to find all data of one kind in a file
The next task is to find where every motif begins, not just the first one.
This is done by modifying the pattern so that it matches any number rather
than just the number 1 as in program 10.10.
The pattern now has [0-9]+ where it used to have the number 1. Brack-
eted expressions in a pattern define character classes. This character class will
match any character between 0 and 9. The plus sign after the character class
means that the line must have one or more characters in this class. Any
character or character class can be followed by a quantifier:
Quantifiers can also be used to specify exactly how many times a character
must occur as well as a range of occurrences. This is done by placing the
number of times or the range in braces after the character or character class.
Character classes and quantifiers are specified using characters (such as
brackets, braces, etc.) just as everything is specified in Perl. However, this
means that these characters are special within a pattern. They are called
metacharacters. When used within a pattern they do not match themselves.
The metacharacters are backward slash, vertical bar, parentheses, brackets,
braces, circumflex, dollar sign, asterisk, plus sign, question mark, and period.
If a pattern should match one of the metacharacters, then use a backward
slash. For example, \? means match the question mark character rather than
quantify the preceding character or character class.
The next task is to obtain the motif number. In principle, one could get
this number by using the split and substr functions, but there is a much
easier way. When Perl matches a pattern, it keeps track of what succeeded
in matching those parts of the pattern that are in parentheses. In this case,
10.1 Text Transformations 229
while (<>) {
chomp;
if (/Motif #([0-9]+):/) {
print "The motif $1 has been found!\n";
}
}
while (<>) {
chomp;
if (/Motif #([0-9]+):/) {
print "Probability distributions for motif $1\n";
} elsif (/^[0-9]+ /) {
split;
print "A $_[1] G $_[2] C $_[3] T $_[4]\n";
}
}
Program 10.12 Extracting an array of data from a file using pattern matching
Summary
• A pattern specifies the text that a string must have in order to match the
pattern.
230 10 Transforming with Traditional Programming Languages
• When a pattern matches, Perl extracts the text that matches the whole
pattern as well as text that matches each subpattern.
This excerpt shows the probability distributions for one motif (labeled
“MATRIX 2”). There are two ways in which this file differs from what is
necessary for the task. First, the distributions are given in terms of frequen-
cies rather than probabilities. Second, the frequencies are listed by DNA base
rather than by position in the motif. The first difference is easy to fix: one can
just divide by the total number of sequences. The second difference is not so
easily handled because the information has the wrong arrangement.
To rearrange information obtained from an input file, it is necessary to
store information from several lines before printing it. This would be easy
if the information consisted of a few scalars, but it gets much more compli-
cated when substantial amounts of data must be organized. The technique
for doing this in programming languages is called a data structure. Some data
structures have already been used; namely, arrays and hashes. These are the
simplest data structures. One constructs more complex data structures by us-
ing a technique called nesting. A nested data structure is a data structure whose
items are themselves data structures. For example, one can have an array of
hashes, or a hash of arrays, or a hash of hashes of arrays, and so on. There is
no limit to how deeply nested a data structure can be. The special case of an
array of arrays was already developed in subsection 10.1.2. Data structures
extend the concept of multidimensional array to allow for dimensions that
10.1 Text Transformations 231
while (<>) {
chomp;
if (/^MATRIX ([0-9]+)$/) {
$label = $1;
} elsif (/^number of sequences = ([0-9]+)$/) {
$numberOfSequences = $1 + 0;
} elsif (/^[ACGT] [|]/) {
@record = split;
for ($i = 2; $i < scalar(@record); $i++) {
$motifs{$label}{$i-2}{$record[0]} =
$record[$i] / $numberOfSequences;
}
}
}
foreach $label (sort(keys(%motifs))) {
print "Probability distributions for motif $label\n";
%motif = %{ $motifs{$label} };
foreach $position (sort(keys(%motif))) {
foreach $base (A, C, T, G) {
print("$base $motif{$position}{$base} ");
}
print("\n");
}
}
Program 10.13 Extracting data structures from a file using pattern matching
the first DNA base is the third field on the line, the second frequency is the
fourth field, and so on. So it is necessary to subtract 2 from the field position
to get the DNA base position. Finally, an item in the third dimension is the
probability for one of the four DNA bases. This is obtained by dividing the
frequency by the number of sequences.
Having extracted the motifs, the next step is to print them. Since the motifs
are in a 3D data structure, the most natural way to use the structure is with
three nested loops. The first loop processes the motifs. The labels are the
keys of the motifs hash, and it is customary to sort the keys of a hash so
that they are printed in a reasonable order.
10.1 Text Transformations 233
Given the label for a motif, one can obtain the motif by using the label
as the key: $motifs{$label}. However, this is a scalar, not the hash of
DNA positions. This is the trickiest part of the program. To get the hash of
DNA positions, one must use the expression %{$motifs{$label}}. This
may seem mysterious at first, but it all makes sense when one finds out that
every use of the prefixes $, %, and @ are actually supposed to look like this.
Omitting the braces is an abbreviation that one can use for simple variable
names.
Once the hash for one motif has been obtained, one just loops over the
positions and then over the four bases. The program explicitly writes out the
DNA bases, because it is printing them in an order that is not alphabetical.
After printing the probability distribution, a newline is printed to end the
line. The output of the program will look something like this:
Perl will always print everything that it knows about a number. In many
cases the numbers will have far too many decimal places than are merited
by the data. To specify the exact number of decimal places that should be
printed one should use the printf statement. It would look like this:
The first parameter of the printf statement is called the format. Its purpose
is to specify what kinds of data are to be printed as well as the precise format
to use for each one. Each format specification begins with a percent sign.
This use of the percent sign has no connection with the notion of a Perl hash.
The %s format means that the variable is to be printed verbatim. The s stands
for “string.” The %5.3f format means that the variable is to be printed as a
number with three digits after the decimal point and five characters in all (in-
cluding the decimal point). The f stands for “floating-point number.” Using
this format, the output of the program would look like this:
Summary
• One can represent complex data structures by nesting arrays and hashes,
for example, by constructing an array of hashes,
• When printing numbers, one can specify how much precision will be used
by using the formatted print statement, printf.
rather more like tissues than organs or organisms, because all of the entities
that make up the collection are the same kind of entity. With modules and
objects, the grouping includes entities that are dissimilar. Thus one can group
together scalars, arrays, hashes, procedures, and so on, all in a single unit.
Modules are mainly used for publishing programs. One person or group
of persons constructs a module for a specialized purpose. The module is then
published, usually at the Comprehensive Perl Archive Network (CPAN) lo-
cated at cpan.org. The modules can then be downloaded and installed by
other people. If you have installed your own personal Perl library, then you
can look for and install modules by running the cpan command. If you have
Perl, but do not have cpan, then try the following command:
If you don’t have Perl, then you will need to install it.
The cpan command (or its equivalent) presumes that you know which
modules you want to install. If you do not know which ones you would like,
then use one of the CPAN search engines, such as search.cpan.org. There
are over 100 packages that mention bioinformatics, plus there are many oth-
ers related to biology and medicine.
Once a module has been installed, the most common way for it to be used
is to construct a module object. Programs that use modules typically look
something like this:
use moduleName;
$p = new moduleName;
...
The use statement tells Perl that the program will be using a module. One
can use any number of modules. The new statement constructs an object. An
object is a reference to a collection of scalars, arrays, hashes, procedures, and
other objects, all of which have been grouped together in a single unit. The
parts of an object are obtained by using a special operator, written ->. For
example, if one of the parts of the module object $p is a procedure named
computeAverage, then the procedure is invoked by using the statement
$p->computeAverage;
1. XML::Parser provides one with the ability to process XML one element
at a time. It is analogous to reading a file one line at a time. However,
because elements can contain other elements, it is important to know not
only when one starts reading an element but also when an element is fin-
ished. This process is similar to the pattern-matching programs in subsec-
tion 10.1.4 such as program 10.12. The XML parser looks for the patterns
that indicate when an element begins and when an element ends.
Summary
• The cpan command, or its equivalent, can be used to install Perl modules
that have been published on the CPAN website.
• Perl modules are available for processing and querying XML documents.
use XML::Parser;
sub start {
$tag = $_[1];
%attributes = @_;
if ($tag eq "Interview") {
print("Weight $attributes{Weight}\n");
}
}
called XML::Parser. Suppose that we would like to obtain the Weight at-
tribute of every Interview in an XML document that looks like this:
<HealthStudy>
<Interview Date=’2000-1-15’ Weight=’46.27’.../>
<Interview Date=’2000-1-15’ Weight=’68.95’.../>
<Interview Date=’2000-2-1’ Weight=’92.53’.../>
<Interview Date=’2000-2-1’ Weight=’50.35’.../>
</HealthStudy>
This task can be accomplished by using program 10.14. The use statement
imports the XML::Parser module. If this statement fails, then this module
has not yet been installed. You can install it by using the cpan command or
its equivalent as described in subsection 10.2.1.
The next two statements of the program construct the XML parser and
parse the document. There are several styles for parsing. The style for pro-
cessing the XML document one element at a time is called the “handlers”
style. Handlers are Perl procedures that are invoked as various kinds of data
are encountered in the XML document. A Start handler is invoked whenever
an XML element is first encountered. There are many kinds of handler that
will be discussed later. In this case the Start handler is a procedure called
start.
The initial \& in front of start is telling Perl that one is passing the start
238 10 Transforming with Traditional Programming Languages
procedure to the parser as a parameter. Without this Perl would simply in-
voke the start procedure at this place in the program. In this case we never
explicitly invoke start in our own program. We want the parser to do this
instead. It will be invoked five times for the sample document: once for the
HealthStudy element and once each for the four Interview elements.
The parsing is actually done when the parsefile procedure is invoked.
This procedure belongs to the parser and is not one of your own procedures
(such as the start procedure), so it is invoked by using the -> operator on
the module object. Procedures that belong to a module are called methods.
The parameter is the name of the file to be parsed. In this case, the name of
the XML file will be specified on the command line. The file names on the
command line are in the ARGV array. Program 10.14 will be run by typing
this line to the computer:
perl printweights.perl healthstudy.xml
The start procedure will be invoked with two kinds of information: the
name of the element and the attributes of the element. The name of the el-
ement (also called its “tag”) is the second parameter. The first statement of
start sets the tag variable to the element name for later use. The rest of
the parameters are the attributes of the element. The simplest way to use
these parameters is to convert them to a hash and then look up the attributes
that are needed. The second statement converts the parameters to a hash
named attributes. The program prints the Weight attribute of every
Interview element. The output of the program is
Weight 46.27
Weight 68.95
Weight 92.53
Weight 50.35
The XML::Parser handlers all have a first parameter that is a reference to
an internal parsing procedure. This is used only if one wishes to get access
to low-level parsing information.
One might be curious about what those => symbols mean in this program.
As it happens, they are just another way of writing a comma. In other words,
one could equally well have constructed the parser using this statement:
$p = new XML::Parser(Handlers, { Start, \&start });
The purpose of the => symbols is to make the program easier to understand.
It is very common to specify parameters in pairs, where each pair consists of
10.2 Transforming XML 239
the name of the parameter and the value of the parameter. The => symbols
are suggestive of this way of using parameters. This style for designing pro-
cedures is analogous to the attributes in an XML element. One first gives the
attribute name and then the attribute value. In XML the attribute name and
attribute value are separated by an equal-to sign. In Perl they are separated
by => symbols.
Program 10.14 can only process information that is in XML attributes. XML
content requires additional handlers. Consider the task of parsing the output
of program 10.19 of subsection 10.2.4. The XML document in this case has no
XML attributes at all, and all of the data are in XML content. Program 10.15
will accomplish the task. Just as a story has a beginning, a middle, and an
end, there are now three handlers, one for when an element starts, one the
content, and the last one for when an element ends. The weightElement
variable is nonzero exactly when one is parsing a Weight element. This
ensures that the char procedure will print the content only for Weight ele-
ments. In general, the char procedure will be invoked several times within
a single element. It will usually be called once for each line of the content.
One of the most useful resources for general biomedical information is
PubMed. This is a repository of citations to biomedical publications. More
than half of the citations include abstracts. There are over 15 million citations
available online using PubMed. These citations are available as XML docu-
ments. The following is what part of a typical PubMed citation looks like.
The actual citation is over 130 lines long.
use XML::Parser;
sub start {
$tag = $_[1];
if ($tag eq "Weight") {
print("Weight ");
$weightElement = 1;
}
}
sub char {
if ($weightElement) {
print($_[1]);
}
}
sub end {
if ($weightElement) {
print("\n");
$weightElement = 0;
}
}
<Journal>
<ISSN>1083-7159</ISSN>
<JournalIssue>
<Volume>4</Volume>
<Issue>4</Issue>
<PubDate>
<Year>1999</Year>
</PubDate>
</JournalIssue>
</Journal>
<ArticleTitle>Breast cancer highlights.</ArticleTitle>
10.2 Transforming XML 241
<Pagination>
<MedlinePgn>299-308</MedlinePgn>
</Pagination>
<Affiliation>Massachusetts General Hospital,
Boston, Massachusetts 02114-2617, USA.
Kuter.Irene@MGH.Harvard.edu</Affiliation>
<AuthorList CompleteYN="Y">
<Author>
<LastName>Kuter</LastName>
<ForeName>I</ForeName>
<Initials>I</Initials>
</Author>
</AuthorList>
<Language>eng</Language>
<PublicationTypeList>
<PublicationType>Congresses</PublicationType>
</PublicationTypeList>
</Article>
<MedlineJournalInfo>
<Country>UNITED STATES</Country>
<MedlineTA>Oncologist</MedlineTA>
<NlmUniqueID>9607837</NlmUniqueID>
</MedlineJournalInfo>
<ChemicalList>
<Chemical>
<RegistryNumber>0</RegistryNumber>
<NameOfSubstance>Antineoplastic Agents, Hormonal</NameOfSubstance>
</Chemical>
...
</ChemicalList>
<CitationSubset>IM</CitationSubset>
<MeshHeadingList>
...
<MeshHeading>
<DescriptorName MajorTopicYN="N">Piperidines</DescriptorName>
<QualifierName MajorTopicYN="N">therapeutic use</QualifierName>
</MeshHeading>
...
</MeshHeadingList>
</MedlineCitation>
An XML document would contain this citation as one of its elements. Con-
sider the task of extracting the title of the article together with the list of all
MeSH descriptors. The program for parsing an XML document to extract
this information is shown in program 10.16. The result should look like this:
use XML::Parser;
sub clear {
$pmidElement = 0;
$titleElement = 0;
$descElement = 0;
}
sub start {
if ($_[1] eq "PMID") {
$pmidElement = 1;
} elsif ($_[1] eq "ArticleTitle") {
$titleElement = 1;
} elsif ($_[1] eq "DescriptorName") {
$descElement = 1;
}
}
sub char {
if ($pmidElement) {
print("PubMed ID: $_[1]\n");
} elsif ($titleElement) {
print("Title: $_[1]\n");
} elsif ($descElement) {
print("Descriptor: $_[1]\n");
}
clear;
}
The handlers style of processing XML documents is the most efficient way
to process XML. In fact, all other styles are based on the handlers style. How-
ever, the handlers style is difficult to use when one needs to do more compli-
cated processing of the document. Subsection 10.2.3 presents another style
that is better suited to more complex tasks.
Summary
• One way to process XML documents is to parse the document one element
at a time. This is called the handlers style.
• In the handlers style, one specifies procedures that are invoked by the
parser. Most commonly one specifies procedures to be invoked at the
start of each element, for the text content of the element, and at the end of
the element.
• The handlers style for parsing XML documents is efficient and fast but is
only appropriate when the processing to be done is relatively simple.
244 10 Transforming with Traditional Programming Languages
use XML::DOM;
$p = new XML::DOM::Parser;
$doc = $p->parsefile($ARGV[0]);
$weights = $doc->getElementsByTagName(Weight);
for ($i = 0; $i < $weights->getLength; $i++) {
$weight =
$weights->item($i)->getFirstChild->getNodeValue;
print("Weight $weight\n");
}
Program 10.17 Converting an entire XML document using a Perl data structure
Summary
• The whole document style of XML processing reads the entire document
into a single Perl data structure.
• DOM methods are used to extract information from an XML document.
• The entities that occur in an XML document are represented by DOM
nodes.
• DOM lists are used for holding a collection of DOM nodes.
print("<HealthStudy>\n");
while (<>) {
$month = substr($_, 0, 2) + 0;
$day = substr($_, 2, 2) + 0;
$yr = substr($_, 4, 2) + 0;
$year = 1900 + $yr;
$year = 2000 + $yr if $yr < 20;
$bmi = substr($_, 6, 8) + 0;
$status = "normal";
$status = "obese" if substr($_, 14, 3) + 0 > 0;
$status = "overweight" if substr($_, 17, 3) + 0 > 0;
$height = substr($_, 20, 3) + 0;
$weight = substr($_, 23, 8) + 0;
print("<Interview Date=’$year-$month-$day’");
print(" BMI=’$bmi’ Status=’$status’");
print(" Height=’$height’ Weight=’$weight’/>\n");
}
print("</HealthStudy>\n");
<HealthStudy>
<Interview Date=’2000-1-15’ BMI=’18.66’ .../>
<Interview Date=’2000-1-15’ BMI=’26.93’ .../>
<Interview Date=’2000-2-1’ BMI=’33.95’ .../>
<Interview Date=’2000-2-1’ BMI=’17.38’ .../>
</HealthStudy>
The solution is shown in program 10.18. This program stores all information
in XML attributes. Another way to store information is to use XML content
instead. For the health study example, the XML document would then look
like this:
10.2 Transforming XML 247
<HealthStudy>
<Interview>
<Date>2000-1-15</Date>
<BMI>18.66</BMI>
<Status>normal</Status>
<Height>62</Height>
<Weight>46.27</Weight>
</Interview>
<Interview>
<Date>2000-1-15</Date>
<BMI>26.93</BMI>
<Status>overweight</Status>
<Height>63</Height>
<Weight>68.95</Weight>
</Interview>
<Interview>
<Date>2000-2-1</Date>
<BMI>33.95</BMI>
<Status>obese</Status>
<Height>65</Height>
<Weight>92.53</Weight>
</Interview>
<Interview>
<Date>2000-2-1</Date>
<BMI>17.38</BMI>
<Status>normal</Status>
<Height>67</Height>
<Weight>50.35</Weight>
</Interview>
</HealthStudy>
print("<HealthStudy>\n");
while (<>) {
$month = substr($_, 0, 2) + 0;
$day = substr($_, 2, 2) + 0;
$yr = substr($_, 4, 2) + 0;
$year = 1900 + $yr;
$year = 2000 + $yr if $yr < 20;
$bmi = substr($_, 6, 8) + 0;
$status = "normal";
$status = "obese" if substr($_, 14, 3) + 0 > 0;
$status = "overweight" if substr($_, 17, 3) + 0 > 0;
$height = substr($_, 20, 3) + 0;
$weight = substr($_, 23, 8) + 0;
print("<Interview>\n");
print(" <Date>$year-$month-$day</Date>\n");
print(" <BMI>$bmi</BMI>\n");
print(" <Status>$status</Status>\n");
print(" <Height>$height</Height>\n");
print(" <Weight>$weight</Weight>\n");
print("</Interview>\n");
}
print("</HealthStudy>\n");
two elements, one inside the other, that looks like this:
<Main>
<Part id=’p1’>XML Example</Part>
</Main>
except that the Part id and the content of the Part element are obtained from
an input file:
p1:XML Example
print("<Main>\n");
while (<>) {
chomp;
split(/:/);
print(" <Part id=’$_[0]’>$_[1]</Part>\n");
}
print("</Main>\n");
<Main>
<Part id=’[% name %]’>[% content %]</Part>
</Main>
program because the XML text is spread throughout the Perl code. The Perl
template is shown in template 10.1.
Notice how the Perl template looks much more like the output that is to be
produced than the Perl program. The parts of the template in bracketed per-
cent signs are the variable parts of the template. The rest of the template is
the constant part. The constant part is printed exactly as shown. The variable
parts are instantiated with the values of what look like variables. However,
the names id and content are actually hash keys, not variables. The tem-
plate is used from program 10.21.
The first line of the program tells Perl that the Template Toolkit package is
being used. The data are obtained by reading the first line of the input file
and extracting the data to be used in the template. The last part of the pro-
gram invokes the template package. The first statement constructs the tem-
plate processor using the Template Toolkit package. The second statement
constructs a hash that tells the template processor the data that should be
used for instantiating the template. As noted earlier, what look like variables
in the template are actually hash keys. The third statement actually does the
processing. The template processor needs two parameters: the name of the
template file and the hash containing the data to be used for instantiation of
the template.
250 10 Transforming with Traditional Programming Languages
use Template;
while (<>) {
chomp;
split(/:/);
$name = $_[0];
$content = $_[1];
}
$tt = new Template;
$vars = {
name => $name,
content => $content,
};
$tt->process(’part.tt’, $vars);
Now consider a more interesting transformation task: the first task of this
chapter. To use a Perl template, the data extracted from the input file must
be organized into a data structure to be used by the template processor for
instantiating the template as in program 10.22. The while statement con-
structs an array of hashes. Each hash gives the information about one inter-
view of the health study. In other words, each hash represents one record of
the health study database. The template processor is given this array in the
same way as in the earlier program, except that now there is just one hash
key: HealthStudyInterviews. This will be the name of the array within
the template. The template is shown in template 10.2. Notice that one iter-
ates over the elements of the array in almost the same way as in Perl. The
Template Toolkit, however, uses a more simplified notation than Perl:
1. Variables usually have no initial character such as % or @. The Template
Toolkit does use the $ but only when one is substituting a value in an
expression. For example, if one has a variable named status whose
value is “obese,” then the expression i.status would have two different
meanings. Should it mean $i{status} or should it mean $i{obese}?
In the Template Toolkit one specifies the second one by writing i.$sta-
tus. The $ prefix in the Template Toolkit means “substitute the value of
the variable here.”
10.2 Transforming XML 251
use Template;
while (<>) {
$interviews[$i]{month} = substr($_, 0, 2) + 0;
$interviews[$i]{day} = substr($_, 2, 2) + 0;
$interviews[$i]{year} = 2000 + substr($_, 4, 2);
$interviews[$i]{bmi} = substr($_, 6, 8) + 0;
$status = ’normal’;
if (substr($_, 14, 3) + 0 > 0) { $status = ’obese’; }
if (substr($_, 17, 3) + 0 > 0)
{ $status = ’overweight’; }
$interviews[$i]{status} = $status;
$interviews[$i]{height} = substr($_, 20, 3) + 0;
$interviews[$i]{weight} = substr($_, 23, 8) + 0;
$i++;
}
$tt = new Template;
$vars = {
HealthStudyInterviews => [@interviews],
};
$tt->process("health.tt", $vars);
<HealthStudy>
[% FOREACH i IN HealthStudyInterviews %]
<Interview Date=’[% i.year %]-[% i.month %]-[% i.day %]’
BMI=’[% i.bmi %]’ Status=’[% i.status %]’
Height=’[% i.height %]’ Weight=’[% i.weight %]’/>
[% END %]
</HealthStudy>
3. Keywords such as FOREACH are written using all capital letters in the Tem-
plate Toolkit, but using lower-case letters in Perl.
The Template Toolkit can simplify its notation because it supports a very
limited range of features compared with Perl.
Next consider a more difficult transformation such as transforming the
output produced by BioProspector as in subsection 10.1.4. The Perl program
for extracting the motifs must be modified so that the information is kept in a
Perl data structure which is given to the template in the usual way, as shown
in program 10.23. The corresponding template is shown in template 10.3.
Running this program on the BioProspector file produces output that begins
like this:
<MotifData>
<Motif id=’1’>
<DNA>
<A>0.00</A>
<C>0.21</C>
<T>0.59</T>
<G>0.21</G>
</DNA>
...
The extra blank lines come from the FOREACH and END directives. These do
not produce any text by themselves, so they show up as blank lines in the
output. To get rid of the unnecessary blank lines and other spaces, just add
dashes at the end of each directive, as shown in template 10.4.
Summary
• To convert non-XML data to the XML format, one can use the same tech-
niques that apply to any kind of processing of text data. The XML docu-
ment is just another kind of output format.
10.2 Transforming XML 253
use Template;
while (<>) {
chomp;
if (/Motif #([0-9]+):/) {
$label = $1;
$i = 0;
} elsif ($label && /^[0-9]+/) {
split;
$motifs{$label}[$i]{A} = $_[1];
$motifs{$label}[$i]{G} = $_[2];
$motifs{$label}[$i]{C} = $_[3];
$motifs{$label}[$i]{T} = $_[4];
$i++;
}
}
$tt = Template->new();
$vars = {
MotifData => {%motifs},
};
$tt->process("motif.tt", $vars);
Program 10.23 Using pattern matching to extract data and then formatting it with
Perl templates
• The Perl Template Toolkit has its own language for iteration and selecting
an item of a hash or array. The Template Toolkit language is much simpler
than Perl because it has fewer features.
<MotifData>
[% FOREACH label IN MotifData.keys.sort %]
<Motif id=’[% label %]’>
[% FOREACH position IN MotifData.$label %]
<DNA>
<A>[% position.A %]</A>
<C>[% position.C %]</C>
<T>[% position.T %]</T>
<G>[% position.G %]</G>
</DNA>
[% END %]
</Motif>
[% END %]
</MotifData>
Template 10.3 Perl template for formatting Perl hashes and arrays
<MotifData>
[% FOREACH label IN MotifData.keys.sort -%]
<Motif id=’[% label %]’>
[% FOREACH position IN MotifData.$label -%]
<DNA>
<A>[% position.A %]</A>
<C>[% position.C %]</C>
<T>[% position.T %]</T>
<G>[% position.G %]</G>
</DNA>
[% END -%]
</Motif>
[% END -%]
</MotifData>
use XML::Parser;
print("<HealthStudyUS>\n");
$p->parsefile($ARGV[0]);
print("</HealthStudyUS>\n");
sub start {
$tag = $_[1];
%attributes = @_;
if ($tag eq "Interview") {
print(" <Interview");
print(" Date=’$attributes{Date}’");
$WeightUS = $attributes{Weight} * 2.2;
print(" Weight=’$WeightUS’");
$HeightUS = $attributes{Height} * 0.39;
print(" Height=’$HeightUS’");
print("/>\n");
}
}
<HealthStudy>
<Interview Date=’2000-1-15’ BMI=’18.66’ .../>
<Interview Date=’2000-1-15’ BMI=’26.93’ .../>
...
The output document should contain only the date, height, and weight at-
tributes of each interview. The result of running program 10.24 using the
example database is
256 10 Transforming with Traditional Programming Languages
<HealthStudyUS>
<Interview Date=’2000-1-15’ Weight=’101.794’ Height=’24.18’/>
<Interview Date=’2000-1-15’ Weight=’151.69’ Height=’24.57’/>
<Interview Date=’2000-2-1’ Weight=’203.566’ Height=’25.35’/>
<Interview Date=’2000-2-1’ Weight=’110.77’ Height=’26.13’/>
</HealthStudyUS>
<WeightList>
<Weight>46.27</Weight>
<Weight>68.95</Weight>
<Weight>92.53</Weight>
<Weight>50.35</Weight>
</WeightList>
use XML::Parser;
print("<WeightList>\n");
$p->parsefile($ARGV[0]);
print("</WeightList>\n");
sub start {
$tag = $_[1];
if ($tag eq "Weight") {
print(" <Weight>");
$weightElement = 1;
}
}
sub char {
if ($weightElement) {
print($_[1]);
}
}
sub end {
if ($weightElement) {
print("</Weight>\n");
$weightElement = 0;
}
}
use XML::Parser;
$p = new XML::Parser(Handlers =>
{ Start => \&start, End => \&end, Char => \&char });
$p->parsefile($ARGV[0]);
sub start {
$tag = $_[1];
if ($tag eq "HealthStudy") {
print("<HealthStudy>\n");
}
elsif ($tag eq "Interview") {
print("<Interview");
}
elsif ($tag eq "Date") {
print(" Date=’");
$printContent = 1;
}
...
}
sub char {
if ($printContent) {
print($_[1]);
}
}
sub end {
$tag = $_[1];
if ($tag eq "HealthStudy") {
print("</HealthStudy>\n");
}
elsif ($tag eq "Interview") {
print("/>\n");
}
elsif ($printContent) {
print("’");
$printContent = 0;
}
}
parsing styles can help simplify the program, but all of them have disadvan-
tages. A better approach that has become very popular is to use the XML
Transformation Language that is introduced in the next chapter.
Summary
• Transformation from XML to XML using Perl can be done using any of
the parsing styles.
10.3 Exercises
In the following exercises, write a Perl program that determines the specified
information. The solutions to these exercises are available online at the book
website ontobio.org. Additional exercises are also available at this site.
1. Using the health study database in section 1.1, find all interviews in the
year 2000 for which the study subject had a BMI greater than 30. Print the
information for each such interview using tab-delimited fields. Compare
your answer with your solution to exercise 10.1.
2. Perform the same task as in exercise 10.1, but using a database in XML
format as in section 1.2. Write your program first by using patterns to
extract the information, and then by using the XML::Parser module.
3. Generalize exercise 10.2 to extract interviews for any year and any mini-
mum BMI value. Write your program as a Perl procedure which has two
parameters.
5. As in exercise 10.3, find all PubMed citations dealing with the therapeu-
tic use of glutethimide. For each citation print one line containing the
MedlineID, the title, and the date of publication in tab-delimited format.
260 10 Transforming with Traditional Programming Languages
6. For the health study database in section 1.1, the subject identifier is a field
named SID. Find all subjects in the database for which the BMI of the
subject increased by more than 4.5 during any period of time. For each
subject, print the subject identifier, the amount that the BMI increased,
and the period of time. Print the results in XML format. If this condition
is satisfied more than once by a subject, then print the maximum increase
in the BMI for this subject. Hint: Collect information about each subject
in a hash or array.
8. A file contains BioML data as in figure 1.3. For each gene in this file,
compute the total length of all exons that it contains.
11 The XML Transformation
Language
The XML Transformation Language (XSLT) (W3C 2001d) is one of the most
popular, as well as the most commonly available, transformation languages
for XML documents. Although this language was originally intended for use
by the XML Stylesheet Language (XSL), one can use XSLT for many other
useful transformations, including data transformations for bioinformatics.
In fact, XSLT is used mostly for transformation today. While there are many
XML transformation languages, XSLT has the advantage of being rule-based
and being itself written in XML. This chapter introduces this style of pro-
gramming.
XSLT is very different from the procedural style of programming that dom-
inates mainstream programming languages. XSLT is rule-based. An XSLT
rule is called a template, and an XSLT program is just a set of templates. The
templates are separate from one another (i.e., one template can never contain
another), and the order in which they appear in the program does not matter.
The whole XSLT program is called a transformation program or a transform.
Consider the document in figure 11.1 that shows some protein interaction
data from a microarray experiment. Suppose that one would like to change
the names (tags) of some of the elements. Specifically, suppose that instead of
Protein we want to use P, and instead of Substrate, use S. Transform 11.1
shows the XSLT program for doing this task. To understand how this pro-
gram functions, consider how enzymes digest molecules such as proteins.
Proteins are long chains of amino acids, and each enzyme is capable of split-
ting the chain at one or more specific points in the chain, which match the
active site of the enzyme. This process is shown symbolically in figure 11.2.
262 11 The XML Transformation Language
<Array>
<Protein id="Mas375">
<interaction substrate="Sub89032">
<BindingStrength>5.67</BindingStrength>
<Concentration unit="nm">43</Concentration>
</interaction>
<interaction substrate="Sub89033">
<BindingStrength>4.37</BindingStrength>
<Concentration unit="nm">75</Concentration>
</interaction>
</Protein>
<Protein id="Mtr245">
<interaction substrate="Sub89032">
<BindingStrength>0.65</BindingStrength>
<Concentration unit="um">0.53</Concentration>
</interaction>
<interaction substrate="Sub80933">
<BindingStrength>8.87</BindingStrength>
<Concentration unit="nm">8.4</Concentration>
</interaction>
</Protein>
<Substrate id="Sub89032"/>
<Substrate id="Sub89033"/>
</Array>
Each template acts like an enzyme that acts upon one or more kinds of
elements in the XML document. The kinds of elements that the template
can “attack” is specified by the match attribute. Most commonly, the match
condition is either the tag of the elements that the template can attack or a
“wild card” that allows the template to attack any element. If there are both
specific and generic templates, then the specific ones take precedence.
Since elements and attributes can have the same names, XSLT distinguishes
them by prefixing attribute names with an @ sign. Thus chromosome is the
name of an element, but @start is the name of an attribute. The wild card
notation for elements is node(), and the wild card notation for attributes is
@*. The templates in transform 11.1 use both of the wild card notations.
Enzymes can only attack locations on a protein chain that are “exposed.”
In the same way, templates only attack the highest-level elements that can
be matched. Lower-level elements become exposed only when the contain-
11.1 Transformation as Digestion 263
<?xml version="1.0"?>
<xsl:transform version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
</xsl:transform>
Transform 11.1 An XML transformation program that changes the name of protein
elements from “Protein” to “P”, and similarly changes “Substrate” to “S”. All other
elements are unchanged.
ing elements have been “digested.” Digestion and the subsequent expos-
ing of child elements to attack by other templates is accomplished by using
the xsl:apply-templates command. One can be selective about exactly
which of the child elements will be exposed by using a select criterion.
Figure 11.3 illustrates how the hierarchical structure relates to the templates.
Note that the context changes as a result of the “digestion” of an element.
The last template in transform 11.1 is saying: “by default, copy all elements
and attributes, and then apply appropriate templates to the attributes and
child elements that are in each element.” This template is a handy one to
include in any XSLT program that is modifying some of the features of an
XML document, but which is leaving most of the features unchanged.
264 11 The XML Transformation Language
Figure 11.2 Abstract depiction of the process of digestion. The original chain is
shown in A. An enzyme (dark gray region) attacks the chain (B, C, D) in two locations,
splitting the chain each time. A second enzyme (light gray region) attacks two of the
subchains (E, F, G). The end result is five subchains (H).
Summary
• An XSLT program consists of templates.
Figure 11.3 The digestion process during XML transformation. The first template
digests a chromosome element and then releases the locus child elements to the
second template. The corresponding action on the hierarchy is to change the context
from the chromosome element to the locus as shown in the screen image.
...
<!-- Change all occurrences of Protein to P -->
<xsl:template match="Protein">
<xsl:sort select="@id"/>
<P>
<xsl:apply-templates select="@*|node()"/>
</P>
</xsl:template>
Transform 11.2 A modification of the program in transform 11.1 in which the pro-
teins and substrates have been sorted by their ids
read a novel. This order is called the document order. However, the order in
which elements are selected during the transformation can be changed by
using a xsl:sort element. In transform 11.2 a transformation is performed
that not only changes some element names but also changes the order of
those elements.
The apply-templates command serves to change the context of the
transformation from one element or attribute to another one. The for-each
is another command that accomplishes the same effect. The only difference
between them is that apply-templates causes another template to be-
come active in a new context while the for-each command stays inside
the same template. This is illustrated in transform 11.3 which changes the
tag of interaction elements within Protein elements to I.
While both apply-templates and for-each have the same effect, there
are some differences. The for-each command is a traditional technique for
controlling the actions performed by a computer program, and those who
have programming experience will find it a familiar command. By contrast,
apply-templates is a rule-based command that uses a matching or “lock-
and-key” mechanism which is much more flexible and powerful.
The power of the apply-templates rule-based command is illustrated
by transform 11.2. In this program, child elements of a Protein other than
11.3 Navigation and Computation 267
...
<xsl:template match="Protein">
<P>
<xsl:apply-templates select="@*"/>
<xsl:for-each select="interaction">
<I>
<xsl:apply-templates select="@*|node()"/>
</I>
<xsl:for-each/>
</P>
</xsl:template>
...
</xsl:transform>
interaction elements would be lost. This would not occur if the trans-
formation of the interaction elements were done using another template.
The only interaction elements that will be transformed by the for-each
command are the ones that are child elements of a Protein element.
Nevertheless, the for-each command is useful, especially when one is
performing numerical calculations. This is the topic of the next section.
Summary
• A transformation action occurs in a context: the element or attribute being
transformed.
• The context is normally chosen in the same order in which the elements
or attributes appear in the document, but which can be changed by using
a sort command.
...
<xsl:template match="Protein">
<P>
<xsl:attribute name="averageBindingStrength">
<xsl:value-of
select="sum(interaction/BindingStrength) div
count(interaction/BindingStrength)"/>
</xsl:attribute>
<xsl:apply-templates select="@*|node()"/>
</P>
</xsl:template>
..
Transform 11.4 A modification of the program in transform 11.1 to compute the av-
erage binding strength of all interactions with a protein. The average binding strength
is shown as another attribute in each P element.
programming languages such as Perl, but XSLT adds a new feature to com-
putation: navigation.
Navigation is the process of conducting vehicles from one place to another.
The original meaning was concerned with ships on the sea. Nowadays it is
more commonly applied to the directions for driving a car from one place to
another. In the case of XML documents, one navigates from one element to
another. Instead of streets one navigates over elements, and instead of turn-
ing from one street to another, one traverses either “down” from an element
to a child element, or “up” from an element to its parent element.
The template in transform 11.4 shows how to perform both navigation and
computation. The objective is to compute the average binding strength of all
interactions of a protein. The value-of command evaluates the expression
in its select attribute. The interaction/BindingStrength part of this
expression is the navigation using XPath as in section 8.1. It specifies that one
should select all interaction elements in the context and then select all
BindingStrength elements within the interaction elements. The slash
means that one navigates from a parent element to a child element. This
notation emulates the notation used for navigating among directories and
files (except that in Windows, a backward slash is used instead of a forward
slash).
An attribute command inserts an attribute into the current element (in
this case a P element). The sum is the numerical sum of all matching el-
ements, and the count is the number of all matching elements. The div
11.4 Conditionals 269
...
<xsl:template match="interaction">
<interaction>
<xsl:attribute name="protein">
<xsl:value-of select="../@id"/>
</xsl:attribute>
<xsl:apply-templates select="@*|node()"/>
</interaction>
</xsl:template>
..
Transform 11.5 A template that adds the id of the containing element as a new
attribute
operator is short for “division.” Programming languages often use the slash
to denote division. Obviously one cannot use the same notation because that
would conflict with the use of slash to denote navigation.
Navigating from a child to a parent uses the same notation as in directories.
In transform 11.5, an attribute is added to the interaction element that
has the identifier of the corresponding Protein element.
The XSLT language inherits all of the operators that are available in XPath,
such as the ones in table 8.1. Two operators that seem to be missing are the
maximum and minimum operators. In fact, both of these can be computed
by using the xsl:sort command. This is explained in the next section.
Summary
• XSLT navigation is the process of traveling from one element or attribute
to another one in the document.
11.4 Conditionals
Conditionals are used to define special cases. For example, in section 1.1
the health study record defined normal weight, overweight, and obesity in
terms of ranges for the body mass index (BMI). In XSLT these ranges would
be written like this:
270 11 The XML Transformation Language
<xsl:choose>
<xsl:when test="@bmi<25">
Normal
</xsl:when>
<xsl:when test="@bmi<30">
Overweight
</xsl:when>
<xsl:otherwise>
Obese
</xsl:otherwise>
</xsl:choose>
By using sorting and conditionals one can compute the maximum and
minimum. Here is the computation of a maximum:
<xsl:for-each select="interaction/BindingStrength">
<xsl:sort data-type="number" select="."/>
<xsl:if test="position()=last()">
<xsl:value-of select="."/>
</xsl:if>
</xsl:for-each>
This computation sorts all the binding strengths in increasing numerical or-
der. It then selects just the last (largest) one. Note the use of the “.” to denote
the current element. Alternatively, one could have sorted in descending or-
der and selected the first one as follows:
<xsl:for-each select="interaction/BindingStrength">
<xsl:sort data-type="number"
order="descending" select="."/>
<xsl:if test="position()=1">
<xsl:value-of select="."/>
</xsl:if>
</xsl:for-each>
<xsl:value-of select="BindingStrength[position()=1]"/>
will select just the first BindingStrength element. One can abbreviate the
test above as
11.5 Precise Formatting 271
<xsl:value-of select="BindingStrength[1]"/>
but this should only be used in simple cases like this one. Do not expect such
abbreviations to work for more complicated expressions.
Summary
• Conditionals are used for special cases.
<xsl:value-of
select="format-number(3674.9806, ’#,##0.0##’)"/>
will print 3,674.981. The # symbol represents a digit that will be omitted
if it is insignificant. Zero represents a digit (not just 0) that will always be
printed even if it is insignificant. As another example,
<xsl:value-of
select="format-number(3674.9805, ’#,##0.000’)"/>
If your output file is not an XML document, then you may want to exer-
cise more precise control over the output formatting by using the xsl:text
element. Consider these two templates:
<xsl:template match="Protein">
Protein information:
<xsl:apply-templates select="@*|node()"/>
</xsl:template>
<xsl:template match="Protein">
<xsl:text>Protein information:</xsl:text>
<xsl:apply-templates select="@*|node()"/>
</xsl:template>
The first template would produce generous amounts of space before and af-
ter the Protein information: text in the output file, while the second
would write nothing more than just the Protein information: text.
Since XSLT is designed to produce XML documents, it automatically chang-
es the left angle bracket from < to <. XSLT also automatically changes
11.6 Multiple Source Documents 273
the ampersand character from & to & These two characters have a spe-
cial meaning in XML documents. If XSLT is being used to produce a non-
XML document, then one may want these two characters to be left alone.
To force XSLT to write left angle brackets and ampersands verbatim, use
disable-output-escaping attribute in each element where this behav-
ior is desired.
Summary
• The format-number function allows one to specify the format of a num-
ber.
• The xsl:output element tells XSLT the kind of document that is being
produced so it can format the output document appropriately.
• The xsl:text element is used for controlling the amount of space in the
output document and also for informing XSLT whether or not to escape
the XML special characters.
1. The collection of files is a single XML document that was split into pieces
for convenience. In this strategy, all of the pieces must form a document
that conforms to a single DTD.
Suppose that one has performed five experiments and that the data are
stored in five separate files, called experiment1.xml through experiment5.
xml. The experiment1.xml file might look like this:
<Experiment date="2003-09-01">
<Observation id="A23">
...
</Experiment>
274 11 The XML Transformation Language
The first strategy can be accomplished by using the notion of an XML en-
tity as discussed in section 1.4. A separate “main” file is created that looks
like this:
<?xml version="1.0"?>
<!DOCTYPE ExperimentSet SYSTEM "experiment.dtd"
[
<!ENTITY experiment1 SYSTEM "experiment1.xml">
<!ENTITY experiment2 SYSTEM "experiment2.xml">
<!ENTITY experiment3 SYSTEM "experiment3.xml">
<!ENTITY experiment4 SYSTEM "experiment4.xml">
<!ENTITY experiment5 SYSTEM "experiment5.xml">
]>
<ExperimentSet>
&experiment1;
&experiment2;
&experiment3;
&experiment4;
&experiment5;
</ExperimentSet>
The five files will automatically be incorporated into the main file. This is
done by the XML processor, not by XSLT, and there is nothing in the XSLT
transformation program that mentions anything about these files. Note that
only the main file mentions the DOCTYPE. This strategy requires that the files
being combined form an XML document that conforms to the overall DTD.
To accomplish the second strategy use the document function. For exam-
ple,
<xsl:for-each select="document(’experiment1.xml’)">
<xsl:apply-templates/>
<xsl:for-each>
<xsl:for-each select="document(’experiment2.xml’)">
<xsl:apply-templates/>
<xsl:for-each>
Summary
• XSLT can process multiple input source files by using XML entities to in-
clude one file in another.
Although XSLT is a rule-based language, one can also program in XSLT us-
ing the traditional procedural style. In particular, this means that one can
declare and use variables and procedures, and one can pass parameters to
procedures.
A variable is declared using the xsl:variable command. For example,
will set the variable x to the first binding_strength element in the current
context. This command has approximately the same meaning as
$x = $BindingStrength[0];
in Perl. Note that XSLT starts counting at 1 while Perl normally starts count-
ing at 0.
An XSLT variable is used (evaluated) by writing the $ character before
the variable name. This convention is almost the same as in Perl, except
that Perl variables are not declared so they always appear with a preceding
character such as $. Another difference is that Perl distinguishes between
variables that represent collections of values from variables that represent
single values (called “scalars” in Perl). XSLT makes no such distinction.
Procedures in XSLT are just templates that have a name. They are called
by using the xsl:call-template command. The following template com-
putes the average of all binding_strength elements in the current con-
text:
<xsl:template name="BindingStrengthAverage">
<xsl:value-of select="sum(BindingStrength) div
count(BindingStrength)"/>
</xsl:template>
276 11 The XML Transformation Language
<xsl:call-template name="BindingStrengthAverage"/>
Procedures often have parameters, and these are specified in XSLT by us-
ing xsl:param in the procedure. For example, the following will compute
the average of any set of elements:
<xsl:template name="average">
<xsl:param name="elements"/>
<xsl:value-of
select="sum($elements) div count($elements)"/>
</xsl:template>
<xsl:call-template name="average">
<xsl:with-param name="elements"
select="BindingStrength"/>
</xsl:call-template>
2. The procedure body. This is the part that performs the actual computa-
tion. It usually consists of a conditional element having two parts:
<xsl:template name="variance">
<xsl:param name="elements"/>
<xsl:param name="ssq"/>
<xsl:param name="i"/>
<xsl:choose>
<xsl:when test="$i > count($elements)">
<!-- The final computation goes here. -->
</xsl:when>
<xsl:otherwise>
<!-- The computation on each subelement goes here. -->
</xsl:otherwise>
</xsl:choose>
Since the iterator starts at 1, the computation is complete when the iterator
exceeds the total number of elements to be processed. It does not matter
whether the final computation is written first or second. So it could also be
written this way:
<xsl:choose>
<xsl:when test="$i <= count($elements)">
<!-- The computation on each subelement goes here. -->
</xsl:when>
<xsl:otherwise>
<!-- The final computation goes here. -->
</xsl:otherwise>
</xsl:choose>
<xsl:call-template name="variance">
<xsl:with-param name="elements" select="$elements"/>
<xsl:with-param name="ssq"
select="$ssq + $elements[position()=$i] *
$elements[position()=$i]"/>
<xsl:with-param name="i" select="$i + 1"/>
</xsl:call-template>
The first xsl:with-param command adds the square of the next element
to the accumulator. The second command increases the iterator by 1. The
call to the procedure continues the computation. The two commands can
be written in either order as they take effect only after the computation is
continued. So the following program does the same computation:
<xsl:call-template name="variance">
<xsl:with-param name="i" select="$i + 1"/>
<xsl:with-param name="ssq"
select="$ssq + $elements[position()=$i] *
$elements[position()=$i]"/>
<xsl:with-param name="elements" select="$elements"/>
</xsl:call-template>
The final computation divides the sum of squares by the number of ele-
ments and subtracts the square of the average:
<xsl:variable name="avg"
select="sum($elements) div count($elements)"/>
<xsl:value-of
select="$ssq div count($elements) - $avg * $avg"/>
Putting these together gives the following procedure for computing the
variance:
<xsl:template name="variance">
<xsl:param name="elements"/>
<xsl:param name="ssq"/>
<xsl:param name="i"/>
<xsl:choose>
<xsl:when test="$i > count($elements)">
<xsl:variable name="avg"
select="sum($elements) div count($elements)"/>
<xsl:value-of
select="$ssq div count($elements) - $avg * $avg"/>
11.7 Procedural Programming 279
</xsl:when>
<xsl:otherwise>
<xsl:call-template name="variance">
<xsl:with-param name="ssq"
select="$ssq + $elements[position()=$i] *
$elements[position()=$i]"/>
<xsl:with-param name="i" select="$i + 1"/>
</xsl:call-template>
</xsl:otherwise>
</xsl:choose>
</xsl:template>
Summary
• XSLT can be used for traditional procedural programming.
• Variables are declared by using an xsl:variable element.
• Procedures are templates that have a name. The parameters of a proce-
dure are declared by using xsl:param elements.
• Procedures are called by using an xsl:call-template element. Pa-
rameters are passed to the procedure by using xsl:with-param ele-
ments.
• Although one could implement complex numerical algorithms in XSLT, it
is probably easier to use programming languages and tools that are de-
signed for such algorithms.
280 11 The XML Transformation Language
11.8 Exercises
The following exercises use the BioML example in figure 1.3. Each exercise
is solved with one or two templates that transform the kinds of elements
mentioned in the exercise. Each of the solutions is an XSLT program having
the following form:
<xsl:transform version=’1.0’
xmlns:xsl=’http://www.w3.org/1999/XSL/Transform’>
<!--
This template copies all elements and attributes
that do not appear in the template(s) above.
-->
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
</xsl:transform>
1. Why the ontology is being developed. One of the most common reasons
for building a formal ontology is to make shared information more usable.
However, there are other reasons why one would build a formal ontology.
It can be very useful for managing information used by small groups of
people or even by a single individual. This book, for example, was writ-
ten in XML, using an ontology that was built specifically for the needs of
this project. Yet another reason why one might build a formal ontology
is to analyze a domain, making explicit the assumptions being made by
the community. In this case, the very act of formalizing the domain can
be valuable irrespective of any other uses of the ontology. Finally, ontolo-
gies are often needed as part of a larger project, as in the example at the
beginning of the chapter.
12.1 Purpose of Ontology Development 283
2. What will be covered by the ontology. This is also called its scope. A clear
definition of the scope will prevent the ontology development effort from
expanding unnecessarily. Many ontologies have already been developed.
If an existing ontology has overlapping scope, then one should consider
reusing it. When an ontology is being developed as part of a larger project,
the scope will be dependent on the project scope.
3. Who will be using the ontology. If an ontology will only be used by a few
persons, possibly only one person, then its design will be very different
from an ontology that will be used by a much larger community. Indeed,
if an ontology will only be used by one person for a short time, then it is
possible to avoid writing it down explicitly. The authors of this book built
a formal ontology to help with the writing of the book. This ontology was
very useful even though it was used by only two persons.
4. When and for how long the ontology will be used. An ontology that will be
used for a few weeks will generally have a much different design than one
that is intended to be used for decades. Generally speaking, the longer an
ontology will be used, the more effort one should invest in its design.
5. How the ontology is intended to be used. An ontology intended for in-
formation retrieval may be different from one intended to be used for
scientific experimentation.
When a design choice is made, it is helpful to document the rationale for
the choice and to refer back to the original purpose of the ontology. A design
rationale should include the alternatives that were considered as well as the
reason for the choice that was made. When an ontology development project
involves a substantial amount of effort, then the statement of purpose will
take the form of a statement of project requirements. Such a statement can be
regarded as the contract which the developers have agreed to fulfill.
In this chapter we will use a medical chart ontology as an example of on-
tology development. Another example is developed in the exercises at the
end of the chapter. The purpose of an ontology has a significant influence
on how it should be developed. We begin by giving an informal descrip-
tion of the purpose of this ontology: A hospital would like to make its medical
chart information more easily available in its medical information system. The plan
is to develop an ontology that will be useful for the next decade. The medical chart
information will be used only by medical personnel who have permission to access
the information. The information will be used both for immediate diagnostic deci-
sions and for statistical data mining to detect long-term trends. The ontology must
284 12 Building Bioinformatics Ontologies
cover medically relevant events for patients and must also allow for personnel to
make notes about patients and events. Events include tests, drug prescriptions, and
operations performed. All events must be categorized using standard categories.
The purpose of this ontology is summarized in the following table:
Requirements for software development are often expressed using use case
diagrams. Use case diagrams are part of the Unified Modeling Language
(UML) (UML 2004). Although use case diagrams are intended for devel-
oping software, the technique can also be used for ontology development.
Use case diagrams are primarily useful for specifying who will be using the
ontology and how it will be used. They are not an effective way to spec-
ify why the ontology is being developed, what will be covered, and how
long the ontology will be used. A use case diagram shows the relationships
among actors and use cases. Actors represent anything that interacts with the
system. An actor is usually a role played by a person, but it can also repre-
sent an organization or another computer system. A use case represents an
interaction with the system. An ontology is not a computer system, but one
can identify the actors that interact with it, as well as the components with
which the actors interact. The requirements for the medical chart ontology
could be represented diagrammatically as in figure 12.1. This diagram was
created using the ArgoUML tool (Tigris 2004).
Summary
• Before developing an ontology, one should understand its purpose.
Chart Ontology
Ontologist
Authorization
Medical Personnel
Chart Database
Figure 12.1 Use case diagram for the medical chart ontology.
• Use case diagrams can specify who will use the ontology and how it will
be used.
• OWL Lite. This limited form of OWL was intended more for develop-
ers than for serious use. It allows a developer a first step on the way to
supporting the more substantial OWL-DL and OWL Full languages.
• OWL Full. This is the richest and most flexible of the web-based ontology
languages. It is also the least supported. Inference using OWL Full can be
slow or can fail entirely. This is not a flaw of existing tools, but rather is a
fundamental aspect of this language.
The major ontology languages can be divided into these main groups:
Ontologies within a single group are mostly compatible with one another.
XSD has more features than XML DTD, and it is easy to convert from a DTD
to a schema. Similarly RDF and the OWL languages differ from one another
mainly in what features are supported. Converting ontologies from one of
these language groups to another can be difficult. Converting from the first
group to one of the other two is especially problematic. Topic Maps, RDF,
and OWL require that all relationships be explicit, while XML relationships
are mostly implicit. As noted in the list above, there is an approach that com-
bines the first and third groups. Developing an ontology using this technique
is relatively easy, but it has the disadvantage that one is making no use of the
expressiveness of RDF and OWL.
Note that in the discussion of ontology languages above, the concern was
with conversion of ontologies from one ontology language to another, not
transformation of data from one ontology to another. Data transformation,
which we discussed at length in chapters 9 through 11, can involve trans-
forming data within the same ontology language group as well as between
language groups. Transformation can also involve data that are not web-
based or data that are not based on any formal ontology. While making a
good choice of an ontology language can make the transformation task eas-
ier, developing correct transformation programs can still be difficult.
288 12 Building Bioinformatics Ontologies
Summary
• The major ontology languages used today can be classified as follows:
– Basic XML
* XML DTD
* XSD
– XML Topic Maps
– Semantic Web
* RDF
* OWL
1. OWL Lite
2. OWL-DL
3. OWL Full
• It is possible to use an approach that is compatible with XML DTD, XSD,
RDF, and the OWL languages.
...
If the DTD generated by this tool is not exactly what one had in mind, then
it is easy to modify it. The most common modification is to relax some of
the constraints. For example, one might change some of the mandatory
(#REQUIRED) attributes to optional (#IMPLIED) attributes.
3. XML editor. There are many XML editors, and some of them allow one
to create DTDs and XML schemas. For a survey of these tools, see (XML
2004).
4. RDF editor. Many RDF editors are now available. For a survey of the
RDF editors that were available as of 2002, see (Denny 2002a,b).
5. OWL editor. There are very few of these. The few that do exist were
originally developed for another ontology language and were adapted for
OWL. The best known OWL editor is Protégé-2000 from Stanford Medi-
cal Informatics (Noy et al. 2003). Protégé is an open source ontology and
12.4 Acquiring Domain Knowledge 291
Summary
The following are the main groups of approaches and tools for ontology de-
velopment:
• XML editor
• RDF editor
• OWL editor
George is a patient.
George is in the infectious disease ward.
George was admitted on 2 September 2004.
Dr. Lenz noted that George was experiencing nausea.
George’s temperature 38.9 degrees C.
Nausea is classified using code S00034.
Summary
• Ontologies are based on domain knowledge.
• The following are the main sources of domain knowledge for ontology
development:
anew. However, there are risks involved that must be balanced against the
advantages. Here are some of the reasons why existing ontologies might not
be appropriate:
1. The ontology may have inappropriate features or the wrong level of de-
tail. This could happen because the ontology was constructed for a differ-
ent purpose or in a different context.
1. Include. This is nearly the same as cutting and pasting except that it oc-
curs every time that the document is processed. The inclusion is speci-
fied by giving the URL of the ontology to be included. The ontology is
downloaded and substituted like any other included document into the
place where the inclusion was requested. An example of this is shown in
section 1.4 where five XML documents containing experimental data are
1. Of course, one should be careful to ensure that doing this does not violate the copyright.
12.5 Reusing Existing Ontologies 295
merged to form a single XML document. The merger occurs each time the
main XML document is processed. In particular, this means that changing
one of the five XML documents would also change the merged document.
By contrast, if one constructed a new document by cutting and pasting,
then changes to any of the five XML documents would not be reflected in
the new document.
Each ontology language has its own special way to include and to import:
1. XML. One can only include into an XML document. There is no XML
import mechanism. The mechanism for inclusion is called an entity, and
there are two main kinds of entity, distinguished by the character that sig-
nals the inclusion. One can use either an ampersand or a percent sign. The
percent sign is used for including material into a DTD. The ampersand is
used for including material into the XML document. This was discussed
in section 1.4.
2. XSD. Both include and import are supported by XSD. The include mech-
anism uses the include element. Because included elements are in the
same namespace, there is the possibility of the same name being used for
two different purposes. The redefine element can be used to eliminate
such an ambiguity. Alternatively, one can choose to use the import ele-
ment which prevents ambiguities by keeping resources in different names-
paces.
4. OWL. One can import an OWL ontology into another one by using a prop-
erty named owl:imports. In addition, it is possible to declare that a re-
source in one namespace is the same as a resource in another. This allows
one to introduce concepts from one namespace to another one namespace.
This is similar to the redefine element of XSD.
296 12 Building Bioinformatics Ontologies
<xsd:import
namespace=
"http://www.w3.org/1998/Math/MathML"
schemaLocation=
"http://www.w3.org/Math/XMLSchema/mathml2/mathml2.xsd"/>
Note that the URI of the MathML namespace is not the same as its URL.
Summary
• Reusing existing ontologies can save time and improve quality.
and one can attach notes to events. Notes are made by medical personnel.
The following are the content models:
<!--
A chart consists of one patient
and a sequence of events.
-->
<!ELEMENT Chart (Patient,Event*)>
<!--
An event is an admission, test, prescription
or operation. It also has at least one category
and may have any number of notes.
-->
<!ELEMENT Event
((Admission|Test|Prescription|Operation),
Category+, Note*)
>
<rdfs:Class rdf:ID="Chart"/>
<rdfs:Class rdf:ID="Patient"/>
<rdfs:Class rdf:ID="Event"/>
<rdfs:Class rdf:ID="Admission">
<rdfs:subClassOf rdf:resource="#Event"/>
</rdfs:Class>
<rdfs:Class rdf:ID="Test">
<rdfs:subClassOf rdf:resource="#Event"/>
</rdfs:Class>
12.6 Designing the Concept Hierarchy 299
<rdfs:Class rdf:ID="Prescription">
<rdfs:subClassOf rdf:resource="#Event"/>
</rdfs:Class>
<rdfs:Class rdf:ID="Operation">
<rdfs:subClassOf rdf:resource="#Event"/>
</rdfs:Class>
<rdfs:Class rdf:ID="Category"/>
<rdfs:Class rdf:ID="Note"/>
To help understand large hierarchies one should try to make them as uniform
as possible. While uniformity is a subjective notion, there are some objective
criteria that one can use to help make a taxonomy more uniform:
2. Every class that has subclasses should be subdivided into at least two and
no more than a dozen subclasses. Subdividing into a single class suggests
that the ontology is either incomplete or that the subclass is superfluous.
Subdividing into a large number of subclasses makes it difficult for a per-
son to understand or to navigate the taxonomy.
Unfortunately, these two criteria can conflict with each other. The taxon-
omy of living beings is a good example of this. The most general concept is
subdivided into domains, which are subdivided into kingdoms, which are
subdivided into phyla, continuing until one reaches individual species. The
notion of a phylum, for example, serves to identify a level in the taxonomy,
and every phylum represents the same level of generality throughout the hi-
erarchy. However, the price that one pays for this uniformity is that some
subclassifications consist of a single subclass while others consist of a large
number of subclasses.
When the number of subclasses is large, one can introduce new levels into
the hierarchy. In the taxonomy of living beings, additional levels are some-
times used to reduce the number of classes in a subclassification, such as
“subphyla,” “superphyla,” “suborders,” “superfamilies,” and so on. Unfor-
tunately, there is no easy way to deal with classes that have only one subclass.
In the case of the taxonomy of living beings, one can argue that the single
subclass is the only one that is currently known, leaving open the possibility
that others may have existed in the past or may be discovered in the future.
The species H. sapiens is the only species in the genus Homo. However, there
were other species in this genus in the past.
12.6 Designing the Concept Hierarchy 301
how the measurement was performed (e.g., orally, rectally, etc.). But one
might not stop there. Body temperature normally fluctuates with a circadian
rhythm, so the time of day should also be considered. One could continue
this elaboration forever.
As the example suggests, there is no limit to the degree of detail for any
concept. Aside from the additional development cost and effort that results
from scope creep, larger ontologies are harder to understand and to use. In
addition, overelaboration can result in overlapping scope with other ontolo-
gies. This is not a problem in itself, but it can become a problem when the
designs of the overlapping concepts are significantly different and it is nec-
essary to make use of both ontologies.
All ontological commitments should be documented with a rationale for
why the commitment was made. Documenting such commitments is much
harder than it seems. The problem is that one may not be aware of the as-
sumptions that are being made. Realizing that one is making implicit as-
sumptions can be a difficult exercise. The best way to discover such assump-
tions is to have a well-stated purpose and scope for the ontology. Ontological
commitments most commonly occur at the “boundaries” of the project scope.
It is best to keep the ontology as simple as possible and to elaborate all con-
cepts only as required. Staying within the scope not only limits the amount
of work required, it also furnishes a good rationale for ontological commit-
ments.
As more was learned about them, the taxonomy continued to change, but the
disjointness condition was maintained.
Ontology languages differ from one another with respect to how disjoint-
ness is specified and whether it is implicitly assumed. In some ontology
languages, subclasses are necessarily disjoint, unless one specifies otherwise.
Other ontology languages presume that subclasses may overlap unless one
specifies that they are disjoint. XML DTDs do not have a mechanism for al-
lowing a particular element to belong to more than one type of element. Each
particular element has exactly one tag. Thus XML DTDs do not allow any
overlap among element types. By contrast, RDF and OWL allow instances to
belong to more than one class, as long as the classes have not been explicitly
specified to be disjoint (which can be specified in OWL, but not in RDF).
Summary
• XML hierarchies are concerned with the structure of the document.
• RDF and OWL hierarchies are concerned with the subclass relationships.
muscles, organs, skin, and so on. While one can have an ontology consisting
of just a class hierarchy, such as the many classic taxonomies, they are just as
lifeless as a skeleton by itself. Properties are essential for giving an ontology
real meaning.
Properties can be classified in several ways:
2. Data vs. resource. XSD is in two parts. The first part deals with data struc-
tures (made up of XML elements) and the second deals with datatypes
(such as numbers and dates, which do not involve XML elements). In
XML, data structures are built using child elements. For example, a Med-
line citation such as figure 2.1 is an elaborate data structure using many
elements. A simple datatype value, on the other hand, can be represented
in XML using either XML attributes or XML elements. For example, the
fact that George’s height is 185 can be expressed either as an attribute:
or as a child element:
<Person name="George">
<height>185</height>
</Person>
moment, one can only state this in the informal description of the prop-
erty. However, it is useful to classify properties this way since it affects
the design of the ontology. We will see examples in the rest of this section.
<Event eventType="Test">
...
</Event>
<!ATTLIST Event
eventType
(Admission|Test|Prescription|Operation) #REQUIRED>
306 12 Building Bioinformatics Ontologies
<owl:ObjectProperty rdf:ID="eventType">
<rdfs:range>
<owl:Class rdf:ID="EventType">
<owl:oneOf parseType="Collection">
<EventType rdf:ID="Admission"/>
<EventType rdf:ID="Test"/>
<EventType rdf:ID="Prescription"/>
<EventType rdf:ID="Operation"/>
</owl:oneOf>
</owl:Class>
</rdfs:range>
</owl:ObjectProperty>
3. Subclasses have new properties. When a subclass will have additional fea-
tures, such as an additional property, then it is better to use subclassing
rather than property values. For example, in the Event classification of the
medical chart ontology, one would expect that each of the subclasses will
have properties unique to the subclass. For example, a prescription in-
stance will have a drug and administration schedule, which other events,
such as an admission, would not have. However, this criterion is not as
12.7 Designing the Properties 307
Possibly the most subtle issue concerning subclasses is the issue of how
the instances act or are acted upon. The classic example of the resulting con-
fusion is the question of whether square is a subclass of rectangle. From a
logical point of view, it seems obvious that squares are a proper subset (and
therefore subclass) of rectangles. However, according to one view of cog-
nition and concepts, objects can only be defined by specifying the possible
ways of acting on them (Indurkhya 1992). For instance, Piaget showed that
the child constructs the notion of object permanence in terms of his or her
own actions (Piaget 1971).
Accordingly, the concept square, when defined to be the set of all squares
without any actions on them, is not the same as the concept square in which
the objects are allowed to be shrunk or stretched. In fact, it has been found
that children’s concepts of square and rectangle undergo several transfor-
mations as the child’s repertoire of operations increases (Piaget and Inhelder
1967; Piaget et al. 1981). This suggests that one should model “squareness” as
a property value of the Rectangle class, called something like isSquare,
which can be either true or false.
In general, concepts in the real world, which ontologies attempt to model,
do not come in neatly packaged, mind-independent hierarchies. There are
many actions that can potentially be performed on or by objects. The ones
that are relevant to the purpose of the ontology can have a strong affect on
how the ontology should be designed (Baclawski 1997b). For still more ex-
amples of how complex our everyday concept hierarchies can be, see (In-
durkhya 2002; Lakoff 1987; Rosch and Lloyd 1978).
The domain of a property is the set of entities that are allowed to have that
property. For example, the supervisor property applies only to people.
The range of a property is the set of entities that may be values of the prop-
erty. For example, a height is a nonnegative number. When designing an
ontology it is useful to choose appropriate domains and ranges for proper-
ties. They should be neither too specific nor too general. If a domain or range
is too limiting, then acceptable statements may be disallowed. If a domain
or range is too general, then meaningless statements will be allowed.
A more subtle ontology design issue is to ensure that the property is at-
tached to the right set in the first place. For example, it may seem obvious
that the body temperature is a property of a person. However, this fails to
consider the fact that a person’s body temperature varies with time. This
may be important when one is recording more than one temperature mea-
surement as in the medical chart ontology. As a result, it would be more
appropriate for the domain of the body temperature to be an event rather
than a person.
In XML and XSD, the domain of an attribute is the set of elements that use
the attribute. In the BioML DTD, for example, virtually every element can
have a name attribute, but not every element can have a start attribute.
XML DTDs have only a limited capability for specifying ranges of attributes.
The most commonly used ranges are CDATA (arbitrary text) and NMTOKEN
(which limits the attribute to names using only letters, digits, and a few other
characters such as underscores). XSD has a much more elaborate capability
for specifying attribute ranges, as discussed in section 2.4.
12.7 Designing the Properties 309
In RDF, the domain and range of a property are specified using domain
and range statements. For example, the height of a person would be declared
as follows:
<rdf:Property rdf:ID="personHeight">
<rdfs:domain rdf:resource="#Person"/>
<rdfs:range rdf:resource="xsd:decimal"/>
</rdf:Property>
<rdf:Property rdf:ID="supervisor">
<rdfs:domain rdf:resource="#Person"/>
<rdfs:range rdf:resource="#Person"/>
</rdf:Property>
In OWL, one can define domains and ranges in the same way as in RDF.
For example, in the medical chart ontology, each event may be authorized
by a member of the staff. This is specified in OWL as follows:
<owl:ObjectProperty rdf:ID="authorizedBy">
<rdfs:domain rdf:resource="#Event"/>
<rdfs:range rdf:resource="#Staff"/>
</owl:ObjectProperty>
In addition, OWL has the ability to specify local ranges relative to a domain
by means of owl:someValuesFrom. For example, suppose that admissions may
only be authorized by a doctor. In other words, when an event is in the
Admission subclass of Event, then the range of authorizedBy is the sub-
class Doctor of Staff. This is specified in OWL as follows:
<owl:Class rdf:about="#Admission">
<rdfs:subClassOf>
<owl:Restriction>
<owl:onProperty rdf:resource="#authorizedBy"/>
<owl:allValuesFrom rdf:resource="#Doctor"/>
</owl:Restriction>
<rdfs:subClassOf>
</owl:Class>
310 12 Building Bioinformatics Ontologies
Note that this does not require that admissions be authorized. It only states
that when an admission has been authorized, then it must be authorized by
a doctor. To require authorization, one must impose a cardinality constraint.
This is covered in the next subsection.
Many design methodologies treat classes as the most important design no-
tion, and relegate properties to a subsidiary role in which properties belong
to their domain classes. For example, the methodology in (Noy and McGuin-
ness 2001) regards properties as “slots” belonging to a class. XML and XSD,
as well as most software engineering methodologies, take this point of view.
OWL and RDF use an alternative point of view in which classes and proper-
ties have the same status (Baclawski et al. 2001). This design methodology is
called aspect-oriented modeling, and it is supported by the most recent version
of UML (UML 2004).
The last column shows the symbol used in XML DTDs, and Perl patterns.
One can impose cardinality constraints with numbers other than 0 or 1, but
this is rarely done. The last row in the table is used by properties that are
12.7 Designing the Properties 311
defined on a general domain, but which do not apply for some subsets. One
cannot specify this in an XML DTD or Perl pattern. The various ontology
languages differ a great deal with respect to their ability to specify cardinality
constraints.
XML DTD. An attribute can only appear once in an element. In other
words, every attribute has maximum cardinality equal to 1. If an attribute is
#IMPLIED, then the attribute is optional. One can specify that an attribute is
#REQUIRED. This is the same as requiring that the cardinality be equal to 1.
The number of occurrences of a child element is specified using the character
in the last column of the table above.
XSD. As in XML DTDs, an attribute can only appear once in an element.
For child elements, the number of occurrences is specified using minOccurs
and maxOccurs.
RDF. In this case, properties can have any number of values, and one can-
not impose any cardinality constraints.
OWL. This language has the most elaborate cardinality constraints, and
they can be either global (i.e., applying to the property no matter how it
is used) or local (i.e., applying only to specific uses of the property). The
global cardinality constraints are owl:FunctionalProperty and owl:
InverseFunctionalProperty. If a property is declared to be an owl:
FunctionalProperty, then it is mathematically a partial function, that is,
it can take at most one value on each domain element. This is the same as
stating that this property has a maximum cardinality equal to 1. If a property
is declared to be an owl:InverseFunctionalProperty, then its inverse
property is a partial function.
The local cardinality constraints are:
<owl:Class rdf:about="#Admission">
<rdfs:subClassOf>
<owl:Restriction>
<owl:onProperty rdf:resource="#authorizedBy"/>
<owl:someValuesFrom rdf:resource="#Doctor"/>
312 12 Building Bioinformatics Ontologies
</owl:Restriction>
<rdfs:subClassOf>
</owl:Class>
<owl:Class rdf:about="#Admission">
<rdfs:subClassOf>
<owl:Restriction>
<owl:onProperty rdf:resource="#authorizedBy"/>
<owl:minCardinality
rdf:datatype="&xsd;nonNegativeInteger"
>1</owl:minCardinality>
</owl:Restriction>
<rdfs:subClassOf>
</owl:Class>
<owl:Class rdf:about="#Admission">
<rdfs:subClassOf>
<owl:Restriction>
<owl:onProperty rdf:resource="#authorizedBy"/>
<owl:cardinality
rdf:datatype="&xsd;nonNegativeInteger"
12.8 Validating and Modifying the Ontology 313
>1</owl:minCardinality>
</owl:Restriction>
<rdfs:subClassOf>
</owl:Class>
Summary
• Properties can be classified in several ways:
• One should specify the domain and range of every property. They should
be neither too general nor too specific.
• Cardinality constraints are important for ensuring the integrity of the know-
ledge base.
3. Create examples from the ontology and check that they are meaningful.
314 12 Building Bioinformatics Ontologies
Creating examples from the ontology is the reverse of the process above.
Instead of starting with meaningful usage examples and expressing them,
one expresses examples and checks that they are meaningful. The examples
can be either specific or generic. Some of the issues one should consider are
the following:
3. Dynamics. Can a property value change? This tests whether the property
is intrinsic or extrinsic and can affect the class hierarchy as discussed in
subsection 12.7.1. For example, can a patient change his or her name?
Consistency checkers vary with respect to how they explain the problems
that are found. In addition to finding inconsistencies, most of the tools also
give advice about situations that are not inconsistencies but which could be
indicative of an error. Such a situation is called a symptom by analogy with
the medical notion of a symptom of a disease (Baclawski et al. 2004). The
ConsVISor consistency checker is unique in having the ability to produce
output that itself conforms to an ontology.
When flaws in the ontology design are revealed during validation, the on-
tology must be modified. Ontologies are also modified after they are pub-
lished. This can happen because new concepts have been introduced, exist-
ing concepts change their meaning, or concepts can be related in new ways.
When concepts and relationships change, it is tempting to modify the ontol-
ogy to reflect those changes. However, the danger is that programs and data
that depend on the ontology will no longer be compatible. Ontology modifi-
cation is also called ontology evolution. Certain modifications are relatively
316 12 Building Bioinformatics Ontologies
benign and are unlikely to have much effect on programs. Adding new at-
tributes and relaxing constraints are usually innocuous. Some of the more
substantial modifications include:
• Reification. This is the process whereby concepts that are not classes are
given class status. For example, a relationship can be reified to become a
class. Reifying a relationship will replace the relationship with a class and
two relationships.
Figure 12.3 Medical chart ontology evolution. The temperature property is now
connected to the Test class.
Figure 12.4 Another medical chart ontology modification. The temperature prop-
erty is now connected to the Event class.
Summary
• Ontology validation consists of the following activities:
12.9 Exercises
Most of the exercises are based on the development of an ontology for single
nucleotide polymorphism (SNP).
1. The informal description of the purpose of the SNP ontology is the follow-
ing: A small group of researchers would like to formalize their understanding of
single nucleotide polymorphisms. The ontology will only be used for a few weeks.
The ontology is only concerned with giving a high-level view of SNPs, which
does not deal with the details.
Summarize the purpose succinctly in a table as in section 12.1.
2. Add consistency checking to the use case diagram in figure 12.1 for the
medical chart ontology.
3. Choose an ontology language for the SNP ontology, and give a design
rationale for your choice.
4. There already exist ontologies that deal with SNPs. For example, the SNP
database (SNPdb) ontology in (Niu et al. 2003) is written in OWL and
gives detailed information about the methods for finding SNPs. Is it ap-
propriate to reuse SNPdb by importing it?
The formal ontologies and languages developed in the first two parts of the
book are based on deductive reasoning and deterministic algorithms. There
is no room for uncertainty. Reality, unfortunately, is quite different, and any
endeavor that attempts to model reality must deal with this fact. This part
of the book compares and contrasts deductive and inductive reasoning, and
then proposes how they can be reconciled.
The first chapter compares deductive reasoning with inductive reasoning,
taking a high level point of view. There are several approaches to reasoning
about uncertainty, and these are surveyed. The most successful approach to
uncertainty is known as Bayesian analysis, and the rest of the book takes this
point of view. Bayesian networks are a popular and effective mechanism
for expressing complex joint probability distributions and for performing
probabilistic inference. The second chapter covers Bayesian networks and
stochastic inference.
Combining information from different sources is an important activity in
many professions, and it is especially important in the life sciences. One can
give a precise mathematical formulation of the process whereby probabilistic
information is combined. This process is known as “meta-analysis,” and it is
a large subject in its own right. The third chapter gives a brief introduction
to this subject.
The book ends by proposing a means by which inductive reasoning can
be supported by the World Wide Web. Because Bayesian networks express
reasoning with uncertainty, we refer to the inductive layer of the web as the
Bayesian Web. Although this proposal is speculative, it is realistic. It has the
advantage of allowing uncertainty to be formally represented in a web-based
form. It offers the prospect of assisting scientists in some important tasks
such as propagating uncertainty through a chain of reasoning, performing
stochastic inference based on observations, and combining information from
different sources.
13 Inductive vs. Deductive
Reasoning
Deductive reasoning, also known more briefly as “logic,” is the process for
which facts can be deduced from other facts in a completely unambiguous
manner using axioms and rules. Modern digital computer programs are
fundamentally logical. They function in a manner that is unambiguously
deterministic.
Reality is unlike a computer in many respects. It is much larger and far
more complex than any computer program could ever be. Furthermore, most
of what takes place is governed by rules that are either unknown or only
imperfectly known. The lack of full knowledge about reality manifests itself
as ambiguity and nondeterminism. There is no reason to suppose that reality
is actually ambiguous or nondeterministic. Despite this situation, people
manage to function effectively in the world.
There are two important mechanisms that people use to function in the
world. The first is the ability to restrict attention to a small part of all of
reality. The second is to accept that information is uncertain. These two
mechanisms are related to one another. In theory, if one were capable of
omniscience, then reality would be as unambiguous and deterministic as a
computer program. However, since people are not capable of such a capac-
ity, we are forced to suppress nearly all of what occurs in the world. The
suppressed details manifest themselves in the form of uncertainties, ambi-
guities, and nondeterminism in the details that we do choose to observe.
The former mechanism is called by various names such as “abstraction” and
“relevance.” Ontologies are fundamental to specifying what is relevant. The
latter mechanism is fundamental to scientific reasoning. The combination of
these two mechanisms is the subject of this part of the book.
When scientific reasoning is relatively simple, it is easy to ignore the role
of the ontology, leaving it informal or implicit. However, medicine and bi-
322 13 Inductive vs. Deductive Reasoning
ology are becoming too complex for informal descriptions of the context in
which reasoning is taking place. This is especially important for combining
inferences made in different contexts.
perception of the world. To address these issues, one must not only ac-
cept that our knowledge about the world is uncertain, it is also necessary
to quantify and formalize the notion of uncertainty so that statements about
the world will have more than just a vague, informal meaning.
Many mathematical theories have been introduced to give a formal seman-
tics to uncertainty. On can classify these approaches into two main classes
(Perez and Jirousek 1985):
Summary
• There are many sources of uncertainty, such as measurements, unmodeled
variables, and subjectivity.
One theory of uncertainty that has achieved some degree of popularity is the
theory of fuzzy logicZadeh (Zadeh 1965, 1981). Fuzzy logic is an extensional
approach to uncertainty. In fuzzy logic one associates a number between 0
and 1 with each statement. This number is called a truth-value or possibility to
distinguish it from probabilities used in probability theory. Truth-values are
either given a priori as ground facts, or they are computed. Statements may
be combined with other statements using the operations AND, OR, and NOT,
just as in classic Boolean logic. Fuzzy logic is a generalization of Boolean
logic in that if all statements are either fully true or fully false (i.e., their
truth-values are either 1 or 0), then combining the statements using Boolean
operations will always produce the same result as in Boolean logic. State-
ments that are entirely true or false are called crisp statements, and Boolean
logic is called crisp logic. The truth-value of general statements combined us-
ing the Boolean operations is determined by a function called the t-norm. The
t-norm is a function from a pair of truth-values to a single truth-value. It is
the function that computes the truth-value of the AND of two fuzzy state-
ments. The most commonly used t-norm is the minimum, also called the
Gödel t-norm.
Because fuzzy logic depends on the choice of a t-norm, there are many
different kinds of fuzzy logic. Truth-values computed using one t-norm are
not compatible with truth-values computed using a different t-norm. One
can define rules for fuzzy logic, and these rules can be fuzzy in the sense that
each rule is assigned a strength between 0 and 1. The strength specifies the
degree of confidence in the rule.
In rule-based systems, one begins with a collection of known facts and
rules. The rule engine then infers new facts using the rules. It can do this ei-
ther in a forward-chaining manner where all facts are inferred or a backward-
chaining manner in which one infers only the facts needed to answer a par-
ticular query (see chapter 3 for how this works). Fuzzy logic, as in other ex-
tensional systems, is similar except that it is the truth-values that propagate,
not the facts. Like rule-based systems, one can use either forward-chaining
or backward-chaining.
Note that the term “fuzzy” is often used for any notion of uncertainty, not
just for the specific class of theories due to Zadeh. For example, there is a
notion of “fuzzy Bayesian network,” which is unrelated to fuzzy logic.
There are many other extensional approaches to uncertainty. MYCIN (Short-
liffe 1976) is an expert system that was developed for the purpose of medical
13.3 Intensional Approaches to Uncertainty 325
diagnosis. Like fuzzy logic, MYCIN propagates uncertainty using rules, each
of which has a strength (called the “credibility” of the rule). It differs from
fuzzy logic primarily in the formulas used for propagating certainty levels.
Summary
• Fuzzy logic associates a generalized truth-value between 0 and 1 to each
statement.
• There are many fuzzy logics, one for each choice of a t-norm.
The probabilities of these events define the joint probability distribution (JPD)
of the random variables L and T . A stochastic model is another name for a
collection of random variables. The probabilistic structure of the stochastic
model is the JPD of the collection of random variables. One could give a
strong argument that the stochastic model is the fundamental construct, and
that the probability space is secondary. However, it is convenient to treat the
probability space as fundamental and the random variables as derived from
it (as measurable functions on the probability space).
Given two events A and B, a conditional probability of A given B is any
number c between 0 and 1 (inclusive) such that P r(A and B) = cP r(B).
13.3 Intensional Approaches to Uncertainty 327
P r(A and B)
P r(A | B) = .
P r(B)
The left-hand side of this equation is the same as the left-hand side of the
defining formula for the conditional probability of A given B. Therefore
know how often the symptom occurs when a person has the disease, then
one knows P r(B | A). Bayes’ law then gives the probability that a person
has the disease when the symptom is observed. In other words, Bayes’ law
gives important information which can be used for the diagnosis of diseases
based on symptoms. Specific examples of the use of Bayes’ law for diagnosis
are given in section 14.2.
As we discussed in section 13.1, The applicability and interpretation of
probability theory has been the subject of both philosophical and mathe-
matical analysis since the time it was originally founded. More recently,
other ways of expressing uncertainty have emerged, as we discussed in sec-
tion 13.2. The question is which of these approaches to uncertainty is the
best.
The current methodology for such comparing approaches to uncertainty
was introduced by De Finetti (De Finetti 1937) who formulated subjective
probability in terms of betting against an adversary. This formulation is
called the Dutch book argument. This made it possible to extend the appli-
cability of probability to questions such as “Will the sun rise tomorrow?” or
“Was there life on Mars?” It also can be used to prove the power of proba-
bility theory in general, and Bayesian analysis in particular. The argument
is that if one knows that an agent consistently follows a non-Bayesian belief
system in a known way, then one can arrange the bets so that the Bayesian
always wins (not just on average). If the departure from Bayesian analysis is
inconsistent, then the Bayesian can only win on average.
Although stated in financial terms, the Dutch book argument applies equal-
ly well to any activity which involves some form of utility, whether it is
financial or not, and the associated risk in trying to increase this utility. It fol-
lows that Bayesian analysis is a minimal requirement for rational inference
in experimental science.
There are significant advantages to probability theory as a mechanism for
expressing uncertainty. It is the only approach that is empirically grounded,
and it can be used either empirically or subjectively. Furthermore, Bayesian
analysis will always win over a non-Bayesian analysis whenever one quan-
tifies the risks associated with decisions based on the events in question.
However, probability theory has significant disadvantages. It is much
more computationally complex than the extensional approaches. Specify-
ing a general JPD is a formidable task as the number of random variables
increases. Even for random variables that can take only two values, if there
are 20 random variables, then a joint probability distribution has over 106
probabilities. Accordingly, it is very common to assume that the random
13.3 Intensional Approaches to Uncertainty 329
Summary
• Probability theory is the dominant intensional approach to uncertainty.
• Bayes’ Law is the basis for diagnostic inference and subjective probabili-
ties.
• The Dutch book argument shows that Bayesian analysis is always better
than non-Bayesian analysis.
Stochastic modeling has a long history, and it is the basis for the empiri-
cal methodology that has been used with great success by modern scien-
tific disciplines. Stochastic models have traditionally been expressed using
mathematical notation that was developed long before computers and GUIs
became commonly available. A Bayesian network (BN) is a graphical mecha-
nism for specifying the joint probability distribution (JPD) of a set of random
variables (Pearl 1998). As such, BNs are a fundamental probabilistic repre-
sentation mechanism for stochastic models. The use of graphs provides an
intuitive and visually appealing interface whereby humans can express com-
plex stochastic models. This graphical structure has other consequences. It is
the basis for an interchange format for stochastic models, and it can be used
in the design of efficient algorithms for data mining, learning, and inference.
The range of potential applicability of BNs is large, and their popularity
has been growing rapidly. BNs have been especially popular in biomedical
applications where they have been used for diagnosing diseases (Jaakkola
and Jordan 1999) and studying complex cellular networks (Friedman 2004),
among many other applications.
This chapter divides the subject of BNs into three sections. The sections
answer three questions: What BNs are, How BNs are used, and How BNs
are constructed. The chapter begins with the definition of the notion of a BN
(section 14.1). BNs are primarily used for stochastic inference, as discussed
in section 14.2. BNs are named after Bayes because of the fundamental im-
portance of Bayes’ law for stochastic inference. Because BNs require one to
specify probability distributions as part of the structure, statistical methods
will be needed as part of the task of constructing a BN. Section 14.3 gives an
overview of the statistical techniques needed for constructing and evaluating
BNs.
332 14 Bayesian Networks
3. Perceives Fever (PF), meaning that the patient perceives that he or she has
a fever.
not(PF) PF
not(Flu) and not(Cold) 0.99 0.01
not(Flu) and (Cold) 0.90 0.10
(Flu) and not(Cold) 0.10 0.90
(Flu) and (Cold) 0.05 0.95
The CPD for the Temperature (T) node has two incoming edges, so its CPD
will have four entries as in the case above, but because T is continuous, it
must be specified using some technique other than a table. For example, one
could model it as a normal distribution for each of the four cases as follows:
Mean Std Dev
not(Flu) and not(Cold) 37.0 0.5
not(Flu) and (Cold) 37.5 1.0
(Flu) and not(Cold) 39.0 1.5
(Flu) and (Cold) 39.2 1.6
As an example of one term of the JPD, consider the probability of the event
(F lu) and not (Cold) and (P F ) and (T ≤ 39.0). This will be the product of
the four probabilities: Pr(Flu), Pr(not(Cold)) = (1-Pr(Cold)), Pr(PF|Flu and
not(Cold)), and Pr(T≤39.0|Flu and not(Cold)). Multiplying these gives
(0.0001)(.99)(.90)(0.5) = 0.004455.
Although the BN example above has no directed cycles, it does have undi-
rected cycles. It is much harder to process BNs that have undirected cycles
than those that do not. Some BN tools do not allow undirected cycles be-
cause of this.
Many of the classic stochastic models are special cases of this general graph-
ical model formalism. Although this formalism goes by the name of Bayesian
network, it is a general framework for specifying JPDs, and it need not in-
volve any applications of Bayes’ law. Bayes’ law becomes important only
when one performs inference in a BN, as discussed below. Examples of the
classic models subsumed by BNs include mixture models, factor analysis,
hidden Markov models (HMMs), Kalman filters, and Ising models, to name
a few.
BNs have a number of other names. One of these, belief networks, hap-
pens to have the same initialism. BNs are also called probabilistic networks,
directed graphical models, causal networks, and “generative” models. The
last two of these names arise from the fact that the edges can be interpreted
as specifying how causes generate effects. One of the motivations for intro-
ducing BNs was to give a solid mathematical foundation for the notion of
14.2 Stochastic Inference 335
Summary
• A BN is a graphical mechanism for specifying JPDs.
Event A Pr(A)
PF and not(Flu) and not(Cold) (0.9999)(0.99)(0.01) = 0.0099
PF and not(Flu) and (Cold) (0.9999)(0.01)(0.10) = 0.0010
(PF and Flu) and not(Cold) (0.0001)(0.99)(0.90) = 0.0001
(PF and Flu) and (Cold) (0.0001)(0.01)(0.95) = 0.0000
not(PF) and not(Flu) and not(Cold) (0.9999)(0.99)(0.99) = 0.9800
not(PF) and not(Flu) and (Cold) (0.9999)(0.01)(0.90) = 0.0090
(not(PF) and Flu) and not(Cold) (0.0001)(0.99)(0.10) = 0.0000
(not(PF) and Flu) and (Cold) (0.0001)(0.01)(0.05) = 0.0000
Figure 14.2 Example of diagnostic inference using a BN. The evidence for diagnosis
is the perception of a fever by the patient. The question to be answered is whether
the patient has influenza.
that instead of selecting the terms of the JPD that satisfy the evidence, one
multiplies the terms by the probability that the evidential event has occurred.
In effect, one is weighting the terms by the evidence. The probabilistic basis
for this process is given in chapter 15. We leave it as an exercise to compute
the probability of the flu as well as the probability of a cold given only that
there is a 30% chance of the patient complaining of a fever.
BN inference is substantially more complex when the evidence involves
a continuous random variable. We will consider this problem later. Not
surprisingly, many BN tools are limited to discrete random variables because
of this added complexity.
In principle, there is nothing special about any particular node in the pro-
cess of BN inference. Once one has the JPD, one can assert evidence on any
nodes and compute the marginal distribution of any other nodes. However,
BN algorithms can take advantage of the structure of the BN to compute the
answer more efficiently in many cases. As a result, the pattern of inference
does affect performance. The various types of inference are shown in fig-
ure 14.3. Generally speaking, it is easier to infer in the direction of the edges
of the BN than against them. Inferring in the direction of the edges is called
causal inference. Inferring against the direction of the edges is called diagnostic
inference. Other forms of inference are called mixed inference.
So far we have considered only discrete nodes. Continuous nodes add
some additional complexity to the process. There are several ways to deal
with such nodes:
14.2 Stochastic Inference 339
Figure 14.3 Various types of inference. Although information about any of the
nodes (random variables) can be used as evidence, and any nodes can be queried,
the pattern of inference determines how easy it is to compute the inferred probability
distribution.
The techniques above are concerned with the specification of PDs. A CPD
is a function from the possible values of the parent nodes to PDs on the node.
If there are only a few possible values of the parent nodes (as in the diagnostic
example in figure 14.1), then explicitly listing all of the PDs is feasible. Many
BN tools have no other mechanism for specifying CPDs. When the number
of possible values of the parent nodes is large or even infinite, then the CPD
may be much better specified using a function. In the infinite case, one has
no choice but to use this technique. Curve-fitting techniques such as least-
squares analysis can be used to choose the function based on the available
data.
A BN with both discrete and continuous nodes is called a hybrid BN. The
diagnostic BN example above is a hybrid BN. When continuous nodes are
dependent on discrete nodes, inference will produce a compound (mixed)
Gaussian distribution. Such a distribution is the result of a compound pro-
cess in which one of a finite set of Gaussians is selected according to a PD, and
then a value is chosen based on the particular Gaussian that was selected.
If a discrete node is dependent on continuous nodes, then the discrete node
can be regarded as defining a classifier since it takes continuous inputs and
produces a discrete output which classifies the inputs. The CPDs for this situ-
ation are usually chosen to be logistic/softmax distributions. Connectionist
networks (also called neural networks) are an example of this.
BNs are not the only graphical representation for stochastic models. Undi-
rected graphical models, also called Markov random fields (MRFs) or Mar-
kov networks, are also used, especially in the physics and vision communi-
ties.
One application of BNs is to assist in decision making. To make a decision
based on evidence one must quantify the risk associated with the various
choices. This is done by using a utility function. It is possible to model
some utility functions by adding value nodes (also called utility nodes) to a
BN and linking them with dependency edges to ordinary BN nodes and to
other utility nodes. The result of a decision is an action that is performed,
and these can also be represented graphically by adding decision nodes and
edges to a BN augmented with utility nodes. A BN augmented with utility
and action nodes is called an influence diagram (also called a relevance diagram)
(Howard and Matheson 1981). An influence diagram can, in principle, be
used to determine the optimal actions to perform so as to maximize expected
utility.
14.3 Constructing Bayesian Networks 341
Summary
• The main use of BNs is for stochastic inference.
• Evidence can be given for any nodes, and any nodes can be queried.
• BNs can be augmented with other kinds of nodes, and used for making
decisions based on stochastic inference.
4. Evaluate.
2. Machine learning. The PDs and CPDs are most commonly found by us-
ing statistical methods. There are a large number of such techniques.
4. Ontologies. Ontologies can be used as the basis for the graph structure of
the BN.
14.3.1 BN Requirements
Before embarking on any development project, it is important to have an
understanding of its purpose. We saw this already in section 12.1 for the
development of ontologies. The purpose of the BN should include the fol-
lowing:
1. Why the BN is being developed. One of the most common reasons for
building a BN is to support diagnostic inference. However, BNs can also
be used for combining information from different sources at different times.
Yet another reason why one might build a BN is to analyze a domain,
making independence assumptions more explicit. This allows these as-
sumptions to be tested.
2. What will be covered by the BN. This is also called its scope. A clear def-
inition of the scope will prevent the development effort from expanding
unnecessarily.
3. Who will be using the BN. As with ontology development, this will affect
the amount of effort that should be devoted to the design of the BN.
Summary
Figure 14.4 Bayesian network for the result of a research study of body mass index
(BMI) as a function of age and sex.
The estimation techniques discussed above assume that data about all of
the relevant nodes were available. This is not always the case. When one
or more nodes are not directly measurable, one can either remove them from
the BN or attempt to estimate them indirectly. The latter can be done by using
BN inference iteratively. One treats the unobservable nodes as query nodes
and the observable nodes as evidence nodes in a BN inference process. One
then computes the expectations of the unobservable nodes and uses these
values as if they were actually observed. One can then use ML or MAP as
above. This whole process is then repeated until it converges. This technique
is known as expectation maximization (EM).
It is possible to use machine learning techniques to learn the structure of
the BN graph as well as to learn the CPDs. These tend to have very high
computational complexity, so they can only be used for small BNs. In prac-
tice, it is much better to start with a carefully designed BN and then modify
it in response to an evaluation of the quality of its results.
Connectionist networks are a class of BNs that are designed for efficient
machine learning. Such BNs are most commonly known as “neural net-
works” because they have a superficial resemblance to the networks of neu-
rons found in vertebrates, even though neurons have very different behavior
than the nodes in connectionist networks. Many kinds of connectionist net-
work support incremental machine learning. In other words, they continu-
ally learn as new training data are made available.
Connectionist networks constitute a large research area, and there are many
software tools available that support them. There is an extensive frequently
asked questions list (FAQ) for neural networks, including lists of both com-
mercial and free software (Sarle 2002). Although connectionist networks are
a special kind of BN, the specification of a connectionist network is very dif-
ferent from the specification of a BN. Consequently, techniques for machine
learning of connectionist networks may not apply directly to BNs or vice
versa. However, BNs are being used for connectionist networks (MacKay
2004) and some connectionist network structures are being incorporated into
BNs, as in (Murphy 1998).
Summary
1. Complex objects can be assigned to classes which can share CPDs. Reusing
CPDs greatly simplifies the task of constructing a BN.
14.3 Constructing Bayesian Networks 347
2. Classes can inherit from other classes which allows for still more possibil-
ities for reuse.
Summary
an entity belongs to the class. Edges are introduced when two classes are
related. The most common relationship is the subclass relationship which
means that one class is contained in another. Obviously this will result in a
stochastic dependency. Other kinds of relationship can be expressed in terms
of classes. For example, the age of a person (in years) gives rise to a collection
of disjoint subclasses of the person class, one for each possible value of the
age of a person.
Although this technique does seem to be a natural way to “add probabili-
ties” to ontologies, it does not seem to produce BNs that are especially useful.
The most peculiar feature of these BNs is that all of the classes are ultimately
subclasses of a single universal class (called the Thing class), and the random
variable for a class represents the probability that a randomly chosen thing
is a member of the class. While this might make sense for some class hier-
archies, the hierarchies of ontologies often contain a wide variety of types
of entity. For example, a biomedical ontology would contain classes for re-
search papers, journals, lists of authors, drugs, addresses of institutions, and
so on. It is hard to see what kind of experiment would sometimes produce a
drug, other times produce a list of authors, and still other times produce an
address.
On the other hand, this technique can be the starting point for BN devel-
opment, especially for diagnostic BNs. An example of this is discussed in
subsection 14.3.6, where the ontology is used as the background for the de-
velopment of a BN. The disadvantage of developing BNs by using ontologies
in this way is that whatever formal connection exists between the ontology
and the BN is quickly lost as the BN is modified. As a result, one cannot
use any logical consequences entailed by the ontology during BN inference.
Indeed, the ontology ultimately furnishes no more than informal documen-
tation for the BN.
Summary
• Such BNs are seldom useful in their original form, but can be used as the
starting point for developing realistic BNs.
Other authors have mentioned patterns that may be regarded as being de-
sign patterns, but in a much more informal manner. For example in (Murphy
1998) quite a variety of patterns are shown such as the BNs reproduced in
figure 14.6. In each of the patterns, the rectangles represent discrete nodes
and the ovals represent Gaussian nodes. The shaded nodes are visible (ob-
servable) while the unshaded nodes are hidden. Inference typically involves
specifying some (or all) of the visible nodes and querying some (or all) of the
hidden nodes.
A number of Design idioms for BNs were introduced by (Neil et al. 2000).
The definitional/synthesis idiom models the synthesis or combination of
many nodes into one node. It also models deterministic definitions. The
cause-consequence idiom models an uncertain causal process whose conse-
quences are observable. The measurement idiom models the uncertainty of
a measuring instrument. The induction idiom models inductive reasoning
based on populations of similar or exchangeable members. Lastly, the rec-
350 14 Bayesian Networks
Figure 14.6 Various informal patterns for BNs. These examples are taken from
(Murphy 1998).
Summary
• Many BN design patterns have been identified, but most are only infor-
mally specified.
14.3 Constructing Bayesian Networks 351
1. Test cases. When there are special cases whose answer is known, one can
perform the inference and check that the result is close to the expected
answer.
section 14.3.4. The authors have studied the use of their methodology within
the domain of esophageal cancer.
The Helsper-van der Gaag methodology uses ontologies more as a back-
ground for the design process than as a formal specification for the BN struc-
ture. This is in contrast with the OOBN technique in subsection 14.3.3 in
which the design not only specifies the BN completely but also affects the
inference algorithm. In the Helsper methodology the ontology is used to pro-
vide an initial design for the BN in a manner similar to the way that this is
done in subsection 14.3.4. This step in the methodology is called translating.
However, this initial design is modified in a series of steps based on domain
knowledge. Some of the modifications use the ontology, but most of them
must be elicited from domain experts. The ontology “serves to document
the elicited domain knowledge.”
What makes the Helsper-van der Gaag methodology interesting are the
systematic modification techniques that are employed. The methodology
refers to this phase as improving and optimizing. The modifications must fol-
low a set of guidelines, but these guidelines are only explained by examples
in the articles.
One example of a modification operation used by Helsper and van der
Gaag is shown in figure 14.7. In this operation, a node that depends on two
(or more) other nodes is eliminated. This would be done if the node being
eliminated is not observable or if it is difficult to observe the node. There
are techniques for determining the CPDs for unobservable nodes such as
the EM algorithm discussed in subsection 14.3.2. However, this algorithm is
time-consuming. Furthermore, there is virtually no limit to what one could
potentially model, as discussed in section 13.3. One must make choices about
what variables are relevant, even when they could be observed, in order to
make the model tractable.
Figure 14.7 Modifying a BN by eliminating a node that other nodes depend on. The
result is that the parent nodes become dependent on each other.
14.3 Constructing Bayesian Networks 353
When a node is dependent on other nodes, the other nodes (which may
otherwise be independent) become implicitly dependent on each other via
the dependent node. In statistics this is known as Berkson’s paradox, or “se-
lection bias.” The result of dropping a node is to make the parent nodes
explicitly dependent on each other. This dependency can be specified in ei-
ther direction, whichever is convenient and maintains the acyclicity of the
BN.
The modification operation shown in figure 14.7 changes the JPD of the
BN because one of the variables is being deleted. Furthermore, the new JPD
need not be the same as the distribution obtained by computing the marginal
distribution to remove the deleted variable, although it is approximately the
same.
It is a general fact that the direction of a directed edge in a BN is proba-
bilistically arbitrary. If one knows the JPD of two random variables, then one
can choose either one as the parent node and then compute the CPD for the
child node by conditioning. In practice, of course, the specification works
the other way: the JPD is determined by specifying the CPD. For a particu-
lar modeling problem, the direction of the edge will usually be quite clear,
especially when one is using a design pattern.
However, sometimes the direction of the dependency is ambiguous, and
one of the modification operations is to reverse the direction. In this case
the JPD is not changed by the operation. This situation occurs, for example,
when two variables are Boolean, and one of them subsumes the other. In
other words, if one of the variables is true, then the other one is necessarily
true also (but not vice versa). Suppose that X and Y are two Boolean random
variables such that X implies Y. Then we know that P r(Y = true | X =
true) = 1. This gives one half of the CPD of one of the variables with re-
spect to the other, and the dependency can go either way. This is shown in
figure 14.8
Summary
• It is important to test and validate BNs to ensure that they satisfy the
requirements.
3. uncertainty analysis,
4. consistency checking.
14.4 Exercises
1. In the diagnostic BN in figure 14.1, one can use either a temperature mea-
surement or a patient’s perception of a fever to diagnose influenza. Al-
though these two measurements are a priori independent, they become
dependent when one observes that the patient has the flu or a cold. In
statistics this is known as Berkson’s paradox, or “selection bias.” It has
the effect that a high temperature can reduce the likelihood that a patient
reports being feverish and vice versa. Compute the JPD of the PF and T
nodes in this BN given the observation that the patient has influenza.
temperature, obtaining 30.2◦ ±0.3◦ C. One now has two independent normal
distributions. Combining the two measurements is the same as combining
the two distributions.
In this chapter the process of meta-analysis is formally defined and proven.
Combining discrete distributions is covered first, followed by the continuous
case. Stochastic inference is a special case of meta-analysis. More gener-
ally, one can combine two Bayesian networks (BNs). Conversely, the meta-
analysis process can itself be expressed in terms of BNs. This is shown in
section 15.3. The temperature measurement example above is an example
of the combination of observations that are continuous distributions. PDs
are not only a means of expressing the uncertainty of a observation, they
can themselves be observations. In other words, PDs can have PDs. A large
number of statistical tests are based on this idea, which is discussed in sec-
tion 15.4. The last section introduces an interesting variation on information
combination, called Dempster-Shafer theory.
We first consider the case of combining two discrete PDs. That means we
have two independent random variables X and Y, whose values are discrete
rather than continuous. For example, a patient might seek multiple inde-
pendent opinions from practitioners, each of which gives the patient their
estimates of the probabilities of the possible diagnoses. Combining these two
discrete random variables into a single random variable is done as follows:
P r(X = v and X = Y )
P r(Z = v) = P r(X = v | X = Y ) = .
P r(X = Y )
criteria for making diagnoses. As a result, the diagnoses would not usually
be independent.
For a more extreme example, suppose that the first doctor concludes that
the probabilities are 0.9, 0.0, and 0.1; and the second doctor gives the prob-
abilities as 0.0. 0.9, and 0.1. The combined distribution will conclude that
the tumor has probability 1.0, while the other two diagnoses are impossible
(Zadeh 1984). This seems to be wrong. However, it makes perfectly good
sense. In the words of Sherlock Holmes in “The Blanched Soldier”, “When
you have eliminated all which is impossible, then whatever remains, how-
ever improbable, must be the truth.” Each of the doctors has concluded that
one of the diagnoses is impossible, so the third possibility must be the truth.
On the other hand, one can question whether such totally different diagnoses
would happen independently. In other words, it is unlikely that the doctors
are independently observing the same phenomenon. Such observations are
said to be incompatible.
However, there are other circumstances for which observations that are
incompatible are actually independent and therefore fusable. For example,
if one distribution represents the probability of occurrence of a rare disease,
and another distribution represents the observation that a particular patient
definitely has the disease, then the combination of the two distributions is
simple: the patient has the disease.
An even more extreme example would be two observations in which all of
the possibilities have been declared to be impossible in one or the other ob-
servation. The discrete information combination theorem gives no combined
distribution in this case because the hypotheses are not satisfied. One says
that such observations are inconsistent.
Summary
• The discrete information combination theorem gives the formula for fus-
ing independent discrete random variables that measure the same phe-
nomenon.
• Incompatible PDs can be combined but care must be taken to interpret the
combined distribution properly.
One can also combine continuous random variables. The only difference
is that one must be careful to ensure that the combined distribution can be
rescaled to be a PD.
f (x)g(x)
h(x) = .
f (y)g(y)dy
Proof The proof proceeds as in the discrete case except that one must check
that f (x)g(x)dx converges. Now f (x) was assumed to be bounded. Let
B be an upper bound of this function. Then f (x)g(x) ≤ Bg(x) for every x.
Since g(x)dx converges, it follows that f (y)g(y)dy also converges. The
result then follows as in the discrete case.
n and variances v, w, respectively, then the combined random variable has mean
m n
wm + vn v + w
= 1 1
v+w v + w
and variance
vw 1 1
= + .
v+w v w
This result is easily extended to the combination of any number of in-
dependent normal distributions. The means are combined by means of a
weighted average, using weights that are proportional to the inverse vari-
ances.
We can now combine the two temperature measurements 30.5◦ ±0.4◦ C
and 30.2◦ ±0.3◦ C mentioned earlier. The variances are 0.16 and 0.09, so the
combined mean is 30.3◦ ±0.24◦ C. The combined mean is closer to 30.2◦ C
than to 30.5◦ C because the former measurement is more accurate.
The formula for combining normal distributions applies equally well to
multivariate normal distributions. The only differences are that the mean is
a vector and the variance is a symmetric matrix (often called the covariance).
This formula is the basis for the Kalman filter (Maybeck 1979) in which a
sequence of estimates is successively updated by independent observations.
The Kalman filter update formula is usually derived by using an optimiza-
tion criterion such as least squares. However, nothing more than elementary
probability theory is necessary.
Information combination is commonly formulated in terms of a priori and
a posteriori distributions. The a priori or prior distribution is one of the
two distributions being combined, while the experiment or observation is
the other one. The a posteriori distribution is the combined distribution. Al-
though the formulation in terms of a priori and a posteriori distributions is
equivalent to information combination, it can be somewhat misleading, as
it suggests that the two distributions play different roles in the process. In
fact, information combination is symmetric: the two distributions being com-
bined play exactly the same role. One of the two distributions will generally
have more effect on the result, but this is due to it having more accuracy, not
because it is the prior distribution or the observation.
Another example of information combination is stochastic inference in a
BN, as presented in section 14.2. The evidence is combined with the BN, and
the distributions of the query nodes are obtained by computing the marginal
distributions of the combined JPD. Since the evidence usually specifies in-
formation about only some of the nodes, a full JPD is constructed by using
15.3 Information Combination as a BN Design Pattern 361
Summary
• The continuous information combination theorem gives the formula for
fusing independent continuous random variables that measure the same
phenomenon.
• The derivation of an a posteriori distribution from an a priori distribu-
tion and an observation is a special case of the information combination
theorems.
• Stochastic inference in a BN is another special case of the information com-
bination theorems.
Figure 15.2 The conditional probability distributions that define the BN for combin-
ing two independent observations of the same phenomenon. The prior probability
distribution on the Z is the uniform distribution.
Summary
Figure 15.3 Examples of information combination processes. The process on the left
side combines two random variables that have a mutual dependency. The process on
the right side combines random variables that are not directly observable.
When two populations are compared, one can compare them in a large va-
riety of ways. Their means can be compared with a t-test, and their variances
can be compared with either a chi-square test (to determine whether the dif-
ference of the variances is small) or an F-test (to determine whether the ratio
of the variances is close to 1).
It is easy to experiment with these concepts either by using real data or
by generating the data using a random number generator. In the follow-
ing, a random number generator was used to generate two independent
random samples of size 100 from a normal population with mean 10 and
variance 16. For such a large sample size, the t and chi-square distributions
are very close to being normal distributions. The estimates for the distribu-
tions (mean, variance) were (9.31, 13.81) and (10.55, 16.63). Now forget what
these are measuring, and just think of them as two independent measure-
ments. The uncertainty of each measurement is approximately normally dis-
tributed. The mean of the first measurement
is the measurement itself, and
0.138 0
the variance matrix is . The variance matrix of the second
0 3.86
0.166 0
measurement is . The off-diagonal terms are zero because
0 5.59
the two measurements are independent, and hence uncorrelated. Combin-
ing these two measurements can be done in two ways. The first is to apply
the continuous information combination theorem. The combined
distribu-
0.075 0
tion has mean (9.87, 14.97) and variance matrix . The second
0 2.28
way to combine the two measurements is to treat them as a single combined
15.5 Dempster-Shafer Theory 365
Summary
• PDs can be measured.
M (C)
Z(C) = ,
nonemptyD M (D)
The only complicated entry in the computation above is the value of M(con-
cussion). This probability is the sum of two products: P(concussion)Q(con-
cussion) and P(concussion-meningitis)Q(concussion). The rationale for in-
cluding both of these in the combined probability for concussion is that both
concussion and concussion-meningitis contribute to the evidence (or belief)
in concussion because they both contain concussion.
There is some question about the role played by the empty entity. It is
sometimes interpreted as representing the degree to which one is unsure
about the overall observation. However, Dempster’s rule of combination
explicitly excludes the empty entity from any combined distribution. As a
result, the only effect in D-S theory of a nonzero probability for the empty
entity is to allow distributions to be unnormalized. The information com-
bination theorems also apply to unnormalized distributions, as we noted in
the discussion after the information combination theorems.
Summary
• D-S theory introduces a probabilistic form of concept combination.
16.1 Introduction
The Semantic Web is an extension of the World Wide Web in which infor-
mation is given a well-defined meaning, so that computers and people may
more easily work in cooperation. This is done by introducing a formal logical
layer to the web in which one can perform rigorous logical inference. How-
ever, the Semantic Web does not include a mechanism for empirical, scientific
reasoning which is based on stochastic inference. Bayesian networks (BNs)
are a popular mechanism for modeling uncertainty and performing stochas-
tic inference in biomedical situations. They are a fundamental probabilistic
representation mechanism that subsumes a great variety of other probabilis-
tic modeling methods, such as hidden Markov models and stochastic dy-
namic systems. In this chapter we propose an extension to the Semantic Web
which we call the Bayesian Web (BW) that supports BNs and that integrates
stochastic inference with logical inference. Within the BW, one can perform
both logical inference and stochastic inference, as well as make statistical
decisions.
Although very large BNs are now being developed, each BN is constructed
in isolation. Interoperability of BNs is possible only if there is a framework
for one to identify common variables. The BW would make it possible to
perform operations such as:
2. The Web Ontology Language (OWL) layer expands on the RDF layer by
adding more constructs and richer formal semantics.
3. The Logic layer adds inference. At this layer one can have both resources
and links that have been inferred. However, the inference is limited by
the formal semantics specified by RDF and OWL.
4. The Proof layer adds rules. Rules can take many forms such as logical
rules as in the Logic layer, search rules for finding documents that match
a query, and domain-specific heuristic rules.
The proposed BW consists of a collection of ontologies that formalize the
notion of a BN together with stochastic inference rules. The BW resides pri-
marily on two of the Semantic Web layers: the Web Ontology layer and the
Proof layer. The BW ontologies are expressed in OWL on the Web Ontol-
ogy layer, and the algorithms for the stochastic operations are located on the
Proof layer. By splitting the BW into two layers, one ensures that BW in-
formation can be processed using generic Semantic Web tools which have
no understanding of probability or statistics. The result of processing at the
OWL layer is to obtain authenticated and syntactically consistent BNs. The
probabilistic and statistical semantics is specified on the Proof layer which
requires engines that understand probability and statistics.
372 16 The Bayesian Web
<?XML VERSION="1.0">
<!DOCTYPE ANALYSISNOTEBOOK SYSTEM "xbn.dtd">
<ANALYSISNOTEBOOK
NAME="Diagnostic Bayesian Network Example"
ROOT="InfluenzaDiagnosis">
<BNMODEL NAME="InfluenzaDiagnosis">
<STATICPROPERTIES>
<FORMAT VALUE="MSR DTAS XML"/>
<VERSION VALUE="1.0"/>
<CREATOR VALUE="Ken Baclawski"/>
</STATICPROPERTIES>
<VARIABLES>
<VAR NAME="Flu" TYPE="discrete">
<DESCRIPTION>Patient has influenza</DESCRIPTION>
<STATENAME>Absent</STATENAME>
<STATENAME>Present</STATENAME>
</VAR>
16.4 Ontologies for Bayesian Networks 373
<DIST TYPE="discrete">
<CONDSET>
<CONDELEM NAME="Flu"/>
<CONDELEM NAME="Cold"/>
</CONDSET>
<PRIVATE NAME="PerceivesFever"/>
<DPIS>
<DPI INDEXES="0 0">0.99 0.01</DPI>
<DPI INDEXES="0 1">0.90 0.10</DPI>
<DPI INDEXES="1 0">0.10 0.90</DPI>
<DPI INDEXES="1 1">0.05 0.95</DPI>
</DPIS>
</DIST>
<DIST TYPE="gaussian">
<CONDSET>
<CONDELEM NAME="Flu"/>
<CONDELEM NAME="Cold"/>
</CONDSET>
<PRIVATE NAME="Temperature"/>
<DPIS>
<DPI INDEXES="0 0" MEAN="37" VARIANCE="0.25">
<DPI INDEXES="0 1" MEAN="37.5" VARIANCE="1.0">
<DPI INDEXES="1 0" MEAN="39" VARIANCE="2.25">
<DPI INDEXES="1 1" MEAN="39.2" VARIANCE="2.56">
</DPIS>
</DIST>
</DISTRIBUTIONS>
</BNMODEL>
</ANALYSISNOTEBOOK>
The CPDs are the most complex elements. In general, a CPD is a list of PDs.
The list is contained in a DPIS element. PDs are specified by DPI elements.
If a node has no incoming edges, then its CPD is a PD and there is only a
single DPI element. Nodes with incoming edges must specify several PDs.
The published DTD for XBN does not support continuous random variables,
so it was necessary to add two attributes to the DPI element: the MEAN and
VARIANCE.
The XBN format has a number of limitations as the basis for the BW. In its
current published form, it only supports random variables with a finite num-
ber of values. It does not support continuous random variables. It should be
possible to specify a wide variety of types of PD. Another significant lim-
itation is its lack of a mechanism for referring to external resources or for
external documents to refer to the BN. This makes it difficult to use this
mechanism to satisfy the requirement for common variables, and there is
only limited support for annotation.
These considerations suggest that a better choice of language for the BW
is OWL. We now present a series of three OWL ontologies that satisfy the
requirements for the BW. We present them in top-down fashion, starting with
high-level concepts and successively elaborating them:
While one would think that the notion of a random variable is unambigu-
ous, in fact it is a combination of two different concepts. First, there is the
phenomenon that is being observed or measured, such as one toss of a coin
or the measurement of a person’s blood pressure. The second concept is the
PD of the phenomenon. It is the combination of these two notions which is
the concept of a random variable. The relationship between the phenomenon
and its PD is many-to-many. Many phenomena have the same PD, and the
same phenomenon can be distributed in many ways. The reason why a phe-
nomenon does not uniquely determine its PD is due to the notion of con-
ditioning. As one observes related events, the distribution of a phenomenon
changes. The phenomenon is the same; what changes is the knowledge about
it.
The top-level concept of the BW is the BN which is used to model net-
works of more elementary phenomena (see figure 16.2). A BN consists of
a collection of nodes, each of which represents one elementary phenomenon.
Think of a node as a random variable whose PD has not yet been specified. A
node has a range of values. For example, the height of a person is a positive
real number. A Node can depend on other Nodes. A dependency is called a
dependency arc. It is convenient to order the dependencies of a single node,
so in figure 16.2, a Node can depend on a NodeList, which consists of a
sequence of Nodes. The order of the dependencies is used when the con-
ditional probabilities are specified. A BN can import another BN. The nodes
and dependencies of an imported BN become part of the importing BN.
The most complex part of a BN is its joint probability distribution (JPD)
which is specified using a collection of conditional and unconditional PDs.
Since a BN can have more than one PD, the notion of a BN distribution (BND)
is separated from that of the BN. There is a one-to-many relationship between
the concepts of BN and BND. A BND consists of a collection of distributions,
one for each node in the BN. A node distribution (ND) relates one node to its
conditional distribution.
The notion of a conditional distribution is the main concept in the condi-
tional probability ontology, as shown in figure 16.3. A conditional distribu-
tion has three special cases. It can be a CPD table (CPT), a general stochastic
function (SF), or an unconditional PD. The CPT is used in the case of phenom-
ena with a small number of possible values (called states in this case). Most
current BN tools support only this kind of conditional probability specifica-
tion.
A CPT is defined recursively, with one level for each dependency. There is
one conditional probability entry (CPE) for each value of the first parent node.
16.4 Ontologies for Bayesian Networks 377
Each CPE specifies a weight and a CPT for the remaining parent nodes.
Weights are nonnegative real numbers. They need not be normalized. At
the last level one uses an unconditional PD.
A SF is also defined recursively, but instead of using an explicit collection
of CPEs, it uses one or more functions that specify the parameter(s) of the
remaining distributions. The most common function is a linear function, and
it is the only one shown in the diagram. Functions are necessary to spec-
ify dependencies on continuous phenomena. More general functions can
be specified by using the Mathematical Markup Language (MathML) (W3C
2003).
PDs are classified in the PD ontology shown in figure 16.4. This ontology
is a hierarchy of the most commonly used PDs. The main classification is
between discrete and continuous distributions. Discrete distributions may
either be defined by a formula (as in the Poisson and binomial distributions)
or explicitly for each value (state). Every continuous distribution can be al-
tered by changing its scale or by translating it (or both). The most commonly
used continuous distributions are the uniform and normal (Gaussian) dis-
378 16 The Bayesian Web
tributions. The uniform distribution is on the unit interval and the normal
has mean 0 and variance 1. Other uniform and normal distributions can
be obtained by scaling and translating the standard ones. Other commonly
used distributions are the exponential and chi-square distributions as well as
Gosset’s t distribution, and Fisher’s F distribution.
A NSWER TO
E XERCISE 1.1
<bio_sequence element_id="U83302" sequence_id="MICR83302"
organism_name="Colaptes rupicola" seq_length="1047" type="DNA"/>
<bio_sequence element_id="U83303" sequence_id="HSU83303"
organism_name="Homo sapiens" seq_length="3460" type="DNA"/>
<bio_sequence element_id="U83304" sequence_id="MMU83304"
organism_name="Mus musculus" seq_length="51" type="RNA"/>
<bio_sequence element_id="U83305" sequence_id="MIASSU833"
organism_name="Accipiter striatus" seq_length="1143" type="DNA"/>
A NSWER TO
E XERCISE 1.2
<!ATTLIST bio_sequence
element_id ID #IMPLIED
sequence_id CDATA #IMPLIED
organism_name CDATA #IMPLIED
seq_length CDATA #IMPLIED
molecule_type (DNA | mRNA | rRNA | tRNA | cDNA | AA)
#IMPLIED>
This example was taken from the AGAVE DTD (AGAVE 2002). The actual
element has some additional attributes, and it differs in a few other ways as
well. For example, some of the attributes are restricted to NMTOKEN rather
than just CDATA. NMTOKEN specifies text that starts with a letter (and a few
other characters, such as an underscore), and is followed by letters and digits.
Programming languages such as Perl restrict the names of variables and pro-
cedures in this way, and many genomics databases use this same convention
for their accession numbers and other identifiers.
380 17 Answers to Selected Exercises
A NSWER TO
E XERCISE 1.3
<physical_unit name="millisecond">
<factor prefix="milli" unit="second"/>
</physical_unit>
<physical_unit name="per_millisecond">
<factor prefix="milli" unit="second" exponent="-1"/>
</physical_unit>
<physical_unit name="millivolt">
<factor prefix="milli" unit="volt"/>
</physical_unit>
<physical_unit name="microA_per_mm2">
<factor prefix="micro" unit="ampere"/>
<factor prefix="milli" unit="mitre" exponent="-2"/>
</physical_unit>
<physical_unit name="microF_per_mm2">
<factor prefix="micro" unit="farad"/>
<factor prefix="milli" unit="mitre" exponent="-2"/>
</physical_unit>
</component>
<component name="ionic_current">
<variable name="I_ion" interface="out"
physical_unit="microA_per_mm2"/>
<variable name="v" interface="in"/>
<variable name="Vth" interface="in"
physical_unit="millivolt"/>
</component>
IDREF means that the attribute refers to another one elsewhere in the docu-
ment. In this case it is referring to a physical unit definition in exercise 1.3.
A NSWER TO
E XERCISE 2.1
The XML schema can be obtained by translating the molecule DTD in fig-
ure 1.6 using dtd2xsd.pl (W3C 2001a). The answer is the following:
<schema
xmlns=’http://www.w3.org/2000/10/XMLSchema’
targetNamespace=’http://www.w3.org/namespace/’
xmlns:t=’http://www.w3.org/namespace/’>
<element name=’molecule’>
<complexType>
<sequence>
<element ref=’t:atomArray’/>
<element ref=’t:bondArray’/>
</sequence>
<attribute name=’title’ type=’string’ use=’optional’/>
<attribute name=’id’ type=’string’ use=’optional’/>
<attribute name=’convention’ type=’string’ use=’default’ value=’CML’/>
<attribute name=’dictRef’ type=’string’ use=’optional’/>
<attribute name=’count’ type=’string’ use=’default’ value=’1’/>
</complexType>
</element>
382 17 Answers to Selected Exercises
<element name=’atomArray’>
<complexType>
<sequence>
<element ref=’t:atom’ maxOccurs=’unbounded’/>
</sequence>
<attribute name=’title’ type=’string’ use=’optional’/>
<attribute name=’id’ type=’string’ use=’optional’/>
<attribute name=’convention’ type=’string’ use=’default’ value=’CML’/>
</complexType>
</element>
<element name=’atom’>
<complexType>
<attribute name=’elementType’ type=’string’ use=’optional’/>
<attribute name=’title’ type=’string’ use=’optional’/>
<attribute name=’id’ type=’string’ use=’optional’/>
<attribute name=’convention’ type=’string’ use=’default’ value=’CML’/>
<attribute name=’dictRef’ type=’string’ use=’optional’/>
<attribute name=’count’ type=’string’ use=’default’ value=’1’/>
</complexType>
</element>
<element name=’bondArray’>
<complexType>
<sequence>
<element ref=’t:bond’ maxOccurs=’unbounded’/>
</sequence>
<attribute name=’title’ type=’string’ use=’optional’/>
<attribute name=’id’ type=’string’ use=’optional’/>
<attribute name=’convention’ type=’string’ use=’default’ value=’CML’/>
</complexType>
</element>
<element name=’bond’>
<complexType>
<attribute name=’title’ type=’string’ use=’optional’/>
<attribute name=’id’ type=’string’ use=’optional’/>
<attribute name=’convention’ type=’string’ use=’default’ value=’CML’/>
<attribute name=’dictRef’ type=’string’ use=’optional’/>
<attribute name=’atomRefs’ type=’string’ use=’optional’/>
</complexType>
</element>
</schema>
A NSWER TO
E XERCISE 2.2
Change the line
<attribute name=’elementType’ type=’string’ use=’optional’/>
383
<attribute name=’elementType’
type=’elementTypeType’ use=’optional’/>
<xsd:simpleType name="elementTypeType">
<xsd:restriction base="xsd:string">
<xsd:enumeration value="Ac"/>
<xsd:enumeration value="Al"/>
<xsd:enumeration value="Ag"/>
...
<xsd:enumeration value="Zn"/>
<xsd:enumeration value="Zr"/>
</xsd:restriction>
</xsd:simpleType>
A NSWER TO
E XERCISE 2.3
One possible answer uses an enumeration of cases:
<xsd:simpleType name="DNABase">
<xsd:restriction base="xsd:string">
<xsd:enumeration value="A"/>
<xsd:enumeration value="C"/>
<xsd:enumeration value="G"/>
<xsd:enumeration value="T"/>
</xsd:restriction>
</xsd:simpleType>
<xsd:simpleType name="DNAbase">
<xsd:restriction base="xsd:string">
<xsd:pattern value="[ACGT]"/>
</xsd:restriction>
</xsd:simpleType>
A NSWER TO
E XERCISE 2.4
A DNA sequence could be defined using a list of bases as in
384 17 Answers to Selected Exercises
<simpleType name="DNASequence">
<list itemType="DNABase"/>
</simpleType>
Using this definition, the TATA sequence would be written without spaces
as TATA, just as one would expect.
A NSWER TO
E XERCISE 4.1
In the following answer, it was presumed that the concepts of atomArray
and bondArray were artifacts of the design of the XML DTD and schema
and were not fundamental to the meaning of a molecule. Other assumptions
would lead to many other designs.
<Class rdf:ID="Molecule"/>
<Class rdf:ID="Atom"/>
<Class rdf:ID="Bond"/>
<Property rdf:ID="atom">
<domain rdf:resource="#Molecule"/>
<range rdf:resource="#Atom"/>
</Property>
<Property rdf:ID="bond">
<domain rdf:resource="#Molecule"/>
<range rdf:resource="#Bond"/>
</Property>
<Property rdf:ID="title"/>
<Property rdf:ID="convention"/>
<Property rdf:ID="dictRef"/>
<Property rdf:ID="count">
<range rdf:resource=
"http://www.w3.org/2000/10/XMLSchema#positiveInteger"/>
</Property>
<Property rdf:ID="elementType">
385
<domain rdf:resource="#Atom"/>
<range rdf:resource=
"http://ontobio.org/molecule.xsd#elementTypeType"/>
</Property>
<Property rdf:ID="atomRef">
<domain rdf:resource="#Bond"/>
<range rdf:resource="#Atom"/>
</Property>
A NSWER TO
E XERCISE 4.2
Using the ontology in the sample answer above, nitrous oxide would be the
following:
<owl:Class rdf:ID="bio_sequence"/>
<owl:ObjectProperty rdf:ID="sequence_id">
<rdfs:domain rdf:about="#bio_sequence"/>
</owl:ObjectProperty>
<owl:DatatypeProperty rdf:ID="organism_name">
<rdfs:domain rdf:about="#bio_sequence"/>
<rdfs:range rdf:about=
"http://www.w3.org/2000/10/XMLSchema#string"/>
</owl:DatatypeProperty>
<owl:DatatypeProperty rdf:ID="organism_name">
<rdfs:domain rdf:about="#bio_sequence"/>
<rdfs:range rdf:about=
"http://www.w3.org/2000/10/XMLSchema#string"/>
</owl:DatatypeProperty>
<owl:DatatypeProperty rdf:ID="seq_length">
<rdfs:domain rdf:about="#bio_sequence"/>
<rdfs:range rdf:about=
"http://www.w3.org/2000/10/XMLSchema#nonNegativeInteger"/>
</owl:DatatypeProperty>
<owl:ObjectProperty rdf:ID="molecule_type">
<rdfs:domain rdf:about="#bio_sequence"/>
<rdfs:range rdf:about="#MoleculeTypes"/>
</owl:ObjectProperty>
<owl:Class rdf:ID="MoleculeTypes">
<owl:oneOf parseType="Collection">
<owl:MoleculeTypes rdf:ID="DNA"/>
<owl:MoleculeTypes rdf:ID="mDNA"/>
<owl:MoleculeTypes rdf:ID="rDNA"/>
<owl:MoleculeTypes rdf:ID="tDNA"/>
<owl:MoleculeTypes rdf:ID="cDNA"/>
<owl:MoleculeTypes rdf:ID="AA"/>
</owl:oneOf>
</owl:Class>
A NSWER TO
E XERCISE 8.1
To solve this exercise extract all interview elements that have attributes with
the desired characteristics. Dates always start with the year so the following
query gives the desired results:
document("healthstudy.xml")
//Interview[starts-with(@Date,"2000") and @BMI>30]
387
Alternatively, one can use the XQuery function that extracts the year from
a date as in the following query:
document("healthstudy.xml")
//Interview[year-from-dateTime(@Date)=2000 and @BMI>30]
A NSWER TO
E XERCISE 8.2
First find the insulin gene locus. Then within this locus find all literature
entries. An entry is a literature reference if the reference element containing
the entry is named “Literature references.” Note the use of “..” to obtain
the name attribute of the parent element of the entry.
for $citation in
document("pubmed.xml")//MedlineCitation,
$heading in
$citation//MeshHeading
where $heading/DescriptorName/@MajorTopicYN="Y"
and $heading/DescriptorName="Glutethimide"
and $heading/QualifierName="therapeutic use"
return $citation
In the query above, if a citation has more than one MeSH heading that
satisfies the criteria, then the citation will be returned more than once. One
can avoid this problem by using a “nested” subquery as in the following
query. For each citation, this query runs a separate subsidiary query that
finds all headings within the citation that satisfy the criteria. If the nested
subquery has one or more results, then the citation is returned.
388 17 Answers to Selected Exercises
for $i in document("healthstudy.xml")//Interview,
$j in document("healthstudy.xml")//Interview
where $i/@SID = $j/@SID
and $j/@BMI - $i/@BMI > 4.5
and $j/@Date - $i/@Date < "P2Y"
return $i/@SID
This query can return the same subject identifier more than once when a
subject satisfies the criteria multiple times.
A NSWER TO
E XERCISE 8.6
The number of associations is specified by the n_associations attribute of
the go:term element. The term number is specified by the go:accession
element. The GO namespace must be declared prior to using it in the query.
declare namespace
go="http://www.geneontology.org/dtds/go.dtd#";
document("go.xml")
//go:term[go:accession="GO:0003673"]/@n_associations
<xsl:template match="gene">
<xsl:copy>
<xsl:attribute name="locus">
<xsl:value-of select="../@name"/>
</xsl:attribute>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
This template applies only to gene elements. All other elements are copied
exactly. For gene elements, the element itself is copied, then a new attribute
is added named locus, having a value equal to the name attribute of its
parent element.
A NSWER TO
E XERCISE 11.2
<xsl:template match="locus">
<xsl:apply-templates select="gene"/>
</xsl:template>
<xsl:template match="gene">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
<xsl:apply-templates select="../reference"/>
</xsl:copy>
</xsl:template>
The first template removes the locus element, along with all of its child
elements, except for the gene element. The second template copies all gene
elements and adds the reference elements that were removed by the first
template.
A NSWER TO
E XERCISE 11.3
<xsl:template match="organism">
<organism>
<xsl:apply-templates select="@*"/>
<contains>
<xsl:apply-templates select="node()"/>
</contains>
</organism>
</xsl:template>
390 17 Answers to Selected Exercises
Note that the attributes of organism remain in the same element, but the
child elements of organism are made child elements of the new contains
A NSWER TO element.
E XERCISE 11.4
<xsl:template match="reference">
<xsl:choose>
<xsl:when test="@name=’Sequence databases’">
<isStoredIn>
<xsl:apply-templates select="@*|node()"/>
</isStoredIn>
</xsl:when>
<xsl:when test="@name=’Literature references’">
<isCitedBy>
<xsl:apply-templates select="@*|node()"/>
</isCitedBy>
</xsl:when>
</xsl:choose>
</xsl:template>
A NSWER TO
E XERCISE 11.5
<xsl:template match="gene">
<gene>
<xsl:attribute name="embl">
<xsl:value-of select=
"../reference[@name=’Sequence databases’]
/db_entry[@name=’EMBL sequence’]/@entry"/>
</xsl:attribute>
<xsl:attribute name="organism">
<xsl:value-of select="../../../../organism/@name"/>
</xsl:attribute>
<xsl:apply-templates select="@*|node()"/>
</gene>
</xsl:template>
A NSWER TO
E XERCISE 11.6
<xsl:template match="gene">
<gene>
<xsl:attribute name="totalExonLength">
<xsl:value-of
391
select="sum(exon/@end)-sum(exon/@start)+count(exon)"/>
</xsl:attribute>
<xsl:apply-templates select="@*|node()"/>
</gene>
</xsl:template>
A NSWER TO
E XERCISE 12.1
Why Assist research
What High-level view
Who Researchers
When A few weeks
How Help understanding
A NSWER TO
E XERCISE 12.2
Consistency checking uses a software tool. This is analogous to the data-
mining tool in the diagram. So it should be modeled as an actor. The consistency-
checking tool can check the ontology for consistency, and it can also check
that the chart database is consistent with the chart ontology. The modified
diagram is in figure 17.1
Chart Ontology
Ontologist
Consistency Checker
Authorization
Medical Personnel
Chart Database
Figure 17.1 Modified use case diagram for the medical chart ontology. The diagram
now includes the consistency checking tool.
A NSWER TO
E XERCISE 12.3
For this project, one would like to make use of more advanced modeling
constructs. This is a good argument in favor of the OWL languages. In-
compatibility with the other major web-based ontology language groups is
392 17 Answers to Selected Exercises
SNP
AGAVE, 2002. The Architecture for Genomic Annotation, Visualization and Ex-
change. www.animorphics.net/lifesci.html.
Al-Shahrour, F., R. Diaz-Uriarte, and J. Dopazo. 2004. FatiGO: a web tool for finding
significant associations of Gene Ontology terms with groups of genes. Bioinformat-
ics 20:578–580.
Altschul, S.F. 1991. Amino acid substitution matrices from an information theoretic
perspective. J. Mol. Biol. 219:555–565.
Altschul, S.F., and W. Gish. 1996. Local alignment statistics. Methods Enzymol. 266:
460–480.
Altschul, S.F., W. Gish, W. Miller, E.W. Myers, and D.J. Lipman. 1990. Basic local
alignment search tool. J. Mol. Biol. 215:403–410.
Altschul, S.F., T.L. Madden, A.A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D.J. Lip-
man. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database
search programs. Nucleic Acids Res. 25:3389–3402.
Andreeva, A., D. Howorth, S.E. Brenner, T.J. Hubbard, C. Chothia, and A.G. Murzin.
2004. SCOP database in 2004: refinements integrate structure and sequence family
data. Nucleic Acids Res. 32:D226–D229. Database issue.
Aronson, A.R. 2001. Effective mapping of biomedical text to the UMLS Metathe-
saurus: the MetaMap program. In Proc. AMIA Symp., pp. 17–21.
Asimov, I. 1964. A Short History of Biology. London: Thomas Nelson & Sons.
Attwood, T.K., P. Bradley, D.R. Flower, A. Gaulton, N. Maudling, A.L. Mitchell,
G. Moulton, A. Nordle, K. Paine, P. Taylor, A. Uddin, and C. Zygouri. 2003. PRINTS
and its automatic supplement, prePRINTS. Nucleic Acids Res. 31:400–402.
Attwood, T.K., D.R. Flower, A.P. Lewis, J.E. Mabey, S.R. Morgan, P. Scordis, J.N. Selley,
and W. Wright. 1999. PRINTS prepares for the new millennium. Nucleic Acids Res.
27:220–225.
394 References
Baclawski, K., 1997a. Distributed computer database system and method. United
States Patent No. 5,694,593. Assigned to Northeastern University, Boston.
Baclawski, K. 1997b. Long time, no see: categorization in information science. In
S. Hecker and G.C. Rota (eds.), Essays on the Future. In Honor of the 80th Birthday of
Nick Metropolis, pp. 11–26. Cambridge, MA: Birkhauser.
Baclawski, K. 2003. Ontology development. Kenote address in International Workshop
on Software Methodologies, Tools and Techniques, pp. 3–26.
Baclawski, K., J. Cigna, M.M. Kokar, P. Mager, and B. Indurkhya. 2000. Knowledge
representation and indexing using the Unified Medical Language System. In Pacific
Symposium on Biocomputing, vol. 5, pp. 490–501.
Baclawski, K., M. Kokar, P. Kogut, L. Hart, J. Smith, W. Holmes, J. Letkowski, and
M. Aronson. 2001. Extending UML to support ontology engineering for the Se-
mantic Web. In M. Gogolla and C. Kobryn (eds.), Fourth International Conference on
the Unified Modeling Language, vol. 2185, pp. 342–360. Berlin: Springer-Verlag.
Baclawski, K., C. Matheus, M. Kokar, and J. Letkowski. 2004. Toward a symptom on-
tology for Semantic Web applications. In ISWC’04, vol. 3298, pp. 650–667. Springer-
Verlag, Berlin.
Bader, G.D., D. Betel, and C.W. Hogue. 2003. BIND: the biomolecular interaction
network database. Nucleic Acids Res. 31:248–250.
Bairoch, A. 1991. PROSITE: a dictionary of sites and patterns in proteins. Nucleic
Acids Res. 19:2241–2245.
Baker, P.G., A. Brass, S. Bechhofer, C. Goble, N. Paton, and R. Stevens. 1998. TAMBIS–
transparent access to multiple bioinformatics information sources. In Proc. Int.
Conf. Intell. Syst. Mol. Biol., vol. 6, pp. 25–34.
Baker, P.G., C.A. Goble, S. Bechhofer, N.W. Paton, R. Stevens, and A. Brass. 1999. An
ontology for bioinformatics applications. Bioinformatics 15:510–520.
Bateman, A., L. Coin, R. Durbin, R.D. Finn, V. Hollich, S. Griffiths-Jones, A. Khanna,
M. Marshall, S. Moxon, E.L. Sonnhammer, D.J. Studholme, C. Yeats, and S.R. Eddy.
2004. The Pfam protein families database. Nucleic Acids Res. 32:D138–D141. Data-
base issue.
Benson, D.A., I. Karsch-Mizrachi, D.J. Lipman, J. Ostell, and D.L. Wheeler. 2004.
GenBank: update. Nucleic Acids Res. 32:D23–D26. Database issue.
Bergamaschi, S., S. Castano, and M. Vincini. 1999. Semantic integration of semistruc-
tured and structured data sources. SIGMOD Rec. 28:54–59.
Bergman, C.M., B.D. Pfeiffer, D. Rincon-Limas, R.A. Hoskins, A. Gnirke, C.J. Mungall,
A.M. Wang, B. Kronmiller, J. Pacleb, S. Park, M. Stapleton, K. Wan, R.A. George,
P.J. de Jong, J. Botas, G.M. Rubin, and S.E. Celniker. 2002. Assessing the impact of
comparative genomic sequence data on the functional annotation of the Drosophila
genome. Genome Biol. 3:RESEARCH0086.
References 395
Berman, H.M., T.N. Bhat, P.E. Bourne, Z. Feng, G. Gilliland, H. Weissig, and J. West-
brook. 2000. The Protein Data Bank and the challenge of structural genomics. Nat.
Struct. Biol. 7:957–959.
Berman, H.M., W.K. Olson, D.L. Beveridge, J. Westbrook, A. Gelbin, T. Demeny, S.H.
Hsieh, A.R. Srinivasan, and B. Schneider. 1992. The nucleic acid database. A com-
prehensive relational database of three-dimensional structures of nucleic acids. Bio-
phys. J. 63:751–759.
Berners-Lee, T., 2000a. Semantic Web - XML2000. www.w3.org/2000/Talks/
1206-xml2k-tbl.
Berners-Lee, T., 2000b. Why RDF model is different from the XML model. www.w3.
org/DesignIssues/RDF-XML.html.
BioML, 2003. Biopolymer Markup Language website. www.rdcormia.com/
COIN78/files/XML_Finals/BIOML/Pages/BIOML.htm.
Birney, E., T.D. Andrews, P. Bevan, M. Caccamo, Y. Chen, L. Clarke, G. Coates,
J. Cuff, V. Curwen, T. Cutts, T. Down, E. Eyras, X.M. Fernandez-Suarez, P. Gane,
B. Gibbins, J. Gilbert, M. Hammond, H.R. Hotz, V. Iyer, K. Jekosch, A. Kahari,
A. Kasprzyk, D. Keefe, S. Keenan, H. Lehvaslaiho, G. McVicker, C. Melsopp,
P. Meidl, E. Mongin, R. Pettett, S. Potter, G. Proctor, M. Rae, S. Searle, G. Slater,
D. Smedley, J. Smith, W. Spooner, A. Stabenau, J. Stalker, R. Storey, A. Ureta-Vidal,
K.C. Woodwark, G. Cameron, R. Durbin, A. Cox, T. Hubbard, and M. Clamp. 2004.
An overview of Ensembl. Genome Res. 14:925–928.
Bodenreider, O. 2004. The Unified Medical Language System (UMLS): integrating
biomedical terminology. Nucleic Acids Res. 32:D267–D270. Database issue.
Bodenreider, O., S.J. Nelson, W.T. Hole, and H.F. Chang. 1998. Beyond synonymy:
exploiting the UMLS semantics in mapping vocabularies. In Proc. AMIA Symp., pp.
815–819.
Brachman, R., and J. Schmolze. 1985. An overview of the KL-ONE knowledge repre-
sentation system. Cognitive Sci 9:171–216.
Brazma, A., P. Hingamp, J. Quackenbush, G. Sherlock, P. Spellman, C. Stoeckert,
J. Aach, W. Ansorge, C.A. Ball, H.C. Causton, T. Gaasterland, P. Glenisson, F.C. Hol-
stege, I.F. Kim, V. Markowitz, J.C. Matese, H. Parkinson, A. Robinson, U. Sarkans,
S. Schulze-Kremer, J. Stewart, R. Taylor, J. Vilo, and M. Vingron. 2001. Minimum
information about a microarray experiment (MIAME)–toward standards for mi-
croarray data. Nat. Genet. 29:365–371.
Buck, L. 2000. The molecular architecture of odor and pheromone sensing in mam-
mals. Cell 100:611–618.
Buck, L., and R. Axel. 1991. A novel multigene family may encode odorant receptors:
a molecular basis for odor recognition. Cell 65:175–187.
396 References
Bunge, M. 1977. Treatise on Basic Philosophy. III: Ontology: The Furniture of the World.
Dordrecht, Netherlands: Reidel.
Bunge, M. 1979. Treatise on Basic Philosophy. IV: Ontology: A World of Systems. Dor-
drecht, Netherlands: Reidel.
Camon, E., M. Magrane, D. Barrell, D. Binns, W. Fleischmann, P. Kersey, N. Mulder,
T. Oinn, J. Maslen, A. Cox, and R. Apweiler. 2003. The Gene Ontology Annotation
(GOA) project: implementation of GO in SWISS-PROT, TrEMBL, and InterPro. Ge-
nome Res. 13:662–672.
Celis, J.E., M. Ostergaard, N.A. Jensen, I. Gromova, H.H. Rasmussen, and P. Gro-
mov. 1998. Human and mouse proteomic databases: novel resources in the protein
universe. FEBS Lett. 430:64–72.
CellML, 2003. CellML website. www.cellml.org.
Chakrabarti, S., B. Dom, D. Gibson, J. Kleinberg, P. Raghavan, and S. Rajagopalan.
1998. Automatic resource list compilation by analyzing hyperlink structure and
associated text. In Proc. 7th Int. World Wide Web Conf.
Chen, R.O., R. Felciano, and R.B. Altman. 1997. RIBOWEB: linking structural com-
putations to a knowledge base of published experimental data. In Proc. Int. Conf.
Intell. Syst. Mol. Biol., vol. 5, pp. 84–87.
Cheng, J., S. Sun, A. Tracy, E. Hubbell, J. Morris, V. Valmeekam, A. Kimbrough, M.S.
Cline, G. Liu, R. Shigeta, D. Kulp, and M.A. Siani-Rose. 2004. NetAffx Gene On-
tology Mining Tool: a visual approach for microarray data analysis. Bioinformatics
20:1462–1463.
Cleverdon, C., and E. Keen. 1966. Factors determining the performance of indexing
systems. Vol. 1: Design, Vol. 2: Results. Technical report, Aslib Cranfield Research
Project, Cranfield, UK.
Clocksin, W., C. Mellish, and W. Clocksin. 2003. Programming in PROLOG. New York:
Springer-Verlag.
CML, 2003. Chemical Markup Language website. www.xml-cml.org.
Conde, L., J.M. Vaquerizas, J. Santoyo, F. Al-Shahrour, S. Ruiz-Llorente, M. Robledo,
and J. Dopazo. 2004. PupaSNP Finder: a web tool for finding SNPs with putative
effect at transcriptional level. Nucleic Acids Res. 32:W242–W248. Web server issue.
Cooper, D.N. 1999. Human Gene Evolution. San Diego: Academic Press.
Crasto, C., L. Marenco, P. Miller, and G. Shepherd. 2002. Olfactory Receptor Data-
base: a metadata-driven automated population from sources of gene and protein
sequences. Nucleic Acids Res. 30:354–360.
Dayhoff, M.O., R.M. Schwartz, and B.C. Orcutt. 1978. A model of evolutionary
change in proteins. In M.O. Dayhoff (ed.), Atlas of Protein Sequence and Structure,
vol. 5, pp. 345–352. Washington, DC: National Biomedical Research Foundation.
References 397
De Finetti, B. 1937. La prévision: ses lois logiques, ses sources subjectives. Ann. Inst.
Henri Poincaré 7:1–68.
Decker, S., D. Brickley, J. Saarela, and J. Angele. 1998. A query and inference service
for RDF. In QL’98 - The Query Language Workshop.
Dennis, G., Jr., B.T. Sherman, D.A. Hosack, J. Yang, W. Gao, H.C. Lane, and R.A.
Lempicki. 2003. DAVID: Database for annotation, visualization, and integrated
discovery. Genome Biol. 4:P3.
Denny, J.C., J.D. Smithers, and R.A. Miller. 2003. “Understanding” medical school
curriculum content using KnowledgeMap. J. Am. Med. Inf. Assoc. 10:351–362.
Denny, M., 2002a. Ontology building: a survey of editing tools. www.xml.com/pub/
a/2002/11/06/ontologies.html.
Denny, M., 2002b. Ontology editor survey results. www.xml.com/2002/11/06/
Ontology_Editor_Survey.html.
Ding, Z., and Y. Peng. 2004. A probabilistic extension to ontology language OWL. In
Proc. 37th Hawaii Int. Conf. on Systems Science.
Do, H., S. Melnik, and E. Rahm. 2002. Comparison of schema matching evaluations.
In Proc. GI-Workshop “Web and Databases,” vol. 2593, Erfurt, Germany. Springer-
Verlag.
Do, H., and E. Rahm. 2002. COMA - a system for flexible combination of schema
matching approaches. In Proc. VLDB.
Dodd, I.B., and J.B. Egan. 1990. Improved detection of helix-turn-helix DNA-binding
motifs in protein sequences. Nucleic Acids Res. 18:5019–5026.
Draghici, S., P. Khatri, P. Bhavsar, A. Shah, S.A. Krawetz, and M.A. Tainsky. 2003.
Onto-Tools, the toolkit of the modern biologist: Onto-Express, Onto-Compare,
Onto-Design and Onto-Translate. Nucleic Acids Res. 31:3775–3781.
DUET, 2002. DAML UML enhanced tool (DUET). grcinet.grci.com/maria/
www/CodipSite/Tools/Tools.html.
Dwight, S.S., R. Balakrishnan, K.R. Christie, M.C. Costanzo, K. Dolinski, S.R. En-
gel, B. Feierbach, D.G. Fisk, J. Hirschman, E.L. Hong, L. Issel-Tarver, R.S. Nash,
A. Sethuraman, B. Starr, C.L. Theesfeld, R. Andrada, G. Binkley, Q. Dong, C. Lane,
M. Schroeder, S. Weng, D. Botstein, and J.M. Cherry. 2004. Saccharomyces genome
database: underlying principles and organisation. Brief Bioinform. 5:9–22.
EcoCyc, 2003. Encyclopedia of Escherichia coli Genes and Metabolism. www.ecocyc.
org.
Eddy, S.R. 1998. Profile hidden Markov models. Bioinformatics 14:755–763.
Embley, D., et al. 2001. Multifaceted exploitation of metadata for attribute match
discovery in information integration. In International Workshop on Information Inte-
gration on the Web.
398 References
Heflin, J., J. Hendler, and S. Luke. 2000. SHOE: a knowledge representation lan-
guage for Internet applications. Technical Report www.cs.umd.edu/projects/-
plus/SHOE, Institute for Advanced Studies, University of Maryland, College Park.
Helsper, E., and L. van der Gaag. 2001. Ontologies for probabilistic networks: A case
study in oesophageal cancer. In B. Kröse, M. de Rijke, G. Schreiber, and M. van
Someren (eds.), Proc. 13th Belgium-Netherlands Conference on Artificial Intelligence,
Amsterdam, pp. 125–132.
Helsper, E., and L. van der Gaag. 2002. A case study in ontologies for probabilistic
networks. In M. Bramer, F. Coenen, and A. Preece (eds.), Research and Development
in Intelligent Systems XVIII, pp. 229–242. London: Springer-Verlag.
Henikoff, J.G., E.A. Greene, S. Pietrokovski, and S. Henikoff. 2000. Increased coverage
of protein families with the blocks database servers. Nucleic Acids Res. 28:228–230.
Henikoff, S., and J.G. Henikoff. 1991. Automated assembly of protein blocks for
database searching. Nucleic Acids Res. 19:6565–6572.
Henikoff, S., and J.G. Henikoff. 1992. Amino acid substitution matrices from protein
blocks. Proc. Natl. Acad. Sci. U.S.A. 89:10915–10919.
Henikoff, S., and J.G. Henikoff. 1994. Protein family classification based on searching
a database of blocks. Genomics 19:97–107.
Henikoff, S., S. Pietrokovski, and J.G. Henikoff. 1998. Superior performance in protein
homology detection with the Blocks database servers. Nucleic Acids Res. 26:309–312.
Henrion, M., M. Pradhan, B. del Favero, K. Huang, G. Provan, and P. O’Rorke. 1996.
Why is diagnosis using belief networks insensitive to imprecision in probabilities?
In Proc. 12th Conf. Uncertainty in Artificial Intelligence, pp. 307–314.
Hertz, G.Z., G.W. Hartzell III, and G.D. Stormo. 1990. Identification of consensus
patterns in unaligned DNA sequences known to be functionally related. Comput.
Appl. Biosci. 6:81–92.
Hertz, G.Z., and G.D. Stormo. 1999. Identifying DNA and protein patterns with
statistically significant alignments of multiple sequences. Bioinformatics 15:563–577.
Hoebeke, M., H. Chiapello, P. Noirot, and P. Bessieres. 2001. SPiD: a subtilis protein
interaction database. Bioinformatics 17:1209–1212.
Holm, L., C. Ouzounis, C. Sander, G. Tuparev, and G. Vriend. 1992. A database of
protein structure families with common folding motifs. Protein Sci. 1:1691–1698.
Holm, L., and C. Sander. 1998. Touring protein fold space with Dali/FSSP. Nucleic
Acids Res. 26:316–319.
Howard, R., and J. Matheson. 1981. Influence diagrams. In R. Howard and J. Mathe-
son (eds.), Readings on the Principles and Applications of Decision Analysis, vol. 2, pp.
721–762. Menlo Park, CA: Strategic Decisions Group.
References 401
Kanehisa, M., and S. Goto. 2000. KEGG: Kyoto encyclopedia of genes and genomes.
Nucleic Acids Res. 28:27–30.
Kanehisa, M., S. Goto, S. Kawashima, and A. Nakaya. 2002. The KEGG databases at
GenomeNet. Nucleic Acids Res. 30:42–46.
Karlin, S., and S.F. Altschul. 1990. Methods for assessing the statistical significance
of molecular sequence features by using general scoring schemes. Proc. Natl. Acad.
Sci. U.S.A. 87:2264–2268.
Karlin, S., and S.F. Altschul. 1993. Applications and statistics for multiple high-
scoring segments in molecular sequences. Proc. Natl. Acad. Sci. U.S.A. 90:5873–5877.
Karp, P.D., S. Paley, and P. Romero. 2002a. The Pathway Tools software. Bioinformatics
18:S225–S232.
Karp, P.D., M. Riley, S.M. Paley, and A. Pellegrini-Toole. 2002b. The MetaCyc data-
base. Nucleic Acids Res. 30:59–61.
Karp, P.D., M. Riley, M. Saier, I.T. Paulsen, J. Collado-Vides, S.M. Paley, A. Pellegrini-
Toole, C. Bonavides, and S. Gama-Castro. 2002c. The EcoCyc database. Nucleic
Acids Res. 30:56–58.
Kelley, B.P., B. Yuan, F. Lewitter, R. Sharan, B.R. Stockwell, and T. Ideker. 2004. Path-
BLAST: a tool for alignment of protein interaction networks. Nucleic Acids Res. 32:
W83–W88. Web server issue.
Kent, W.J. 2002. BLAT–the BLAST-like alignment tool. Genome Res. 12:656–664.
King, O.D., R.E. Foulger, S.S. Dwight, J.V. White, and F.P. Roth. 2003. Predicting gene
function from patterns of annotation. Genome Res. 13:896–904.
Kleinberg, J. 1998. Authoritative sources in a hyperlinked environment. In Proc.
ACM-SIAM Symp. on Discrete Algorithms.
Know-Me, 2004. Know-Me website. www.nbirn.net/Resources/Users/
Applications/KnowMe/Know-ME.htm.
Kogut, P., S. Cranefield, L. Hart, M. Dutra, K. Baclawski, M. Kokar, and J. Smith. 2002.
UML for ontology development. Knowledge Eng. Rev. 17:61–64.
Kohane, I.S., A.T. Kho, and A.J. Butte. 2003. Microarrays for an Integrative Genomics.
Cambridge, MA: MIT Press.
Kohonen, T. 1997. Self-organizating maps. New York: Springer-Verlag.
Kokar, M., J. Letkowski, K. Baclawski, and J. Smith, 2001. The ConsVISor consistency
checking tool. www.vistology.com/consvisor/.
Kolchanov, N.A., E.V. Ignatieva, E.A. Ananko, O.A. Podkolodnaya, I.L. Stepanenko,
T.I. Merkulova, M.A. Pozdnyakov, N.L. Podkolodny, A.N. Naumochkin, and A.G.
Romashchenko. 2002. Transcription Regulatory Regions Database (TRRD): its sta-
tus in 2002. Nucleic Acids Res. 30:312–317.
References 403
Koller, D., A. Levy, and A. Pfeffer. 1997. P-Classic: a tractable probabilistic description
logic. In Proc. 14th National Conf. on Artificial Intelligence, Providence, RI, pp. 390–
397.
Koller, D., and A. Pfeffer. 1997. Object-oriented Bayesian networks. In Proc. 13th Ann.
Conf. on Uncertainty in Artificial Intelligence, Providence, RI, pp. 302–313.
Korf, I., and W. Gish. 2000. MPBLAST : improved BLAST performance with multi-
plexed queries. Bioinformatics 16:1052–1053.
Krishnan, V.G., and D.R. Westhead. 2003. A comparative study of machine-learning
methods to predict the effects of single nucleotide polymorphisms on protein func-
tion. Bioinformatics 19:2199–2209.
Kulikova, T., P. Aldebert, N. Althorpe, W. Baker, K. Bates, P. Browne, A. van den
Broek, G. Cochrane, K. Duggan, R. Eberhardt, N. Faruque, M. Garcia-Pastor,
N. Harte, C. Kanz, R. Leinonen, Q. Lin, V. Lombard, R. Lopez, R. Mancuso,
M. McHale, F. Nardone, V. Silventoinen, P. Stoehr, G. Stoesser, M.A. Tuli, K. Tzou-
vara, R. Vaughan, D. Wu, W. Zhu, and R. Apweiler. 2004. The EMBL nucleotide
sequence database. Nucleic Acids Res. 32:D27–D30. Database issue.
Kuter, I. 1999. Breast cancer highlights. Oncologist 4:299–308.
Lakoff, G. 1987. Women, Fire, and Dangerous Things: What Categories Reveal about the
Mind. Chicago: University of Chicago Press.
Lassila, O., and R. Swick, 1999. Resource description framework (RDF) model and
syntax specification. www.w3.org/TR/REC-rdf-syntax.
Lawrence, C., S. Altschul, M. Boguski, J. Liu, A. Neuwald, and J. Wootton. 1993. De-
tecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment.
Science 262:208–214.
Leibniz, G. 1998. Monadology. In G.W. Leibniz Philosophical Texts (1714), pp. 267–
281. Translated and edited by R. Woolhouse and R. Francks. New York: Oxford
University Press.
Leif, R.C., S.B. Leif, and S.H. Leif. 2003. CytometryML, an XML format based on
DICOM and FCS for analytical cytology data. Cytometry 54A:56–65.
Letunic, I., R.R. Copley, S. Schmidt, F.D. Ciccarelli, T. Doerks, J. Schultz, C.P. Ponting,
and P. Bork. 2004. SMART 4.0: towards genomic data integration. Nucleic Acids
Res. 32:D142–D144. Database issue.
Leung, Y.F., and C.P. Pang. 2002. EYE on bioinformatics: dissecting complex disease
traits in silico. Appl. Bioinformatics 1:69–80.
Li, W., and C. Clifton. 2000. Semint: a tool for identifying attribute correspondences
in heterogeneous databases using neural network. Data and Knowledge Engineering
33:49–84.
404 References
Lindberg, D.A., B.L. Humphreys, and A.T. McCray. 1993. The Unified Medical Lan-
guage System. Methods Inf. Med. 32:281–291.
Liu, J.S., A.F. Neuwald, , and C.E. Lawrence. 1995. Bayesian models for multiple
local sequence alignment and Gibbs sampling strategies. J. Am. Statis. Assoc. 90:
1156–1170.
Liu, X., D.L. Brutlag, and J.S. Liu. 2001. BioProspector: discovering conserved DNA
motifs in upstream regulatory regions of co-expressed genes. In Pac. Symp. Biocom-
put., pp. 127–138.
Lutteke, T., M. Frank, and C.W. von der Lieth. 2004. Data mining the protein data
bank: automatic detection and assignment of carbohydrate structures. Carbohydr.
Res. 339:1015–1020.
Lynch, M., and J.S. Conery. 2000. The evolutionary fate and consequences of duplicate
genes. Science 290:1151–1155.
MacKay, D., 2004. Bayesian methods for neural networks - FAQ. www.inference.
phy.cam.ac.uk/mackay/Bayes_FAQ.html.
MacQueen, J. 1967. Some methods for classification and analysis of multivariate
observations. In L. Le Cam and J. Neyman (eds.), Proc. Fifth Berkeley Symp. Math.
Statis. and Prob., vol. 1, pp. 281–297, Berkeley, CA. University of California Press.
Madera, M., C. Vogel, S.K. Kummerfeld, C. Chothia, and J. Gough. 2004. The SU-
PERFAMILY database in 2004: additions and improvements. Nucleic Acids Res. 32:
D235–D239. Database issue.
Madhavan, J., P. Bernstein, and E. Rahm. 2001. Generic schema matching with Cupid.
In Proc. VLDB.
MAGE-ML, 2003. MicroArray Gene Expression Markup Language website. www.
mged.org.
Marchler-Bauer, A., A.R. Panchenko, B.A. Shoemaker, P.A. Thiessen, L.Y. Geer, and
S.H. Bryant. 2002. CDD: a database of conserved domain alignments with links to
domain three-dimensional structure. Nucleic Acids Res. 30:281–283.
Maybeck, P. 1979. Stochastic models, estimation and control, vol. 1. New York: Academic
Press.
McCray, A.T., O. Bodenreider, J.D. Malley, and A.C. Browne. 2001. Evaluating UMLS
strings for natural language processing. In Proc. AMIA Symp., pp. 448–452.
McGinnis, S., and T.L. Madden. 2004. BLAST: at the core of a powerful and diverse
set of sequence analysis tools. Nucleic Acids Res. 32:W20–W25. Web server issue.
McGuinness, D., R. Fikes, J. Rice, and S. Wilder. 2000. An environment for merging
and testing large ontologies. In Proceedings of the 7th International Conference on
Principles of Knowledge Representation and Reasoning (KR2000), Breckenridge, CO.
References 405
Mellquist, J.L., L. Kasturi, S.L. Spitalnik, and S.H. Shakin-Eshleman. 1998. The amino
acid following an asn-X-Ser/Thr sequon is an important determinant of N-linked
core glycosylation efficiency. Biochemistry 37:6833–6837.
Mewes, H.W., C. Amid, R. Arnold, D. Frishman, U. Guldener, G. Mannhaupt,
M. Munsterkotter, P. Pagel, N. Strack, V. Stumpflen, J. Warfsmann, and A. Ruepp.
2004. MIPS: analysis and annotation of proteins from whole genomes. Nucleic Acids
Res. 32:D41–D44. Database issue.
Miller, E., R. Swick, D. Brickley, and B. McBride, 2001. Semantic Web activity page.
www.w3.org/2001/sw/.
Miller, R., L. Haas, and M. Hernandez. 2000. Schema mapping as query discovery. In
Proc. VLDB, pp. 77–88.
Mitra, P., G. Wiederhold, and J. Jannink. 1999. Semi-automatic integration of know-
ledge sources. In Proc. 2nd International Conf. on Information Fusion.
Miyazaki, S., H. Sugawara, K. Ikeo, T. Gojobori, and Y. Tateno. 2004. DDBJ in the
stream of various biological data. Nucleic Acids Res. 32:D31–D34. Database issue.
Mulder, N.J., R. Apweiler, T.K. Attwood, A. Bairoch, D. Barrell, A. Bateman, D. Binns,
M. Biswas, P. Bradley, P. Bork, P. Bucher, R.R. Copley, E. Courcelle, U. Das,
R. Durbin, L. Falquet, W. Fleischmann, S. Griffiths-Jones, D. Haft, N. Harte,
N. Hulo, D. Kahn, A. Kanapin, M. Krestyaninova, R. Lopez, I. Letunic, D. Lons-
dale, V. Silventoinen, S.E. Orchard, M. Pagni, D. Peyruc, C.P. Ponting, J.D. Selengut,
F. Servant, C.J. Sigrist, R. Vaughan, and E.M. Zdobnov. 2003. The InterPro database,
2003 brings increased coverage and new features. Nucleic Acids Res 31:315–318.
Muller, A., R.M. MacCallum, and M.J. Sternberg. 1999. Benchmarking PSI-BLAST in
genome annotation. J. Mol. Biol. 293:1257–1271.
Muller, A., R.M. MacCallum, and M.J. Sternberg. 2002. Structural characterization of
the human proteome. Genome Res. 12:1625–1641.
Murphy, K., 1998. A brief introduction to graphical models and Bayesian networks.
www.ai.mit.edu/∼murphyk/Bayes/bnintro.html.
Murray-Rust, P., and H.S. Rzepa. 2003. Chemical Markup, XML, and the World Wide
Web. 4. CML Schema. J. Chem. Inf. Comput. Sci. 43:757–772.
Murzin, A.G., S.E. Brenner, T. Hubbard, and C. Chothia. 1995. SCOP: a structural
classification of proteins database for the investigation of sequences and structures.
J. Mol. Biol. 247:536–540.
Nagumo, J. 1962. An active pulse transmission line simulating nerve axon. Proc. Inst.
Radio Eng. 50:2061–2070.
Nam, Y., J. Goguen, and G. Wang. 2002. A metadata integration assistant generator
for heterogeneous distributed databases. In Proc. Int. Conf. Ontologies, Databases,
and Applications of Semantics for Large Scale Information Systems, vol. 2519, pp. 1332–
1344. Springer-Verlag, New York.
406 References
Orengo, C.A., A.D. Michie, S. Jones, D.T. Jones, M.B. Swindells, and J.M. Thornton.
1997. CATH–a hierarchic classification of protein domain structures. Structure 5:
1093–1108.
Orengo, C.A., F.M. Pearl, and J.M. Thornton. 2003. The CATH domain structure
database. Methods Biochem. Anal. 44:249–271.
Packer, B.R., M. Yeager, B. Staats, R. Welch, A. Crenshaw, M. Kiley, A. Eckert, M. Beer-
man, E. Miller, A. Bergen, N. Rothman, R. Strausberg, and S.J. Chanock. 2004.
SNP500Cancer: a public resource for sequence validation and assay development
for genetic variation in candidate genes. Nucleic Acids Res. 32:D528–D532. Database
issue.
Page, L., and S. Brin, 2004. Google page rank algorithm. www.google.com/
technology.
Pandey, A., and F. Lewitter. 1999. Nucleotide sequence databases: a gold mine for
biologists. Trends Biochem. Sci. 24:276–280.
Patel-Schneider, P., P. Hayes, and I. Horrocks, 2004. OWL web ontology language
semantics and abstract syntax. www.w3.org/TR/owl-semantics/.
Pearl, J. 1988. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Infer-
ence. San Francisco: Morgan Kaufmann.
Pearl, J. 1998. Graphical models for probabilistic and causal reasoning. In D. Gabbay
and P. Smets (eds.), Handbook of Defeasible Reasoning and Uncertainty Management
Systems, Volume 1: Quantified Representation of Uncertainty and Imprecision, pp. 367–
389. Dordrecht, Netherlands: Kluwer Academic.
Pearl, J. 2000. Causality: Models, Reasoning and Inference. Cambridge, UK: Cambridge
University Press.
Pearson, W.R., and D.J. Lipman. 1988. Improved tools for biological sequence com-
parison. Proc. Natl. Acad. Sci. U.S.A. 85:2444–2448.
Pellet, 2003. Pellet OWL reasoner. www.mindswap.org/2003/pellet/.
Perez, A., and R. Jirousek. 1985. Constructing an intensional expert system (INES).
In Medical Decision Making. Amserdam: Elsevier.
Peri, S., J.D. Navarro, T.Z. Kristiansen, R. Amanchy, V. Surendranath, B. Muthusamy,
T.K. Gandhi, K.N. Chandrika, N. Deshpande, S. Suresh, B.P. Rashmi, K. Shanker,
N. Padma, V. Niranjan, H.C. Harsha, N. Talreja, B.M. Vrushabendra, M.A. Ramya,
A.J. Yatish, M. Joy, H.N. Shivashankar, M.P. Kavitha, M. Menezes, D.R. Choudhury,
N. Ghosh, R. Saravana, S. Chandran, S. Mohan, C.K. Jonnalagadda, C.K. Prasad,
C. Kumar-Sinha, K.S. Deshpande, and A. Pandey. 2004. Human protein reference
database as a discovery resource for proteomics. Nucleic Acids Res. 32:D497–D501.
Database issue.
Piaget, J. 1971. The Construction of Reality in the Child. New York: Ballantine Books.
408 References
Piaget, J., and B. Inhelder. 1967. The Child’s Conception of Space. New York: Norton.
Piaget, J., B. Inhelder, and A. Szeminska. 1981. The Child’s Conception of Geometry.
New York, NY: Norton.
Pingoud, A., and A. Jeltsch. 2001. Structure and function of type II restriction en-
donucleases. Nucleic Acids Res. 29:3705–3727.
Pradhan, M., M. Henrion, G. Provan, B. del Favero, and K. Huang. 1996. The sensi-
tivity of belief networks to imprecise probabilities: an experimental investigation.
Artif. Intell. 85:363–397.
Pradhan, M., G. Provan, B. Middleton, and M. Henrion. 1994. Knowledge engineer-
ing for large belief networks. In Proc. Tenth Annual Conf. on Uncertainty in Artificial
Intelligence (UAI–94), pp. 484–490, San Mateo, CA. Morgan Kaufmann.
Rahm, E., and P. Bernstein. 2001. On matching schemas automatically. Technical
report, Dept. of Computer Science, University of Leipzig. dol.uni-leipzig.
de/pub/2001-5/en.
Ramensky, V., P. Bork, and S. Sunyaev. 2002. Human non-synonymous SNPs: server
and survey. Nucleic Acids Res. 30:3894–3900.
Roberts, R.J., T. Vincze, J. Posfai, and D. Macelis. 2003. REBASE: restriction enzymes
and methyltransferases. Nucleic Acids Res. 31:418–420.
Rosch, E., and B. Lloyd (eds.). 1978. Cognition and Categorization. Hillsdale, NJ:
Lawrence Erlbaum.
Roth, F.R., J.D. Hughes, P.E. Estep, and G.M. Church. 1998. Finding DNA regulatory
motifs within unaligned non-coding sequences clustered by whole-genome mRNA
quantitation. Nat. Biotechnol. 16:939–945.
Salton, G. 1989. Automatic Text Processing. Reading, MA: Addison-Wesley.
Salton, G., E. Fox, and H. Wu. 1983. Extended boolean information retrieval. Comm.
ACM 26:1022–1036.
Salton, G., and M. McGill. 1986. Introduction to Modern Information Retrieval. New
York: McGraw-Hill.
Salwinski, L., C.S. Miller, A.J. Smith, F.K. Pettit, J.U. Bowie, and D. Eisenberg. 2004.
The database of interacting proteins: 2004 update. Nucleic Acids Res. 32:D449–D451.
Database issue.
Saracevic, T. 1975. Relevance: a review of and a framework for the thinking on the
notion in information science. J. Am. Soc. Info. Sci. 26:321–343.
Sarle, W., 2002. Neural network FAQ. www.faqs.org/faqs/ai-faq/
neural-nets.
SBML, 2003. The Systems Biology Markup Language website. www.sbw-sbml.org.
References 409
Schaffer, A.A., L. Aravind, T.L. Madden, S. Shavirin, J.L. Spouge, Y.I. Wolf, E.V.
Koonin, and S.F. Altschul. 2001. Improving the accuracy of PSI-BLAST protein
database searches with composition-based statistics and other refinements. Nucleic
Acids Res. 29:2994–3005.
Schofield, P.N., J.B. Bard, C. Booth, J. Boniver, V. Covelli, P. Delvenne, M. Ellender,
W. Engstrom, W. Goessner, M. Gruenberger, H. Hoefler, J. Hopewell, M. Mancuso,
C. Mothersill, C.S. Potten, L. Quintanilla-Fend, B. Rozell, H. Sariola, J.P. Sundberg,
and A. Ward. 2004. Pathbase: a database of mutant mouse pathology. Nucleic Acids
Res. 32:D512–D515. Database issue.
Servant, F., C. Bru, S. Carrere, E. Courcelle, J. Gouzy, D. Peyruc, and D. Kahn. 2002.
ProDom: automated clustering of homologous domains. Brief Bioinform. 3:246–251.
Shafer, G. 1976. A Mathematical Theory of Evidence. Princeton, NJ: Princeton University
Press.
Sherry, S.T., M.H. Ward, M. Kholodov, J. Baker, L. Phan, E.M. Smigielski, and
K. Sirotkin. 2001. dbSNP: the NCBI database of genetic variation. Nucleic Acids
Res. 29:308–311.
Shipley, B. 2000. Cause and Correlation in Biology. Cambridge, UK: Cambridge Uni-
versity Press.
Shortliffe, E. 1976. Computer-Based Medical Consultation: MYCIN. New York: Elsevier.
Sigrist, C.J., L. Cerutti, N. Hulo, A. Gattiker, L. Falquet, M. Pagni, A. Bairoch, and
P. Bucher. 2002. PROSITE: a documented database using patterns and profiles as
motif descriptors. Brief Bioinform. 3:265–274.
Smith, T.F., and M.S. Waterman. 1981. Identification of common molecular subse-
quences. J. Mol. Biol. 147:195–197.
Software, Hit, 2004. Hit Software XML utilities. www.hitsw.com/xml_utilites/.
Spellman, P.T., M. Miller, J. Stewart, C. Troup, U. Sarkans, S. Chervitz, D. Bernhart,
G. Sherlock, C. Ball, M. Lepage, M. Swiatek, W.L. Marks, J. Goncalves, S. Markel,
D. Iordan, M. Shojatalab, A. Pizarro, J. White, R. Hubley, E. Deutsch, M. Senger, B.J.
Aronow, A. Robinson, D. Bassett, C.J. Stoeckert, Jr., and A. Brazma. 2002. Design
and implementation of microarray gene expression markup language (MAGE-
ML). Genome Biol. 3:RESEARCH0046.
Spinoza, B. 1998. The Ethics (1677). Translated by R. Elwes. McLean, VA: IndyPub-
lish.com.
Spirtes, P., C. Glymour, and R. Scheines. 2001. Causation, Prediction and Search. Cam-
bridge, MA: MIT Press.
States, D.J., and W. Gish. 1994. Combined use of sequence similarity and codon bias
for coding region identification. J. Comput. Biol. 1:39–50.
410 References
Steinberg, A., C. Bowman, and F. White. 1999. Revisions to the JDL data fusion model.
In SPIE Conf. Sensor Fusion: Architectures, Algorithms and Applications III, vol. 3719,
pp. 430–441.
Stock, A., and J. Stock. 1987. Purification and characterization of the CheZ protein of
bacterial chemotaxis. J. Bacteriol. 169:3301–3311.
Stoeckert, C.J., Jr., H.C. Causton, and C.A. Ball. 2002. Microarray databases: standards
and ontologies. Nat. Genet. 32 (Suppl):469–473.
Stormo, G.D., and G.W. Hartzell III. 1989. Identifying protein-binding sites from
unaligned DNA fragments. Proc. Natl. Acad. Sci. U.S.A. 86:1183–1187.
Strausberg, R.L. 2001. The Cancer Genome Anatomy Project: new resources for
reading the molecular signatures of cancer. J. Pathol. 195:31–40.
Strausberg, R.L., S.F. Greenhut, L.H. Grouse, C.F. Schaefer, and K.H. Buetow. 2001. In
silico analysis of cancer through the Cancer Genome Anatomy Project. Trends Cell
Biol. 11:S66–S71.
Tatusov, R.L., M.Y. Galperin, D.A. Natale, and E.V. Koonin. 2000. The COG database:
a tool for genome-scale analysis of protein functions and evolution. Nucleic Acids
Res. 28:33–36.
Tatusova, T.A., and T.L. Madden. 1999. BLAST 2 sequences, a new tool for comparing
protein and nucleotide sequences. FEMS Microbiol. Lett. 174:247–250.
Taylor, W.R. 1986. Identification of protein sequence homology by consensus tem-
plate alignment. J. Mol. Biol. 188:233–258.
Thompson, J.D., D.G. Higgins, and T.J. Gibson. 1994. CLUSTAL W: improving the
sensitivity of progressive multiple sequence alignment through sequence weight-
ing, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22:
4673–4680.
Thorisson, G.A., and L.D. Stein. 2003. The SNP Consortium website: past, present
and future. Nucleic Acids Res. 31:124–127.
Tigris, 2004. ArgoUML website. argouml.tigris.org/.
Tuttle, M.S., D. Sheretz, M. Erlbaum, N. Olson, and S.J. Nelson. 1989. Implementing
Meta-1: the first version of the UMLS Metathesaurus. In L.C. Kingsland (ed.), Proc.
13th Annual Symp. Comput. App. Med. Care, Washington, DC, pp. 483–487. New
York: IEEE Computer Society Press,
UML, 2004. Introduction to OMG’s Unified Modeling Language. www.omg.org/
gettingstarted/what_is_uml.htm.
Uschold, M., and M. Gruninger. 1996. Ontologies: principles, methods and applica-
tions. Knowledge Eng. Rev. 11:93–155.
van Harmelen, F., J. Hendler, I. Horrocks, D. McGuinness, P. Patel-Schneider, and
L. Stein, 2003. OWL web ontology language reference. www.w3.org/TR/
owl-ref/.
References 411
Villanueva, J., J. Philip, D. Entenberg, C.A. Chaparro, M.K. Tanwar, E.C. Holland, and
P. Tempst. 2004. Serum peptide profiling by magnetic particle-assisted, automated
sample processing and MALDI-TOF mass spectrometry. Anal. Chem. 76:1560–1570.
Volinia, S., R. Evangelisti, F. Francioso, D. Arcelli, M. Carella, and P. Gasparini. 2004.
GOAL: automated Gene Ontology analysis of expression profiles. Nucleic Acids
Res. 32:W492–W499. Web server issue.
vOWLidator, 2003. BBN OWL validator. owl.bbn.com/validator/.
W3C, 1999. XML Path language. www.w3.org/TR/xpath.
W3C, 2001a. A conversion tool from DTD to XML Schema. www.w3.org/2000/04/
schema_hack/.
W3C, 2001b. eXtensible Markup Language website. www.w3.org/XML/.
W3C, 2001c. XML Schema website. www.w3.org/XML/Schema.
W3C, 2001d. XML Stylesheet Language website. www.w3.org/Style/XSL.
W3C, 2003. W3C Math Home. w3c.org/Math.
W3C, 2004a. Resource description framework (RDF): concepts and abstract syntax.
www.w3.org/TR/rdf-concepts/.
W3C, 2004b. XML information set (second edition). www.w3.org/TR/2004/
REC-xml-infoset-20040204.
W3C, 2004c. XML Query (XQuery) website. www.w3.org/XML/Query.
Wain, H.M., E.A. Bruford, R.C. Lovering, M.J. Lush, M.W. Wright, and S. Povey. 2002.
Guidelines for human gene nomenclature. Genomics 79:464–470.
Wall, L., T. Christiansen, and R. Schwartz. 1996. Programming Perl. Sebastopol, CA:
O’Reilly & Associates.
Wand, Y. 1989. A proposal for a formal model of objects. In W. Kim and F. Lochovsky
(eds.), Object-Oriented Concepts, Databases and Applications, pp. 537–559. Reading,
MA: Addison-Wesley.
Wang, G., J. Goguen, Y. Nam, and K. Lin. 2004. Data, schema and ontology integra-
tion. In CombLog’04 Workshop, Lisbon.
Wang, L., J.J. Riethoven, and A. Robinson. 2002. XEMBL: distributing EMBL data in
XML format. Bioinformatics 18:1147–1148.
Waugh, A., P. Gendron, R. Altman, J.W. Brown, D. Case, D. Gautheret, S.C. Harvey,
N. Leontis, J. Westbrook, E. Westhof, M. Zuker, and F. Major. 2002. RNAML: a
standard syntax for exchanging RNA information. RNA 8:707–717.
Westbrook, J.D., and P.E. Bourne. 2000. STAR/mmCIF: an ontology for macromolec-
ular structure. Bioinformatics 16:159–168.
Whewell, W. 1847. The Philosophy of the Inductive Sciences. London: Parker. 2nd ed.
412 References
CDATA, 6 bounds, 48
changing attribute names, 197 canonical, 47
changing attributes to elements, complex data type, 42
198 date, 47
changing element names, 197 facets, 47
changing elements to attributes, ordered, 47
198 simple data type, 42
child element, 9, 66 XML Spy, 34, 289
combining element information, XML Stylesheet Language, 261
198 XML Topic Maps, 77, 286
content, 11 association, 77
content model, 11, 297 scope, 77
default value, 6 topic, 77
DOCTYPE, 274 XML Transformation Language, 261
ELEMENT, 11 XPath, 175
element, 5, 304 ancestor element, 177
entering data, 6, 10 attribute, 177
ENTITY, 11 axis, 177
entity, 295 child element, 177
fragment, 38 descendant element, 177
hierarchy, 9 element, 176
IDREF, 381 node, 176
implicit class, 70 numerical operations, 178
merging documents, 198 parent element, 177
NMTOKEN, 379 root element, 177
order of attributes, 40 step, 176
order of elements, 39, 66, 74 string operations, 178
parent element, 9, 66 text, 177
root, 9 XQuery, 175, 180, 183
sibling elements, 9 corpus, 181
special character, 7 database, 181
splitting documents, 198 document, 180
syntax, 38 for, 181, 182
text content, 16 let, 182
updating data, 6, 10 return, 181
viewing data, 10 where, 181
XML::DOM, 236 XSD, 42
XML::Parser, 236 xsdasn1, 45
XML::XPath, 236 XSL, 261
XML Belief Network format, 370 XSLT, 180, 261
XML editor, 34, 289, 290 accumulator, 276
XML Schema, 42, 286 and, 179
424 Index