FR 1

GO for Gene Documents
Xin Ying Qiu Padmini Srinivasan

Management Sciences Department Management Sciences Department &
Tippie College of Business School of Library and Information Science
The University of Iowa The University of Iowa
xin-qiu@uiowa.edu padmini-srinivasan@uiowa.edu
ABSTRACT in the biomedical literature. Our goal, based on this ap-

Annotating genes and their products with Gene Ontology proach, is to develop automatic methods for annotation that
codes is an important area of research. One approach for do- could supplement the expensive manual annotation processes
ing this is to use the information available about these genes currently in place.
in the biomedical literature. Our goal, based on this ap- The importance of this GO annotation problem and the
proach, is to develop automatic methods for annotation that value of computational methods to solve for it are well rec-
could supplement the expensive manual annotation processes ognized. In the 2004 BioCreAtIve challenge, a set of tasks
currently in place. Using a set of Support Vector Machines were designed to assess the performance of current systems
(SVM) classifiers we were able to achieve Fscores of 0.48, 0.4 in the area of supporting GO annotations for specific pro-
and 0.32 for codes of the molecular function, cellular compo- teins. In particular, the second task of identifying spe-
nent and biological process GO hierarchies respectively. We cific text passages that provide the evidence for annotation
explore thresholding of SVM scores, the relationship of per- resembles most the manual process of GO annotation[7].
formance to hierarchy level and to the number of positives in The participating systems showed a variety of approaches
the training sets. We find that hierarchy level is important (from heuristics to Support Vector Machines classification)
especially for the molecular function and biological process exploring di↵erent levels in text analysis (such as sentences
hierarchies. We find that the cellular component hierarchy or paragraphs)[2]. In Rice et al.[12], Support Vector Ma-
stands apart from the other two in many respects. This may chines (SVM) classification was applied to the relevant doc-
be due to fundamental di↵erences in link semantics. This uments for each GO code. Features from the documents
research also exploits the hierarchical structures by defining were selected and conflated as sets of synonymous terms.
and testing a relaxed criteria for classification correctness. Their methods worked better when a substantial set of rel-
evant documents were available. In Ray et al.[11], statisti-
cal methods were first applied to identify n-gram informa-
Categories and Subject Descriptors tive terms from the relevant documents of each GO term.
H.4 [Information Systems Applications]: Miscellaneous; These term models provided hypothesized annotation mod-
I.7 [Document and Text Processing]: Miscellaneous els which could be applied to the test documents. In Chiang
et al.[5], a hybrid method that combined sentence level clas-
sification and pattern matching seemed to achieve higher
General Terms precision with fewer true positive documents.
Experimentation, Performance In some of these previous studies, the hierarchy was ex-
plored but to a limited extent. This was done primarily to
Keywords add information to the classification models. When working
on GO annotation one may certainly draw from the general
Automatic document annotation, Gene Ontology, Hierarchy hierarchical text classification literature (e.g. [4], [6], [18]).
structures We may also learn from hierarchical e↵orts with MeSH[14].
However GO may have special characteristics that could be
1. INTRODUCTION exploited beneficially. Or there may be properties that must
Annotating genes and their products with Gene Ontology be considered by automatic annotation systems in order to
codes is an important area of research. One approach for do- be e↵ective.
ing this is to use the information available about these genes Our goal in this research is to gain a better understanding
of the GO annotation problem using Support Vector Ma-
chines classification algorithms. We will study several open
issues in the GO context. One is the e↵ect of the hierarchical
Permission to make digital or hard copies of all or part of this work for level on performance. Another is the e↵ect of skewed distri-
personal or classroom use is granted without fee provided that copies are butions where the negative examples tend to overwhelm the
not made or distributed for profit or commercial advantage and that copies positives in the training data. We will also study di↵erences
bear this notice and the full citation on the first page. To copy otherwise, to between hierarchies built predominantly upon is a relation-
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
ships and those that significantly include part of relation-
TMBIO’06, November 10, 2006, Arlington, Virginia, USA. ships as well. Although both are asymmetric and transitive,
Copyright 2006 ACM 1-59593-526-6/06/0011 ...$5.00.
their semantics are very di↵erent. Looking beyond achieving - pmid combinations occur more than once. This happens
good performance, our aim in this research is to contribute when the same document o↵ers two di↵erent kinds of evi-
to an understanding of the problem itself. The annotation of dence, say TAS as also IDA, for annotation. Limiting these
genes and their products is an important contribution to de- combinations to the unique occurrences gives us 7, 200 an-
velopments in bioinformatics. As new genes are discovered notations for BP, 4, 391 for CC and 3, 877 for MF. These
and as new functions of genes are identified, these annota- data were used in our experiments.
tions serve as key mechanisms for organizing and providing The data for each hierarchy was randomly split into 5
access to the accumulated knowledge. splits such that each code appears in each split with near
equal numbers of evidence documents. The overall cross
validation strategy is to iteratively take 4 splits as training
2. DATA SOURCES AND APPROACH data and test the trained model on the remaining fifth split.
As an example for split1, we take splits 2 - 5 as training
2.1 Gene Ontology data and 1 as testing. This ensures that there are at least
Gene Ontology (GO)1 provides a structured vocabulary 4 relevant documents for a code in the training side and at
that is used to annotate gene products in order to succinctly least 1 in the test side.
indicate their molecular functions, biological processes, and
cellular components[1]. Although di↵erent subsets of GO 2.3 Document Representation
may be used to annotate di↵erent species, the intent is In information retrieval research, the most widely used
to provide a common annotation infrastructure. Molecular document representation method is the “bag of words” ap-
function describes activities performed by individual gene proach where all the terms are used to form a vector rep-
products or complexes of gene products. Examples of molec- resentation. Functional or connective words are considered
ular functions are arbutin transporter activity and retinoic as stop words and are generally removed since they are as-
acid receptor binding. A biological process is made of sev- sumed to have no information content. The terms could
eral steps accomplished by sequences of molecular functions. be weighted for example, with TF⇥IDF weights or boolean
Examples include lipoprotein transport and phage assembly. weights. Alternative methods of defining terms have been
Cellular components are for example, the nucleus, NADPH explored, but with little significant improvement for text
oxidase complex, and chromosome. There are three hierar- classification performance. Recent research by Moschitti
chies in GO corresponding to these major dimensions. Each and Basili[10] suggests that the elementary textual repre-
hierarchy is a directed acyclic graph (DAG). The molecular sentation based on words applied to SVMs models is very
function hierarchy almost completely consists of is a links. e↵ective in text classification. More complex linguistic fea-
About a fifth of the links in the biological process hierarchy tures such as part-of-speech information and word senses did
represent part of links and the rest are is a links. The cellu- not contribute to the predictive accuracy of SVMs.
lar component hierarchy is about evenly balanced between For this research, we use vector representation for docu-
the two types of links. ments produced using the SMART system[15] with stemmed
2.2 Annotations terms after removing stop words. The “atc”[17] construc-
tion of TF⇥IDF weighting scheme were applied to the terms.
We began with the August 2005 download of LocusLink This representation has worked well in our previous research
and extracted the entries for Homo Sapiens limited to those ([9]). We used the title, abstract, RN and MeSH fields of
with locus type gene with protein product, function known the MEDLINE records.
or inferred..
There are 77, 759 annotation entries for 16, 630 locus ids. 2.4 Overall Approach
Considering only annotations that used documents for evi-
Genes (or more strictly their products) are annotated with
dence we have 29, 501 entries. These entries are then lim-
GO codes. Our interest is in predicting annotations from
ited to those having TAS (Traceable Author Statement) or
the literature, specifically from MEDLINE records. This is
IDA (Inferred from Direct Assay) as evidence types yielding
in contrast to other annotation methods such as the ones
20, 869 entries2 . These entries are composed of 9, 577 anno-
involving sequence homology and protein domain analysis
tations for biological processes (BP) 5, 195 annotations for
(e.g. [19]). We approach the MEDLINE based annotation
cellular components (CC) and 6, 097 for molecular function
problem in three phases. In the first phase we find docu-
(MF). Together these 20, 869 annotations reference 8, 744
ments that are relevant to the gene. In the second phase
unique documents.
we determine which codes should be assigned to each docu-
We looked at the distribution of the GO codes in our
ment. In the third phase we decide which codes should be
dataset in terms of the number of documents associated with
assigned to a gene/gene product based on its classified doc-
each. The range is 1 to 333 for MF, 1 to 789 for CC and 1
uments. In recently completed work we studied phase 1, the
to 579 for BP.
problem of retrieving MEDLINE records for genes[16]. In it
Limiting ourselves to only those codes that had at least
we consider the special challenges of dealing with gene name
5 (unique) documents associated, we get 283 unique codes
and symbol ambiguity. In this research we focus mainly on
for BP, 93 for CC and 214 for MF. We used 5 as the thresh-
phase 2. That is, given a document (relevant for a gene or
old given the 5 times cross validation design for our experi-
a gene product) we ask: what GO codes should be assigned
ments. Thus we wish to ensure that each code had at least
to it? We also close this paper with preliminary results for
1 evidence document in each split. Interestingly some code
phase 3 using a very simple strategy. Specifically a gene is
1
Downloaded on May 16 2006 from the GO Consortium: assigned a code if it is assigned to any of its relevant docu-
http://www.geneontology.org ments. More sophisticated strategies for phase 3 are left to
2
http://www.geneontology.org/GO.evidence.shtml. future research.
The document annotation or classification problem of phase split. The single best threshold was the average of the best
two is interesting in that the codes themselves are struc- thresholds in the four folds[3].
tured hierarchically. Similar hierarchical classification prob- Results are presented in table 2. The table shows for each
lems have been addressed ([4], [6], [18]) including by our hierarchy, the threshold selected for each split as well as
own group ([13], [14]). A key aspect in GO based research the recall, precision and Fscore values achieved on both the
is that we have three hierarchies with di↵erent properties. training and test sets. Averages across the splits are also
Moreover with GO, document classification is not the end provided. First we observe that the thresholds selected fall
point but a step toward the goal which is gene/gene product within a small range from -0.87 to -0.82 across all hierar-
annotation (i.e., phase 3). chies. Molecular function has the smallest spread of thresh-
We adopt a classifier-based machine learning approach us- old values (-0.85 to -0.84). We also observe that molecular
ing the open source software SVM Light3 . In all experiments function o↵ers a relatively easier problem compared to cel-
parameters are set at their default values. The positive in- lular component with biological process being the hardest
stances for a GO code are those records associated with it in to solve. Finally, the test set scores are actually better than
the LocusLink dataset. The negative instances are records the training set scores indicating that we have successfully
assigned to all the other GO codes. avoided over training our models in each case as these are
We present a sequence of experiments within the Support able to generalize to the unseen test cases. Thus we see that
Vector Machines classifier framework. These also explore setting the thresholds appropriately for these SVM classifiers
the e↵ect of hierarchy level and number of positives available o↵ers enormous benefits in performance (when compared to
for each code during model building. We also explore a more the results in table 1).
relaxed definition of classification correctness. Our overall
aim is to contribute a better understanding of phase 2 of the
GO annotation problem.
5. CODE SPECIFIC THRESHOLDS
In the previous experiment a single threshold score was
set for each hierarchy. In this experiment thresholds are set
3. CODE SPECIFIC SVM CLASSIFIERS specific to individual GO codes. This strategy is reasonable
Support Vector Machines were designed for binary or 2 to explore as it may indeed be that although the averages
class classification problems. A common solution, adopted fall within a small range, the optimal threshold varies con-
here, is to transform an N class problem into N binary siderably across the codes. The overall structure of the ex-
problems. Thus we build a distinct classifier for each code periment is the same as in the previous experiment. Code
(class) where the classifier decides whether a document be- specific thresholds are set using a 4-fold cross validation ex-
longs to the code’s class or not. The hierarchy within each periment on each training set. The selected threshold is the
GO dimension is not used at this point. The only connection average of the best threshold for the code across the 4 folds.
among the codes is that they share a common dataset, al- Results are presented in table 3. Interestingly, this time
beit with di↵erent positive and negative instances. Unfortu- the Fscores achieved on the training runs are considerably
nately, this approach yields extremely poor results as shown higher than the Fscores achieved in the test runs of the sin-
in table 1. We found that most of the scores calculated by gle threshold experiment (compare with table 2). However,
SVM are negative, mainly due to the highly skewed nature the penalty is clearly paid on the test side, indicating that
of the training data for most codes. As observed by several this code specific strategy over-trains and fails to generalize
others this problem may be fixed with judicious threshold- e↵ectively on new data. The one exception is in the case
ing[3]. So in the next experiment we calculated an optimal of CC where the Fscores are about the same in both cases.
threshold from the training data for the SVM scores. However, performance for MF and BP drop significantly by
10.4% and 17.5% respectively. Thus a single threshold over
Hierarchy Recall Precision Fscore all codes of a hierarchy is superior to code specific thresh-
MF 0.0419 0.0944 0.052 olding. We also find a similar pattern with the previous
CC 0.0599 0.1461 0.0764 experiment in that molecular function is easier to work with
BP 0.0234 0.064 0.0398 than cellular component which in turn is less challenging
than biological process.
Table 1: Results: Single Classifier for each GO Code
6. ANALYSIS OF RESULTS
We now analyze the best results obtained thus far which
is obtained using a single threshold score for all codes of a
4. SVM SCORE THRESHOLDS given hierarchy. Our goal is to obtain further insights into
Our goal is to determine a single threshold score for each factors influencing the results.
hierarchy, such that documents with scores assigned by the
SVM classifier above this threshold are declared positive.
6.1 Recall versus Precision
We select the best threshold from the training data for each It is well understood that the same Fscore may be ob-
split. In particular, we take the training dataset of a split tained from di↵erent combinations of recall and precision.
and divide it into 4 parts. (We call these ‘folds’ in order to In this regard a key point to note from table 2 (and table 3)
maintain a distinction from the higher level ‘splits’). Cross is that recall is always considerably higher than precision.
validation over these four folds is done to generate a single Although recall could also be improved, our results indicate
best threshold which is then applied to the test side of the that the more serious problem for us lies in the context of
precision. Although in general we are making the correct
3
http://svmlight.joachims.org/ decisions the problem is we are making too many false pos-
Training Testing
Hierarchy Split Threshold Recall Precision Fscore Recall Precision Fscore
MF 1 -0.84 0.5624 0.4136 0.4504 0.5992 0.4258 0.4684
MF 2 -0.86 0.5923 0.3835 0.4390 0.6775 0.4073 0.4769
MF 3 -0.86 0.5954 0.3734 0.4328 0.6817 0.3874 0.4684
MF 4 -0.84 0.5713 0.4046 0.449 0.6857 0.4487 0.5134
MF 5 -0.85 0.5921 0.4076 0.4541 0.6772 0.3945 0.4727
MF Average na 0.5827 0.3965 0.4451 0.6643 0.4128 0.48
CC 1 -0.82 0.4799 0.3185 0.3627 0.5301 0.3531 0.3986
CC 2 -0.82 0.4823 0.3214 0.3665 0.5359 0.3516 0.4006
CC 3 -0.86 0.5287 0.2976 0.3590 0.6553 0.3895 0.4571
CC 4 -0.85 0.5122 0.2997 0.3571 0.5703 0.2976 0.3715
CC 5 -0.85 0.5222 0.315 0.3714 0.599 0.29 0.3767
CC Average na 0.5051 0.3104 0.3633 0.5781 0.3364 0.4009
BP 1 -0.87 0.4304 0.2378 0.2847 0.4722 0.2585 0.3079
BP 2 -0.87 0.4377 0.2442 0.2908 0.5259 0.2713 0.3362
BP 3 -0.85 0.4019 0.2615 0.2948 0.4908 0.2884 0.3392
BP 4 -0.84 0.3706 0.2556 0.2794 0.4854 0.2966 0.3484
BP 5 -0.87 0.4519 0.2600 0.3069 0.4608 0.2220 0.2791
BP Average na 0.4185 0.2518 0.2913 0.4870 0.2674 0.3222
Table 2: Results: Using a Common SVM Score Threshold
itive declarations. In other words we need to tighten the

constraints and apply some filtering criteria on the positive
decisions declared. This angle will be pursued in future re-
search.
6.2 Hierarchical Level & Performance

Table 4 presents performance achieved for each level of
Training Testing the hierarchies. Note that levels increase with the depth of
Hierarchy Split Fscore Recall Precision Fscore the tree. Thus more specific codes have higher level num-
bers. The table identifies the number of codes at each level
MF 1 0.6221 0.4499 0.3989 0.3852
as well as the average scores. For molecular function, ignor-
MF 2 0.615 0.5364 0.4402 0.44351
ing level 1 which has very few codes, we find that levels 2
MF 3 0.6128 0.5295 0.3892 0.4133
and 3 are the most challenging. The remaining MF levels
MF 4 0.6298 0.5793 0.4394 0.452
achieve Fscore in the range of 0.4728 to 0.6667. However
MF 5 0.6371 0.5467 0.4264 0.4451
with the cellular component hierarchy we have Fscore de-
MF average 0.6234 0.5284 0.4188 0.4278
creasing as the level increases (barring level 1 which has
CC 1 0.5541 0.4679 0.3774 0.3842
only 1 code). Finally with biological process, after level 2,
CC 2 0.5052 0.5029 0.3435 0.3626
we observe somewhat stable performance between levels 3
CC 3 0.5131 0.5632 0.3806 0.4239
and 6 (0.31 - 0.32 Fscore). Higher levels, especially level 7,
CC 4 0.5554 0.5134 0.3273 0.3727
show better performance.
CC 5 0.5796 0.5148 0.3875 0.4201
It seems that with MF and BP hierarchies the difficult
CC average 0.5415 0.5125 0.3632 0.3927
decisions are closer to the upper levels. This is contrary to
BP 1 0.4469 0.3994 0.2463 0.2554
common intuition which suggests that classifying into more
BP 2 0.4472 0.4017 0.2727 0.2793
general categories (such as animal or plant) should be easier
BP 3 0.4378 0.3951 0.2531 0.2589
than classifying into more specific categories (such as hawk
BP 4 0.4248 0.4309 0.2654 0.2804
or eagle). CC is di↵erent in that the decisions become more
BP 5 0.4518 0.3710 0.2434 0.2543
challenging as we descend the hierarchy. The di↵erence be-
BP average 0.4417 0.3996 0.2562 0.2657
tween MF and BP on the one hand and CC on the other
could be because of di↵erences in the underlying semantics
Table 3: Results: Using Dynamic Thresholds for
of the links. As mentioned before CC links are about evenly
SVM Scores
split between is a and part of whereas BP links are about
75% made of is a links while MF is almost exclusively is a.
These performance di↵erences observed across the levels of
the hierarchies have important implications in the design of
automated annotation systems for GO.
# of Scores Training # MF- # CC- # BP-
Hier. Level Codes Recall Precision FScore size codes Fscore codes Fscore codes Fscore
MF 1 4 0.3176 0.1786 0.2205 5 2 0.25 34 0.4067 128 0.2695
MF 2 26 0.4846 0.2666 0.3176 6-10 3 0.0833 25 0.3650 65 0.3875
MF 3 41 0.5261 0.3145 0.3695 11-15 9 0.4373 7 0.4528 22 0.3716
MF 4 50 0.6780 0.4449 0.5066 16-20 37 0.5645 4 0.4550 15 0.3306
MF 5 57 0.7799 0.4936 0.5732 21-25 39 0.544 3 0.4762 9 0.2588
MF 6 17 0.8937 0.5548 0.6505 26-30 31 0.5566 4 0.3687 4 0.3007
MF 7 11 0.6961 0.3876 0.4728 31-35 6 0.4663 3 0.5651 8 0.3566
MF 8 4 0.675 0.475 0.5233 36-40 7 0.5275 0 0 6 0.3579
MF 9 2 0.8 0.6 0.6667 41-45 10 0.4124 1 0.2009 5 0.3484
CC 1 1 0.3171 0.2235 0.2537 46-50 11 0.4276 1 0.2861 2 0.2553
CC 2 20 0.6476 0.4017 0.4675 51-75 18 0.3912 2 0.3430 12 0.3060
CC 3 25 0.6062 0.349 0.4089 76-100 12 0.3936 1 0.2681 6 0.2726
CC 4 26 0.5547 0.306 0.3741 101-125 5 0.4273 2 0.4089 0 0
CC 5 14 0.5502 0.2832 0.3622 126-150 4 0.4767 2 0.3226 0 0
CC 6 6 0.3955 0.2717 0.3135 151-last 20 0.3511 4 0.4586 1 0.2822
CC 7 1 1 0.7917 0.8667
BP 1 3 0.1354 0.0481 0.0704 Table 5: Performance by Number of Positives for
BP 2 10 0.3327 0.1767 0.2174 Training
BP 3 34 0.5164 0.2517 0.3179
BP 4 54 0.4849 0.2563 0.3119
BP 5 49 0.4681 0.2516 0.3093
BP 6 52 0.4555 0.2734 0.3139
6.4 Correlations between Level and Number
BP 7 51 0.5863 0.3251 0.3921
of Positives for Training
BP 8 21 0.4677 0.2834 0.3301 Taking this analysis the next logical step forward we ex-
BP 9 8 0.4698 0.2840 0.3316 plore the relationship between level, positive set size and
performance for each code. Table 6 presents the computed
Table 4: Performance by Level correlations.
We find a moderate and significant negative correlation
between level and size in the case of MF and BP but inter-
estingly not in the case of CC. So with MF and BP more
6.3 Number of Positives for Training & specific codes tend to have fewer positives in the training
Performance data but this is not the case with CC. There is also a mod-
Table 5 presents average scores for di↵erent ranges of num- erate and significant positive correlation between level and
ber of positive examples in the training sets. Intuitively we FScore in the case of MF and BP but again not for CC. That
expect less skewed training data to provide better results as is we tend to get better Fscores with more specific codes in
we are using supervised SVM classifiers. Interestingly we MF and BP hierarchies but not so with CC. Thus with MF
observe that this does not necessarily hold. For example and BP we need to pay closer attention to the higher level
with molecular function higher numbers of true positives do codes. Once again our e↵orts indicate that CC is a hierarchy
not necessarily yield better Fscores. Limiting our attention that might require classification methods that are di↵erent
to only those ranges with at least 10 codes, we find for ex- from those that are appropriate for MF and BP. Again this
ample, having more than 150 examples is significantly worse may be due to the underlying di↵erences in link semantics.
than having just 16 to 20 positive examples. Observe that
as the size of the training set size is the same for each code, Hier Level vs Size Level vs FScore Size vs FScore
having fewer positives implies that there are more negatives MF -0.2705* 0.3361* -0.1146
in the sample. With the cellular component hierarchy we CC -0.0123 -0.1051 0.0904
restrict our attention to only the first 2 rows as the other BP -0.2155* 0.1622* -0.0191
cells have too few codes in them. Again we see that fewer
examples yield better results. With the BP hierarchy we Table 6: Correlations. * - significant (0.01 signifi-
again see a similar tendency for performance to drop with cance level)
increasing numbers of positive examples. The exception is
the first row which has significantly lower Fscore than the
A second observation may be made from the correlations
next few ranges. These observations are interesting espe-
between performance and the other two variables. Specifi-
cially because they are counter to the generally accepted
cally, level is far more important than the number of posi-
notion that with a supervised approach we may expect bet-
tives available for training, at least in the case of MF and
ter results with more positive data.
BP. Thus in order to seek improvements in performance it
would be prudent to develop methods capable of exploiting
the level information for the GO codes. Size of training set
on the other hand does not correlate with performance. As
mentioned before this is a surprising observation given the
commonly accepted notion that larger amounts of (positive)
training data tend to yield better performance scores.
7. LEVEL SPECIFIC THRESHOLDS cisions ([4]). Many variations on these themes have been
To explore the e↵ect of level further we adopt a simple explored in the general machine learning literature. In this
strategy of setting the threshold by level. Table 7 shows research we explore a second direction that has recently at-
the e↵ect of this strategy for the MF and BP hierarchies, tracted the attention of researchers, especially in the context
focussing only on levels 2 and 3. We do not apply this of bioinformatics problems (e.g. [8]). Specifically, we use the
strategy to CC as there was no correlation between level hierarchy to relax the criteria for correctness of a classifica-
and performance for this hierarchy. Also we consider only tion decision during evaluation. Essentially we assume that
levels 2 and 3 as level 1 has too few codes and these are the when a document is assigned a GO code it is implicitly as-
levels where we seek improvements. signed the ancestor GO codes as well. This is reasonable
Interestingly, we find improvements at level 2 for both since the GO hierarchies encode is a and part of semantics
MF and BP (+7.4% and +4.6% improvements in Fscore along the parent - child links and these are transitive rela-
respectively). However, the strategy does not work for level tionships. With this assumption we relax the calculation of
3 in both cases. We will consider a di↵erent approach in recall and precision and therefore also of FScore as follows.
future research, one that involves including examples from Recall = A/B where B is as usual the number of known
the neighborhood of the code. This could optionally include correct code - pmid pairs in the dataset. The relaxation is
weighting by distance to the code. applied to the calculation of A.
Consider a code - pmid pair (C - P) which is known to
Original Final be correct. If our classifiers assign code C to P then A is
Hier. Split Level Fscore Threshold Fscore increased by 1. Otherwise if our classifiers assign a code C’
MF 1 2 0.3299 -0.8 0.3665 to P where C’ is an ancestor of C then again A is increased
MF 2 2 0.2782 -0.83 0.2973 by 1.
MF 3 2 0.3298 -0.78 0.373 Precision = E/F where F is as usual the number of pos-
MF 4 2 0.3484 -0.81 0.373 itive decisions declared by the classifiers. The relaxation is
MF 5 2 0.3016 -0.78 0.3263 applied to the calculation of E.
MF avg 2 0.3176 na 0.341 Consider a code - pmid pair (C - P) which is declared a
(+7.4%) positive by our classifiers. If code C is correctly assigned
MF 1 3 0.3347 -0.87 0.3063 to P then E is increased by 1. Otherwise if there exists a
MF 2 3 0.3178 -0.84 0.3301 code C” which is known to be assigned to P where C is an
MF 3 3 0.4243 -0.88 0.3760 ancestor of C” then E is increased by 1.
MF 4 3 0.4263 -0.87 0.3823 Note that our relaxed evaluation accepts as correct those
MF 5 3 0.3444 -0.86 0.3464 decisions that are more general than the correct code and
MF avg 3 0.3695 na 0.3482 (- not those decisions that are more specific than the correct
5.8%) code. Thus if the target code is glucoside transport, we will
accept as correct classification with the higher level (gen-
BP 1 2 0.2542 -0.87 0.2542
eral) carbohydrate transport code but not classification with
BP 2 2 0.2951 -0.89 0.2989
the lower level (specific) alpha-glucoside transport or beta-
BP 3 2 0.2261 -0.89 0.2027
glucoside transport codes.
BP 4 2 0.1609 -0.87 0.2319
The definition of ‘ancestor’ can of course be varied de-
BP 5 2 0.1507 -0.88 0.1494
pending upon how far up the tree one considers. This is for-
BP avg 2 0.2174 na 0.2274
malized by ANCESTOR LEVEL, a parameter that can be
(+4.6%)
varied systematically. For example, when set to 1 ancestors
BP 1 3 0.2916 -0.86 0.3020
are limited to parents. Table 8 presents our results using
BP 2 3 0.3145 -0.83 0.3455
this relaxed evaluation scheme with ANCESTOR LEVEL
BP 3 3 0.3496 -0.82 0.3030
varying from 1 to 5. Unfortunately the results indicate that
BP 4 3 0.3128 -0.83 0.3164
we do not achieve improvements in FScore even when we
BP 5 3 0.3209 -0.83 0.3529
consider ancestors 5 levels up the hierarchies. But all is not
BP avg 3 0.3179 na 0.324
lost as we see next!
(+1.9%)
Table 9 takes a di↵erent perspective on assessing perfor-
mance within the context of this experiment. Note first that
Table 7: Performance: Level Specific Thresholds
thus far results have been obtained from averages of scores
for each GO code. To explain we have 5 splits in our ex-
periment design (see section 2), and each GO code appears
in each split with roughly equal number of positive exam-
8. RELAXING THE CORRECTNESS ples. Within a split we first calculate FScore for each code
CRITERIA and then average these FScores. Tables 3 and 4 show such
averages for each split as also the global average. This ap-
Thus far we have not utilized the hierarchical structure in
proach for evaluation reflects a ‘code’ perspective with all
any way. There are at least two major directions in which
codes being considered equally important. A di↵erent way
the hierarchy may be utilized. One is where the hierarchy is
to summarize performance is to consider each code - pmid
used somehow during model building. For example, a node’s
combination as an independent decision that has to be made.
training data may be augmented with training data from its
Each combination needs to be declared as positive or nega-
neighbors ([11]). Alternatively, a top down approach for
tive by our classifiers. Thus given N codes and M pmids,
model building may be employed, with examples that filter
N ⇥M decisions are to be made. Averages may then be com-
through higher level nodes participating in lower level de-
puted across the set of decisions in a split. In table 9 results ANC LEVEL Recall Precision Fscore
are presented from this perspective of individual decisions. MF baseline 0.6639 0.3076 0.4100
Observe first that we have new baselines identified for each MF 1 0.6639 0.3195 0.4309
hierarchy. Note also that from the decision perspective, CC MF 2 0.6652 0.326 0.4370
is the easier hierarchy followed by MF and then BP. When MF 3 0.6652 0.3288 0.4396
compared to these baselines we find steady improvements as MF 4 0.6652 0.3296 0.4403
the definition of ancestor changes. Using ancestors up to 3 MF 5 0.6652 0.3297 0.4404
levels above gives improvements of 7.2%, 7.6% and 4.5% for CC baseline 0.7442 0.3163 0.4432
MF, CC and BP respectively. With level 5 we have 7.4%, CC 1 0.7442 0.3321 0.4586
8.8% and 6.1% respectively. These improvements indicate CC 2 0.7458 0.3512 0.4769
that, from a decision perspective, we perform better if we ac- CC 3 0.74583 0.3551 0.4804
cept decisions that are approximately in the correct vicinity CC 4 0.7458 0.357 0.4822
of the target code. CC 5 0.7458 0.3572 0.4823
Is the decision perspective useful? The answer is yes. Av- BP baseline 0.5578 0.2286 0.3236
eraging by the code (as done in the previous experiments) BP 1 0.5583 0.2360 0.3311
tells us which codes are more challenging than others. While BP 2 0.5583 0.2432 0.3381
designing annotation systems, we need to know code level BP 3 0.5586 0.2466 0.3415
di↵erences that may lead to tailored strategies. For exam- BP 4 0.5586 0.2482 0.3430
ple the classifier system may di↵er by code level in the hi- BP 5 0.5586 0.2485 0.3433
erarchies. So the “code perspective” is certainly important.
However, the decision perspective is more indicative of per- Table 9: Performance: Common SVM Score
formance in terms of our end goal - annotation at the gene Threshold Runs with Relaxed Correctness Criteria
product level. The decision perspective implies that each (Decision Perspective)
annotation decision, irrespective of code, is equally impor-
tant.
ANC LEVEL Recall Precision Fscore decision perspective (see table 9). We view these phase 3
MF baseline 0.6643 0.4128 0.4800 (of the gene annotation problem, see section 2.3) results as
MF 1 0.6643 0.419 0.4847 preliminary. Our focus in this paper is on gaining a better
MF 2 0.6650 0.4229 0.4880 understanding of phase 2 which is document classification
MF 3 0.6650 0.4243 0.4888 with GO codes.
MF 4 0.6650 0.4245 0.4890
MF 5 0.6650 0.4245 0.4880 9. CONCLUSIONS
CC baseline 0.5781 0.3364 0.4009 We presented a series of experiments designed to explore
CC 1 0.5781 0.3471 0.4082 the value of Support Vector Machine based classifiers for as-
CC 2 0.5784 0.3509 0.4113 signing Gene Ontology codes to MEDLINE documents. We
CC 3 0.5784 0.3536 0.4132 find that by using thresholds selected for each hierarchy Fs-
CC 4 0.5784 0.3540 0.4136 cores of 0.48, 0.4 and 0.32 are obtained for the MF, CC and
BP baseline 0.4870 0.2674 0.3222 BP hierarchies respectively. This is with a system of SVM
BP 1 0.4887 0.2724 0.3265 classifiers that do not yet capitalize on the hierarchical or-
BP 2 0.4887 0.2746 0.3285 ganization of the codes. Interestingly, threshold selection at
BP 3 0.4890 0.2773 0.3301 the individual code level (as opposed to the full hierarchy)
BP 4 0.4890 0.2776 0.3305 decreases performance due to over training. We explored
BP 5 0.4890 0.2778 0.3306 performance by level and by the number of positives in the
training set. The former appears more important especially
Table 8: Performance: Common SVM Score for MF and BP. CC in general di↵ers from the other two
Threshold Runs with Relaxed Correctness Criteria hierarchies. This may be due to di↵erences in link seman-
(Code Perspective) tics as almost 50% of links are part of in CC. In contrast,
only a fifth of the links in BP are part of and there is only
1 such link in MF. Setting level specific thresholds for the
Finally, we consider the annotation of the gene/gene prod- second highest level of MF and BP lead to appreciable im-
uct (i.e., the locus id) itself. We test a simple strategy of provements in Fscore. But this was not the case for level
annotating a gene with a code if the code is assigned by our 3. Finally we explored a more relaxed evaluation criteria
system of classifiers to a document that is relevant to the where classification with a more general code compared to
gene. Using this strategy we obtain for MF an Fscore of 0.31 the target code is considered correct. This yielded appre-
(recall = 0.35 and precision = 0.28), for CC an Fscore of 0.36 ciable improvements when a decision perspective was taken
(recall = 0.47 and precision = 0.29) and an Fscore of 0.22 during evaluation.
for BP (recall = 0.26 and precision = 0.191). These scores From this study we conclude that the hierarchies are dif-
are on the low side indicating that on the whole the problem ferent. Also hierarchical level is important. Counter to com-
of annotation is hard and one that o↵ers many challenges. mon intuition more general codes in MF and BP are actually
We observe that the order of difficulty for the hierarchies at more challenging for classification. Also counter to common
the gene product annotation level has CC being easier than intuition it is not necessarily the case that having more pos-
MF and then BP. This parallels the order observed with the itives in our training data yields better performance.
There are several other ways in which we will exploit the [7] L. Hirschman, A. Yeh, C. Blaschke, and A. Valencia.
hierarchical structure in future work. For example, we plan Overview of biocreative: cretical assessment of
to try an ensemble of classifiers where ensembles are defined information extraction for biology. BMC
through the hierarchy. We also plan to focus more on the Bioinformatics, 6(Suppl 1)(S1):795–825, May 2005.
codes that have extremely few positive examples (1 to 4). In [8] S. Kiritchenko, S. Matwin, and A. Famili. Functional
this study we employed (global) feature weighting in lieu of annotation of genes using hierarchical text
feature selection. In future work we will explore feature se- categorization. In Proceedings of BioLINK SIG:
lection, both global and local (code specific) strategies, more Linking Literature, Information and Knowledge for
directly. Finally, we plan on exploring other strategies for Biology, 2005.
phase 3 of the annotation problem which is to determine the [9] M. Light, X. Y. Qiu, and P. Srinivasan. The language
codes for a gene/gene product after these codes have been of bioscience: Facts, speculations and statements in
assigned to their relevant documents. The current study has between. In Proceedings of BioLink 2004 Workshop on
given us a better understanding of the problem of classifying Linking Biological Literature, Ontologies and
documents with GO codes and prepares us for future work Databases, 2004.
in this direction. [10] A. Moschitti and R. Basili. Complex linguistic
features for text classification: A comprehensive study.
Acknowledgments Proceedings of the 26th European Conference on
This material is based upon work supported by the National Information Retrieval (ECIR), pages 181–196, 2004.
Science Foundation under Grant No.0312356 awarded to P. [11] S. Ray and M. Craven. Learning statistical models for
Srinivasan. Any opinions, findings, and conclusions or rec- annotating proteins with function information using
ommendations expressed in this material are those of the biomedical text. BMC Bioinformatics, 6(Suppl
author(s) and do not necessarily reflect the views of the Na- 1)(S18):291–301, May 2005.
tional Science Foundation. [12] S. B. Rice, G. Nenadic, and B. J. Stapley. Mining
protein function from text using term-based support
vector machines. BMC Bioinformatics, 6(Suppl
10. REFERENCES 1)(S22):291–301, May 2005.
[1] M. Ashburner, C. A. Ball, J. A. Blake, D. Botstein, [13] M. Ruiz and P. Srinivasan. Hybrid hierarchical
H. Butler, J. M. Cherry, A. P. Davis, K. Dolinski, classifiers for categorization of medical documents.
S. S. Dwight, J. T. Eppig, M. A. Harris, D. P. Hill, Proceedings of the American Society for Information
L. Issel-Tarver, A. Kasarskis, S. Lewis, J. C. Matese, Science and Technology, 2003.
J. E. Richardson, M. Ringwald, G. M. Rubin, and
[14] M. E. Ruiz and P. Srinivasan. Hierarchical text
G. Sherlock. Gene ontology: tool for the unification of
categorization using neural networks. Information
biology. Nature Genetics, 25:25–29, 2000.
Retrieval, 5(1):87–118, 2002.
[2] C. Blaschke, E. A. Leon, M. Krallinger, and
[15] G. Salton. Automatic Text Processing: The
A. Valencia. Evaluation of biocreative assessment of
Transformation, Analysis, and Retrieval of
task 2. BMC Bioinformatics, 6(Suppl
Information by Computer. Addison-Wesley, 1989.
1)(S16):291–301, May 2005.
[16] A. K. Sehgal and P. Srinivasan. Retrieval with gene
[3] J. Brank, M. Grobelnik, N. Milic-Frayling, and
queries. BMC Bioinformatics, 7(220), April 2006.
D. Mlade. Training text classifiers with svm on very
[17] A. Singhal, C. Buckley, and M. Mitra. Pivoted
few positive examples. Microsoft Corporation
document length normalization. Proceedings of the
Technical Report, MSR-TR-2003-34, 2003.
1996 ACM SIGIR Conference on Research and
[4] S. Charkrabarti, B. Dom, R. Agrawal, and
Development in Information Retrieval, pages 21–29,
P. Raghavan. Using taxonomy, discriminants, and
1996.
signatures for navigating in text databases. In
[18] W. Wibowo and H. Williams. Minimising errors in
Proceedings of the International Conference on Very
hierarchical web categorisation. In Proceedings of the
Large Data Bases (VLDB), 1997.
International Conference on Information and
[5] J.-H. Chiang and H.-C. Yu. Extracting functional
Knowledge Management (CIKM) 2002, pages
annotations of proteins based on hybrid text mining
525–531, 2002.
approaches. In Proceedings of BioCreAtIvE Challenge
[19] H. Xie, A. Wasserman, Z. Levine, A. Novik,
Evaluation Workshop 2004, 2004.
V. Grebinshy, A. Shoshan, and L. Mintz. Large scale
[6] S. Dumais and H. Chen. Hierarchical classification of
protein annotation through gene ontology. Genome
web content. In Proceedings of the ACM International
Research, 12:785–794, 2002.
Conference on Research and Development in
Information Retrieval (SIGIR) 2000, pages 256–263,
2000.

FR 1

Uploaded by

Copyright:

Available Formats

FR 1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

FR 1

Uploaded by

Copyright:

Available Formats

GO for Gene Documents

Xin Ying Qiu Padmini Srinivasan

ABSTRACT in the biomedical literature. Our goal, based on this ap-

Table 2: Results: Using a Common SVM Score Threshold

itive declarations. In other words we need to tighten the

6.2 Hierarchical Level & Performance

You might also like