FR 1
FR 1
FR 1
ANC LEVEL Recall Precision Fscore decision perspective (see table 9). We view these phase 3
MF baseline 0.6643 0.4128 0.4800 (of the gene annotation problem, see section 2.3) results as
MF 1 0.6643 0.419 0.4847 preliminary. Our focus in this paper is on gaining a better
MF 2 0.6650 0.4229 0.4880 understanding of phase 2 which is document classification
MF 3 0.6650 0.4243 0.4888 with GO codes.
MF 4 0.6650 0.4245 0.4890
MF 5 0.6650 0.4245 0.4880 9. CONCLUSIONS
CC baseline 0.5781 0.3364 0.4009 We presented a series of experiments designed to explore
CC 1 0.5781 0.3471 0.4082 the value of Support Vector Machine based classifiers for as-
CC 2 0.5784 0.3509 0.4113 signing Gene Ontology codes to MEDLINE documents. We
CC 3 0.5784 0.3536 0.4132 find that by using thresholds selected for each hierarchy Fs-
CC 4 0.5784 0.3540 0.4136 cores of 0.48, 0.4 and 0.32 are obtained for the MF, CC and
BP baseline 0.4870 0.2674 0.3222 BP hierarchies respectively. This is with a system of SVM
BP 1 0.4887 0.2724 0.3265 classifiers that do not yet capitalize on the hierarchical or-
BP 2 0.4887 0.2746 0.3285 ganization of the codes. Interestingly, threshold selection at
BP 3 0.4890 0.2773 0.3301 the individual code level (as opposed to the full hierarchy)
BP 4 0.4890 0.2776 0.3305 decreases performance due to over training. We explored
BP 5 0.4890 0.2778 0.3306 performance by level and by the number of positives in the
training set. The former appears more important especially
Table 8: Performance: Common SVM Score for MF and BP. CC in general di↵ers from the other two
Threshold Runs with Relaxed Correctness Criteria hierarchies. This may be due to di↵erences in link seman-
(Code Perspective) tics as almost 50% of links are part of in CC. In contrast,
only a fifth of the links in BP are part of and there is only
1 such link in MF. Setting level specific thresholds for the
Finally, we consider the annotation of the gene/gene prod- second highest level of MF and BP lead to appreciable im-
uct (i.e., the locus id) itself. We test a simple strategy of provements in Fscore. But this was not the case for level
annotating a gene with a code if the code is assigned by our 3. Finally we explored a more relaxed evaluation criteria
system of classifiers to a document that is relevant to the where classification with a more general code compared to
gene. Using this strategy we obtain for MF an Fscore of 0.31 the target code is considered correct. This yielded appre-
(recall = 0.35 and precision = 0.28), for CC an Fscore of 0.36 ciable improvements when a decision perspective was taken
(recall = 0.47 and precision = 0.29) and an Fscore of 0.22 during evaluation.
for BP (recall = 0.26 and precision = 0.191). These scores From this study we conclude that the hierarchies are dif-
are on the low side indicating that on the whole the problem ferent. Also hierarchical level is important. Counter to com-
of annotation is hard and one that o↵ers many challenges. mon intuition more general codes in MF and BP are actually
We observe that the order of difficulty for the hierarchies at more challenging for classification. Also counter to common
the gene product annotation level has CC being easier than intuition it is not necessarily the case that having more pos-
MF and then BP. This parallels the order observed with the itives in our training data yields better performance.
There are several other ways in which we will exploit the [7] L. Hirschman, A. Yeh, C. Blaschke, and A. Valencia.
hierarchical structure in future work. For example, we plan Overview of biocreative: cretical assessment of
to try an ensemble of classifiers where ensembles are defined information extraction for biology. BMC
through the hierarchy. We also plan to focus more on the Bioinformatics, 6(Suppl 1)(S1):795–825, May 2005.
codes that have extremely few positive examples (1 to 4). In [8] S. Kiritchenko, S. Matwin, and A. Famili. Functional
this study we employed (global) feature weighting in lieu of annotation of genes using hierarchical text
feature selection. In future work we will explore feature se- categorization. In Proceedings of BioLINK SIG:
lection, both global and local (code specific) strategies, more Linking Literature, Information and Knowledge for
directly. Finally, we plan on exploring other strategies for Biology, 2005.
phase 3 of the annotation problem which is to determine the [9] M. Light, X. Y. Qiu, and P. Srinivasan. The language
codes for a gene/gene product after these codes have been of bioscience: Facts, speculations and statements in
assigned to their relevant documents. The current study has between. In Proceedings of BioLink 2004 Workshop on
given us a better understanding of the problem of classifying Linking Biological Literature, Ontologies and
documents with GO codes and prepares us for future work Databases, 2004.
in this direction. [10] A. Moschitti and R. Basili. Complex linguistic
features for text classification: A comprehensive study.
Acknowledgments Proceedings of the 26th European Conference on
This material is based upon work supported by the National Information Retrieval (ECIR), pages 181–196, 2004.
Science Foundation under Grant No.0312356 awarded to P. [11] S. Ray and M. Craven. Learning statistical models for
Srinivasan. Any opinions, findings, and conclusions or rec- annotating proteins with function information using
ommendations expressed in this material are those of the biomedical text. BMC Bioinformatics, 6(Suppl
author(s) and do not necessarily reflect the views of the Na- 1)(S18):291–301, May 2005.
tional Science Foundation. [12] S. B. Rice, G. Nenadic, and B. J. Stapley. Mining
protein function from text using term-based support
vector machines. BMC Bioinformatics, 6(Suppl
10. REFERENCES 1)(S22):291–301, May 2005.
[1] M. Ashburner, C. A. Ball, J. A. Blake, D. Botstein, [13] M. Ruiz and P. Srinivasan. Hybrid hierarchical
H. Butler, J. M. Cherry, A. P. Davis, K. Dolinski, classifiers for categorization of medical documents.
S. S. Dwight, J. T. Eppig, M. A. Harris, D. P. Hill, Proceedings of the American Society for Information
L. Issel-Tarver, A. Kasarskis, S. Lewis, J. C. Matese, Science and Technology, 2003.
J. E. Richardson, M. Ringwald, G. M. Rubin, and
[14] M. E. Ruiz and P. Srinivasan. Hierarchical text
G. Sherlock. Gene ontology: tool for the unification of
categorization using neural networks. Information
biology. Nature Genetics, 25:25–29, 2000.
Retrieval, 5(1):87–118, 2002.
[2] C. Blaschke, E. A. Leon, M. Krallinger, and
[15] G. Salton. Automatic Text Processing: The
A. Valencia. Evaluation of biocreative assessment of
Transformation, Analysis, and Retrieval of
task 2. BMC Bioinformatics, 6(Suppl
Information by Computer. Addison-Wesley, 1989.
1)(S16):291–301, May 2005.
[16] A. K. Sehgal and P. Srinivasan. Retrieval with gene
[3] J. Brank, M. Grobelnik, N. Milic-Frayling, and
queries. BMC Bioinformatics, 7(220), April 2006.
D. Mlade. Training text classifiers with svm on very
[17] A. Singhal, C. Buckley, and M. Mitra. Pivoted
few positive examples. Microsoft Corporation
document length normalization. Proceedings of the
Technical Report, MSR-TR-2003-34, 2003.
1996 ACM SIGIR Conference on Research and
[4] S. Charkrabarti, B. Dom, R. Agrawal, and
Development in Information Retrieval, pages 21–29,
P. Raghavan. Using taxonomy, discriminants, and
1996.
signatures for navigating in text databases. In
[18] W. Wibowo and H. Williams. Minimising errors in
Proceedings of the International Conference on Very
hierarchical web categorisation. In Proceedings of the
Large Data Bases (VLDB), 1997.
International Conference on Information and
[5] J.-H. Chiang and H.-C. Yu. Extracting functional
Knowledge Management (CIKM) 2002, pages
annotations of proteins based on hybrid text mining
525–531, 2002.
approaches. In Proceedings of BioCreAtIvE Challenge
[19] H. Xie, A. Wasserman, Z. Levine, A. Novik,
Evaluation Workshop 2004, 2004.
V. Grebinshy, A. Shoshan, and L. Mintz. Large scale
[6] S. Dumais and H. Chen. Hierarchical classification of
protein annotation through gene ontology. Genome
web content. In Proceedings of the ACM International
Research, 12:785–794, 2002.
Conference on Research and Development in
Information Retrieval (SIGIR) 2000, pages 256–263,
2000.