Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1854776.1854820acmconferencesArticle/Chapter ViewAbstractPublication PagesbcbConference Proceedingsconference-collections
research-article

Unsupervised mapping of sentences to biomedical concepts based on integrated information retrieval model and clustering

Published: 02 August 2010 Publication History

Abstract

Structured information revealed by manual annotation of disease descriptions with UMLS meta-thesaurus concepts, can provide high-quality reliable data sources for the research community. While progress in both extent and annotation has been made, only a limited scope of diseases has been annotated, largely because of the required human resources. Since annotating text is time consuming and the variation of disease descriptions makes the annotation task difficult, it is useful to develop systems for automatic mapping of biomedical sentences into an ontology. Our goal is to automatically map biomedical sentences into UMLS disease concepts. Previous methods including statistical methods, are still weaker than dictionary-based simple matching methods. To consider an alternative to both, we demonstrate how the mapping problem can be viewed as a document retrieval problem: under this perspective, the mapping integrates information based on a language model, document frequency, and distance measures. Our improvements are based on a three-step method using information retrieval and clustering. In the first step, we retrieve the top-10 ranked relevant UMLS concept entries using an integrated information retrieval model. In the second step, we cluster the retrieved concept entries according to shared words. In the final step, we select one answer for each cluster using a threshold. Our experiments are promising, and on typical data show a precision of 73.28%, recall of 77.51%, and F-measure of 75.34% significantly outperforming previous methods based on statistics, dictionaries, and the MetaMap by 6.95 to 9.95 percent.

References

[1]
M. Krallinger, and A. Valencia, "Text-Mining and Information-Retrieval Services for Molecular Biology", Genome Biology. 6:224. 2005
[2]
C. Blaschke, M. A. Andrade, C. Ouzounis, A. Valencia, "Automatic extraction of biological information from scientific text: protein-protein interactions", proc. Int. Conf. Intell. Syst. Mol. Biol. 30A (2) (1999) 60--67
[3]
T. Ono, H. Hishigaki, A. Tanigami, T. Takagi, "Automated extraction of information on protein-protein interactions from the biological literature", Bioinformatics 17 (2) (2001) 155--161
[4]
J. Thomas, D. Milward, C. Ouzounis, S. Pulman, M. Carroll, "Automatic extraction of protein interactions from scientific abstracts", Pac. Symp. Biocomput. (2000) 541--552
[5]
D. R Swanson, Fish Oil, Raynaud's Syndrome, and Undiscovered Public Knowledge. Perspectives in. Biology and Medicine 30(1):7--18, 1986
[6]
D. R Swanson, "Medical literature as a potential source of new knowledge", BULLETIN OF THE MEDICAL LIBRARY ASSOCIATION 78 (1): 29--37, 1990
[7]
Weeber, M., Klein, H., Aronson, A. R., Mork, J. G., de Jongvan den Berg,. L. T. W., & Vos, R, text-Based Discovery in Biomedicine: The Architecture of the DAD-system, Proc AMIA Symp. 35 (20) 903--7, 2000
[8]
O. Bodenreider, "The Unified Medical Language System (UMLS): integrating biomedical terminology". Nucleic Acids Res 32 Database issue:D267--70. 2004.
[9]
D. A. Lindberg, B. L. Humphreys, A. T. McCray. "The Unified Medical Language System". Methods Inf Med 32(4), pp.281--291, 1993
[10]
A. R Aronson: Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program. Proc AMIA Symp, pp.17--21, 2001
[11]
W. Hersh, T. J. Leone, The SAPHIRE server: a new algorithm and implementation. Proc Annu Symp Comput Appl Med Care, pp.858--862, 1995.
[12]
A. J. Butte, I. S. Kohane, Creation and implications of a phenomegenome network. Nat Biotechnol, 24(1):pp.55--62, 2006
[13]
A. J. Butte, R. Chen, Finding disease-related genomic experiments within an international repository: first steps in translational bioinformatics. AMIA Annu Symp Proc, pp.106--110, 2006
[14]
N. H. Shah, D. L. Rubin, I. Espinosa, K. Montgomery, M. A. Musen, Annotation and query of tissue microarray data using the NCI Thesaurus. BMC Bioinformatics, 8:296, 2007
[15]
M. Dai, "An Efficient Solution for Mapping Free Text to Ontology Terms". AMIA Summit on Translational Bioinformatics. San Francisco, CA, 2008
[16]
S. Gaudan, A. Jimeno Yepes, V. Lee, D. Rebholz-Schuhmann, "Combining evidence, specificity, and proximity towards the normalization of gene ontology terms in text", EURASIP Journal on Bioinformatics and Systems Biology, v.8 n.1, pp.1--9, January 2008
[17]
A. Jimeno, E. Jimenez-Ruiz, V. Lee, S. Gaudan, R. Berlanga, and D. Rebholz-Schuhmann, "Assessment of disease named entity recognition on a corpus of annotated sentences", BMC bioinformatics, 9 Suppl 3():S3, 2008
[18]
F. Mougin, A. Burgun, and O. Bodenreider, "Mapping data elements to terminological resources for integrating biomedical data sources", BMC Bioinformatics 7(S-3), 2006
[19]
A. Mottaz, Y. L. Yip, P. Ruch, and A. Veuthey, "Mapping protein information to disease terminologies", Journal of Integrative Bioinformatics, 4(3):79, 2007
[20]
J. Hakenberg, C. Plake, L. Royer, and H. Strobelt, U. Leser, and M. Schroeder, "Gene mention normalization and interaction extraction with context models and sentence motifs", Genome Biol, 9 Suppl 2: S14, 2008
[21]
K. B. Cohen, G. K. Acquaah-Mensah, A. E. Dolbey, and L. Hunter, "Contrast and variability in gene names" ACL-02 workshop on Natural language processing in the biomedical domain, pp.14--20, 2002
[22]
Y. Tsuruoka, J. Mcnaught, and S. Ananiadou, "Normalizing biomedical terms by minimizing ambiguity and variability" BMC Bioinformatics, Vol. 9, No. Suppl 3. 2008
[23]
H. W. Chun, Y. Tsuruoka, J. D. Kim, R. Shiba, N. Nagata, T. Hishiki, and J. Tsujii, "Extraction of gene-disease relations from Medline using domain dictionaries and machine learning" Pac Symp Biocomput, pp. 4--15, 2006
[24]
A. Névéol, W. Kim, W. John Wilbur, and Zhiyong Lu, "Exploring Two Biomedical Text Genres for Disease Recognition", Proc. of the Workshop on BioNLP, pp.144--152, 2009
[25]
L. K. Tanabe and W. J. Wilbur, "A Priority Model for Named Entities". Proc. of HLT-NAACL BioNLP Workshop, pp.33--40, 2006
[26]
Jay M. Ponte and W. Bruce Croft, "A Language Modeling Approach to Information Retrieval", Proc. of ACM SIGIR conference on Research and development in information retrieval, pp.206--214, 1998
[27]
P. Willet, "Recent trends in hierarchical document clustering: a critical review". Information Processing and Management, Vol.24, pp.577--597, 1988.
[28]
Bundschus M, Dejori M, Stetter M, Tresp V, Kriegel HP, "Extraction of semantic biomedical relations from text using conditional random fields", BMC Bioinformatics, Apr 23; 9:207, 2008

Cited By

View all
  • (2020)UMLS at 30 years: How it is used and published (Preprint)JMIR Medical Informatics10.2196/20675Online publication date: 25-May-2020

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
BCB '10: Proceedings of the First ACM International Conference on Bioinformatics and Computational Biology
August 2010
705 pages
ISBN:9781450304382
DOI:10.1145/1854776
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 August 2010

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. bioinformatics
  2. information retrieval
  3. mapping of biomedical terms
  4. text mining

Qualifiers

  • Research-article

Conference

BCB'10
Sponsor:

Acceptance Rates

Overall Acceptance Rate 254 of 885 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1
  • Downloads (Last 6 weeks)0
Reflects downloads up to 04 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2020)UMLS at 30 years: How it is used and published (Preprint)JMIR Medical Informatics10.2196/20675Online publication date: 25-May-2020

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media