Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

A Grid Infrastructure for Text Mining of Full Text Articles and Creation of a Knowledge Base of Gene Relations

  • Conference paper
Biological and Medical Data Analysis (ISBMDA 2005)

Abstract

We demonstrate the application of a grid infrastructure for conducting text mining over distributed data and computational resources. The approach is based on using LexiQuest Mine, a text mining workbench, in a grid computing environment. We describe our architecture and approach and provide an illustrative example of mining full-text journal articles to create a knowledge base of gene relations. The number of patterns found increased from 0.74 per full-text articles from a corpus of 1000 articles to 0.83 when the corpus contained 5000 articles. However, it was also shown that mining a corpus of 5000 full-text articles took 26 hours on a single computer, whilst the process was completed in less than 2.5 hours on a grid comprising of 20 computers. Thus whilst increasing the size of the corpus improved the efficiency of the text-mining process, a grid infrastructure was required to complete the task in a timely manner.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Hearst, M.A.: Untangling text data mining. In: Proc. Of ACL, p. 37 (1999)

    Google Scholar 

  2. Fukuda, K., Tsunoda, T., Tamura, A., Takagi, T.: Towards Information Extraction: identifying protein names from biological papers. In: Pacific Symposium on Biocomputing, pp. 707–718 (1998)

    Google Scholar 

  3. Eriksson, G., Franzen, K., Olsson, F.: Exploiting syntax when detecting protein names in text. In: Workshop on Natural Language. Processing in Biomedical Applications (2002), at http://www.sics.se/humle/projects/prothalt/

  4. Wilbur, W., Hazard Jr., G.F., Divita, G., Mork, J.G., Aronson, A.R., Browne, A.C.: Analysis of biomedical text for biochemical names: A comparison of three methods. In: Proc. of AMIA Symposium, pp. 176–180 (1999)

    Google Scholar 

  5. Kazama, J., Makino, T., Ohta, Y., Tsujii, J.: Tuning Support Vector Machines for Biomedical Named Entity Recognition. In: Proc. of the Natural Language Processing in the Biomedical Domain, Philadelphia, PA, USA (2002)

    Google Scholar 

  6. Ono, T., Hishigaki, H., Tanigami, A., Takagi, T.: Automated extraction of information on protein-protein interactions from the biological literature. Bioinformatics 17, 155–161 (2001)

    Article  Google Scholar 

  7. Wong, L.: A protein interaction extraction system. Pacific Symposium on Biocomputing 6, 520–531 (2001)

    Google Scholar 

  8. Yakushiji, A., Tateisi, Y., Miyao, Y., Tsujii, J.: Event extraction from biomedical papers using a full parser. In: Pacific Symposium on Biocomputing, vol. 6, pp. 408–419 (2001)

    Google Scholar 

  9. Sekimizu, T., Park, H.S., Tsujii, J.: Identifying the interaction between genes and gene products based on frequently seen verbs in Medline abstracts. In: Proceedings of the workshop on Genome Informatics, pp. 62–71 (1998)

    Google Scholar 

  10. Craven, M., Kumlien, J.: Constructing biological knowledge base by extracting information from text sources. In: Proc. of the 7th International Conference on Intelligent Systems for Molecular Biology, pp. 77–76 (1999)

    Google Scholar 

  11. Stapley, B.J., Kelley, L.A., Strenberg, M.J.E.: Predicting the sub-cellular location of proteins from text using support vector machines. In: Pacific Symposium on Biocomputing, vol. 7, pp. 374–385 (2002)

    Google Scholar 

  12. Gaizauskas, R., Demetriou, G., Artymiuk, P.J., Willett, P.: Protein structure and Information Extraction from Biological Texts: The PASTA system. Bioinformatics 19(1), 135–143 (2003)

    Article  Google Scholar 

  13. Rzhetsky, A., Iossifov, I., Koike, T., Krauthammer, M., Kra, P., Morris, M., Yu, H., Duboue, P.A., Weng, W., Wilbur, W.J., Hatzivassiloglou, V., Friedman, C.: GeneWays: a system for extracting, analyzing, visualizing, and integrating molecular pathway data. Jr. of Biomedical Informatics 37, 43–53 (2004)

    Article  Google Scholar 

  14. Hahn, U., Romacker, M., Schulz, S.: Creating knowledge repositories from biomedical reports: The MEDSYNDIKATE text mining system. In: Pacific Symposium on Biocomputing, vol. 7, pp. 338–349 (2002)

    Google Scholar 

  15. Ideker, T., Galitski, T., Hood, L.: A new approach to decoding life: systems biology. Annu. Rev. Genomics Hum. Genet. 2, 343–372 (2001)

    Article  Google Scholar 

  16. Rzhetsky, A., et al.: GeneWays: a system for extracting, analyzing, visualizing, and integrating molecular pathway data. Jr. of Biomedical Informatics 37, 43–53 (2004)

    Article  Google Scholar 

  17. Pustejovsky, J., etc.: Medstract: Creating large scale information servers for biomedical libraries. In: ACL 2002, Philadelphia (2002)

    Google Scholar 

  18. Wong, L.: PIES a protein interaction extraction system. In: Pacific Symposium on Biocomputing, vol. 6, pp. 520–531 (2001)

    Google Scholar 

  19. Bremner, E.G., Natarajan, J., Zhang, Y., DeSesa, C., Hack, C.J., Dubitzky, W.: Text mining of full text articles and creation of a knowledge base for analysis of microarray data. In: Knowledge exploration in Life Science Informatics. LNCS (LNAI), pp. 84–95 (2004)

    Google Scholar 

  20. Foster, I., Kesselman, C. (eds.): The Grid 2: Blueprint for a New Computing Infrastructure. Morgan Kaufmann, San Francisco (2004)

    Google Scholar 

  21. SPSS LexiQuest Mine available at http://www.spss.com

  22. United Devices Grid MP Services available at http://www.ud.com

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Natarajan, J., Mulay, N., DeSesa, C., Hack, C.J., Dubitzky, W., Bremer, E.G. (2005). A Grid Infrastructure for Text Mining of Full Text Articles and Creation of a Knowledge Base of Gene Relations. In: Oliveira, J.L., Maojo, V., Martín-Sánchez, F., Pereira, A.S. (eds) Biological and Medical Data Analysis. ISBMDA 2005. Lecture Notes in Computer Science(), vol 3745. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11573067_11

Download citation

  • DOI: https://doi.org/10.1007/11573067_11

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-29674-4

  • Online ISBN: 978-3-540-31658-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics