Abstract
We demonstrate the application of a grid infrastructure for conducting text mining over distributed data and computational resources. The approach is based on using LexiQuest Mine, a text mining workbench, in a grid computing environment. We describe our architecture and approach and provide an illustrative example of mining full-text journal articles to create a knowledge base of gene relations. The number of patterns found increased from 0.74 per full-text articles from a corpus of 1000 articles to 0.83 when the corpus contained 5000 articles. However, it was also shown that mining a corpus of 5000 full-text articles took 26 hours on a single computer, whilst the process was completed in less than 2.5 hours on a grid comprising of 20 computers. Thus whilst increasing the size of the corpus improved the efficiency of the text-mining process, a grid infrastructure was required to complete the task in a timely manner.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Hearst, M.A.: Untangling text data mining. In: Proc. Of ACL, p. 37 (1999)
Fukuda, K., Tsunoda, T., Tamura, A., Takagi, T.: Towards Information Extraction: identifying protein names from biological papers. In: Pacific Symposium on Biocomputing, pp. 707–718 (1998)
Eriksson, G., Franzen, K., Olsson, F.: Exploiting syntax when detecting protein names in text. In: Workshop on Natural Language. Processing in Biomedical Applications (2002), at http://www.sics.se/humle/projects/prothalt/
Wilbur, W., Hazard Jr., G.F., Divita, G., Mork, J.G., Aronson, A.R., Browne, A.C.: Analysis of biomedical text for biochemical names: A comparison of three methods. In: Proc. of AMIA Symposium, pp. 176–180 (1999)
Kazama, J., Makino, T., Ohta, Y., Tsujii, J.: Tuning Support Vector Machines for Biomedical Named Entity Recognition. In: Proc. of the Natural Language Processing in the Biomedical Domain, Philadelphia, PA, USA (2002)
Ono, T., Hishigaki, H., Tanigami, A., Takagi, T.: Automated extraction of information on protein-protein interactions from the biological literature. Bioinformatics 17, 155–161 (2001)
Wong, L.: A protein interaction extraction system. Pacific Symposium on Biocomputing 6, 520–531 (2001)
Yakushiji, A., Tateisi, Y., Miyao, Y., Tsujii, J.: Event extraction from biomedical papers using a full parser. In: Pacific Symposium on Biocomputing, vol. 6, pp. 408–419 (2001)
Sekimizu, T., Park, H.S., Tsujii, J.: Identifying the interaction between genes and gene products based on frequently seen verbs in Medline abstracts. In: Proceedings of the workshop on Genome Informatics, pp. 62–71 (1998)
Craven, M., Kumlien, J.: Constructing biological knowledge base by extracting information from text sources. In: Proc. of the 7th International Conference on Intelligent Systems for Molecular Biology, pp. 77–76 (1999)
Stapley, B.J., Kelley, L.A., Strenberg, M.J.E.: Predicting the sub-cellular location of proteins from text using support vector machines. In: Pacific Symposium on Biocomputing, vol. 7, pp. 374–385 (2002)
Gaizauskas, R., Demetriou, G., Artymiuk, P.J., Willett, P.: Protein structure and Information Extraction from Biological Texts: The PASTA system. Bioinformatics 19(1), 135–143 (2003)
Rzhetsky, A., Iossifov, I., Koike, T., Krauthammer, M., Kra, P., Morris, M., Yu, H., Duboue, P.A., Weng, W., Wilbur, W.J., Hatzivassiloglou, V., Friedman, C.: GeneWays: a system for extracting, analyzing, visualizing, and integrating molecular pathway data. Jr. of Biomedical Informatics 37, 43–53 (2004)
Hahn, U., Romacker, M., Schulz, S.: Creating knowledge repositories from biomedical reports: The MEDSYNDIKATE text mining system. In: Pacific Symposium on Biocomputing, vol. 7, pp. 338–349 (2002)
Ideker, T., Galitski, T., Hood, L.: A new approach to decoding life: systems biology. Annu. Rev. Genomics Hum. Genet. 2, 343–372 (2001)
Rzhetsky, A., et al.: GeneWays: a system for extracting, analyzing, visualizing, and integrating molecular pathway data. Jr. of Biomedical Informatics 37, 43–53 (2004)
Pustejovsky, J., etc.: Medstract: Creating large scale information servers for biomedical libraries. In: ACL 2002, Philadelphia (2002)
Wong, L.: PIES a protein interaction extraction system. In: Pacific Symposium on Biocomputing, vol. 6, pp. 520–531 (2001)
Bremner, E.G., Natarajan, J., Zhang, Y., DeSesa, C., Hack, C.J., Dubitzky, W.: Text mining of full text articles and creation of a knowledge base for analysis of microarray data. In: Knowledge exploration in Life Science Informatics. LNCS (LNAI), pp. 84–95 (2004)
Foster, I., Kesselman, C. (eds.): The Grid 2: Blueprint for a New Computing Infrastructure. Morgan Kaufmann, San Francisco (2004)
SPSS LexiQuest Mine available at http://www.spss.com
United Devices Grid MP Services available at http://www.ud.com
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Natarajan, J., Mulay, N., DeSesa, C., Hack, C.J., Dubitzky, W., Bremer, E.G. (2005). A Grid Infrastructure for Text Mining of Full Text Articles and Creation of a Knowledge Base of Gene Relations. In: Oliveira, J.L., Maojo, V., Martín-Sánchez, F., Pereira, A.S. (eds) Biological and Medical Data Analysis. ISBMDA 2005. Lecture Notes in Computer Science(), vol 3745. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11573067_11
Download citation
DOI: https://doi.org/10.1007/11573067_11
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-29674-4
Online ISBN: 978-3-540-31658-9
eBook Packages: Computer ScienceComputer Science (R0)