Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

Porting a lexicalized-grammar parser to the biomedical domain

Published: 01 October 2009 Publication History

Abstract

This paper introduces a state-of-the-art, linguistically motivated statistical parser to the biomedical text mining community, and proposes a method of adapting it to the biomedical domain requiring only limited resources for data annotation. The parser was originally developed using the Penn Treebank and is therefore tuned to newspaper text. Our approach takes advantage of a lexicalized grammar formalism, Combinatory Categorial Grammar (ccg), to train the parser at a lower level of representation than full syntactic derivations. The ccg parser uses three levels of representation: a first level consisting of part-of-speech (pos) tags; a second level consisting of more fine-grained ccg lexical categories; and a third, hierarchical level consisting of ccg derivations. We find that simply retraining the pos tagger on biomedical data leads to a large improvement in parsing performance, and that using annotated data at the intermediate lexical category level of representation improves parsing accuracy further. We describe the procedure involved in evaluating the parser, and obtain accuracies for biomedical data in the same range as those reported for newspaper text, and higher than those previously reported for the biomedical resource on which we evaluate. Our conclusion is that porting newspaper parsers to the biomedical domain, at least for parsers which use lexicalized grammars, may not be as difficult as first thought.

References

[1]
Marcus, Mitchell, Santorini, Beatrice and Marcinkiewicz, Mary, Building a large annotated corpus of English: the Penn Treebank. Comput Linguist. v19 i2. 313-330.
[2]
Steedman, Mark, The syntactic process. 2000. The MIT Press, Cambridge, MA.
[3]
Bangalore, Srinivas and Joshi, Aravind, Supertagging: an approach to almost parsing. Comput Linguist. v25 i2. 237-265.
[4]
Clark Stephen, Curran James R. The importance of supertagging for wide-coverage CCG parsing. In: Proceedings of COLING-04, Geneva, Switzerland, 2004. p. 282-8.
[5]
Clark Stephen, Steedman Mark, Curran James R. Object-extraction and question-parsing using CCG. In: Proceedings of the EMNLP Conference, Barcelona, Spain, 2004. p. 111-8.
[6]
Clark, Stephen and Curran, James R., Wide-coverage efficient statistical parsing with CCG and log-linear models. Comput Linguist. v33 i4. 493-552.
[7]
Gildea Daniel. Corpus variation and parser performance. In: 2001 conference on empirical methods in natural language processing (EMNLP), Pittsburgh, PA, 2001.
[8]
Clegg Andrew B, Shepherd Adrian J. Evaluating and integrating treebank parsers on a biomedical corpus. In: Proceedings of the association for computational linguistics 43rd annual meeting workshop on software, Ann Arbor, US, 2005.
[9]
Bikel, Daniel M., Intricacies of collins parsing model. Comput Linguist. v30 i4. 479-511.
[10]
Charniak Eugene. A maximum-entropy-inspired parser. In: Proceedings of the 1st meeting of the NAACL, Seattle, WA, 2000. p. 132-9.
[11]
Collins, Michael, Head-driven statistical models for natural language parsing. Comput Linguist. v29 i4. 589-637.
[12]
Tateisi Yuka, Yakushiji Akane, Ohta Tomoko, Tsujii Jun'ichi. Syntax annotation for the GENIA corpus. In: Proceedings of the companion volume of the second international joint conference on natural language processing (IJCNLP-05), Jeju Island, Korea, 2005. p. 222-7.
[13]
Lease Matthew, Charniak Eugene. Parsing biomedical literature. In: Proceedings of the second international joint conference on natural language processing (IJCNLP-05), Jeju Island, Korea, 2005.
[14]
Kim, Jin-Dong, Ohta, Tomoko, Teteisi, Yuka and Tsujii, Jun'ichi, GENIA corpus - a semantically annotated corpus for bio-textmining. Bioinformatics. v19. i180-i182.
[15]
Hara Tadayoshi, Miyao Yusuke, Tsujii Jun'ichi. Adapting a probabilistic disambiguation model of an HPSG parser to a new domain. In: International joint conference on natural language processing (IJCNLP), Jeju Island, Korea, 2005. p. 199-210.
[16]
Bacchiani Michiel, Roark Brian, Saraclar Murat. Language model adaptation with MAP estimation and the perception algorithm. In: Proceedings of the human language technology conference and meeting of the North American chapter of the association for computational linguistics (HLT-NAACL), 2004. p. 21-4.
[17]
Bacchiani, Michiel, Riley, Michael, Roark, Brian and Sproat, Richard, MAP adaptation of stochastic grammars. Comput Speech Lang. v20. 41-68.
[18]
McClosky David, Charniak Eugene, Johnson Mark. Reranking and self-training for parser adaptation. In: Proceedings of the association for computational linguistics (COLING-ACL), Sydney, Australia, 2006.
[19]
McClosky David, Charniak Eugene, Johnson Mark. Effective self-training for parsing. In: Proceedings of the conference on human language technology and North American chapter of the association for computational linguistics (HLT-NAACL), Brooklyn, New York, 2006.
[20]
Foster Jennifer, Wagner Joachim, Seddah Djamé, van Genabith Josef. Adapting WSJ-trained parsers to the British National Corpus using in-domain self-training. In: Proceedings of the 10th international conference on parsing technologies (IWPT), Prague, 2007.
[21]
McClosky David, Charniak Eugene. Self-training for biomedical parsing. In: Proceedings of the association for computational linguistics (ACL-08, short papers), Columbus, Ohio, 2008, p. 101-104.
[22]
Steedman Mark, Osborne Miles, Sarkar Anoop, Clark Stephen, Hwa Rebecca, Hockenmaier Julia, et al. Bootstrapping statistical parsers from small datasets. In: Proceedings of the 11th conference of the European association for computational linguistics, Budapest, Hungary, 2003. p. 331-8.
[23]
Steedman Mark, Hwa Rebecca, Clark Stephen, Osborne Miles, Sarkar Anoop, Hockenmaier Julia, et al. Example selection for bootstrapping statistical parsers. In: Proceedings of the annual meeting of the North American association for computational linguistics (NAACL-HLT-03), Edmonton, Canada, 2003. p. 157-64.
[24]
Blitzer John, McDonald Ryan, Pereira Fernando. Domain adaptation with structural correspondence learning. In: Empirical methods in natural language processing conference (EMNLP), Sydney, Australia, 2006. p. 120-8.
[25]
Clark Stephen, Curran James R. Partial training for a lexicalized-grammar parser. In: Proceedings of the human language technology conference and the annual meeting of the North American chapter of the association for computational linguistics (HLT-NAACL'06), New York, 2006. p. 144-51.
[26]
Hara Tadayoshi, Miyao Yusuke, Tsujii Jun'ichi. Evaluating impact of re-training a lexical disambiguation model on domain adaptation of an HPSG parser. In: Proceedings of IWPT, Prague, Czech Republic, 2007. p. 11-22.
[27]
Ninomiya Takashi, Matsuzaki Takuya, Tsuruoka Yoshimasa, Miyao Yusuke, Tsujii Jun'ichi. Extremely lexicalized models for accurate and fast HPSG parsing. In: Proceedings of the EMNLP conference, 2006.
[28]
Miyao Yusuke, Tsujii Jun'ichi. Probabilistic disambiguation models for wide-coverage HPSG parsing. In: Proceedings of the 43rd meeting of the ACL, University of Michigan, Ann Arbor, 2005. p. 83-90.
[29]
Pollard, Carl and Sag, Ivan, Head-driven phrase structure grammar. 1994. The University of Chicago Press, Chicago.
[30]
Schabes Yves, Abeillé Anne, Joshi Aravind. Parsing strategies with 'lexicalised' grammars: application to tree adjoining grammar. In: Proceedings of the 12th COLING conference, Budapest, Hungary, 1988.
[31]
Kaplan Ronald M, Bresnan Joan. Lexical-functional grammar: a formal system for grammatical representation. In: Bresnan Joan, editor. The mental representation of grammatical relations. Cambridge, MA: The MIT Press; 1982. p. 173-281. Reprinted in Dalrymple, Mary, Kaplan, Ronald M, Maxwell, John, Zaenen, Annie, editors. Formal issues in lexical-functional grammar. Stanford: Center for the Study of Language and Information; 1995. p. 29-130.
[32]
Clark Stephen, Hockenmaier Julia, Steedman Mark. Building deep dependency structures with a wide-coverage CCG parser. In: Proceedings of the 40th meeting of the ACL, Philadelphia, PA, 2002. p. 327-34.
[33]
Hockenmaier Julia, Steedman Mark. Generative models for statistical parsing with Combinatory Categorial Grammar. In: Proceedings of the 40th meeting of the ACL, Philadelphia, PA, 2002. p. 335-42.
[34]
Hockenmaier Julia. Parsing with generative models of predicate-argument structure. In: Proceedings of the 41st meeting of the ACL, Sapporo, Japan, 2003. p. 359-66.
[35]
Clark Stephen, Curran James R. Parsing the WSJ using CCG and log-linear models. In: Proceedings of the 42nd meeting of the ACL, Barcelona, Spain, 2004. p. 104-11.
[36]
Wood, Mary McGee, Categorial grammars. 1993. Routledge, London.
[37]
Bar-Hillel, Yehoshua, A quasi-arithmetical notation for syntactic description. Language. v29. 47-58.
[38]
Curry Haskell B, Feys Robert. Combinatory logic, vol. I. Amsterdam: North Holland; 1958.
[39]
Steedman, Mark, Surface structure and interpretation. 1996. The MIT Press, Cambridge, MA.
[40]
Hockenmaier, Julia and Steedman, Mark, CCGbank: a corpus of CCG derivations and dependency structures extracted from the Penn Treebank. Comput Linguist. v33 i3. 355-396.
[41]
Ratnaparkhi Adwait. A maximum entropy model for part-of-speech tagging. In: Proceedings of the EMNLP conference, Philadelphia, PA, 1996. p. 133-42.
[42]
Curran James R, Clark Stephen. Investigating GIS and smoothing for maximum entropy taggers. In: Proceedings of the 10th meeting of the EACL, Budapest, Hungary, 2003. p. 91-8.
[43]
Clark Stephen. A supertagger for Combinatory Categorial Grammar. In: Proceedings of the TAG+ workshop, Venice, Italy, 2002. p. 19-24.
[44]
Kasami J. An efficient recognition and syntax analysis algorithm for context-free languages. Technical Report AFCRL-65-758, Air Force Cambridge Research Laboratory, Bedford, MA, 1965.
[45]
Younger, D., Recognition and parsing of context-free languages in time n3. Inf Control. v10 i2. 189-208.
[46]
Ratnaparkhi Adwait. Maximum entropy models for natural language ambiguity resolution. Ph.D. Thesis, University of Pennsylvania; 1998.
[47]
Lafferty John, McCallum Andrew, Pereira Fernando. Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the 18th international conference on machine learning, Williams College, MA, 2001. p. 282-9.
[48]
King Tracy H, Crouch Richard, Riezler Stefan, Dalrymple Mary, Kaplan Ronald M. The PARC 700 dependency bank. In: Proceedings of the 4th international workshop on linguistically interpreted corpora, Budapest, Hungary, 2003.
[49]
Briscoe Ted, Carroll John, Watson Rebecca. The second release of the RASP system. In: Proceedings of the interactive demo session of the joint conference of the international committee on computational linguistics and the association for computational linguistics (COLING/ACL-06), Sydney, Australia, 2006.
[50]
Clark Stephen, Curran James R. Formalism-independent parser evaluation with CCG and DepBank. In: Proceedings of the 45th meeting of the ACL, Prague, Czech Republic, 2007. p. 248-55.
[51]
Grover Claire, Matthews Michael, Tobin Richard. Tools to address the interdependence between tokenisation and standoff annotation. In: NLPXML, 2006.
[52]
Grover, Claire, Lascarides, Alex and Lapata, Mirella, A comparison of parsing technologies for the biomedical domain. Nat Lang Eng. v11. 27-65.
[53]
Tsuruoka Yoshimasa, Tateisi Yuka, Kim Jin-Dong, Ohta Tomoko, McNaught John, Ananiadou Sophia, et al. Developing a robust part-of-speech tagger for biomedical text. In: Advances in informatics - 10th Panhellenic conference on informatics, LNCS 3746, Volos, Greece, 2005. p. 382-92.
[54]
Pyysalo, Sampo, Ginter, Filip, Heimonen, Juho, Björne, Jari, Boberg, Jorma and Järvinen, Jouni, BioInfer: a corpus for information extraction in the biomedical domain. BMC Bioinform. v8. 50
[55]
Pyysalo Sampo, Ginter Filip, Laippala Veronika, Haverinen Katri, Heimonen Juho, Salakoski Tapio. On the unification of syntactic annotations under the stanford dependency scheme: a case study on BioInfer and GENIA. In: ACL'07 workshop on biological, translational, and clinical language processing, Prague, Czech Republic, 2007. p. 25-32.
[56]
de Marneffe Marie-Catherine, MacCartney Bill, Manning Christopher D. Generating typed dependency parses from phrase structure parses. In: Proceedings of the 5th LREC conference, Genoa, Italy, 2006. p. 449-54.
[57]
Lin Dekang. A dependency-based method for evaluating broad-coverage parsers. In: Proceedings of IJCAI-95, Montreal, Canada, 1995. p. 1420-5.
[58]
Carroll John, Briscoe Ted, Sanfilippo Antonio. Parser evaluation: a survey and a new proposal. In: Proceedings of the 1st LREC conference, Granada, Spain, 1998. p. 447-54.
[59]
Clegg, Andrew B. and Shepherd, Adrian J., Benchmarking natural-language parsers for biological applications using dependency graphs. BMC Bioinform. v8. 24
[60]
Briscoe Ted, Carroll John. Evaluating the accuracy of an unlexicalized statistical parser on the PARC DepBank. In: Proceedings of the poster session of the joint conference of the international committee on computational linguistics and the association for computational linguistics (COLING/ACL-06), Sydney, Australia, 2006.
[61]
Black E, Abney S, Flickinger D, Gdaniec C, Grishman R, Harrison P, et al. A procedure for quantitatively comparing the syntactic coverage of English grammars. In: Proceedings of the DARPA speech and natural language workshop, 1991. p. 306311.
[62]
Kulick Seth, Bies Ann, Liberman Mark, Mandel Mark, McDonald Ryan, Palmer Martha, et al. Integrated annotation for biomedical information extraction. In: HLT/NAACL, 2004. p. 61-8.
[63]
Sleator Daniel D, Temperley Davy. Parsing English with a link grammar. In: Third international workshop on parsing technologies, 1993. p. 277-91.
[64]
Pyysalo, Sampo, Salakoski, Tapio, Aubin, Sophie and Nazarenko, Adeline, Lexical adaptation of link grammar to the biomedical sublanguage: a comparative evaluation of three approaches. BMC Bioinform. v7 iSuppl. 3. S2

Cited By

View all
  • (2015)Incremental Dependency Parsing and Disfluency Detection in Spoken Learner EnglishProceedings of the 18th International Conference on Text, Speech, and Dialogue - Volume 930210.1007/978-3-319-24033-6_53(470-479)Online publication date: 14-Sep-2015
  • (2014)The contribution of syntactic---semantic approach to the search for complementary literatures for scientific or technical discoveryScientometrics10.1007/s11192-014-1299-2100:3(659-673)Online publication date: 1-Sep-2014
  • (2012)CharaParser for fine-grained semantic annotation of organism morphological descriptionsJournal of the American Society for Information Science and Technology10.1002/asi.2261863:4(738-754)Online publication date: 1-Apr-2012
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

Publisher

Elsevier Science

San Diego, CA, United States

Publication History

Published: 01 October 2009

Author Tags

  1. Annotation
  2. Evaluation
  3. Porting
  4. Statistical parsing
  5. Tagging
  6. ccg

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 16 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2015)Incremental Dependency Parsing and Disfluency Detection in Spoken Learner EnglishProceedings of the 18th International Conference on Text, Speech, and Dialogue - Volume 930210.1007/978-3-319-24033-6_53(470-479)Online publication date: 14-Sep-2015
  • (2014)The contribution of syntactic---semantic approach to the search for complementary literatures for scientific or technical discoveryScientometrics10.1007/s11192-014-1299-2100:3(659-673)Online publication date: 1-Sep-2014
  • (2012)CharaParser for fine-grained semantic annotation of organism morphological descriptionsJournal of the American Society for Information Science and Technology10.1002/asi.2261863:4(738-754)Online publication date: 1-Apr-2012
  • (2011)BioNLP Shared Task 2011Proceedings of the BioNLP Shared Task 2011 Workshop10.5555/2107691.2107707(112-120)Online publication date: 24-Jun-2011
  • (2011)Relation guided bootstrapping of semantic lexiconsProceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 210.5555/2002736.2002792(266-270)Online publication date: 19-Jun-2011
  • (2011)Discovering novel biomedical relations using ASKNet semantic networksProceedings of the 4th International Symposium on Applied Sciences in Biomedical and Communication Technologies10.1145/2093698.2093780(1-5)Online publication date: 26-Oct-2011
  • (2010)Evaluating dependency representation for event extractionProceedings of the 23rd International Conference on Computational Linguistics10.5555/1873781.1873869(779-787)Online publication date: 23-Aug-2010
  • (2010)Exploring variations across biomedical subdomainsProceedings of the 23rd International Conference on Computational Linguistics10.5555/1873781.1873859(689-697)Online publication date: 23-Aug-2010
  • (2010)Detecting speculative language using syntactic dependencies and logistic regressionProceedings of the Fourteenth Conference on Computational Natural Language Learning --- Shared Task10.5555/1870535.1870538(18-25)Online publication date: 15-Jul-2010
  • (2010)A comparative study of syntactic parsers for event extractionProceedings of the 2010 Workshop on Biomedical Natural Language Processing10.5555/1869961.1869966(37-45)Online publication date: 15-Jul-2010
  • Show More Cited By

View Options

View options

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media