article

Porting a lexicalized-grammar parser to the biomedical domain

Authors:

Stephen ClarkAuthors Info & Claims

Journal of Biomedical Informatics, Volume 42, Issue 5

Pages 852 - 865

https://doi.org/10.1016/j.jbi.2008.12.004

Published: 01 October 2009 Publication History

Abstract

This paper introduces a state-of-the-art, linguistically motivated statistical parser to the biomedical text mining community, and proposes a method of adapting it to the biomedical domain requiring only limited resources for data annotation. The parser was originally developed using the Penn Treebank and is therefore tuned to newspaper text. Our approach takes advantage of a lexicalized grammar formalism, Combinatory Categorial Grammar (ccg), to train the parser at a lower level of representation than full syntactic derivations. The ccg parser uses three levels of representation: a first level consisting of part-of-speech (pos) tags; a second level consisting of more fine-grained ccg lexical categories; and a third, hierarchical level consisting of ccg derivations. We find that simply retraining the pos tagger on biomedical data leads to a large improvement in parsing performance, and that using annotated data at the intermediate lexical category level of representation improves parsing accuracy further. We describe the procedure involved in evaluating the parser, and obtain accuracies for biomedical data in the same range as those reported for newspaper text, and higher than those previously reported for the biomedical resource on which we evaluate. Our conclusion is that porting newspaper parsers to the biomedical domain, at least for parsers which use lexicalized grammars, may not be as difficult as first thought.

References

[1]

Marcus, Mitchell, Santorini, Beatrice and Marcinkiewicz, Mary, Building a large annotated corpus of English: the Penn Treebank. Comput Linguist. v19 i2. 313-330.

Digital Library

[2]

Steedman, Mark, The syntactic process. 2000. The MIT Press, Cambridge, MA.

Digital Library

[3]

Bangalore, Srinivas and Joshi, Aravind, Supertagging: an approach to almost parsing. Comput Linguist. v25 i2. 237-265.

Digital Library

[4]

Clark Stephen, Curran James R. The importance of supertagging for wide-coverage CCG parsing. In: Proceedings of COLING-04, Geneva, Switzerland, 2004. p. 282-8.

Digital Library

[5]

Clark Stephen, Steedman Mark, Curran James R. Object-extraction and question-parsing using CCG. In: Proceedings of the EMNLP Conference, Barcelona, Spain, 2004. p. 111-8.

[6]

Clark, Stephen and Curran, James R., Wide-coverage efficient statistical parsing with CCG and log-linear models. Comput Linguist. v33 i4. 493-552.

Digital Library

[7]

Gildea Daniel. Corpus variation and parser performance. In: 2001 conference on empirical methods in natural language processing (EMNLP), Pittsburgh, PA, 2001.

[8]

Clegg Andrew B, Shepherd Adrian J. Evaluating and integrating treebank parsers on a biomedical corpus. In: Proceedings of the association for computational linguistics 43rd annual meeting workshop on software, Ann Arbor, US, 2005.

Digital Library

[9]

Bikel, Daniel M., Intricacies of collins parsing model. Comput Linguist. v30 i4. 479-511.

Digital Library

[10]

Charniak Eugene. A maximum-entropy-inspired parser. In: Proceedings of the 1st meeting of the NAACL, Seattle, WA, 2000. p. 132-9.

Digital Library

[11]

Collins, Michael, Head-driven statistical models for natural language parsing. Comput Linguist. v29 i4. 589-637.

Digital Library

[12]

Tateisi Yuka, Yakushiji Akane, Ohta Tomoko, Tsujii Jun'ichi. Syntax annotation for the GENIA corpus. In: Proceedings of the companion volume of the second international joint conference on natural language processing (IJCNLP-05), Jeju Island, Korea, 2005. p. 222-7.

[13]

Lease Matthew, Charniak Eugene. Parsing biomedical literature. In: Proceedings of the second international joint conference on natural language processing (IJCNLP-05), Jeju Island, Korea, 2005.

Digital Library

[14]

Kim, Jin-Dong, Ohta, Tomoko, Teteisi, Yuka and Tsujii, Jun'ichi, GENIA corpus - a semantically annotated corpus for bio-textmining. Bioinformatics. v19. i180-i182.

[15]

Hara Tadayoshi, Miyao Yusuke, Tsujii Jun'ichi. Adapting a probabilistic disambiguation model of an HPSG parser to a new domain. In: International joint conference on natural language processing (IJCNLP), Jeju Island, Korea, 2005. p. 199-210.

Digital Library

[16]

Bacchiani Michiel, Roark Brian, Saraclar Murat. Language model adaptation with MAP estimation and the perception algorithm. In: Proceedings of the human language technology conference and meeting of the North American chapter of the association for computational linguistics (HLT-NAACL), 2004. p. 21-4.

Digital Library

[17]

Bacchiani, Michiel, Riley, Michael, Roark, Brian and Sproat, Richard, MAP adaptation of stochastic grammars. Comput Speech Lang. v20. 41-68.

Digital Library

[18]

McClosky David, Charniak Eugene, Johnson Mark. Reranking and self-training for parser adaptation. In: Proceedings of the association for computational linguistics (COLING-ACL), Sydney, Australia, 2006.

Digital Library

[19]

McClosky David, Charniak Eugene, Johnson Mark. Effective self-training for parsing. In: Proceedings of the conference on human language technology and North American chapter of the association for computational linguistics (HLT-NAACL), Brooklyn, New York, 2006.

Digital Library

[20]

Foster Jennifer, Wagner Joachim, Seddah Djamé, van Genabith Josef. Adapting WSJ-trained parsers to the British National Corpus using in-domain self-training. In: Proceedings of the 10th international conference on parsing technologies (IWPT), Prague, 2007.

Digital Library

[21]

McClosky David, Charniak Eugene. Self-training for biomedical parsing. In: Proceedings of the association for computational linguistics (ACL-08, short papers), Columbus, Ohio, 2008, p. 101-104.

Digital Library

[22]

Steedman Mark, Osborne Miles, Sarkar Anoop, Clark Stephen, Hwa Rebecca, Hockenmaier Julia, et al. Bootstrapping statistical parsers from small datasets. In: Proceedings of the 11th conference of the European association for computational linguistics, Budapest, Hungary, 2003. p. 331-8.

Digital Library

[23]

Steedman Mark, Hwa Rebecca, Clark Stephen, Osborne Miles, Sarkar Anoop, Hockenmaier Julia, et al. Example selection for bootstrapping statistical parsers. In: Proceedings of the annual meeting of the North American association for computational linguistics (NAACL-HLT-03), Edmonton, Canada, 2003. p. 157-64.

Digital Library

[24]

Blitzer John, McDonald Ryan, Pereira Fernando. Domain adaptation with structural correspondence learning. In: Empirical methods in natural language processing conference (EMNLP), Sydney, Australia, 2006. p. 120-8.

Digital Library

[25]

Clark Stephen, Curran James R. Partial training for a lexicalized-grammar parser. In: Proceedings of the human language technology conference and the annual meeting of the North American chapter of the association for computational linguistics (HLT-NAACL'06), New York, 2006. p. 144-51.

Digital Library

[26]

Hara Tadayoshi, Miyao Yusuke, Tsujii Jun'ichi. Evaluating impact of re-training a lexical disambiguation model on domain adaptation of an HPSG parser. In: Proceedings of IWPT, Prague, Czech Republic, 2007. p. 11-22.

Digital Library

[27]

Ninomiya Takashi, Matsuzaki Takuya, Tsuruoka Yoshimasa, Miyao Yusuke, Tsujii Jun'ichi. Extremely lexicalized models for accurate and fast HPSG parsing. In: Proceedings of the EMNLP conference, 2006.

Digital Library

[28]

Miyao Yusuke, Tsujii Jun'ichi. Probabilistic disambiguation models for wide-coverage HPSG parsing. In: Proceedings of the 43rd meeting of the ACL, University of Michigan, Ann Arbor, 2005. p. 83-90.

Digital Library

[29]

Pollard, Carl and Sag, Ivan, Head-driven phrase structure grammar. 1994. The University of Chicago Press, Chicago.

[30]

Schabes Yves, Abeillé Anne, Joshi Aravind. Parsing strategies with 'lexicalised' grammars: application to tree adjoining grammar. In: Proceedings of the 12th COLING conference, Budapest, Hungary, 1988.

Digital Library

[31]

Kaplan Ronald M, Bresnan Joan. Lexical-functional grammar: a formal system for grammatical representation. In: Bresnan Joan, editor. The mental representation of grammatical relations. Cambridge, MA: The MIT Press; 1982. p. 173-281. Reprinted in Dalrymple, Mary, Kaplan, Ronald M, Maxwell, John, Zaenen, Annie, editors. Formal issues in lexical-functional grammar. Stanford: Center for the Study of Language and Information; 1995. p. 29-130.

[32]

Clark Stephen, Hockenmaier Julia, Steedman Mark. Building deep dependency structures with a wide-coverage CCG parser. In: Proceedings of the 40th meeting of the ACL, Philadelphia, PA, 2002. p. 327-34.

Digital Library

[33]

Hockenmaier Julia, Steedman Mark. Generative models for statistical parsing with Combinatory Categorial Grammar. In: Proceedings of the 40th meeting of the ACL, Philadelphia, PA, 2002. p. 335-42.

Digital Library

[34]

Hockenmaier Julia. Parsing with generative models of predicate-argument structure. In: Proceedings of the 41st meeting of the ACL, Sapporo, Japan, 2003. p. 359-66.

Digital Library

[35]

Clark Stephen, Curran James R. Parsing the WSJ using CCG and log-linear models. In: Proceedings of the 42nd meeting of the ACL, Barcelona, Spain, 2004. p. 104-11.

Digital Library

[36]

Wood, Mary McGee, Categorial grammars. 1993. Routledge, London.

[37]

Bar-Hillel, Yehoshua, A quasi-arithmetical notation for syntactic description. Language. v29. 47-58.

[38]

Curry Haskell B, Feys Robert. Combinatory logic, vol. I. Amsterdam: North Holland; 1958.

[39]

Steedman, Mark, Surface structure and interpretation. 1996. The MIT Press, Cambridge, MA.

[40]

Hockenmaier, Julia and Steedman, Mark, CCGbank: a corpus of CCG derivations and dependency structures extracted from the Penn Treebank. Comput Linguist. v33 i3. 355-396.

Digital Library

[41]

Ratnaparkhi Adwait. A maximum entropy model for part-of-speech tagging. In: Proceedings of the EMNLP conference, Philadelphia, PA, 1996. p. 133-42.

[42]

Curran James R, Clark Stephen. Investigating GIS and smoothing for maximum entropy taggers. In: Proceedings of the 10th meeting of the EACL, Budapest, Hungary, 2003. p. 91-8.

Digital Library

[43]

Clark Stephen. A supertagger for Combinatory Categorial Grammar. In: Proceedings of the TAG+ workshop, Venice, Italy, 2002. p. 19-24.

[44]

Kasami J. An efficient recognition and syntax analysis algorithm for context-free languages. Technical Report AFCRL-65-758, Air Force Cambridge Research Laboratory, Bedford, MA, 1965.

[45]

Younger, D., Recognition and parsing of context-free languages in time n3. Inf Control. v10 i2. 189-208.

[46]

Ratnaparkhi Adwait. Maximum entropy models for natural language ambiguity resolution. Ph.D. Thesis, University of Pennsylvania; 1998.

Digital Library

[47]

Lafferty John, McCallum Andrew, Pereira Fernando. Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the 18th international conference on machine learning, Williams College, MA, 2001. p. 282-9.

Digital Library

[48]

King Tracy H, Crouch Richard, Riezler Stefan, Dalrymple Mary, Kaplan Ronald M. The PARC 700 dependency bank. In: Proceedings of the 4th international workshop on linguistically interpreted corpora, Budapest, Hungary, 2003.

[49]

Briscoe Ted, Carroll John, Watson Rebecca. The second release of the RASP system. In: Proceedings of the interactive demo session of the joint conference of the international committee on computational linguistics and the association for computational linguistics (COLING/ACL-06), Sydney, Australia, 2006.

Digital Library

[50]

Clark Stephen, Curran James R. Formalism-independent parser evaluation with CCG and DepBank. In: Proceedings of the 45th meeting of the ACL, Prague, Czech Republic, 2007. p. 248-55.

[51]

Grover Claire, Matthews Michael, Tobin Richard. Tools to address the interdependence between tokenisation and standoff annotation. In: NLPXML, 2006.

Digital Library

[52]

Grover, Claire, Lascarides, Alex and Lapata, Mirella, A comparison of parsing technologies for the biomedical domain. Nat Lang Eng. v11. 27-65.

Digital Library

[53]

Tsuruoka Yoshimasa, Tateisi Yuka, Kim Jin-Dong, Ohta Tomoko, McNaught John, Ananiadou Sophia, et al. Developing a robust part-of-speech tagger for biomedical text. In: Advances in informatics - 10th Panhellenic conference on informatics, LNCS 3746, Volos, Greece, 2005. p. 382-92.

Digital Library

[54]

Pyysalo, Sampo, Ginter, Filip, Heimonen, Juho, Björne, Jari, Boberg, Jorma and Järvinen, Jouni, BioInfer: a corpus for information extraction in the biomedical domain. BMC Bioinform. v8. 50

[55]

Pyysalo Sampo, Ginter Filip, Laippala Veronika, Haverinen Katri, Heimonen Juho, Salakoski Tapio. On the unification of syntactic annotations under the stanford dependency scheme: a case study on BioInfer and GENIA. In: ACL'07 workshop on biological, translational, and clinical language processing, Prague, Czech Republic, 2007. p. 25-32.

Digital Library

[56]

de Marneffe Marie-Catherine, MacCartney Bill, Manning Christopher D. Generating typed dependency parses from phrase structure parses. In: Proceedings of the 5th LREC conference, Genoa, Italy, 2006. p. 449-54.

[57]

Lin Dekang. A dependency-based method for evaluating broad-coverage parsers. In: Proceedings of IJCAI-95, Montreal, Canada, 1995. p. 1420-5.

Digital Library

[58]

Carroll John, Briscoe Ted, Sanfilippo Antonio. Parser evaluation: a survey and a new proposal. In: Proceedings of the 1st LREC conference, Granada, Spain, 1998. p. 447-54.

[59]

Clegg, Andrew B. and Shepherd, Adrian J., Benchmarking natural-language parsers for biological applications using dependency graphs. BMC Bioinform. v8. 24

[60]

Briscoe Ted, Carroll John. Evaluating the accuracy of an unlexicalized statistical parser on the PARC DepBank. In: Proceedings of the poster session of the joint conference of the international committee on computational linguistics and the association for computational linguistics (COLING/ACL-06), Sydney, Australia, 2006.

Digital Library

[61]

Black E, Abney S, Flickinger D, Gdaniec C, Grishman R, Harrison P, et al. A procedure for quantitatively comparing the syntactic coverage of English grammars. In: Proceedings of the DARPA speech and natural language workshop, 1991. p. 306311.

Digital Library

[62]

Kulick Seth, Bies Ann, Liberman Mark, Mandel Mark, McDonald Ryan, Palmer Martha, et al. Integrated annotation for biomedical information extraction. In: HLT/NAACL, 2004. p. 61-8.

[63]

Sleator Daniel D, Temperley Davy. Parsing English with a link grammar. In: Third international workshop on parsing technologies, 1993. p. 277-91.

[64]

Pyysalo, Sampo, Salakoski, Tapio, Aubin, Sophie and Nazarenko, Adeline, Lexical adaptation of link grammar to the biomedical sublanguage: a comparative evaluation of three approaches. BMC Bioinform. v7 iSuppl. 3. S2

Cited By

Moore RCaines AGraham CButtery P(2015)Incremental Dependency Parsing and Disfluency Detection in Spoken Learner EnglishProceedings of the 18th International Conference on Text, Speech, and Dialogue - Volume 930210.1007/978-3-319-24033-6_53(470-479)Online publication date: 14-Sep-2015
https://dl.acm.org/doi/10.1007/978-3-319-24033-6_53
Vicente-Gomila J(2014)The contribution of syntactic---semantic approach to the search for complementary literatures for scientific or technical discoveryScientometrics10.1007/s11192-014-1299-2100:3(659-673)Online publication date: 1-Sep-2014
https://dl.acm.org/doi/10.1007/s11192-014-1299-2
Cui H(2012)CharaParser for fine-grained semantic annotation of organism morphological descriptionsJournal of the American Society for Information Science and Technology10.1002/asi.2261863:4(738-754)Online publication date: 1-Apr-2012
https://dl.acm.org/doi/10.1002/asi.22618
Show More Cited By

Porting a lexicalized-grammar parser to the biomedical domain
1. Computing methodologies
  1. Artificial intelligence
2. Hardware
  1. Power and energy
    1. Power estimation and optimization

Recommendations

Partial training for a lexicalized-grammar parser
HLT-NAACL '06: Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics

We propose a solution to the annotation bottleneck for statistical parsing, by exploiting the lexicalized nature of Combinatory Categorial Grammar (CCG). The parsing model uses predicate-argument dependencies for training, which are derived from ...
A Survey of Syntactic Parsers of Arabic Language
BDAW '16: Proceedings of the International Conference on Big Data and Advanced Wireless Technologies

Syntactic parsing constitutes one of the most important stages for many Natural Language Processing applications such as Information Retrieval or Question Answering. We present a survey that covers almost all syntactic parsers of Arabic language ...
Lexicalized context-free grammars
ACL '93: Proceedings of the 31st annual meeting on Association for Computational Linguistics

Lexicalized context-free grammar(LCFG) is an attractive compromise between the parsing efficiency of context-free grammar (CFG) and the elegance and lexical sensitivity of lexicalized tree adjoining grammar (LTAG). LCFG is a restricted form of LTAG that ...

Comments

Information & Contributors

Information

Published In

Copyright © Elsevier Inc. © 2008.

Publisher

Elsevier Science

San Diego, CA, United States

Publication History

Published: 01 October 2009

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

17
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 16 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Moore RCaines AGraham CButtery P(2015)Incremental Dependency Parsing and Disfluency Detection in Spoken Learner EnglishProceedings of the 18th International Conference on Text, Speech, and Dialogue - Volume 930210.1007/978-3-319-24033-6_53(470-479)Online publication date: 14-Sep-2015
https://dl.acm.org/doi/10.1007/978-3-319-24033-6_53
Vicente-Gomila J(2014)The contribution of syntactic---semantic approach to the search for complementary literatures for scientific or technical discoveryScientometrics10.1007/s11192-014-1299-2100:3(659-673)Online publication date: 1-Sep-2014
https://dl.acm.org/doi/10.1007/s11192-014-1299-2
Cui H(2012)CharaParser for fine-grained semantic annotation of organism morphological descriptionsJournal of the American Society for Information Science and Technology10.1002/asi.2261863:4(738-754)Online publication date: 1-Apr-2012
https://dl.acm.org/doi/10.1002/asi.22618
Stenetorp PTopić GPyysalo SOhta TKim JTsujii JTsujii JKim JPyysalo S(2011)BioNLP Shared Task 2011Proceedings of the BioNLP Shared Task 2011 Workshop10.5555/2107691.2107707(112-120)Online publication date: 24-Jun-2011
https://dl.acm.org/doi/10.5555/2107691.2107707
McIntosh TYencken LCurran JBaldwin TLin D(2011)Relation guided bootstrapping of semantic lexiconsProceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 210.5555/2002736.2002792(266-270)Online publication date: 19-Jun-2011
https://dl.acm.org/doi/10.5555/2002736.2002792
Harrington BFrattasi SMarchetti N(2011)Discovering novel biomedical relations using ASKNet semantic networksProceedings of the 4th International Symposium on Applied Sciences in Biomedical and Communication Technologies10.1145/2093698.2093780(1-5)Online publication date: 26-Oct-2011
https://dl.acm.org/doi/10.1145/2093698.2093780
Miwa MPyysalo SHara TTsujii JJoshi AHuang CJurafsky D(2010)Evaluating dependency representation for event extractionProceedings of the 23rd International Conference on Computational Linguistics10.5555/1873781.1873869(779-787)Online publication date: 23-Aug-2010
https://dl.acm.org/doi/10.5555/1873781.1873869
Lippincott TSéaghdha DSun LKorhonen AJoshi AHuang CJurafsky D(2010)Exploring variations across biomedical subdomainsProceedings of the 23rd International Conference on Computational Linguistics10.5555/1873781.1873859(689-697)Online publication date: 23-Aug-2010
https://dl.acm.org/doi/10.5555/1873781.1873859
Vlachos ACraven MFarkas RVincze VSzarvas GMóra GCsirik J(2010)Detecting speculative language using syntactic dependencies and logistic regressionProceedings of the Fourteenth Conference on Computational Natural Language Learning --- Shared Task10.5555/1870535.1870538(18-25)Online publication date: 15-Jul-2010
https://dl.acm.org/doi/10.5555/1870535.1870538
Miwa MPyysalo SHara TTsujii J(2010)A comparative study of syntactic parsers for event extractionProceedings of the 2010 Workshop on Biomedical Natural Language Processing10.5555/1869961.1869966(37-45)Online publication date: 15-Jul-2010
https://dl.acm.org/doi/10.5555/1869961.1869966
Show More Cited By

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents