Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Learning to detect english and hungarian light verb constructions

Published: 21 June 2013 Publication History

Abstract

Light verb constructions consist of a verbal and a nominal component, where the noun preserves its original meaning while the verb has lost it (to some degree). They are syntactically flexible and their meaning can only be partially computed on the basis of the meaning of their parts, thus they require special treatment in natural language processing. For this purpose, the first step is to identify light verb constructions.
In this study, we present our conditional random fields-based tool—called FXTagger—for identifying light verb constructions. The flexibility of the tool is demonstrated on two, typologically different, languages, namely, English and Hungarian. As earlier studies labeled different linguistic phenomena as light verb constructions, we first present a linguistics-based classification of light verb constructions and then show that FXTagger is able to identify different classes of light verb constructions in both languages.
Different types of texts may contain different types of light verb constructions; moreover, the frequency of light verb constructions may differ from domain to domain. Hence we focus on the portability of models trained on different corpora, and we also investigate the effect of simple domain adaptation techniques to reduce the gap between the domains. Our results show that in spite of domain specificities, out-domain data can also contribute to the successful LVC detection in all domains.

References

[1]
Alonso, M. R. 2004. Las construcciones con verbo de apoyo. Visor Libros, Madrid.
[2]
Apresjan, J. D. 2004. O semantičeskoj nepustote i motivirovannosti glagol'nyx leksičeskix funkcij. Voprosy jazykoznanija 4, 3--18.
[3]
Apresjan, J. D. and Tsinman, L. L. 2002. Formal'naja model' perifrazirovanija predloženij dlja sistem pererabotki tekstkov na estestvennyx jazykax. Russkij jazyk v naučnom osveščenii 2, 4, 102--146.
[4]
Bannard, C. 2007. A measure of syntactic flexibility for automatically identifying multiword expressions in corpora. In Proceedings of the Workshop on a Broader Perspective on Multiword Expressions (MWE'07), Association for Computational Linguistics, 1--8.
[5]
Bejcek, E. and Stranák, P. 2010. Annotation of multiword expressions in the Prague Dependency Treebank. Lang. Resources Eval. 44, 1--2, 7--21.
[6]
Bouma, G. 2010. Collocation extraction beyond the independence assumption. In Proceedings of the ACL Conference (Short Papers). Association for Computational Linguistics, 109--114.
[7]
Calzolari, N., Fillmore, C., Grishman, R., Ide, N. Lenci, A., MacLeod, C., and Zampolli, A. 2002. Towards best practice for multiword expressions in computational lexicons. In Proceedings of the 3rd International Conference on Language Resources and Evaluation (LREC'02). 1934--1940.
[8]
Cinková S. and Kolářová, V. 2005. Nouns as components of support verb constructions in the Prague Dependency Treebank. In Insight into Slovak and Czech Corpus Linguistics, M. Šimková, Ed., Veda Bratislava, Slovakia, 113--139.
[9]
Cook, P., Fazly, A. and Stevenson, S. 2007. Pulling their weight: Exploiting syntactic forms for the automatic identification of idiomatic expressions in context. In Proceedings of the Workshop on a Broader Perspective on Multiword Expressions (MWE'07). Association for Computational Linguistics, 41--48.
[10]
Cook, P., Fazly, A., and Stevenson, S. 2008. The VNC-tokens dataset. In Proceedings of the LREC Workshop Towards a Shared Task for Multiword Expressions (MWE'08). 19--22.
[11]
Daumé III, H. 2007. Frustratingly easy domain adaptation. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics. Association for Computational Linguistics, 256--263.
[12]
Diab, M. and Bhutada, P. 2009. Verb noun construction MWE token classification. In Proceedings of the Workshop on Multiword Expressions: Identification, Interpretation, Disambiguation and Applications. Association for Computational Linguistics, 17--22.
[13]
Dias, G. 2003. Multiword unit hybrid extraction. In Proceedings of the ACL Workshop on Multiword Expressions: Analysis, Acquisition and Treatment. Vol. 18, Association for Computational Linguistics, 41--48.
[14]
É. kiss, K. 2002. The Syntax of Hungarian. Cambridge University Press, Cambridge, UK.
[15]
Fazly, A. and Stevenson, S. 2007. Distinguishing subtypes of multiword expressions using linguistically-motivated statistical measures. In Proceedings of the Workshop on a Broader Perspective on Multiword Expressions. Association for Computational Linguistics, 9--16.
[16]
Finkel, J. R., Grenager, T., and Manning, C. 2005. Incorporating non-local information into information extraction systems by Gibbs sampling. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics (ACL'05). Association for Computational Linguistics, 363--370.
[17]
Gurrutxaga, A. and Alegria, I. N. 2011. Automatic extraction of NV Expressions in Basque: Basic issues on co-occurrence techniques. In Proceedings of the Workshop on Multiword Expressions: from Parsing and Generation to the Real World. Association for Computational Linguistics, 2--7.
[18]
Hendrickx, L., Mendes, A., Pereira, S., Gonçalves, A., and Duarte, I. 2010. Complex predicates annotation in a corpus of Portuguese. In Proceedings of the 4th Linguistic Annotation Workshop. Association for Computational Linguistics, 100--108.
[19]
Kaalep, H.-J. and Muischnek, K. 2006. Multi-word verbs in a flective language: The case of Estonian. In Proceedings of the EACL Workshop on Multi-Word Expressions in a Multilingual Contexts. Association for Computational Linguistics, 57--64.
[20]
Kaalep, H.-J. and Muischnek, K. 2008. Multi-word verbs of Estonian: A database and a corpus. In Proceedings of the LREC Workshop Towards a Shared Task for Multiword Expressions (MWE'08). 23--26.
[21]
Kearns, K. 2002. Light verbs in English. Manuscript.
[22]
Kim, S. N. 2008. Statistical modeling of multiword expressions. Ph.D. dissertation, University of Melbourne.
[23]
Klein D. and Manning, C. D. 2003. Accurate unlexicalized parsing. In Proceedings of the Annual Meeting of the ACL. Vol. 41, 423--430.
[24]
Krenn, B. 2008. Description of evaluation resource—German PP-verb data. In Proceedings of the LREC Workshop Towards a Shared Task for Multiword Expressions (MWE'08). 7--10.
[25]
Lafferty, J. D., McCallum, A. K., and Pereira, F. C. N. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning (ICML'01). Morgan Kaufmann, San Francisco, CA, 282--289.
[26]
McCallum, A. K. 2002. MALLET: A machine learning for language toolkit. http://mallet.cs.umass.edu.
[27]
Meyers, A., Reeves, R., MacLeod, C., Szekely, R., Zielinska, V., Young, B., and Grishman, R. 2004. The NomBank project: An interim report. In Proceedings of the HLT-NAACL Workshop: Frontiers in Corpus Annotation. A. Meyers, Ed., Association for Computational Linguistics, 24--31.
[28]
Muischnek, K. and Kaalep, H. J. 2010. The variability of multi-word verbal expressions in Estonian. Lang. Resources Eval. 44, 1--2, 115--135.
[29]
Nagy T., I., Vincze, V., and Berend, G. 2011. Domain-dependent identification of multiword expressions. In Proceedings of the International Conference Recent Advances in Natural Language Processing (RANLP'11). 622--627.
[30]
Pecina, P. 2010. Lexical association measures and collocation extraction. Lang. Resources Eval. 44, 1-2, 137--158.
[31]
Piao, S. S. L., Rayson, P., Archer, D., Wilson, A., and McEnery, T. 2003. Extracting multiword expressions with a semantic tagger. In Proceedings of the ACL Workshop on Multiword Expressions: Analysis, Acquisition and Treatment. Vol. 18, Association for Computational Linguistics, 49--56.
[32]
Ramisch, C., Villavicencio, A., and Boitet, C. 2010a. Multiword expressions in the wild? The MWEToolkit comes in handy. In Proceedings of COLING'10 (Demonstrations). 57--60.
[33]
Ramisch, C., Villavicencio, A., and Boitet, C. 2010b. MWEToolkit: A framework for multiword expression identification. In Proceedings of the 7th Conference on International Language Resources and Evaluation (LREC'10). N. Calzolari et al., Eds., European Language Resources Association, 19--21.
[34]
Rayson, P., Piao, S. S., Sharoff, S., Evert, S. and Moirón, B. V. 2010. Multiword expressions: Hard going or plain sailing? Lang. Resources Eval. 44, 1-2, 1--5.
[35]
Sag, I. A., Baldwin, T., Bond, F., Copestake, A., and Flickinger, D. 2002. Multiword expressions: A pain in the neck for NLP. In Proceedings of the 3rd International Conference on Intelligent Text Processing and Computational Linguistics (CICLing'02). 1--15.
[36]
Samardžić, T. and Merlo, P. 2010. Cross-lingual variation of light verb constructions: Using parallel corpora and automatic alignment for linguistic research. In Proceedings of the Workshop on NLP and Linguistics: Finding the Common Ground. Association for Computational Linguistics, 52--60.
[37]
Sanches, M. D., Ramisch, C., Aluísio, S. M., and Villavicencio, A. 2011. Identifying and analyzing Brazilian Portuguese complex predicates. In Proceedings of the Workshop on Multiword Expressions: from Parsing and Generation to the Real World. Association for Computational Linguistics, 74--82.
[38]
Sanromán Vilas, B. N. 2009. Towards a semantically oriented selection of the values of Oper1: The case of golpe ‘blow’ in Spanish. In Proceedings of the 4th International Conference on Meaning-Text Theory (MTT'09). D. Beck et al., Eds., 327--337.
[39]
Sass, B. 2010. Párhuzamos igei szerkezetek közvetlen kinyerése párhuzamos korpuszból {Extracting parallel multiword verbs from parallel corpora}. In VII. Magyar Számítóg;épes; Nyelvészeti Konferencia, A. Tanács and V. Vincze, Eds., Szegedi Tudományegyetem, Szeged, 102--110.
[40]
Sinha, R. M. 2011. Stepwise mining of multi-word expressions in Hindi. In Proceedings of the Workshop on Multiword Expressions: from Parsing and Generation to the Real World. Association for Computational Linguistics, 110--115.
[41]
Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., Tufiş, D., and Varga, D. 2006. The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages. In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC'06). 2142--2147.
[42]
Stevenson, S., Fazly, A., and North, R. 2004. Statistical measures of the semi-productivity of light verb constructions. In Proceedings of the 2nd ACL Workshop on Multiword Expressions: Integrating Processing. T. Tanaka et al., Eds., Association for Computational Linguistics, 1--8.
[43]
Szarvas, Gy., Farkas, R., and Kocsor, A. 2006. A multilingual named entity recognition system using boosting and C4.5 decision tree learning algorithms. In Discovery Science, 267--278.
[44]
Szarvas, Gy., Vincze, V., Farkas, R., Móra, Gy., and Gurevych, I. 2012. Cross-genre and cross-domain detection of semantic uncertainty. Computat. Ling. (Special Issue on Modality and Negation) 38, 2, 335--367.
[45]
Tan, Y. F., Kan, M.-Y., and Cui, H. 2006. Extending corpus-based identification of light verb constructions using a supervised learning framework. In Proceedings of the EACL Workshop on Multi-Word Expressions in a Multilingual Contexts. Association for Computational Linguistics, 49--56.
[46]
Tjong Kim Sang, E. F., and De Meulder, F. 2003. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In Proceedings of CoNLL-03. W. Daelemans and M. Osborne, Eds., 142--147.
[47]
Toutanova, K. and Manning, C. D. 2000. Enriching the knowledge sources used in a maximum entropy part-of-speech tagger. In Proceedings of EMNLP'00. Association for Computational Linguistics, 63--70.
[48]
Tu, Y. and Roth, D. 2011. Learning English light verb constructions: Contextual or statistical. In Proceedings of the Workshop on Multiword Expressions: From Parsing and Generation to the Real World. Association for Computational Linguistics, 31--39.
[49]
Van De Cruys, T. and Moirón, B. V. 2007. Semantics-based multiword expression extraction. In Proceedings of the Workshop on a Broader Perspective on Multiword Expressions (MWE'07), Association for Computational Linguistics, 25--32.
[50]
Villavicencio, A., Kordoni, V., Zhang, Y., Idiart, M., and Ramisch, C. 2007. Validation and evaluation of automatically acquired multiword expressions for grammar engineering. In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL). Association for Computational Linguistics, 1034--1043.
[51]
Vincze, V. 2011. Semi-compositional noun + verb constructions: Theoretical questions and computational linguistic analyses. Ph.D. dissertation, University of Szeged, Szeged, Hungary.
[52]
Vincze, V. 2012. Light verb constructions in the SzegedParalellFX English--Hungarian parallel corpus. In Proceedings of LREC'12.
[53]
Vincze, V. and Csirik, J. 2010. Hungarian corpus of light verb constructions. In Proceedings of the 23rd International Conference on Computational Linguistics (Coling'10). Coling 2010 Organizing Committee, 1110--1118.
[54]
Vincze, V., Nagy T., I., and Berend, G. 2011a. Detecting noun compounds and light verb constructions: A contrastive study. In Proceedings of the Workshop on Multiword Expressions: From Parsing and Generation to the Real World. ACL, 116--121.
[55]
Vincze, V., Nagy T., I., and Berend, G. 2011b. Multiword expressions and named entities in theWiki50 corpus. In Proceedings of the International Conference Recent Advances in Natural Language Processing (RANLP'11). 289--295.
[56]
Vincze, V., Szauter, D., Almási, A., Móra, Gy., Alexin, Z., and Csirik, J. 2010. Hungarian dependency treebank. In Proceedings of the 7th Conference on International Language Resources and Evaluation (LREC'10).
[57]
Zsibrita, J., Vincze, V., and Farkas, R. 2010. Ismeretlen kifejezések és a szófaji egyértelműsítés {Unknown expressions and POS-tagging}. In MSzNy 2010 -- VII. Magyar Számítógépes Nyelvészeti Konferencia, A. Tanács and V. Vincze, Eds., University of Szeged, Szeged, Hungary, 275--283.

Cited By

View all
  • (2023)Determining sentiment views of verbal multiword expressions using linguistic featuresNatural Language Engineering10.1017/S1351324923000153(1-38)Online publication date: 15-May-2023
  • (2022)The Relation Dimension in the Identification and Classification of Lexically Restricted Word Co-Occurrences in Text CorporaMathematics10.3390/math1020383110:20(3831)Online publication date: 17-Oct-2022
  • (2020)Using automatic constructed thesauri instead of dictionaries in the verbal phraseological units validation taskJournal of Intelligent & Fuzzy Systems10.3233/JIFS-179872(1-10)Online publication date: 12-Jun-2020
  • Show More Cited By

Index Terms

  1. Learning to detect english and hungarian light verb constructions

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Speech and Language Processing
    ACM Transactions on Speech and Language Processing   Volume 10, Issue 2
    Special issue on multiword expressions: From theory to practice and use, part 1
    June 2013
    91 pages
    ISSN:1550-4875
    EISSN:1550-4883
    DOI:10.1145/2483691
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 21 June 2013
    Accepted: 01 February 2013
    Revised: 01 October 2012
    Received: 01 June 2012
    Published in TSLP Volume 10, Issue 2

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Conditional random fields
    2. English
    3. Hungarian
    4. corpora
    5. domain adaptation
    6. light verb constructions
    7. multiword expressions

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)10
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 09 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Determining sentiment views of verbal multiword expressions using linguistic featuresNatural Language Engineering10.1017/S1351324923000153(1-38)Online publication date: 15-May-2023
    • (2022)The Relation Dimension in the Identification and Classification of Lexically Restricted Word Co-Occurrences in Text CorporaMathematics10.3390/math1020383110:20(3831)Online publication date: 17-Oct-2022
    • (2020)Using automatic constructed thesauri instead of dictionaries in the verbal phraseological units validation taskJournal of Intelligent & Fuzzy Systems10.3233/JIFS-179872(1-10)Online publication date: 12-Jun-2020
    • (2019)An unsupervised method for automatic validation of verbal phraseological unitsJournal of Intelligent & Fuzzy Systems10.3233/JIFS-179009(1-7)Online publication date: 8-Apr-2019
    • (2019)Detecting light verb constructions across languagesNatural Language Engineering10.1017/S1351324919000330(1-30)Online publication date: 15-Jul-2019
    • (2014)Discriminative Lexical Semantic Segmentation with Gaps: Running the MWE GamutTransactions of the Association for Computational Linguistics10.1162/tacl_a_001762(193-206)Online publication date: Dec-2014

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media