Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

A lexicon of multiword expressions for linguistically precise, wide-coverage natural language processing

Published: 01 November 2014 Publication History
  • Get Citation Alerts
  • Abstract

    Since Sag et al. (2002) highlighted a key problem that had been underappreciated in the past in natural language processing (NLP), namely idiosyncratic multiword expressions (MWEs) such as idioms, quasi-idioms, cliches, quasi-cliches, institutionalized phrases, proverbs and old sayings, and how to deal with them, many attempts have been made to extract these expressions from corpora and construct a lexicon of them. However, no extensive, reliable solution has yet been realized. This paper presents an overview of a comprehensive lexicon of Japanese multiword expressions (Japanese MWE Lexicon: JMWEL), which has been compiled in order to realize linguistically precise and wide-coverage natural Japanese processing systems. The JMWEL is characterized by significant notational, syntactic, and semantic diversity as well as a detailed description of the syntactic functions, structures, and flexibilities of MWEs. The lexicon contains about 111,000 header entries written in kana (phonetic characters) and their almost 820,000 variants written in kana and kanji (ideographic characters). The paper demonstrates the JMWEL's validity, supported mainly by comparing the lexicon with a large-scale Japanese N-gram frequency dataset, namely the LDC2009T08 generated by Google Inc. (Kudo and Kazawa, 2009). The present work is an attempt to provide a tentative answer for Japanese, from outside statistical empiricism, to the question posed by Church (2011): ''How many multiword expressions do people know?''

    References

    [1]
    IPADIC Version 2.7.0 User's Manual. NAIST, Information Science Division.
    [2]
    Multiword expressions: some problems for Japanese NLP. In: Proceedings of the 8th Annual Meeting of the Association for Natural Language Processing, pp. 379-382.
    [3]
    A statistical approach to the semantics of verb-particles. In: Proceedings of ACL2003 Workshop on Multiword Expressions: Analysis, Acquisition and Treatment, pp. 65-72.
    [4]
    A measure of syntactic flexibility for automatically identifying multiword expressions in corpora. In: Proceedings of ACL 2007 Workshop on A Broader Perspective on Multiword Expressions, pp. 1-8.
    [5]
    Frozen sentences of Portuguese: formal descriptions for NLP. In: Proceedings of ACL 2004 Workshop on Multiword Expressions: Integrating Processing, pp. 72-79.
    [6]
    Bouma, G., Villada, B., 2002. Corpus-based acquisition of Rodopi collocational prepositional phrases. In: Theune, M., Nijholt, A., Hondorp, H. (Eds.), CLIN, Selected Papers from the Twelfth CLIN Meeting, Amsterdam, New York, pp. 23-37.
    [7]
    Corpus-based extraction of Japanese compound verbs. In: Proceedings of the 2009 Australasian Language Technology Workshop (ALTW 2009), pp. 35-43.
    [8]
    Towards best practice for multiword expressions in computational lexicons. In: Proceedings of the 3rd International Conference on Language Resources and Evaluation (LREC 2002), pp. 1934-1940.
    [9]
    How many multiword expressions do people know?. In: Proceedings of ACL 2011 Workshop on Multiword Expressions: From Parsing and Generation to the Real World, pp. 137-144.
    [10]
    Multiword expressions: linguistic precision and reusability. In: Proceedings of the 3rd International Conference on Language Resources and Evaluation (LREC 2002), pp. 1941-1947.
    [11]
    Automatically constructing a lexicon of verb phrase idiomatic combinations. In: Proceedings of the 11th Conference of the European Chapter of the ACL (EACL 2006), pp. 337-344.
    [12]
    . In: Fellbaum, C. (Ed.), WordNet. An Electronic Lexical Database, MIT Press, Cambridge, MA.
    [13]
    Corpus-based studies of german idioms and light verbs. International Journal of Lexicography. v19 i4. 349-360.
    [14]
    Combining resources for MWE-token classification. In: Proceedings of the First Joint Conference on Lexical and Computational Semantics (SEM 2012), pp. 100-104.
    [15]
    Have as a function word. Language Learning. v1 i3. 4-8.
    [16]
    Lexicon-grammar. The representation of compound words. In: Proceedings of the 11th International Conference on Computational Linguistics (COLING'86), pp. 1-6.
    [17]
    Compilation of an idiom example database for supervised idiom identification. Language Resource and Evaluation. v43 i4. 355-384.
    [18]
    Standardizing complex functional expressions in Japanese predicates: applying theoretically-based paraphrasing rules. In: Proceedings of COLING 2010 Workshop on Multiword Expressions: From Theory to Applications, pp. 63-71.
    [19]
    The Architecture of Language Faculty. MIT Press, Cambridge, MA.
    [20]
    A formalism for dependency grammar based on tree adjoining grammar. In: Proceedings of the Conference on Meaning-Text Theory, pp. 207-216.
    [21]
    Multi-word expressions as discourse relation markers (DRMs). In: Proceedings of COLING 2010 Workshop on Multiword Expressions: from Theory to Applications, Invited Talk, pp. 89
    [22]
    Shogakusei no Manga Kanyouku Jiten. Shogakukan, Tokyo.
    [23]
    Shogakukan Gakushu Kokugo Shin Jiten Zentei Dainihan. 2nd ed. Shogaukan.
    [24]
    Shin Reinbo Shogaku Kokugo Jiten. Gakken, Tokyo.
    [25]
    Large scale collocation data and their application to Japanese word processor technology. In: Proceedings of the 17th International Conference on Computational Linguistics (COLING'98), pp. 694-698.
    [26]
    Japanese Web N-gram Version 1. Linguistic Data Consortium, Philadelphia.
    [27]
    SAID: A Syntactically Annotated Idiom Dataset. Linguistic Data Consortium 2003T10.
    [28]
    An electronic dictionary of French multiword adverbs. In: Proceedings of the LREC Workshop on Towards a Shared Task for Multiword Expressions, pp. 31-34.
    [29]
    Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA.
    [30]
    Compilation of a dictionary of Japanese functional expressions with hierarchical organization. In: Proceedings of the 21st International Conference on Computer Processing of Oriental Languages (ICCPOL 2006), pp. 395-402.
    [31]
    Detecting a continuum of compositionality in phrasal verbs. In: Proceedings of ACL2003 Workshop on Multiword Expressions: Analysis, Acquisition and Treatment, pp. 73-80.
    [32]
    Usage and Semantics of Idioms. Meiji Shoin, Tokyo.
    [33]
    Seigorin - Koji Kotowaza Kanyouku Jiten. Obunsha, Tokyo.
    [34]
    A statistical corpus-based term extractor. In: Proceedings of the 14th Biennial Conference of the Canadian Society on Computational Studies of Intelligence, pp. 36-46.
    [35]
    Synonymy in collocation extraction. In: Proceedings of NAACL 2001 Workshop: WordNet and Other Lexical Resources: Applications, Extensions and Customizations, pp. 41-46.
    [36]
    A machine learning approach to multiword expression extraction. In: Proceedings of the LREC Workshop on Towards a Shared Task for Multiword Expressions, pp. 54-57.
    [37]
    Multiword expressions: a pain in the neck for NLP. In: Proceedings of the 3rd International Conference on Intelligent Text Processing and Computational Linguistics (CICLing 2002), pp. 1-15.
    [38]
    Compilation of a comparative list of basic Japanese idioms from five sources. In: IPSJ 2007-NL-178, pp. 1-6.
    [39]
    Exploring vector space models to predict the compositionality of German noun-noun compounds. In: Proceedings of the Second Joint Conference on Lexical and Computational Semantics (SEM 2013), pp. 255-265.
    [40]
    Studies on Japanese language processing by a bunsetsu-phrase structural model. The Bulletin of the Institute for Advanced Research of Fukuoka University. v45. 1-119.
    [41]
    Morphological aspect of Japanese language processing. In: Proceedings of the 8th International Conference on Computational Linguistics (COLING'80), pp. 1-8.
    [42]
    MWEs as non-propositional content indicators. In: Proceedings of ACL 2004 Workshop on Multiword Expressions: Integrating Processing, pp. 31-39.
    [43]
    A comprehensive dictionary of multiword expressions. In: Proceedings of the 49th Annual Meeting of Association for Computational Linguistics (ACL 2011), pp. 161-170.
    [44]
    Unsupervised metaphor paraphrasing using a vector space model. In: Proceedings of the 24th International Conference on Computational Linguistics (COLING 2012), pp. 1121-1130.
    [45]
    Licensing complex prepositions via lexical constrains. In: Proceedings of ACL2003 Workshop on Multiword Expressions: Analysis, Acquisition and Treatment, pp. 97-104.
    [46]
    A disambiguation of compound verbs. In: Proceedings of ACL 2003 Workshop on Multiword Expressions: Analysis, Acquisition and Treatment, pp. 81-88.
    [47]
    Lexical encoding of MWEs. In: Proceedings of ACL 2004 Workshop on Multiword Expressions: Integrating Processing, pp. 80-87.
    [48]
    Construction of a Chinese idiom knowledge base and its applications. In: Proceedings of COLING 2010 Workshop on Multiword Expressions: From Theory to Applications, pp. 10-17.
    [49]
    Nihongo Kanyouku Jiten. Tokyo-do Shuppan, Tokyo.

    Index Terms

    1. A lexicon of multiword expressions for linguistically precise, wide-coverage natural language processing
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Information & Contributors

          Information

          Published In

          cover image Computer Speech and Language
          Computer Speech and Language  Volume 28, Issue 6
          November, 2014
          111 pages

          Publisher

          Academic Press Ltd.

          United Kingdom

          Publication History

          Published: 01 November 2014

          Author Tags

          1. Dependency structure
          2. Internal modification
          3. Lexicon
          4. Linguistic idiosyncrasy
          5. Multiword expression (MWE)
          6. Natural language processing
          7. Non-compositionality

          Qualifiers

          • Article

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • 0
            Total Citations
          • 0
            Total Downloads
          • Downloads (Last 12 months)0
          • Downloads (Last 6 weeks)0
          Reflects downloads up to 11 Aug 2024

          Other Metrics

          Citations

          View Options

          View options

          Get Access

          Login options

          Media

          Figures

          Other

          Tables

          Share

          Share

          Share this Publication link

          Share on social media