Valia Kordoni

2023

pdf bib abs
A corpus of metaphors as register markers
Markus Egg | Valia Kordoni
Findings of the Association for Computational Linguistics: EACL 2023

The paper presents our work on corpus annotationfor metaphor in German. Metaphors denoteentities that are similar to their literal referent,e.g., when *Licht* ‘light’ is used in the senseof ‘hope’. We are interested in the relation betweenmetaphor and register, hence, the corpusincludes material from different registers. We focussed on metaphors that can serve asregister markers and can also be reliably indentifiedfor annotation. Our results show hugedifferences between registers in metaphor usage,which we interpret in terms of specificproperties of the registers.

2022

pdf bib abs
A Gentle Introduction to Deep Nets and Opportunities for the Future
Kenneth Church | Valia Kordoni | Gary Marcus | Ernest Davis | Yanjun Ma | Zeyu Chen
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts

The first half of this tutorial will make deep nets more accessible to a broader audience, following “Deep Nets for Poets” and “A Gentle Introduction to Fine-Tuning.” We will also introduce GFT (general fine tuning), a little language for fine tuning deep nets with short (one line) programs that are as easy to code as regression in statistics packages such as R using glm (general linear models). Based on the success of these methods on a number of benchmarks, one might come away with the impression that deep nets are all we need. However, we believe the glass is half-full: while there is much that can be done with deep nets, there is always more to do. The second half of this tutorial will discuss some of these opportunities.

pdf bib abs
Metaphor annotation for German
Markus Egg | Valia Kordoni
Proceedings of the Thirteenth Language Resources and Evaluation Conference

The paper presents current work on a German corpus annotated for metaphor. Metaphors denote entities or situations that are in some sense similar to the literal referent, e.g., when “Handschrift” ‘signature’ is used in the sense of ‘distinguishing mark’ or the suppression of hopes is introduced by the verb “verschütten” ‘bury’. The corpus is part of a project on register, hence, includes material from different registers that represent register variation along a number of important dimensions, but we believe that it is of interest to research on metaphor in general. The corpus extends previous annotation initiatives in that it not only annotates the metaphoric expressions themselves but also their respective relevant contexts that trigger a metaphorical interpretation of the expressions. For the corpus, we developed extended annotation guidelines, which specifically focus not only on the identification of these metaphoric contexts but also analyse in detail specific linguistic challenges for metaphor annotation that emerge due to the grammar of German.

2021

pdf bib
Proceedings of the 1st Workshop on Benchmarking: Past, Present and Future
Kenneth Church | Mark Liberman | Valia Kordoni
Proceedings of the 1st Workshop on Benchmarking: Past, Present and Future

pdf bib abs
Benchmarking: Past, Present and Future
Kenneth Church | Mark Liberman | Valia Kordoni
Proceedings of the 1st Workshop on Benchmarking: Past, Present and Future

Where have we been, and where are we going? It is easier to talk about the past than the future. These days, benchmarks evolve more bottom up (such as papers with code). There used to be more top-down leadership from government (and industry, in the case of systems, with benchmarks such as SPEC). Going forward, there may be more top-down leadership from organizations like MLPerf and/or influencers like David Ferrucci, who was responsible for IBM’s success with Jeopardy, and has recently written a paper suggesting how the community should think about benchmarking for machine comprehension. Tasks such as reading comprehension become even more interesting as we move beyond English. Multilinguality introduces many challenges, and even more opportunities.

2018

pdf bib abs
Beyond Multiword Expressions: Processing Idioms and Metaphors
Valia Kordoni
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts

Idioms and metaphors are characteristic to all areas of human activity and to all types of discourse. Their processing is a rapidly growing area in NLP, since they have become a big challenge for NLP systems. Their omnipresence in language has been established in a number of corpus studies and the role they play in human reasoning has also been confirmed in psychological experiments. This makes idioms and metaphors an important research area for computational and cognitive linguistics, and their automatic identification and interpretation indispensable for any semantics-oriented NLP application. This tutorial aims to provide attendees with a clear notion of the linguistic characteristics of idioms and metaphors, computational models of idioms and metaphors using state-of-the-art NLP techniques, their relevance for the intersection of deep learning and natural language processing, what methods and resources are available to support their use, and what more could be done in the future. Our target audience are researchers and practitioners in machine learning, parsing (syntactic and semantic) and language technology, not necessarily experts in idioms and metaphors, who are interested in tasks that involve or could benefit from considering idioms and metaphors as a pervasive phenomenon in human language and communication.

2017

pdf bib abs
Beyond Words: Deep Learning for Multiword Expressions and Collocations
Valia Kordoni
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts

Deep learning has recently shown much promise for NLP applications. Traditionally, in most NLP approaches, documents or sentences are represented by a sparse bag-of-words representation. There is now a lot of work which goes beyond this by adopting a distributed representation of words, by constructing a so-called ``neural embedding'' or vector space representation of each word or document. The aim of this tutorial is to go beyond the learning of word vectors and present methods for learning vector representations for Multiword Expressions and bilingual phrase pairs, all of which are useful for various NLP applications.This tutorial aims to provide attendees with a clear notion of the linguistic and distributional characteristics of Multiword Expressions (MWEs), their relevance for the intersection of deep learning and natural language processing, what methods and resources are available to support their use, and what more could be done in the future. Our target audience are researchers and practitioners in machine learning, parsing (syntactic and semantic) and language technology, not necessarily experts in MWEs, who are interested in tasks that involve or could benefit from considering MWEs as a pervasive phenomenon in human language and communication.

2016

pdf bib
Proceedings of the 12th Workshop on Multiword Expressions
Valia Kordoni | Kostadin Cholakov | Markus Egg | Stella Markantonatou | Preslav Nakov
Proceedings of the 12th Workshop on Multiword Expressions

pdf bib
Using Word Embeddings for Improving Statistical Machine Translation of Phrasal Verbs
Kostadin Cholakov | Valia Kordoni
Proceedings of the 12th Workshop on Multiword Expressions

pdf bib abs
Enlarging Scarce In-domain English-Croatian Corpus for SMT of MOOCs Using Serbian
Maja Popović | Kostadin Cholakov | Valia Kordoni | Nikola Ljubešić
Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3)

Massive Open Online Courses have been growing rapidly in size and impact. Yet the language barrier constitutes a major growth impediment in reaching out all people and educating all citizens. A vast majority of educational material is available only in English, and state-of-the-art machine translation systems still have not been tailored for this peculiar genre. In addition, a mere collection of appropriate in-domain training material is a challenging task. In this work, we investigate statistical machine translation of lecture subtitles from English into Croatian, which is morphologically rich and generally weakly supported, especially for the educational domain. We show that results comparable with publicly available systems trained on much larger data can be achieved if a small in-domain training set is used in combination with additional in-domain corpus originating from the closely related Serbian language.

The present work is an overview of the TraMOOC (Translation for Massive Open Online Courses) research and innovation project, a machine translation approach for online educational content. More specifically, videolectures, assignments, and MOOC forum text is automatically translated from English into eleven European and BRIC languages. Unlike previous approaches to machine translation, the output quality in TraMOOC relies on a multimodal evaluation schema that involves crowdsourcing, error type markup, an error taxonomy for translation model comparison, and implicit evaluation via text mining, i.e. entity recognition and its performance comparison between the source and the translated text, and sentiment analysis on the students’ forum posts. Finally, the evaluation output will result in more and better quality in-domain parallel data that will be fed back to the translation engine for higher quality output. The translation service will be incorporated into the Iversity MOOC platform and into the VideoLectures.net digital library portal.

2015

bib abs
Robust Semantic Analysis of Multiword Expressions with FrameNet
Miriam R. L. Petruck | Valia Kordoni
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing: Tutorial Abstracts

This tutorial will give participants a solid understanding of the linguistic features of multiword expressions (MWEs), focusing on the semantics of such expressions and their importance for natural language processing and language technology, with particular attention to the way that FrameNet (framenet.icsi.berkeley.edu) handles this wide spread phenomenon. Our target audience includes researchers and practitioners of language technology, not necessarily experts in MWEs or knowledgeable about FrameNet, who are interested in NLP tasks that involve or could benefit from considering MWEs as a pervasive phenomenon in human language and communication.NLP research has been interested in automatic processing of multiword expressions, with reports on and tasks relating to such efforts presented at workshops and conferences for at least ten years (e.g. ACL 2003, LREC 2008, COLING 2010, EACL 2014). Overcoming the challenge of automatically processing MWEs remains elusive in part because of the difficulty in recognizing, acquiring, and interpreting such forms.Indeed the phenomenon manifests in a range of linguistic forms (as Sag et al. (2001), among many others, have documented), including: noun + noun compounds (e.g. fish knife, health hazard etc.); adjective + noun compounds (e.g. political agenda, national interest, etc.); particle verbs (shut up, take out, etc.); prepositional verbs (e.g. look into, talk into, etc.); VP idioms, such as kick the bucket, and pull someone’s leg, along with less obviously idiomatic forms like answer the door, mention someone’s name, etc.; expressions that have their own mini-grammars, such as names with honorifics and terms of address (e.g. Rabbi Lord Jonathan Sacks), kinship terms (e.g. second cousin once removed), and time expressions (e.g. January 9, 2015); support verb constructions (e.g. verbs: take a bath, make a promise, etc; and prepositions: in doubt, under review, etc.). Linguists address issues of polysemy, compositionality, idiomaticity, and continuity for each type included here.While native speakers use these forms with ease, the treatment and interpretation of MWEs in computational systems requires considerable effort due to the very issues that concern linguists.

2014

pdf bib abs
Multiword Expressions in Machine Translation
Valia Kordoni | Iliana Simova
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This work describes an experimental evaluation of the significance of phrasal verb treatment for obtaining better quality statistical machine translation (SMT) results. The importance of the detection and special treatment of phrasal verbs is measured in the context of SMT, where the word-for-word translation of these units often produces incoherent results. Two ways of integrating phrasal verb information in a phrase-based SMT system are presented. Automatic and manual evaluations of the results reveal improvements in the translation quality in both experiments.

pdf bib
Proceedings of the 10th Workshop on Multiword Expressions (MWE)
Valia Kordoni | Markus Egg | Agata Savary | Eric Wehrli | Stefan Evert
Proceedings of the 10th Workshop on Multiword Expressions (MWE)

pdf bib
Better Statistical Machine Translation through Linguistic Treatment of Phrasal Verbs
Kostadin Cholakov | Valia Kordoni
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

pdf bib
Subcategorisation Acquisition from Raw Text for a Free Word-Order Language
Will Roberts | Markus Egg | Valia Kordoni
Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics

2013

pdf bib
Robust Automated Natural Language Processing with Multiword Expressions and Collocations
Valia Kordoni | Markus Egg
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Tutorials)

pdf bib
Proceedings of the 9th Workshop on Multiword Expressions
Valia Kordoni | Carlos Ramisch | Aline Villavicencio
Proceedings of the 9th Workshop on Multiword Expressions

pdf bib
Improving English-Bulgarian statistical machine translation by phrasal verb treatment
Iliana Simova | Valia Kordoni
Proceedings of the Workshop on Multi-word Units in Machine Translation and Translation Technologies

2012

pdf bib abs
Task-Driven Linguistic Analysis based on an Underspecified Features Representation
Stasinos Konstantopoulos | Valia Kordoni | Nicola Cancedda | Vangelis Karkaletsis | Dietrich Klakow | Jean-Michel Renders
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

In this paper we explore a task-driven approach to interfacing NLP components, where language processing is guided by the end-task that each application requires. The core idea is to generalize feature values into feature value distributions, representing under-specified feature values, and to fit linguistic pipelines with a back-channel of specification requests through which subsequent components can declare to preceding ones the importance of narrowing the value distribution of particular features that are critical for the current task.

pdf bib abs
Using Verb Subcategorization for Word Sense Disambiguation
Will Roberts | Valia Kordoni
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We develop a model for predicting verb sense from subcategorization information and integrate it into SSI-Dijkstra, a wide-coverage knowledge-based WSD algorithm. Adding syntactic knowledge in this way should correct the current poor performance of WSD systems on verbs. This paper also presents, for the first time, an evaluation of SSI-Dijkstra on a standard data set which enables a comparison of this algorithm with other knowledge-based WSD systems. Our results show that our system is competitive with current graph-based WSD algorithms, and that the subcategorization model can be used to achieve better verb sense disambiguation performance.

2011

pdf bib
Proceedings of the Workshop on Multiword Expressions: from Parsing and Generation to the Real World
Valia Kordoni | Carlos Ramisch | Aline Villavicencio
Proceedings of the Workshop on Multiword Expressions: from Parsing and Generation to the Real World

pdf bib
An Empirical Comparison of Unknown Word Prediction Methods
Kostadin Cholakov | Gertjan van Noord | Valia Kordoni | Yi Zhang
Proceedings of 5th International Joint Conference on Natural Language Processing

pdf bib
Adaptability of Lexical Acquisition for Large-scale Grammars
Kostadin Cholakov | Gertjan van Noord | Valia Kordoni | Yi Zhang
Proceedings of the International Conference Recent Advances in Natural Language Processing 2011

2010

pdf bib
Discourse Structure: Theory, Practice and Use
Bonnie Webber | Markus Egg | Valia Kordoni
Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts

pdf bib
Chart Mining-based Lexical Acquisition with Precision Grammars
Yi Zhang | Timothy Baldwin | Valia Kordoni | David Martinez | Jeremy Nicholson
Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics

pdf bib
Discriminant Ranking for Efficient Treebanking
Yi Zhang | Valia Kordoni
Coling 2010: Posters

pdf bib abs
Mapping between Dependency Structures and Compositional Semantic Representations
Max Jakob | Markéta Lopatková | Valia Kordoni
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

This paper investigates the mapping between two semantic formalisms, namely the tectogrammatical layer of the Prague Dependency Treebank 2.0 (PDT) and (Robust) Minimal Recursion Semantics ((R)MRS). It is a first attempt to relate the dependency-based annotation scheme of PDT to a compositional semantics approach like (R)MRS. A mapping algorithm that converts PDT trees to (R)MRS structures is developed, associating (R)MRSs to each node on the dependency tree. Furthermore, composition rules are formulated and the relation between dependency in PDT and semantic heads in (R)MRS is analyzed. It turns out that structure and dependencies, morphological categories and some coreferences can be preserved in the target structures. Moreover, valency and free modifications are distinguished using the valency dictionary of PDT as an additional resource. The validation results show that systematically correct underspecified target representations can be obtained by a rule-based mapping approach, which is an indicator that (R)MRS is indeed robust in relation to the formal representation of Czech data. This finding is novel, for Czech, with its free word order and rich morphology, is typologically different than languages analyzed with (R)MRS to date.

pdf bib abs
Semantic Feature Engineering for Enhancing Disambiguation Performance in Deep Linguistic Processing
Danielle Ben-Gera | Yi Zhang | Valia Kordoni
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

The task of parse disambiguation has gained in importance over the last decade as the complexity of grammars used in deep linguistic processing has been increasing. In this paper we propose to employ the fine-grained HPSG formalism in order to investigate the contribution of deeper linguistic knowledge to the task of ranking the different trees the parser outputs. In particular, we focus on the incorporation of semantic features in the disambiguation component and the stability of our model cross domains. Our work is carried out within DELPH-IN (http://www.delph-in.net), using the LinGo Redwoods and the WeScience corpora, parsed with the English Resource Grammar and the PET parser.

pdf bib abs
Disambiguating Compound Nouns for a Dynamic HPSG Treebank of Wall Street Journal Texts
Valia Kordoni | Yi Zhang
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

The aim of this paper is twofold. We focus, on the one hand, on the task of dynamically annotating English compound nouns, and on the other hand we propose disambiguation methods and techniques which facilitate the annotation task. Both the aforementioned are part of a larger on-going effort which aims to create HPSG annotation for the texts from theWall Street Journal (henceforward WSJ) sections of the Penn Treebank (henceforward PTB) with the help of a hand-written large-scale and wide-coverage grammar of English, the English Resource Grammar (henceforward ERG; Flickinger (2002)). As we show in this paper, such annotations are very rich linguistically, since apart from syntax they also incorporate semantics, which does not only ensure that the treebank is guaranteed to be a truly sharable, re-usable and multi-functional linguistic resource, but also calls for the necessity of a better disambiguation of the internal (syntactic) structure of larger units of words, such as compound nouns, since this has an impact on the representation of their meaning, which is of utmost interest if the linguistic annotation of a given corpus is to be further understood as the practice of adding interpretative linguistic information of the highest quality in order to give added value to the corpus.

2009

pdf bib
Prepositions in Applications: A Survey and Introduction to the Special Issue
Timothy Baldwin | Valia Kordoni | Aline Villavicencio
Computational Linguistics, Volume 35, Number 2, June 2009 - Special Issue on Prepositions

pdf bib
Proceedings of the EACL 2009 Workshop on the Interaction between Linguistics and Computational Linguistics: Virtuous, Vicious or Vacuous?
Timothy Baldwin | Valia Kordoni
Proceedings of the EACL 2009 Workshop on the Interaction between Linguistics and Computational Linguistics: Virtuous, Vicious or Vacuous?

pdf bib
Annotating Wall Street Journal Texts Using a Hand-Crafted Deep Linguistic Grammar
Valia Kordoni | Yi Zhang
Proceedings of the Third Linguistic Annotation Workshop (LAW III)

pdf bib
Using Treebanking Discriminants as Parse Disambiguation Features
Md. Faisal Mahbub Chowdhury | Yi Zhang | Valia Kordoni
Proceedings of the 11th International Conference on Parsing Technologies (IWPT’09)

pdf bib
Enabling Adaptation of Lexicalised Grammars to New Domains
Valia Kordoni | Yi Zhang
Proceedings of the Workshop on Adaptation of Language Resources and Technology to New Domains

2008

pdf bib
Towards Domain-Independent Deep Linguistic Processing: Ensuring Portability and Re-Usability of Lexicalised Grammars
Kostadin Cholakov | Valia Kordoni | Yi Zhang
Coling 2008: Proceedings of the workshop on Grammar Engineering Across Frameworks

pdf bib
Enhancing Performance of Lexicalised Grammars
Rebecca Dridan | Valia Kordoni | Jeremy Nicholson
Proceedings of ACL-08: HLT

pdf bib
Mapping between Compositional Semantic Representations and Lexical Semantic Resources: Towards Accurate Deep Semantic Parsing
Sergio Roa | Valia Kordoni | Yi Zhang
Proceedings of ACL-08: HLT, Short Papers

pdf bib abs
Evaluating and Extending the Coverage of HPSG Grammars: A Case Study for German
Jeremy Nicholson | Valia Kordoni | Yi Zhang | Timothy Baldwin | Rebecca Dridan
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

In this work, we examine and attempt to extend the coverage of a German HPSG grammar. We use the grammar to parse a corpus of newspaper text and evaluate the proportion of sentences which have a correct attested parse, and analyse the cause of errors in terms of lexical or constructional gaps which prevent parsing. Then, using a maximum entropy model, we evaluate prediction of lexical types in the HPSG type hierarchy for unseen lexemes. By automatically adding entries to the lexicon, we observe that we can increase coverage without substantially decreasing precision.

pdf bib abs
Robust Parsing with a Large HPSG Grammar
Yi Zhang | Valia Kordoni
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

In this paper we propose a partial parsing model which achieves robust parsing with a large HPSG grammar. Constraint-based precision grammars, like the HPSG grammar we are using for the experiments reported in this paper, typically lack robustness, especially when applied to real world texts. To maximally recover the linguistic knowledge from an unsuccessful parse, a proper selection model must be used. Also, the efficiency challenges usually presented by the selection model must be answered. Building on the work reported in (Zhang et al., 2007), we further propose a new partial parsing model that splits the parsing process into two stages, both of which use the bottom-up chart-based parsing algorithm. The algorithm is implemented and a preliminary experiment shows promising results.

2007

pdf bib
Partial Parse Selection for Robust Deep Processing
Yi Zhang | Valia Kordoni | Erin Fitzgerald
ACL 2007 Workshop on Deep Linguistic Processing

pdf bib
The Corpus and the Lexicon: Standardising Deep Lexical Acquisition Evaluation
Yi Zhang | Timothy Baldwin | Valia Kordoni
ACL 2007 Workshop on Deep Linguistic Processing

pdf bib
Validation and Evaluation of Automatically Acquired Multiword Expressions for Grammar Engineering
Aline Villavicencio | Valia Kordoni | Yi Zhang | Marco Idiart | Carlos Ramisch
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)

2006

pdf bib abs
Automated Deep Lexical Acquisition for Robust Open Texts Processing
Yi Zhang | Valia Kordoni
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

In this paper, we report on methods to detect and repair lexical errors for deep grammars. The lack of coverage has for long been the major problem for deep processing. The existence of various errors in the hand-crafted large grammars prevents their usage in real applications. The manual detection and repair of errors requires asignificant amount of human effort. An experiment with the British National Corpus shows about 70% of the sentences contain unknownword(s) for the English Resource Grammar. With the help of error mining methods, many lexical errors are discovered, which cause a large part of the parsing failures. Moreover, with a lexical type predictor based on a maximum entropy model, new lexical entries are automatically generated. The contribution of various features for the model is evaluated. With the disambiguated full parsing results, the precision of the predictor is enhanced significantly.

pdf bib
Automated Multiword Expression Prediction for Grammar Engineering
Yi Zhang | Valia Kordoni | Aline Villavicencio | Marco Idiart
Proceedings of the Workshop on Multiword Expressions: Identifying and Exploiting Underlying Properties