Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content
In this work we present an empirical analysis performed on Italian nominal multiword expressions (MWEs) of the form [noun + adjective] that aims at studying quantitatively their syntactic and semantic features in order to improve their... more
In this work we present an empirical analysis performed on Italian nominal multiword expressions (MWEs) of the form [noun + adjective] that aims at studying quantitatively their syntactic and semantic features in order to improve their automatic identification and collection. Three indices are proposed, which are able to measure syntactic and semantic frozeness of the expressions on empirical basis in a corpus of about 1.8 million words, composed of Italian texts concerning the domain of physics. The combination of the three indices can be used to create a global measure, that we call Prototypicality Index (PI), which appears to be useful in the automatic extraction of terminological MWEs. The performance of PI at extracting true positives out of a candidate list is compared to those of the well-known statistical association measures Log-likelihood and Pointwise Mutual Information. Our results show how the performance of PI can be comparable to those of association measures, althoug...
Research Interests:
Research Interests:
Research Interests:
In contemporary linguistics the definition of those entities which are referred to as multiword expressions (MWEs) remains controversial. It is intuitively clear that some words, when appearing together, have some “special bond” in terms... more
In contemporary linguistics the definition of those entities which are referred to as multiword expressions (MWEs) remains controversial. It is intuitively clear that some words, when appearing together, have some “special bond” in terms of meaning (e.g. black hole, mountain chain), or lexical choice (e.g strong tea, to fill a form), contrary to free combinations. Nevertheless, the great variety of features and anomalous behaviours that these expressions exhibit makes it difficult to organize them into categories and gave rise to a great amount of different and sometimes overlapping terminology. So far, most approaches in corpus linguistics have focused on trying to automatically extract MWEs from corpora by using statistical association measures, while theoretical aspects related to their definition, typology and behaviours arising from quantitative corpus-based studies have not been widely explored, especially for languages with a rich morphology and relatively free word order, such as Italian. I show that a systematic analysis of the empirical behaviour of Italian MWEs in large corpora, with respect to several parameters, such as syntactic and semantic variations, is useful to outline a subcategorization of the expressions in homogeneous sets which approximately correspond to what is intuitively known as multiword units (“polirematiche” in the Italian lexicographic tradition) and lexical collocations. These results can be obtained by using an ad-hoc designed tool (whose methodology is fully explained in my work) which is able to investigate automatically the empirical features of MWEs once that a large corpus and a list of expressions are provided.