Computational Linguistics in the Netherlands, 2017
The recent years have seen the emergence of a number of corpus-based graded lexical resources, wh... more The recent years have seen the emergence of a number of corpus-based graded lexical resources, which include lexical entries graded along a particular learning or difficulty scale. We argue that these graded lexicons are a step towards rendering the inherent complexity of words more apparent – contrary to traditional (single-level) frequency-based lexicons – and could thus find their utility in the field of automatic text simplification, to name but one. However, until now, this type of resource has only been made available for a few languages, including French as a first and second (L2) language (Lété et al., 2004; François et al., 2014) and Swedish L2 (François et al., 2016). The goal of our current work is therefore to expand upon these previous developments, presenting a similar resource for Dutch as a foreign language. Our presentation will be twofold. On the one hand, we will present the alpha version of the NT2Lex resource. We will describe the common methodology used for collecting a corpus of readers and textbooks grader per level of the Common European Framework of Reference (CEFR) and for extracting and refining the per-level lexical frequencies from the collected texts. On the other hand, we will present a concrete application of the resource (and of a graded lexicon in general), which is linked to the task of complex word identification and prediction. In particular, we will address the concrete challenges we’re faced with when using a predictive model of vocabulary knowledge at a given CEFR level
The recent years have seen the emergence of a number of graded lexical resources to further resea... more The recent years have seen the emergence of a number of graded lexical resources to further research on first and second language lexical complexity (Dürlich & François, 2018; François et al., 2014; François et al., 2016; Lété et al., 2004). These lexical resources describe the frequency distributions of lexemes graded along a particular learning or difficulty scale (f.i. the Common European Framework of Reference – CEFR scale) and are in particular corpus-based, machine-readable and open-licensed. Indeed, the lexical frequencies included in these resources are commonly estimated on a corpus of L1 and L2 learning materials, either including textbooks and simplified readers (receptive lexicons) or learner texts (productive lexicons) and they can hence easily be used for pedagogical purposes. Furthermore, the resources are not only available via an online query engine for teachers and/or researchers, but can also be used as components of a readability-driven learning platform (Pilán, Volodina, & Borin, 2016) or an automated essay grading system (Pilán, Volodina, & Zesch, 2016). Until now, these CEFR-graded lexical resources have only been made available for a few of the European languages, including French, Swedish and English. The rationale of our current research is to expand upon these previous developments with NT2Lex, a new resource for Dutch as a foreign language. Moreover, we also aim to leverage one of the shortcomings of the previously developed resources. Indeed, although these graded lexicons are a step towards rendering the inherent complexity of words more apparent – contrary to traditional frequency-based lexicons – we argue that they still lack information about word sense complexity since they include lexical entries disambiguated per lemma and part of speech. Our principal aim is therefore to advance a new type of graded lexicon: a word-sense disambiguated (WSD) graded lexicon linked to Open Dutch WordNet (Postma et al., 2016). The objectives of our study will be twofold. First, we will present the final version of the NT2Lex resource, which includes single and multi-word lexical entries part-of-speech tagged with ‘Frog’ (van den Bosch et al., 2007) and word-sense disambiguated with a Dutch WSD tool. Second, we will present work in progress on the use of word sense complexity features obtained with NT2Lex in L2 readability research, by comparing them to more traditional indices of lexical complexity such as lexical sophistication and hypernymy (Crossley & Salsbury, 2010)
BackgroundDetecting Alzheimer’s disease (AD) before the onset of symptoms is crucial for the deve... more BackgroundDetecting Alzheimer’s disease (AD) before the onset of symptoms is crucial for the development of effective treatment (1). As biomarkers are costly and/or invasive, there is a need for cost‐efficient and non‐invasive methods that can indicate which individuals need further examination. We present a running project that aims at developing a screening tool applicable to the general population by automatically analyzing the history of electronic messages such as WhatsApp and e‐mails over a period of several years prior to the analysis.MethodIn the first phase of the project (model construction), 30 prodromal or mildly demented AD patients (with biomarkers‐confirmed diagnosis) and 30 clinically normal elderly volunteers are recruited to donate their electronic message histories. Data is gathered and processed according to the General Data Protection Regulation and only sent messages are considered, which are codified to hide personal information. Automatic linguistic analyses ...
L'objectif de la recherche rapportee ici est de developper une technique d'extraction d&#... more L'objectif de la recherche rapportee ici est de developper une technique d'extraction d'information permettant de determiner automatiquement la valence affective de phrases qui mentionnent des noms de personnalites ou de societes. Pour ce faire un extracteur d'entites nommees est associe a un programme d'analyse lexicale faisant appel a des dictionnaires de valence affective. Un corpus de reference est etabli pour mesurer les performances du systeme propose en les comparant a des jugements humains.
The literature frequently addresses the differences in receptive and productive vocabulary, but g... more The literature frequently addresses the differences in receptive and productive vocabulary, but grammar is often left unacknowledged in second language acquisition studies. In this paper, we used two corpora to investigate the divergences in the behavior of pedagogically relevant grammatical structures in reception and production texts. We further improved the divergence scores observed in this investigation by setting a polarity to them that indicates whether there is overuse or underuse of a grammatical structure by language learners. This led to the compilation of a language profile that was later combined with vocabulary and readability features for classifying reception and production texts in three classes: beginner, intermediate, and advanced. The results of the automatic classification task in both production (0.872 of F-measure) and reception (0.942 of F-measure) were comparable to the current state of the art. We also attempted to automatically attribute a score to texts p...
Cet article présente la plate-forme AMesure, qui vise à assister les rédacteurs de textes adminis... more Cet article présente la plate-forme AMesure, qui vise à assister les rédacteurs de textes administratifs dans la simplification de leurs écrits. Cette plate-forme inclut non seulement la première formule de lisibilité spécialisée pour les textes administratifs, mais recourt à diverses techniques de traitement automatique du langage (TAL) : ainsi, elle effectue également une analyse détaillée du texte qui met en évidence plusieurs phénomènes considérés comme plus difficiles à lire. Cette étude rapporte la méthodologie mise en place pour réaliser cette analyse, ainsi qu'une évaluation des résultats de cette dernière. Ceux-ci sont au niveau l'état de l'art en ce qui concerne la détection des structures syntaxiques, mais moins satisfaisants par rapport à la détection des abréviations. Enfin, l'interface de AMesure est rapidement présentée, en particulier sa fonctionnalité qui propose des conseils de rédaction simple aux rédacteurs
This paper describes the architecture of an encoding system which aim is to be implemented as a c... more This paper describes the architecture of an encoding system which aim is to be implemented as a coding help at the Cliniques universtaires Saint-Luc, a hospital in Brussels. This paper focuses on machine learning methods, more specifically, on the appropriate set of attributes to be chosen in order to optimize the results of these methods. A series of four experiments was conducted on a baseline method: Naive Bayes with varying sets of attributes. These experiments showed that a first step consisting in the extraction of information to be coded (such as diseases, procedures, aggravating factors, etc.) is essential. It also demonstrated the importance of stemming features. Restraining the classes to categories resulted in a recall of 81.1 %.
Parsers are essential tools for several NLP applications. Here we introduce PassPort, a model for... more Parsers are essential tools for several NLP applications. Here we introduce PassPort, a model for the dependency parsing of Portuguese trained with the Stanford Parser. For developing PassPort, we observed which approach performed best in several setups using different existing parsing algorithms and combinations of linguistic information. PassPort achieved an UAS of 87.55 and a LAS of 85.21 in the Universal Dependencies corpus. We also evaluated the model’s performance in relation to another model and different corpora containing three genres. For that, we annotated random sentences from these corpora using PassPort and the PALAVRAS parsing system. We then carried out a manual evaluation and comparison of both models. They achieved very similar results for dependency parsing, with a LAS of 85.02 for PassPort against 84.36 for PALAVRAS. In addition, the results from the analysis showed us that better performance in the part-of-speech tagging could improve our LAS.
This paper presents a rule-based method for the detection and normalization of medical entities u... more This paper presents a rule-based method for the detection and normalization of medical entities using SNOMED-CT which, although based on knowledge stored in terminological resources, allows some flexibility in order to account for the language variation typical of medical texts. Our system is based on the software Unitex and is one of the few to code French medical texts with SNOMED-CT concept identifiers. Our evaluation quantifies the benefits of such a flexible approach, but also emphasizes terminological resource shortcomings for the processing of medical reports written in French. Finally, our methodology is an interesting alternative to supervised training, as the extraction rules require limited development.
Computational Linguistics in the Netherlands, 2017
The recent years have seen the emergence of a number of corpus-based graded lexical resources, wh... more The recent years have seen the emergence of a number of corpus-based graded lexical resources, which include lexical entries graded along a particular learning or difficulty scale. We argue that these graded lexicons are a step towards rendering the inherent complexity of words more apparent – contrary to traditional (single-level) frequency-based lexicons – and could thus find their utility in the field of automatic text simplification, to name but one. However, until now, this type of resource has only been made available for a few languages, including French as a first and second (L2) language (Lété et al., 2004; François et al., 2014) and Swedish L2 (François et al., 2016). The goal of our current work is therefore to expand upon these previous developments, presenting a similar resource for Dutch as a foreign language. Our presentation will be twofold. On the one hand, we will present the alpha version of the NT2Lex resource. We will describe the common methodology used for collecting a corpus of readers and textbooks grader per level of the Common European Framework of Reference (CEFR) and for extracting and refining the per-level lexical frequencies from the collected texts. On the other hand, we will present a concrete application of the resource (and of a graded lexicon in general), which is linked to the task of complex word identification and prediction. In particular, we will address the concrete challenges we’re faced with when using a predictive model of vocabulary knowledge at a given CEFR level
The recent years have seen the emergence of a number of graded lexical resources to further resea... more The recent years have seen the emergence of a number of graded lexical resources to further research on first and second language lexical complexity (Dürlich & François, 2018; François et al., 2014; François et al., 2016; Lété et al., 2004). These lexical resources describe the frequency distributions of lexemes graded along a particular learning or difficulty scale (f.i. the Common European Framework of Reference – CEFR scale) and are in particular corpus-based, machine-readable and open-licensed. Indeed, the lexical frequencies included in these resources are commonly estimated on a corpus of L1 and L2 learning materials, either including textbooks and simplified readers (receptive lexicons) or learner texts (productive lexicons) and they can hence easily be used for pedagogical purposes. Furthermore, the resources are not only available via an online query engine for teachers and/or researchers, but can also be used as components of a readability-driven learning platform (Pilán, Volodina, & Borin, 2016) or an automated essay grading system (Pilán, Volodina, & Zesch, 2016). Until now, these CEFR-graded lexical resources have only been made available for a few of the European languages, including French, Swedish and English. The rationale of our current research is to expand upon these previous developments with NT2Lex, a new resource for Dutch as a foreign language. Moreover, we also aim to leverage one of the shortcomings of the previously developed resources. Indeed, although these graded lexicons are a step towards rendering the inherent complexity of words more apparent – contrary to traditional frequency-based lexicons – we argue that they still lack information about word sense complexity since they include lexical entries disambiguated per lemma and part of speech. Our principal aim is therefore to advance a new type of graded lexicon: a word-sense disambiguated (WSD) graded lexicon linked to Open Dutch WordNet (Postma et al., 2016). The objectives of our study will be twofold. First, we will present the final version of the NT2Lex resource, which includes single and multi-word lexical entries part-of-speech tagged with ‘Frog’ (van den Bosch et al., 2007) and word-sense disambiguated with a Dutch WSD tool. Second, we will present work in progress on the use of word sense complexity features obtained with NT2Lex in L2 readability research, by comparing them to more traditional indices of lexical complexity such as lexical sophistication and hypernymy (Crossley & Salsbury, 2010)
BackgroundDetecting Alzheimer’s disease (AD) before the onset of symptoms is crucial for the deve... more BackgroundDetecting Alzheimer’s disease (AD) before the onset of symptoms is crucial for the development of effective treatment (1). As biomarkers are costly and/or invasive, there is a need for cost‐efficient and non‐invasive methods that can indicate which individuals need further examination. We present a running project that aims at developing a screening tool applicable to the general population by automatically analyzing the history of electronic messages such as WhatsApp and e‐mails over a period of several years prior to the analysis.MethodIn the first phase of the project (model construction), 30 prodromal or mildly demented AD patients (with biomarkers‐confirmed diagnosis) and 30 clinically normal elderly volunteers are recruited to donate their electronic message histories. Data is gathered and processed according to the General Data Protection Regulation and only sent messages are considered, which are codified to hide personal information. Automatic linguistic analyses ...
L'objectif de la recherche rapportee ici est de developper une technique d'extraction d&#... more L'objectif de la recherche rapportee ici est de developper une technique d'extraction d'information permettant de determiner automatiquement la valence affective de phrases qui mentionnent des noms de personnalites ou de societes. Pour ce faire un extracteur d'entites nommees est associe a un programme d'analyse lexicale faisant appel a des dictionnaires de valence affective. Un corpus de reference est etabli pour mesurer les performances du systeme propose en les comparant a des jugements humains.
The literature frequently addresses the differences in receptive and productive vocabulary, but g... more The literature frequently addresses the differences in receptive and productive vocabulary, but grammar is often left unacknowledged in second language acquisition studies. In this paper, we used two corpora to investigate the divergences in the behavior of pedagogically relevant grammatical structures in reception and production texts. We further improved the divergence scores observed in this investigation by setting a polarity to them that indicates whether there is overuse or underuse of a grammatical structure by language learners. This led to the compilation of a language profile that was later combined with vocabulary and readability features for classifying reception and production texts in three classes: beginner, intermediate, and advanced. The results of the automatic classification task in both production (0.872 of F-measure) and reception (0.942 of F-measure) were comparable to the current state of the art. We also attempted to automatically attribute a score to texts p...
Cet article présente la plate-forme AMesure, qui vise à assister les rédacteurs de textes adminis... more Cet article présente la plate-forme AMesure, qui vise à assister les rédacteurs de textes administratifs dans la simplification de leurs écrits. Cette plate-forme inclut non seulement la première formule de lisibilité spécialisée pour les textes administratifs, mais recourt à diverses techniques de traitement automatique du langage (TAL) : ainsi, elle effectue également une analyse détaillée du texte qui met en évidence plusieurs phénomènes considérés comme plus difficiles à lire. Cette étude rapporte la méthodologie mise en place pour réaliser cette analyse, ainsi qu'une évaluation des résultats de cette dernière. Ceux-ci sont au niveau l'état de l'art en ce qui concerne la détection des structures syntaxiques, mais moins satisfaisants par rapport à la détection des abréviations. Enfin, l'interface de AMesure est rapidement présentée, en particulier sa fonctionnalité qui propose des conseils de rédaction simple aux rédacteurs
This paper describes the architecture of an encoding system which aim is to be implemented as a c... more This paper describes the architecture of an encoding system which aim is to be implemented as a coding help at the Cliniques universtaires Saint-Luc, a hospital in Brussels. This paper focuses on machine learning methods, more specifically, on the appropriate set of attributes to be chosen in order to optimize the results of these methods. A series of four experiments was conducted on a baseline method: Naive Bayes with varying sets of attributes. These experiments showed that a first step consisting in the extraction of information to be coded (such as diseases, procedures, aggravating factors, etc.) is essential. It also demonstrated the importance of stemming features. Restraining the classes to categories resulted in a recall of 81.1 %.
Parsers are essential tools for several NLP applications. Here we introduce PassPort, a model for... more Parsers are essential tools for several NLP applications. Here we introduce PassPort, a model for the dependency parsing of Portuguese trained with the Stanford Parser. For developing PassPort, we observed which approach performed best in several setups using different existing parsing algorithms and combinations of linguistic information. PassPort achieved an UAS of 87.55 and a LAS of 85.21 in the Universal Dependencies corpus. We also evaluated the model’s performance in relation to another model and different corpora containing three genres. For that, we annotated random sentences from these corpora using PassPort and the PALAVRAS parsing system. We then carried out a manual evaluation and comparison of both models. They achieved very similar results for dependency parsing, with a LAS of 85.02 for PassPort against 84.36 for PALAVRAS. In addition, the results from the analysis showed us that better performance in the part-of-speech tagging could improve our LAS.
This paper presents a rule-based method for the detection and normalization of medical entities u... more This paper presents a rule-based method for the detection and normalization of medical entities using SNOMED-CT which, although based on knowledge stored in terminological resources, allows some flexibility in order to account for the language variation typical of medical texts. Our system is based on the software Unitex and is one of the few to code French medical texts with SNOMED-CT concept identifiers. Our evaluation quantifies the benefits of such a flexible approach, but also emphasizes terminological resource shortcomings for the processing of medical reports written in French. Finally, our methodology is an interesting alternative to supervised training, as the extraction rules require limited development.
Uploads
Papers by Cedrick Fairon