Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content
Bastien Kindt
  • Institut orientaliste
    Collège Érasme
    Place Blaise Pascal, 1
    1348 Louvain-la-Neuve
    bte L3.03.32
    Belgique
  • (+32 0)10 47 49 10
  • Scientific collaborator at UCLouvain (Louvain-la-Neuve, Belgium) NLP Resources & Digital Corpus Developer at Peeters ... moreedit
Bibliographie du projet GREgORI 1990-...
Research Interests:
The colophons of Armenian manuscripts constitute a large textual corpus spanning a millennium of written culture. These texts are highly diverse and rich in terms of linguistic variation. This poses a challenge to NLP tools, especially... more
The colophons of Armenian manuscripts constitute a large textual corpus spanning a millennium of written culture. These texts are highly diverse and rich in terms of linguistic variation. This poses a challenge to NLP tools, especially considering the fact that linguistic resources designed or suited for Armenian are still scarce. In this paper, we deal with a sub-corpus of colophons written to commemorate the rescue of a manuscript and dating from 1286 to ca. 1450, a thematic group distinguished by a particularly high concentration of words exhibiting linguistic variation. The text is processed (lemmatization, POS-tagging, and inflectional tagging) using the tools of the GREgORI Project and evaluated. Through a selection of examples, we show how variation is dealt with at each linguistic level (phonology, orthography, flexion, vocabulary, syntax). Complex variation, at the level of tokens or lemmata, is considered as well. The results of this work are used to enrich and refine the linguistic resources of the GREgORI project, which in turn benefits the processing of other texts.
The aim of this paper is to evaluate a lexical analysis (mainly lemmatization and POS-tagging) of a sample of the ancient-Armenian version of the Adversus Haereses by Irenaeus of Lyons (2nd c.) by using hybrid approach based on digital... more
The aim of this paper is to evaluate a lexical analysis (mainly lemmatization and POS-tagging) of a sample of the ancient-Armenian version of the Adversus Haereses by Irenaeus of Lyons (2nd c.) by using hybrid approach based on digital dictionaries on the one hand, and on Recurrent Neural Network (RNN) on the other hand. The quality of the results is checked by comparing data obtained by implementing these two methods with data manually checked. In the present case, 98,37% of the results are correct by using the first (lexical) approach, and 74,64% by using the second (RNN). But, in fact, both methods present advantages and disadvantages and argue for the hybrid method. The linguistic resources implemented here are jointly developed and tested by GREgORI and Calfa.
Creating a digital corpus enriched by full linguistic annotations is a work which classically integrates several manual steps of acquisition, processing, and data display. Processing presupposes the existence of dedicated and specialised... more
Creating a digital corpus enriched by full linguistic annotations is a work
which classically integrates several manual steps of acquisition, processing, and data display. Processing presupposes the existence of dedicated and specialised analysis tools, adapted to the state of the language used in the corpus. This paper describes a semi-supervised process for building Armenian corpora from scanned documents.
This method is based on a chain of applications pre-trained by Calfa and GREgORI and enabling the complete processing of texts, from their automated input to their linguistic analysis and data display. We provide an assessment of this methodology and benefits of model specialisation, based on digitised copies of a 17th-century manuscript of the Four Gospels (Walters MS W541 = BAL W541, Amida Gospels, ff. 113v-117r: Lk 1:1‑78).
The DTC corpus brings together historical texts written in Greek during the Byzantine period. These texts were analyzed semi-automatically lemmatization and POS-tagging) by using computer tools and linguistic resources of the GREgORI... more
The DTC corpus brings together historical texts written in Greek during the Byzantine period. These texts were analyzed semi-automatically  lemmatization and POS-tagging) by using computer tools and linguistic resources of the GREgORI project (UCLouvain, Louvain-la-Neuve, Belgium) specialized in the NLP of Greek and the languages of the Christian East. A second analysis was carried out in collaboration with the company Calfa (Paris, France) developping NLP tools for Armenian and implementing approach relating to artificial intelligence. This second analysis is performed by a neural network. This study compares and evaluates the results produced by the two methods and proposes a hybrid approach for the processing of the languages concerned.
Lemmatisation automatique des sources en géorgien ancien
Research Interests:
The GREgORI project provides scholars with lemmatized corpora of texts written in Greek and in the main languages of the Christian East. Attested word-forms are linked with lemma, POS-, and inflectional tags. This work makes it possible... more
The GREgORI project provides scholars with lemmatized corpora of texts written in Greek and in the main languages of the Christian East. Attested word-forms are linked with lemma, POS-, and inflectional tags. This work makes it possible to produce lemmatized indexes and concordances, and to disseminate these corpora by using web-based interfaces. This paper gives an overview of the goals of the GREgORI project and lists the morphosyntactic and inflectional tags used for the analysis of the ancient Armenian language.
Research Interests:
Research Interests:
Research Interests:
Les vingt-trois publications reprises dans ce volume sont consacrées à la LXX de Jérémie et à Baruch son secrétaire. Elles visent pour la plupart à montrer que le texte hébreu reçu (TM, texte long) est une refonte d’un modèle hébreu... more
Les vingt-trois publications reprises dans ce volume sont consacrées à la LXX de Jérémie et à Baruch son secrétaire. Elles visent pour la plupart à montrer que le texte hébreu reçu (TM, texte long) est une refonte d’un modèle hébreu traduit en grec et conservé dans la Septante (LXX, texte court placé sous la responsabilité de Baruch). Pour ce faire, P.-M. Bogaert a mis en œuvre une exégèse «différentielle» qui compare, analyse et explique les divergences entre les deux formes du livre. L’ouvrage commence par des contributions plus générales; les suivantes portent sur des passages choisis en fonction de leur intérêt: différence de contenu et différence d’ordre. Une dernière contribution, inédite, offre une synthèse provisoire visant à caractériser le texte court pour lui-même. Le recueil est introduit par une préface en anglais du Professeur Emanuel Tov, de l’Université hébraïque de Jérusalem (Publisher’s blurb – Peeters Publishers, 2020).
Research Interests:
Research Interests:
Research Interests:
Research Interests:
Research Interests:
Research Interests:
Research Interests:
Research Interests:
Research Interests:
Research Interests:
Research Interests:
Research Interests:
Research Interests:
Bibliographie du projet GREgORI (ordre chronologique)
Research Interests: