- Institut orientaliste
Collège Érasme
Place Blaise Pascal, 1
1348 Louvain-la-Neuve
bte L3.03.32
Belgique - (+32 0)10 47 49 10
- Greek lexicology, Greek Lexicography, Natural Language Processing, Byzantine Hagiography, Ancient Greek Language, Byzantine Studies, and 20 moreLexicology, Ancient Greek, Kartvelian Languages, Lemmatization, Greek Literature, Lexicography, Greek Language, Late Antiquity, Greek Grammar, Dictionaries, Lexicografía Griega, DIccionarios De Griego Antiguo, Ancient Greek Vocabulary, Liddell Scott Jones, Byzantine Archaeology, Classics, Early Christianity, Roman History, Ancient Greek History, and Roman Archaeologyedit
- Scientific collaborator at UCLouvain (Louvain-la-Neuve, Belgium) NLP Resources & Digital Corpus Developer at Peeters ... moreScientific collaborator at UCLouvain (Louvain-la-Neuve, Belgium)
NLP Resources & Digital Corpus Developer at Peeters Publishers (Leuven, Belgium).
I graduated both in Classics and in Oriental Philology and History at the UCLouvain. I am a hellenist with a major interest in the languages of the Christian East, on the one hand, and in the automatic processing of natural languages, on the other hand.
At present, I am the coordinator of the GREgORI project directed by Professor Bernard Coulie at UCLouvain (see https://uclouvain.be/fr/instituts-recherche/incal/ciol/gregori-project.html).
This Project aims to provide researchers with tagged corpora of texts written in classical and oriental languages (mainly Greek, Armenian, Georgian, Arabic, and Syriac). This includes the preparation of lemmatized indexes and concordances and designing online user interfaces and search capabilities for Greek and oriental texts. Samples of tagged corpora are available free of charge on the interfaces of the GREgORI Project (see https://www.v2.gregoriproject.com).
Processed texts are turned into corpora enriched with lexical (lemma), part-of-speech (noun, adjective, verb, etc.), and grammatical tags (case, gender, tense, person, number, etc.). By handling such data with the appropriate query engines, users can search for words or expressions in tagged corpora and gather linguistic materials in order to automatically create indexes, concordances, and other lexicographical tools (frequency indexes, inverse indexes, etc.), paving the way for linguistic, philological, or historical studies.
These developments are carried out in cooperation with scholars of the Oriental Institute of the UCLouvain – for their linguistic expertise –, with Calfa (see https://calfa.fr) – for IT developments –, as well as with other researchers and academic teams, both in Belgium and abroad.
Examples of ongoing projects are the creation of digital versions of the Corpus Scriptorum Christianorum Orientalium series, in collaboration with Peeters Publishers (Leuven, Belgium), and processing of the Syriac texts published by the project "Florilegia Syriaca. The Intercultural Dissemination of Greek Christian Thought in Syriac and Arabic in the First Millennium CE" (Venice) (see https://cordis.europa.eu/project/rcn/212198).edit
Bibliographie du projet GREgORI 1990-...
Research Interests: Patristics, Georgian Language, Armenian Language, Byzantine historiography, Ancient Greek Language, and 8 morePOS tagging, Natural Language Processing(NLP), Lemmatization, Syriac Language, Ancient Greek, Concordances, Lexicology, Greek (Byzantine) Texts, Keyword in Context (KWIC), and Recursive Neural network (RNN)
The colophons of Armenian manuscripts constitute a large textual corpus spanning a millennium of written culture. These texts are highly diverse and rich in terms of linguistic variation. This poses a challenge to NLP tools, especially... more
The colophons of Armenian manuscripts constitute a large textual corpus spanning a millennium of written culture. These texts are highly diverse and rich in terms of linguistic variation. This poses a challenge to NLP tools, especially considering the fact that linguistic resources designed or suited for Armenian are still scarce. In this paper, we deal with a sub-corpus of colophons written to commemorate the rescue of a manuscript and dating from 1286 to ca. 1450, a thematic group distinguished by a particularly high concentration of words exhibiting linguistic variation. The text is processed (lemmatization, POS-tagging, and inflectional tagging) using the tools of the GREgORI Project and evaluated. Through a selection of examples, we show how variation is dealt with at each linguistic level (phonology, orthography, flexion, vocabulary, syntax). Complex variation, at the level of tokens or lemmata, is considered as well. The results of this work are used to enrich and refine the linguistic resources of the GREgORI project, which in turn benefits the processing of other texts.
Research Interests:
Creating a digital corpus enriched by full linguistic annotations is a work which classically integrates several manual steps of acquisition, processing, and data display. Processing presupposes the existence of dedicated and specialised... more
Creating a digital corpus enriched by full linguistic annotations is a work
which classically integrates several manual steps of acquisition, processing, and data display. Processing presupposes the existence of dedicated and specialised analysis tools, adapted to the state of the language used in the corpus. This paper describes a semi-supervised process for building Armenian corpora from scanned documents.
This method is based on a chain of applications pre-trained by Calfa and GREgORI and enabling the complete processing of texts, from their automated input to their linguistic analysis and data display. We provide an assessment of this methodology and benefits of model specialisation, based on digitised copies of a 17th-century manuscript of the Four Gospels (Walters MS W541 = BAL W541, Amida Gospels, ff. 113v-117r: Lk 1:1‑78).
which classically integrates several manual steps of acquisition, processing, and data display. Processing presupposes the existence of dedicated and specialised analysis tools, adapted to the state of the language used in the corpus. This paper describes a semi-supervised process for building Armenian corpora from scanned documents.
This method is based on a chain of applications pre-trained by Calfa and GREgORI and enabling the complete processing of texts, from their automated input to their linguistic analysis and data display. We provide an assessment of this methodology and benefits of model specialisation, based on digitised copies of a 17th-century manuscript of the Four Gospels (Walters MS W541 = BAL W541, Amida Gospels, ff. 113v-117r: Lk 1:1‑78).
Research Interests:
The DTC corpus brings together historical texts written in Greek during the Byzantine period. These texts were analyzed semi-automatically lemmatization and POS-tagging) by using computer tools and linguistic resources of the GREgORI... more
The DTC corpus brings together historical texts written in Greek during the Byzantine period. These texts were analyzed semi-automatically lemmatization and POS-tagging) by using computer tools and linguistic resources of the GREgORI project (UCLouvain, Louvain-la-Neuve, Belgium) specialized in the NLP of Greek and the languages of the Christian East. A second analysis was carried out in collaboration with the company Calfa (Paris, France) developping NLP tools for Armenian and implementing approach relating to artificial intelligence. This second analysis is performed by a neural network. This study compares and evaluates the results produced by the two methods and proposes a hybrid approach for the processing of the languages concerned.
Research Interests:
Lemmatisation automatique des sources en géorgien ancien
Research Interests:
Research Interests:
Research Interests:
Research Interests:
Les vingt-trois publications reprises dans ce volume sont consacrées à la LXX de Jérémie et à Baruch son secrétaire. Elles visent pour la plupart à montrer que le texte hébreu reçu (TM, texte long) est une refonte d’un modèle hébreu... more
Les vingt-trois publications reprises dans ce volume sont consacrées à la LXX de Jérémie et à Baruch son secrétaire. Elles visent pour la plupart à montrer que le texte hébreu reçu (TM, texte long) est une refonte d’un modèle hébreu traduit en grec et conservé dans la Septante (LXX, texte court placé sous la responsabilité de Baruch). Pour ce faire, P.-M. Bogaert a mis en œuvre une exégèse «différentielle» qui compare, analyse et explique les divergences entre les deux formes du livre. L’ouvrage commence par des contributions plus générales; les suivantes portent sur des passages choisis en fonction de leur intérêt: différence de contenu et différence d’ordre. Une dernière contribution, inédite, offre une synthèse provisoire visant à caractériser le texte court pour lui-même. Le recueil est introduit par une préface en anglais du Professeur Emanuel Tov, de l’Université hébraïque de Jérusalem (Publisher’s blurb – Peeters Publishers, 2020).
Research Interests:
Research Interests:
Research Interests:
Research Interests:
Research Interests:
Research Interests:
Research Interests:
Research Interests:
Research Interests:
Research Interests:
Research Interests:
Research Interests:
Bibliographie du projet GREgORI (ordre chronologique)