Egon W Stemle

European Academy of Bolzano/Bozen (EURAC), Institute for Specialised Communication and Multilingualism, Intrapreneur, Researcher, Practitioner

Followers

Following

Co-authors

Public Views

InterestsView All (10)

Uploads

Papers by Egon W Stemle

Towards an infrastructure for FAIR language learner corpora

8th NLP4CALL workshop, 2019

The FAIR Index of CMC Corpora

Combining image recognition with text mining for next generation CES assessments

ESP 10 World Conference, 2019

On the Detection of Neologism Candidates as a Basis for Language Observation and Lexicographic Endeavors

Znanstvena založba Filozofske fakultete, Aug 21, 2018

Download

Next WG3 meeting: Lexicographical Data meet Computational Linguistics and Knowledge Systems

COST ENeL WG3 meeting (organized with WG1), 2016

Using Language Learner Data for Metaphor Detection

Proceedings of the Workshop on Figurative Language Processing

Download

Exploring Reusability and Reproducibility for a Research Infrastructure for L1 and L2 Learner Corpora

Information, 2021

Up until today research in various educational and linguistic domains such as learner corpus rese... more Up until today research in various educational and linguistic domains such as learner corpus research, writing research, or second language acquisition has produced a substantial amount of research data in the form of L1 and L2 learner corpora. However, the multitude of individual solutions combined with domain-inherent obstacles in data sharing have so far hampered comparability, reusability and reproducibility of data and research results. In this article, we present work in creating a digital infrastructure for L1 and L2 learner corpora and populating it with data collected in the past. We embed our infrastructure efforts in the broader field of infrastructures for scientific research, drawing from technical solutions and frameworks from research data management, among which the FAIR guiding principles for data stewardship. We share our experiences from integrating some L1 and L2 learner corpora from concluded projects into the infrastructure while trying to ensure compliance wit...

Download

The DiDi Corpus of South Tyrolean CMC Data

NLP4CMC Workshp at the GSCL-Conference 2015, 2015

This paper presents the DiDi Corpus, a corpus of South Tyrolean Data of Computer-mediated Communi... more This paper presents the DiDi Corpus, a corpus of South Tyrolean Data of Computer-mediated Communication (CMC). The corpus comprises around 650,000 tokens from Facebook wall posts, comments on wall posts and private messages, as well as socio-demographic data of participants. All data was automatically annotated with language information (de, it, en and others), and manually normalised and anonymised. Furthermore, semi-automatic token level annotations include part-of-speech and CMC phenomena (e.g. emoticons, emojis, and iteration of graphemes and punctuation). The anonymised corpus without the private messages is freely available for researchers ; the complete and anonymised corpus is available after signing a non-disclosure agreement.

KrdWrd CANOLA Corpus 1.1

Institute for Applied Linguistics, Eurac Research, Nov 25, 2010

The CLARIN ERIC deployment infrastructure and its applicability to reproducible research

This paper is describing the needs and technological preconditions of the CLARIN ERIC infrastruct... more This paper is describing the needs and technological preconditions of the CLARIN ERIC infrastructure. It introduces how containerization using Docker can help to meet these requirements and fleshes out the build and deployment workflow that CLARIN ERIC is employing to ensure that all the goals of their infrastructure are met in an efficient and sustainable way. In a second step, it is also shown how these same workflows can help researchers, especially in the fields of computational and corpus linguistics, to provide for more easily reproducible research by creating a virtual environment that can provide specific versions of data, programs and algorithms used for certain research questions and make sure that the exact same versions can still be used at a later stage to reproduce the results.

Download

The DiDi Corpus of South Tyrolean CMC Data: A multilingual corpus of Facebook texts

Proceedings of the Third Italian Conference on Computational Linguistics CLiC-it 2016, 2016

Download

Comparison of Automatic vs. Manual Language Identification in Multilingual Social Media Texts

Multilingual speakers communicate in more than one language in daily life and on social media. In... more Multilingual speakers communicate in more than one language in daily life and on social media. In order to process or investigate multilingual communication, there is a need for language identification. This study compares the performance of human annotators with automatic ways of language identification on a multilingual (mainly German-Italian-English) social media corpus collected in South Tyrol, Italy. Our results indicate that humans and Natural Language Processing (NLP) systems follow their individual techniques to make a decision about multilingual text messages. This results in low agreement when different annotators or NLP systems execute the same task. In general, annotators agree with each other more than NLP systems. However, there is also variation in human agreement depending on the prior establishment of guidelines for the annotation task or not.

Harnessing artificial intelligence technology and social media data to support Cultural Ecosystem Service assessments

People and Nature, 2021

Download

Closing a gap in the language resources landscape : Groundwork and best practices from projects on computer-mediated communication in four European countries

The paper presents best practices and results from projects in four countries dedicated to the cr... more The paper presents best practices and results from projects in four countries dedicated to the creation of corpora of computer-mediated communication and social media interactions (CMC). Even though there are still many open issues related to building and annotating corpora of that type, there already exists a range of accessible solutions which have been tested in projects and which may serve as a starting point for a more precise discussion of how future standards for CMC corpora may (and should) be shaped like.

Download

Introducing a Gold Standard Corpus from Young Multilinguals for the Evaluation of Automatic UD-PoS Taggers for Italian

Part-of-speech (PoS) tagging constitutes a common task in Natural Language Processing (NLP) given... more Part-of-speech (PoS) tagging constitutes a common task in Natural Language Processing (NLP) given its widespread applicability. However, with the advance of new information technologies and language variation, the contents and methods for PoS-tagging have changed. The majority of Italian existing data for this task originate from standard texts, where language use is far from multifaceted informal real-life situations. Automatic PoS-tagging models trained with such data do not perform reliably on non-standard language, like social media content or language learners’ texts. Our aim is to provide additional training and evaluation data from language learners tagged in Universal Dependencies (UD), as well as testing current automatic PoStagging systems and evaluating their performance on such data. We use Italian texts from a multilingual corpus of young language learners, LEONIDE, to create a tagged gold standard for evaluating UD PoS-tagging performance on nonstandard language. With ...

Download

Introducing a Gold Standard Corpus from Young Multilinguals for the Evaluation of Automatic UD-PoS Taggers for Italian

Download

Rapid Adaptation of NE Resolvers for Humanities Domains using Active Annotation

Download

Towards an infrastructure for language learner corpora

by Egon W Stemle and Therese Lindström Tiedemann

In this article we provide an overview of first-hand experiences and vantage points for best prac... more In this article we provide an overview of first-hand experiences and vantage points for best practices from projects in seven European countries dedicated to learner corpus research (LCR) and the creation of language learner corpora. The corpora and tools involved in LCR are becoming more and more important, as are careful preparation and easy retrieval and reusability of corpora and tools. But the lack of commonly agreed solutions for many aspects of LCR, interoperability between learner corpora and the exchange of data from different learner corpus projects remains a challenge. We show how concepts like metadata, anonymization, error taxonomies and linguistic annotations as well as tools, toolchains and data formats can be individually challenging and how the challenges can be solved.

Download

Anaphoric Annotation of Wikipedia and Blogs in the Live Memories Corpus

The Live Memories corpus is an Italian corpus annotated for anaphoric relations. The corpus inclu... more The Live Memories corpus is an Italian corpus annotated for anaphoric relations. The corpus includes manual annotated information about morphosyntactic agreement, anaphoricity, and semantic class of the NPs. For the annotation of the anaphoric links the corpus takes into account specific phenomena of the Italian language like incorporated clitics and phonetically non realized pronouns. The Live Memories Corpus contains texts from the Italian Wikipedia about the region Trentino/Süd Tirol and from blog sites with users’ comments. It is planned to add a set of articles of local news papers. 1.

Download

THE HUMANITIES RESEARCH PORTAL: Human Language Technology Meets Humanities Publication Repositories

The Portale della Ricerca Umanistica Trentina (Humanities Research Portal) will be a one-stop sea... more The Portale della Ricerca Umanistica Trentina (Humanities Research Portal) will be a one-stop search facility for repositories of articles on about Humanities subjects about the History, History of Art, and Archaeology of Trentino that relies on automatically extracted entity-, spatialand temporalmetadata to provide entity-based, spatially-based and temporallybased access to the articles. In this article we discuss the aims of this project and the current state of work.

Download

Towards an infrastructure for FAIR language learner corpora

8th NLP4CALL workshop, 2019

The FAIR Index of CMC Corpora

Combining image recognition with text mining for next generation CES assessments

ESP 10 World Conference, 2019

On the Detection of Neologism Candidates as a Basis for Language Observation and Lexicographic Endeavors

Znanstvena založba Filozofske fakultete, Aug 21, 2018

Download

Next WG3 meeting: Lexicographical Data meet Computational Linguistics and Knowledge Systems

COST ENeL WG3 meeting (organized with WG1), 2016

Using Language Learner Data for Metaphor Detection

Proceedings of the Workshop on Figurative Language Processing

Download

Exploring Reusability and Reproducibility for a Research Infrastructure for L1 and L2 Learner Corpora

Information, 2021

Download

The DiDi Corpus of South Tyrolean CMC Data

NLP4CMC Workshp at the GSCL-Conference 2015, 2015

KrdWrd CANOLA Corpus 1.1

Institute for Applied Linguistics, Eurac Research, Nov 25, 2010

The CLARIN ERIC deployment infrastructure and its applicability to reproducible research

Download

The DiDi Corpus of South Tyrolean CMC Data: A multilingual corpus of Facebook texts

Proceedings of the Third Italian Conference on Computational Linguistics CLiC-it 2016, 2016

Download

Comparison of Automatic vs. Manual Language Identification in Multilingual Social Media Texts

Harnessing artificial intelligence technology and social media data to support Cultural Ecosystem Service assessments

People and Nature, 2021

Download

Closing a gap in the language resources landscape : Groundwork and best practices from projects on computer-mediated communication in four European countries

Download

Introducing a Gold Standard Corpus from Young Multilinguals for the Evaluation of Automatic UD-PoS Taggers for Italian

Download

Introducing a Gold Standard Corpus from Young Multilinguals for the Evaluation of Automatic UD-PoS Taggers for Italian

Download

Rapid Adaptation of NE Resolvers for Humanities Domains using Active Annotation

Download

Towards an infrastructure for language learner corpora

by Egon W Stemle and Therese Lindström Tiedemann

Download

Anaphoric Annotation of Wikipedia and Blogs in the Live Memories Corpus

Download

THE HUMANITIES RESEARCH PORTAL: Human Language Technology Meets Humanities Publication Repositories

Download

egon stemle's up-to-date CV

link to egon stemle's up-to-date CV

Download

Integrating corpora of computer-mediated communication into the language resources landscape: Initiatives and best practices from French, German, Italian and Slovenian project

by Isabella Chiari and Egon W Stemle

Michael Beißwenger, Thierry Chanier, Isabella Chiari, Tomaž Erjavec, Darja Fișer, Axel Herold, Ni... more Michael Beißwenger, Thierry Chanier, Isabella Chiari, Tomaž Erjavec, Darja Fișer, Axel Herold, Nikola Lubešić, Harald Lüngen, Céline Poudat, Egon Stemle, Angelika Storrer and Ciara Wigham, “Integrating corpora of computer-mediated communication into the language resources landscape: Initiatives and best practices from French, German, Italian and Slovenian project”, CLARIN Annual Conference, 27-28 ottobre 2016, Aix en Provence, France

Egon W Stemle

Uploads

Papers by Egon W Stemle

Log In