2024
pdf
bib
abs
Automatic sentence segmentation of clinical record narratives in real-world data
Dongfang Xu
|
Davy Weissenbacher
|
Karen O’Connor
|
Siddharth Rawal
|
Graciela Gonzalez Hernandez
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Sentence segmentation is a linguistic task and is widely used as a pre-processing step in many NLP applications. The need for sentence segmentation is particularly pronounced in clinical notes, where ungrammatical and fragmented texts are common. We propose a straightforward and effective sequence labeling classifier to predict sentence spans using a dynamic sliding window based on the prediction of each input sequence. This sliding window algorithm allows our approach to segment long text sequences on the fly. To evaluate our approach, we annotated 90 clinical notes from the MIMIC-III dataset. Additionally, we tested our approach on five other datasets to assess its generalizability and compared its performance against state-of-the-art systems on these datasets. Our approach outperformed all the systems, achieving an F1 score that is 15% higher than the next best-performing system on the clinical dataset.
pdf
bib
abs
Overview of the 9th Social Media Mining for Health Applications (#SMM4H) Shared Tasks at ACL 2024 – Large Language Models and Generalizability for Social Media NLP
Dongfang Xu
|
Guillermo Garcia
|
Lisa Raithel
|
Philippe Thomas
|
Roland Roller
|
Eiji Aramaki
|
Shoko Wakamiya
|
Shuntaro Yada
|
Pierre Zweigenbaum
|
Karen O’Connor
|
Sai Samineni
|
Sophia Hernandez
|
Yao Ge
|
Swati Rajwal
|
Sudeshna Das
|
Abeed Sarker
|
Ari Klein
|
Ana Schmidt
|
Vishakha Sharma
|
Raul Rodriguez-Esteban
|
Juan Banda
|
Ivan Amaro
|
Davy Weissenbacher
|
Graciela Gonzalez-Hernandez
Proceedings of The 9th Social Media Mining for Health Research and Applications (SMM4H 2024) Workshop and Shared Tasks
For the past nine years, the Social Media Mining for Health Applications (#SMM4H) shared tasks have promoted community-driven development and evaluation of advanced natural language processing systems to detect, extract, and normalize health-related information in publicly available user-generated content. This year, #SMM4H included seven shared tasks in English, Japanese, German, French, and Spanish from Twitter, Reddit, and health forums. A total of 84 teams from 22 countries registered for #SMM4H, and 45 teams participated in at least one task. This represents a growth of 180% and 160% in registration and participation, respectively, compared to the last iteration. This paper provides an overview of the tasks and participating systems. The data sets remain available upon request, and new systems can be evaluated through the post-evaluation phase on CodaLab.
2022
pdf
bib
Proceedings of The Seventh Workshop on Social Media Mining for Health Applications, Workshop & Shared Task
Graciela Gonzalez-Hernandez
|
Davy Weissenbacher
Proceedings of The Seventh Workshop on Social Media Mining for Health Applications, Workshop & Shared Task
pdf
bib
abs
Overview of the Seventh Social Media Mining for Health Applications (#SMM4H) Shared Tasks at COLING 2022
Davy Weissenbacher
|
Juan Banda
|
Vera Davydova
|
Darryl Estrada Zavala
|
Luis Gasco Sánchez
|
Yao Ge
|
Yuting Guo
|
Ari Klein
|
Martin Krallinger
|
Mathias Leddin
|
Arjun Magge
|
Raul Rodriguez-Esteban
|
Abeed Sarker
|
Lucia Schmidt
|
Elena Tutubalina
|
Graciela Gonzalez-Hernandez
Proceedings of The Seventh Workshop on Social Media Mining for Health Applications, Workshop & Shared Task
For the past seven years, the Social Media Mining for Health Applications (#SMM4H) shared tasks have promoted the community-driven development and evaluation of advanced natural language processing systems to detect, extract, and normalize health-related information in public, user-generated content. This seventh iteration consists of ten tasks that include English and Spanish posts on Twitter, Reddit, and WebMD. Interest in the #SMM4H shared tasks continues to grow, with 117 teams that registered and 54 teams that participated in at least one task—a 17.5% and 35% increase in registration and participation, respectively, over the last iteration. This paper provides an overview of the tasks and participants’ systems. The data sets remain available upon request, and new systems can be evaluated through the post-evaluation phase on CodaLab.
2021
pdf
bib
Proceedings of the Sixth Social Media Mining for Health (#SMM4H) Workshop and Shared Task
Arjun Magge
|
Ari Klein
|
Antonio Miranda-Escalada
|
Mohammed Ali Al-garadi
|
Ilseyar Alimova
|
Zulfat Miftahutdinov
|
Eulalia Farre-Maduell
|
Salvador Lima Lopez
|
Ivan Flores
|
Karen O'Connor
|
Davy Weissenbacher
|
Elena Tutubalina
|
Abeed Sarker
|
Juan M Banda
|
Martin Krallinger
|
Graciela Gonzalez-Hernandez
Proceedings of the Sixth Social Media Mining for Health (#SMM4H) Workshop and Shared Task
pdf
bib
abs
Overview of the Sixth Social Media Mining for Health Applications (#SMM4H) Shared Tasks at NAACL 2021
Arjun Magge
|
Ari Klein
|
Antonio Miranda-Escalada
|
Mohammed Ali Al-Garadi
|
Ilseyar Alimova
|
Zulfat Miftahutdinov
|
Eulalia Farre
|
Salvador Lima López
|
Ivan Flores
|
Karen O’Connor
|
Davy Weissenbacher
|
Elena Tutubalina
|
Abeed Sarker
|
Juan Banda
|
Martin Krallinger
|
Graciela Gonzalez-Hernandez
Proceedings of the Sixth Social Media Mining for Health (#SMM4H) Workshop and Shared Task
The global growth of social media usage over the past decade has opened research avenues for mining health related information that can ultimately be used to improve public health. The Social Media Mining for Health Applications (#SMM4H) shared tasks in its sixth iteration sought to advance the use of social media texts such as Twitter for pharmacovigilance, disease tracking and patient centered outcomes. #SMM4H 2021 hosted a total of eight tasks that included reruns of adverse drug effect extraction in English and Russian and newer tasks such as detecting medication non-adherence from Twitter and WebMD forum, detecting self-reported adverse pregnancy outcomes, detecting cases and symptoms of COVID-19, identifying occupations mentioned in Spanish by Twitter users, and detecting self-reported breast cancer diagnosis. The eight tasks included a total of 12 individual subtasks spanning three languages requiring methods for binary classification, multi-class classification, named entity recognition and entity normalization. With a total of 97 registering teams and 40 teams submitting predictions, the interest in the shared tasks grew by 70% and participation grew by 38% compared to the previous iteration.
2020
pdf
bib
Proceedings of the Fifth Social Media Mining for Health Applications Workshop & Shared Task
Graciela Gonzalez-Hernandez
|
Ari Z. Klein
|
Ivan Flores
|
Davy Weissenbacher
|
Arjun Magge
|
Karen O'Connor
|
Abeed Sarker
|
Anne-Lyse Minard
|
Elena Tutubalina
|
Zulfat Miftahutdinov
|
Ilseyar Alimova
Proceedings of the Fifth Social Media Mining for Health Applications Workshop & Shared Task
pdf
bib
abs
Overview of the Fifth Social Media Mining for Health Applications (#SMM4H) Shared Tasks at COLING 2020
Ari Klein
|
Ilseyar Alimova
|
Ivan Flores
|
Arjun Magge
|
Zulfat Miftahutdinov
|
Anne-Lyse Minard
|
Karen O’Connor
|
Abeed Sarker
|
Elena Tutubalina
|
Davy Weissenbacher
|
Graciela Gonzalez-Hernandez
Proceedings of the Fifth Social Media Mining for Health Applications Workshop & Shared Task
The vast amount of data on social media presents significant opportunities and challenges for utilizing it as a resource for health informatics. The fifth iteration of the Social Media Mining for Health Applications (#SMM4H) shared tasks sought to advance the use of Twitter data (tweets) for pharmacovigilance, toxicovigilance, and epidemiology of birth defects. In addition to re-runs of three tasks, #SMM4H 2020 included new tasks for detecting adverse effects of medications in French and Russian tweets, characterizing chatter related to prescription medication abuse, and detecting self reports of birth defect pregnancy outcomes. The five tasks required methods for binary classification, multi-class classification, and named entity recognition (NER). With 29 teams and a total of 130 system submissions, participation in the #SMM4H shared tasks continues to grow.
2019
pdf
bib
abs
SemEval-2019 Task 12: Toponym Resolution in Scientific Papers
Davy Weissenbacher
|
Arjun Magge
|
Karen O’Connor
|
Matthew Scotch
|
Graciela Gonzalez-Hernandez
Proceedings of the 13th International Workshop on Semantic Evaluation
We present the SemEval-2019 Task 12 which focuses on toponym resolution in scientific articles. Given an article from PubMed, the task consists of detecting mentions of names of places, or toponyms, and mapping the mentions to their corresponding entries in GeoNames.org, a database of geospatial locations. We proposed three subtasks. In Subtask 1, we asked participants to detect all toponyms in an article. In Subtask 2, given toponym mentions as input, we asked participants to disambiguate them by linking them to entries in GeoNames. In Subtask 3, we asked participants to perform both the detection and the disambiguation steps for all toponyms. A total of 29 teams registered, and 8 teams submitted a system run. We summarize the corpus and the tools created for the challenge. They are freely available at
https://competitions.codalab.org/competitions/19948. We also analyze the methods, the results and the errors made by the competing systems with a focus on toponym disambiguation.
pdf
bib
Proceedings of the Fourth Social Media Mining for Health Applications (#SMM4H) Workshop & Shared Task
Davy Weissenbacher
|
Graciela Gonzalez-Hernandez
Proceedings of the Fourth Social Media Mining for Health Applications (#SMM4H) Workshop & Shared Task
pdf
bib
abs
Overview of the Fourth Social Media Mining for Health (SMM4H) Shared Tasks at ACL 2019
Davy Weissenbacher
|
Abeed Sarker
|
Arjun Magge
|
Ashlynn Daughton
|
Karen O’Connor
|
Michael J. Paul
|
Graciela Gonzalez-Hernandez
Proceedings of the Fourth Social Media Mining for Health Applications (#SMM4H) Workshop & Shared Task
The number of users of social media continues to grow, with nearly half of adults worldwide and two-thirds of all American adults using social networking. Advances in automated data processing, machine learning and NLP present the possibility of utilizing this massive data source for biomedical and public health applications, if researchers address the methodological challenges unique to this media. We present the Social Media Mining for Health Shared Tasks collocated with the ACL at Florence in 2019, which address these challenges for health monitoring and surveillance, utilizing state of the art techniques for processing noisy, real-world, and substantially creative language expressions from social media users. For the fourth execution of this challenge, we proposed four different tasks. Task 1 asked participants to distinguish tweets reporting an adverse drug reaction (ADR) from those that do not. Task 2, a follow-up to Task 1, asked participants to identify the span of text in tweets reporting ADRs. Task 3 is an end-to-end task where the goal was to first detect tweets mentioning an ADR and then map the extracted colloquial mentions of ADRs in the tweets to their corresponding standard concept IDs in the MedDRA vocabulary. Finally, Task 4 asked participants to classify whether a tweet contains a personal mention of one’s health, a more general discussion of the health issue, or is an unrelated mention. A total of 34 teams from around the world registered and 19 teams from 12 countries submitted a system run. We summarize here the corpora for this challenge which are freely available at
https://competitions.codalab.org/competitions/22521, and present an overview of the methods and the results of the competing systems.
2018
pdf
bib
Proceedings of the 2018 EMNLP Workshop SMM4H: The 3rd Social Media Mining for Health Applications Workshop & Shared Task
Graciela Gonzalez-Hernandez
|
Davy Weissenbacher
|
Abeed Sarker
|
Michael Paul
Proceedings of the 2018 EMNLP Workshop SMM4H: The 3rd Social Media Mining for Health Applications Workshop & Shared Task
pdf
bib
abs
Overview of the Third Social Media Mining for Health (SMM4H) Shared Tasks at EMNLP 2018
Davy Weissenbacher
|
Abeed Sarker
|
Michael J. Paul
|
Graciela Gonzalez-Hernandez
Proceedings of the 2018 EMNLP Workshop SMM4H: The 3rd Social Media Mining for Health Applications Workshop & Shared Task
The goals of the SMM4H shared tasks are to release annotated social media based health related datasets to the research community, and to compare the performances of natural language processing and machine learning systems on tasks involving these datasets. The third execution of the SMM4H shared tasks, co-hosted with EMNLP-2018, comprised of four subtasks. These subtasks involve annotated user posts from Twitter (tweets) and focus on the (i) automatic classification of tweets mentioning a drug name, (ii) automatic classification of tweets containing reports of first-person medication intake, (iii) automatic classification of tweets presenting self-reports of adverse drug reaction (ADR) detection, and (iv) automatic classification of vaccine behavior mentions in tweets. A total of 14 teams participated and 78 system runs were submitted (23 for task 1, 20 for task 2, 18 for task 3, 17 for task 4).
pdf
bib
abs
Dealing with Medication Non-Adherence Expressions in Twitter
Takeshi Onishi
|
Davy Weissenbacher
|
Ari Klein
|
Karen O’Connor
|
Graciela Gonzalez-Hernandez
Proceedings of the 2018 EMNLP Workshop SMM4H: The 3rd Social Media Mining for Health Applications Workshop & Shared Task
Through a semi-automatic analysis of tweets, we show that Twitter users not only express Medication Non-Adherence (MNA) in social media but also their reasons for not complying; further research is necessary to fully extract automatically and analyze this information, in order to facilitate the use of this data in epidemiological studies.
2016
pdf
bib
Automatic Prediction of Linguistic Decline in Writings of Subjects with Degenerative Dementia
Davy Weissenbacher
|
Travis A. Johnson
|
Laura Wojtulewicz
|
Amylou Dueck
|
Dona Locke
|
Richard Caselli
|
Graciela Gonzalez
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
2015
pdf
bib
DIEGOLab: An Approach for Message-level Sentiment Classification in Twitter
Abeed Sarker
|
Azadeh Nikfarjam
|
Davy Weissenbacher
|
Graciela Gonzalez
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)
2014
pdf
bib
Natural Language Processing Methods for Enhancing Geographic Metadata for Phylogeography of Zoonotic Viruses
Tasnia Tahsin
|
Robert Rivera
|
Rachel Beard
|
Rob Lauder
|
Davy Weissenbacher
|
Matthew Scotch
|
Garrick Wallstrom
|
Graciela Gonzalez
Proceedings of BioNLP 2014
2011
pdf
bib
Comprendre les effets des erreurs d’annotations des plateformes de TAL, une étude sur la résolution des anaphores pronominales [Understand the effects of erroneous annotations produced by NLP pipelines, a case study on the pronominal anaphora resolution]
Davy Weissenbacher
|
Adeline Nazarenko
Traitement Automatique des Langues, Volume 52, Numéro 1 : Varia [Varia]
2009
pdf
bib
abs
ASSIST : un moteur de recherche spécialisé pour l’analyse des cadres d’expériences
Davy Weissenbacher
|
Elisa Pieri
|
Sophia Ananiadou
|
Brian Rea
|
Farida Vis
|
Yuwei Lin
|
Rob Procter
|
Peter Halfpenny
Actes de la 16ème conférence sur le Traitement Automatique des Langues Naturelles. Démonstrations
L’analyse qualitative des données demande au sociologue un important travail de sélection et d’interprétation des documents. Afin de faciliter ce travail, cette communauté c’est dotée d’outils informatique mais leur fonctionnalités sont encore limitées. Le projet ASSIST est une étude exploratoire pour préciser les modules de traitement automatique des langues (TAL) permettant d’assister le sociologue dans son travail d’analyse. Nous présentons le moteur de recherche réalisé et nous justifions le choix des composants de TAL intégrés au prototype.
2007
pdf
bib
abs
Identifier les pronoms anaphoriques et trouver leurs antécédents : l’intérêt de la classification bayésienne
Davy Weissenbacher
|
Adeline Nazarenko
Actes de la 14ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs
On oppose souvent en TAL les systèmes à base de connaissances linguistiques et ceux qui reposent sur des indices de surface. Chaque approche a ses limites et ses avantages. Nous proposons dans cet article une nouvelle approche qui repose sur les réseaux bayésiens et qui permet de combiner au sein d’une même représentation ces deux types d’informations hétérogènes et complémentaires. Nous justifions l’intérêt de notre approche en comparant les performances du réseau bayésien à celles des systèmes de l’état de l’art, sur un problème difficile du TAL, celui de la résolution d’anaphore.
2006
pdf
bib
Bayesian Network, a Model for NLP?
Davy Weissenbacher
Demonstrations
pdf
bib
abs
The ALVIS Format for Linguistically Annotated Documents
A. Nazarenko
|
E. Alphonse
|
J. Derivière
|
T. Hamon
|
G. Vauvert
|
D. Weissenbacher
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)
The paper describes the ALVIS annotation format and discusses the problems that we encountered for the indexing of large collections of documents for topic specific search engines. This paper is exemplified on the biological domain and on MedLine abstracts, as developing a specialized search engine for biologist is one of the ALVIS case studies. The ALVIS principle for linguistic annotations is based on existing works and standard propositions. We made the choice of stand-off annotations rather than inserted mark-up, and annotations are encoded as XML elements which form the linguistic subsection of the document record.
2004
pdf
bib
abs
La relation de synonymie en génomique
Davy Weissenbacher
Actes de la 11ème conférence sur le Traitement Automatique des Langues Naturelles. REncontres jeunes Chercheurs en Informatique pour le Traitement Automatique des Langues (Posters)
L’accès au contenu des textes de génomique est aujourd’hui un enjeu important. Cela suppose au départ d’identifier les noms d’entités biologiques comme les gènes ou les protéines. Se pose alors la question de la variation de ces noms. Cette question revêt une importance particulière en génomique où les noms de gènes sont soumis à de nombreuses variations, notamment la synonymie. A partir d’une étude de corpus montrant que la synonymie est une relation stable et linguistiquement marquée, cet article propose une modélisation de la synonymie et une méthode d’extraction spécifiquement adaptée à cette relation. Au vu de nos premières expériences, cette méthode semble plus prometteuse que les approches génériques utilisées pour l’extraction de cette relation.
pdf
bib
Event-Based Information Extraction for the Biomedical Domain: the Caderige Project
Erick Alphonse
|
Sophie Aubin
|
Philippe Bessières
|
Gilles Bisson
|
Thierry Hamon
|
Sandrine Lagarrigue
|
Adeline Nazarenko
|
Alain-Pierre Manine
|
Claire Nédellec
|
Mohamed Ould Abdel Vetah
|
Thierry Poibeau
|
Davy Weissenbacher
Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (NLPBA/BioNLP)