Katrin Tomanek

2024

pdf bib abs
Detecting Hallucination and Coverage Errors in Retrieval Augmented Generation for Controversial Topics
Tyler A. Chang | Katrin Tomanek | Jessica Hoffmann | Nithum Thain | Erin MacMurray van Liemt | Kathleen Meier-Hellstern | Lucas Dixon
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

We explore a strategy to handle controversial topics in LLM-based chatbots based on Wikipedia’s Neutral Point of View (NPOV) principle: acknowledge the absence of a single true answer and surface multiple perspectives. We frame this as retrieval augmented generation, where perspectives are retrieved from a knowledge base and the LLM is tasked with generating a fluent and faithful response from the given perspectives. As a starting point, we use a deterministic retrieval system and then focus on common LLM failure modes that arise during this approach to text generation, namely hallucination and coverage errors. We propose and evaluate three methods to detect such errors based on (1) word-overlap, (2) salience, and (3) LLM-based classifiers. Our results demonstrate that LLM-based classifiers, even when trained only on synthetic errors, achieve high error detection performance, with ROC AUC scores of 95.3% for hallucination and 90.5% for coverage error detection on unambiguous error cases. We show that when no training data is available, our other methods still yield good results on hallucination (84.0%) and coverage error (85.2%) detection.

2023

Using pre-trained language models to implement classifiers from small to modest amounts of training data is an area of active research. The ability of large language models to generalize from few-shot examples and to produce strong classifiers is extended using the engineering approach of parameter-efficient tuning. Using the Explainable Detection of Online Sexism (EDOS) training data and a small number of trainable weights to create a tuned prompt vector, a competitive model for this task was built, which was top-ranked in Subtask B.

Text-based safety classifiers are widely used for content moderation and increasingly to tune generative language model behavior - a topic of growing concern for the safety of digital assistants and chatbots. However, different policies require different classifiers, and safety policies themselves improve from iteration and adaptation. This paper introduces and evaluates methods for agile text classification, whereby classifiers are trained using small, targeted datasets that can be quickly developed for a particular policy. Experimenting with 7 datasets from three safety-related domains, comprising 15 annotation schemes, led to our key finding: prompt-tuning large language models, like PaLM 62B, with a labeled dataset of as few as 80 examples can achieve state-of-the-art performance. We argue that this enables a paradigm shift for text classification, especially for models supporting safer online discourse. Instead of collecting millions of examples to attempt to create universal safety classifiers over months or years, classifiers could be tuned using small datasets, created by individuals or small organizations, tailored for specific use cases, and iterated on and adapted in the time-span of a day.

2022

pdf bib abs
Context-Aware Abbreviation Expansion Using Large Language Models
Shanqing Cai | Subhashini Venugopalan | Katrin Tomanek | Ajit Narayanan | Meredith Morris | Michael Brenner
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Motivated by the need for accelerating text entry in augmentative and alternative communication (AAC) for people with severe motor impairments, we propose a paradigm in which phrases are abbreviated aggressively as primarily word-initial letters. Our approach is to expand the abbreviations into full-phrase options by leveraging conversation context with the power of pretrained large language models (LLMs). Through zero-shot, few-shot, and fine-tuning experiments on four public conversation datasets, we show that for replies to the initial turn of a dialog, an LLM with 64B parameters is able to exactly expand over 70% of phrases with abbreviation length up to 10, leading to an effective keystroke saving rate of up to about 77% on these exact expansions. Including a small amount of context in the form of a single conversation turn more than doubles abbreviation expansion accuracies compared to having no context, an effect that is more pronounced for longer phrases. Additionally, the robustness of models against typo noise can be enhanced through fine-tuning on noisy data.

2021

pdf bib abs
Residual Adapters for Parameter-Efficient ASR Adaptation to Atypical and Accented Speech
Katrin Tomanek | Vicky Zayats | Dirk Padfield | Kara Vaillancourt | Fadi Biadsy
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Automatic Speech Recognition (ASR) systems are often optimized to work best for speakers with canonical speech patterns. Unfortunately, these systems perform poorly when tested on atypical speech and heavily accented speech. It has previously been shown that personalization through model fine-tuning substantially improves performance. However, maintaining such large models per speaker is costly and difficult to scale. We show that by adding a relatively small number of extra parameters to the encoder layers via so-called residual adapter, we can achieve similar adaptation gains compared to model fine-tuning, while only updating a tiny fraction (less than 0.5%) of the model parameters. We demonstrate this on two speech adaptation tasks (atypical and accented speech) and for two state-of-the-art ASR architectures.

We describe the re-annotation of selected types of named entities (persons, organizations, locations) from the Muc7 corpus. The focus of this annotation initiative is on recording the time needed for the linguistic process of named entity annotation. Annotation times are measured on two basic annotation units -- sentences vs. complex noun phrases. We gathered evidence that decision times are non-uniformly distributed over the annotation units, while they do not substantially deviate among annotators. This data seems to support the hypothesis that annotation times very much depend on the inherent ""hardness"" of each single annotation decision. We further show how such time-stamped information can be used for empirically grounded studies of selective sampling techniques, such as Active Learning. We directly compare Active Learning costs on the basis of token-based vs. time-based measurements. The data reveals that Active Learning keeps its competitive advantage over random sampling in both scenarios though the difference is less marked for the time metric than for the token metric.

The production of gold standard corpora is time-consuming and costly. We propose an alternative: the âsilver standard corpus (SSC), a corpus that has been generated by the harmonisation of the annotations that have been delivered from a selection of annotation systems. The systems have to share the type system for the annotations and the harmonisation solution has use a suitable similarity measure for the pair-wise comparison of the annotations. The annotation systems have been evaluated against the harmonised set (630.324 sentences, 15,956,841 tokens). We can demonstrate that the annotation of proteins and genes shows higher diversity across all used annotation solutions leading to a lower agreement against the harmonised set in comparison to the annotations of diseases and species. An analysis of the most frequent annotations from all systems shows that a high agreement amongst systems leads to the selection of terms that are suitable to be kept in the harmonised set. This is the first large-scale approach to generate an annotated corpus from automated annotation systems. Further research is required to understand, how the annotations from different systems have to be combined to produce the best annotation result for a harmonised corpus.

2009

pdf bib
An Intrinsic Stopping Criterion for Committee-Based Active Learning
Fredrik Olsson | Katrin Tomanek
Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL-2009)

pdf bib
How Feasible and Robust is the Automatic Extraction of Gene Regulation Events? A Cross-Method Evaluation under Lab and Real-Life Conditions
Udo Hahn | Katrin Tomanek | Ekaterina Buyko | Jung-jae Kim | Dietrich Rebholz-Schuhmann
Proceedings of the BioNLP 2009 Workshop

pdf bib
Proceedings of the NAACL HLT 2009 Workshop on Active Learning for Natural Language Processing
Eric Ringger | Robbie Haertel | Katrin Tomanek
Proceedings of the NAACL HLT 2009 Workshop on Active Learning for Natural Language Processing

pdf bib
On Proper Unit Selection in Active Learning: Co-Selection Effects for Named Entity Recognition
Katrin Tomanek | Florian Laws | Udo Hahn | Hinrich Schütze
Proceedings of the NAACL HLT 2009 Workshop on Active Learning for Natural Language Processing

pdf bib
A Web Survey on the Use of Active Learning to Support Annotation of Text Data
Katrin Tomanek | Fredrik Olsson
Proceedings of the NAACL HLT 2009 Workshop on Active Learning for Natural Language Processing

pdf bib
Timed Annotations — Enhancing MUC7 Metadata by the Time It Takes to Annotate Named Entities
Katrin Tomanek | Udo Hahn
Proceedings of the Third Linguistic Annotation Workshop (LAW III)

pdf bib
Semi-Supervised Active Learning for Sequence Labeling
Katrin Tomanek | Udo Hahn
Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP

2008

pdf bib
Multi-Task Active Learning for Linguistic Annotations
Roi Reichart | Katrin Tomanek | Udo Hahn | Ari Rappoport
Proceedings of ACL-08: HLT

pdf bib abs
Approximating Learning Curves for Active-Learning-Driven Annotation
Katrin Tomanek | Udo Hahn
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Active learning (AL) is getting more and more popular as a methodology to considerably reduce the annotation effort when building training material for statistical learning methods for various NLP tasks. A crucial issue rarely addressed, however, is when to actually stop the annotation process to profit from the savings in efforts. This question is tightly related to estimating the classifier performance after a certain amount of data has already been annotated. While learning curves are the default means to monitor the progress of the annotation process in terms of classifier performance, this requires a labeled gold standard which - in realistic annotation settings, at least - is often unavailable. We here propose a method for committee-based AL to approximate the progression of the learning curve based on the disagreement among the committee members. This method relies on a separate, unlabeled corpus and is thus well suited for situations where a labeled gold standard is not available or would be too expensive to obtain. Considering named entity recognition as a test case we provide empirical evidence that this approach works well under simulation as well as under real-world annotation conditions.

pdf bib abs
Semantic Annotations for Biology: a Corpus Development Initiative at the Jena University Language & Information Engineering (JULIE) Lab
Udo Hahn | Elena Beisswanger | Ekaterina Buyko | Michael Poprat | Katrin Tomanek | Joachim Wermter
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

We provide an overview of corpus building efforts at the Jena University Language & Information Engineering (JULIE) Lab which are focused on life science documents. Special emphasis is laid on semantic annotations in terms of a large amount of biomedical named entities (almost 100 entity types), semantic relations, as well as discourse phenomena, reference relations in particular.