Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Showing 1–13 of 13 results for author: Karpinska, M

.
  1. arXiv:2407.19884  [pdf, other

    cs.CL

    Preliminary WMT24 Ranking of General MT Systems and LLMs

    Authors: Tom Kocmi, Eleftherios Avramidis, Rachel Bawden, Ondrej Bojar, Anton Dvorkovich, Christian Federmann, Mark Fishel, Markus Freitag, Thamme Gowda, Roman Grundkiewicz, Barry Haddow, Marzena Karpinska, Philipp Koehn, Benjamin Marie, Kenton Murray, Masaaki Nagata, Martin Popel, Maja Popovic, Mariya Shmatova, Steinþór Steingrímsson, Vilém Zouhar

    Abstract: This is the preliminary ranking of WMT24 General MT systems based on automatic metrics. The official ranking will be a human evaluation, which is superior to the automatic ranking and supersedes it. The purpose of this report is not to interpret any findings but only provide preliminary results to the participants of the General MT task that may be useful during the writing of the system submissio… ▽ More

    Submitted 29 July, 2024; originally announced July 2024.

  2. arXiv:2406.17761  [pdf, other

    cs.CL cs.AI cs.LG

    CaLMQA: Exploring culturally specific long-form question answering across 23 languages

    Authors: Shane Arora, Marzena Karpinska, Hung-Ting Chen, Ipsita Bhattacharjee, Mohit Iyyer, Eunsol Choi

    Abstract: Large language models (LLMs) are used for long-form question answering (LFQA), which requires them to generate paragraph-length answers to complex questions. While LFQA has been well-studied in English, this research has not been extended to other languages. To bridge this gap, we introduce CaLMQA, a collection of 1.5K complex culturally specific questions spanning 23 languages and 51 culturally a… ▽ More

    Submitted 3 July, 2024; v1 submitted 25 June, 2024; originally announced June 2024.

    Comments: 39 pages, 17 figures. Code and data available at https://github.com/2015aroras/CaLMQA. Revised argument in section 4, results unchanged

  3. arXiv:2406.16264  [pdf, other

    cs.CL cs.AI

    One Thousand and One Pairs: A "novel" challenge for long-context language models

    Authors: Marzena Karpinska, Katherine Thai, Kyle Lo, Tanya Goyal, Mohit Iyyer

    Abstract: Synthetic long-context LLM benchmarks (e.g., "needle-in-the-haystack") test only surface-level retrieval capabilities, but how well can long-context LLMs retrieve, synthesize, and reason over information across book-length inputs? We address this question by creating NoCha, a dataset of 1,001 minimally different pairs of true and false claims about 67 recently-published English fictional books, wr… ▽ More

    Submitted 18 July, 2024; v1 submitted 23 June, 2024; originally announced June 2024.

    Comments: preprint, 37 pages

  4. arXiv:2406.11580  [pdf, other

    cs.CL

    Error Span Annotation: A Balanced Approach for Human Evaluation of Machine Translation

    Authors: Tom Kocmi, Vilém Zouhar, Eleftherios Avramidis, Roman Grundkiewicz, Marzena Karpinska, Maja Popović, Mrinmaya Sachan, Mariya Shmatova

    Abstract: High-quality Machine Translation (MT) evaluation relies heavily on human judgments. Comprehensive error classification methods, such as Multidimensional Quality Metrics (MQM), are expensive as they are time-consuming and can only be done by experts, whose availability may be limited especially for low-resource languages. On the other hand, just assigning overall scores, like Direct Assessment (DA)… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

  5. arXiv:2404.01261  [pdf, other

    cs.CL cs.AI

    FABLES: Evaluating faithfulness and content selection in book-length summarization

    Authors: Yekyung Kim, Yapei Chang, Marzena Karpinska, Aparna Garimella, Varun Manjunatha, Kyle Lo, Tanya Goyal, Mohit Iyyer

    Abstract: While long-context large language models (LLMs) can technically summarize book-length documents (>100K tokens), the length and complexity of the documents have so far prohibited evaluations of input-dependent aspects like faithfulness. In this paper, we conduct the first large-scale human evaluation of faithfulness and content selection on LLM-generated summaries of fictional books. Our study miti… ▽ More

    Submitted 1 April, 2024; originally announced April 2024.

    Comments: preprint - 39 pages

  6. arXiv:2404.00399  [pdf, other

    cs.CL cs.AI cs.LG

    Aurora-M: The First Open Source Multilingual Language Model Red-teamed according to the U.S. Executive Order

    Authors: Taishi Nakamura, Mayank Mishra, Simone Tedeschi, Yekun Chai, Jason T Stillerman, Felix Friedrich, Prateek Yadav, Tanmay Laud, Vu Minh Chien, Terry Yue Zhuo, Diganta Misra, Ben Bogin, Xuan-Son Vu, Marzena Karpinska, Arnav Varma Dantuluri, Wojciech Kusa, Tommaso Furlanello, Rio Yokota, Niklas Muennighoff, Suhas Pai, Tosin Adewumi, Veronika Laippala, Xiaozhe Yao, Adalberto Junior, Alpay Ariyak , et al. (20 additional authors not shown)

    Abstract: Pretrained language models underpin several AI applications, but their high computational cost for training limits accessibility. Initiatives such as BLOOM and StarCoder aim to democratize access to pretrained models for collaborative community development. However, such existing models face challenges: limited multilingual capabilities, continual pretraining causing catastrophic forgetting, where… ▽ More

    Submitted 23 April, 2024; v1 submitted 30 March, 2024; originally announced April 2024.

    Comments: Preprint

  7. arXiv:2304.03245  [pdf, other

    cs.CL

    Large language models effectively leverage document-level context for literary translation, but critical errors persist

    Authors: Marzena Karpinska, Mohit Iyyer

    Abstract: Large language models (LLMs) are competitive with the state of the art on a wide range of sentence-level translation datasets. However, their ability to translate paragraphs and documents remains unexplored because evaluation in these settings is costly and difficult. We show through a rigorous human evaluation that asking the Gpt-3.5 (text-davinci-003) LLM to translate an entire literary paragrap… ▽ More

    Submitted 22 May, 2023; v1 submitted 6 April, 2023; originally announced April 2023.

    Comments: preprint (31 pages)

  8. arXiv:2303.13408  [pdf, other

    cs.CL cs.CR cs.LG

    Paraphrasing evades detectors of AI-generated text, but retrieval is an effective defense

    Authors: Kalpesh Krishna, Yixiao Song, Marzena Karpinska, John Wieting, Mohit Iyyer

    Abstract: The rise in malicious usage of large language models, such as fake content creation and academic plagiarism, has motivated the development of approaches that identify AI-generated text, including those based on watermarking or outlier detection. However, the robustness of these detection algorithms to paraphrases of AI-generated text remains unclear. To stress test these detectors, we build a 11B… ▽ More

    Submitted 17 October, 2023; v1 submitted 23 March, 2023; originally announced March 2023.

    Comments: NeurIPS 2023 camera ready (32 pages). Code, models, data available in https://github.com/martiansideofthemoon/ai-detection-paraphrases

  9. arXiv:2210.14250  [pdf, other

    cs.CL

    Exploring Document-Level Literary Machine Translation with Parallel Paragraphs from World Literature

    Authors: Katherine Thai, Marzena Karpinska, Kalpesh Krishna, Bill Ray, Moira Inghilleri, John Wieting, Mohit Iyyer

    Abstract: Literary translation is a culturally significant task, but it is bottlenecked by the small number of qualified literary translators relative to the many untranslated works published around the world. Machine translation (MT) holds potential to complement the work of human translators by improving both training procedures and their overall efficiency. Literary translation is less constrained than m… ▽ More

    Submitted 25 October, 2022; originally announced October 2022.

    Comments: EMNLP 2022

  10. arXiv:2210.13746  [pdf, other

    cs.CL

    DEMETR: Diagnosing Evaluation Metrics for Translation

    Authors: Marzena Karpinska, Nishant Raj, Katherine Thai, Yixiao Song, Ankita Gupta, Mohit Iyyer

    Abstract: While machine translation evaluation metrics based on string overlap (e.g., BLEU) have their limitations, their computations are transparent: the BLEU score assigned to a particular candidate translation can be traced back to the presence or absence of certain words. The operations of newer learned metrics (e.g., BLEURT, COMET), which leverage pretrained language models to achieve higher correlati… ▽ More

    Submitted 24 October, 2022; originally announced October 2022.

    Comments: 22 pages, EMNLP 2022 (camera ready)

  11. arXiv:2210.07188  [pdf, other

    cs.CL

    ezCoref: Towards Unifying Annotation Guidelines for Coreference Resolution

    Authors: Ankita Gupta, Marzena Karpinska, Wenlong Zhao, Kalpesh Krishna, Jack Merullo, Luke Yeh, Mohit Iyyer, Brendan O'Connor

    Abstract: Large-scale, high-quality corpora are critical for advancing research in coreference resolution. However, existing datasets vary in their definition of coreferences and have been collected via complex and lengthy guidelines that are curated for linguistic experts. These concerns have sparked a growing interest among researchers to curate a unified set of guidelines suitable for annotators with var… ▽ More

    Submitted 13 October, 2022; originally announced October 2022.

    Comments: preprint (19 pages), code in https://github.com/gnkitaa/ezCoref

  12. arXiv:2109.06835  [pdf, other

    cs.CL

    The Perils of Using Mechanical Turk to Evaluate Open-Ended Text Generation

    Authors: Marzena Karpinska, Nader Akoury, Mohit Iyyer

    Abstract: Recent text generation research has increasingly focused on open-ended domains such as story and poetry generation. Because models built for such tasks are difficult to evaluate automatically, most researchers in the space justify their modeling choices by collecting crowdsourced human judgments of text quality (e.g., Likert scores of coherence or grammaticality) from Amazon Mechanical Turk (AMT).… ▽ More

    Submitted 14 September, 2021; originally announced September 2021.

    Comments: EMNLP 2021 (20 pages)

  13. arXiv:1908.11443  [pdf, other

    cs.CL

    NarrativeTime: Dense Temporal Annotation on a Timeline

    Authors: Anna Rogers, Marzena Karpinska, Ankita Gupta, Vladislav Lialin, Gregory Smelkov, Anna Rumshisky

    Abstract: For the past decade, temporal annotation has been sparse: only a small portion of event pairs in a text was annotated. We present NarrativeTime, the first timeline-based annotation framework that achieves full coverage of all possible TLinks. To compare with the previous SOTA in dense temporal annotation, we perform full re-annotation of TimeBankDense corpus, which shows comparable agreement with… ▽ More

    Submitted 22 December, 2022; v1 submitted 29 August, 2019; originally announced August 2019.