Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Showing 1–48 of 48 results for author: Hessel, J

Searching in archive cs. Search in all archives.
.
  1. arXiv:2407.12043  [pdf, other

    cs.CL cs.AI cs.HC

    The Art of Saying No: Contextual Noncompliance in Language Models

    Authors: Faeze Brahman, Sachin Kumar, Vidhisha Balachandran, Pradeep Dasigi, Valentina Pyatkin, Abhilasha Ravichander, Sarah Wiegreffe, Nouha Dziri, Khyathi Chandu, Jack Hessel, Yulia Tsvetkov, Noah A. Smith, Yejin Choi, Hannaneh Hajishirzi

    Abstract: Chat-based language models are designed to be helpful, yet they should not comply with every user request. While most existing work primarily focuses on refusal of "unsafe" queries, we posit that the scope of noncompliance should be broadened. We introduce a comprehensive taxonomy of contextual noncompliance describing when and how models should not comply with user requests. Our taxonomy spans a… ▽ More

    Submitted 2 July, 2024; originally announced July 2024.

  2. arXiv:2407.01942  [pdf, other

    cs.AI cs.CL cs.CV

    Certainly Uncertain: A Benchmark and Metric for Multimodal Epistemic and Aleatoric Awareness

    Authors: Khyathi Raghavi Chandu, Linjie Li, Anas Awadalla, Ximing Lu, Jae Sung Park, Jack Hessel, Lijuan Wang, Yejin Choi

    Abstract: The ability to acknowledge the inevitable uncertainty in their knowledge and reasoning is a prerequisite for AI systems to be truly truthful and reliable. In this paper, we present a taxonomy of uncertainty specific to vision-language AI systems, distinguishing between epistemic uncertainty (arising from a lack of information) and aleatoric uncertainty (due to inherent unpredictability), and furth… ▽ More

    Submitted 2 July, 2024; originally announced July 2024.

    Comments: 26 pages

  3. arXiv:2407.00369  [pdf, other

    cs.CL

    How to Train Your Fact Verifier: Knowledge Transfer with Multimodal Open Models

    Authors: Jaeyoung Lee, Ximing Lu, Jack Hessel, Faeze Brahman, Youngjae Yu, Yonatan Bisk, Yejin Choi, Saadia Gabriel

    Abstract: Given the growing influx of misinformation across news and social media, there is a critical need for systems that can provide effective real-time verification of news claims. Large language or multimodal model based verification has been proposed to scale up online policing mechanisms for mitigating spread of false and harmful content. While these can potentially reduce burden on human fact-check… ▽ More

    Submitted 29 June, 2024; originally announced July 2024.

  4. arXiv:2406.18925  [pdf, other

    cs.CL cs.CV

    Selective Vision is the Challenge for Visual Reasoning: A Benchmark for Visual Argument Understanding

    Authors: Jiwan Chung, Sungjae Lee, Minseo Kim, Seungju Han, Ashkan Yousefpour, Jack Hessel, Youngjae Yu

    Abstract: Visual arguments, often used in advertising or social causes, rely on images to persuade viewers to do or believe something. Understanding these arguments requires selective vision: only specific visual stimuli within an image are relevant to the argument, and relevance can only be understood within the context of a broader argumentative structure. While visual arguments are readily appreciated by… ▽ More

    Submitted 27 June, 2024; originally announced June 2024.

    Comments: 12 pages, 5 figures

  5. arXiv:2405.01470  [pdf, other

    cs.CL

    WildChat: 1M ChatGPT Interaction Logs in the Wild

    Authors: Wenting Zhao, Xiang Ren, Jack Hessel, Claire Cardie, Yejin Choi, Yuntian Deng

    Abstract: Chatbots such as GPT-4 and ChatGPT are now serving millions of users. Despite their widespread use, there remains a lack of public datasets showcasing how these tools are used by a population of users in practice. To bridge this gap, we offered free access to ChatGPT for online users in exchange for their affirmative, consensual opt-in to anonymously collect their chat transcripts and request head… ▽ More

    Submitted 2 May, 2024; originally announced May 2024.

    Comments: accepted by ICLR 2024

  6. arXiv:2402.15610  [pdf, other

    cs.CL

    Selective "Selective Prediction": Reducing Unnecessary Abstention in Vision-Language Reasoning

    Authors: Tejas Srinivasan, Jack Hessel, Tanmay Gupta, Bill Yuchen Lin, Yejin Choi, Jesse Thomason, Khyathi Raghavi Chandu

    Abstract: Selective prediction minimizes incorrect predictions from vision-language models (VLMs) by allowing them to abstain from answering when uncertain. However, when deploying a vision-language system with low tolerance for inaccurate predictions, selective prediction may be over-cautious and abstain too frequently, even on many correct predictions. We introduce ReCoVERR, an inference-time algorithm to… ▽ More

    Submitted 12 June, 2024; v1 submitted 23 February, 2024; originally announced February 2024.

    Comments: Accepted to ACL Findings 2024

  7. arXiv:2402.09052  [pdf, other

    cs.AI

    L3GO: Language Agents with Chain-of-3D-Thoughts for Generating Unconventional Objects

    Authors: Yutaro Yamada, Khyathi Chandu, Yuchen Lin, Jack Hessel, Ilker Yildirim, Yejin Choi

    Abstract: Diffusion-based image generation models such as DALL-E 3 and Stable Diffusion-XL demonstrate remarkable capabilities in generating images with realistic and unique compositions. Yet, these models are not robust in precisely reasoning about physical and spatial configurations of objects, especially when instructed with unconventional, thereby out-of-distribution descriptions, such as "a chair with… ▽ More

    Submitted 14 February, 2024; originally announced February 2024.

  8. arXiv:2402.03284  [pdf, other

    cs.CL cs.AI cs.LG

    Deal, or no deal (or who knows)? Forecasting Uncertainty in Conversations using Large Language Models

    Authors: Anthony Sicilia, Hyunwoo Kim, Khyathi Raghavi Chandu, Malihe Alikhani, Jack Hessel

    Abstract: Effective interlocutors account for the uncertain goals, beliefs, and emotions of others. But even the best human conversationalist cannot perfectly anticipate the trajectory of a dialogue. How well can language models represent inherent uncertainty in conversations? We propose FortUne Dial, an expansion of the long-standing "conversation forecasting" task: instead of just accuracy, evaluation is… ▽ More

    Submitted 5 February, 2024; originally announced February 2024.

    Comments: 2 Figures; 7 Tables; 27 pages

  9. arXiv:2402.00838  [pdf, other

    cs.CL

    OLMo: Accelerating the Science of Language Models

    Authors: Dirk Groeneveld, Iz Beltagy, Pete Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Harsh Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, Shane Arora, David Atkinson, Russell Authur, Khyathi Raghavi Chandu, Arman Cohan, Jennifer Dumas, Yanai Elazar, Yuling Gu, Jack Hessel, Tushar Khot, William Merrill, Jacob Morrison, Niklas Muennighoff, Aakanksha Naik, Crystal Nam , et al. (18 additional authors not shown)

    Abstract: Language models (LMs) have become ubiquitous in both NLP research and in commercial product offerings. As their commercial importance has surged, the most powerful models have become closed off, gated behind proprietary interfaces, with important details of their training data, architectures, and development undisclosed. Given the importance of these details in scientifically studying these models… ▽ More

    Submitted 7 June, 2024; v1 submitted 1 February, 2024; originally announced February 2024.

  10. arXiv:2312.05979  [pdf, other

    cs.CL

    NovaCOMET: Open Commonsense Foundation Models with Symbolic Knowledge Distillation

    Authors: Peter West, Ronan Le Bras, Taylor Sorensen, Bill Yuchen Lin, Liwei Jiang, Ximing Lu, Khyathi Chandu, Jack Hessel, Ashutosh Baheti, Chandra Bhagavatula, Yejin Choi

    Abstract: We present NovaCOMET, an open commonsense knowledge model, that combines the best aspects of knowledge and general task models. Compared to previous knowledge models, NovaCOMET allows open-format relations enabling direct application to reasoning tasks; compared to general task models like Flan-T5, it explicitly centers knowledge, enabling superior performance for commonsense reasoning. NovaCOME… ▽ More

    Submitted 10 December, 2023; originally announced December 2023.

  11. arXiv:2312.04837  [pdf, other

    cs.AI cs.CL cs.CV

    Localized Symbolic Knowledge Distillation for Visual Commonsense Models

    Authors: Jae Sung Park, Jack Hessel, Khyathi Raghavi Chandu, Paul Pu Liang, Ximing Lu, Peter West, Youngjae Yu, Qiuyuan Huang, Jianfeng Gao, Ali Farhadi, Yejin Choi

    Abstract: Instruction following vision-language (VL) models offer a flexible interface that supports a broad range of multimodal tasks in a zero-shot fashion. However, interfaces that operate on full images do not directly enable the user to "point to" and access specific regions within images. This capability is important not only to support reference-grounded VL benchmarks, but also, for practical applica… ▽ More

    Submitted 12 December, 2023; v1 submitted 8 December, 2023; originally announced December 2023.

    Comments: Neurips 2023

  12. arXiv:2311.08469  [pdf, other

    cs.CL

    UNcommonsense Reasoning: Abductive Reasoning about Uncommon Situations

    Authors: Wenting Zhao, Justin T Chiu, Jena D. Hwang, Faeze Brahman, Jack Hessel, Sanjiban Choudhury, Yejin Choi, Xiang Lorraine Li, Alane Suhr

    Abstract: Language technologies that accurately model the dynamics of events must perform commonsense reasoning. Existing work evaluating commonsense reasoning focuses on making inferences about common, everyday situations. To instead investigate the ability to model unusual, unexpected, and unlikely situations, we explore the task of uncommonsense abductive reasoning. Given a piece of context with an unexp… ▽ More

    Submitted 1 May, 2024; v1 submitted 14 November, 2023; originally announced November 2023.

    Comments: accepted at NAACL'24

  13. arXiv:2311.02805  [pdf, other

    cs.CL

    Tailoring Self-Rationalizers with Multi-Reward Distillation

    Authors: Sahana Ramnath, Brihi Joshi, Skyler Hallinan, Ximing Lu, Liunian Harold Li, Aaron Chan, Jack Hessel, Yejin Choi, Xiang Ren

    Abstract: Large language models (LMs) are capable of generating free-text rationales to aid question answering. However, prior work 1) suggests that useful self-rationalization is emergent only at significant scales (e.g., 175B parameter GPT-3); and 2) focuses largely on downstream performance, ignoring the semantics of the rationales themselves, e.g., are they faithful, true, and helpful for humans? In thi… ▽ More

    Submitted 22 May, 2024; v1 submitted 5 November, 2023; originally announced November 2023.

    Journal ref: The Twelfth International Conference on Learning Representations, 2024

  14. arXiv:2310.19785  [pdf, other

    cs.CL cs.CV cs.LG

    What's "up" with vision-language models? Investigating their struggle with spatial reasoning

    Authors: Amita Kamath, Jack Hessel, Kai-Wei Chang

    Abstract: Recent vision-language (VL) models are powerful, but can they reliably distinguish "right" from "left"? We curate three new corpora to quantify model comprehension of such basic spatial relations. These tests isolate spatial reasoning more precisely than existing datasets like VQAv2, e.g., our What'sUp benchmark contains sets of photographs varying only the spatial relations of objects, keeping th… ▽ More

    Submitted 30 October, 2023; originally announced October 2023.

    Comments: EMNLP 2023

  15. arXiv:2310.11564  [pdf, other

    cs.CL

    Personalized Soups: Personalized Large Language Model Alignment via Post-hoc Parameter Merging

    Authors: Joel Jang, Seungone Kim, Bill Yuchen Lin, Yizhong Wang, Jack Hessel, Luke Zettlemoyer, Hannaneh Hajishirzi, Yejin Choi, Prithviraj Ammanabrolu

    Abstract: While Reinforcement Learning from Human Feedback (RLHF) aligns Large Language Models (LLMs) with general, aggregate human preferences, it is suboptimal for learning diverse, individual perspectives. In this work, we study Reinforcement Learning from Personalized Human Feedback (RLPHF) problem, wherein LLMs are aligned to multiple (sometimes conflicting) preferences by modeling alignment as a Multi… ▽ More

    Submitted 17 October, 2023; originally announced October 2023.

    Comments: Preprint

  16. arXiv:2310.10418  [pdf, other

    cs.LG cs.AI

    Reading Books is Great, But Not if You Are Driving! Visually Grounded Reasoning about Defeasible Commonsense Norms

    Authors: Seungju Han, Junhyeok Kim, Jack Hessel, Liwei Jiang, Jiwan Chung, Yejin Son, Yejin Choi, Youngjae Yu

    Abstract: Commonsense norms are defeasible by context: reading books is usually great, but not when driving a car. While contexts can be explicitly described in language, in embodied scenarios, contexts are often provided visually. This type of visually grounded reasoning about defeasible commonsense norms is generally easy for humans, but (as we show) poses a challenge for machines, as it necessitates both… ▽ More

    Submitted 11 November, 2023; v1 submitted 16 October, 2023; originally announced October 2023.

    Comments: Published as a conference paper at EMNLP 2023 (long)

  17. arXiv:2308.06595  [pdf, other

    cs.CL cs.AI cs.CV

    VisIT-Bench: A Benchmark for Vision-Language Instruction Following Inspired by Real-World Use

    Authors: Yonatan Bitton, Hritik Bansal, Jack Hessel, Rulin Shao, Wanrong Zhu, Anas Awadalla, Josh Gardner, Rohan Taori, Ludwig Schmidt

    Abstract: We introduce VisIT-Bench (Visual InsTruction Benchmark), a benchmark for evaluation of instruction-following vision-language models for real-world use. Our starting point is curating 70 'instruction families' that we envision instruction tuned vision-language models should be able to address. Extending beyond evaluations like VQAv2 and COCO, tasks range from basic recognition to game playing and c… ▽ More

    Submitted 26 December, 2023; v1 submitted 12 August, 2023; originally announced August 2023.

    Comments: Accepted to NeurIPS 2023, Datasets and Benchmarks. Website: https://visit-bench.github.io/

  18. arXiv:2308.01390  [pdf, other

    cs.CV cs.AI cs.LG

    OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models

    Authors: Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, Jenia Jitsev, Simon Kornblith, Pang Wei Koh, Gabriel Ilharco, Mitchell Wortsman, Ludwig Schmidt

    Abstract: We introduce OpenFlamingo, a family of autoregressive vision-language models ranging from 3B to 9B parameters. OpenFlamingo is an ongoing effort to produce an open-source replication of DeepMind's Flamingo models. On seven vision-language datasets, OpenFlamingo models average between 80 - 89% of corresponding Flamingo performance. This technical report describes our models, training data, hyperpar… ▽ More

    Submitted 7 August, 2023; v1 submitted 2 August, 2023; originally announced August 2023.

  19. arXiv:2306.14899  [pdf, other

    cs.CV cs.AI cs.CL cs.MM

    FunQA: Towards Surprising Video Comprehension

    Authors: Binzhu Xie, Sicheng Zhang, Zitang Zhou, Bo Li, Yuanhan Zhang, Jack Hessel, Jingkang Yang, Ziwei Liu

    Abstract: Surprising videos, such as funny clips, creative performances, or visual illusions, attract significant attention. Enjoyment of these videos is not simply a response to visual stimuli; rather, it hinges on the human capacity to understand (and appreciate) commonsense violations depicted in these videos. We introduce FunQA, a challenging video question-answering (QA) dataset specifically designed t… ▽ More

    Submitted 22 March, 2024; v1 submitted 26 June, 2023; originally announced June 2023.

    Comments: Project Page: https://funqa-benchmark.github.io/ Codebase: https://github.com/Jingkang50/FunQA

  20. arXiv:2306.14050  [pdf, other

    cs.CL

    Symbolic Chain-of-Thought Distillation: Small Models Can Also "Think" Step-by-Step

    Authors: Liunian Harold Li, Jack Hessel, Youngjae Yu, Xiang Ren, Kai-Wei Chang, Yejin Choi

    Abstract: Chain-of-thought prompting (e.g., "Let's think step-by-step") primes large language models to verbalize rationalization for their predictions. While chain-of-thought can lead to dramatic performance gains, benefits appear to emerge only for sufficiently large models (beyond 50B parameters). We show that orders-of-magnitude smaller models (125M -- 1.3B parameters) can still benefit from chain-of-th… ▽ More

    Submitted 15 April, 2024; v1 submitted 24 June, 2023; originally announced June 2023.

    Comments: ACL 2023

  21. arXiv:2306.04751  [pdf, other

    cs.CL

    How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources

    Authors: Yizhong Wang, Hamish Ivison, Pradeep Dasigi, Jack Hessel, Tushar Khot, Khyathi Raghavi Chandu, David Wadden, Kelsey MacMillan, Noah A. Smith, Iz Beltagy, Hannaneh Hajishirzi

    Abstract: In this work we explore recent advances in instruction-tuning language models on a range of open instruction-following datasets. Despite recent claims that open models can be on par with state-of-the-art proprietary models, these claims are often accompanied by limited evaluation, making it difficult to compare models across the board and determine the utility of various resources. We provide a la… ▽ More

    Submitted 30 October, 2023; v1 submitted 7 June, 2023; originally announced June 2023.

    Comments: 18 pages, 6 figure, 10 tables. NeurIPS 2023 Datasets and Benchmarks Track Camera Ready

  22. arXiv:2305.14897  [pdf, other

    cs.CL cs.CV cs.LG

    Text encoders bottleneck compositionality in contrastive vision-language models

    Authors: Amita Kamath, Jack Hessel, Kai-Wei Chang

    Abstract: Performant vision-language (VL) models like CLIP represent captions using a single vector. How much information about language is lost in this bottleneck? We first curate CompPrompts, a set of increasingly compositional image captions that VL models should be able to capture (e.g., single object, to object+property, to multiple interacting objects). Then, we train text-only recovery probes that ai… ▽ More

    Submitted 30 October, 2023; v1 submitted 24 May, 2023; originally announced May 2023.

    Comments: EMNLP 2023

  23. arXiv:2304.06939  [pdf, other

    cs.CV cs.CL

    Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved with Text

    Authors: Wanrong Zhu, Jack Hessel, Anas Awadalla, Samir Yitzhak Gadre, Jesse Dodge, Alex Fang, Youngjae Yu, Ludwig Schmidt, William Yang Wang, Yejin Choi

    Abstract: In-context vision and language models like Flamingo support arbitrarily interleaved sequences of images and text as input. This format not only enables few-shot learning via interleaving independent supervised (image, text) examples, but also, more complex prompts involving interaction between images, e.g., "What do image A and image B have in common?" To support this interface, pretraining occurs… ▽ More

    Submitted 28 October, 2023; v1 submitted 14 April, 2023; originally announced April 2023.

    Comments: NeurIPS D&B 2023. Project homepage: https://github.com/allenai/mmc4

  24. arXiv:2303.09713  [pdf, other

    cs.CL cs.AI

    CHAMPAGNE: Learning Real-world Conversation from Large-Scale Web Videos

    Authors: Seungju Han, Jack Hessel, Nouha Dziri, Yejin Choi, Youngjae Yu

    Abstract: Visual information is central to conversation: body gestures and physical behaviour, for example, contribute to meaning that transcends words alone. To date, however, most neural conversational models are limited to just text. We introduce CHAMPAGNE, a generative model of conversations that can account for visual contexts. To train CHAMPAGNE, we collect and release YTD-18M, a large-scale corpus of… ▽ More

    Submitted 16 August, 2023; v1 submitted 16 March, 2023; originally announced March 2023.

    Comments: ICCV 2023, Project page: https://seungjuhan.me/champagne

  25. arXiv:2303.07274  [pdf, other

    cs.CV cs.AI cs.CL

    Breaking Common Sense: WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images

    Authors: Nitzan Bitton-Guetta, Yonatan Bitton, Jack Hessel, Ludwig Schmidt, Yuval Elovici, Gabriel Stanovsky, Roy Schwartz

    Abstract: Weird, unusual, and uncanny images pique the curiosity of observers because they challenge commonsense. For example, an image released during the 2022 world cup depicts the famous soccer stars Lionel Messi and Cristiano Ronaldo playing chess, which playfully violates our expectation that their competition should occur on the football field. Humans can easily recognize and interpret these unconvent… ▽ More

    Submitted 12 August, 2023; v1 submitted 13 March, 2023; originally announced March 2023.

    Comments: Accepted to ICCV 2023. Website: whoops-benchmark.github.io

  26. arXiv:2212.10465  [pdf, other

    cs.CL

    SODA: Million-scale Dialogue Distillation with Social Commonsense Contextualization

    Authors: Hyunwoo Kim, Jack Hessel, Liwei Jiang, Peter West, Ximing Lu, Youngjae Yu, Pei Zhou, Ronan Le Bras, Malihe Alikhani, Gunhee Kim, Maarten Sap, Yejin Choi

    Abstract: Data scarcity has been a long standing issue in the field of open-domain social dialogue. To quench this thirst, we present SODA: the first publicly available, million-scale high-quality social dialogue dataset. By contextualizing social commonsense knowledge from a knowledge graph, we are able to distill an exceptionally broad spectrum of social interactions from a large language model. Human eva… ▽ More

    Submitted 23 October, 2023; v1 submitted 20 December, 2022; originally announced December 2022.

    Comments: EMNLP 2023. Dataset, model, and code can be found at https://hyunw.kim/sodaverse

  27. arXiv:2210.01241  [pdf, other

    cs.CL cs.LG

    Is Reinforcement Learning (Not) for Natural Language Processing: Benchmarks, Baselines, and Building Blocks for Natural Language Policy Optimization

    Authors: Rajkumar Ramamurthy, Prithviraj Ammanabrolu, Kianté Brantley, Jack Hessel, Rafet Sifa, Christian Bauckhage, Hannaneh Hajishirzi, Yejin Choi

    Abstract: We tackle the problem of aligning pre-trained large language models (LMs) with human preferences. If we view text generation as a sequential decision-making problem, reinforcement learning (RL) appears to be a natural conceptual framework. However, using RL for LM-based generation faces empirical challenges, including training instability due to the combinatorial action space, as well as a lack of… ▽ More

    Submitted 28 February, 2023; v1 submitted 3 October, 2022; originally announced October 2022.

    Comments: In Proceedings of ICLR 2023. Code found at https://github.com/allenai/rl4lms and Project website at https://rl4lms.apps.allenai.org/

  28. arXiv:2209.06293  [pdf, other

    cs.CL cs.CV

    Do Androids Laugh at Electric Sheep? Humor "Understanding" Benchmarks from The New Yorker Caption Contest

    Authors: Jack Hessel, Ana Marasović, Jena D. Hwang, Lillian Lee, Jeff Da, Rowan Zellers, Robert Mankoff, Yejin Choi

    Abstract: Large neural networks can now generate jokes, but do they really "understand" humor? We challenge AI models with three tasks derived from the New Yorker Cartoon Caption Contest: matching a joke to a cartoon, identifying a winning caption, and explaining why a winning caption is funny. These tasks encapsulate progressively more sophisticated aspects of "understanding" a cartoon; key elements are th… ▽ More

    Submitted 6 July, 2023; v1 submitted 13 September, 2022; originally announced September 2022.

    Journal ref: ACL 2023

  29. arXiv:2205.13636  [pdf, other

    cs.CL cs.LG

    Quark: Controllable Text Generation with Reinforced Unlearning

    Authors: Ximing Lu, Sean Welleck, Jack Hessel, Liwei Jiang, Lianhui Qin, Peter West, Prithviraj Ammanabrolu, Yejin Choi

    Abstract: Large-scale language models often learn behaviors that are misaligned with user expectations. Generated text may contain offensive or toxic language, contain significant repetition, or be of a different sentiment than desired by the user. We consider the task of unlearning these misalignments by fine-tuning the language model on signals of what not to do. We introduce Quantized Reward Konditioning… ▽ More

    Submitted 16 November, 2022; v1 submitted 26 May, 2022; originally announced May 2022.

    Journal ref: NeurIPS 2022 (Oral Selection)

  30. arXiv:2205.12630  [pdf, other

    cs.CL cs.CV

    Multimodal Knowledge Alignment with Reinforcement Learning

    Authors: Youngjae Yu, Jiwan Chung, Heeseung Yun, Jack Hessel, JaeSung Park, Ximing Lu, Prithviraj Ammanabrolu, Rowan Zellers, Ronan Le Bras, Gunhee Kim, Yejin Choi

    Abstract: Large language models readily adapt to novel settings, even without task-specific training data. Can their zero-shot capacity be extended to multimodal inputs? In this work, we propose ESPER which extends language-only zero-shot models to unseen multimodal tasks, like image and audio captioning. Our key novelty is to use reinforcement learning to align multimodal inputs to language model generatio… ▽ More

    Submitted 25 May, 2022; originally announced May 2022.

    ACM Class: I.2.7; I.4.9

  31. arXiv:2202.04800  [pdf, other

    cs.CV cs.CL

    The Abduction of Sherlock Holmes: A Dataset for Visual Abductive Reasoning

    Authors: Jack Hessel, Jena D. Hwang, Jae Sung Park, Rowan Zellers, Chandra Bhagavatula, Anna Rohrbach, Kate Saenko, Yejin Choi

    Abstract: Humans have remarkable capacity to reason abductively and hypothesize about what lies beyond the literal content of an image. By identifying concrete visual clues scattered throughout a scene, we almost can't help but draw probable inferences beyond the literal scene based on our everyday experience and knowledge about the world. For example, if we see a "20 mph" sign alongside a road, we might as… ▽ More

    Submitted 25 July, 2022; v1 submitted 9 February, 2022; originally announced February 2022.

    Comments: code, data, models at http://visualabduction.com/

    Journal ref: ECCV 2022

  32. arXiv:2201.02639  [pdf, other

    cs.CV cs.CL cs.LG cs.SD eess.AS

    MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound

    Authors: Rowan Zellers, Jiasen Lu, Ximing Lu, Youngjae Yu, Yanpeng Zhao, Mohammadreza Salehi, Aditya Kusupati, Jack Hessel, Ali Farhadi, Yejin Choi

    Abstract: As humans, we navigate a multimodal world, building a holistic understanding from all our senses. We introduce MERLOT Reserve, a model that represents videos jointly over time -- through a new training objective that learns from audio, subtitles, and video frames. Given a video, we replace snippets of text and audio with a MASK token; the model learns by choosing the correct masked-out snippet. Ou… ▽ More

    Submitted 13 May, 2022; v1 submitted 7 January, 2022; originally announced January 2022.

    Comments: CVPR 2022. Project page at https://rowanzellers.com/merlotreserve

  33. arXiv:2112.08995  [pdf, other

    cs.SD cs.CL cs.CV eess.AS

    Connecting the Dots between Audio and Text without Parallel Data through Visual Knowledge Transfer

    Authors: Yanpeng Zhao, Jack Hessel, Youngjae Yu, Ximing Lu, Rowan Zellers, Yejin Choi

    Abstract: Machines that can represent and describe environmental soundscapes have practical potential, e.g., for audio tagging and captioning systems. Prevailing learning paradigms have been relying on parallel audio-text data, which is, however, scarcely available on the web. We propose VIP-ANT that induces \textbf{A}udio-\textbf{T}ext alignment without using any parallel audio-text data. Our key idea is t… ▽ More

    Submitted 2 May, 2022; v1 submitted 16 December, 2021; originally announced December 2021.

    Comments: Accepted to NAACL 2022. Our code is available at https://github.com/zhaoyanpeng/vipant

  34. arXiv:2112.08674  [pdf, other

    cs.CL

    Reframing Human-AI Collaboration for Generating Free-Text Explanations

    Authors: Sarah Wiegreffe, Jack Hessel, Swabha Swayamdipta, Mark Riedl, Yejin Choi

    Abstract: Large language models are increasingly capable of generating fluent-appearing text with relatively little task-specific supervision. But can these models accurately explain classification decisions? We consider the task of generating free-text explanations using human-written examples in a few-shot manner. We find that (1) authoring higher quality prompts results in higher quality generations; and… ▽ More

    Submitted 4 May, 2022; v1 submitted 16 December, 2021; originally announced December 2021.

    Comments: NAACL 2022 Camera-ready. 13 pages main + references, 14 pages appendix

  35. arXiv:2110.07178  [pdf, other

    cs.CL

    Symbolic Knowledge Distillation: from General Language Models to Commonsense Models

    Authors: Peter West, Chandra Bhagavatula, Jack Hessel, Jena D. Hwang, Liwei Jiang, Ronan Le Bras, Ximing Lu, Sean Welleck, Yejin Choi

    Abstract: The common practice for training commonsense models has gone from-human-to-corpus-to-machine: humans author commonsense knowledge graphs in order to train commonsense models. In this work, we investigate an alternative, from-machine-to-corpus-to-machine: general language models author these commonsense knowledge graphs to train commonsense models. Our study leads to a new framework, Symbolic Knowl… ▽ More

    Submitted 28 November, 2022; v1 submitted 14 October, 2021; originally announced October 2021.

  36. arXiv:2106.02636  [pdf, other

    cs.CV cs.CL cs.LG

    MERLOT: Multimodal Neural Script Knowledge Models

    Authors: Rowan Zellers, Ximing Lu, Jack Hessel, Youngjae Yu, Jae Sung Park, Jize Cao, Ali Farhadi, Yejin Choi

    Abstract: As humans, we understand events in the visual world contextually, performing multimodal reasoning across time to make inferences about the past, present, and future. We introduce MERLOT, a model that learns multimodal script knowledge by watching millions of YouTube videos with transcribed speech -- in an entirely label-free, self-supervised manner. By pretraining with a mix of both frame-level (s… ▽ More

    Submitted 21 October, 2021; v1 submitted 4 June, 2021; originally announced June 2021.

    Comments: project page at https://rowanzellers.com/merlot; NeurIPS 2021 camera ready

  37. arXiv:2104.08718  [pdf, other

    cs.CV cs.CL

    CLIPScore: A Reference-free Evaluation Metric for Image Captioning

    Authors: Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, Yejin Choi

    Abstract: Image captioning has conventionally relied on reference-based automatic evaluations, where machine captions are compared against captions written by humans. This is in contrast to the reference-free manner in which humans assess caption quality. In this paper, we report the surprising empirical finding that CLIP (Radford et al., 2021), a cross-modal model pretrained on 400M image+caption pairs f… ▽ More

    Submitted 23 March, 2022; v1 submitted 18 April, 2021; originally announced April 2021.

    Journal ref: EMNLP 2021

  38. arXiv:2010.16363  [pdf, other

    cs.CL cs.CV

    Domain-Specific Lexical Grounding in Noisy Visual-Textual Documents

    Authors: Gregory Yauney, Jack Hessel, David Mimno

    Abstract: Images can give us insights into the contextual meanings of words, but current image-text grounding approaches require detailed annotations. Such granular annotation is rare, expensive, and unavailable in most domain-specific contexts. In contrast, unlabeled multi-image, multi-sentence documents are abundant. Can lexical grounding be learned from such documents, even though they have significant l… ▽ More

    Submitted 30 October, 2020; originally announced October 2020.

    Journal ref: Published in EMNLP 2020

  39. arXiv:2010.06572  [pdf, other

    cs.CL cs.CV

    Does my multimodal model learn cross-modal interactions? It's harder to tell than you might think!

    Authors: Jack Hessel, Lillian Lee

    Abstract: Modeling expressive cross-modal interactions seems crucial in multimodal tasks, such as visual question answering. However, sometimes high-performing black-box algorithms turn out to be mostly exploiting unimodal signals in the data. We propose a new diagnostic tool, empirical multimodally-additive function projection (EMAP), for isolating whether or not cross-modal interactions improve performanc… ▽ More

    Submitted 13 October, 2020; originally announced October 2020.

    Journal ref: Published in EMNLP 2020

  40. arXiv:2004.14338  [pdf, other

    cs.CL cs.CV

    Beyond Instructional Videos: Probing for More Diverse Visual-Textual Grounding on YouTube

    Authors: Jack Hessel, Zhenhai Zhu, Bo Pang, Radu Soricut

    Abstract: Pretraining from unlabelled web videos has quickly become the de-facto means of achieving high performance on many video understanding tasks. Features are learned via prediction of grounded relationships between visual content and automatic speech recognition (ASR) tokens. However, prior pretraining work has been limited to only instructional videos; a priori, we expect this domain to be relativel… ▽ More

    Submitted 16 October, 2020; v1 submitted 29 April, 2020; originally announced April 2020.

    Comments: 11 pages including supplementary materials

    Journal ref: Published in EMNLP 2020

  41. arXiv:1910.02930  [pdf, other

    cs.CL

    A Case Study on Combining ASR and Visual Features for Generating Instructional Video Captions

    Authors: Jack Hessel, Bo Pang, Zhenhai Zhu, Radu Soricut

    Abstract: Instructional videos get high-traffic on video sharing platforms, and prior work suggests that providing time-stamped, subtask annotations (e.g., "heat the oil in the pan") improves user experiences. However, current automatic annotation methods based on visual features alone perform only slightly better than constant prediction. Taking cues from prior work, we show that we can improve performance… ▽ More

    Submitted 7 October, 2019; originally announced October 2019.

    Journal ref: Published in The SIGNLL Conference on Computational Natural Language Learning (CoNLL) 2019

  42. arXiv:1904.07826  [pdf, other

    cs.CL cs.CV

    Unsupervised Discovery of Multimodal Links in Multi-image, Multi-sentence Documents

    Authors: Jack Hessel, Lillian Lee, David Mimno

    Abstract: Images and text co-occur constantly on the web, but explicit links between images and sentences (or other intra-document textual units) are often not present. We present algorithms that discover image-sentence relationships without relying on explicit multimodal annotation in training. We experiment on seven datasets of varying difficulty, ranging from documents consisting of groups of images capt… ▽ More

    Submitted 31 August, 2019; v1 submitted 16 April, 2019; originally announced April 2019.

    Comments: Code and data available at http://www.cs.cornell.edu/~jhessel/multiretrieval/multiretrieval.html

    Journal ref: EMNLP 2019

  43. arXiv:1904.07372  [pdf, other

    cs.CL cs.SI physics.soc-ph

    Something's Brewing! Early Prediction of Controversy-causing Posts from Discussion Features

    Authors: Jack Hessel, Lillian Lee

    Abstract: Controversial posts are those that split the preferences of a community, receiving both significant positive and significant negative feedback. Our inclusion of the word "community" here is deliberate: what is controversial to some audiences may not be so to others. Using data from several different communities on reddit.com, we predict the ultimate controversiality of posts, leveraging features d… ▽ More

    Submitted 15 April, 2019; originally announced April 2019.

    Comments: Accepted at NAACL 2019 as a long paper

  44. arXiv:1804.06786  [pdf, other

    cs.CL cs.CV cs.IR

    Quantifying the visual concreteness of words and topics in multimodal datasets

    Authors: Jack Hessel, David Mimno, Lillian Lee

    Abstract: Multimodal machine learning algorithms aim to learn visual-textual correspondences. Previous work suggests that concepts with concrete visual manifestations may be easier to learn than concepts with abstract ones. We give an algorithm for automatically computing the visual concreteness of words and topics within multimodal datasets. We apply the approach in four settings, ranging from image captio… ▽ More

    Submitted 23 May, 2018; v1 submitted 18 April, 2018; originally announced April 2018.

    Comments: NAACL HLT 2018, 14 pages, 6 figures, data available at http://www.cs.cornell.edu/~jhessel/concreteness/concreteness.html

    Journal ref: 2018 North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT)

  45. arXiv:1703.01725  [pdf, other

    cs.SI cs.CL cs.CV physics.soc-ph

    Cats and Captions vs. Creators and the Clock: Comparing Multimodal Content to Context in Predicting Relative Popularity

    Authors: Jack Hessel, Lillian Lee, David Mimno

    Abstract: The content of today's social media is becoming more and more rich, increasingly mixing text, images, videos, and audio. It is an intriguing research question to model the interplay between these different modes in attracting user attention and engagement. But in order to pursue this study of multimodal content, we must also account for context: timing effects, community preferences, and social fa… ▽ More

    Submitted 5 March, 2017; originally announced March 2017.

    Comments: 10 pages, data and models available at http://www.cs.cornell.edu/~jhessel/cats/cats.html, Proceedings of WWW 2017

  46. arXiv:1612.07487  [pdf, other

    cs.SI physics.soc-ph

    Science, AskScience, and BadScience: On the Coexistence of Highly Related Communities

    Authors: Jack Hessel, Chenhao Tan, Lillian Lee

    Abstract: When large social-media platforms allow users to easily form and self-organize into interest groups, highly related communities can arise. For example, the Reddit site hosts not just a group called food, but also HealthyFood, foodhacks, foodporn, and cooking, among others. Are these highly related communities created for similar classes of reasons (e.g., to focus on a subtopic, to create a place f… ▽ More

    Submitted 22 December, 2016; originally announced December 2016.

    Comments: Appeared in In Tenth International AAAI Conference on Web and Social Media (2016)

  47. arXiv:1511.03371  [pdf, ps, other

    cs.SI physics.soc-ph

    What do Vegans do in their Spare Time? Latent Interest Detection in Multi-Community Networks

    Authors: Jack Hessel, Alexandra Schofield, Lillian Lee, David Mimno

    Abstract: Most social network analysis works at the level of interactions between users. But the vast growth in size and complexity of social networks enables us to examine interactions at larger scale. In this work we use a dataset of 76M submissions to the social network Reddit, which is organized into distinct sub-communities called subreddits. We measure the similarity between entire subreddits both in… ▽ More

    Submitted 24 November, 2015; v1 submitted 10 November, 2015; originally announced November 2015.

    Comments: NIPS 2015 Network Workshop

  48. arXiv:1508.02091  [pdf, other

    cs.CL cs.CV

    Image Representations and New Domains in Neural Image Captioning

    Authors: Jack Hessel, Nicolas Savva, Michael J. Wilber

    Abstract: We examine the possibility that recent promising results in automatic caption generation are due primarily to language models. By varying image representation quality produced by a convolutional neural network, we find that a state-of-the-art neural captioning algorithm is able to produce quality captions even when provided with surprisingly poor image representations. We replicate this result in… ▽ More

    Submitted 9 August, 2015; originally announced August 2015.

    Comments: 11 Pages, 5 Images, To appear at EMNLP 2015's Vision + Learning workshop