Through the Theory of Mind's Eye: Reading Minds with Multimodal Video Large Language Models

Chen, Zhawnen; Wang, Tianchun; Wang, Yizhou; Kosinski, Michal; Zhang, Xiang; Fu, Yun; Li, Sheng

Computer Science > Computer Vision and Pattern Recognition

arXiv:2406.13763 (cs)

[Submitted on 19 Jun 2024]

Title:Through the Theory of Mind's Eye: Reading Minds with Multimodal Video Large Language Models

Authors:Zhawnen Chen, Tianchun Wang, Yizhou Wang, Michal Kosinski, Xiang Zhang, Yun Fu, Sheng Li

View PDF HTML (experimental)

Abstract:Can large multimodal models have a human-like ability for emotional and social reasoning, and if so, how does it work? Recent research has discovered emergent theory-of-mind (ToM) reasoning capabilities in large language models (LLMs). LLMs can reason about people's mental states by solving various text-based ToM tasks that ask questions about the actors' ToM (e.g., human belief, desire, intention). However, human reasoning in the wild is often grounded in dynamic scenes across time. Thus, we consider videos a new medium for examining spatio-temporal ToM reasoning ability. Specifically, we ask explicit probing questions about videos with abundant social and emotional reasoning content. We develop a pipeline for multimodal LLM for ToM reasoning using video and text. We also enable explicit ToM reasoning by retrieving key frames for answering a ToM question, which reveals how multimodal LLMs reason about ToM.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2406.13763 [cs.CV]
	(or arXiv:2406.13763v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2406.13763

Submission history

From: Zhanwen Chen [view email]
[v1] Wed, 19 Jun 2024 18:24:31 UTC (13,370 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Through the Theory of Mind's Eye: Reading Minds with Multimodal Video Large Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Through the Theory of Mind's Eye: Reading Minds with Multimodal Video Large Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators