Evaluating Large Language Models in Theory of Mind Tasks

Kosinski, Michal

doi:10.1073/pnas.2405460121

Computer Science > Computation and Language

arXiv:2302.02083 (cs)

[Submitted on 4 Feb 2023 (v1), last revised 4 Nov 2024 (this version, v7)]

Title:Evaluating Large Language Models in Theory of Mind Tasks

Authors:Michal Kosinski

View PDF

Abstract:Eleven Large Language Models (LLMs) were assessed using a custom-made battery of false-belief tasks, considered a gold standard in testing Theory of Mind (ToM) in humans. The battery included 640 prompts spread across 40 diverse tasks, each one including a false-belief scenario, three closely matched true-belief control scenarios, and the reversed versions of all four. To solve a single task, a model needed to correctly answer 16 prompts across all eight scenarios. Smaller and older models solved no tasks; GPT-3-davinci-003 (from November 2022) and ChatGPT-3.5-turbo (from March 2023) solved 20% of the tasks; ChatGPT-4 (from June 2023) solved 75% of the tasks, matching the performance of six-year-old children observed in past studies. We explore the potential interpretation of these findings, including the intriguing possibility that ToM, previously considered exclusive to humans, may have spontaneously emerged as a byproduct of LLMs' improving language skills.

Comments:	TRY RUNNING ToM EXPERIMENTS ON YOUR OWN: The code and tasks used in this study are available at Colab (this https URL). Don't worry if you are not an expert coder, you should be able to run this code with no-to-minimum Python skills. Or copy-paste the tasks to ChatGPT's web interface. Proceedings of the National Academy of Sciences (PNAS) 2024
Subjects:	Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
Cite as:	arXiv:2302.02083 [cs.CL]
	(or arXiv:2302.02083v7 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2302.02083
Related DOI:	https://doi.org/10.1073/pnas.2405460121

Submission history

From: Michal Kosinski [view email]
[v1] Sat, 4 Feb 2023 03:50:01 UTC (539 KB)
[v2] Fri, 10 Feb 2023 19:01:49 UTC (538 KB)
[v3] Tue, 14 Mar 2023 18:49:26 UTC (604 KB)
[v4] Tue, 29 Aug 2023 14:55:37 UTC (515 KB)
[v5] Sat, 11 Nov 2023 23:05:44 UTC (524 KB)
[v6] Sat, 17 Feb 2024 02:05:32 UTC (530 KB)
[v7] Mon, 4 Nov 2024 19:51:53 UTC (811 KB)

Computer Science > Computation and Language

Title:Evaluating Large Language Models in Theory of Mind Tasks

Submission history

Access Paper:

References & Citations

1 blog link

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Evaluating Large Language Models in Theory of Mind Tasks

Submission history

Access Paper:

References & Citations

1 blog link

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators