GPQA: A Graduate-Level Google-Proof Q&A Benchmark

Rein, David; Hou, Betty Li; Stickland, Asa Cooper; Petty, Jackson; Pang, Richard Yuanzhe; Dirani, Julien; Michael, Julian; Bowman, Samuel R.

Computer Science > Artificial Intelligence

arXiv:2311.12022 (cs)

[Submitted on 20 Nov 2023]

Title:GPQA: A Graduate-Level Google-Proof Q&A Benchmark

Authors:David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, Samuel R. Bowman

View PDF

Abstract:We present GPQA, a challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. We ensure that the questions are high-quality and extremely difficult: experts who have or are pursuing PhDs in the corresponding domains reach 65% accuracy (74% when discounting clear mistakes the experts identified in retrospect), while highly skilled non-expert validators only reach 34% accuracy, despite spending on average over 30 minutes with unrestricted access to the web (i.e., the questions are "Google-proof"). The questions are also difficult for state-of-the-art AI systems, with our strongest GPT-4 based baseline achieving 39% accuracy. If we are to use future AI systems to help us answer very hard questions, for example, when developing new scientific knowledge, we need to develop scalable oversight methods that enable humans to supervise their outputs, which may be difficult even if the supervisors are themselves skilled and knowledgeable. The difficulty of GPQA both for skilled non-experts and frontier AI systems should enable realistic scalable oversight experiments, which we hope can help devise ways for human experts to reliably get truthful information from AI systems that surpass human capabilities.

Comments:	28 pages, 5 figures, 7 tables
Subjects:	Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:	arXiv:2311.12022 [cs.AI]
	(or arXiv:2311.12022v1 [cs.AI] for this version)
	https://doi.org/10.48550/arXiv.2311.12022

Submission history

From: David Rein [view email]
[v1] Mon, 20 Nov 2023 18:57:34 UTC (7,790 KB)

Computer Science > Artificial Intelligence

Title:GPQA: A Graduate-Level Google-Proof Q&A Benchmark

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Artificial Intelligence

Title:GPQA: A Graduate-Level Google-Proof Q&A Benchmark

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators