PsyBench: a balanced and in-depth Psychological Chinese Evaluation Benchmark for Foundation Models

Zhang, Junlei; He, Hongliang; Song, Nirui; He, Shuyuan; Zhang, Shuai; Qiu, Huachuan; Li, Anqi; Ma, Lizhi; Lan, Zhenzhong

Computer Science > Computation and Language

arXiv:2311.09861v2 (cs)

[Submitted on 16 Nov 2023 (v1), revised 17 Nov 2023 (this version, v2), latest version 16 Jun 2024 (v4)]

Title:PsyBench: a balanced and in-depth Psychological Chinese Evaluation Benchmark for Foundation Models

Authors:Junlei Zhang, Hongliang He, Nirui Song, Shuyuan He, Shuai Zhang, Huachuan Qiu, Anqi Li, Lizhi Ma, Zhenzhong Lan

View PDF

Abstract:As Large Language Models (LLMs) are becoming prevalent in various fields, there is an urgent need for improved NLP benchmarks that encompass all the necessary knowledge of individual discipline. Many contemporary benchmarks for foundational models emphasize a broad range of subjects but often fall short in presenting all the critical subjects and encompassing necessary professional knowledge of them. This shortfall has led to skewed results, given that LLMs exhibit varying performance across different subjects and knowledge areas. To address this issue, we present psybench, the first comprehensive Chinese evaluation suite that covers all the necessary knowledge required for graduate entrance exams. psybench offers a deep evaluation of a model's strengths and weaknesses in psychology through multiple-choice questions. Our findings show significant differences in performance across different sections of a subject, highlighting the risk of skewed results when the knowledge in test sets is not balanced. Notably, only the ChatGPT model reaches an average accuracy above $70\%$, indicating that there is still plenty of room for improvement. We expect that psybench will help to conduct thorough evaluations of base models' strengths and weaknesses and assist in practical application in the field of psychology.

Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2311.09861 [cs.CL]
	(or arXiv:2311.09861v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2311.09861

Submission history

From: Junlei Zhang [view email]
[v1] Thu, 16 Nov 2023 12:43:18 UTC (1,432 KB)
[v2] Fri, 17 Nov 2023 03:17:05 UTC (1,432 KB)
[v3] Thu, 13 Jun 2024 13:56:20 UTC (9,461 KB)
[v4] Sun, 16 Jun 2024 11:33:03 UTC (8,303 KB)

Computer Science > Computation and Language

Title:PsyBench: a balanced and in-depth Psychological Chinese Evaluation Benchmark for Foundation Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:PsyBench: a balanced and in-depth Psychological Chinese Evaluation Benchmark for Foundation Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators