GeoEval: Benchmark for Evaluating LLMs and Multi-Modal Models on Geometry Problem-Solving

Zhang, Jiaxin; Li, Zhongzhi; Zhang, Mingliang; Yin, Fei; Liu, Chenglin; Moshfeghi, Yashar

Computer Science > Artificial Intelligence

arXiv:2402.10104 (cs)

[Submitted on 15 Feb 2024 (v1), last revised 17 May 2024 (this version, v2)]

Title:GeoEval: Benchmark for Evaluating LLMs and Multi-Modal Models on Geometry Problem-Solving

Authors:Jiaxin Zhang, Zhongzhi Li, Mingliang Zhang, Fei Yin, Chenglin Liu, Yashar Moshfeghi

View PDF HTML (experimental)

Abstract:Recent advancements in large language models (LLMs) and multi-modal models (MMs) have demonstrated their remarkable capabilities in problem-solving. Yet, their proficiency in tackling geometry math problems, which necessitates an integrated understanding of both textual and visual information, has not been thoroughly evaluated. To address this gap, we introduce the GeoEval benchmark, a comprehensive collection that includes a main subset of 2,000 problems, a 750 problems subset focusing on backward reasoning, an augmented subset of 2,000 problems, and a hard subset of 300 problems. This benchmark facilitates a deeper investigation into the performance of LLMs and MMs in solving geometry math problems. Our evaluation of ten LLMs and MMs across these varied subsets reveals that the WizardMath model excels, achieving a 55.67\% accuracy rate on the main subset but only a 6.00\% accuracy on the hard subset. This highlights the critical need for testing models against datasets on which they have not been pre-trained. Additionally, our findings indicate that GPT-series models perform more effectively on problems they have rephrased, suggesting a promising method for enhancing model capabilities.

Comments:	Accepted in ACL 2024 Findings
Subjects:	Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:	arXiv:2402.10104 [cs.AI]
	(or arXiv:2402.10104v2 [cs.AI] for this version)
	https://doi.org/10.48550/arXiv.2402.10104

Submission history

From: Yashar Moshfeghi [view email]
[v1] Thu, 15 Feb 2024 16:59:41 UTC (12,526 KB)
[v2] Fri, 17 May 2024 11:42:09 UTC (5,634 KB)

Computer Science > Artificial Intelligence

Title:GeoEval: Benchmark for Evaluating LLMs and Multi-Modal Models on Geometry Problem-Solving

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Artificial Intelligence

Title:GeoEval: Benchmark for Evaluating LLMs and Multi-Modal Models on Geometry Problem-Solving

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators