Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/3666122.3667327guideproceedingsArticle/Chapter ViewAbstractPublication PagesnipsConference Proceedingsconference-collections
research-article

Mathematical capabilities of ChatGPT

Published: 10 December 2023 Publication History

Abstract

We investigate the mathematical capabilities of two versions of ChatGPT (released 9-January-2023 and 30-January-2023) and of GPT-4 by testing them on publicly available datasets, as well as hand-crafted ones, using a novel evaluation scheme. In contrast to formal mathematics, where large databases of formal proofs are available (e.g., mathlib, the Lean Mathematical Library), current datasets of natural-language mathematics used to benchmark language models either cover only elementary mathematics or are very small. We address this by publicly releasing two new datasets: GHOSTS and miniGHOSTS. These are the first natural-language datasets curated by working researchers in mathematics that (1) aim to cover graduate-level mathematics, (2) provide a holistic overview of the mathematical capabilities of language models, and (3) distinguish multiple dimensions of mathematical reasoning. These datasets test, by using 1636 human expert evaluations, whether ChatGPT and GPT-4 can be helpful assistants to professional mathematicians by emulating use cases that arise in the daily professional activities of mathematicians. We benchmark the models on a range of fine-grained performance metrics. For advanced mathematics, this is the most detailed evaluation effort to date. We find that ChatGPT and GPT-4 can be used most successfully as mathematical assistants for querying facts, acting as mathematical search engines and knowledge base interfaces. GPT-4 can additionally be used for undergraduate-level mathematics but fails on graduate-level difficulty. Contrary to many positive reports in the media about GPT-4 and ChatGPT's exam-solving abilities (a potential case of selection bias), their overall mathematical performance is well below the level of a graduate student. Hence, if you aim to use ChatGPT to pass a graduate-level math exam, you would be better off copying from your average peer!

References

[1]
Sascha Lobo. Das Ende von Google, wie wir es kannten. Der Spiegel, Retrieved 2023-01-10. https://www.spiegel.de/netzwelt/netzpolitik/bessere-treffer-durch-chatgpt-das-ende-von-google-wie-wir-es-kannten-kolumne-a-77820af6-51d7-4c03-b822-cf93094fd709.
[2]
John Naughton. The ChatGPT bot is causing panic now - but it'll soon be as mundane a tool as Excel. The Guardian, Retrieved 2023-01-14. https://www.theguardian.com/commentisfree/2023/jan/07/chatgpt-bot-excel-ai-chatbot-tec.
[3]
Kevon Roose. The Brilliance and Weirdness of ChatGPT. The New York Times, Retrieved 2023-01-24. https://www.nytimes.com/2022/12/05/technology/chatgpt-ai-twitter.html.
[4]
Joe Rogan and Bret Weinstein. What ChatGPT Could Mean for the Future of Artificial Intelligence [Podcast episode]. In The Joe Rogan Experience. Episode 1919, Retrieved 2023-01-05. https://www.youtube.com/watch?v=kh5dN72GTQ8.
[5]
teddy [@teddynpc]. I made ChatGPT take a full SAT test. Here's how it did: [Image attached] [Tweet]. Twitter. Retrieved 2023-01-13. https://twitter.com/teddynpc/status/1598767389390573569.
[6]
Timothy Gowers [@wtgowers]. It's amusing when ChatGPT makes ridiculous mathematical mistakes. But of course, it's more interesting to find out what it can do well. Here's one example that wasn't bad: I gave it a very rough outline of a proof and asked it to fill in the details [Tweet]. Twitter. Retrieved 2023-01-13. https://twitter.com/wtgowers/status/1611750773607604224.
[7]
OpenAI (2023). GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
[8]
Tiffany H. Kung, Morgan Cheatham, Arielle Medenilla, Czarina Sillos, and Lorie De Leon et al. Performance of ChatGPT on USMLE: Potential for AI-Assisted Medical Education Using Large Language Models. medRxiv, 2022.
[9]
David Rozado. What is the IQ of ChatGPT? Retrieved 2023-01-09. https://davidrozado.substack.com/p/what-is-the-iq-of-chatgpt.
[10]
Christian Terwiesch. Would Chat GPT3 Get a Wharton MBA? A Prediction Based on Its Performance in the Operations Management Course. Retrieved 2023-0104. https://mackinstitute.wharton.upenn.edu/wp-content/uploads/2023/01/Christian-Terwiesch-Chat-GTP.pdf.
[11]
Natalie. ChatGPT - Release Notes. Retrieved 2023-04-03. https://help.openai.com/en/articles/6825453-chatgpt-release-notes.
[12]
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. In J. Vanschoren and S. Yeung, editors, Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, volume 1. Curran, 2021.
[13]
Guillaume Lample and François Charton. Deep learning for symbolic mathematics. arXiv preprint arXiv:1912.01412, 2019.
[14]
Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. Model cards for model reporting. In Proceedings of the conference on fairness, accountability, and transparency, pages 220-229, 2019.
[15]
A. L. Samuel. Some studies in machine learning using the game of checkers. IBM Journal of Research and Development, 3(3):210-229, 1959.
[16]
Jörg Denzinger, Matthias Fuchs, Christoph Goller, and Stephan Schulz. Learning from previous proof experience: A survey. Technical report, TU München, 1999.
[17]
John Harrison, Josef Urban, and Freek Wiedijk. History of interactive theorem proving. In Computational Logic, volume 9, pages 135-214, 2014.
[18]
Malik Amir, Yang-Hui He, Kyu-Hwan Lee, Thomas Oliver, and Eldar Sultanow. Machine Learning Class Numbers of Real Quadratic Fields. arXiv preprint arXiv:2209.09283, 2022.
[19]
Alex Davies, Petar Veličković, Lars Buesing, Sam Blackwell, and Daniel Zheng et al. Advancing mathematics by guiding human intuition with AI. Nature, 600(7887):70-74, 2021.
[20]
Yang-Hui He. Machine-learning the string landscape. Physics Letters B, 774:564-568, 2017.
[21]
Aitor Lewkowycz, Anders Johan Andreassen, David Dohan, Ethan Dyer, and Henryk Michalewski et al. Solving quantitative reasoning problems with language models. In Advances in Neural Information Processing Systems, 2022.
[22]
Francois Charton, Amaury Hayat, and Guillaume Lample. Learning advanced mathematical computations from examples. In International Conference on Learning Representations, 2021.
[23]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in Neural Information Processing Systems, 2017.
[24]
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, and Gaurav Mishra et al. PaLM: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
[25]
Aida Amini, Saadia Gabriel, Shanchuan Lin, and Rik Koncel-Kedziorski et al. MathQA: Towards interpretable math word problem solving with operation-based formalisms. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2357-2367. Association for Computational Linguistics, 2019.
[26]
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, and Heewoo Jun et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
[27]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, and Jared D Kaplan et al. Language models are few-shot learners. In Advances in Neural Information Processing Systems, pages 1877-1901, 2020.
[28]
Piotr Piękos, Mateusz Malinowski, and Henryk Michalewski. Measuring and improving BERT's mathematical abilities by predicting the order of reasoning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, pages 383-394. Association for Computational Linguistics, 2021.
[29]
Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blunsom. Program induction by rationale generation: Learning to solve and explain algebraic word problems. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pages 158-167. Association for Computational Linguistics, 2017.
[30]
Teven Le Scao and Angela Fan et al. BLOOM: A 176B-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
[31]
Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, and Apoorv Kulshreshtha et al. Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239, 2022.
[32]
Pan Lu, Liang Qiu, Wenhao Yu, Sean Welleck, and Kai-Wei Chang. A survey of deep learning for mathematical reasoning. arXiv preprint arXiv:2212.10535, 2022.
[33]
Sean Welleck, Jiacheng Liu, Ronan Le Bras, Hannaneh Hajishirzi, Yejin Choi, and Kyunghyun Cho. NaturalProofs: Mathematical theorem proving in natural language. arXiv preprint arXiv:2104.01112, 2021.
[34]
Sean Welleck, Jiacheng Liu, Ximing Lu, Hannaneh Hajishirzi, and Yejin Choi. Natural-Prover: Grounded mathematical proof generation with language models. arXiv preprint arXiv:2205.12910, 2022.
[35]
Rick Durrett. Probability: Theory and Examples. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2019.
[36]
James R. Munkres. Topology. Prentice-Hall, 2000.
[37]
Walter Rudin. Functional analysis. McgGraw-Hill, 1991.
[38]
Sheldon Axler. Linear algebra done right. Springer, 2015.
[39]
W. Rudin. Principles of Mathematical Analysis. International series in pure and applied mathematics. McGraw-Hill, 1976.
[40]
Arthur Engel. Problem-Solving Strategies. Springer, 1998.
[41]
Tranquil Sea Of Math. Does ChatGPT code LaTeX and write proofs? Youtube. Retrieved 2023-01-12. https://www.youtube.com/watch?v=ge2N7VI_8P0.
[42]
Richard Van Noorden @[email protected] [@Richvn]. Huh. ChatGPT confidently gives the right kind of reasoning to solve this math problem, but whiffs on the algebra in the middle and gets the answer wrong [Tweet]. Twitter. Retrieved 2023-01-09. https://twitter.com/Richvn/status/1598714487711756288.
[43]
Amos Azaria. ChatGPT Usage and Limitations. Retrieved 2023-01-15. https://hal.science/hal-03913837.
[44]
Ernest Davis. Mathematics, word problems, common sense, and artificial intelligence. arXiv preprint arXiv:2301.09723, 2023.
[45]
Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial general intelligence: Early experiments with GPT-4. arXiv preprint arXiv:2303.12712, 2023.
[46]
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
[47]
Swaroop Mishra, Matthew Finlayson, Pan Lu, Leonard Tang, Sean Welleck, Chitta Baral, Tanmay Rajpurohit, Oyvind Tafjord, Ashish Sabharwal, Peter Clark, et al. Lila: A unified benchmark for mathematical reasoning. arXiv preprint arXiv:2210.17517, 2022.
[48]
The mathlib Community. The Lean mathematical library. In Proceedings of the 9th ACM SIGPLAN International Conference on Certified Programs and Proofs. ACM, 2020.
[49]
Markus N. Rabe, Dennis Lee, Kshitij Bansal, and Christian Szegedy. Mathematical reasoning via self-supervised skip-tree training. arXiv preprint arXiv:2006.04757v3, 2020.
[50]
Shuyin Ouyang, Jie M Zhang, Mark Harman, and Meng Wang. LLM is like a box of chocolates: the non-determinism of ChatGPT in code generation. arXiv preprint arXiv:2308.02828, 2023.
[51]
Sherman Chann. Non-determinism in GPT-4 is caused by sparse MoE, 2023. Accessed on August 5, 2023.
[52]
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, and Carroll L. Wainwright et al. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155, 2022.
[53]
Carroll Wainwright and Ryan Lowe. InstructGPT: Training Language Models to Follow Instructions with Human Feedback. GitHub repository, Retrieved 2023-01-09. https://github.com/openai/following-instructions-human-feedback.
[54]
Sarah Wiegreffe (sigmoid.social/@sarah) [@sarahwiegreffe]. If text-davinci-001 is a rough approximate to the model reported in the NeurIPS 2020 paper, and text-davinci-002 is InstructGPT in the 2022 preprint, then what is just "davinci"? Trying to reproduce results from a time before this naming existed [Tweet]. Twitter. Retrieved 2023-01-15. https://twitter.com/sarahwiegreffe/status/1583617355678355456.
[55]
OpenAI. GPT-4 API waitlist. Retrieved 2023-06-06. https://openai.com/waitlist/gpt-4-api.
[56]
OpenAI. Documentation - Models. Retrieved 2023-06-06. https://platform.openai.com/docs/models/gpt-4.
[57]
OpenAI. OpenAI API Reference - Chat Completion Endpoint. Retrieved 2023-06-06. https://platform.openai.com/docs/api-reference/chat.
[58]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, and Henrique Ponde de Oliveira Pinto et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
[59]
Alexander Bogomolny. Pythagorean theorem, Retrieved 2023-08-10. https://www.cut-the-knot.org/pythagoras.
[60]
Benj F Yanney and James A Calderhead. New and old proofs of the Pythagorean theorem. The American Mathematical Monthly, 3(4):110-113, 1896.
[61]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. Chain-of-Thought prompting elicits reasoning in large language models. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 24824-24837. Curran Associates, Inc., 2022.
[62]
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601, 2023.
[63]
Xi Ye and Greg Durrett. The unreliability of explanations in few-shot in-context learning. arXiv preprint arXiv:2205.03401, 2022.
[64]
Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé Iii, and Kate Crawford. Datasheets for datasets. Communications of the ACM, 64(12):86-92, 2021.

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings
NIPS '23: Proceedings of the 37th International Conference on Neural Information Processing Systems
December 2023
80772 pages

Publisher

Curran Associates Inc.

Red Hook, NY, United States

Publication History

Published: 10 December 2023

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 06 Feb 2025

Other Metrics

Citations

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media