A Survey of Deep Learning For Mathematical Reasoning
A Survey of Deep Learning For Mathematical Reasoning
A Survey of Deep Learning For Mathematical Reasoning
Pan Lu1 , Liang Qiu1 , Wenhao Yu2 , Sean Welleck3∗ , Kai-Wei Chang1∗
1
UCLA, 2 University of Notre Dame, 3 University of Washington
https://github.com/lupantech/dl4math
60
Abstract
50
Mathematical reasoning is a fundamental as- 40
pect of human intelligence and is applicable
Papers
in various fields, including science, engineer- 30
arXiv:2212.10535v1 [cs.AI] 20 Dec 2022
and Metamath (Megill and Wheeler, 2019). To Figure 2: An example of geometry problem in the Ge-
prove a theorem in an ITP, the theorem is stated in ometry3K (Lu et al., 2021a) dataset.
the ITP’s programming language, then simplified
by generating “proof steps” until it is reduced to
known facts. The result is a sequence of steps that elements of informal and formal theorem proving.
constitutes a verified proof. For example, Wu et al. (2022b) explore translat-
Data sources for neural theorem proving in ITPs ing informal statements into formal statements,
include interactive learning environments that in- while Jiang et al. (2022b) release a new version
terface with ITPs, and datasets derived from proofs of the miniF2F benchmark augmented with infor-
in ITP libraries. For example, CoqGym (Yang and mal statements and proofs, which we refer to as
Deng, 2019) provides an interactive environment miniF2F+informal. Jiang et al. (2022b) explore
and 71K human-written proofs for the Coq ITP. For translating provided (or generated) informal proofs
Isabelle, PISA (Jiang et al., 2021) enables interac- into formal proofs.
tion and provides a dataset of 183k proofs mined
from the Isabelle standard library and Archive of
Formal Proofs. For Lean, LeanStep (Han et al., 2.3 Geometry Problem Solving
2022) provides a dataset of proof-steps from Lean’s Automated geometry problem solving (GPS) is also
mathematical library along with auxiliary tasks, a long-standing AI task in mathematical reasoning
while Lean-Gym (Polu et al., 2022) provides an in- research (Gelernter et al., 1960; Wen-Tsun, 1986;
teractive REPL. The miniF2F (Zheng et al., 2022) Chou et al., 1996; Ye et al., 2008) and has attracted
benchmark aims to provide a shared benchmark much attention in recent years. Different from a
across ITPs, consisting of 488 problem statements math word problem, a geometry problem consists
sourced from mathematical competitions. of a textual description in natural language and a
Other resources provide proxy environments or geometric diagram. As shown in Figure 2, the mul-
tasks. For example, INT (Wu et al., 2021c) pro- timodal inputs describe the entities, attributes, and
vide a synthetic proving environment to measure relationships of geometric elements, and the goal
six different types of generalization. Li et al. con- is to find the numeric solution to an unknown vari-
struct IsarStep using the Isabelle Archive of Formal able. GPS is a challenging task for deep learning
Proofs, and propose a task of filling in a missing in- methods due to the complex skills it requires. It
termediate proposition. Early applications of deep involves the ability to parse multimodal informa-
learning for formal theorem proving focus on se- tion, perform symbolic abstraction, utilize theorem
lecting relevant premises (Alemi et al., 2016). knowledge, and conduct quantitative reasoning.
Informal theorem proving presents an alternative Some early datasets are proposed to facilitate
medium for theorem proving, in which statements research in this domain (Seo et al., 2015; Alvin
and proofs are written in the mixture of natural lan- et al., 2017; Sachan et al., 2017; Sachan and Xing,
guage and symbols used in “standard” mathematics 2017). However, these datasets are relatively small
(e.g., in LATEX), and are checked for correctness by or not publicly available, which limits the devel-
humans. Early work focuses on selecting relevant opment of deep learning methods. In response to
premises (Ferreira and Freitas, 2020b,a). Welleck this limitation, Lu et al. create the Geometry3K
et al. (2021) develop NaturalProofs, a large-scale dataset, which consists of 3,002 multi-choice geom-
dataset of 32k informal mathematical theorems, etry problems with unified logic form annotations
definitions, and proofs, and provide a benchmark for the multimodal inputs. More recently, larger-
for premise selection via retrieval and generation scale datasets such as GeoQA (Chen et al., 2021a),
tasks. Welleck et al. (2022a) adapt NaturalProofs GeoQA+ (Cao and Xiao, 2022), and UniGeo (Chen
for full proof generation, and provide a human eval- et al., 2022a) have been introduced and are anno-
uation protocol and proxy automatic metrics. tated with programs that can be learned by neural
An emerging area of research aims to combine solvers and executed to obtain the final answers.
2.4 Math Question Answering tasks, such as understanding news, reports, elec-
tions, and markets. This has led many in the com-
Numerical reasoning is a core ability within human
munity to question whether AI systems can effec-
intelligence and plays an important role in many
tively perform quantitative reasoning in everyday
NLP tasks. Aside from theorem proving and grade-
scenarios. To this end, various benchmarks have
level math word problem solving, there is a wide
been developed to evaluate the capabilities of AI
range of question answering (QA) benchmarks
systems in this area.
that center around mathematical reasoning. In this
work, we refer to these tasks as math question an- Diagrams, such as figures, charts, and plots, are
swering (MathQA). A large number of datasets essential media that convey large amounts of infor-
have been presented recently. For example, QuaRel mation in a concise way. FigureQA (Kahou et al.,
(Tafjord et al., 2019) is a dataset of diverse story 2017), DVQA (Kafle et al., 2018), MNS (Zhang
questions that involve 19 different types of quan- et al., 2020c), PGDP5K (Hao et al., 2022), and
tities. McTaco (Zhou et al., 2019) studies tempo- GeoRE (Yu et al., 2021a), are released to investi-
ral commonsense problems, while Fermi (Kalyan gate models’ abilities to reason about quantitative
et al., 2021) studies Fermi problems whose answers relationships among entities grounded in diagrams.
can only be approximately estimated. NumerSense (Lin et al., 2020), instead, examines
whether and to what extent existing pre-trained lan-
Recent studies have shown that state-of-the-art
guage models can induce numerical commonsense
mathematical reasoning systems might suffer from
knowledge. EQUATE (Ravichander et al., 2019)
brittleness in reasoning, in that the models rely on
formalizes aspects of quantitative reasoning in a
spurious signals and plug-and-chug calculations in
natural language inference framework. Quantita-
the specific dataset to achieve “satisfactory” per-
tive reasoning can appear frequently in specific
formance (Hendrycks et al., 2021; Mishra et al.,
domains like finance, science, and programming.
2022b). To address this issue, new benchmarks
For instance, the ConvFinQA (Chen et al., 2022c)
are proposed from various aspects. The Mathemat-
targets numerical reasoning over financial reports
ics dataset (Saxton et al., 2020) consists of many
in a conversational question answering format. Sci-
different types of mathematics problems, cover-
enceQA (Lu et al., 2022a) involves numerical rea-
ing arithmetic, algebra, probability, and calculus.
soning in scientific domains, while P3 (Schuster
The dataset allows for measuring the algebraic gen-
et al., 2021) studies the function inference ability
eralization ability of a model. Similarly, MATH
of deep learning models to find a valid input which
(Hendrycks et al., 2021) consists of challenging
makes the given program return True.
competition mathematics to measure the problem-
solving ability of models in complex scenarios. 3 Neural Networks for Mathematical
Some work incorporates tabular contexts in the Reasoning
question inputs. For example, FinQA (Chen et al.,
2021c), TAT-QA (Zhu et al., 2021), and MultiHiertt 3.1 Seq2Seq Networks for Math
(Zhao et al., 2022) collect questions that require Sequence-to-sequence (Seq2Seq) (Sutskever et al.,
both table understanding and numeric reasoning 2014) neural networks have been successfully ap-
to answer. Others, instead, present large-scale plied to mathematical reasoning tasks, such as math
unified benchmarks for mathematical reasoning. word problem solving (Wang et al., 2017), theorem
NumGLUE (Mishra et al., 2022b) is a multi-task proving (Yang and Deng, 2019), geometry prob-
benchmark with the goal of evaluating the perfor- lem solving (Robaidek et al., 2018), and math ques-
mance of models on eight different tasks. Mishra tion answering (Tafjord et al., 2019). A Seq2Seq
et al. 2022a push this direction further and presents model uses an encoder-decoder architecture and
Lila, which consists of 23 mathematical reasoning usually formalizes mathematical reasoning as a se-
tasks, spanning a wide range of mathematics top- quence generation task. The basic idea behind this
ics, linguistic complexity, question formats, and approach is to map an input sequence (e.g. a math-
background knowledge requirements. ematical problem) to an output sequence (e.g. an
equation, program, and proof). Common encoders
2.5 Other Quantitative Problems
and decoders include Long Short Term Memory
Numbers are an integral part of our daily lives, and network (LSTM) (Hochreiter and Schmidhuber,
we humans reason with numbers in a variety of 1997), Gated Recurrent Unit (GRU) (Cho et al.,
Dataset Task Size Input Output Rationale Domain
Verb395 (2014) MWP 395 Question Number Equation Math
Alg514 (2014) MWP 514 Question Number Equation Math
IL (2015) MWP - Question Number Equation Math
SingleEQ (2015) MWP 508 Question Number Equation Math
DRAW (2015) MWP 1,000 Question Number Equation Math
Dolphin1878 (2015) MWP 1,878 Question Number Equation Math
Dolphin18K (2016) MWP 18,460 Question Number Equation Math
MAWPS (2016) MWP 3,320 Question Number Equation Math
AllArith (2017) MWP 831 Question Number Equation Math
DRAW-1K (2017) MWP 1,000 Question Number Equation Math
Math23K (2017) MWP 23,162 Question Number Equation Math
AQuA (2017) MWP 100,000 Question Option Natural language Math
Aggregate (2018) MWP 1,492 Question Number Equation Math
MathQA (2019) MWP 37,297 Question Number Program Math
ASDiv (2020) MWP 2,305 Question Number Equation Math
HMWP (2020) MWP 5,470 Question Number Equation Math
Ape210K (2020) MWP 210,488 Question Number Equation Math
SVAMP (2021) MWP 1,000 Question Number Equation Math
GSM8K (2021) MWP 8,792 Question Number Natural language Math
IconQA (2021b) MWP 107,439 Figure+Question Option+Text span 7 Math
MathQA-Python (2021) MWP 23,914 Question Number Python program Math
ArMATH (2022) MWP 6,000 Question Number Equation Math
TabMWP (2022b) MWP 38,431 Table+Question Option+Number Natural language Math
MML (2015) TP 57,882 Statement Proof steps 7 Math
HolStep (2017) TP 2,209,076 Statement Proof steps 7 Math
Gamepad (2019) TP - Statement Proof steps 7 Math
CoqGym (2019) TP 71,000 Statement Proof steps 7 Math
HOList (2019) TP 29,462 Statement Proof steps 7 Math
IsarStep (2021) TP 860,000 Statement Proof steps 7 Math
PISA (2021) TP 183,000 Statement Proof steps 7 Math
INT (2021c) TP - Statement Proof steps 7 Math
NaturalProofs (2021) TP 32,000 Statement Proof steps 7 Math
NaturalProofs-Gen (2022a) TP 14,500 Statement Proof steps 7 Math
miniF2F (2022) TP 488 Statement Proof steps 7 Math
miniF2F+informal (2022b) TP 488 Statement Proof steps 7 Math
LeanStep (2022) TP 21,606,000 Statement Proof steps 7 Math
GEOS (2015) GPS 186 Figure+Question Option 7 Geometry
GeoShader (2017) GPS 102 Figure+Question Number 7 Geometry
GEOS++ (2017) GPS 1,406 Figure+Question Number 7 Geometry
GEOS-OS (2017) GPS 2,235 Figure+Question Option Demonstration Geometry
Geometry3K (2021a) GPS 3,002 Figure+Question Option Logical form Geometry
GeoQA (2021a) GPS 4,998 Figure+Question Option Program Geometry
GeoQA+ (2022) GPS 12,054 Figure+Question Option Program Geometry
UniGeo (2022a) GPS/TP 14,541 Figure+Question Option Program Geometry
Quarel (2019) MathQA 2,771 Question Option Logical form Math
McTaco (2019) MathQA 13,225 Text+Question Option 7 Time
DROP (2019) MathQA 96,567 Passage+Question Number+Text span 7 Math
Mathematics (2020) MathQA 2,010,000 Question Free-form Number Math
FinQA (2021c) MathQA 8,281 Text+Table+Q Number Program Finance
Fermi (2021) MathQA 11,000 Question Number Program+Fact Math
MATH (2021) MathQA 12,500 Question Number Natural language Math
TAT-QA (2021) MathQA 16,552 Text+Table+Q Number+Text span 7 Finance
AMPS (2021) MathQA 5,000,000 Question - LATEX Math
MultiHiertt (2022) MathQA 10,440 Text+Table+Q Number+Text span Expression Finance
NumGLUE (2022b) MathQA 101,835 Text+Question Number+Text span 7 Math
Lila (2022a) MathQA 134,000 Text+Question Free-form Python program Math
FigureQA (2017) VQA 1,000,000+ Figure+Question Binary 7 Math
DVQA (2018) VQA 3,487,194 Figure+Question Text span Number+Text span Math
DREAM (2019) ConvQA 10,197 Dialog+Question Option 7 Math
EQUATE (2019) NLI - Premise+Hypothesis Binary 7 Math
NumerSense (2020) Filling 13,600 Masked question Word 7 Math
MNS (2020c) IQ Test - Figure Number 7 Math
P3 (2021) Puzzle 397 Text Program 7 Math
NOAHQA (2021) ConvQA 21,347 Dialog+Question Text span Reasoning graph Math
ConvFinQA (2022c) ConvQA 3,892 Report+Dialog+Q Number Expression Math
PGDP5K (2022) Parsing 5,000 Figure+Question Number 7 Geometry
GeoRE (2022a) Parsing 12,901 Figure+Question Number 7 Geometry
ScienceQA (2022a) VQA 21,208 Context+Question Option Natural language Science
2014), and their bidirectional variants: BiLSTM work that uses a Seq2Seq model to transform sen-
and BiGRU. DNS (Wang et al., 2017) is the first tences in word problems into mathematical equa-
Paper Task Problem Network Encod Decod ATT Description
DNS (Wang et al., 2017) MWP Generation Seq2Seq GRU LSTM 7 The first deep MWP solver
AnsRat (Ling et al., 2017) MWP Generation Seq2Seq LSTM LSTM 7 Trained with staged back-propagation
Math-EN (Wang et al., 2018a) MWP Generation Seq2Seq BiLSTM LSTM 4 A standard Seq2Seq model with attention
CASS (Huang et al., 2018) MWP Generation Seq2Seq BiGRU BiGRU 4 Copy and alignment with RL
S-Aligned (Chiang and Chen, 2019) MWP Generation Seq2Seq BiLSTM LSTM 4 Operating symbols
T-RNN (Wang et al., 2019) MWP Generation Seq2Seq BiLSTM BiLSTM 4 Predicting a tree-structure math template
GROUP-ATT (Li et al., 2019) MWP Generation Seq2Seq BiLSTM LSTM 4 Group attention
SMART (Hong et al., 2021b) MWP Generation Seq2Seq - - 7 Explicitly incorporating values
SelfAtt (Robaidek et al., 2018) GPS Classification Seq2Seq BiLSTM - 4 Multi-hop self-attention
QuaSP+ (Tafjord et al., 2019) MathQA Generation Seq2Seq BiLSTM LSTM 7 Adopting attributed grammar
AST-Dec (Liu et al., 2019a) MWP Generation Seq2Tree BiLSTM Tree 4 Using prefix order decoding
GTS (Xie and Sun, 2019) MWP Generation Seq2Tree BiGRU Tree 4 A goal-driven tree-structured approach
KA-S2T (Wu et al., 2020) MWP Generation Seq2Tree BiLSTM Tree 4 A knowledge-aware method
TSN-MD (Zhang et al., 2020a) MWP Generation Seq2Tree BiGRU Tree 4 A teacher-student network
T-LSTM (Zaporojets et al., 2021) MWP Generation Seq2Tree BiLSTM Tree 7 A child-sum tree-LSTM model
NT-LSTM (Zaporojets et al., 2021) MWP Generation Seq2Tree BiLSTM Tree 7 An N-ary tree-LSTM model
NS-Solver (Qin et al., 2021) MWP Generation Seq2Tree BiGRU Tree 4 A neural-symbolic solver with programs
NumS2T (Wu et al., 2021b) MWP Generation Seq2Tree BiLSTM Tree 4 Explicitly incorporating values
HMS (Lin et al., 2021) MWP Generation Seq2Tree GRU Tree 4 A word-clause-problem encoder
LBF (Hong et al., 2021a) MWP Generation Seq2Tree BiGRU Tree 4 A learning-by-fixing (LBF) framework
Seq2DAG (Cao et al., 2021) MWP Generation Seq2Graph GRU Graph 7 A direct acyclic graph (DAG) structure
Graph2Tree (Zhang et al., 2020b) MWP Generation Graph2Tree Graph Tree 7 Generating better solution expressions
Multi-E/D (Shen and Jin, 2020) MWP Generation Graph2Tree Graph Tree 4 A graph encoder and a tree-bad decoder
Graph2Tree (Li et al., 2020b) MWP Generation Graph2Tree Graph Tree 4 A graph-to-tree neural network
EEH-G2T (Wu et al., 2021a) MWP Generation Graph2Tree Graph Tree 7 A hierarchical graph-to-tree model
ASTactic (Yang and Deng, 2019) TP Generation Tree2Seq TreeLSTM GRU 4 Generating tactics as programs
MathDQN (Wang et al., 2018b) MWP Search DQN - - 7 RL with a deep Q-network (DQN)
DDT (Meng and Rumshisky, 2019) MWP Generation Transformer Trm Trm 4 A Transformer-based model
DeepMath (Irving et al., 2016) TP Classification CNN CNN - 7 The first deep large scale theorem prover
Holophrasm (Whalen, 2016) TP Classification BiGRU BiGRU - 7 A neural prover for higher-order logic
CNNTP (Loos et al., 2017) TP Classification CNN CNN - 7 A CNN-based theorem prover
WaveNetTP (Loos et al., 2017) TP Classification WaveNet WaveNet - 7 A WaveNet-based theorem prover
DeepHOL (Bansal et al., 2019) TP Generation WaveNet WaveNet - 7 A neural theorem prover with RL
NGS (Chen et al., 2021a) GPS Generation VQA LSTM* LSTM 4 The first deep geometry solver
PGDPNet (Zhang et al., 2022a) Parsing Generation GNN - - 7 A neural diagram parser with GNN
Table 3: A summarization of deep neural network models for mathematical reasoning. Encod: encoder, Decod:
decoder, ATT: Attention. LSTM*: ResNet + LSTM, Trm: Transformer
tions. A large amount of work has shown the perfor- Wu et al., 2020; Zhang et al., 2020a; Zaporojets
mance advantage of Seq2Seq models over previous et al., 2021; Qin et al., 2021; Wu et al., 2021b; Lin
statistical learning approaches (Ling et al., 2017; et al., 2021; Hong et al., 2021a). For example, (Liu
Wang et al., 2018a; Huang et al., 2018; Chiang and et al., 2019a) devise a Seq2Tree model to better use
Chen, 2019; Wang et al., 2019; Li et al., 2019). information from an equation’s AST. Seq2DAG
(Cao et al., 2021), instead, applies a sequence-to-
3.2 Graph-based Networks for Math graph (Seq2Graph) framework when generating the
Seq2Seq approaches show their advantages of gen- equations since the graph decoder is able to extract
erating mathematical expressions and not relying complex relationships among multiple variables.
on hand-crafted features. Mathematical expres- The graph-based information can also be embedded
sions could be transformed into a tree-based struc- when encoding the input mathematical sequences
ture, e.g., an abstract syntax tree (AST) and a graph- (Zhang et al., 2020b; Shen and Jin, 2020; Li et al.,
based structure, which describes structured infor- 2020b; Wu et al., 2021a). For example, ASTactic
mation in the expressions. However, this important (Yang and Deng, 2019) applies TreeLSTM (Tai
information is not explicitly modeled by Seq2Seq et al., 2015) on ASTs to represent the input goal
methods. To solve this issue, graph-based neu- and premises for theorem proving.
ral networks are developed to explicitly model the
structure in expressions. 3.3 Attention-based Networks for Math
Sequence-to-tree (Seq2Tree) models explicitly The attention mechanism has been successfully ap-
model the tree structure when encoding the output plied to natural language processing (Bahdanau
sequences (Liu et al., 2019a; Xie and Sun, 2019; et al., 2014) and computer vision problems (Xu
et al., 2015; Woo et al., 2018), taking into account lem parsing in Zhang et al. (2022a), taking advan-
the hidden vectors of the inputs during the decoding tage of its success in spatial reasoning. WaveNet
processing. Recently, researchers have been explor- has been applied to theorem proving (Loos et al.,
ing its usefulness in mathematical reasoning tasks, 2017; Bansal et al., 2019), due to its ability to ad-
as it can be used to identify the most important dress longitudinal time-series data. Furthermore,
relationships between mathematical concepts. For Transformers are found to outperform GRU in gen-
instance, MATH-EN (Wang et al., 2018a) is a math erating mathematical equations in DDT (Meng and
word problem solver which benefits from long- Rumshisky, 2019). Finally, MathDQN (Wang et al.,
distance dependency information learned by self- 2018b) is the first work to explore reinforcement
attention. Attention-based methods have also been learning for math word problem solving, taking
applied to other mathematical reasoning tasks such advantage of its strong search capabilities.
as geometry problems solving (Robaidek et al.,
2018; Chen et al., 2021a) and theorem proving 4 Pre-trained Language Models for
(Yang and Deng, 2019). Various attention mech- Mathematical Reasoning
anisms have been studied to extract better repre-
sentations, such as Group-ATT (Li et al., 2019) Pre-trained language models (e.g., Devlin et al.
which uses different multi-head attention to extract (2018); Radford et al. (2020); Brown et al. (2020))
various types of MWP features, and graph atten- have demonstrated remarkable performance gains
tion which is applied to extract knowledge-aware on a wide range of NLP tasks (Qiu et al., 2020). By
information in (Wu et al., 2020). pre-training on a large corpus of text, the models
learn valuable world knowledge (Guu et al., 2020),
3.4 Other Neural Networks for Math which could be applied to downstream tasks such
as question answering (Khashabi et al., 2020), text
Deep learning approaches to mathematical rea- classification (Minaee et al., 2021), and dialogue
soning tasks can also make use of other neural generation (Zhang et al., 2019; Qiu et al., 2022a,b).
networks, such as convolutional neural networks Similar ideas can be applied to math-related prob-
(CNN) and multimodal networks. Some work en- lems, and previous work has shown promising per-
codes the input text using a convolutional neural formance of pre-trained language models in answer-
network architecture, giving the model the ability ing math word problems (Kim et al., 2020; Shen
to capture long-term relationships between sym- et al., 2021; Yu et al., 2021b; Cobbe et al., 2021;
bols in the input (Gehring et al., 2017; Wang et al., Li et al., 2022b; Jie et al., 2022; Ni et al., 2022), as-
2018a,a; Robaidek et al., 2018; Irving et al., 2016; sisting with theorem proving (Polu and Sutskever,
Loos et al., 2017). For example, the first applica- 2020; Han et al., 2022; Wu et al., 2022b; Jiang et al.,
tion of deep neural networks for theorem proving 2022b; Welleck et al., 2022a), as well as other math-
is proposed in (Irving et al., 2016), which relies ematical tasks (Lu et al., 2021a; Chen et al., 2022a;
on convolutional networks for premise selection in Cao and Xiao, 2022; Clark et al., 2020; Chen et al.,
large theories. 2021c; Zhu et al., 2021; Hendrycks et al., 2021;
Multimodal mathematical reasoning tasks, such Zhao et al., 2022; Nye et al., 2021; Charton, 2021).
as geometry problem solving and diagram-based However, though large language models excel in
mathematical reasoning, are formalized as visual modeling natural language, there are several chal-
question answer (VQA) problems (Kafle et al., lenges to using them for mathematical reasoning.
2018; Chen et al., 2021a; Lu et al., 2021b). In this First, pre-trained language models are not specif-
domain, visual inputs are encoded using ResNet ically trained on mathematical data. This likely
(He et al., 2016) or Faster-RCNN (Ren et al., 2015), contributes to them being less proficient in math-
while textual representations are obtained via GRU related tasks compared to natural language tasks.
or LTSM. Subsequently, the joint representation is There is also less mathematical or scientific data
learned using multimodal fusion models, such as available for large-scale pre-training compared to
BAN (Kim et al., 2018), FiLM (Perez et al., 2018), text data. Second, the size of pre-trained models
and DAFA (Gao et al., 2019). continues to grow, making it expensive to train the
Other deep neural network structures can also be entire model from scratch for specific downstream
used in mathematical reasoning. A Graph Neural tasks. Additionally, downstream tasks may deal
Network (GNN) is employed for geometry prob- with different input formats or modalities, such as
Paper Backbone Size Corpus Pre-training task
GPT-f (Polu and Sutskever, 2020) Transformer 774M Math Causal language modeling
LISA (Jiang et al., 2021) Transformer 163M Math Causal language modeling
MATH-PLM (Hendrycks et al., 2021) GPT-2 1.5B Math Causal language modeling
MWP-BERT (Liang et al., 2022b) RoBERTa 123M Math 8 numeracy augmented tasks
TaPEx (Liu et al., 2022b) BART 406M SQL Query result generation
HTPS (Lample et al., 2022) Transformer 600M Math Masked Seq2Seq modeling
Thor (Jiang et al., 2022a) Transformer 700M Github, arXiv Causal language modeling
PACT (Han et al., 2022) Transformer 837M Math Masked/Causal language modeling
Minerva (Lewkowycz et al., 2022) PaLM 540B Science & Math Causal language modeling
GenBERT (Geva et al., 2020) BERT 110M Number, Text Masked/Causal language modeling
NF-NSM (Feng et al., 2021) RoBERTa 110M Number Number prediction
LIME (Wu et al., 2021d) Transformer 11B Math Causal language modeling
Set (Wu et al., 2022c) T5 60M Math Unique token generation
Table 4: Comparison of model backbone, size, pre-training corpus, and pre-training tasks of language models for
mathematical reasoning.
structured tables (Zhao et al., 2022; Chen et al., also mentions an interesting thresholding effect:
2021c; Zhu et al., 2021) or diagrams (Lu et al., “all models that win head-to-head model compar-
2021a; Chen et al., 2022a; Lu et al., 2021b). To isons for accuracy at a rate well above chance are
address these challenges, researchers have to adjust at least 50B parameters”. A similar size-growing
pre-trained models by finetuning them on down- trend can be observed in the field of mathematical
stream tasks or adapting the neural architectures. reasoning with pre-trained language models. For
Lastly, though pre-trained language models can en- example, MWP-BERT (Liang et al., 2022b) uses
code substantial amounts of linguistic information, a backbone of BERT (110M) (Devlin et al., 2018)
it may be difficult for models to learn numerical and RoBERTa (123M) (Liu et al., 2019b) for Math
representation or high-level reasoning skills just Word Problems. TaPEx (Liu et al., 2022b) pre-train
from the language modeling objective (Lin et al., their model based on BARTlarge , which has 460M
2020; Kalyan et al., 2021). Taking this into con- parameters. Most recently, Minerva (Lewkowycz
sideration, there are recent studies investigating et al., 2022) based on the PaLM (Chowdhery et al.,
the injection of mathematical-related skills with a 2022) pre-trained language model has a variable
curriculum starting from basics (Geva et al., 2020; size with up to 540B parameters.
Feng et al., 2021; Wu et al., 2021d). Pre-training corpus. There are generally two
4.1 Self-Supervised Learning for Math types of pre-training corpus for mathematical lan-
guage models. (i) Curated datasets from openly
Self-supervised learning is a machine learning ap- accessible sources. For example, Hendrycks et al.
proach in which an algorithm learns to perform (2021) present the first large-scale mathematics
a task without being explicitly provided with la- pre-training dataset with step-by-step solutions
beled training data. An example of self-supervised in natural language and LATEX, called the Auxil-
learning is next-token prediction, which allows a iary Mathematics Problems and Solutions (AMPS).
language model to learn the relationships between AMPS consists of Khan Academy and Mathemat-
words and understand the meaning of the text from ica data. Minerva (Lewkowycz et al., 2022) col-
large-scale unlabeled data. Table 4 provides a list of lects a high-quality dataset containing scientific and
language models pre-trained with self-supervised mathematical data, which contains 38.5B tokens
tasks for mathematical reasoning. from webpages filtered for mathematical content
Model scale. There is a clear trend that pre-trained and from papers submitted to the arXiv preprint
language models have become increasingly larger server. Thor (Jiang et al., 2022a) pre-trains a lan-
in the past few years (Devlin et al., 2018; Lewis guage model on the GitHub + arXiv subsets of The
et al., 2020; Raffel et al., 2020; Radford et al., 2020; Pile (Gao et al., 2020). (ii) Synthetic datasets based
Brown et al., 2020). A recent study (Liang et al., on templates or interaction with engines. Recent
2022a) shows that model scale within a model fam- work (Wu et al., 2021d; Krishna et al., 2021; Ri
ily reliably predicts model accuracy. The study and Tsuruoka, 2022; Anderson and Farrell, 2022;
Wu et al., 2022c) shows that pre-training on data Paper Backbone Task
that is fully synthetically generated—synthetic pre- EPT (2020) ALBERT MWP
training can actually provide substantial gains. Rep- GenerateRank (2021) BART MWP
RPKHS (2021b) RoBERTa MWP
resentative work includes TaPEx (Liu et al., 2022b),
PatchTRM (2021b) ResNet+BERT MWP
which obtains a pre-training corpus by automati- GSM8K-PLM (2021) GPT-3 MWP
cally synthesizing executable SQL queries and their BERT-TD+CL (2022b) BERT MWP
DeductReasoner (2022) RoBERTa MWP
execution outputs. LISA (Jiang et al., 2021) ex-
Self-Sampling (2022) GPT-Neo MWP
tracts lemmas and theorems by interacting with the Bhaskara (2022a) GPT-Neo MWP
Isabelle standard library and the Archive of Formal miniF2F-PLM (2022) GPT-f TP
Proofs. GenBERT (Geva et al., 2020) generats nu- NaturalProver (2022a) GPT-3 TP
merical and textual pre-training datasets based on Inter-GPS (2021a) BART GPS
manually crafted and extracted templates. UniGeo (2022a) VL-T5 GPS
DPE-NGS (2022) RoBERTa GPS
Pre-training tasks. General pre-training language
Aristo (2020) RoBERTa MathQA
models have two typical self-supervised learning FinQANet (2021c) RoBERTa MathQA
tasks: (i) Masked Language Modeling (MLM), TagOp (2021) RoBERTa MathQA
where it randomly masks a portion of words in MATH-PLM (2021) GPT-3 MathQA
MT2Net (2022) RoBERTa MathQA
each sequence to predict the outcome; (ii) Causal
Language Modeling (CLM), where the model is Scratchpad (2021) Transformer Mixed
LAMT (2021) Transformer Mixed
trained to predict the next token in a sequence of
tokens. Following the same paradigm, researchers Table 5: Finetuned pre-trained language models for
pre-train language models with MLM and CLM downstream mathematical reasoning tasks.
tasks on mathematical/scientific corpora for down-
stream tasks (Polu and Sutskever, 2020; Jiang et al.,
2021; Hendrycks et al., 2021; Han et al., 2022; tion between entities (Lin et al., 2020). Zhang
Lewkowycz et al., 2022; Jiang et al., 2022a). et al. (2020d) evaluate the language model embed-
dings via scalar probing and Berg-Kirkpatrick and
There is also recent work that designs cus- Spokoyny (2020) carry out a large-scale empiri-
tomized tasks to inject mathematical reasoning cal investigation of masked number prediction and
capabilities into language models. For instance, numerical anomaly detection in text.
Liang et al. (2022b) pre-train language models with
a suite of 8 numeracy-augmented tasks with con- 4.2 Task-specific Fine-tuning for Math
sideration of reasoning logic and numerical proper- Task-specific fine-tuning is a technique to improve
ties. LIME (Wu et al., 2021d) proposes synthetic the performance of a pre-trained language model
pre-training tasks to learn three reasoning primi- on a specific task. This is also a common prac-
tives: deduction, induction, and abduction before tice when there is not enough data for training the
learning more complex reasoning skills, which also large models from scratch. As shown in Table 5,
be regarded as a form of curriculum learning. A existing work fine-tunes pre-trained language mod-
follow-up work (Wu et al., 2022c) finds that pre- els on a variety of downstream tasks, such as Math
training on a simple and generic synthetic task of Word Problems (Kim et al., 2020; Shen et al., 2021;
predicting unique tokens in its original order (Set) Yu et al., 2021b; Lu et al., 2021b; Cobbe et al.,
achieves similar performance as LIME. Geva et al. 2021; Li et al., 2022b; Jie et al., 2022; Ni et al.,
(2020) train their language models on a numerical 2022; Mishra et al., 2022a; Welleck et al., 2022b),
data generation task followed by a text data gener- MathQA over financial tabular data (Zhao et al.,
ation task. The first task teaches models numeri- 2022; Chen et al., 2021c; Zhu et al., 2021), Geom-
cal operations and the second task teaches models etry (Lu et al., 2021a; Chen et al., 2022a; Cao and
to comprehend how numerical operations are ex- Xiao, 2022), Linear Algebra (Charton, 2021), and
pressed in text. informal theorem proving (Welleck et al., 2022a).
Besides knowledge injection, there are also stud- Apart from fine-tuning the model parameters, much
ies about probing whether pre-trained language work also uses pre-trained language models as en-
models have captured numerical commonsense coders and ensemble them with other modules for
knowledge, i.e., commonsense knowledge that downstream tasks, e.g., IconQA (Lu et al., 2021b)
provides an understanding of the numeric rela- proposes to combine the ResNet (He et al., 2016)
and BERT for diagram recognition and text under- (i) selecting better in-context examples (Fu et al.,
standing, respectively. 2022; Lu et al., 2022b; Zhang et al., 2022b) and (ii)
creating better reasoning chains (Zhou et al., 2022;
5 In-context Learning for Mathematical Wang et al., 2022; Li et al., 2022a).
Reasoning
5.1 In-context Example Selection
Large language models (LLMs), such as GPT-
Early chain-of-thought work randomly or heuris-
3 (Brown et al., 2020), have recently revolutionized
tically selects in-context examples. However, re-
the field of natural language processing (NLP), es-
cent studies have shown that this type of few-shot
pecially on account of their powerful few-shot in-
learning can be highly unstable across different se-
context learning capabilities (Brown et al., 2020).
lections of in-context examples (Rubin et al., 2022;
In-context Learning (ICL) enables LLMs to per-
Liu et al., 2022a). Therefore, which in-context rea-
form target tasks by providing some task examples
soning examples make the most effective prompts
as conditions at inference time, without updating
is still an unknown problem in the literature.
model parameters (Radford et al., 2020; Brown
To address the limitation, recent work has in-
et al., 2020). ICL allows users to quickly build
vestigated various methods to optimize the in-
models for new use cases without worrying about
context examples selection process (Rubin et al.,
fine-tuning and storing a large amount of new pa-
2022; Zhang et al., 2022b; Lu et al., 2022b; Yu
rameters for each task, so it is widely used in few-
et al., 2022; Fu et al., 2022). For example, Rubin
shot settings nowadays (Min et al., 2022).
et al. (2022) attempt to address this issue by re-
An in-context example typically contains an
trieving semantically similar examples. However,
input-output pair with some prompt words, e.g.,
this approach has been shown to work poorly on
Please select the largest number from the list. In-
mathematical reasoning problems (Zhang et al.,
put: [2, 4, 1, 5, 8]. Output: 8, and few-shot works
2022b), and it is sometimes hard to measure the
by giving multiple examples, and then a final in-
similarity if structured information (e.g., tables)
put example, where the model is expected to pre-
is contained (Lu et al., 2022b). In addition, Fu
dict the output. However, such standard few-shot
et al. (2022) propose complexity-based prompting,
promptings, in which the LLM is given in-context
which chooses examples with complex reasoning
examples of input–output pairs in front of test-time
chains, i.e., chains with more reasoning steps, as
examples, have not yet proved sufficient to achieve
the prompt. Lu et al. (2022b) propose a method
high performance on challenging tasks such as
for selecting in-context examples via reinforcement
mathematical reasoning (Rae et al., 2021).
learning (RL). Specifically, an agent learns to find
Chain-of-thought prompting (CoT) (Wei et al.,
optimal in-context examples from a candidate pool,
2022), leverages intermediate natural language ra-
with the goal of maximizing the prediction rewards
tionales as prompts to enable LLMs to first generate
on given training examples when interacting with
reasoning chains and then predict an answer for
the GPT-3 environment. In addition, Zhang et al.
an input question. For example, a CoT prompt for
(2022b) find diversifying demonstration questions
solving the math word problem could be
could also improve model performance. They pro-
Question: Roger has 5 tennis balls. He pose a two-step approach to construct in-context
buys 2 more cans of tennis balls. Each demonstrations: first, partitioning questions of a
can has 3 tennis balls. Then, how many given dataset into a few clusters; second, select-
tennis balls does Roger have now? ing a representative question from each cluster and
Answer: Roger started with 5 balls. 2 generating its reasoning chain using a zero-shot
cans of 3 tennis balls each are 6 tennis chain-of-thought with simple heuristics.
balls. 5 + 6 = 11. The answer is 11.
5.2 High-quality Reasoning Chains
Apart from Kojima et al. (2022) showing that Early chain of thought work (e.g., Wei et al. (2022))
LLMs are decent zero-shot reasoners when given mainly relies on a single human-annotated reason-
the “Let’s think step by step!" prompt, most of ing chain as a prompt. However, manually creating
the recent work has focused on how to improve reasoning chains has two disadvantages. First, as
chain-of-thought reasoning under the few-shot set- tasks become more complex, current models may
ting. This work is mainly divided into two parts, not be sufficient to learn to perform all necessary
Engine ICL Rationale Rationale
Models (best performed) source type source Post method
Table 6: In-context learning with large language models for mathematical reasoning. For GPT-3, all papers use the
text-davinci-002 version; for Codex, all papers use the code-davinci-002. RL is short for reinforcement learning.
reasoning steps and cannot easily generalize to dif- to using sampling with a single prompt to produce
ferent tasks. Second, a single decoding process multiple reasoning paths, Li et al. (2022a) propose
is vulnerable to incorrect inference steps, leading to introduce diverse prompts through “self teach-
to an incorrect prediction as the final answer. To ing”, as a complementary solution to produce a
address this limitation, recent studies mainly fo- higher degree of diversity.
cus on two aspects, (i) hand-crafting more complex
demonstrations, which we refer to as process-based 6 Discussion
approaches (Zhou et al., 2022; Chen et al., 2022b),
6.1 Analysis of Benchmarks
(ii) leveraging ensemble-like methods, which we
refer to as outcome-based approaches (Wang et al., Multi-modal setting. Most existing benchmarks
2022; Li et al., 2022a). for mathematical reasoning have targeted the
textual-only modality. However, visual elements
Process-based approaches aim to improve the can provide a rich source of quantitative informa-
chain-of-thought reasoning quality, especially for tion, making multi-modal datasets beneficial for
complex reasoning tasks. In least-to-most prompt- reasoning over quantitative relations in natural im-
ing (Zhou et al., 2022), the problem-solving pro- ages (Lu et al., 2022a), abstract diagrams (Lu et al.,
cess is implemented through two-stage prompting: 2021b), figures (Kahou et al., 2017), and charts
(i) reducing a complex problem into a list of sub- (Kafle et al., 2018). Tables, which are commonly
problems; (ii) solving these sub-problems sequen- found in daily documents and contain hierarchi-
tially, so that solving a given sub-problem is fa- cally structured information, have also been the
cilitated by the answers to previously solved sub- focus of tasks that require quantitative reasoning
problems. Similarly, Khot et al. (2022) leverage over textual and tabular context (Chen et al., 2021c;
diverse decomposition structures and use differ- Zhu et al., 2021; Zhao et al., 2022; Lu et al., 2022b).
ent prompts to answer each sub-question. Apart In addition, recent datasets have been developed for
from these multi-step reasoning methods, Chen mathematical reasoning grounded on conversations
et al. (2022b); Gao et al. (2022) propose program- (Sun et al., 2019; Zhang et al., 2021; Chen et al.,
of-thoughts (PoT), an alternative solution that uses 2022c), as well as reports (Chen et al., 2022c).
large language models to express the reasoning
Low-resource setting. Despite the creation of
process as a program. The computation is then rel-
various datasets, mathematical reasoning in low-
egated to an external computer, which executes the
resource settings remains largely under-explored.
generated programs to derive the answer.
Pioneering research has developed mathematical
Outcome-based approaches acknowledge the reasoning benchmarks for financial (Chen et al.,
potential incorrectness of an individual reason- 2021c; Zhu et al., 2021; Zhao et al., 2022) and
ing path, and instead use multiple reasoning scientific domains (Lu et al., 2022a). Addition-
paths (Wang et al., 2022; Li et al., 2022a). Self- ally, there have been attempts to build non-English
consistency (Wang et al., 2022) generates a set of datasets for Chinese (Wang et al., 2017; Qin et al.,
reasoning paths by sampling from the language 2020; Yu et al., 2021a) and Arabic (Alghamdi et al.,
model, and marginalizes out the reasoning paths 2022) for mathematical reasoning.
by choosing the most common answer. In addition Rationale annotations. Complex reasoning usu-
T5 UnifiedQA GPT-3 GPT-3 Problems GPT-3 (text-davinci-002)
(Large) (Large) (davinci-002) (davinci-003) John had 8 balls and he gave 3 to Mary. John has 5 balls.
How many balls does John have now?
3 balls + 5 balls = 7 5 balls 8 balls 8 balls
23 balls + 145 balls = 7 7 58 balls 168 balls John had 3 apples. John had 8 balls Mary has 5 balls.
23 balls + 1,855 balls = 7 7 2,878 balls 2,988 balls and he gave 3 to Mary. How many
balls does Mary have now?
John had 8 balls and he gave 3 to Mary. John has more balls.
Table 7: Language models struggle with large numbers. Who has more balls now?
John had 8 balls and he gave 3 to Mary. No, John has 5 balls now.
Does John have more balls now?
ally involves multiple steps to arrive at the final John had 8 balls and he gave 4 to Mary. No, John has 4 balls now.
answer. To bridge this gap, datasets annotated with Does John have more balls now?
intermediate rationales such as logic forms (Tafjord John had 8 balls and he gave 4 to Mary. John has more balls.
Who has more balls now?
et al., 2019; Lu et al., 2021a), programs (Amini
et al., 2019; Chen et al., 2021c,a; Cao and Xiao, Table 8: Examples where large language models are
2022; Chen et al., 2022a), and reasoning graphs not consistent for mathematical reasoning.
(Zhang et al., 2021) have been proposed to train
models for complex reasoning tasks. Python pro-
out-of-distribution (OOD) problems. Table 7 pro-
grams are used as reasoning annotations in (Austin
vides examples of where language models tend to
et al., 2021; Mishra et al., 2022a) due to their en-
struggle with large numbers. Although increasing
hanced accessibility and readability. To imitate the
model scales could help, even the state-of-the-art
reasoning process of a human, a more recent trend
large language model GPT-3 performs poorly when
is to annotate solutions in natural language (Ling
reasoning over large numbers. Some recent work
et al., 2017; Cobbe et al., 2021; Lu et al., 2022b;
suggests that using scientific notation (Zhang et al.,
Hendrycks et al., 2021; Lu et al., 2022a).
2020e) and digit-level decomposition (Geva et al.,
6.2 Analysis of Deep Learning Methods 2020) may be helpful in improving numeracy rep-
resentation, but this remains an open problem in
Is the current representation of numeracy suffi- the field.
cient? While neural networks and language mod-
els have achieved impressive results, their ability Are deep learning methods consistent for math-
to represent and comprehend numbers is still not ematical reasoning? Recent developments in
ideal. The standard practice for deep learning tech- deep learning have led to impressive results on var-
niques is to treat numbers in the same way as words. ious mathematical reasoning tasks. The zero-shot-
Early neural network methods create a vocabulary CoT Minerva 540B achieves a score of 75.0% on
that maps input words and numbers to token IDs, the MMLU-STEM benchmark (Hendrycks et al.,
resulting in less frequent numbers being collapsed 2020a), which assesses multitask reasoning abil-
into an “UNK” token. On the other hand, pre-trained ity in the fields of science, technology, engineer-
language models (such as BERT) and newer large ing, and mathematics (STEM) at both high school
language models (such as GPT-3) use subword to- and college levels. Similarly, few-shot-CoT GPT-3
kenization techniques (Wu et al., 2016; Sennrich 175B achieves a high accuracy of 93.0% on the
et al., 2016) to split numbers into atomic tokens. MultiArith task. However, the question remains as
Recent studies have shown that these tokeniza- to whether these methods are sufficiently advanced
tion approaches are suboptimal (Wallace et al., to tackle more complex problems.
2019; Lin et al., 2020; Zhang et al., 2020e; Thawani There is strong evidence that deep learning meth-
et al., 2022). Two numbers on the same or close ods for mathematical reasoning are not robust and
number line could have surface forms with no susceptible to adversarial attacks (Lin et al., 2020;
shared common tokens. For example, a number Patel et al., 2021; Mishra et al., 2022b,a; Welleck
like 1598 is tokenized as “15” and “98” in GPT-3, et al., 2022c). The SVAMP (Patel et al., 2021)
while another format like 1, 598 is split as three dataset is a collection of one-unknown arithmetic
different tokens: “1”, “,”, and “598”. This lack of word problems up to grade 4, with slight word vari-
consistent representation can make it difficult for ations from previous datasets. It is surprising that
deep learning models to effectively process num- current state-of-the-art (SOTA) methods perform
bers, especially when compared to pure text. The poorly on this dataset, with Graph2Tree achieving
insufficient representations of numbers can lead to only a 43.8% accuracy and zero-shot-CoT GPT-3
(175B) only reaching 63.7%, which is just above rections for this include: (i) using language models
an “F” grade. Table 8 also shows the inconsistent to provide evidence, such as theorems, to support
performance of the zero-shot GPT-3 model in sce- the reasoning process; (ii) incorporating a mech-
narios with slightly different descriptions, while anism that makes a judgment when the model is
human performance remains unchanged. This in- unsure of the answer; and (iii) using a model itself
dicates a lack of consistency in the mathematical or another module to detect and locate mistakes in
reasoning ability of SOTA large language models. a model’s reasoning.
Jiaqi Chen, Tong Li, Jinghui Qin, Pan Lu, Liang Lin, Aakanksha Chowdhery, Sharan Narang, Jacob Devlin,
Chongyu Chen, and Xiaodan Liang. 2022a. Unigeo: Maarten Bosma, Gaurav Mishra, Adam Roberts,
Unifying geometry logical reasoning via reformu- Paul Barham, Hyung Won Chung, Charles Sutton,
lating mathematical expression. In The 2022 Con- Sebastian Gehrmann, et al. 2022. Palm: Scaling
ference on Empirical Methods in Natural Language language modeling with pathways. arXiv preprint
Processing (EMNLP). arXiv:2204.02311.
Jiaqi Chen, Jianheng Tang, Jinghui Qin, Xiaodan Peter Clark, Oren Etzioni, Tushar Khot, Daniel
Liang, Lingbo Liu, Eric Xing, and Liang Lin. 2021a. Khashabi, Bhavana Mishra, Kyle Richardson,
Geoqa: A geometric question answering benchmark Ashish Sabharwal, Carissa Schoenick, Oyvind
towards multimodal numerical reasoning. In Find- Tafjord, Niket Tandon, et al. 2020. From ‘f’to ‘a’on
ings of the Association for Computational Linguis- the ny regents science exams: An overview of the
tics: ACL-IJCNLP 2021, pages 513–523. aristo project. AI Magazine, 41(4):39–53.
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Karl Cobbe, Vineet Kosaraju, Mohammad Bavar-
Henrique Ponde de Oliveira Pinto, Jared Kaplan, ian, Jacob Hilton, Reiichiro Nakano, Christopher
Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Hesse, and John Schulman. 2021. Training veri-
Brockman, et al. 2021b. Evaluating large lan- fiers to solve math word problems. arXiv preprint
guage models trained on code. arXiv preprint arXiv:2110.14168.
arXiv:2107.03374.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Wenhu Chen, Xueguang Ma, Xinyi Wang, and Kristina Toutanova. 2018. Bert: Pre-training of deep
William W Cohen. 2022b. Program of thoughts bidirectional transformers for language understand-
prompting: Disentangling computation from reason- ing. arXiv preprint arXiv:1810.04805.
ing for numerical reasoning tasks. arXiv preprint
arXiv:2211.12588. Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel
Stanovsky, Sameer Singh, and Matt Gardner. 2019.
Zhiyu Chen, Wenhu Chen, Charese Smiley, Sameena Drop: A reading comprehension benchmark requir-
Shah, Iana Borova, Dylan Langdon, Reema Moussa, ing discrete reasoning over paragraphs. In Proceed-
Matt Beane, Ting-Hao Huang, Bryan R Routledge, ings of the 2019 Conference of the North American
et al. 2021c. Finqa: A dataset of numerical reason- Chapter of the Association for Computational Lin-
ing over financial data. In Proceedings of the 2021 guistics: Human Language Technologies, Volume 1
Conference on Empirical Methods in Natural Lan- (Long and Short Papers), pages 2368–2378.
guage Processing (EMNLP), pages 3697–3711.
Edward A Feigenbaum et al. 1963. Computers and
Zhiyu Chen, Shiyang Li, Charese Smiley, Zhiqiang thought. McGraw-Hill.
Ma, Sameena Shah, and William Yang Wang. 2022c.
Convfinqa: Exploring the chain of numerical rea- Yu Feng, Jing Zhang, Xiaokang Zhang, Lemao Liu,
soning in conversational finance question answering. Cuiping Li, and Hong Chen. 2021. Injecting numeri-
arXiv preprint arXiv:2210.03849. cal reasoning skills into knowledge base question an-
swering models. arXiv preprint arXiv:2112.06109.
Ting-Rui Chiang and Yun-Nung Chen. 2019.
Semantically-aligned equation generation for Deborah Ferreira and André Freitas. 2020a. Natu-
solving and reasoning math word problems. In ral language premise selection: Finding supporting
Proceedings of the 2019 Conference of the North statements for mathematical text. In Proceedings
American Chapter of the Association for Computa- of the Twelfth Language Resources and Evaluation
tional Linguistics: Human Language Technologies, Conference, pages 2175–2182, Marseille, France.
Volume 1 (Long and Short Papers), pages 2656– European Language Resources Association.
2668.
Deborah Ferreira and André Freitas. 2020b. Premise
Kyunghyun Cho, Bart van Merrienboer Caglar Gul- selection in natural language mathematical texts. In
cehre, Dzmitry Bahdanau, Fethi Bougares Holger Proceedings of the 58th Annual Meeting of the Asso-
Schwenk, and Yoshua Bengio. 2014. Learning ciation for Computational Linguistics, pages 7365–
phrase representations using rnn encoder–decoder 7374, Online. Association for Computational Lin-
for statistical machine translation. In Proceedings of guistics.
Yao Fu, Hao Peng, Ashish Sabharwal, Peter Clark, Jesse Michael Han, Jason Rute, Yuhuai Wu, Edward W
and Tushar Khot. 2022. Complexity-based prompt- Ayers, and Stanislas Polu. 2022. Proof artifact co-
ing for multi-step reasoning. arXiv preprint training for theorem proving with language models.
arXiv:2210.00720. In International Conference on Learning Represen-
tations (ICLR).
Leo Gao, Stella Biderman, Sid Black, Laurence Gold-
ing, Travis Hoppe, Charles Foster, Jason Phang, Ho- Yihan Hao, Mingliang Zhang, Fei Yin, and Lin-
race He, Anish Thite, Noa Nabeshima, et al. 2020. lin Huang. 2022. Pgdp5k: A diagram parsing
The pile: An 800gb dataset of diverse text for lan- dataset for plane geometry problems. arXiv preprint
guage modeling. arXiv preprint arXiv:2101.00027. arXiv:2205.09947.
Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian
Pengfei Liu, Yiming Yang, Jamie Callan, and Gra- Sun. 2016. Deep residual learning for image recog-
ham Neubig. 2022. Pal: Program-aided language nition. In Proceedings of the IEEE conference on
models. arXiv preprint arXiv:2211.10435. computer vision and pattern recognition, pages 770–
778.
Peng Gao, Zhengkai Jiang, Haoxuan You, Pan Lu,
Steven CH Hoi, Xiaogang Wang, and Hongsheng Li. Dan Hendrycks, Collin Burns, Steven Basart, Andy
2019. Dynamic fusion with intra-and inter-modality Zou, Mantas Mazeika, Dawn Song, and Jacob
attention flow for visual question answering. In The Steinhardt. 2020a. Measuring massive multi-
IEEE Conference on Computer Vision and Pattern task language understanding. arXiv preprint
Recognition (CVPR), pages 6639–6648. arXiv:2009.03300.
Thibault Gauthier, Cezary Kaliszyk, Josef Urban, Ra- Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul
mana Kumar, and Michael Norrish. 2021. Tactic- Arora, Steven Basart, Eric Tang, Dawn Song, and
Toe: Learning to Prove with Tactics. Journal of Au- Jacob Steinhardt. 2021. Measuring mathematical
tomated Reasoning. problem solving with the math dataset. In 35th Con-
ference on Neural Information Processing Systems
Jonas Gehring, Michael Auli, David Grangier, Denis
(NeurIPS) Track on Datasets and Benchmarks.
Yarats, and Yann N Dauphin. 2017. Convolutional
sequence to sequence learning. In International Dan Hendrycks, Xiaoyuan Liu, Eric Wallace, Adam
conference on machine learning, pages 1243–1252. Dziedzic, Rishabh Krishnan, and Dawn Song.
PMLR. 2020b. Pretrained transformers improve out-of-
Herbert Gelernter, James R Hansen, and Donald W distribution robustness. In Proceedings of the 58th
Loveland. 1960. Empirical explorations of the ge- Annual Meeting of the Association for Computa-
ometry theorem machine. In Papers presented at the tional Linguistics, pages 2744–2751.
May 3-5, 1960, western joint IRE-AIEE-ACM com-
Jonathan Herzig, Pawel Krzysztof Nowak, Thomas
puter conference, pages 143–149.
Mueller, Francesco Piccinno, and Julian Eisensch-
Mor Geva, Ankit Gupta, and Jonathan Berant. 2020. los. 2020. Tapas: Weakly supervised table parsing
Injecting numerical reasoning skills into language via pre-training. In Proceedings of the 58th Annual
models. arXiv preprint arXiv:2004.04487. Meeting of the Association for Computational Lin-
guistics (ACL), pages 4320–4333.
Kevin Gimpel, Dipanjan Das, and Noah A Smith. 2010.
Distributed asynchronous online learning for natural Sepp Hochreiter and Jürgen Schmidhuber. 1997.
language processing. In Proceedings of the Four- Long short-term memory. Neural computation,
teenth Conference on Computational Natural Lan- 9(8):1735–1780.
guage Learning, pages 213–222.
Yining Hong, Qing Li, Daniel Ciao, Siyuan Huang, and
Amelia Glaese, Nat McAleese, Maja Tr˛ebacz, John Song-Chun Zhu. 2021a. Learning by fixing: Solv-
Aslanides, Vlad Firoiu, Timo Ewalds, Maribeth ing math word problems with weak supervision. In
Rauh, Laura Weidinger, Martin Chadwick, Phoebe Proceedings of the AAAI Conference on Artificial In-
Thacker, et al. 2022. Improving alignment of dia- telligence, pages 4959–4967.
logue agents via targeted human judgements. arXiv
preprint arXiv:2209.14375. Yining Hong, Qing Li, Ran Gong, Daniel Ciao, Siyuan
Huang, and Song-Chun Zhu. 2021b. Smart: A situa-
Adam Grabowski, Artur Korniłowicz, and Adam Nau- tion model for algebra story problems via attributed
mowicz. 2015. Four decades of mizar. Journal of grammar. In AAAI, pages 13009–13017.
Automated Reasoning, 55(3):191–198.
Mohammad Javad Hosseini, Hannaneh Hajishirzi,
Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pa- Oren Etzioni, and Nate Kushman. 2014. Learning
supat, and Mingwei Chang. 2020. Retrieval aug- to solve arithmetic word problems with verb catego-
mented language model pre-training. In Inter- rization. In Proceedings of the 2014 Conference on
national Conference on Machine Learning, pages Empirical Methods in Natural Language Processing
3929–3938. PMLR. (EMNLP).
Daniel Huang, Prafulla Dhariwal, Dawn Song, and Ilya Samira Ebrahimi Kahou, Vincent Michalski, Adam
Sutskever. 2019. Gamepad: A learning environment Atkinson, Ákos Kádár, Adam Trischler, and Yoshua
for theorem proving. In ICLR. Bengio. 2017. Figureqa: An annotated fig-
ure dataset for visual reasoning. arXiv preprint
Danqing Huang, Jing Liu, Chin-Yew Lin, and Jian Yin. arXiv:1710.07300.
2018. Neural math word problem solver with rein-
forcement learning. In Proceedings of the 27th Inter- Cezary Kaliszyk, François Chollet, and Christian
national Conference on Computational Linguistics, Szegedy. 2017. Holstep: A machine learning dataset
pages 213–223. for higher-order logic theorem proving. In ICLR.
Danqing Huang, Shuming Shi, Chin-Yew Lin, and Ashwin Kalyan, Abhinav Kumar, Arjun Chan-
Jian Yin. 2017. Learning fine-grained expressions drasekaran, Ashish Sabharwal, and Peter Clark.
to solve math word problems. In Proceedings of 2021. How much coffee was consumed during
Empirical Methods in Natural Language Processing emnlp 2019? fermi problems: A new reasoning chal-
(EMNLP), pages 805–814. lenge for ai. In Proceedings of the 2021 Conference
on Empirical Methods in Natural Language Process-
Danqing Huang, Shuming Shi, Chin-Yew Lin, Jian Yin, ing, pages 7318–7328.
and Wei-Ying Ma. 2016. How well do comput-
ers solve math word problems? large-scale dataset Nikhil Kandpal, H. Deng, Adam Roberts, Eric Wal-
construction and evaluation. In Proceedings of the lace, and Colin Raffel. 2022. Large language mod-
54th Annual Meeting of the Association for Compu- els struggle to learn long-tail knowledge. ArXiv,
tational Linguistics (ACL), pages 887–896. abs/2211.08411.
Geoffrey Irving, Christian Szegedy, Alexander A Daniel Khashabi, Sewon Min, Tushar Khot, Ashish
Alemi, Niklas Eén, François Chollet, and Josef Ur- Sabharwal, Oyvind Tafjord, Peter Clark, and Han-
ban. 2016. Deepmath-deep sequence models for naneh Hajishirzi. 2020. Unifiedqa: Crossing for-
premise selection. Advances in neural information mat boundaries with a single qa system. In Find-
processing systems, 29. ings of the Association for Computational Linguis-
tics (EMNLP), pages 1896–1907.
Albert Q Jiang, Wenda Li, Szymon Tworkowski, Kon-
Tushar Khot, Harsh Trivedi, Matthew Finlayson, Yao
rad Czechowski, Tomasz Odrzygóźdź, Piotr Miłoś,
Fu, Kyle Richardson, Peter Clark, and Ashish Sab-
Yuhuai Wu, and Mateja Jamnik. 2022a. Thor:
harwal. 2022. Decomposed prompting: A modular
Wielding hammers to integrate language models
approach for solving complex tasks. arXiv preprint
and automated theorem provers. arXiv preprint
arXiv:2210.02406.
arXiv:2205.10893.
Bugeun Kim, Kyung Seo Ki, Donggeon Lee, and Gah-
Albert Q. Jiang, Sean Welleck, Jin Peng Zhou, Wenda gene Gweon. 2020. Point to the expression: Solv-
Li, Jiacheng Liu, Mateja Jamnik, Timothée Lacroix, ing algebraic word problems using the expression-
Yuhuai Wu, and Guillaume Lample. 2022b. Draft, pointer transformer model. In Proceedings of the
sketch, and prove: Guiding formal theorem provers 2020 Conference on Empirical Methods in Natural
with informal proofs. In Submitted to The Eleventh Language Processing (EMNLP), pages 3768–3779.
International Conference on Learning Representa-
tions. Jin-Hwa Kim, Jaehyun Jun, and Byoung-Tak Zhang.
2018. Bilinear attention networks. In Advances in
Albert Qiaochu Jiang, Wenda Li, Jesse Michael Han, Neural Information Processing Systems (NeurIPS),
and Yuhuai Wu. 2021. Lisa: Language models of pages 1571–1581.
isabelle proofs. In 6th Conference on Artificial Intel-
ligence and Theorem Proving (AITP 2021). Wonjae Kim, Bokyung Son, and Ildoo Kim. 2021. Vilt:
Vision-and-language transformer without convolu-
Zhanming Jie, Jierui Li, and Wei Lu. 2022. Learn- tion or region supervision. In Proceedings of the
ing to reason deductively: Math word problem solv- 38th International Conference on Machine Learning
ing as complex relation extraction. arXiv preprint (ICML), pages 5583–5594.
arXiv:2203.10316.
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yu-
Jaehun Jung, Lianhui Qin, Sean Welleck, Faeze Brah- taka Matsuo, and Yusuke Iwasawa. 2022. Large
man, Chandra Bhagavatula, Ronan Le Bras, and language models are zero-shot reasoners. arXiv
Yejin Choi. 2022. Maieutic prompting: Logi- preprint arXiv:2205.11916.
cally consistent reasoning with recursive explana-
tions. ArXiv, abs/2205.11822. Rik Koncel-K., Subhro Roy, Aida Amini, Nate Kush-
man, and Hannaneh Hajishirzi. 2016. Mawps: A
Kushal Kafle, Brian Price, Scott Cohen, and Christo- math word problem repository. In Proceedings
pher Kanan. 2018. Dvqa: Understanding data visu- of the 2016 Conference of the North American
alizations via question answering. In Proceedings of Chapter of the Association for Computational Lin-
the IEEE Conference on Computer Vision and Pat- guistics: Human Language Technologies (NAACL),
tern Recognition (CVPR), pages 5648–5656. pages 1152–1157.
Rik Koncel-Kedziorski, Hannaneh Hajishirzi, Ashish of the 57th Annual Meeting of the Association for
Sabharwal, Oren Etzioni, and Siena Dumas Ang. Computational Linguistics, pages 6162–6167.
2015. Parsing algebraic word problems into equa-
tions. Transactions of the Association for Computa- Jiwei Li, Alexander H Miller, Sumit Chopra,
tional Linguistics (TACL), 3:585–597. Marc’Aurelio Ranzato, and Jason Weston. 2016.
Dialogue learning with human-in-the-loop. arXiv
Kundan Krishna, Jeffrey Bigham, and Zachary C preprint arXiv:1611.09823.
Lipton. 2021. Does pretraining for summariza-
tion require knowledge transfer? arXiv preprint Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui
arXiv:2109.04953. Hsieh, and Kai-Wei Chang. 2020a. What does bert
with vision look at? In Proceedings of the 58th An-
Nate Kushman, Yoav Artzi, Luke Zettlemoyer, and nual Meeting of the Association for Computational
Regina Barzilay. 2014. Learning to automatically Linguistics (ACL), pages 5265–5275.
solve algebra word problems. In Proceedings of the
52nd Annual Meeting of the Association for Compu- Shucheng Li, Lingfei Wu, Shiwei Feng, Fangli Xu,
tational Linguistics (ACL), pages 271–281. Fengyuan Xu, and Sheng Zhong. 2020b. Graph-
Guillaume Lample and François Charton. 2020. Deep to-tree neural networks for learning structured input-
learning for symbolic mathematics. In International output translation with applications to semantic pars-
Conference on Learning Representations (ICLR). ing and math word problem. In Findings of the As-
sociation for Computational Linguistics: EMNLP
Guillaume Lample, Marie-Anne Lachaux, Thibaut 2020, pages 2841–2852.
Lavril, Xavier Martinet, Amaury Hayat, Gabriel
Ebner, Aurélien Rodriguez, and Timothée Lacroix. Wenda Li, Lei Yu, Yuhuai Wu, and Lawrence C Paul-
2022. Hypertree proof search for neural theorem son. 2021. Isarstep: a benchmark for high-level
proving. arXiv preprint arXiv:2205.11491. mathematical reasoning. In International Confer-
ence on Learning Representations (ICLR).
Yihuai Lan, Lei Wang, Qiyuan Zhang, Yunshi Lan,
Bing Tian Dai, Yan Wang, Dongxiang Zhang, and Yifei Li, Zeqi Lin, Shizhuo Zhang, Qiang Fu, Bei Chen,
Ee-Peng Lim. 2022. Mwptoolkit: an open-source Jian-Guang Lou, and Weizhu Chen. 2022a. On the
framework for deep learning-based math word prob- advance of making language models better reason-
lem solvers. In Proceedings of the AAAI Conference ers. arXiv preprint arXiv:2206.02336.
on Artificial Intelligence, pages 13188–13190.
Zhongli Li, Wenxuan Zhang, Chao Yan, Qingyu Zhou,
Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Chao Li, Hongzhi Liu, and Yunbo Cao. 2022b.
Kevin Gimpel, Piyush Sharma, and Radu Soricut. Seeking patterns, not just memorizing procedures:
2019. Albert: A lite bert for self-supervised learn- Contrastive learning for solving math word prob-
ing of language representations. arXiv preprint lems. In Findings of the Association for Computa-
arXiv:1909.11942. tional Linguistics: ACL 2022, pages 2486–2496.
Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Percy Liang, Rishi Bommasani, Tony Lee, Dimitris
Haffner. 1998. Gradient-based learning applied to Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian
document recognition. Proceedings of the IEEE, Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Ku-
86(11):2278–2324. mar, et al. 2022a. Holistic evaluation of language
Mike Lewis, Yinhan Liu, Naman Goyal, Mar- models. arXiv preprint arXiv:2211.09110.
jan Ghazvininejad, Abdelrahman Mohamed, Omer
Levy, Veselin Stoyanov, and Luke Zettlemoyer. Percy Liang and Dan Klein. 2009. Online em for un-
2020. BART: Denoising sequence-to-sequence pre- supervised models. In Proceedings of human lan-
training for natural language generation, translation, guage technologies: The 2009 annual conference of
and comprehension. In Proceedings of the 58th An- the North American chapter of the association for
nual Meeting of the Association for Computational computational linguistics, pages 611–619.
Linguistics, pages 7871–7880, Online. Association
for Computational Linguistics (ACL). Zhenwen Liang, Jipeng Zhang, Lei Wang, Wei Qin,
Yunshi Lan, Jie Shao, and Xiangliang Zhang. 2022b.
Aitor Lewkowycz, Anders Andreassen, David Dohan, Mwp-bert: Numeracy-augmented pre-training for
Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, math word problem solving. In Findings of the
Ambrose Slone, Cem Anil, Imanol Schlag, Theo Association for Computational Linguistics: NAACL
Gutman-Solo, et al. 2022. Solving quantitative 2022, pages 997–1009.
reasoning problems with language models. arXiv
preprint arXiv:2206.14858. Bill Yuchen Lin, Seyeon Lee, Rahul Khanna, and Xi-
ang Ren. 2020. Birds have four legs?! numersense:
Jierui Li, Lei Wang, Jipeng Zhang, Yan Wang, Probing numerical commonsense knowledge of pre-
Bing Tian Dai, and Dongxiang Zhang. 2019. Model- trained language models. In Proceedings of the
ing intra-relation in math word problems with differ- 2020 Conference on Empirical Methods in Natural
ent functional multi-head attentions. In Proceedings Language Processing (EMNLP), pages 6862–6868.
Xin Lin, Zhenya Huang, Hongke Zhao, Enhong Chen, Pan Lu, Liang Qiu, Kai-Wei Chang, Ying Nian
Qi Liu, Hao Wang, and Shijin Wang. 2021. Hms: Wu, Song-Chun Zhu, Tanmay Rajpurohit, Peter
A hierarchical solver with dependency-enhanced un- Clark, and Ashwin Kalyan. 2022b. Dynamic
derstanding for math word problem. In Proceedings prompt learning via policy gradient for semi-
of the AAAI Conference on Artificial Intelligence, structured mathematical reasoning. arXiv preprint
pages 4232–4240. arXiv:2209.14610.
Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blun- Pan Lu, Liang Qiu, Jiaqi Chen, Tony Xia, Yizhou Zhao,
som. 2017. Program induction by rationale genera- Wei Zhang, Zhou Yu, Xiaodan Liang, and Song-
tion: Learning to solve and explain algebraic word Chun Zhu. 2021b. Iconqa: A new benchmark for
problems. In Proceedings of the 55th Annual Meet- abstract diagram understanding and visual language
ing of the Association for Computational Linguistics reasoning. In The 35th Conference on Neural In-
(ACL), pages 158–167. formation Processing Systems (NeurIPS) Track on
Datasets and Benchmarks.
Jiachang Liu, Dinghan Shen, Yizhe Zhang, William B
Dolan, Lawrence Carin, and Weizhu Chen. 2022a. Yao Lu, Max Bartolo, Alastair Moore, Sebastian
What makes good in-context examples for gpt-3? Riedel, and Pontus Stenetorp. 2022c. Fantastically
In Proceedings of Deep Learning Inside Out (Dee- ordered prompts and where to find them: Overcom-
LIO 2022): The 3rd Workshop on Knowledge Ex- ing few-shot prompt order sensitivity. In Proceed-
traction and Integration for Deep Learning Architec- ings of the 60th Annual Meeting of the Association
tures, pages 100–114. for Computational Linguistics (ACL), pages 8086–
8098.
Qian Liu, Bei Chen, Jiaqi Guo, Morteza Ziyadi, Zeqi
Lin, Weizhu Chen, and Jian-Guang Lou. 2022b. The mathlib Community. 2020. The lean mathemati-
TAPEX: Table pre-training via learning a neural cal library. In CPP 2020 - Proceedings of the 9th
SQL executor. In International Conference on ACM SIGPLAN International Conference on Certi-
Learning Representations. fied Programs and Proofs, co-located with POPL
2020.
Qianying Liu, Wenyu Guan, Sujian Li, Fei Cheng,
Daisuke Kawahara, and Sadao Kurohashi. 2020. Norman D. Megill and David A. Wheeler. 2019. Meta-
Reverse operation based data augmentation for math: A Computer Language for Mathematical
solving math word problems. arXiv preprint Proofs. Lulu Press, Morrisville, North Carolina.
arXiv:2010.01556. http://us.metamath.org/downloads/metamath.pdf.
Qianying Liu, Wenyv Guan, Sujian Li, and Daisuke
Kawahara. 2019a. Tree-structured decoding for Yuanliang Meng and Anna Rumshisky. 2019. Solv-
solving math word problems. In Proceedings of ing math word problems with double-decoder trans-
the 2019 conference on empirical methods in natu- former. arXiv preprint arXiv:1908.10924.
ral language processing and the 9th international
Shen-Yun Miao, Chao-Chun Liang, and Keh-Yih Su.
joint conference on natural language processing
2020. A diverse corpus for evaluating and develop-
(EMNLP-IJCNLP), pages 2370–2379.
ing english math word problem solvers. In Proceed-
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- ings of the 58th Annual Meeting of the Association
dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, for Computational Linguistics, pages 975–984.
Luke Zettlemoyer, and Veselin Stoyanov. 2019b.
Roberta: A robustly optimized bert pretraining ap- Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe,
proach. arXiv preprint arXiv:1907.11692. Mike Lewis, Hannaneh Hajishirzi, and Luke Zettle-
moyer. 2022. Rethinking the role of demonstrations:
Sarah Loos, Geoffrey Irving, Christian Szegedy, and What makes in-context learning work? Proceedings
Cezary Kaliszyk. 2017. Deep network guided proof of Empirical Methods in Natural Language Process-
search. arXiv preprint arXiv:1701.06972. ing (EMNLP).
Pan Lu, Ran Gong, Shibiao Jiang, Liang Qiu, Siyuan Shervin Minaee, Nal Kalchbrenner, Erik Cambria, Nar-
Huang, Xiaodan Liang, and Song-Chun Zhu. 2021a. jes Nikzad, Meysam Chenaghlu, and Jianfeng Gao.
Inter-gps: Interpretable geometry problem solving 2021. Deep learning–based text classification: a
with formal language and symbolic reasoning. In comprehensive review. ACM Computing Surveys
The 59th Annual Meeting of the Association for Com- (CSUR), 54(3):1–40.
putational Linguistics (ACL).
Swaroop Mishra, Matthew Finlayson, Pan Lu, Leonard
Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai- Tang, Sean Welleck, Chitta Baral, Tanmay Rajpuro-
Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter hit, Oyvind Tafjord, Ashish Sabharwal, Peter Clark,
Clark, and Ashwin Kalyan. 2022a. Learn to explain: and Ashwin Kalyan. 2022a. Lila: A unified bench-
Multimodal reasoning via thought chains for science mark for mathematical reasoning. In Proceedings of
question answering. In The 36th Conference on Neu- the 2022 Conference on Empirical Methods in Natu-
ral Information Processing Systems (NeurIPS 2022). ral Language Processing (EMNLP).
Swaroop Mishra, Arindam Mitra, Neeraj Varshney, volume 828 of Lecture Notes in Computer Science.
Bhavdeep Sachdeva, Peter Clark, Chitta Baral, and Springer.
Ashwin Kalyan. 2022b. Numglue: A suite of fun-
damental yet challenging mathematical reasoning Ethan Perez, Florian Strub, Harm De Vries, Vincent
tasks. In Proceedings of the 60th Annual Meet- Dumoulin, and Aaron Courville. 2018. Film: Vi-
ing of the Association for Computational Linguistics sual reasoning with a general conditioning layer. In
(ACL), pages 3505–3523. Proceedings of the AAAI Conference on Artificial In-
telligence.
Eric Mitchell, Joseph J. Noh, Siyan Li, William S.
Armstrong, Ananth Agarwal, Patrick Liu, Chelsea Stanislas Polu, Jesse Michael Han, Kunhao Zheng,
Finn, and Christopher D. Manning. 2022. Enhanc- Mantas Baksys, Igor Babuschkin, and Ilya Sutskever.
ing self-consistency and performance of pretrained 2022. Formal mathematics statement curriculum
language models with nli. In Proceedings of the learning. ArXiv, abs/2202.01344.
2022 Conference on Empirical Methods in Natu-
ral Language Processing (EMNLP). Association for Stanislas Polu and Ilya Sutskever. 2020. Generative
Computational Linguistics. language modeling for automated theorem proving.
arXiv preprint arXiv:2009.03393.
Leonardo de Moura, Soonho Kong, Jeremy Avigad,
Floris van Doorn, and Jakob von Raumer. 2015. Jinghui Qin, Xiaodan Liang, Yining Hong, Jianheng
The lean theorem prover (system description). In Tang, and Liang Lin. 2021. Neural-symbolic solver
International Conference on Automated Deduction, for math word problems with auxiliary tasks. In Pro-
pages 378–388. Springer. ceedings of the 59th Annual Meeting of the Associa-
tion for Computational Linguistics and the 11th In-
Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, ternational Joint Conference on Natural Language
Long Ouyang, Christina Kim, Christopher Hesse, Processing (Volume 1: Long Papers), pages 5870–
Shantanu Jain, Vineet Kosaraju, William Saunders, 5881.
et al. 2021. Webgpt: Browser-assisted question-
answering with human feedback. arXiv preprint Jinghui Qin, Lihui Lin, Xiaodan Liang, Rumin Zhang,
arXiv:2112.09332. and Liang Lin. 2020. Semantically-aligned univer-
sal tree-structured solver for math word problems.
A. Newell, J. C. Shaw, and H. A. Simon. 1957. Empiri- In Proceedings of the 2020 Conference on Empirical
cal explorations of the logic theory machine: A case Methods in Natural Language Processing (EMNLP),
study in heuristic. In Proceedings of the Western pages 3780–3789.
Joint Computer Conference, IRE-AIEE-ACM 1957.
Ansong Ni, Jeevana Priya Inala, Chenglong Wang, Liang Qiu, Yizhou Zhao, Jinchao Li, Pan Lu, Baolin
Oleksandr Polozov, Christopher Meek, Dragomir Peng, Jianfeng Gao, and Song-Chun Zhu. 2022a.
Radev, and Jianfeng Gao. 2022. Learning from Valuenet: A new dataset for human value driven dia-
self-sampled correct and partially-correct programs. logue system. IEEE Trans. on Pattern Analysis and
arXiv preprint arXiv:2205.14318. Machine Intelligence (TPAMI), 44(5):2468–2484.
Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Liang Qiu, Yizhou Zhao, Yuan Liang, Pan Lu, Weiyan
Henryk Michalewski, Jacob Austin, David Bieber, Shi, Zhou Yu, and Song-chun Zhu. 2022b. Towards
David Dohan, Aitor Lewkowycz, Maarten Bosma, socially intelligent agents with mental state transi-
David Luan, et al. 2021. Show your work: Scratch- tion and human value. In Proceedings of the 23rd
pads for intermediate computation with language Annual Meeting of the Special Interest Group on Dis-
models. arXiv preprint arXiv:2112.00114. course and Dialogue, pages 146–158.
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Car- Xipeng Qiu, Tianxiang Sun, Yige Xu, Yunfan Shao,
roll L Wainwright, Pamela Mishkin, Chong Zhang, Ning Dai, and Xuanjing Huang. 2020. Pre-
Sandhini Agarwal, Katarina Slama, Alex Ray, et al. trained models for natural language processing: A
2022. Training language models to follow in- survey. Science China Technological Sciences,
structions with human feedback. arXiv preprint 63(10):1872–1897.
arXiv:2203.02155.
Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
Arkil Patel, Satwik Bhattamishra, and Navin Goyal. Dario Amodei, Ilya Sutskever, et al. 2020. Lan-
2021. Are nlp models really able to solve sim- guage models are unsupervised multitask learners.
ple math word problems? In Proceedings of the OpenAI Blog.
2021 Conference of the North American Chapter of
the Association for Computational Linguistics: Hu- Jack W Rae, Sebastian Borgeaud, Trevor Cai, Katie
man Language Technologies (NAACL), pages 2080– Millican, Jordan Hoffmann, Francis Song, John
2094. Aslanides, Sarah Henderson, Roman Ring, Susan-
nah Young, et al. 2021. Scaling language models:
Lawrence C. Paulson. 1994. Isabelle - A Generic The- Methods, analysis & insights from training gopher.
orem Prover (with a contribution by T. Nipkow), arXiv preprint arXiv:2112.11446.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Mrinmaya Sachan and Eric Xing. 2017. Learning
Lee, Sharan Narang, Michael Matena, Yanqi Zhou, to solve geometry problems from natural language
Wei Li, and Peter J Liu. 2020. Exploring the lim- demonstrations in textbooks. In Proceedings of the
its of transfer learning with a unified text-to-text 6th Joint Conference on Lexical and Computational
transformer. Journal of Machine Learning Research Semantics, pages 251–261.
(JMLR), 21:1–67.
David Saxton, Edward Grefenstette, Felix Hill, and
Abhilasha Ravichander, Aakanksha Naik, Carolyn Pushmeet Kohli. 2020. Analysing mathematical rea-
Rose, and Eduard Hovy. 2019. Equate: A bench- soning abilities of neural models. In International
mark evaluation framework for quantitative reason- Conference on Learning Representations (ICLR).
ing in natural language inference. In Proceedings of
the 23rd Conference on Computational Natural Lan- Tal Schuster, Ashwin Kalyan, Alex Polozov, and
guage Learning (CoNLL), pages 349–361. Adam Tauman Kalai. 2021. Programming puzzles.
In Thirty-fifth Conference on Neural Information
Yasaman Razeghi, Robert L Logan IV, Matt Gard-
Processing Systems Datasets and Benchmarks Track
ner, and Sameer Singh. 2022. Impact of pretrain-
(Round 1).
ing term frequencies on few-shot reasoning. ArXiv,
abs/2202.07206.
Rico Sennrich, Barry Haddow, and Alexandra Birch.
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian 2016. Neural machine translation of rare words
Sun. 2015. Faster r-cnn: Towards real-time object with subword units. In Proceedings of the 54th An-
detection with region proposal networks. Advances nual Meeting of the Association for Computational
in neural information processing systems, 28. Linguistics (Volume 1: Long Papers), pages 1715–
1725.
Ryokan Ri and Yoshimasa Tsuruoka. 2022. Pretrain-
ing with artificial language: Studying transferable Minjoon Seo, Hannaneh Hajishirzi, Ali Farhadi, Oren
knowledge in language models. In Proceedings Etzioni, and Clint Malcolm. 2015. Solving geome-
of the 60th Annual Meeting of the Association for try problems: Combining text and diagram interpre-
Computational Linguistics (Volume 1: Long Papers), tation. In Proceedings of Empirical Methods in Nat-
pages 7302–7315. ural Language Processing (EMNLP), pages 1466–
1476.
Benjamin Robaidek, Rik Koncel-Kedziorski, and Han-
naneh Hajishirzi. 2018. Data-driven methods for Jianhao Shen, Yichun Yin, Lin Li, Lifeng Shang, Xin
solving algebra word problems. arXiv preprint Jiang, Ming Zhang, and Qun Liu. 2021. Generate &
arXiv:1804.10718. rank: A multi-task framework for math word prob-
Subhro Roy and Dan Roth. 2015. Solving general arith- lems. In Findings of the Association for Computa-
metic word problems. In Proceedings of the 2015 tional Linguistics: EMNLP 2021, pages 2269–2279.
Conference on Empirical Methods in Natural Lan-
guage Processing, pages 1743–1752. Yibin Shen and Cheqing Jin. 2020. Solving math word
problems with multi-encoders and multi-decoders.
Subhro Roy and Dan Roth. 2017. Unit dependency In Proceedings of the 28th International Conference
graph and its application to arithmetic word problem on Computational Linguistics, pages 2924–2934.
solving. In Proceedings of the AAAI Conference on
Artificial Intelligence (AAAI). Shuming Shi, Yuehui Wang, Chin-Yew Lin, Xiaojiang
Liu, and Yong Rui. 2015. Automatically solving
Subhro Roy and Dan Roth. 2018. Mapping to declara- number word problems by semantic parsing and rea-
tive knowledge for word problem solving. Transac- soning. In Proceedings of the 2015 conference on
tions of the Association for Computational Linguis- empirical methods in natural language processing
tics (TACL), 6:159–172. (EMNLP), pages 1132–1142.
Subhro Roy, Tim Vieira, and Dan Roth. 2015. Reason-
Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-
ing about quantities in natural language. Transac-
Yan Liu. 2019. Mass: Masked sequence to sequence
tions of the Association for Computational Linguis-
pre-training for language generation. arXiv preprint
tics (TACL), 3:1–13.
arXiv:1905.02450.
Ohad Rubin, Jonathan Herzig, and Jonathan Berant.
2022. Learning to retrieve prompts for in-context Kai Sun, Dian Yu, Jianshu Chen, Dong Yu, Yejin Choi,
learning. North American Chapter of the Associa- and Claire Cardie. 2019. Dream: A challenge data
tion for Computational Linguistics (NAACL). set and models for dialogue-based reading compre-
hension. Transactions of the Association for Com-
Mrinmaya Sachan, Kumar Dubey, and Eric Xing. 2017. putational Linguistics, 7:217–231.
From textbooks to knowledge: A case study in
harvesting axiomatic knowledge from textbooks to Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.
solve geometry problems. In Proceedings of Em- Sequence to sequence learning with neural networks.
pirical Methods in Natural Language Processing Advances in neural information processing systems,
(EMNLP), pages 773–784. 27.
Oyvind Tafjord, Peter Clark, Matt Gardner, Wen-tau Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le,
Yih, and Ashish Sabharwal. 2019. Quarel: A dataset Ed Chi, and Denny Zhou. 2022. Self-consistency
and models for answering questions about qualita- improves chain of thought reasoning in language
tive relationships. In Proceedings of the AAAI Con- models. arXiv preprint arXiv:2203.11171.
ference on Artificial Intelligence, pages 7063–7071.
Yan Wang, Xiaojiang Liu, and Shuming Shi. 2017.
Kai Sheng Tai, Richard Socher, and Christopher D Deep neural solver for math word problems. In
Manning. 2015. Improved semantic representations Proceedings of the 2017 Conference on Empirical
from tree-structured long short-term memory net- Methods in Natural Language Processing (EMNLP),
works. In Proceedings of the 53rd Annual Meet- pages 845–854.
ing of the Association for Computational Linguistics
and the 7th International Joint Conference on Natu-
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten
ral Language Processing (Volume 1: Long Papers),
Bosma, Ed Chi, Quoc Le, and Denny Zhou. 2022.
pages 1556–1566.
Chain of thought prompting elicits reasoning in large
Avijit Thawani, Jay Pujara, and Ashwin Kalyan. 2022. language models. arXiv preprint arXiv:2201.11903.
Estimating numbers without regression. In 36th
Conference on Neural Information Processing Sys- Sean Welleck, Jiacheng Liu, Ronan Le Bras, Han-
tems (NeurIPS 2022) Workshop on MATH-AI. naneh Hajishirzi, Yejin Choi, and Kyunghyun Cho.
2021. Naturalproofs: Mathematical theorem prov-
Shyam Upadhyay and Ming-Wei Chang. 2015. Draw: ing in natural language. In Thirty-fifth Conference
A challenging and diverse algebra word problem set. on Neural Information Processing Systems Datasets
Technical report, Citeseer. and Benchmarks Track (Round 1).
Shyam Upadhyay and Ming-Wei Chang. 2017. Anno-
tating derivations: A new evaluation strategy and Sean Welleck, Jiacheng Liu, Ximing Lu, Hannaneh
dataset for algebra word problems. In Proceed- Hajishirzi, and Yejin Choi. 2022a. Naturalprover:
ings of the 15th Conference of the European Chap- Grounded mathematical proof generation with lan-
ter of the Association for Computational Linguistics guage models. arXiv preprint arXiv:2205.12910.
(ACL), pages 494–504.
Sean Welleck, Ximing Lu, Peter West, Faeze Brahman,
Josef Urban. 2006. Mptp 0.2: Design, implementation, Tianxiao Shen, Daniel Khashabi, and Yejin Choi.
and initial experiments. Journal of Automated Rea- 2022b. Generating sequences by learning to self-
soning, 37(1):21–43. correct. ArXiv, abs/2211.00053.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Sean Welleck, Peter West, Jize Cao, and Yejin Choi.
Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz 2022c. Symbolic brittleness in sequence models: on
Kaiser, and Illia Polosukhin. 2017. Attention is all systematic generalization in symbolic mathematics.
you need. In Advances in Neural Information Pro- In AAAI.
cessing Systems (NeurIPS), pages 5998–6008.
Eric Wallace, Yizhong Wang, Sujian Li, Sameer Singh, Wu Wen-Tsun. 1986. Basic principles of mechanical
and Matt Gardner. 2019. Do nlp models know num- theorem proving in elementary geometries. Journal
bers? probing numeracy in embeddings. In Proceed- of automated Reasoning, 2(3):221–252.
ings of the 2019 Conference on Empirical Methods
in Natural Language Processing and the 9th Inter- Daniel Whalen. 2016. Holophrasm: a neural auto-
national Joint Conference on Natural Language Pro- mated theorem prover for higher-order logic. arXiv
cessing (EMNLP-IJCNLP), pages 5307–5315. preprint arXiv:1608.02644.
Lei Wang, Yan Wang, Deng Cai, Dongxiang Zhang, Sanghyun Woo, Jongchan Park, Joon-Young Lee, and
and Xiaojiang Liu. 2018a. Translating a math word In So Kweon. 2018. Cbam: Convolutional block
problem to a expression tree. In Proceedings of the attention module. In Proceedings of the European
2018 Conference on Empirical Methods in Natural conference on computer vision (ECCV), pages 3–19.
Language Processing, pages 1064–1069.
Lei Wang, Dongxiang Zhang, Lianli Gao, Jingkuan Qinzhuo Wu, Qi Zhang, Jinlan Fu, and Xuan-Jing
Song, Long Guo, and Heng Tao Shen. 2018b. Math- Huang. 2020. A knowledge-aware sequence-to-tree
dqn: Solving arithmetic word problems via deep re- network for math word problem solving. In Proceed-
inforcement learning. In Proceedings of the AAAI ings of the 2020 Conference on Empirical Methods
Conference on Artificial Intelligence. in Natural Language Processing (EMNLP), pages
7137–7146.
Lei Wang, Dongxiang Zhang, Jipeng Zhang, Xing Xu,
Lianli Gao, Bing Tian Dai, and Heng Tao Shen. Qinzhuo Wu, Qi Zhang, and Zhongyu Wei. 2021a. An
2019. Template-based math word problem solvers edge-enhanced hierarchical graph-to-tree network
with recursive neural networks. In Proceedings for math word problem solving. In Findings of the
of the AAAI Conference on Artificial Intelligence, Association for Computational Linguistics: EMNLP
pages 7144–7151. 2021, pages 1473–1482.
Qinzhuo Wu, Qi Zhang, Zhongyu Wei, and Xuan-Jing Wei Yu, Mengzhu Wang, Xiaodong Wang, Xun Zhou,
Huang. 2021b. Math word problem solving with Yongfu Zha, Yongjian Zhang, Shuyu Miao, and Jing-
explicit numerical values. In Proceedings of the dong Liu. 2021a. Geore: A relation extraction
59th Annual Meeting of the Association for Compu- dataset for chinese geometry problems. In 35th Con-
tational Linguistics and the 11th International Joint ference on Neural Information Processing Systems
Conference on Natural Language Processing (Vol- (NeurIPS 2021) Workshop on Math AI for Education
ume 1: Long Papers), pages 5859–5869. (MATHAI4ED).
Xingjiao Wu, Luwei Xiao, Yixuan Sun, Junhang Weijiang Yu, Yingpeng Wen, Fudan Zheng, and Nong
Zhang, Tianlong Ma, and Liang He. 2022a. A sur- Xiao. 2021b. Improving math word problems with
vey of human-in-the-loop for machine learning. Fu- pre-trained knowledge and hierarchical reasoning.
ture Generation Computer Systems. In Proceedings of the 2021 Conference on Empiri-
cal Methods in Natural Language Processing, pages
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V 3384–3394.
Le, Mohammad Norouzi, Wolfgang Macherey,
Maxim Krikun, Yuan Cao, Qin Gao, Klaus Wenhao Yu, Dan Iter, Shuohang Wang, Yichong Xu,
Macherey, et al. 2016. Google’s neural machine Mingxuan Ju, Soumya Sanyal, Chenguang Zhu,
translation system: Bridging the gap between hu- Michael Zeng, and Meng Jiang. 2022. Gen-
man and machine translation. arXiv preprint erate rather than retrieve: Large language mod-
arXiv:1609.08144. els are strong context generators. arXiv preprint
arXiv:2209.10063.
Yuhuai Wu, Albert Jiang, Jimmy Ba, and Roger Baker
Grosse. 2021c. Int: An inequality benchmark for Klim Zaporojets, Giannis Bekoulis, Johannes Deleu,
evaluating generalization in theorem proving. In Thomas Demeester, and Chris Develder. 2021. Solv-
International Conference on Learning Representa- ing arithmetic word problems by scoring equations
tions (ICLR). with recursive neural networks. Expert Systems with
Applications, 174:114704.
Yuhuai Wu, Albert Qiaochu Jiang, Wenda Li,
Markus Norman Rabe, Charles E Staats, Mateja Jipeng Zhang, Roy Ka-Wei Lee, Ee-Peng Lim, Wei
Jamnik, and Christian Szegedy. 2022b. Autoformal- Qin, Lei Wang, Jie Shao, and Qianru Sun. 2020a.
ization with large language models. In Advances in Teacher-student networks with multiple decoders for
Neural Information Processing Systems. solving math word problem. In IJCAI.