Resoning and Promping

Chain of Thought Prompting for
Large Language Model Reasoning

Zihan Ding and Zixu Zhang
COS 597G - Fall 2022

Hard Language Tasks: Reasoning
2
Reasoning Problems
Q: If there are 3 cars in the Q: Take the last letters of Q: What home entertainment
parking lot and 2 more cars the words in "Elon Musk" equipment requires cable?
arrive, how many cars are in and concatenate them Answer Choices: (a) radio shack
the parking lot? (b) substation (c) television (d)
A: The answer is nk. cabinet
A: The answer is 5
A: The answer is (c).
Arithmetic Reasoning (AR) Symbolic Reasoning (SR) Commonsense Reasoning (CR)

(+ − ×÷…)
3
Reasoning Problems
Fine-tune GPT-3 on GSM8K (arithmetic): (Cobbe et al. 2021)
Conjecture: to achieve > 80%, needs 100

times more fine-tuning data for 175B model
4
Reasoning Problems
GSM8K (arithmetic):
Few-shot standard prompting with even larger

model (PaLM 540B) also does not work well.
5
Reasoning Problems
Scaling up language model size does not efficiently achieve high performances, for
Arithmetic Reasoning (AR), CommonSense Reasoning (CR) and Symbolic
Reasoning (SR) tasks.
6
Reasoning Problems
Scaling up language model size does not efficiently achieve high performances, for
Arithmetic Reasoning (AR), CommonSense Reasoning (CR) and Symbolic
Reasoning (SR) tasks.
Proposed solution: chain of thought prompting
7
Chain of Thought Prompting
8
Chain of Thought (CoT)
Few-Shot CoT
9
Few-Shot CoT
Both papers will appear

in NeurIPS’22!
Zero-Shot CoT
10
Definition:
A chain of thought is a series of intermediate natural language reasoning steps that lead
to the final output.
11
Definition:
Intuition (from neural-symbolic computing):
use <input, intermediate results, output> triples, rather than simple <input, output> pairs
12
Definition:
Intuition (from neural-symbolic computing):
use <input, intermediate results, output> triples, rather than simple <input, output> pairs
Benefits:
● Decomposition -> easier intermediate problems
● Interpretable
● More general than neural symbolic computing
● Leveraging prompting of LLM
13
Examples
14
（Wei et al., 2022）
Examples CoT Examples
Step-by-step Answer
15
（Wei et al., 2022）
Step-by-step Answer
16
（Wei et al., 2022）
Step-by-step Answer
（KoJima et al., 2022）
Two-stage Prompting
Step-by-step Answer
17
Zero-Shot Chain of Thought (CoT)
For zero-shot CoT, a two-stage prompting is applied:
Question
Trigger1
Reasoning
Path
18
Zero-Shot Chain of Thought (CoT)
For zero-shot CoT, a two-stage prompting is applied:
Question Question
+Trigger1
Trigger1
Reasoning Path
Trigger2
Reasoning
Path Answer
19
Experiments
20
Models
Pre-trained LLMs:
● Instruct GPT-3 (ada 350M, babbage 1.3B, curie 6.7B, and davinci 175B) (Ouyang et al., 2022)
21
Models
Pre-trained LLMs:
○ Not your familiar GPT-3 (Brown, et al., 2020)
○ Fine-tuned with human feedback
○ Stay tuned for the lecture on Nov. 14!!
22
Models
Pre-trained LLMs:
● PaLM (8B, 62B, 540B) (Chowdhery et al., 2022)

○ Only accessible to Googlers 😞.
23
Models
Pre-trained LLMs:
● LaMDA (422M, 2B, 8B, 68B, 137B) (Thoppilan et al., 2022)

○ Dialogue-oriented LM.
○ Fine-tuned on human-annotated data.
24
Models
Pre-trained LLMs:
● LaMDA (422M, 2B, 8B, 68B, 137B) (Thoppilan et al., 2022)
● GPT-3 (ada 350M, babbage 1.3B, curie 6.7B, davinci 175B)
● GPT-2 (1.5B)
● GPT-Neo (2.7B), GPT-J (6B), T0 (11B) (Sanh et al., 2022), OPT (13B) (Zhang et al., 2022)
25
Prior Best – Fine-tuning + Verification
GPT-3
1. Fine-tuned 2 epoch on training set.

2. Sample 100 solutions from the generator for
each training problem and label each solution as
correct or incorrect.
3. Train a verifier for a single epoch on this dataset. Correct or Incorrect
(Cobbe et al. 2021) 26
Experiments
Arithmetic Reasoning
27
Free Response - Few-Shot CoT Prompt Exemplar
Free Response
● Manually composed 8 exemplars
Q: If there are 3 cars in the parking
lot and 2 more cars arrive, how
many cars are in the parking lot?
A: There are originally 3 cars. 2

more cars arrive. 3 + 2 = 5. The
answer is 5.
（Wei et al., 2022）
28
Free Response
lot and 2 more cars arrive, how ● All contains equations with flexible formats
A: There are originally 3 cars. 2

more cars arrive. 3 + 2 = 5. The
answer is 5.
（Wei et al., 2022）
29
Free Response Free Response
Q: If there are 3 cars in the parking Q: Olivia has $23. She bought five
lot and 2 more cars arrive, how bagels for $3 each. How much
many cars are in the parking lot? money does she have left?
A: There are originally 3 cars. 2 A: Olivia had 23 dollars. 5 bagels

more cars arrive. 3 + 2 = 5. The for 3 dollars each will be 5 x 3 = 15
answer is 5. dollars. So she has 23 - 15 dollars
left. 23 - 15 is 8. The answer is 8.
（Wei et al., 2022）（Wei et al., 2022）
You can have one or more equations.
30
Free Response Free Response
Q: If there are 3 cars in the parking Q: Olivia has $23. She bought five
lot and 2 more cars arrive, how bagels for $3 each. How much
many cars are in the parking lot? money does she have left?
A: There are originally 3 cars. 2 A: Olivia had 23 dollars. 5 bagels

more cars arrive. 3 + 2 = 5. The for 3 dollars each will be 5 x 3 = 15
answer is 5. dollars. So she has 23 - 15 dollars
left. 23 - 15 is 8. The answer is 8.
（Wei et al., 2022）（Wei et al., 2022）
Equations can be incomplete and combined math with words.
31
Free Response
lot and 2 more cars arrive, how ● All contains equations with flexible formats
● Benchmarked on:
A: There are originally 3 cars. 2 ○ GSM8K (Cobbe et al. 2021)
more cars arrive. 3 + 2 = 5. The ○ SVAMP (Patel et al., 2021)
answer is 5. ○ MAWPS (Koncel-Kedziorski et al., 2016)
（Wei et al., 2022）
32
Multiple Choice - Few-Shot CoT Prompt Exemplar
Multiple Choice
● 4 exemplars, whose questions, intermediate
Q: A person is traveling at 20 km/hr reasoning, and answers are from AQuA-RAT’s
and reached his destiny in 2.5 hr training set
then find the distance?
Answer Choices: (a) 53 km (b) 55
km (c) 52 km (d) 60 km (e) 50 km
A: The distance that the person

traveled would have been 20
km/hr * 2.5 hrs = 50 km. The
answer is (e).
33
Multiple Choice Multiple Choice
Q: A person is traveling at 20 km/hr Q: If a / b = 3/4 and 8a + 5b = 22,then

and reached his destiny in 2.5 hr find the value of a.
then find the distance? Answer Choices: (a) 1/2 (b) 3/2 (c) 5/2
Answer Choices: (a) 53 km (b) 55 (d) 4/2 (e) 7/2
km (c) 52 km (d) 60 km (e) 50 km
A: If a / b = 3/4, then b = 4a / 3. So
A: The distance that the person 8a + 5(4a / 3) = 22. This simplifies to
traveled would have been 20 8a + 20a / 3 = 22, which means 44a /
km/hr * 2.5 hrs = 50 km. The 3 = 22. So a is equal to 3/2. The
answer is (e). answer is (b).
The exemplars have various formats.

34
Multiple Choice
Q: A person is traveling at 20 km/hr reasoning, and answers are from AQuA-RAT’s
and reached his destiny in 2.5 hr training set
● Exemplars have flexible formats
Answer Choices: (a) 53 km (b) 55
km (c) 52 km (d) 60 km (e) 50 km

answer is (e).
35
Multiple Choice
Q: A person is traveling at 20 km/hr reasoning, and answers are from training set
and reached his destiny in 2.5 hr
● Exemplars have flexible formats
Answer Choices: (a) 53 km (b) 55 ● Benchmarked on AQuA-RAT (Ling et al.,
km (c) 52 km (d) 60 km (e) 50 km 2017)

answer is (e).
36
Arithmetic Reasoning - Results
GSM8K
Josh decides to try flipping a
house. He buys a house for
$80,000 and then puts in $50,000
in repairs. This increased the
value of the house by 150%.
How much profit did he make?
SVAMP
Each pack of dvds costs 76
dollars. If there is a discount of
25 dollars on each pack. How
much do you have to pay to buy
each pack?
37
GSM8K
Josh decides to try flipping a
house. He buys a house for
$80,000 and then puts in $50,000
in repairs. This increased the
value of the house by 150%.
How much profit did he make?
Fine-tuning + SVAMP
Verification Each pack of dvds costs 76
dollars. If there is a discount of
25 dollars on each pack. How
much do you have to pay to buy
each pack?
38
MAWPS - MultiArith
The school cafeteria ordered 42
red apples and 7 green apples for
students lunches. But, if only 9
students wanted fruit, how many
extra did the cafeteria end up
with?
AQuA-RAT
A person is traveling at 20 km/hr
Answer Choices: (a) 53 km (b)
55 km (c) 52 km (d) 60 km (e) 50
km
39
MAWPS - MultiArith
The school cafeteria ordered 42
red apples and 7 green apples for
students lunches. But, if only 9
students wanted fruit, how many
extra did the cafeteria end up
with?
AQuA-RAT
A person is traveling at 20 km/hr
Answer Choices: (a) 53 km (b)
55 km (c) 52 km (d) 60 km (e) 50
km
40
Arithmetic Reasoning - Observations
● Both zero-shot and few-shot chain of thought

promptings are emergent abilities of model scale.
41

● They do not positively impact performance for

small models, but start to yield performance gains
when used with models with more than ∼100B
parameters.
42


parameters.
● Few-shot CoT achieves better performance on

LLM than zero-shot CoT.
43


parameters.
● Few-shot CoT achieves better performance on

LLM than zero-shot CoT.
● Instruct GPT-3: text-davinci-002 achieves similar

performance as PaLM 540B model
44
Experiments
Symbolic Reasoning
45
Symbolic Reasoning - Last Letter Concatenation
Last letter concatenation ● Generate full names by randomly
concatenating names from the top
Q: Take the last letters of the words one-thousand first and last names from name
in "Elon Musk" and concatenate census data
them
A: The last letter of "Elon" is "n".

The last letter of "Musk" is "k".
Concatenating them is "nk". The
answer is nk.
46
Last letter concatenation ● Generate full names by randomly
concatenating names from the top
Q: Take the last letters of the words one-thousand first and last names from name
in "Elon Musk" and concatenate census data
them
● 4 exemplars with strict format
A: The last letter of "Elon" is "n".
The last letter of "Musk" is "k".
Concatenating them is "nk". The
answer is nk.
47
Last letter concatenation Last letter concatenation
Q: Take the last letters of the words Q: Take the last letters of the words
in "Elon Musk" and concatenate in "Bill Gates" and concatenate them
them
A: The last letter of "Bill" is "l".
A: The last letter of "Elon" is "n". The last letter of "Gates" is "s".
The last letter of "Musk" is "k". Concatenating them is "ls". The
Concatenating them is "nk". The answer is ls.
answer is nk.
48
Symbolic Reasoning - Coin Flip
Coin Flip Coin Flip
Q: A coin is heads up. Tom does Q: A coin is heads up. Jamey flips
not flip the coin. Mike does not flip the coin. Teressa flips the coin. Is
the coin. Is the coin still heads up? the coin still heads up?
A: The coin was flipped by no A: The coin was flipped by Jamey

one. So the coin was flipped 0 and Teressa. So the coin was
times. The coin started heads up, flipped 2 times, which is an even
and it was not flipped, so it is still number. The coin started heads
heads up. So the answer is yes. up, so after an even number of
flips, it will still be heads up. So
the answer is yes.
8 exemplars with strict format.
49
Symbolic Reasoning - In & Out-of-domain Test
Last letter concatenation Coin Flip
Q: Take the last letters of the words Q: A coin is heads up. Tom does not
in "Elon Musk" and concatenate flip the coin. Mike does not flip the
them coin. Is the coin still heads up?
A: The last letter of "Elon" is "n". A: The coin was flipped by no

The last letter of "Musk" is "k". one. So the coin was flipped 0
Concatenating them is "nk". The times. The coin started heads up,
answer is nk. and it was not flipped, so it is still
heads up. So the answer is yes.
● In-domain test set: examples had the same number of steps as the few-shot exemplars
● Out-of-domain (OOD) test set: examples had more steps than those in the exemplars.
50
In-Domain
Take the last letters of the

words in "Elon Musk" and
concatenate them.
Out-of-Domain
Take the last letters of the

words in "Johann Sebastian
Bach" and concatenate them.
51
*Zero-Shot results use Instruct-GPT-3 175B text-davinci-002 model.
Symbolic Reasoning - Coin Flip
In-Domain
A coin is heads up. Tom does

not flip the coin. Mike does
not flip the coin. Is the coin
still heads up?
Out-of-Domain
A coin is heads up. Tom does

not flip the coin. Mike does
not flip the coin. Jake flips
the coin. Is the coin still heads
up?
52
*Zero-Shot results use Instruct-GPT-3 175B text-davinci-002 model.
Symbolic Reasoning - Observations
● Standard prompting fails out-of-domain tests for both tasks.
53
● Both zero-shot and few-shot CoT promptings are emergent abilities of model scale.
54
● Few-shot CoT prompting is emergent abilities of model scale.
● CoT does not positively impact performance for small models, but start to yield
performance gains when using models with more than ∼100B parameters for both
in-domain and out-of-domain tests.
55
● Few-shot CoT prompting is emergent abilities of model scale.
● CoT does not positively impact performance for small models, but start to yield
performance gains when using models with more than ∼100B parameters for both
in-domain and out-of-domain tests.
● Zero-shot CoT using Instruct-GPT-3 175B achieves the similar performance as few-shot
CoT in both tasks using 540B PaLM model.
56
Pre-Lecture Question 2
Q2: Wei et al., 2022 showed that CoT can improve out-of-domain performance. Can you
state their results and why do you think it is the case (i.e., adding intermediate steps can
improve robustness)?
57
Q2: Wei et al., 2022 showed that CoT can improve out-of-domain performance. Can you
state their results and why do you think it is the case (i.e., adding intermediate steps can
improve robustness)?
While standard prompting fails out-of-domain tests for both tasks, large models with both zero-shot and
few-shot CoT improve the performance of in-domain and out-of-domain tests. CoT prompt in this symbolic
reasoning tasks really guides the LM to reason the process of mapping input to the output. Even if questions
are OOD in the sense of “how many words in the name” and “how many states to track”, the process of getting
the output are still same, and can be resembled from exemplars. However, it is still unsure that CoT will
improve other OOD scenarios with more complex reasoning processes.
58
Experiments
CommonSense Reasoning
59
Commonsense Reasoning - Toy Problems
CSQA (Talmor et al., 2019) StrategyQA (Geva et al., 2021)
Q: What home entertainment Q: Could Brooke Shields succeed at

equipment requires cable? University of Pennsylvania?
Answer Choices: (a) radio shack (b)
substation (c) television (d) cabinet A: The answer is yes.
A: The answer is (c).
Sport Understanding Date Understanding
Q: Is the following sentence Q: 2015 is coming in 36 hours. What

plausible? “Jamel Murray was perfect is the date one week from today in
from the line.” MM/DD/YYYY
A: The answer is yes. A: So the answer is 01/05/2015.

60
SayCan Robot Planning
Locations = [counter, table, user, trash, bowl].
Objects = [7up, apple, kettle chips, tea, multigrain chips, coke, lime soda, jalapeno chips, rice chips,
orange, grapefruit soda, pepsi, redbull, energy bar, sponge, water].
The robot can pick up items with pick(object) and put down items with put(object) as well as find
objects or locations with find(). The robot can only understand the explicit locations and objects
listed.
Human: How would you throw away a redbull?
Plan: 1. find(redbull), 2. pick(redbull), 3. find(trash), 4. put(redbull), 5. done().
61
SayCan Robot Planning
Objects = [7up, apple, kettle chips, tea, multigrain chips, coke, lime soda, jalapeno chips, rice chips,
orange, grapefruit soda, pepsi, redbull, energy bar, sponge, water].
listed.
These tasks not only require multi-steps reasoning, but also need priori knowledge to
understand complex semantics. 62
Commonsense Reasoning - Results
63
64
65
Commonsense Reasoning - Observations
● For all tasks, scaling up model size improved the performance of standard prompting.
66
● CoT prompting led to further gains, with improvements appearing to be largest for PaLM
540B.
67
540B.
● CoT has minimal benefits on CSQA and StrategyQA tasks.
68
540B.
● CoT has minimal benefits on CSQA and StrategyQA tasks.
● Few-shot achieves better performance than Zero-shot CoT on 175B GPT-3 model for
CSQA and Strategy QA tasks, but Zero-shot CoT shows significant improvement for Date
understanding task.
69
Ablation Study - Variations of Few-Shot CoT
Change the types of CoT:
Equation only
5+6=11. The
answer is 11.
70
Equation only
5+6=11. The
answer is 11.
Natural language in reasoning matters.

(Wei et al., 2022) 71
Variable compute only

…………………
…………….. The
answer is 11.
72
Variable compute only

…………………
…………….. The
answer is 11.
More intermediate computation does not help with the final answer.
(Wei et al., 2022)
73
Reasoning after answer

The answer is 11.
Roger started with
5 balls. 2 cans of 3
tennis balls each is
6 tennis balls. 5 +
6 = 11.
74
Reasoning after answer

The answer is 11.
Roger started with
5 balls. 2 cans of 3
tennis balls each is
6 tennis balls. 5 +
6 = 11.
CoT is not just activating knowledge seen in pre-training.

(Wei et al., 2022)
75
Ablation Study - Robustness (Style of Exemplar)
Change the style of exemplar in few-shot CoT:
(Wei et al., 2022)

Results for few-shot LaMDA 137B on two AR tasks: have variance, but CoT still outperforms
76
standard prompting, robust against linguistic styles, different exemplars.
Ablation Study - Robustness (Trigger Sentence)
Change the template (trigger sentence) in
zero-shot CoT:
(Kojima et al., 2022)
Results for zero-shot GPT3 (davinci-002) 175B on MultiArith AR task: different templates
encourage the model to express reasoning quite differently
77
Ablation Study - Model Size
Change the model sizes in CoT prompting:
(Kojima et al., 2022)
Results on MultiArith AR task with different model sizes:

● Larger model, better reasoning
● CoT is effective only for larger models
● Few-shot better than zero-shot
78
● Instruct GPT-3 is much better than original GPT-3
Q1: Describe how the two approaches from (Wei et al., 2022) and (Kojima et al., 2022) are
different. Which one do you think is a more viable solution in terms of cost, performance
and stability?
Wei's work uses few-shot setting with several demonstration examples required, and the CoT annotations need
to be provided for each example, while Kojima's work uses LLM to generate the CoT with the two-stage
prompting and no longer requires the annotated examples.
For most of benchmarks we have seen, few-shot CoT has better performance than zero-shot CoT, while
zero-shot CoT does not require human annotations, which can be costly.
Although there is no direct comparison between zero-shot and few-shot CoT on stability, few-shot seems to be
more robust as its performance does not vary significantly when changing the prompt annotations. On the
other hand, zero-shot CoT has significant performance variance with different trigger sentences.
79
More Advances - Self-Consistency
Change greedy decode (single-path) to self-consistency (multi-path) in few-shot CoT:
Wang, Xuezhi, et al. "Self-consistency improves chain of thought reasoning in language models." arXiv preprint arXiv:2203.11171
(2022). 80
More Advances - Self-Consistency
Showcase results on AR, CR tasks:
Wang, Xuezhi, et al. "Self-consistency improves chain of thought reasoning in language models." arXiv preprint arXiv:2203.11171
(2022). 81
More Advances - Input-Rational Ensemble
Use model-generated rationale in few-shot CoT:
Wang, Xuezhi, et al. "Rationale-Augmented Ensembles in Language Models." arXiv preprint arXiv:2207.00747 (2022).
82
More Advances - Input-Rational Ensemble
Showcase performance for AR reasoning tasks (PaLM-540B):
GSM8K
Standard-prompting 17.9
Few-shot CoT (Wei et al. 2022) 56.5
Zero-shot CoT (Kojima et al. 2022) 43.0
Self-consistency (Wang et al. 2022) 74.4
Prompt-order ensemble 75.4
Input-rationale ensemble 73.8
Performance improvement on reasoning is great over previous CoT, but not significant
against self-consistency, 83
Pre-Lecture Question 3 and Discussion
Q3: Do you think the CoT method can be useful to other NLP tasks that we have seen in the
previous lectures (standard NLP tasks that are beyond the arithmetic/logic reasoning tasks that
these papers evaluated on)? Do you have any ideas about how we can collect the CoT data?
84
Reference
1. Wei, Jason, et al. "Chain of thought prompting elicits reasoning in large language models." arXiv preprint
arXiv:2201.11903 (2022).
2. Kojima, Takeshi, et al. "Large Language Models are Zero-Shot Reasoners." arXiv preprint arXiv:2205.11916 (2022).
3. Cobbe, Karl, et al. "Training verifiers to solve math word problems." arXiv preprint arXiv:2110.14168 (2021).
4. Patel, Arkil, Satwik Bhattamishra, and Navin Goyal. "Are NLP Models really able to Solve Simple Math Word
Problems?." NAACL 2021.
5. Miao, Shen-Yun, Chao-Chun Liang, and Keh-Yih Su. "A diverse corpus for evaluating and developing English math
word problem solvers." ACL 2020 (2020).
6. Koncel-Kedziorski, Rik, et al. "MAWPS: A math word problem repository." NAACL 2016.
7. Ling, Wang, et al. "Program induction by rationale generation: Learning to solve and explain algebraic word problems."
arXiv preprint arXiv:1705.04146 (2017).
8. Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. 2019. CommonsenseQA: A question answering
challenge targeting commonsense knowledge. NAACL.
9. Ahn, Michael, et al. "Do as i can, not as i say: Grounding language in robotic affordances." arXiv preprint
arXiv:2204.01691 (2022).
10. Wang, Xuezhi, et al. "Self-consistency improves chain of thought reasoning in language models." arXiv preprint
arXiv:2203.11171 (2022).
11. Wang, Xuezhi, et al. "Rationale-Augmented Ensembles in Language Models." arXiv preprint arXiv:2207.00747 (2022).
85
Discussion
(This part will not be in presentation)
86
Summary of Arithmetic Reasoning Benchmark
Summary of math arithmetic reasoning benchmarks. N: number of evaluation examples (Wei, et al. 2022)
87
Arithmetic Reasoning - Few-Shot CoT MAWPS Results
LaMDA GPT PaLM
88
Arithmetic Reasoning - Zero-Shot CoT Additional Results
89
GPT-3
90
GPT-3
Finding the one with the highest score

91
GPT-3

2. Sample 100 solutions from the generator for
each training problem and label each solution as
correct or incorrect.
Correct or Incorrect
(Cobbe et al. 2021) 92
MAWPS - SingleEq
If there are 7 bottle caps in a box
and Linda puts 7 more bottle
caps inside, how many bottle
caps are in the box?
MAWPS - AddSub
There were 6 roses in the vase.
Mary cut some roses from her
flower garden. There are now 16
roses in the vase. How many
roses did she cut?
93
Commonsense Reasoning - CSQA CoT Prompt
CSQA (Talmor et al., 2019)
Q: What home entertainment

● 7 exemplars from training dataset
equipment requires cable?
substation (c) television (d) cabinet
A: The answer must require

cable. Of the above choices, only
television requires cable. The
answer is (c).
94
Commonsense Reasoning - CSQA CoT Prompt
CSQA (Talmor et al., 2019)
Q: What home entertainment

● 7 exemplars from training dataset
equipment requires cable?
● Manually composed intermediate reasoning with
substation (c) television (d) cabinet
strict format:
A: The answer must require
cable. Of the above choices, only The answer must . Of the above choices,
television requires cable. The only .
answer is (c).
95
Commonsense Reasoning - Strategy QA CoT Prompt
StrategyQA (Geva et al., 2021)
Q: Could Brooke Shields succeed ● 6 exemplars from training dataset

at University of Pennsylvania?
A: Brooke Shields went to

Princeton University. Princeton
University is about as
academically rigorous as the
University of Pennsylvania. Thus,
Brooke Shields could also
succeed at the University of
Pennsylvania.The answer is yes.
96
StrategyQA (Geva et al., 2021)
Q: Could Brooke Shields succeed ● 6 exemplars from training dataset

at University of Pennsylvania?
● Manually composed intermediate reasoning with
A: Brooke Shields went to flexible format
Princeton University. Princeton
University is about as
academically rigorous as the
University of Pennsylvania. Thus,
Brooke Shields could also
succeed at the University of
Pennsylvania.The answer is yes.
97
StrategyQA (Geva et al., 2021) StrategyQA (Geva et al., 2021)
Q: Could Brooke Shields succeed Q: Yes or no: Is it common to see

at University of Pennsylvania? frost during some college
commencements?
A: Brooke Shields went to
Princeton University. Princeton A: College commencement
University is about as ceremonies can happen in
academically rigorous as the December, May, and June.
University of Pennsylvania. Thus, December is in the winter, so
Brooke Shields could also there can be frost. Thus, there
succeed at the University of could be frost at some
Pennsylvania.The answer is yes. commencements. The answer is
yes.
98
Commonsense Reasoning - CoT Prompt
Date Understanding Sport Understanding
Q: 2015 is coming in 36 hours. Q: Is the following sentence

What is the date one week from plausible? “Jamel Murray was
today in MM/DD/YYYY perfect from the line.”
A: If 2015 is coming in 36 hours, A: Jamal Murray is a basketball

then it is coming in 2 days. 2 player. Being perfect from the line
days before 01/01/2015 is is part of basketball. The answer
12/30/2014, so today is is yes.
12/30/2014. So one week from
today will be 01/05/2015. So the
answer is 01/05/2015.
6 exemplars for Date Understanding 8 exemplars for Sport Understanding

99
Commonsense Reasoning - CoT Prompt
SayCan Robot Planning (Ahn et al., 2022)
Objects = [7up, apple, kettle chips, tea, multigrain chips, coke, lime soda, jalapeno chips, rice
chips, orange, grapefruit soda, pepsi, redbull, energy bar, sponge, water].
listed.
Explanation: The user has asked me to throw away the redbull, I will move it to the trash.
7 exemplars for SayCan 100

Ablation Study - Robustness (Example Distribution)
Change the examples in few-shot CoT:
Change examples from in-domain to out-of-domain!
Results for few-shot on two AR tasks with examples† from a CR task (CommonsenseQA):
across-domain examples with the same format have minor performance degradation. 101

Resoning and Promping

Uploaded by

Copyright:

Available Formats

Resoning and Promping

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Resoning and Promping

Uploaded by

Copyright:

Available Formats

Chain of Thought Prompting for

Large Language Model Reasoning

COS 597G - Fall 2022

Arithmetic Reasoning (AR) Symbolic Reasoning (SR) Commonsense Reasoning (CR)

Fine-tune GPT-3 on GSM8K (arithmetic): (Cobbe et al. 2021)

Conjecture: to achieve > 80%, needs 100

Few-shot standard prompting with even larger

Proposed solution: chain of thought prompting

Both papers will appear

Intuition (from neural-symbolic computing):

Intuition (from neural-symbolic computing):

Examples CoT Examples

Examples CoT Examples

Examples CoT Examples

（KoJima et al., 2022）

● PaLM (8B, 62B, 540B) (Chowdhery et al., 2022)

● PaLM (8B, 62B, 540B) (Chowdhery et al., 2022)

● LaMDA (422M, 2B, 8B, 68B, 137B) (Thoppilan et al., 2022)

● PaLM (8B, 62B, 540B) (Chowdhery et al., 2022)

● LaMDA (422M, 2B, 8B, 68B, 137B) (Thoppilan et al., 2022)

● GPT-3 (ada 350M, babbage 1.3B, curie 6.7B, davinci 175B)

1. Fine-tuned 2 epoch on training set.

A: There are originally 3 cars. 2

（Wei et al., 2022）

A: There are originally 3 cars. 2

（Wei et al., 2022）

Free Response Free Response

A: There are originally 3 cars. 2 A: Olivia had 23 dollars. 5 bagels

（Wei et al., 2022） （Wei et al., 2022）

You can have one or more equations.

Free Response Free Response

A: There are originally 3 cars. 2 A: Olivia had 23 dollars. 5 bagels

（Wei et al., 2022） （Wei et al., 2022）

Equations can be incomplete and combined math with words.

（Wei et al., 2022）

A: The distance that the person

Multiple Choice Multiple Choice

Q: A person is traveling at 20 km/hr Q: If a / b = 3/4 and 8a + 5b = 22,then

The exemplars have various formats.

A: The distance that the person

A: The distance that the person

● Both zero-shot and few-shot chain of thought

● Both zero-shot and few-shot chain of thought

● They do not positively impact performance for

● Both zero-shot and few-shot chain of thought

● They do not positively impact performance for

● Few-shot CoT achieves better performance on

● Both zero-shot and few-shot chain of thought

● They do not positively impact performance for

● Few-shot CoT achieves better performance on

● Instruct GPT-3: text-davinci-002 achieves similar

A: The last letter of "Elon" is "n".

A: The coin was flipped by no A: The coin was flipped by Jamey

8 exemplars with strict format.

A: The last letter of "Elon" is "n". A: The coin was flipped by no

Take the last letters of the

Take the last letters of the

A coin is heads up. Tom does

A coin is heads up. Tom does

● Standard prompting fails out-of-domain tests for both tasks.

（Wei et al., 2022）（Wei et al., 2022）

（Wei et al., 2022）（Wei et al., 2022）