Resoning and Promping
Resoning and Promping
Resoning and Promping
2
Reasoning Problems
Q: If there are 3 cars in the Q: Take the last letters of Q: What home entertainment
parking lot and 2 more cars the words in "Elon Musk" equipment requires cable?
arrive, how many cars are in and concatenate them Answer Choices: (a) radio shack
the parking lot? (b) substation (c) television (d)
A: The answer is nk. cabinet
A: The answer is 5
A: The answer is (c).
3
Reasoning Problems
4
Reasoning Problems
GSM8K (arithmetic):
5
Reasoning Problems
Scaling up language model size does not efficiently achieve high performances, for
Arithmetic Reasoning (AR), CommonSense Reasoning (CR) and Symbolic
Reasoning (SR) tasks.
6
Reasoning Problems
Scaling up language model size does not efficiently achieve high performances, for
Arithmetic Reasoning (AR), CommonSense Reasoning (CR) and Symbolic
Reasoning (SR) tasks.
7
Chain of Thought Prompting
8
Chain of Thought (CoT)
Few-Shot CoT
9
Chain of Thought (CoT)
Few-Shot CoT
Zero-Shot CoT
10
Chain of Thought (CoT)
Definition:
A chain of thought is a series of intermediate natural language reasoning steps that lead
to the final output.
11
Chain of Thought (CoT)
Definition:
A chain of thought is a series of intermediate natural language reasoning steps that lead
to the final output.
use <input, intermediate results, output> triples, rather than simple <input, output> pairs
12
Chain of Thought (CoT)
Definition:
A chain of thought is a series of intermediate natural language reasoning steps that lead
to the final output.
use <input, intermediate results, output> triples, rather than simple <input, output> pairs
Benefits:
● Decomposition -> easier intermediate problems
● Interpretable
● More general than neural symbolic computing
● Leveraging prompting of LLM
13
Chain of Thought (CoT)
Examples
14
Chain of Thought (CoT)
(Wei et al., 2022)
Step-by-step Answer
15
Chain of Thought (CoT)
(Wei et al., 2022)
Step-by-step Answer
16
Chain of Thought (CoT)
(Wei et al., 2022)
Step-by-step Answer
Two-stage Prompting
Step-by-step Answer
17
Zero-Shot Chain of Thought (CoT)
For zero-shot CoT, a two-stage prompting is applied:
Question
Trigger1
Reasoning
Path
18
Zero-Shot Chain of Thought (CoT)
For zero-shot CoT, a two-stage prompting is applied:
Question Question
+Trigger1
Trigger1
Reasoning Path
Trigger2
Reasoning
Path Answer
19
Experiments
20
Models
Pre-trained LLMs:
● Instruct GPT-3 (ada 350M, babbage 1.3B, curie 6.7B, and davinci 175B) (Ouyang et al., 2022)
21
Models
Pre-trained LLMs:
● Instruct GPT-3 (ada 350M, babbage 1.3B, curie 6.7B, and davinci 175B) (Ouyang et al., 2022)
○ Not your familiar GPT-3 (Brown, et al., 2020)
○ Fine-tuned with human feedback
○ Stay tuned for the lecture on Nov. 14!!
22
Models
Pre-trained LLMs:
● Instruct GPT-3 (ada 350M, babbage 1.3B, curie 6.7B, and davinci 175B) (Ouyang et al., 2022)
23
Models
Pre-trained LLMs:
● Instruct GPT-3 (ada 350M, babbage 1.3B, curie 6.7B, and davinci 175B) (Ouyang et al., 2022)
24
Models
Pre-trained LLMs:
● Instruct GPT-3 (ada 350M, babbage 1.3B, curie 6.7B, and davinci 175B) (Ouyang et al., 2022)
● GPT-2 (1.5B)
● GPT-Neo (2.7B), GPT-J (6B), T0 (11B) (Sanh et al., 2022), OPT (13B) (Zhang et al., 2022)
25
Prior Best – Fine-tuning + Verification
GPT-3
27
Free Response - Few-Shot CoT Prompt Exemplar
Free Response
● Manually composed 8 exemplars
Q: If there are 3 cars in the parking
lot and 2 more cars arrive, how
many cars are in the parking lot?
28
Free Response - Few-Shot CoT Prompt Exemplar
Free Response
● Manually composed 8 exemplars
Q: If there are 3 cars in the parking
lot and 2 more cars arrive, how ● All contains equations with flexible formats
many cars are in the parking lot?
29
Free Response - Few-Shot CoT Prompt Exemplar
Q: If there are 3 cars in the parking Q: Olivia has $23. She bought five
lot and 2 more cars arrive, how bagels for $3 each. How much
many cars are in the parking lot? money does she have left?
30
Free Response - Few-Shot CoT Prompt Exemplar
Q: If there are 3 cars in the parking Q: Olivia has $23. She bought five
lot and 2 more cars arrive, how bagels for $3 each. How much
many cars are in the parking lot? money does she have left?
31
Free Response - Few-Shot CoT Prompt Exemplar
Free Response
● Manually composed 8 exemplars
Q: If there are 3 cars in the parking
lot and 2 more cars arrive, how ● All contains equations with flexible formats
many cars are in the parking lot?
● Benchmarked on:
A: There are originally 3 cars. 2 ○ GSM8K (Cobbe et al. 2021)
more cars arrive. 3 + 2 = 5. The ○ SVAMP (Patel et al., 2021)
answer is 5. ○ MAWPS (Koncel-Kedziorski et al., 2016)
32
Multiple Choice - Few-Shot CoT Prompt Exemplar
Multiple Choice
● 4 exemplars, whose questions, intermediate
Q: A person is traveling at 20 km/hr reasoning, and answers are from AQuA-RAT’s
and reached his destiny in 2.5 hr training set
then find the distance?
Answer Choices: (a) 53 km (b) 55
km (c) 52 km (d) 60 km (e) 50 km
33
Multiple Choice - Few-Shot CoT Prompt Exemplar
Multiple Choice
● 4 exemplars, whose questions, intermediate
Q: A person is traveling at 20 km/hr reasoning, and answers are from AQuA-RAT’s
and reached his destiny in 2.5 hr training set
then find the distance?
● Exemplars have flexible formats
Answer Choices: (a) 53 km (b) 55
km (c) 52 km (d) 60 km (e) 50 km
35
Multiple Choice - Few-Shot CoT Prompt Exemplar
Multiple Choice
● 4 exemplars, whose questions, intermediate
Q: A person is traveling at 20 km/hr reasoning, and answers are from training set
and reached his destiny in 2.5 hr
● Exemplars have flexible formats
then find the distance?
Answer Choices: (a) 53 km (b) 55 ● Benchmarked on AQuA-RAT (Ling et al.,
km (c) 52 km (d) 60 km (e) 50 km 2017)
36
Arithmetic Reasoning - Results
GSM8K
Josh decides to try flipping a
house. He buys a house for
$80,000 and then puts in $50,000
in repairs. This increased the
value of the house by 150%.
How much profit did he make?
SVAMP
Each pack of dvds costs 76
dollars. If there is a discount of
25 dollars on each pack. How
much do you have to pay to buy
each pack?
37
Arithmetic Reasoning - Results
GSM8K
Josh decides to try flipping a
house. He buys a house for
$80,000 and then puts in $50,000
in repairs. This increased the
value of the house by 150%.
How much profit did he make?
Fine-tuning + SVAMP
Verification Each pack of dvds costs 76
dollars. If there is a discount of
25 dollars on each pack. How
much do you have to pay to buy
each pack?
38
Arithmetic Reasoning - Results
MAWPS - MultiArith
The school cafeteria ordered 42
red apples and 7 green apples for
students lunches. But, if only 9
students wanted fruit, how many
extra did the cafeteria end up
with?
AQuA-RAT
A person is traveling at 20 km/hr
and reached his destiny in 2.5 hr
then find the distance?
Answer Choices: (a) 53 km (b)
55 km (c) 52 km (d) 60 km (e) 50
km
39
Arithmetic Reasoning - Results
MAWPS - MultiArith
The school cafeteria ordered 42
red apples and 7 green apples for
students lunches. But, if only 9
students wanted fruit, how many
extra did the cafeteria end up
with?
AQuA-RAT
A person is traveling at 20 km/hr
and reached his destiny in 2.5 hr
then find the distance?
Answer Choices: (a) 53 km (b)
55 km (c) 52 km (d) 60 km (e) 50
km
40
Arithmetic Reasoning - Observations
41
Arithmetic Reasoning - Observations
42
Arithmetic Reasoning - Observations
43
Arithmetic Reasoning - Observations
44
Experiments
Symbolic Reasoning
45
Symbolic Reasoning - Last Letter Concatenation
Last letter concatenation ● Generate full names by randomly
concatenating names from the top
Q: Take the last letters of the words one-thousand first and last names from name
in "Elon Musk" and concatenate census data
them
46
Symbolic Reasoning - Last Letter Concatenation
Last letter concatenation ● Generate full names by randomly
concatenating names from the top
Q: Take the last letters of the words one-thousand first and last names from name
in "Elon Musk" and concatenate census data
them
● 4 exemplars with strict format
A: The last letter of "Elon" is "n".
The last letter of "Musk" is "k".
Concatenating them is "nk". The
answer is nk.
47
Symbolic Reasoning - Last Letter Concatenation
Last letter concatenation Last letter concatenation
Q: Take the last letters of the words Q: Take the last letters of the words
in "Elon Musk" and concatenate in "Bill Gates" and concatenate them
them
A: The last letter of "Bill" is "l".
A: The last letter of "Elon" is "n". The last letter of "Gates" is "s".
The last letter of "Musk" is "k". Concatenating them is "ls". The
Concatenating them is "nk". The answer is ls.
answer is nk.
48
Symbolic Reasoning - Coin Flip
Coin Flip Coin Flip
Q: A coin is heads up. Tom does Q: A coin is heads up. Jamey flips
not flip the coin. Mike does not flip the coin. Teressa flips the coin. Is
the coin. Is the coin still heads up? the coin still heads up?
49
Symbolic Reasoning - In & Out-of-domain Test
Last letter concatenation Coin Flip
Q: Take the last letters of the words Q: A coin is heads up. Tom does not
in "Elon Musk" and concatenate flip the coin. Mike does not flip the
them coin. Is the coin still heads up?
● In-domain test set: examples had the same number of steps as the few-shot exemplars
● Out-of-domain (OOD) test set: examples had more steps than those in the exemplars.
50
Symbolic Reasoning - Last Letter Concatenation
In-Domain
Out-of-Domain
51
*Zero-Shot results use Instruct-GPT-3 175B text-davinci-002 model.
Symbolic Reasoning - Coin Flip
In-Domain
Out-of-Domain
52
*Zero-Shot results use Instruct-GPT-3 175B text-davinci-002 model.
Symbolic Reasoning - Observations
53
Symbolic Reasoning - Observations
● Both zero-shot and few-shot CoT promptings are emergent abilities of model scale.
54
Symbolic Reasoning - Observations
● CoT does not positively impact performance for small models, but start to yield
performance gains when using models with more than ∼100B parameters for both
in-domain and out-of-domain tests.
55
Symbolic Reasoning - Observations
● CoT does not positively impact performance for small models, but start to yield
performance gains when using models with more than ∼100B parameters for both
in-domain and out-of-domain tests.
● Zero-shot CoT using Instruct-GPT-3 175B achieves the similar performance as few-shot
CoT in both tasks using 540B PaLM model.
56
Pre-Lecture Question 2
Q2: Wei et al., 2022 showed that CoT can improve out-of-domain performance. Can you
state their results and why do you think it is the case (i.e., adding intermediate steps can
improve robustness)?
57
Pre-Lecture Question 2
Q2: Wei et al., 2022 showed that CoT can improve out-of-domain performance. Can you
state their results and why do you think it is the case (i.e., adding intermediate steps can
improve robustness)?
While standard prompting fails out-of-domain tests for both tasks, large models with both zero-shot and
few-shot CoT improve the performance of in-domain and out-of-domain tests. CoT prompt in this symbolic
reasoning tasks really guides the LM to reason the process of mapping input to the output. Even if questions
are OOD in the sense of “how many words in the name” and “how many states to track”, the process of getting
the output are still same, and can be resembled from exemplars. However, it is still unsure that CoT will
improve other OOD scenarios with more complex reasoning processes.
58
Experiments
CommonSense Reasoning
59
Commonsense Reasoning - Toy Problems
CSQA (Talmor et al., 2019) StrategyQA (Geva et al., 2021)
Objects = [7up, apple, kettle chips, tea, multigrain chips, coke, lime soda, jalapeno chips, rice chips,
orange, grapefruit soda, pepsi, redbull, energy bar, sponge, water].
The robot can pick up items with pick(object) and put down items with put(object) as well as find
objects or locations with find(). The robot can only understand the explicit locations and objects
listed.
61
Commonsense Reasoning - Toy Problems
SayCan Robot Planning
Objects = [7up, apple, kettle chips, tea, multigrain chips, coke, lime soda, jalapeno chips, rice chips,
orange, grapefruit soda, pepsi, redbull, energy bar, sponge, water].
The robot can pick up items with pick(object) and put down items with put(object) as well as find
objects or locations with find(). The robot can only understand the explicit locations and objects
listed.
These tasks not only require multi-steps reasoning, but also need priori knowledge to
understand complex semantics. 62
Commonsense Reasoning - Results
63
Commonsense Reasoning - Results
64
Commonsense Reasoning - Results
65
Commonsense Reasoning - Observations
● For all tasks, scaling up model size improved the performance of standard prompting.
66
Commonsense Reasoning - Observations
● For all tasks, scaling up model size improved the performance of standard prompting.
● CoT prompting led to further gains, with improvements appearing to be largest for PaLM
540B.
67
Commonsense Reasoning - Observations
● For all tasks, scaling up model size improved the performance of standard prompting.
● CoT prompting led to further gains, with improvements appearing to be largest for PaLM
540B.
68
Commonsense Reasoning - Observations
● For all tasks, scaling up model size improved the performance of standard prompting.
● CoT prompting led to further gains, with improvements appearing to be largest for PaLM
540B.
● Few-shot achieves better performance than Zero-shot CoT on 175B GPT-3 model for
CSQA and Strategy QA tasks, but Zero-shot CoT shows significant improvement for Date
understanding task.
69
Ablation Study - Variations of Few-Shot CoT
Change the types of CoT:
Equation only
5+6=11. The
answer is 11.
70
Ablation Study - Variations of Few-Shot CoT
Change the types of CoT:
Equation only
5+6=11. The
answer is 11.
72
Ablation Study - Variations of Few-Shot CoT
Change the types of CoT:
More intermediate computation does not help with the final answer.
(Wei et al., 2022)
73
Ablation Study - Variations of Few-Shot CoT
Change the types of CoT:
74
Ablation Study - Variations of Few-Shot CoT
Change the types of CoT:
Results for zero-shot GPT3 (davinci-002) 175B on MultiArith AR task: different templates
encourage the model to express reasoning quite differently
77
Ablation Study - Model Size
Change the model sizes in CoT prompting:
(Kojima et al., 2022)
Wei's work uses few-shot setting with several demonstration examples required, and the CoT annotations need
to be provided for each example, while Kojima's work uses LLM to generate the CoT with the two-stage
prompting and no longer requires the annotated examples.
For most of benchmarks we have seen, few-shot CoT has better performance than zero-shot CoT, while
zero-shot CoT does not require human annotations, which can be costly.
Although there is no direct comparison between zero-shot and few-shot CoT on stability, few-shot seems to be
more robust as its performance does not vary significantly when changing the prompt annotations. On the
other hand, zero-shot CoT has significant performance variance with different trigger sentences.
79
More Advances - Self-Consistency
Change greedy decode (single-path) to self-consistency (multi-path) in few-shot CoT:
Wang, Xuezhi, et al. "Self-consistency improves chain of thought reasoning in language models." arXiv preprint arXiv:2203.11171
(2022). 80
More Advances - Self-Consistency
Showcase results on AR, CR tasks:
Wang, Xuezhi, et al. "Self-consistency improves chain of thought reasoning in language models." arXiv preprint arXiv:2203.11171
(2022). 81
More Advances - Input-Rational Ensemble
Use model-generated rationale in few-shot CoT:
Wang, Xuezhi, et al. "Rationale-Augmented Ensembles in Language Models." arXiv preprint arXiv:2207.00747 (2022).
82
More Advances - Input-Rational Ensemble
Showcase performance for AR reasoning tasks (PaLM-540B):
GSM8K
Standard-prompting 17.9
Performance improvement on reasoning is great over previous CoT, but not significant
against self-consistency, 83
Pre-Lecture Question 3 and Discussion
Q3: Do you think the CoT method can be useful to other NLP tasks that we have seen in the
previous lectures (standard NLP tasks that are beyond the arithmetic/logic reasoning tasks that
these papers evaluated on)? Do you have any ideas about how we can collect the CoT data?
84
Reference
1. Wei, Jason, et al. "Chain of thought prompting elicits reasoning in large language models." arXiv preprint
arXiv:2201.11903 (2022).
2. Kojima, Takeshi, et al. "Large Language Models are Zero-Shot Reasoners." arXiv preprint arXiv:2205.11916 (2022).
3. Cobbe, Karl, et al. "Training verifiers to solve math word problems." arXiv preprint arXiv:2110.14168 (2021).
4. Patel, Arkil, Satwik Bhattamishra, and Navin Goyal. "Are NLP Models really able to Solve Simple Math Word
Problems?." NAACL 2021.
5. Miao, Shen-Yun, Chao-Chun Liang, and Keh-Yih Su. "A diverse corpus for evaluating and developing English math
word problem solvers." ACL 2020 (2020).
6. Koncel-Kedziorski, Rik, et al. "MAWPS: A math word problem repository." NAACL 2016.
7. Ling, Wang, et al. "Program induction by rationale generation: Learning to solve and explain algebraic word problems."
arXiv preprint arXiv:1705.04146 (2017).
8. Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. 2019. CommonsenseQA: A question answering
challenge targeting commonsense knowledge. NAACL.
9. Ahn, Michael, et al. "Do as i can, not as i say: Grounding language in robotic affordances." arXiv preprint
arXiv:2204.01691 (2022).
10. Wang, Xuezhi, et al. "Self-consistency improves chain of thought reasoning in language models." arXiv preprint
arXiv:2203.11171 (2022).
11. Wang, Xuezhi, et al. "Rationale-Augmented Ensembles in Language Models." arXiv preprint arXiv:2207.00747 (2022).
85
Discussion
(This part will not be in presentation)
86
Summary of Arithmetic Reasoning Benchmark
Summary of math arithmetic reasoning benchmarks. N: number of evaluation examples (Wei, et al. 2022)
87
Arithmetic Reasoning - Few-Shot CoT MAWPS Results
88
Arithmetic Reasoning - Zero-Shot CoT Additional Results
89
Prior Best – Fine-tuning + Verification
GPT-3
90
Prior Best – Fine-tuning + Verification
GPT-3
91
Prior Best – Fine-tuning + Verification
GPT-3
MAWPS - SingleEq
If there are 7 bottle caps in a box
and Linda puts 7 more bottle
caps inside, how many bottle
caps are in the box?
MAWPS - AddSub
There were 6 roses in the vase.
Mary cut some roses from her
flower garden. There are now 16
roses in the vase. How many
roses did she cut?
93
Commonsense Reasoning - CSQA CoT Prompt
94
Commonsense Reasoning - CSQA CoT Prompt
95
Commonsense Reasoning - Strategy QA CoT Prompt
96
Commonsense Reasoning - Strategy QA CoT Prompt
97
Commonsense Reasoning - Strategy QA CoT Prompt
98
Commonsense Reasoning - CoT Prompt
Objects = [7up, apple, kettle chips, tea, multigrain chips, coke, lime soda, jalapeno chips, rice
chips, orange, grapefruit soda, pepsi, redbull, energy bar, sponge, water].
The robot can pick up items with pick(object) and put down items with put(object) as well as find
objects or locations with find(). The robot can only understand the explicit locations and objects
listed.
Explanation: The user has asked me to throw away the redbull, I will move it to the trash.
Results for few-shot on two AR tasks with examples† from a CR task (CommonsenseQA):
across-domain examples with the same format have minor performance degradation. 101