Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Resoning and Promping

Download as pdf or txt
Download as pdf or txt
You are on page 1of 101

Chain of Thought Prompting for

Large Language Model Reasoning


Zihan Ding and Zixu Zhang

COS 597G - Fall 2022


Hard Language Tasks: Reasoning

2
Reasoning Problems

Q: If there are 3 cars in the Q: Take the last letters of Q: What home entertainment
parking lot and 2 more cars the words in "Elon Musk" equipment requires cable?
arrive, how many cars are in and concatenate them Answer Choices: (a) radio shack
the parking lot? (b) substation (c) television (d)
A: The answer is nk. cabinet
A: The answer is 5
A: The answer is (c).

Arithmetic Reasoning (AR) Symbolic Reasoning (SR) Commonsense Reasoning (CR)


(+ − ×÷…)

3
Reasoning Problems

Fine-tune GPT-3 on GSM8K (arithmetic): (Cobbe et al. 2021)

Conjecture: to achieve > 80%, needs 100


times more fine-tuning data for 175B model

4
Reasoning Problems

GSM8K (arithmetic):

Few-shot standard prompting with even larger


model (PaLM 540B) also does not work well.

5
Reasoning Problems

Scaling up language model size does not efficiently achieve high performances, for
Arithmetic Reasoning (AR), CommonSense Reasoning (CR) and Symbolic
Reasoning (SR) tasks.

6
Reasoning Problems

Scaling up language model size does not efficiently achieve high performances, for
Arithmetic Reasoning (AR), CommonSense Reasoning (CR) and Symbolic
Reasoning (SR) tasks.

Proposed solution: chain of thought prompting

7
Chain of Thought Prompting

8
Chain of Thought (CoT)

Few-Shot CoT

9
Chain of Thought (CoT)

Few-Shot CoT

Both papers will appear


in NeurIPS’22!

Zero-Shot CoT

10
Chain of Thought (CoT)
Definition:

A chain of thought is a series of intermediate natural language reasoning steps that lead
to the final output.

11
Chain of Thought (CoT)
Definition:

A chain of thought is a series of intermediate natural language reasoning steps that lead
to the final output.

Intuition (from neural-symbolic computing):

use <input, intermediate results, output> triples, rather than simple <input, output> pairs

12
Chain of Thought (CoT)
Definition:

A chain of thought is a series of intermediate natural language reasoning steps that lead
to the final output.

Intuition (from neural-symbolic computing):

use <input, intermediate results, output> triples, rather than simple <input, output> pairs

Benefits:
● Decomposition -> easier intermediate problems
● Interpretable
● More general than neural symbolic computing
● Leveraging prompting of LLM
13
Chain of Thought (CoT)

Examples

14
Chain of Thought (CoT)
(Wei et al., 2022)

Examples CoT Examples

Step-by-step Answer

15
Chain of Thought (CoT)
(Wei et al., 2022)

Examples CoT Examples

Step-by-step Answer

16
Chain of Thought (CoT)
(Wei et al., 2022)

Examples CoT Examples

Step-by-step Answer

(KoJima et al., 2022)

Two-stage Prompting
Step-by-step Answer

17
Zero-Shot Chain of Thought (CoT)
For zero-shot CoT, a two-stage prompting is applied:

Question
Trigger1

Reasoning
Path

18
Zero-Shot Chain of Thought (CoT)
For zero-shot CoT, a two-stage prompting is applied:

Question Question
+Trigger1
Trigger1

Reasoning Path
Trigger2

Reasoning
Path Answer

19
Experiments

20
Models
Pre-trained LLMs:

● Instruct GPT-3 (ada 350M, babbage 1.3B, curie 6.7B, and davinci 175B) (Ouyang et al., 2022)

21
Models
Pre-trained LLMs:

● Instruct GPT-3 (ada 350M, babbage 1.3B, curie 6.7B, and davinci 175B) (Ouyang et al., 2022)
○ Not your familiar GPT-3 (Brown, et al., 2020)
○ Fine-tuned with human feedback
○ Stay tuned for the lecture on Nov. 14!!

22
Models
Pre-trained LLMs:

● Instruct GPT-3 (ada 350M, babbage 1.3B, curie 6.7B, and davinci 175B) (Ouyang et al., 2022)

● PaLM (8B, 62B, 540B) (Chowdhery et al., 2022)


○ Only accessible to Googlers 😞.

23
Models
Pre-trained LLMs:

● Instruct GPT-3 (ada 350M, babbage 1.3B, curie 6.7B, and davinci 175B) (Ouyang et al., 2022)

● PaLM (8B, 62B, 540B) (Chowdhery et al., 2022)

● LaMDA (422M, 2B, 8B, 68B, 137B) (Thoppilan et al., 2022)


○ Dialogue-oriented LM.
○ Fine-tuned on human-annotated data.

24
Models
Pre-trained LLMs:

● Instruct GPT-3 (ada 350M, babbage 1.3B, curie 6.7B, and davinci 175B) (Ouyang et al., 2022)

● PaLM (8B, 62B, 540B) (Chowdhery et al., 2022)

● LaMDA (422M, 2B, 8B, 68B, 137B) (Thoppilan et al., 2022)

● GPT-3 (ada 350M, babbage 1.3B, curie 6.7B, davinci 175B)

● GPT-2 (1.5B)

● GPT-Neo (2.7B), GPT-J (6B), T0 (11B) (Sanh et al., 2022), OPT (13B) (Zhang et al., 2022)

25
Prior Best – Fine-tuning + Verification

GPT-3

1. Fine-tuned 2 epoch on training set.


2. Sample 100 solutions from the generator for
each training problem and label each solution as
correct or incorrect.
3. Train a verifier for a single epoch on this dataset. Correct or Incorrect
(Cobbe et al. 2021) 26
Experiments
Arithmetic Reasoning

27
Free Response - Few-Shot CoT Prompt Exemplar

Free Response
● Manually composed 8 exemplars
Q: If there are 3 cars in the parking
lot and 2 more cars arrive, how
many cars are in the parking lot?

A: There are originally 3 cars. 2


more cars arrive. 3 + 2 = 5. The
answer is 5.

(Wei et al., 2022)

28
Free Response - Few-Shot CoT Prompt Exemplar

Free Response
● Manually composed 8 exemplars
Q: If there are 3 cars in the parking
lot and 2 more cars arrive, how ● All contains equations with flexible formats
many cars are in the parking lot?

A: There are originally 3 cars. 2


more cars arrive. 3 + 2 = 5. The
answer is 5.

(Wei et al., 2022)

29
Free Response - Few-Shot CoT Prompt Exemplar

Free Response Free Response

Q: If there are 3 cars in the parking Q: Olivia has $23. She bought five
lot and 2 more cars arrive, how bagels for $3 each. How much
many cars are in the parking lot? money does she have left?

A: There are originally 3 cars. 2 A: Olivia had 23 dollars. 5 bagels


more cars arrive. 3 + 2 = 5. The for 3 dollars each will be 5 x 3 = 15
answer is 5. dollars. So she has 23 - 15 dollars
left. 23 - 15 is 8. The answer is 8.

(Wei et al., 2022) (Wei et al., 2022)

You can have one or more equations.

30
Free Response - Few-Shot CoT Prompt Exemplar

Free Response Free Response

Q: If there are 3 cars in the parking Q: Olivia has $23. She bought five
lot and 2 more cars arrive, how bagels for $3 each. How much
many cars are in the parking lot? money does she have left?

A: There are originally 3 cars. 2 A: Olivia had 23 dollars. 5 bagels


more cars arrive. 3 + 2 = 5. The for 3 dollars each will be 5 x 3 = 15
answer is 5. dollars. So she has 23 - 15 dollars
left. 23 - 15 is 8. The answer is 8.

(Wei et al., 2022) (Wei et al., 2022)

Equations can be incomplete and combined math with words.

31
Free Response - Few-Shot CoT Prompt Exemplar

Free Response
● Manually composed 8 exemplars
Q: If there are 3 cars in the parking
lot and 2 more cars arrive, how ● All contains equations with flexible formats
many cars are in the parking lot?
● Benchmarked on:
A: There are originally 3 cars. 2 ○ GSM8K (Cobbe et al. 2021)
more cars arrive. 3 + 2 = 5. The ○ SVAMP (Patel et al., 2021)
answer is 5. ○ MAWPS (Koncel-Kedziorski et al., 2016)

(Wei et al., 2022)

32
Multiple Choice - Few-Shot CoT Prompt Exemplar

Multiple Choice
● 4 exemplars, whose questions, intermediate
Q: A person is traveling at 20 km/hr reasoning, and answers are from AQuA-RAT’s
and reached his destiny in 2.5 hr training set
then find the distance?
Answer Choices: (a) 53 km (b) 55
km (c) 52 km (d) 60 km (e) 50 km

A: The distance that the person


traveled would have been 20
km/hr * 2.5 hrs = 50 km. The
answer is (e).

33
Multiple Choice - Few-Shot CoT Prompt Exemplar

Multiple Choice Multiple Choice

Q: A person is traveling at 20 km/hr Q: If a / b = 3/4 and 8a + 5b = 22,then


and reached his destiny in 2.5 hr find the value of a.
then find the distance? Answer Choices: (a) 1/2 (b) 3/2 (c) 5/2
Answer Choices: (a) 53 km (b) 55 (d) 4/2 (e) 7/2
km (c) 52 km (d) 60 km (e) 50 km
A: If a / b = 3/4, then b = 4a / 3. So
A: The distance that the person 8a + 5(4a / 3) = 22. This simplifies to
traveled would have been 20 8a + 20a / 3 = 22, which means 44a /
km/hr * 2.5 hrs = 50 km. The 3 = 22. So a is equal to 3/2. The
answer is (e). answer is (b).

The exemplars have various formats.


34
Multiple Choice - Few-Shot CoT Prompt Exemplar

Multiple Choice
● 4 exemplars, whose questions, intermediate
Q: A person is traveling at 20 km/hr reasoning, and answers are from AQuA-RAT’s
and reached his destiny in 2.5 hr training set
then find the distance?
● Exemplars have flexible formats
Answer Choices: (a) 53 km (b) 55
km (c) 52 km (d) 60 km (e) 50 km

A: The distance that the person


traveled would have been 20
km/hr * 2.5 hrs = 50 km. The
answer is (e).

35
Multiple Choice - Few-Shot CoT Prompt Exemplar

Multiple Choice
● 4 exemplars, whose questions, intermediate
Q: A person is traveling at 20 km/hr reasoning, and answers are from training set
and reached his destiny in 2.5 hr
● Exemplars have flexible formats
then find the distance?
Answer Choices: (a) 53 km (b) 55 ● Benchmarked on AQuA-RAT (Ling et al.,
km (c) 52 km (d) 60 km (e) 50 km 2017)

A: The distance that the person


traveled would have been 20
km/hr * 2.5 hrs = 50 km. The
answer is (e).

36
Arithmetic Reasoning - Results

GSM8K
Josh decides to try flipping a
house. He buys a house for
$80,000 and then puts in $50,000
in repairs. This increased the
value of the house by 150%.
How much profit did he make?

SVAMP
Each pack of dvds costs 76
dollars. If there is a discount of
25 dollars on each pack. How
much do you have to pay to buy
each pack?

37
Arithmetic Reasoning - Results

GSM8K
Josh decides to try flipping a
house. He buys a house for
$80,000 and then puts in $50,000
in repairs. This increased the
value of the house by 150%.
How much profit did he make?

Fine-tuning + SVAMP
Verification Each pack of dvds costs 76
dollars. If there is a discount of
25 dollars on each pack. How
much do you have to pay to buy
each pack?

38
Arithmetic Reasoning - Results

MAWPS - MultiArith
The school cafeteria ordered 42
red apples and 7 green apples for
students lunches. But, if only 9
students wanted fruit, how many
extra did the cafeteria end up
with?

AQuA-RAT
A person is traveling at 20 km/hr
and reached his destiny in 2.5 hr
then find the distance?
Answer Choices: (a) 53 km (b)
55 km (c) 52 km (d) 60 km (e) 50
km

39
Arithmetic Reasoning - Results

MAWPS - MultiArith
The school cafeteria ordered 42
red apples and 7 green apples for
students lunches. But, if only 9
students wanted fruit, how many
extra did the cafeteria end up
with?

AQuA-RAT
A person is traveling at 20 km/hr
and reached his destiny in 2.5 hr
then find the distance?
Answer Choices: (a) 53 km (b)
55 km (c) 52 km (d) 60 km (e) 50
km

40
Arithmetic Reasoning - Observations

● Both zero-shot and few-shot chain of thought


promptings are emergent abilities of model scale.

41
Arithmetic Reasoning - Observations

● Both zero-shot and few-shot chain of thought


promptings are emergent abilities of model scale.

● They do not positively impact performance for


small models, but start to yield performance gains
when used with models with more than ∼100B
parameters.

42
Arithmetic Reasoning - Observations

● Both zero-shot and few-shot chain of thought


promptings are emergent abilities of model scale.

● They do not positively impact performance for


small models, but start to yield performance gains
when used with models with more than ∼100B
parameters.

● Few-shot CoT achieves better performance on


LLM than zero-shot CoT.

43
Arithmetic Reasoning - Observations

● Both zero-shot and few-shot chain of thought


promptings are emergent abilities of model scale.

● They do not positively impact performance for


small models, but start to yield performance gains
when used with models with more than ∼100B
parameters.

● Few-shot CoT achieves better performance on


LLM than zero-shot CoT.

● Instruct GPT-3: text-davinci-002 achieves similar


performance as PaLM 540B model

44
Experiments
Symbolic Reasoning

45
Symbolic Reasoning - Last Letter Concatenation
Last letter concatenation ● Generate full names by randomly
concatenating names from the top
Q: Take the last letters of the words one-thousand first and last names from name
in "Elon Musk" and concatenate census data
them

A: The last letter of "Elon" is "n".


The last letter of "Musk" is "k".
Concatenating them is "nk". The
answer is nk.

46
Symbolic Reasoning - Last Letter Concatenation
Last letter concatenation ● Generate full names by randomly
concatenating names from the top
Q: Take the last letters of the words one-thousand first and last names from name
in "Elon Musk" and concatenate census data
them
● 4 exemplars with strict format
A: The last letter of "Elon" is "n".
The last letter of "Musk" is "k".
Concatenating them is "nk". The
answer is nk.

47
Symbolic Reasoning - Last Letter Concatenation
Last letter concatenation Last letter concatenation

Q: Take the last letters of the words Q: Take the last letters of the words
in "Elon Musk" and concatenate in "Bill Gates" and concatenate them
them
A: The last letter of "Bill" is "l".
A: The last letter of "Elon" is "n". The last letter of "Gates" is "s".
The last letter of "Musk" is "k". Concatenating them is "ls". The
Concatenating them is "nk". The answer is ls.
answer is nk.

48
Symbolic Reasoning - Coin Flip
Coin Flip Coin Flip

Q: A coin is heads up. Tom does Q: A coin is heads up. Jamey flips
not flip the coin. Mike does not flip the coin. Teressa flips the coin. Is
the coin. Is the coin still heads up? the coin still heads up?

A: The coin was flipped by no A: The coin was flipped by Jamey


one. So the coin was flipped 0 and Teressa. So the coin was
times. The coin started heads up, flipped 2 times, which is an even
and it was not flipped, so it is still number. The coin started heads
heads up. So the answer is yes. up, so after an even number of
flips, it will still be heads up. So
the answer is yes.

8 exemplars with strict format.

49
Symbolic Reasoning - In & Out-of-domain Test
Last letter concatenation Coin Flip

Q: Take the last letters of the words Q: A coin is heads up. Tom does not
in "Elon Musk" and concatenate flip the coin. Mike does not flip the
them coin. Is the coin still heads up?

A: The last letter of "Elon" is "n". A: The coin was flipped by no


The last letter of "Musk" is "k". one. So the coin was flipped 0
Concatenating them is "nk". The times. The coin started heads up,
answer is nk. and it was not flipped, so it is still
heads up. So the answer is yes.

● In-domain test set: examples had the same number of steps as the few-shot exemplars
● Out-of-domain (OOD) test set: examples had more steps than those in the exemplars.
50
Symbolic Reasoning - Last Letter Concatenation

In-Domain

Take the last letters of the


words in "Elon Musk" and
concatenate them.

Out-of-Domain

Take the last letters of the


words in "Johann Sebastian
Bach" and concatenate them.

51
*Zero-Shot results use Instruct-GPT-3 175B text-davinci-002 model.
Symbolic Reasoning - Coin Flip

In-Domain

A coin is heads up. Tom does


not flip the coin. Mike does
not flip the coin. Is the coin
still heads up?

Out-of-Domain

A coin is heads up. Tom does


not flip the coin. Mike does
not flip the coin. Jake flips
the coin. Is the coin still heads
up?

52
*Zero-Shot results use Instruct-GPT-3 175B text-davinci-002 model.
Symbolic Reasoning - Observations

● Standard prompting fails out-of-domain tests for both tasks.

53
Symbolic Reasoning - Observations

● Standard prompting fails out-of-domain tests for both tasks.

● Both zero-shot and few-shot CoT promptings are emergent abilities of model scale.

54
Symbolic Reasoning - Observations

● Standard prompting fails out-of-domain tests for both tasks.

● Few-shot CoT prompting is emergent abilities of model scale.

● CoT does not positively impact performance for small models, but start to yield
performance gains when using models with more than ∼100B parameters for both
in-domain and out-of-domain tests.

55
Symbolic Reasoning - Observations

● Standard prompting fails out-of-domain tests for both tasks.

● Few-shot CoT prompting is emergent abilities of model scale.

● CoT does not positively impact performance for small models, but start to yield
performance gains when using models with more than ∼100B parameters for both
in-domain and out-of-domain tests.

● Zero-shot CoT using Instruct-GPT-3 175B achieves the similar performance as few-shot
CoT in both tasks using 540B PaLM model.

56
Pre-Lecture Question 2
Q2: Wei et al., 2022 showed that CoT can improve out-of-domain performance. Can you
state their results and why do you think it is the case (i.e., adding intermediate steps can
improve robustness)?

57
Pre-Lecture Question 2
Q2: Wei et al., 2022 showed that CoT can improve out-of-domain performance. Can you
state their results and why do you think it is the case (i.e., adding intermediate steps can
improve robustness)?

While standard prompting fails out-of-domain tests for both tasks, large models with both zero-shot and
few-shot CoT improve the performance of in-domain and out-of-domain tests. CoT prompt in this symbolic
reasoning tasks really guides the LM to reason the process of mapping input to the output. Even if questions
are OOD in the sense of “how many words in the name” and “how many states to track”, the process of getting
the output are still same, and can be resembled from exemplars. However, it is still unsure that CoT will
improve other OOD scenarios with more complex reasoning processes.

58
Experiments
CommonSense Reasoning

59
Commonsense Reasoning - Toy Problems
CSQA (Talmor et al., 2019) StrategyQA (Geva et al., 2021)

Q: What home entertainment Q: Could Brooke Shields succeed at


equipment requires cable? University of Pennsylvania?
Answer Choices: (a) radio shack (b)
substation (c) television (d) cabinet A: The answer is yes.

A: The answer is (c).

Sport Understanding Date Understanding

Q: Is the following sentence Q: 2015 is coming in 36 hours. What


plausible? “Jamel Murray was perfect is the date one week from today in
from the line.” MM/DD/YYYY

A: The answer is yes. A: So the answer is 01/05/2015.


60
Commonsense Reasoning - Toy Problems
SayCan Robot Planning

Locations = [counter, table, user, trash, bowl].

Objects = [7up, apple, kettle chips, tea, multigrain chips, coke, lime soda, jalapeno chips, rice chips,
orange, grapefruit soda, pepsi, redbull, energy bar, sponge, water].

The robot can pick up items with pick(object) and put down items with put(object) as well as find
objects or locations with find(). The robot can only understand the explicit locations and objects
listed.

Human: How would you throw away a redbull?

Plan: 1. find(redbull), 2. pick(redbull), 3. find(trash), 4. put(redbull), 5. done().

61
Commonsense Reasoning - Toy Problems
SayCan Robot Planning

Locations = [counter, table, user, trash, bowl].

Objects = [7up, apple, kettle chips, tea, multigrain chips, coke, lime soda, jalapeno chips, rice chips,
orange, grapefruit soda, pepsi, redbull, energy bar, sponge, water].

The robot can pick up items with pick(object) and put down items with put(object) as well as find
objects or locations with find(). The robot can only understand the explicit locations and objects
listed.

Human: How would you throw away a redbull?

Plan: 1. find(redbull), 2. pick(redbull), 3. find(trash), 4. put(redbull), 5. done().

These tasks not only require multi-steps reasoning, but also need priori knowledge to
understand complex semantics. 62
Commonsense Reasoning - Results

63
Commonsense Reasoning - Results

64
Commonsense Reasoning - Results

65
Commonsense Reasoning - Observations

● For all tasks, scaling up model size improved the performance of standard prompting.

66
Commonsense Reasoning - Observations

● For all tasks, scaling up model size improved the performance of standard prompting.

● CoT prompting led to further gains, with improvements appearing to be largest for PaLM
540B.

67
Commonsense Reasoning - Observations

● For all tasks, scaling up model size improved the performance of standard prompting.

● CoT prompting led to further gains, with improvements appearing to be largest for PaLM
540B.

● CoT has minimal benefits on CSQA and StrategyQA tasks.

68
Commonsense Reasoning - Observations

● For all tasks, scaling up model size improved the performance of standard prompting.

● CoT prompting led to further gains, with improvements appearing to be largest for PaLM
540B.

● CoT has minimal benefits on CSQA and StrategyQA tasks.

● Few-shot achieves better performance than Zero-shot CoT on 175B GPT-3 model for
CSQA and Strategy QA tasks, but Zero-shot CoT shows significant improvement for Date
understanding task.

69
Ablation Study - Variations of Few-Shot CoT
Change the types of CoT:

Equation only
5+6=11. The
answer is 11.

70
Ablation Study - Variations of Few-Shot CoT
Change the types of CoT:

Equation only
5+6=11. The
answer is 11.

Natural language in reasoning matters.


(Wei et al., 2022) 71
Ablation Study - Variations of Few-Shot CoT
Change the types of CoT:

Variable compute only


…………………
…………….. The
answer is 11.

72
Ablation Study - Variations of Few-Shot CoT
Change the types of CoT:

Variable compute only


…………………
…………….. The
answer is 11.

More intermediate computation does not help with the final answer.
(Wei et al., 2022)
73
Ablation Study - Variations of Few-Shot CoT
Change the types of CoT:

Reasoning after answer


The answer is 11.
Roger started with
5 balls. 2 cans of 3
tennis balls each is
6 tennis balls. 5 +
6 = 11.

74
Ablation Study - Variations of Few-Shot CoT
Change the types of CoT:

Reasoning after answer


The answer is 11.
Roger started with
5 balls. 2 cans of 3
tennis balls each is
6 tennis balls. 5 +
6 = 11.

CoT is not just activating knowledge seen in pre-training.


(Wei et al., 2022)
75
Ablation Study - Robustness (Style of Exemplar)
Change the style of exemplar in few-shot CoT:

(Wei et al., 2022)


Results for few-shot LaMDA 137B on two AR tasks: have variance, but CoT still outperforms
76
standard prompting, robust against linguistic styles, different exemplars.
Ablation Study - Robustness (Trigger Sentence)
Change the template (trigger sentence) in
zero-shot CoT:

(Kojima et al., 2022)

Results for zero-shot GPT3 (davinci-002) 175B on MultiArith AR task: different templates
encourage the model to express reasoning quite differently

77
Ablation Study - Model Size
Change the model sizes in CoT prompting:
(Kojima et al., 2022)

Results on MultiArith AR task with different model sizes:


● Larger model, better reasoning
● CoT is effective only for larger models
● Few-shot better than zero-shot
78
● Instruct GPT-3 is much better than original GPT-3
Pre-Lecture Question 1
Q1: Describe how the two approaches from (Wei et al., 2022) and (Kojima et al., 2022) are
different. Which one do you think is a more viable solution in terms of cost, performance
and stability?

Wei's work uses few-shot setting with several demonstration examples required, and the CoT annotations need
to be provided for each example, while Kojima's work uses LLM to generate the CoT with the two-stage
prompting and no longer requires the annotated examples.

For most of benchmarks we have seen, few-shot CoT has better performance than zero-shot CoT, while
zero-shot CoT does not require human annotations, which can be costly.

Although there is no direct comparison between zero-shot and few-shot CoT on stability, few-shot seems to be
more robust as its performance does not vary significantly when changing the prompt annotations. On the
other hand, zero-shot CoT has significant performance variance with different trigger sentences.

79
More Advances - Self-Consistency
Change greedy decode (single-path) to self-consistency (multi-path) in few-shot CoT:

Wang, Xuezhi, et al. "Self-consistency improves chain of thought reasoning in language models." arXiv preprint arXiv:2203.11171
(2022). 80
More Advances - Self-Consistency
Showcase results on AR, CR tasks:

Wang, Xuezhi, et al. "Self-consistency improves chain of thought reasoning in language models." arXiv preprint arXiv:2203.11171
(2022). 81
More Advances - Input-Rational Ensemble
Use model-generated rationale in few-shot CoT:

Wang, Xuezhi, et al. "Rationale-Augmented Ensembles in Language Models." arXiv preprint arXiv:2207.00747 (2022).
82
More Advances - Input-Rational Ensemble
Showcase performance for AR reasoning tasks (PaLM-540B):

GSM8K

Standard-prompting 17.9

Few-shot CoT (Wei et al. 2022) 56.5

Zero-shot CoT (Kojima et al. 2022) 43.0

Self-consistency (Wang et al. 2022) 74.4

Prompt-order ensemble 75.4

Input-rationale ensemble 73.8

Performance improvement on reasoning is great over previous CoT, but not significant
against self-consistency, 83
Pre-Lecture Question 3 and Discussion

Q3: Do you think the CoT method can be useful to other NLP tasks that we have seen in the
previous lectures (standard NLP tasks that are beyond the arithmetic/logic reasoning tasks that
these papers evaluated on)? Do you have any ideas about how we can collect the CoT data?

84
Reference
1. Wei, Jason, et al. "Chain of thought prompting elicits reasoning in large language models." arXiv preprint
arXiv:2201.11903 (2022).
2. Kojima, Takeshi, et al. "Large Language Models are Zero-Shot Reasoners." arXiv preprint arXiv:2205.11916 (2022).
3. Cobbe, Karl, et al. "Training verifiers to solve math word problems." arXiv preprint arXiv:2110.14168 (2021).
4. Patel, Arkil, Satwik Bhattamishra, and Navin Goyal. "Are NLP Models really able to Solve Simple Math Word
Problems?." NAACL 2021.
5. Miao, Shen-Yun, Chao-Chun Liang, and Keh-Yih Su. "A diverse corpus for evaluating and developing English math
word problem solvers." ACL 2020 (2020).
6. Koncel-Kedziorski, Rik, et al. "MAWPS: A math word problem repository." NAACL 2016.
7. Ling, Wang, et al. "Program induction by rationale generation: Learning to solve and explain algebraic word problems."
arXiv preprint arXiv:1705.04146 (2017).
8. Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. 2019. CommonsenseQA: A question answering
challenge targeting commonsense knowledge. NAACL.
9. Ahn, Michael, et al. "Do as i can, not as i say: Grounding language in robotic affordances." arXiv preprint
arXiv:2204.01691 (2022).
10. Wang, Xuezhi, et al. "Self-consistency improves chain of thought reasoning in language models." arXiv preprint
arXiv:2203.11171 (2022).
11. Wang, Xuezhi, et al. "Rationale-Augmented Ensembles in Language Models." arXiv preprint arXiv:2207.00747 (2022).

85
Discussion
(This part will not be in presentation)

86
Summary of Arithmetic Reasoning Benchmark

Summary of math arithmetic reasoning benchmarks. N: number of evaluation examples (Wei, et al. 2022)
87
Arithmetic Reasoning - Few-Shot CoT MAWPS Results

LaMDA GPT PaLM

88
Arithmetic Reasoning - Zero-Shot CoT Additional Results

89
Prior Best – Fine-tuning + Verification

GPT-3

1. Fine-tuned 2 epoch on training set.

90
Prior Best – Fine-tuning + Verification

GPT-3

Finding the one with the highest score


1. Fine-tuned 2 epoch on training set.

91
Prior Best – Fine-tuning + Verification

GPT-3

1. Fine-tuned 2 epoch on training set.


2. Sample 100 solutions from the generator for
each training problem and label each solution as
correct or incorrect.
Correct or Incorrect
(Cobbe et al. 2021) 92
Arithmetic Reasoning - Results

MAWPS - SingleEq
If there are 7 bottle caps in a box
and Linda puts 7 more bottle
caps inside, how many bottle
caps are in the box?

MAWPS - AddSub
There were 6 roses in the vase.
Mary cut some roses from her
flower garden. There are now 16
roses in the vase. How many
roses did she cut?

93
Commonsense Reasoning - CSQA CoT Prompt

CSQA (Talmor et al., 2019)

Q: What home entertainment


● 7 exemplars from training dataset
equipment requires cable?
Answer Choices: (a) radio shack (b)
substation (c) television (d) cabinet

A: The answer must require


cable. Of the above choices, only
television requires cable. The
answer is (c).

94
Commonsense Reasoning - CSQA CoT Prompt

CSQA (Talmor et al., 2019)

Q: What home entertainment


● 7 exemplars from training dataset
equipment requires cable?
Answer Choices: (a) radio shack (b)
● Manually composed intermediate reasoning with
substation (c) television (d) cabinet
strict format:
A: The answer must require
cable. Of the above choices, only The answer must . Of the above choices,
television requires cable. The only .
answer is (c).

95
Commonsense Reasoning - Strategy QA CoT Prompt

StrategyQA (Geva et al., 2021)

Q: Could Brooke Shields succeed ● 6 exemplars from training dataset


at University of Pennsylvania?

A: Brooke Shields went to


Princeton University. Princeton
University is about as
academically rigorous as the
University of Pennsylvania. Thus,
Brooke Shields could also
succeed at the University of
Pennsylvania.The answer is yes.

96
Commonsense Reasoning - Strategy QA CoT Prompt

StrategyQA (Geva et al., 2021)

Q: Could Brooke Shields succeed ● 6 exemplars from training dataset


at University of Pennsylvania?
● Manually composed intermediate reasoning with
A: Brooke Shields went to flexible format
Princeton University. Princeton
University is about as
academically rigorous as the
University of Pennsylvania. Thus,
Brooke Shields could also
succeed at the University of
Pennsylvania.The answer is yes.

97
Commonsense Reasoning - Strategy QA CoT Prompt

StrategyQA (Geva et al., 2021) StrategyQA (Geva et al., 2021)

Q: Could Brooke Shields succeed Q: Yes or no: Is it common to see


at University of Pennsylvania? frost during some college
commencements?
A: Brooke Shields went to
Princeton University. Princeton A: College commencement
University is about as ceremonies can happen in
academically rigorous as the December, May, and June.
University of Pennsylvania. Thus, December is in the winter, so
Brooke Shields could also there can be frost. Thus, there
succeed at the University of could be frost at some
Pennsylvania.The answer is yes. commencements. The answer is
yes.

98
Commonsense Reasoning - CoT Prompt

Date Understanding Sport Understanding

Q: 2015 is coming in 36 hours. Q: Is the following sentence


What is the date one week from plausible? “Jamel Murray was
today in MM/DD/YYYY perfect from the line.”

A: If 2015 is coming in 36 hours, A: Jamal Murray is a basketball


then it is coming in 2 days. 2 player. Being perfect from the line
days before 01/01/2015 is is part of basketball. The answer
12/30/2014, so today is is yes.
12/30/2014. So one week from
today will be 01/05/2015. So the
answer is 01/05/2015.

6 exemplars for Date Understanding 8 exemplars for Sport Understanding


99
Commonsense Reasoning - CoT Prompt
SayCan Robot Planning (Ahn et al., 2022)

Locations = [counter, table, user, trash, bowl].

Objects = [7up, apple, kettle chips, tea, multigrain chips, coke, lime soda, jalapeno chips, rice
chips, orange, grapefruit soda, pepsi, redbull, energy bar, sponge, water].

The robot can pick up items with pick(object) and put down items with put(object) as well as find
objects or locations with find(). The robot can only understand the explicit locations and objects
listed.

Human: How would you throw away a redbull?

Explanation: The user has asked me to throw away the redbull, I will move it to the trash.

Plan: 1. find(redbull), 2. pick(redbull), 3. find(trash), 4. put(redbull), 5. done().

7 exemplars for SayCan 100


Ablation Study - Robustness (Example Distribution)
Change the examples in few-shot CoT:

Change examples from in-domain to out-of-domain!

Results for few-shot on two AR tasks with examples† from a CR task (CommonsenseQA):
across-domain examples with the same format have minor performance degradation. 101

You might also like