Hypothesis Testing Prompting Improves Deductive Reasoning in Large Language Models

Abstract

Combining different forms of prompts with pre-trained large language models has yielded remarkable results on reasoning tasks (e.g. Chain-of-Thought prompting). However, along with testing on more complex reasoning, these methods also expose problems such as invalid reasoning and fictional reasoning paths. In this paper, we develop Hypothesis Testing Prompting, which adds conclusion assumptions, backward reasoning, and fact verification during intermediate reasoning steps. Hypothesis Testing prompting involves multiple assumptions and reverses validation of conclusions leading to its unique correct answer. Experiments on two challenging deductive reasoning datasets ProofWriter and RuleTaker show that hypothesis testing prompting not only significantly improves the effect, but also generates a more reasonable and standardized reasoning process.

Keywords: Deductive Reasoning, Large Language Models, Prompt

\NAT@set@cites

Yitian Li^1,2, Jidong Tian^1,2, Hao He^1,2, Yaohui Jin^1,2

¹MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University

²State Key Lab of Advanced Optical Communication System and Network

{yitian_li, frank92, hehao, jinyh}@sjtu.edu.cn

Abstract content

1. Introduction

The release of large language models (LLMs) has revolutionized the NLP landscape recently Thoppilan et al. (2022); Kaplan et al. (2020); Chowdhery et al. (2022). Scaling up the size of language models and conducting diversified prompt methods become mainstream Liu et al. (2023c); Wei et al. (2022a); Yang et al. (2023). Given In-context learning or Chain-of-Thought prompts have already achieved high performance on challenging tasks such as commonsense, arithmetic, and symbolic reasoning Imani et al. (2023); Lee et al. (2021); Kojima et al. (2022). Logical reasoning is one of the most important and long-standing problems in NLP Hirschberg and Manning (2015); Russell and Norvig (2010), and integrating this ability into natural language understanding systems has always been a goal pursued Du et al. (2022).

Nevertheless, scaling has been demonstrated to offer limited advantages in resolving complex logical reasoning issues Kazemi et al. (2022). For example, Saparov and He (2022) show that Chain-of-Thought prompting struggles with proof planning for more complex logical reasoning problems. Additionally, the performance suffers greatly while handling recently released and out-of-distribution logical reasoning datasets Liu et al. (2023a). Despite many works have explored variants of Chain-of-Thought prompts to facilitate LLMs inference Zelikman et al. (2022); Zheng et al. (2023), we discover that the present logical reasoning task prompts place an excessive amount of emphasis on the reasoning process while ignoring the origin, purpose, and effectiveness of reasoning Creswell et al. (2022); Xi et al. (2023). As examples shown in Figure 1, the difficulty in judging logical problems arises not only from the process of reasoning but also from the choice of facts and rules to use as a starting point. Even if we were provided the thought process for some of the issues, it would not be very beneficial for others, based on how we previously created the prompts.

Refer to caption — Figure 1: Questions in RuleTaker involve logical reasoning with facts and rules.

In this paper, we propose Hypothesis Testing Prompting, a new and more considerate prompt template design idea. Hypothesis testing is a formal procedure for investigating our ideas about the world using statistics and is often used by scientists to test specific predictions Bevans (2022). We draw inspiration from its process to introduce a process of conclusion assumptions, backward reasoning, and fact verification. Experiments on RuleTaker Clark et al. (2020) and ProofWriter Tafjord et al. (2021) show the effectiveness of our novel prompting paradigm as a strategy for promoting deductive reasoning in large language models. Further analyses show that Hypothesis Test prompting generates more desirable intermediate processes and significantly improves the "Unknown" label.

2. Related Work

2.1. Few-Shot Prompting

Brown et al. (2020) propose in-context learning as an alternative few-shot prompting way to stimulate ability. Besides, chain-of-Thought (CoT) Wei et al. (2022b) is one of the most well-known works, which decomposes the problem into intermediate steps and further improves the ability of large language models. Subsequently, several follow-up works were carried out, including Zero-shot-CoT (simply adding "Let’s think step by step" before each answer) Kojima et al. (2022), Self-consistency Wang et al. (2022), complexity-based Fu et al. (2022), and other prompting work Liu et al. (2023b); Jung et al. (2022); Zhou et al. (2022); Saparov and He (2022). While these methods enhance the performance of inference by paying attention to indications of the reasoning process, they often overlook some aspects such as identifying the root cause of the problem, establishing efficient reasoning strategies, and determining the direction of logical reasoning.

2.2. Deductive Reasoning

Deductive reasoning is defined as the application of general concepts to particular circumstances Johnson-Laird (2010). Making logical assumptions is the foundation of deductive reasoning, which then bases a conclusion on those assumptions. The deduction task is then applied to a situation from the actual world after starting with a rule. In light of the principles "All men are mortal." and "Socrates is a man." for example, we can draw the conclusion that "Socrates is mortal." Johnson-Laird (1999).

3. Hypothesis Testing Prompting

Hypothesis testing is a formal procedure for investigating our ideas about the world using statistics and used by scientists to test specific predictions that arise from theories Bevans (2022); La et al. (2012). There are 5 main steps in hypothesis testing:

1.

State your research hypothesis;
2.

Collect data in a way designed to test the hypothesis;
3.

Perform an appropriate statistical test;
4.

Decide whether to reject or fail to reject your null hypothesis;
5.

Present the findings in your results and discussion section;

When completing a challenging reasoning activity, such as a multi-step deductive reasoning problem, one is not conducting random reasoning to obtain all possible intermediate results. We shall choose the relevant conditions for inference verification after initially making assumptions about the judgment problem, such as " First assume the conclusion is True and start from … Then assume the conclusion is False and start from … because the rules state that … So the conclusion …". The purpose of this study is to give language models the capacity to build a process that is similar to what we defined as Hypothesis Testing Prompting. We will show that large language models can generate more appropriate thought and more accurate results if demonstrations of hypothesis test prompting are provided in the exemplars for few-shot prompting. Figure 2 shows an example of a model producing a hypothesis testing thought to solve a deductive reasoning problem.

4. Experiment

4.1. Experimental Setup

We explore Hypothesis Test Prompting for ChatGPT (GPT-3.5-Turbo in the OpenAI API) on multiple logical reasoning benchmarks.

Benchmarks. Considering FOL reasoning in question answering systems, there are two world assumptions Reiter (1981) that result in different objectives. One is the closed world assumption (CWA), which is the presumption that what is not currently known to be entailment is contradiction. The other is the open world assumption (OWA), whose objective should distinguish false propositions from uncertain ones. Due to differences in world assumptions, our analysis and solutions are also different.

We consider the following two deductive reasoning problem benchmarks: (1) the RuleTaker Clark et al. (2020) benchmark using CWA assumption; (2) the ProofWriter Tafjord et al. (2021) benchmark using OWA assumption. Both datasets are divided into five parts, each part requiring 0, $\leq$ 1, $\leq$ 2, $\leq$ 3, and $\leq$ 5 hops of reasoning, respectively. We conducted comparison tests on the test set of the two datasets for 5 distinct hops.

Standard prompting. As one of the baselines, we take into account the common few-shot prompting, made popular by Brown et al. (2020), in which a language model is provided with in-context examples of input-output pairings before producing a prediction for a test-time example. Examples are presented in the form of questions and answers. As seen in Figure 2(above), the model directly answers the question.

Chain-of-Thought prompting. We also compare with Chain-of-thought prompting which has achieved encouraging results on complex reasoning tasks Wei et al. (2022b). As seen in Figure 2(middle), the model not only provides the final answer but also comes with the consideration of intermediate steps.

Hypothesis Testing Prompting. Our proposed approach is to augment each exemplars in few-shot prompting with the thought of hypothesis testing for an associated answer, as illustrated in Figure 2(below). We show one chain of thought exemplars (Example: Judge the following conclusion ’<Conclusion>’ is true, false, or unknown, based on the following facts and rules: <Facts> … <Rules> …).

4.2. Experimental Results

The results for Hypothesis Testing Prompting and the baselines on the RuleTaker datasets are provided in Figure 3(a), and ProofWriter results are shown in Figure 3(b). From the results, we observe that our method significantly outperforms the other two baselines, especially on ProofWriter. Figure 3(a) demonstrates that while CoT performs well in the low hop, Hypothesis Testing prompting performs better as the hops count increases on RuleTaker. While on ProofWriter, our approach has a thorough lead (improved accuracy by over 4% on all hops). Comparing two datasets, the latter distinguishes between "False" and "Unknown", which demand a greater level of logic. The results on two datasets that were analyzed show a weakness in all methods for handling "Unknown" labels. This beacuse the OWA hypothesis necessitates the exclusion of both positive and negative findings to validate the "Unknown" label. The advantages of our strategy are illustrated by the comparison of the model output outputs in Figure 2. The content "First assume the conclusion is True … Then assume the conclusion is False … So … is Unknown." generated by the model through learning Hypothesis Testing prompting is more in line with our thinking. Besides, we’ll conduct further research and show it later.

4.3. Further Analysis

We carry out the following thorough analysis to better comprehend the thought process:

Proof Accuracy. Five students are required to manually evaluate the outcomes of the intermediate reasoning after we randomly picked 100 examples from depth-5 of the ProofWriter. Proof accuracy represents the proportion where the inference process has been proven to be reasonable in the correct part of data label prediction. We compare the results of Chian-of-Thought and Hypothesis Testing prompting and report in Figure 4(a). While Hypothesis Testing prompting mostly produced the correct intermediate reasoning process when the predicted label was correct, CoT only generated the correct chain for 26% of the examples. This result is in line with other research showing that LMs rely on spurious correlations when solving logical problems from beginning to end. Additionally, our approach can successfully increase reasoning’s rationality. In processing the "Unknown" label, Hypothesis Testing prompting performs noticeably better than Chain-of-Thought.

"Unknown" accuracy. In the ProofWriter dataset, we separately counted the accuracy of the "Unknown" label shown in Figure 4(b). The results point to a flaw in the Chain-of-Thought strategy’s handling of "Unknown" labels(only 0.3 accuracyies). Contrarily, Hypothesis Testing prompting significantly increases the reliability of judging this label (up to 0.65). This further illustrates the value of holding various assumptions, as well as the reverse confirmation of conclusions.

5. Conclusion

We have investigated Hypothesis Testing prompting as a straightforward and widely applicable technique for improving deductive reasoning in large language models. Multiple assumptions are made during hypothesis testing, and conclusions are reverse-validated to arrive at the one and only accurate answer. Through experiments on two logical reasoning datasets, we find that Hypothesis Testing prompting allows large language models to construct reasoning more reasonably and accurately. We anticipate that additional research on language-based reasoning approaches will be stimulated by our novel prompting design strategy.

6. References

\c@NAT@ctr

Aho and Ullman (1972) Alfred V. Aho and Jeffrey D. Ullman. 1972. The Theory of Parsing, Translation and Compiling, volume 1. Prentice-Hall, Englewood Cliffs, NJ.
American Psychological Association (1983) American Psychological Association. 1983. Publications Manual. American Psychological Association, Washington, DC.
Ando and Zhang (2005) Rie Kubota Ando and Tong Zhang. 2005. A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research, 6:1817–1853.
Andrew and Gao (2007) Galen Andrew and Jianfeng Gao. 2007. Scalable training of $L_{1}$ -regularized log-linear models. In Proceedings of the 24th International Conference on Machine Learning, pages 33–40.
Bevans (2022) Rebecca Bevans. 2022. Hypothesis Testing | A Step-by-Step Guide with Easy Examples. Scribbr.
Brown et al. (2020) Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners. In NeurIPS.
BSI (1973a) BSI. 1973a. Natural Fibre Twines, 3rd edition. British Standards Institution, London. BS 2570.
BSI (1973b) BSI. 1973b. Natural fibre twines. BS 2570, British Standards Institution, London. 3rd. edn.
Castor and Pollux (1992) A. Castor and L. E. Pollux. 1992. The use of user modelling to guide inference and learning. Applied Intelligence, 2(1):37–53.
Chandra et al. (1981) Ashok K. Chandra, Dexter C. Kozen, and Larry J. Stockmeyer. 1981. Alternation. Journal of the Association for Computing Machinery, 28(1):114–133.
Chercheur (1994) J.L. Chercheur. 1994. Case-Based Reasoning, 2nd edition. Morgan Kaufman Publishers, San Mateo, CA.
Choi (2022) Yejin Choi. 2022. The curious case of commonsense intelligence. Daedalus.
Chomsky (1973) N. Chomsky. 1973. Conditions on transformations. In A festschrift for Morris Halle, New York. Holt, Rinehart & Winston.
Chowdhery et al. (2022) Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradbury, Jacob Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, Anselm Levskaya, Sanjay Ghemawat, Sunipa Dev, Henryk Michalewski, Xavier Garcia, Vedant Misra, Kevin Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim, Barret Zoph, Alexander Spiridonov, Ryan Sepassi, David Dohan, Shivani Agrawal, Mark Omernick, Andrew M. Dai, Thanumalayan Sankaranarayana Pillai, Marie Pellat, Aitor Lewkowycz, Erica Moreira, Rewon Child, Oleksandr Polozov, Katherine Lee, Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark Diaz, Orhan Firat, Michele Catasta, Jason Wei, Kathy Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov, and Noah Fiedel. 2022. Palm: Scaling language modeling with pathways. CoRR.
Clark et al. (2020) Peter Clark, Oyvind Tafjord, and Kyle Richardson. 2020. Transformers as soft reasoners over language. In IJCAI.
Cooley and Tukey (1965) James W. Cooley and John W. Tukey. 1965. An algorithm for the machine calculation of complex Fourier series. Mathematics of Computation, 19(90):297–301.
Creswell et al. (2022) Antonia Creswell, Murray Shanahan, and Irina Higgins. 2022. Selection-inference: Exploiting large language models for interpretable logical reasoning. CoRR.
Du et al. (2022) Yilun Du, Shuang Li, Joshua B. Tenenbaum, and Igor Mordatch. 2022. Learning iterative reasoning through energy minimization. In ICML.
Eco (1990) Umberto Eco. 1990. The Limits of Interpretation. Indian University Press.
Fu et al. (2022) Yao Fu, Hao Peng, Ashish Sabharwal, Peter Clark, and Tushar Khot. 2022. Complexity-based prompting for multi-step reasoning. CoRR.
Gusfield (1997) Dan Gusfield. 1997. Algorithms on Strings, Trees and Sequences. Cambridge University Press, Cambridge, UK.
Hirschberg and Manning (2015) Julia Hirschberg and Christopher D. Manning. 2015. Advances in natural language processing. Science.
Hoel (1971a) Paul Gerhard Hoel. 1971a. Elementary Statistics, 3rd edition. Wiley series in probability and mathematical statistics. Wiley, New York, Chichester. ISBN 0 471 40300.
Hoel (1971b) Paul Gerhard Hoel. 1971b. Elementary Statistics, 3rd edition, Wiley series in probability and mathematical statistics, pages 19–33. Wiley, New York, Chichester. ISBN 0 471 40300.
Imani et al. (2023) Shima Imani, Liang Du, and Harsh Shrivastava. 2023. Mathprompter: Mathematical reasoning using large language models. CoRR.
Jespersen (1922) Otto Jespersen. 1922. Language: Its Nature, Development, and Origin. Allen and Unwin.
Johnson-Laird (2010) Phil Johnson-Laird. 2010. Deductive reasoning. Wiley Interdisciplinary Reviews: Cognitive Science.
Johnson-Laird (1999) Philip N Johnson-Laird. 1999. Deductive reasoning. Annual review of psychology.
Jung et al. (2022) Jaehun Jung, Lianhui Qin, Sean Welleck, Faeze Brahman, Chandra Bhagavatula, Ronan Le Bras, and Yejin Choi. 2022. Maieutic prompting: Logically consistent reasoning with recursive explanations. In EMNLP.
Kaplan et al. (2020) Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling laws for neural language models. CoRR.
Kazemi et al. (2022) Seyed Mehran Kazemi, Najoung Kim, Deepti Bhatia, Xin Xu, and Deepak Ramachandran. 2022. LAMBADA: backward chaining for automated reasoning in natural language. CoRR.
Kojima et al. (2022) Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large language models are zero-shot reasoners. In NeurIPS.
La et al. (2012) Rosa Patricio S. La, Brooks J. Paul, Deych Elena, Edward L. Boone, David J. Edwards, Wang Qin, Sodergren Erica, Weinstock George, William D. Shannon, and Ethan P. White. 2012. Hypothesis testing and power calculations for taxonomic-based human microbiome data. Plos One.
Lee et al. (2021) Chia-Hsuan Lee, Hao Cheng, and Mari Ostendorf. 2021. Dialogue state tracking with a language model using schema-driven prompting. In EMNLP. Association for Computational Linguistics.
Liu et al. (2023a) Hanmeng Liu, Ruoxi Ning, Zhiyang Teng, Jian Liu, Qiji Zhou, and Yue Zhang. 2023a. Evaluating the logical reasoning ability of chatgpt and GPT-4. CoRR.
Liu et al. (2023b) Hanmeng Liu, Zhiyang Teng, Leyang Cui, Chaoli Zhang, Qiji Zhou, and Yue Zhang. 2023b. Logicot: Logical chain-of-thought instruction-tuning data collection with GPT-4. CoRR.
Liu et al. (2023c) Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2023c. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Comput. Surv.
Rasooli and Tetreault (2015) Mohammad Sadegh Rasooli and Joel R. Tetreault. 2015. Yara parser: A fast and accurate dependency parser. Computing Research Repository, arXiv:1503.06733. Version 2.
Reiter (1981) Raymond Reiter. 1981. On closed world data bases. In Readings in Artificial Intelligence.
Russell and Norvig (2010) Stuart J. Russell and Peter Norvig. 2010. Artificial Intelligence - A Modern Approach, Third International Edition. Pearson Education.
Saparov and He (2022) Abulhair Saparov and He He. 2022. Language models are greedy reasoners: A systematic formal analysis of chain-of-thought. CoRR.
Singer et al. (1954–58) Charles Joseph Singer, E. J. Holmyard, and A. R. Hall, editors. 1954–58. A history of technology. Oxford University Press, London. 5 vol.
Strötgen and Gertz (2012) Jannik Strötgen and Michael Gertz. 2012. Temporal tagging on different domains: Challenges, strategies, and gold standards. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), pages 3746–3753, Istanbul, Turkey. European Language Resource Association (ELRA).
Superman et al. (2000) S. Superman, B. Batman, C. Catwoman, and S. Spiderman. 2000. Superheroes experiences with books, 20th edition. The Phantom Editors Associates, Gotham City.
Tafjord et al. (2021) Oyvind Tafjord, Bhavana Dalvi, and Peter Clark. 2021. Proofwriter: Generating implications, proofs, and abductive statements over natural language. In Findings of ACL.
Thoppilan et al. (2022) Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, YaGuang Li, Hongrae Lee, Huaixiu Steven Zheng, Amin Ghafouri, Marcelo Menegali, Yanping Huang, Maxim Krikun, Dmitry Lepikhin, James Qin, Dehao Chen, Yuanzhong Xu, Zhifeng Chen, Adam Roberts, Maarten Bosma, Yanqi Zhou, Chung-Ching Chang, Igor Krivokon, Will Rusch, Marc Pickett, Kathleen S. Meier-Hellstern, Meredith Ringel Morris, Tulsee Doshi, Renelito Delos Santos, Toju Duke, Johnny Soraker, Ben Zevenbergen, Vinodkumar Prabhakaran, Mark Diaz, Ben Hutchinson, Kristen Olson, Alejandra Molina, Erin Hoffman-John, Josh Lee, Lora Aroyo, Ravi Rajakumar, Alena Butryna, Matthew Lamm, Viktoriya Kuzmina, Joe Fenton, Aaron Cohen, Rachel Bernstein, Ray Kurzweil, Blaise Aguera-Arcas, Claire Cui, Marian Croak, Ed H. Chi, and Quoc Le. 2022. Lamda: Language models for dialog applications. CoRR.
Wang et al. (2022) Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V. Le, Ed H. Chi, and Denny Zhou. 2022. Self-consistency improves chain of thought reasoning in language models. CoRR.
Wei et al. (2022a) Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. 2022a. Emergent abilities of large language models. Trans. Mach. Learn. Res.
Wei et al. (2022b) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. 2022b. Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS.
Xi et al. (2023) Zhiheng Xi, Senjie Jin, Yuhao Zhou, Rui Zheng, Songyang Gao, Tao Gui, Qi Zhang, and Xuanjing Huang. 2023. Self-polish: Enhance reasoning in large language models via problem refinement.
Yang et al. (2023) Jingfeng Yang, Hongye Jin, Ruixiang Tang, Xiaotian Han, Qizhang Feng, Haoming Jiang, Bing Yin, and Xia Hu. 2023. Harnessing the power of llms in practice: A survey on chatgpt and beyond. CoRR.
Zelikman et al. (2022) Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah D. Goodman. 2022. Star: Bootstrapping reasoning with reasoning. In NeurIPS.
Zheng et al. (2023) Chuanyang Zheng, Zhengying Liu, Enze Xie, Zhenguo Li, and Yu Li. 2023. Progressive-hint prompting improves reasoning in large language models.
Zhou et al. (2022) Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Olivier Bousquet, Quoc Le, and Ed H. Chi. 2022. Least-to-most prompting enables complex reasoning in large language models. CoRR.