Researchers from the University of Tokyo & Google Propose Zero-shot-CoT: a Single Zero-Shot Prompt That Elicits a Chain of Thought From Large Language Models Across a Variety of Reasoning Tasks

This Article is written as a summay by Marktechpost Staff based on the Research Paper 'Large Language Models are Zero-Shot Reasoners'. All Credit For This Research Goes To The Researchers of This Project. Check out the paper and ref.blog.

Please Don't Forget To Join Our ML Subreddit

Expanding linguistic models have played a big role in recent advances in natural language processing (NLP). Large language models’ (LLMs) effectiveness is frequently attributed to few-shot or zero-shot learning. It may tackle a variety of problems by simply conditioning the models on a few examples or instructions defining the problem. The process of conditioning the language model is known as “prompting,” and the manual construction of prompts has become a hot topic in NLP.

In contrast to the exceptional performance of LLMs on intuitive and single-step system-1 tasks with task-specific few-shot or zero-shot prompting, language models with 100B parameters or more failed on system-2 tasks demanding lengthy and multi-step reasoning. To remedy this deficiency, the chain of thought prompting (CoT) provides LLMs with step-by-step reasoning examples instead of typical question-and-answer examples. Such demonstrations of thought chains enable models to produce a path of reasoning that decomposes complex thinking into several simpler parts. Particularly with CoT, reasoning performance satisfies scaling laws and increases with the size of language models. When paired with the 540B parameter PaLM model, for instance, the chain of thought prompting significantly outperforms ordinary few-shot prompting on various benchmark reasoning problems, such as GSM8K (17.9% 58.1%).

LMs are competent zero-shot reasoners by introducing a simple prompt to enable step-by-step thinking before answering each question. The Zero-shot-CoT develops a plausible reasoning path in a zero-shot method and gets the correct solution in a case where the traditional zero-shot strategy fails. Importantly, unlike most prior task-specific prompt engineering in the form of examples or templates, the Zero-shot-CoT is versatile and task-agnostic: it can facilitate step-by-step answers across various reasoning tasks, including arithmetic, symbolic (Last letter and Coin flip), commonsense reasoning, and other logical reasoning tasks (Date understanding and Tracking Shuffled Objects from BIG-bench) without modifying the prompt per task.

In Figure 1, Zero-shot-CoT is compared to different prompting baselines. While the Zero-shot-CoT underperforms Few-shot-CoT with carefully-crafted and task-specific step-by-step examples, it achieves considerable score improvements relative to the zero-shot baseline, e.g., from 17.7% to 78.7% on MultiArith and from 10.4% to 40.7% on GSM8K. Significantly, with the single fixed prompt, zero-shot LLMs have a scaling curve comparable to the baseline few-shot CoT. In addition to demonstrating that Few-shot-CoT requires human engineering of multi-step reasoning prompts, their performance degrades if the question types of the example prompt and the task question are mismatched, indicating a significant sensitivity to per-task prompt designs. In contrast, the adaptability of this single cue across multiple reasoning tasks suggests latent and understudied zero-shot foundational capabilities of LLMs, such as higher-level broad cognitive abilities such as generic logical reasoning.

Results

Zero-shot-CoT vs. Zero-shot – For each dataset, Table 1 compares the accuracy of the technique (Zero-shot-CoT) and standard zero-shot prompting (Zero-shot). Zero-shot-CoT outperforms four of six arithmetic reasoning tests (MultiArith, GSM8K, AQUA, SVAMP), all symbolic reasoning, and all other logical reasoning problems by a significant margin (from BIG-bench). Zero-shot-CoT obtains score increases of 17.7 percentage points to 78.7 percentage points on MultiArith and 10.4 percentage points to 40.7% on GSM8K. The solution performs comparably on the remaining two arithmetic reasoning tasks (SingleEq and AddSub), which is expected given that these tasks do not need multi-step reasoning. Zero-shot-CoT provides no performance gains for tasks requiring commonsense reasoning. Few-shot-CoT does not enhance performance on Lambda (135B), but it does improve StrategyQA when combined with a far more giant PaLM (540B) model, which may also be valid for the model. Notably, many of the generated chains of thought are remarkably logically right or contain only human-understandable errors (see Table 4), indicating that zero-shot-CoT elicits stronger commonsense thinking even when the task metrics do not immediately reflect it. Appendix B contains samples generated by Zero-shot-CoT for each dataset.

Comparison to alternative baselines Table 2 compares the results of Zero-shot-CoT and baselines on two arithmetic reasoning benchmarks (MultiArith and GSM8K). The substantial difference between ordinary prompting (top block) and chain of thought prompting (middle block) implies that these activities are challenging without eliciting multi-step reasoning. While Zero-shot-CoT inherently underperforms Few-shot-CoT, it outperforms regular Few-shot prompting with even eight examples per task by a significant margin. Zero-shot-CoT with Instruct GPT-3(175B) beats both fine-tuned GPT-3 and conventional few-shot prompting with large models for GSM8K. (PaLM,540B).

Error Examination

To understand Zero-shot-behavior, CoT’s randomly selected examples were generated by Instruct-GPT3 with Zero-shot-CoT prompting. (1)In common sense reasoning, Zero-shot-CoT frequently generates a flexible and plausible chain of reasoning, even when the ultimate conclusion is incorrect. Zero-shot-CoT frequently generates several answer options when the model finds it challenging to select just one (see Table 4 for examples). (2)In arithmetic reasoning (MultiArith), the error patterns of Zero-shot-CoT and Few-shot-CoT differ significantly. First, it tends to produce unnecessary stages of reasoning after making a correct forecast, resulting in an inaccurate prediction. Zero-shot-CoT also does not always begin reasoning but rather rephrases the input query. In contrast, Few-shot-CoT tends to fail when the generated thought chain contains ternary operations (3 + 2)*4.

Credit: Source link

Comments are closed.