Code, model input/output and cached evaluation results for our ACL-23 paper "Towards Understanding Chain-of-Thought Prompting: An Empirical Study of What Matters" by Boshi Wang, Sewon Min, Xiang Deng, Jiaming Shen, You Wu, Luke Zettlemoyer and Huan Sun.
While Chain-of-Thought (CoT) prompting can improve reasoning in large LMs, there is little understanding of what makes it effective. We perform a series of ablation studies on two representive benchmarks where CoT brings large improvements, which reveal the impact of different aspects of CoT demonstrations. We find that
- CoT reasoning is possible with invalid demonstrations - prompting with invalid reasoning steps can achieve over 80-90% of the performance obtained using CoT under various metrics, while still generating coherent lines of reasoning during inference.
- Other aspects of the rationales, such as being relevant to the query and correctly ordering the reasoning steps, are much more important for effective CoT reasoning.
Overall, these findings open up new questions regarding LLMs' capability to learn to reason in context, and reflections on benchmarking few-shot reasoning.
If you find our code or paper useful, please cite the paper:
@inproceedings{wang2023towards,
title={Towards Understanding Chain-of-Thought Prompting: An Empirical Study of What Matters},
author={Wang, Boshi and Min, Sewon and Deng, Xiang and Shen, Jiaming and Wu, You and Zettlemoyer, Luke and Sun, Huan},
booktitle={The 61st Annual Meeting of the Association for Computational Linguistics},
year={2023}
}
.
├── grade-school-math/ # GSM8K dataset, from https://github.com/openai/grade-school-math
├── indices_800.json # Indices for the 800 GSM8K test examples used for evaluation
├── Bamboogle Prerelease - Sheet1.csv # Bamboogle dataset, from https://github.com/ofirpress/self-ask
├── Bamboogle Prerelease - Sheet1_inter.csv # Annotated intermediate bridging entities for Bamboogle
├── utils.py # Helper functions
├── prompts_*/ # Full prompts for all settings in our experiments
├── main_*.py # Scripts for getting model predictions via OpenAI API
├── eval_*.ipynb # Evaluation scripts, including cached evaluation results
└── result_*/ # Cached model prediction results
First put your OpenAI API key in a file named api_key.txt
.
Details could be found in the param descriptions in main_*.py
. For example, to run the invalid reasoning setting on GSM8K and Bamboogle:
python main_gsm8k.py --prompt_dir prompts_arithmetic/invalid_reasoning.txt --eng text-davinci-002 --num_test 800 --seed 1357 --temp 0.0 --test_ind indices_800.json
python main_bamboogle.py --prompt_dir prompts_bamboogle/invalid_reasoning.txt --eng text-davinci-002 --num_test -1 --seed 1357 --temp 0.0
eval_*.ipynb
contains the scripts and cached evaluation results.