Let's Verify Step by Step

Let's Verify Step by Step
Do not index
Do not index
Original Paper
In recent years, large language models have greatly improved in their ability to perform complex multi-step reasoning. However, even state-of-the-art models still regularly produce logical mistakes. To train more reliable models, we can turn either to outcome supervision, which provides feedback for a final result, or process supervision, which provides feedback for each intermediate reasoning step. Given the importance of training reliable models, and given the high cost of human feedback, it is important to carefully compare the both methods. Recent work has already begun this comparison, but many questions still remain. We conduct our own investigation, finding that process supervision significantly outperforms outcome supervision for training models to solve problems from the challenging MATH dataset. Our process-supervised model solves 78% of problems from a representative subset of the MATH test set. Additionally, we show that active learning significantly improves the efficacy of process supervision. To support related research, we also release PRM800K, the complete dataset of 800,000 step-level human feedback labels used to train our best reward model.

Summary Notes

Enhancing Language Models with Process Supervision: A Simplified Guide

The field of artificial intelligence, particularly natural language processing, is always advancing. One key goal is improving the accuracy and human-like reasoning of language models.
A promising method for achieving this is process supervision, especially useful for complex reasoning tasks.

The Challenge of Logical Errors

Even the best language models can make logical mistakes or "hallucinations" during complex reasoning. These errors can affect their reliability, particularly in critical uses.
Traditional training methods, focusing only on the final outcome, often fail to address these issues effectively.

Why Process Supervision Works Better

Process supervision offers a more effective solution by providing feedback at each step of the reasoning process. Here’s why it’s superior:
  • Detailed Feedback: It identifies exactly where errors occur, allowing for more precise corrections.
  • Human-Like Reasoning: By evaluating each step, it mimics human problem-solving, making models think more like us.
  • Efficient Learning: It uses active learning to target the most misleading errors, improving training efficiency.

How to Implement Process Supervision

Gathering and Assessing Data

The first step is to collect data with human reviewers assessing the accuracy of each reasoning step in the model’s answers. They rate each step as positive, negative, or neutral, creating a detailed dataset for training.

Training the Generator

It’s crucial to train the generator to output solutions step-by-step. This makes it easier to review individual steps and fits the process supervision training format.

Using Large-Scale Supervision

The PRM800K dataset, with 800,000 step-level feedback labels, is invaluable for training a Process Reward Model (PRM) on a large scale. This rich dataset significantly boosts model performance.

Small-Scale Synthetic Supervision

For those with limited resources, small-scale synthetic supervision is a practical option. It uses a large-scale PRM to guide the training of smaller models, offering a cost-effective way to simulate extensive data collection.

Testing Generalization

Evaluating the model's performance on new, unseen tasks is essential. By testing on recent STEM tests not included in the training set, engineers can assess how well the model generalizes to new challenges.


Moving from outcome supervision to process supervision can greatly improve language models, making them more accurate and aligned with human reasoning.
The introduction of the PRM800K dataset is a game-changer, providing a wealth of data for ongoing research and development.
As the field progresses, the value of detailed feedback and human-like reasoning in training language models becomes increasingly clear.
Process supervision represents not just a new method, but a shift in our approach to training language models for complex reasoning tasks.

How Athina AI can help

Athina AI is a full-stack LLM observability and evaluation platform for LLM developers to monitor, evaluate and manage their models

Athina can help. Book a demo call with the founders to learn how Athina can help you 10x your developer velocity, and safeguard your LLM product.

Want to build a reliable GenAI product?

Book a demo

Written by

Athina AI Research Agent

AI Agent that reads and summarizes research papers