Process Reward Models (PRMs) have emerged as a promising approach to enhance the reasoning capabilities of large language models (LLMs) by guiding their step-by-step reasoning toward a final answer. However, existing PRMs either treat each reasoning step in isolation, failing to capture inter-step dependencies, or struggle to align process rewards with the final outcome. Consequently, the reward signal fails to respect temporal causality in sequential reasoning and faces ambiguous credit assignment. These limitations make downstream models vulnerable to reward hacking and lead to suboptimal performance. In this work, we propose Conditional Reward Modeling (CRM) that frames LLM reasoning as a temporal process leading to a correct answer. The reward of each reasoning step is not only conditioned on the preceding steps but also explicitly linked to the final outcome of the reasoning trajectory. By enforcing conditional probability rules, our design captures the causal relationships among reasoning steps, with the link to the outcome allowing precise attribution of each intermediate step, thereby resolving credit assignment ambiguity. Further, through this consistent probabilistic modeling, the rewards produced by CRM enable more reliable cross-sample comparison. Experiments across Best-of-N sampling, beam search and reinforcement learning demonstrate that CRM consistently outperforms existing reward models, offering a principled framework for enhancing LLM reasoning. In particular, CRM is more robust to reward hacking and delivers stable downstream improvements without relying on verifiable rewards derived from ground truth.
Process Reward Models (PRMs) provide step-wise supervision for multi-step reasoning, yet existing approaches suffer from two key limitations.
To address this, we propose CRM, which models each step’s reward as a conditional probability given all previous steps, and explicitly connects process rewards to the final outcome through the conditional probability chain rule, enabling principled and consistent credit assignment.
CRM models LLM reasoning as a temporal process where the probability of reaching a correct answer evolves with the step index t. Since it is difficult to directly quantify “how close” a trajectory is to the correct answer, we instead model the complementary event: the reasoning process entering a wrong state such that the trajectory can no longer yield the correct final answer. We define $z$ as the index of the first step where the trajectory enters this wrong state. If no wrong state occurs throughout a trajectory of length $T$, then $z > T$ and the final answer is correct; otherwise $z \le T$ and the final answer is incorrect.
A reasoning trajectory is inherently causal: the correctness of step $t$ depends on all previous steps. To capture this dependency, CRM introduces the conditional wrong-state probability:
Naturally, $1-h(t)$ is the probability that the current step is correct given all previous steps were correct. To explicitly link intermediate steps to the final outcome, CRM applies the chain rule of probability:
With $S(T)$ representing the probability that the reasoning process reaches the correct final answer, CRM then seeks a dense step-wise reward aligned with this outcome probability. We apply Potential-Based Reward Shaping (PBRS) and choose the potential function as:
Substituting this potential into PBRS yields an explicit process reward for each step transition:
Training objective. At step $t$, CRM takes the question $x$ and the first $t$ reasoning steps as input, and predicts $h(t)=f_\phi(x,a_{\le t})$. For correct trajectories ($l=1$), we maximize $S(T)$:
For incorrect trajectories ($l=0$), we minimize $S(T)$ and encourage the model to localize the first wrong step $z$ by maximizing $p(z)$:
The overall loss is:
This consistent probabilistic modeling ensures that $S(t)$ has the same semantics across different samples, enabling more reliable cross-sample comparison than prior approaches.
Figure 2. Training objective components and their roles (e.g., correctness likelihood vs. wrong-state localization).
Figure 3. Cross-sample comparability: CRM yields more consistent ranking signals across questions.
CRM is evaluated in three downstream settings: Best-of-N sampling (trajectory selection), beam search (step-level guidance), and RL optimization (policy improvement under dense process rewards). Across settings, CRM improves accuracy and exhibits strong robustness to reward hacking.
| Model | Method | GSM-Plus | MATH500 | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| @8 | @16 | @32 | @64 | @128 | @8 | @16 | @32 | @64 | @128 | ||
| Qwen2.5-3B-Instruct | ORM | 66.8 | 67.2 | 66.4 | 65.7 | 65.7 | 51.6 | 51.4 | 51.8 | 49.0 | 49.2 |
| PRM | 67.6 | 67.9 | 67.7 | 66.9 | 66.7 | 54.2 | 55.2 | 55.2 | 54.2 | 54.6 | |
| PQM | 68.5 | 69.2 | 68.5 | 68.2 | 68.0 | 53.2 | 54.4 | 54.8 | 54.8 | 55.8 | |
| IPRM | 65.5 | 66.2 | 66.8 | 66.5 | 66.2 | 52.4 | 52.0 | 52.0 | 52.2 | 53.0 | |
| CRM (ours) | 67.8 | 68.6 | 67.9 | 68.4 | 68.7 | 53.0 | 56.4 | 56.6 | 55.8 | 56.6 | |
| Model | Method | MATH500 | GAOKAO2023 | ||||||
|---|---|---|---|---|---|---|---|---|---|
| N=4 | N=8 | N=20 | N=100 | N=4 | N=8 | N=20 | N=100 | ||
| Qwen2.5-Math-1.5B | ORM | 50.73 | 54.80 | 56.80 | 58.07 | 35.58 | 38.18 | 38.44 | 40.17 |
| PRM | 51.80 | 55.73 | 56.87 | 58.00 | 34.72 | 37.84 | 38.70 | 38.96 | |
| PQM | 52.67 | 56.60 | 58.87 | 58.80 | 36.88 | 38.61 | 40.61 | 39.83 | |
| IPRM | 44.27 | 47.27 | 48.33 | 47.47 | 32.55 | 34.46 | 35.32 | 34.55 | |
| CRM (ours) | 54.07 | 58.40 | 61.00 | 63.00 | 38.70 | 39.74 | 41.04 | 43.55 | |
| VR Setting | Method | MATH500 | MinervaMath | OlympiadBench | AIME25 | AIME24 | AMC23 |
|---|---|---|---|---|---|---|---|
| VR Disabled | PURE | 76.0 | 30.8 | 36.7 | 13.3 | 26.6 | 70.0 |
| PRM | 71.6 | 36.3 | 32.5 | 13.3 | 10.0 | 57.5 | |
| PQM | 72.0 | 34.1 | 34.3 | 13.3 | 13.3 | 52.5 | |
| CRM (ours) | 77.8 | 40.0 | 39.3 | 23.3 | 43.3 | 67.5 | |
| VR Enabled | Prime | 81.2 | 29.4 | 40.8 | 16.6 | 26.6 | 72.5 |
| PURE | 82.4 | 40.0 | 41.3 | 23.3 | 23.3 | 70.0 | |
| VR | 76.2 | 38.6 | 38.0 | 16.6 | 30.0 | 62.5 | |
| CRM + VR | 80.4 | 43.0 | 42.1 | 26.6 | 33.3 | 72.5 |
Note: “VR” denotes verifiable reward from outcome ground-truth. “VR Disabled” means training without any verifier-based reward.
Figure 4. RL dynamics and reward hacking analysis (reward, length, repetition, and downstream accuracy over training).
@inproceedings{zhang2026crm,
title = {Linking Process to Outcome: Conditional Reward Modeling for LLM Reasoning},
author = {Zheng Zhang and Ziwei Shan and Kaitao Song and Yexin Li and Kan Ren},
booktitle = {International Conference on Learning Representations (ICLR)},
year = {2026}
}