Linking Process to Outcome: Conditional Reward Modeling for LLM Reasoning

International Conference on Learning Representations (ICLR) 2026

Zheng Zhang^1,2, Ziwei Shan¹, Kaitao Song³, Yexin Li^2*, Kan Ren^1*

¹ShanghaiTech University ²State Key Laboratory of General AI, BIGAI ³Independent Researcher

^*Corresponding author

Figure 1. Reward modeling paradigms comparison and RL behavior. CRM explicitly conditions each step reward on previous steps and links it to the final outcome; empirically, CRM is more robust to reward hacking.

Abstract

Process Reward Models (PRMs) have emerged as a promising approach to enhance the reasoning capabilities of large language models (LLMs) by guiding their step-by-step reasoning toward a final answer. However, existing PRMs either treat each reasoning step in isolation, failing to capture inter-step dependencies, or struggle to align process rewards with the final outcome. Consequently, the reward signal fails to respect temporal causality in sequential reasoning and faces ambiguous credit assignment. These limitations make downstream models vulnerable to reward hacking and lead to suboptimal performance. In this work, we propose Conditional Reward Modeling (CRM) that frames LLM reasoning as a temporal process leading to a correct answer. The reward of each reasoning step is not only conditioned on the preceding steps but also explicitly linked to the final outcome of the reasoning trajectory. By enforcing conditional probability rules, our design captures the causal relationships among reasoning steps, with the link to the outcome allowing precise attribution of each intermediate step, thereby resolving credit assignment ambiguity. Further, through this consistent probabilistic modeling, the rewards produced by CRM enable more reliable cross-sample comparison. Experiments across Best-of-N sampling, beam search and reinforcement learning demonstrate that CRM consistently outperforms existing reward models, offering a principled framework for enhancing LLM reasoning. In particular, CRM is more robust to reward hacking and delivers stable downstream improvements without relying on verifiable rewards derived from ground truth.

Motivation

Process Reward Models (PRMs) provide step-wise supervision for multi-step reasoning, yet existing approaches suffer from two key limitations.

Isolated step modeling. Many PRMs score each step independently, ignoring the causal dependency that later steps rely on the correctness of earlier ones.
Weak linkage to the final outcome. Step rewards are not explicitly tied to the final correctness, leading to ambiguous credit assignment and vulnerability to reward hacking (rewards increase while task accuracy declines).

To address this, we propose CRM, which models each step’s reward as a conditional probability given all previous steps, and explicitly connects process rewards to the final outcome through the conditional probability chain rule, enabling principled and consistent credit assignment.

Methodology

CRM models LLM reasoning as a temporal process where the probability of reaching a correct answer evolves with the step index t. Since it is difficult to directly quantify “how close” a trajectory is to the correct answer, we instead model the complementary event: the reasoning process entering a wrong state such that the trajectory can no longer yield the correct final answer. We define $z$ as the index of the first step where the trajectory enters this wrong state. If no wrong state occurs throughout a trajectory of length $T$, then $z > T$ and the final answer is correct; otherwise $z \le T$ and the final answer is incorrect.

$$W(t)=\Pr(z\le t)=\sum_{i=1}^{t}p(i), \qquad S(t)=\Pr(z>t)=1-W(t)$$

A reasoning trajectory is inherently causal: the correctness of step $t$ depends on all previous steps. To capture this dependency, CRM introduces the conditional wrong-state probability:

$$h(t)=\Pr(z=t\mid z\ge t)=\frac{p(t)}{S(t-1)}$$

Naturally, $1-h(t)$ is the probability that the current step is correct given all previous steps were correct. To explicitly link intermediate steps to the final outcome, CRM applies the chain rule of probability:

$$S(t)=\Pr(z>t)=\prod_{k=1}^{t}(1-h(k))$$

With $S(T)$ representing the probability that the reasoning process reaches the correct final answer, CRM then seeks a dense step-wise reward aligned with this outcome probability. We apply Potential-Based Reward Shaping (PBRS) and choose the potential function as:

$$\Phi(s_t)\equiv \log S(t)=\sum_{k=1}^{t}\log(1-h(k))$$

Substituting this potential into PBRS yields an explicit process reward for each step transition:

$$r_t=\log(1-h(t))$$

Training objective. At step $t$, CRM takes the question $x$ and the first $t$ reasoning steps as input, and predicts $h(t)=f_\phi(x,a_{\le t})$. For correct trajectories ($l=1$), we maximize $S(T)$:

$$\mathcal{L}_S(x,y)=-\log S(T)=-\log\prod_{t=1}^{T}(1-h(t))$$

For incorrect trajectories ($l=0$), we minimize $S(T)$ and encourage the model to localize the first wrong step $z$ by maximizing $p(z)$:

$$\mathcal{L}_W(x,y)=-\log(1-S(T)),$$ $$\mathcal{L}_z(x,y,z)=-\log p(z)=-\log\left(h(z)\prod_{t=1}^{z-1}(1-h(t))\right)$$

The overall loss is:

$$\mathcal{L}=\frac{1}{|D|}\sum_{i=1}^{|D|}\Big[l_i\mathcal{L}_S(x_i,y_i)+(1-l_i)\big(\mathcal{L}_W(x_i,y_i)+\mathcal{L}_z(x_i,y_i,z_i)\big)\Big]$$

This consistent probabilistic modeling ensures that $S(t)$ has the same semantics across different samples, enabling more reliable cross-sample comparison than prior approaches.

Figure 2. Training objective components and their roles (e.g., correctness likelihood vs. wrong-state localization).

Figure 3. Cross-sample comparability: CRM yields more consistent ranking signals across questions.

Experimental Results

CRM is evaluated in three downstream settings: Best-of-N sampling (trajectory selection), beam search (step-level guidance), and RL optimization (policy improvement under dense process rewards). Across settings, CRM improves accuracy and exhibits strong robustness to reward hacking.

Best-of-N Accuracy

Model	Method	GSM-Plus					MATH500
Model	Method	@8	@16	@32	@64	@128	@8	@16	@32	@64	@128
Qwen2.5-3B-Instruct	ORM	66.8	67.2	66.4	65.7	65.7	51.6	51.4	51.8	49.0	49.2
	PRM	67.6	67.9	67.7	66.9	66.7	54.2	55.2	55.2	54.2	54.6
	PQM	68.5	69.2	68.5	68.2	68.0	53.2	54.4	54.8	54.8	55.8
	IPRM	65.5	66.2	66.8	66.5	66.2	52.4	52.0	52.0	52.2	53.0
	CRM (ours)	67.8	68.6	67.9	68.4	68.7	53.0	56.4	56.6	55.8	56.6

Beam Search Accuracy

Model	Method	MATH500				GAOKAO2023
Model	Method	N=4	N=8	N=20	N=100	N=4	N=8	N=20	N=100
Qwen2.5-Math-1.5B	ORM	50.73	54.80	56.80	58.07	35.58	38.18	38.44	40.17
	PRM	51.80	55.73	56.87	58.00	34.72	37.84	38.70	38.96
	PQM	52.67	56.60	58.87	58.80	36.88	38.61	40.61	39.83
	IPRM	44.27	47.27	48.33	47.47	32.55	34.46	35.32	34.55
	CRM (ours)	54.07	58.40	61.00	63.00	38.70	39.74	41.04	43.55

RL Optimization (Pass@1)

VR Setting	Method	MATH500	MinervaMath	OlympiadBench	AIME25	AIME24	AMC23
VR Disabled	PURE	76.0	30.8	36.7	13.3	26.6	70.0
	PRM	71.6	36.3	32.5	13.3	10.0	57.5
	PQM	72.0	34.1	34.3	13.3	13.3	52.5
	CRM (ours)	77.8	40.0	39.3	23.3	43.3	67.5
VR Enabled	Prime	81.2	29.4	40.8	16.6	26.6	72.5
	PURE	82.4	40.0	41.3	23.3	23.3	70.0
	VR	76.2	38.6	38.0	16.6	30.0	62.5
	CRM + VR	80.4	43.0	42.1	26.6	33.3	72.5

Note: “VR” denotes verifiable reward from outcome ground-truth. “VR Disabled” means training without any verifier-based reward.

Figure 4. RL dynamics and reward hacking analysis (reward, length, repetition, and downstream accuracy over training).

Key Takeaways

Principled credit assignment: step rewards are causally aligned with the final outcome through conditional probability rules.
Better selection & guidance: improves Best-of-N and beam search by producing more comparable reward signals across trajectories.
Robust RL: mitigates reward hacking and achieves strong pass@1 even without verifiable rewards.

BibTeX

@inproceedings{zhang2026crm,
  title     = {Linking Process to Outcome: Conditional Reward Modeling for LLM Reasoning},
  author    = {Zheng Zhang and Ziwei Shan and Kaitao Song and Yexin Li and Kan Ren},
  booktitle = {International Conference on Learning Representations (ICLR)},
  year      = {2026}
}