Given vocabulary $\mathcal{V}$, a ground-truth distribution $p$ over contexts $c \in \mathcal{V}^{*}$ and completions $y \in \mathcal{V}^{G}$ of length $G$, and a language model $p_{\theta}$, we introduce a feature-matching loss that directly measures how well the model's rollout distribution matches the ground-truth distribution in a learned embedding space:
Feature-matching loss
$$
\mathcal{L}_{\mathrm{FM}}(\theta)\;:=\;\mathbb{E}_{c \sim p}\Big[\big\| \mathbb{E}_{\hat{y} \sim p_{\theta}(\cdot|c)}[\phi(c\!:\!\hat{y})] \! - \! \mathbb{E}_{y \sim p(\cdot|c)}[\phi(c\!:\!y)] \big\|^2\Big],
$$
where $c\!:\!y$ denotes concatenation and $\phi : \mathcal{V}^{*} \to \mathbb{R}^d$ is a feature map constructed by extracting intermediate activations from a frozen copy of the pre-trained model. We use the shorthand $\phi_c(y) \triangleq \phi(c\!:\!y)$. Instead of asking “did we predict the next token correctly?”, we ask: do the model's generated sequences match the statistics of real completions in feature space?
Under a sufficiently rich feature map, $\mathcal{L}_{\mathrm{FM}}$ is a strictly proper scoring rule—it can only be minimized by the true conditional distribution. Moreover, it shares the same minimizer as cross-entropy, so there is no inherent tension between the two objectives. In our experiments, optimizing $\mathcal{L}_{\mathrm{FM}}$ improves both.
Since $\mathcal{L}_{\mathrm{FM}}$ depends on the unknown data moment $\mathbb{E}_{y \sim p(\cdot|c)}[\phi_c(y)]$, it cannot be directly estimated from ground-truth pairs $(c,y)$. A bias–variance decomposition yields the conditional feature-matching loss, which differs from $\mathcal{L}_{\mathrm{FM}}$ only by a $\theta$-independent variance term and can therefore be optimized as a surrogate:
Conditional feature-matching loss
$$
\mathcal{L}_{\mathrm{CFM}}(\theta)\;:=\;\mathbb{E}_{c \sim p}\Big[\big\| \mathbb{E}_{\hat{y} \sim p_{\theta}(\cdot|c)}[\phi_c(\hat{y})] \! - \! \phi_c(y) \big\|^2\Big],
$$
which replaces the unknown population moment with a single ground-truth sample.
The figure below shows what happens when we actually optimize this loss. EBFT achieves the lowest feature-matching loss across all completion lengths, despite training with rollouts of only 8 tokens. The gains are especially pronounced near the rollout horizon used during fine-tuning but extend well beyond it, suggesting the objective captures genuine distributional calibration rather than overfitting to the training completion length. RLVR, by contrast, worsens this loss relative to the base model.
EBFT achieves the lowest feature-matching loss across all completion lengths. Conditional feature-matching loss versus completion length for Qwen2.5-1.5B fine-tuned on OpenCodeInstruct. EBFT is lower than the base model, SFT, and RLVR across all lengths, with larger gains near the rollout horizon (completion length 8).