Kempner Institute Harvard University Microsoft Research

Matching Features, Not Tokens:
Energy-Based Fine-Tuning
of Language Models

* Equal contribution

TL;DR. Language models train on ground-truth text but must generate from their own outputs. This mismatch compounds with sequence length: errors drift the model off-distribution, and neither SFT (which never sees its own rollouts) nor RLVR (which provides only a scalar correctness signal) directly addresses the resulting distributional shift. Energy-Based Fine-Tuning (EBFT) introduces a feature-matching objective that measures whether the statistics of model-generated completions match those of ground-truth completions in the activation space of a frozen pre-trained model. EBFT matches or exceeds RLVR on downstream accuracy, achieves better validation cross-entropy compared to SFT despite not directly optimizing it, and requires no task-specific reward or verifier.

Energy-Based Fine-Tuning (EBFT) overview. For each prompt $c$, the generator $p_\theta$ samples $n$ on-policy completions (partial rollouts). A frozen feature network $\phi$, initialized from the generator, embeds the concatenated prompt–completion sequences, producing feature vectors for both sampled and ground-truth completions. EBFT assigns each sample a feature-matching reward that encourages alignment with the ground-truth feature moment while discouraging collapse via a leave-one-out diversity term. The generator is updated with a REINFORCE estimator using an RLOO baseline, yielding a dense sequence-level learning signal under the model's own rollouts.

The Compounding Error Problem for LLM Inference

Language models are trained on ground-truth text but must generate from their own outputs. This mismatch is subtle during training—the model never conditions on its own predictions—but it compounds at inference time. An early error shifts the conditioning context, the next token is sampled from a distribution the model was rarely trained on, and the problem snowballs. Next-token cross-entropy loss under teacher forcing is the standard training objective:

Teacher-forced next-token objective
$$ \mathcal{L}_{\mathrm{CE}}(\theta)\;=\;-\mathbb{E}_{(c,y)\sim p}\;\log p_{\theta}(y\mid c) $$

which provides a dense, stable learning signal that scales efficiently. In principle, if optimized to zero, it recovers the true data distribution and the compounding problem disappears. However, in practice we are never in that regime: every real model carries residual error, and teacher forcing provides no mechanism to correct for how that error propagates across multiple generation steps.

Braverman et al. (2019) made this concrete by measuring the conditional entropy of the $k$-th generated token as a function of $k$. For a perfectly calibrated model—one whose generations are distributionally indistinguishable from real text—this quantity should be flat. However, they found it grows steadily, revealing that the model's uncertainty compounds as it conditions on increasingly off-distribution text.

We observe the same phenomenon through a different lens. The figure below shows a feature-matching loss—measuring how well the statistics of model rollouts match those of ground-truth completions in a learned embedding space—as a function of completion length.

Feature-matching loss grows with completion length

Feature-matching loss grows with completion length. Conditional feature-matching loss (lower is better) for Qwen2.5-1.5B fine-tuned with SFT on OpenCodeInstruct.

The steady increase reflects a genuine calibration failure: SFT does not account for how the model's distribution drifts over its own generations. This is the gap we set out to close. Rather than treating this divergence as an inevitable byproduct of teacher forcing, we ask: can we define a training objective that directly targets these long-range statistics and optimize it?

Why Existing Solutions Fall Short

Two dominant fine-tuning paradigms exist, and each addresses only part of the problem.

Supervised fine-tuning (SFT) optimizes next-token cross-entropy under teacher forcing. It never evaluates the model on its own generations, so it cannot directly optimize the sequence-level statistics that matter at inference time.

Reinforcement learning with verifiable rewards (RLVR) operates on full rollouts and optimizes sequence-level correctness. But that signal is a single scalar reward that tells the model whether an output is correct, not how its distribution deviates from the target. In practice, RLVR can improve downstream accuracy while substantially degrading validation cross-entropy and feature-matching losses.

What is missing is a training signal that is both sequence-level and distributional—one that optimizes not for pointwise correctness of individual outputs, but for alignment between the distribution of model-generated completions and that of ground-truth completions.

Trains on rollouts? Sequence-level signal? Preserves calibration?
SFT No, teacher forcing only No, token-level loss Yes
RLVR Yes Scalar reward only No, CE often degrades
EBFT (ours) Yes Dense, feature-level Yes, CE improves

A New Objective: Matching Long-Range Statistics

Given vocabulary $\mathcal{V}$, a ground-truth distribution $p$ over contexts $c \in \mathcal{V}^{*}$ and completions $y \in \mathcal{V}^{G}$ of length $G$, and a language model $p_{\theta}$, we introduce a feature-matching loss that directly measures how well the model's rollout distribution matches the ground-truth distribution in a learned embedding space:

Feature-matching loss
$$ \mathcal{L}_{\mathrm{FM}}(\theta)\;:=\;\mathbb{E}_{c \sim p}\Big[\big\| \mathbb{E}_{\hat{y} \sim p_{\theta}(\cdot|c)}[\phi(c\!:\!\hat{y})] \! - \! \mathbb{E}_{y \sim p(\cdot|c)}[\phi(c\!:\!y)] \big\|^2\Big], $$

where $c\!:\!y$ denotes concatenation and $\phi : \mathcal{V}^{*} \to \mathbb{R}^d$ is a feature map constructed by extracting intermediate activations from a frozen copy of the pre-trained model. We use the shorthand $\phi_c(y) \triangleq \phi(c\!:\!y)$. Instead of asking “did we predict the next token correctly?”, we ask: do the model's generated sequences match the statistics of real completions in feature space?

Under a sufficiently rich feature map, $\mathcal{L}_{\mathrm{FM}}$ is a strictly proper scoring rule—it can only be minimized by the true conditional distribution. Moreover, it shares the same minimizer as cross-entropy, so there is no inherent tension between the two objectives. In our experiments, optimizing $\mathcal{L}_{\mathrm{FM}}$ improves both.

Since $\mathcal{L}_{\mathrm{FM}}$ depends on the unknown data moment $\mathbb{E}_{y \sim p(\cdot|c)}[\phi_c(y)]$, it cannot be directly estimated from ground-truth pairs $(c,y)$. A bias–variance decomposition yields the conditional feature-matching loss, which differs from $\mathcal{L}_{\mathrm{FM}}$ only by a $\theta$-independent variance term and can therefore be optimized as a surrogate:

Conditional feature-matching loss
$$ \mathcal{L}_{\mathrm{CFM}}(\theta)\;:=\;\mathbb{E}_{c \sim p}\Big[\big\| \mathbb{E}_{\hat{y} \sim p_{\theta}(\cdot|c)}[\phi_c(\hat{y})] \! - \! \phi_c(y) \big\|^2\Big], $$

which replaces the unknown population moment with a single ground-truth sample.

The figure below shows what happens when we actually optimize this loss. EBFT achieves the lowest feature-matching loss across all completion lengths, despite training with rollouts of only 8 tokens. The gains are especially pronounced near the rollout horizon used during fine-tuning but extend well beyond it, suggesting the objective captures genuine distributional calibration rather than overfitting to the training completion length. RLVR, by contrast, worsens this loss relative to the base model.

Conditional feature-matching loss versus completion length

EBFT achieves the lowest feature-matching loss across all completion lengths. Conditional feature-matching loss versus completion length for Qwen2.5-1.5B fine-tuned on OpenCodeInstruct. EBFT is lower than the base model, SFT, and RLVR across all lengths, with larger gains near the rollout horizon (completion length 8).

How EBFT Works

EBFT optimizes the feature-matching objective with on-policy rollouts. For each prompt $c$, the current policy samples $n$ completions $\hat y_1,\ldots,\hat y_n\sim p_{\theta}(\cdot\mid c)$. A frozen feature network extracts embeddings for both generated and reference completions. Applying the REINFORCE log-derivative trick to $\mathcal{L}_{\mathrm{CFM}}$ gives a policy-gradient update with per-sample reward:

EBFT reward
$$ r(\hat{y},c) \;=\; \underbrace{2\,\phi_c(\hat{y})^{\!\top} \phi_c(y)}_{\text{alignment}} \;-\; \underbrace{2\,\phi_c(\hat{y})^{\!\top} \mathbb{E}_{\tilde{y} \sim p_{\theta}(\cdot|c)}\!\big[\phi_c(\tilde{y})\big]}_{\text{diversity}}. $$

Intuitively: samples receive higher reward when their features align with those of the ground-truth completion, and are penalized when they collapse toward the mean of other sampled completions.

In practice, we draw $n > 1$ completions and estimate the diversity term with a leave-one-out average, yielding per-sample rewards $r_j = 2\,\phi_c(\hat{y}_j)^{\!\top}\phi_c(y) - \tfrac{2}{n-1}\sum_{j'\neq j}\phi_c(\hat{y}_j)^{\!\top}\phi_c(\hat{y}_{j'})$. Variance is further reduced with an RLOO baseline. The gradient update is optionally combined with a small cross-entropy regularizer $\gamma\,\mathcal{L}_{\mathrm{CE}}$ for stability.

Training recipe

  1. Sample n on-policy completions for a prompt.
  2. Embed sampled and reference completions with a frozen feature network.
  3. Compute feature-matching rewards with an alignment term and a diversity term.
  4. Update the generator with an RLOO / REINFORCE-style policy-gradient step.
  5. Optionally mix in standard cross-entropy regularization.

Main Results

Greedy, Pass16, Validation cross-entropy, Conditional Feature Matching

Downstream greedy accuracy, pass@16 accuracy, validation cross-entropy, and conditional feature-matching loss for EBFT, SFT, and RLVR. EBFT consistently outperforms the other methods on downstream performance while achieving lower cross-entropy loss and conditional feature matching loss, despite SFT explicitly optimizing for the former.

We evaluate EBFT across both verifiable and non-verifiable settings to test the central claim: optimizing sequence-level semantic objectives under rollouts yields improvements over token-level CE training, particularly in tasks with many acceptable surface forms. We compare three methods: (a) standard CE fine-tuning (SFT), (b) RLVR using task-specific verifiers or metrics where available, and (c) EBFT with a frozen feature network. We report two EBFT configurations: EBFT runs two epochs from the base model, while EBFT (ws.) runs one epoch from a warm-started checkpoint (one epoch of SFT). We use Qwen2.5-1.5B for coding tasks and Llama3.2-1B for translation. In general, we observe:

EBFT matches RLVR on accuracy—without a correctness reward signal. EBFT reaches the highest or tied-highest downstream scores across all settings, outperforming SFT and matching or exceeding RLVR—despite never seeing a single unit test or verifier.

EBFT beats SFT at cross-entropy—despite not directly optimizing it. EBFT drives validation CE below SFT, even though SFT directly optimizes this objective. Meanwhile, RLVR's CE rises throughout training, often exceeding the base model.

RLVR degrades distributional quality while EBFT improves it. EBFT steadily improves the feature-matching loss; SFT is roughly flat; RLVR actively degrades it relative to the base model. On unstructured tasks where RL-style verifiers are unavailable, EBFT still yields strong gains—making it applicable to the vast majority of training data where RLVR simply cannot be used.

Below we summarize our results for the three task settings included in the paper.

We train on a 100k-sample subset of OpenCodeInstruct, consisting of natural-language programming prompts paired with reference solutions. Downstream evaluation uses HumanEval, MBPP, and MultiPL-E, reporting pass@1 and pass@k metrics where correctness is determined by passing all provided unit tests. EBFT reaches the best or tied-best downstream coding accuracy while achieving much lower validation cross-entropy than both SFT and RLVR.

Method CE Loss FM Loss Greedy pass@1 pass@4 pass@16
Base0.3380.3610.4840.4240.6060.715
SFT0.2890.3150.4830.4550.6170.728
EBFT0.2070.2580.5480.5100.6590.771
EBFT (ws.)0.1900.2550.5340.5080.6580.756
RLVR0.7740.4420.5350.5100.6600.752
RLVR (ws.)0.3890.4020.5240.5290.6620.749

Additional Results

The paper develops a broader theoretical and empirical picture beyond the main results. Below we highlight a few key topics; see the full paper for detailed derivations, proofs, and additional experiments on key hyperparameters.

We formalize the feature-matching loss as a proper scoring rule when the feature map is sufficiently rich: matching conditional feature moments corresponds to matching the rollout distribution itself.

Adding KL regularization to the feature-matching objective—penalizing deviation from a reference distribution $q(\cdot\mid c)$—yields a KL-regularized objective whose solution takes the form of an exponential tilt:

KL-regularized optimal distribution
$$\rho^\star(y\mid c) \;\propto\; q(y\mid c)\,\exp\!\big(-\chi_c^\top \phi_c(y)\big),$$

for a context-dependent vector $\chi_c \in \mathbb{R}^d$. Intuitively, $\chi_c$ is the tilt direction that assigns the most probability to completions actually observed in the data, subject to a size constraint. This is precisely the maximum-likelihood problem for an energy-based model with energy function $E(y,c) = \chi_c^\top \phi_c(y)$, motivating the term energy-based fine-tuning. Importantly, EBFT does not explicitly parameterize or learn $\chi$; instead, it directly optimizes the generator parameters via feature-matching gradients.

This KL-regularized view also provides a calibration interpretation. Given a base distribution $q(\cdot\mid c)$ and a target moment constraint $\mathbb{E}_{p(\cdot\mid c)}[f(y,c)] = m$, the distribution that satisfies the constraint while staying closest to $q$ in KL divergence is an exponential tilt $p_\chi(y\mid c) \propto \exp(\chi_c^\top f(y,c))\,q(y\mid c)$. Braverman et al. (2019) use this principle to correct entropy-rate drift in language model generations, applying a scalar tilt with $f(y,c) = -\log p_\theta(y\mid c)$. EBFT performs the same type of correction, but with $f(y,c) = -\phi_c(y)$, enforcing high-dimensional moment constraints in a semantically rich feature space rather than a single scalar statistic.

Takeaways

EBFT offers a middle ground between supervised fine-tuning and reward-based optimization. It uses on-policy rollouts like RL, but the training signal comes from dense semantic feature matching rather than sparse correctness rewards. This makes it especially attractive for domains with many valid outputs, limited verifier availability, or a desire to improve rollout calibration rather than only final-answer correctness.

More broadly, our results suggest that sequence-level training objectives need not be restricted to scalar rewards. Rich feature-space supervision can provide a scalable alternative for aligning language-model behavior under generation.

BibTeX

@misc{jelassi2026matchingfeaturestokensenergybased,
      title={Matching Features, Not Tokens: Energy-Based Fine-Tuning of Language Models}, 
      author={Samy Jelassi and Mujin Kwun and Rosie Zhao and Yuanzhi Li and Nicolo Fusi and Yilun Du and Sham M. Kakade and Carles Domingo-Enrich},
      year={2026},
      eprint={2603.12248},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2603.12248}, 
}