ACL 2026  ·  Main Conference

Empirical Analysis of Decoding Biases
in Masked Diffusion Models

UncodeUNmasking Calibration for DecOding DEbiasing: a training-free fix for two decoding biases that hold masked diffusion LLMs back.

Pengcheng Huang1 Tianming Liu1 Zhenghao Liu1,∗ Yukun Yan2 Shuo Wang2 Tong Xiao1 Zulong Chen3 Maosong Sun2
1Northeastern University, China 2NLP Lab, Tsinghua University 3Alibaba Group
Corresponding author

Takeaways

Highlights

> 7%
average gain over the strongest decoding baseline
44.7  45.3
LLaDA-1.5 + Uncode vs. Qwen-2.5-7B (autoregressive)
× 7
MDM backbones × reasoning & planning benchmarks
0
extra training — plug-and-play, decoding-side only

The Two Decoding Biases

MDMs decode by iterative unmasking — any-order, multi-token, non-autoregressive. The unmasking order is decisive, and standard MDMs greedily unmask the least-uncertain positions first. That greedy heuristic creates two systematic biases.

Illustration of decoding biases during reasoning: rigid boundary bias, trivial token bias, and the ideal decoding trajectory.
The problem in one picture. Uncertainty-based decoders greedily unmask (A) boundary tokens and (B) trivial tokens first, fixing the answer before the rationale exists. (C) The ideal trajectory establishes the reasoning chain, then resolves the answer.
Bias #1

Rigid Boundary Bias

Boundary tokens (BOS/EOS and sentence edges) are consistently decoded first — a positional regularity learned during training — so decoding collapses inward along a fixed “U-shaped” trajectory. The model thus commits to an answer before the reasoning is built, with the order fixed by position rather than the task.

Heatmap of unmasking probability showing a U-shaped boundary-first trajectory, and accuracy comparison across decoding strategies on GSM8K and Sudoku.
A characteristic U-shaped unmasking pattern; its rigidity helps planning (Sudoku) but hurts step-by-step reasoning (GSM8K).
Bias #2

Trivial Token Bias

High-frequency, low-information tokens (punctuation, spaces, fillers) are easy to predict, so they receive low uncertainty and get over-prioritized — nearly 40% of selections vs. ~20% for autoregressive models. Decoding budget is spent on surface structure instead of reasoning content, and suppressing these trivial tokens monotonically improves accuracy.

Trivial token ratio over decoding steps stays far above the autoregressive baseline; suppressing trivial tokens monotonically improves GSM8K accuracy.
Trivial tokens are consistently over-selected; suppressing them monotonically improves GSM8K accuracy.

Uncode: Unmasking Calibration

A lightweight, training-free recalibration of the unmasking priority $s^{\,i}_t$ for every masked position $i$ at step $t$, multiplying the raw uncertainty score by two complementary priors:

$$ \tilde{s}^{\,i}_t \;=\; \underbrace{\mathcal{P}^{\,i}}_{\textstyle\text{position}} \;\cdot\; \underbrace{\mathcal{S}^{\,i}_t}_{\textstyle\text{semantics}} \;\cdot\; \underbrace{\mathcal{F}\!\big(p_\theta(\cdot \mid x_t, i)\big)}_{\textstyle\text{raw uncertainty}} $$

Positional Trajectory Prior

$$\mathcal{P}^{\,i} = e^{-\lambda i}$$

Breaks the rigid boundary-first pattern with a position-dependent decay. The coefficient $\lambda$ interpolates between flexible any-order decoding ($\lambda\!\to\!0$) and left-to-right generation (larger $\lambda$), giving explicit control over causal dependency per task — in practice $\lambda\!=\!0$ for Sudoku, $0.25$ for most tasks, and $0.5$ for Countdown.

Semantic Informativeness Prior

$$\mathcal{S}^{\,i}_t = \min\!\big(-\log p_{\mathcal{D}'}(\hat{x}^{\,i}_t),\, \alpha\big)$$

Down-weights frequent, low-information tokens via corpus-level self-information (estimated on a 16 GB C4 subset), redirecting decoding toward content-bearing tokens. Clipping at $\alpha$ (set to $10$) keeps rare tokens from dominating.

The result: Uncode keeps the parallel, any-order strengths of MDMs while globally reshaping the decoding trajectory and promoting informative content — with no retraining and negligible overhead.

Trajectory Reshaping

The clearest evidence of what Uncode does: it turns the baseline's rigid U-shaped unmasking pattern into a clean, content-first diagonal trajectory.

Two unmasking-probability heatmaps: the Confidence baseline forms a U-shape; UNCODE forms a clean diagonal trajectory.
Unmasking probability over decoding steps. Left: the Confidence baseline collapses from both boundaries inward (U-shape). Right: Uncode produces a smooth, progressive trajectory that builds global context before committing to the answer.
Answer-token entropy distribution and average predictive entropy over decoding steps for Confidence vs UNCODE.
Why it helps. The baseline guesses answer tokens early, while still uncertain (left, blue cluster). Uncode defers answer tokens to later, high-confidence steps (red) and reduces global uncertainty faster (right) — reasoning first, answering second.

Experimental Results

Three MDM backbones (LLaDA-8B-Instruct, LLaDA-1.5-8B, and Dream-7B), seven benchmarks spanning code, math, science and planning, against eight decoding baselines and autoregressive references. The table below shows the two LLaDA backbones; full Dream-7B results are reported in the paper.

Main results — Uncode is near-best across the board

Method HumanEvalMBPPGSM8KMATH500GPQACountdownSudokuAvg.
Autoregressive LLMs reference
LLaMA-3.1-8B53.156.783.923.831.027.00.039.4
Mistral-7B43.937.049.47.228.122.70.026.9
Qwen-2.5-7B78.162.871.964.232.87.70.045.3
LLaDA-8B-Instruct
Uniform15.224.648.815.029.014.42.221.3
Confidence27.442.459.120.827.934.023.833.6
Entropy28.142.260.911.228.433.81.629.4
Margin32.342.458.319.828.433.926.634.5
EB-Sampler26.843.361.211.629.534.124.233.0
Semi-AR39.045.277.927.627.732.60.035.7
Fast-dLLM35.444.778.228.428.623.60.034.1
Uncode42.147.879.234.829.236.329.842.7
LLaDA-1.5-8B
Uniform17.723.052.720.028.115.83.423.0
Confidence28.143.360.722.828.733.824.834.6
Entropy32.944.060.311.226.634.70.230.0
Margin25.043.357.523.228.431.833.634.7
EB-Sampler32.943.661.113.426.634.60.230.3
Semi-AR39.646.880.734.226.132.40.037.1
Fast-dLLM37.246.180.831.227.923.60.036.7
Uncode46.349.982.237.428.835.033.444.7

Pass@1 for code, accuracy elsewhere. Bold = best within each model group; shaded rows are Uncode. Uncode wins nearly every column on both MDM backbones, and LLaDA-1.5 + Uncode (44.7) is comparable to the much-tuned autoregressive Qwen-2.5-7B (45.3).

Ablation — both priors matter

Ablation bar chart: removing the positional prior or the semantic prior both reduce average performance on LLaDA-8B-Instruct and LLaDA-1.5-8B.
Removing the positional trajectory prior causes the largest drop (42.7 → 34.9 and 44.7 → 35.5); the semantic informativeness prior adds a further +2.3 / +2.7. Both are necessary.

Plug-and-play with efficient samplers

Uncode is an unmasking policy, so it drops into existing fast decoders — improving quality by over 3% on average while preserving their 2×–4× speedups.

Sampler HumanEvalMBPPGSM8KMATH500GPQACountdownSudokuAvg.Speedup
τ-leaping17.738.955.216.228.932.13.227.52.14×
 + Uncode22.640.575.430.829.528.425.436.11.99×
EB-Sampler26.843.361.211.629.534.124.233.02.32×
 + Uncode41.546.679.335.228.436.225.641.82.28×
Fast-dLLM35.045.977.425.227.824.40.033.94.21×
 + Uncode36.048.277.829.228.436.30.636.73.99×

Results on LLaDA-8B-Instruct. Speedup is relative to vanilla full-step decoding.

ACL 2026 Poster

UNCODE ACL 2026 poster overview.
Click to zoom, or open the full-resolution PDF →

BibTeX

@inproceedings{huang-etal-2026-empirical,
    title = "Empirical Analysis of Decoding Biases in Masked Diffusion Models",
    author = "Huang, Pengcheng  and Liu, Tianming  and Liu, Zhenghao  and Yan, Yukun  and Wang, Shuo  and Xiao, Tong  and Chen, Zulong  and Sun, Maosong",
    booktitle = "Proceedings of the 64th Annual Meeting of the {A}ssociation for {C}omputational {L}inguistics (Volume 1: Long Papers)",
    year = "2026",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2026.acl-long.311/",
    pages = "6853--6876",
    ISBN = "979-8-89176-390-6",
}