Uncode

ACL 2026 · Main Conference

Empirical Analysis of Decoding Biases
in Masked Diffusion Models

Uncode — UNmasking Calibration for DecOding DEbiasing: a training-free fix for two decoding biases that hold masked diffusion LLMs back.

Pengcheng Huang¹ Tianming Liu¹ Zhenghao Liu^1,∗ Yukun Yan² Shuo Wang² Tong Xiao¹ Zulong Chen³ Maosong Sun²

¹Northeastern University, China ²NLP Lab, Tsinghua University ³Alibaba Group

^∗Corresponding author

arXiv:2508.13021 ACL Anthology Code Poster BibTeX

Takeaways

1
Two biases, one root cause. Greedy uncertainty-based decoding systematically prefers boundaries and trivial tokens — both stem from "easy" positions being unmasked first.
2
Calibrate, don't retrain. A position prior and a semantics prior, multiplied onto the existing score, are enough to fix it — no fine-tuning, no architecture change.
3
Closing the gap to autoregressive. Decoding-side changes alone bring an 8B MDM to near-parity with a strong 7B autoregressive model on reasoning and planning.
4
Compatible with efficient decoding. Uncode plugs into existing efficient decoding strategies (fast samplers), preserving their 2×–4× speedups while improving accuracy.

Highlights

> 7%

average gain over the strongest decoding baseline

44.7 ≈ 45.3

LLaDA-1.5 + Uncode vs. Qwen-2.5-7B (autoregressive)

3 × 7

MDM backbones × reasoning & planning benchmarks

extra training — plug-and-play, decoding-side only

The Two Decoding Biases

MDMs decode by iterative unmasking — any-order, multi-token, non-autoregressive. The unmasking order is decisive, and standard MDMs greedily unmask the least-uncertain positions first. That greedy heuristic creates two systematic biases.

Bias #1

Rigid Boundary Bias

Boundary tokens (BOS/EOS and sentence edges) are consistently decoded first — a positional regularity learned during training — so decoding collapses inward along a fixed “U-shaped” trajectory. The model thus commits to an answer before the reasoning is built, with the order fixed by position rather than the task.

Heatmap of unmasking probability showing a U-shaped boundary-first trajectory, and accuracy comparison across decoding strategies on GSM8K and Sudoku. — A characteristic U-shaped unmasking pattern; its rigidity helps planning (Sudoku) but hurts step-by-step reasoning (GSM8K).

Bias #2

Trivial Token Bias

High-frequency, low-information tokens (punctuation, spaces, fillers) are easy to predict, so they receive low uncertainty and get over-prioritized — nearly 40% of selections vs. ~20% for autoregressive models. Decoding budget is spent on surface structure instead of reasoning content, and suppressing these trivial tokens monotonically improves accuracy.

Trivial token ratio over decoding steps stays far above the autoregressive baseline; suppressing trivial tokens monotonically improves GSM8K accuracy. — Trivial tokens are consistently over-selected; suppressing them monotonically improves GSM8K accuracy.

Uncode: Unmasking Calibration

A lightweight, training-free recalibration of the unmasking priority $s^{\,i}_t$ for every masked position $i$ at step $t$, multiplying the raw uncertainty score by two complementary priors:

$$ \tilde{s}^{\,i}_t \;=\; \underbrace{\mathcal{P}^{\,i}}_{\textstyle\text{position}} \;\cdot\; \underbrace{\mathcal{S}^{\,i}_t}_{\textstyle\text{semantics}} \;\cdot\; \underbrace{\mathcal{F}\!\big(p_\theta(\cdot \mid x_t, i)\big)}_{\textstyle\text{raw uncertainty}} $$

Positional Trajectory Prior

$$\mathcal{P}^{\,i} = e^{-\lambda i}$$

Breaks the rigid boundary-first pattern with a position-dependent decay. The coefficient $\lambda$ interpolates between flexible any-order decoding ($\lambda\!\to\!0$) and left-to-right generation (larger $\lambda$), giving explicit control over causal dependency per task — in practice $\lambda\!=\!0$ for Sudoku, $0.25$ for most tasks, and $0.5$ for Countdown.

Semantic Informativeness Prior

$$\mathcal{S}^{\,i}_t = \min\!\big(-\log p_{\mathcal{D}'}(\hat{x}^{\,i}_t),\, \alpha\big)$$

Down-weights frequent, low-information tokens via corpus-level self-information (estimated on a 16 GB C4 subset), redirecting decoding toward content-bearing tokens. Clipping at $\alpha$ (set to $10$) keeps rare tokens from dominating.

The result: Uncode keeps the parallel, any-order strengths of MDMs while globally reshaping the decoding trajectory and promoting informative content — with no retraining and negligible overhead.

Trajectory Reshaping

The clearest evidence of what Uncode does: it turns the baseline's rigid U-shaped unmasking pattern into a clean, content-first diagonal trajectory.

Two unmasking-probability heatmaps: the Confidence baseline forms a U-shape; UNCODE forms a clean diagonal trajectory. — Unmasking probability over decoding steps. **Left:** the Confidence baseline collapses from both boundaries inward (U-shape). **Right:** Uncode produces a smooth, progressive trajectory that builds global context before committing to the answer.

Answer-token entropy distribution and average predictive entropy over decoding steps for Confidence vs UNCODE. — **Why it helps.** The baseline guesses answer tokens early, while still uncertain (left, blue cluster). Uncode **defers answer tokens to later, high-confidence steps** (red) and **reduces global uncertainty faster** (right) — reasoning first, answering second.

Experimental Results

Three MDM backbones (LLaDA-8B-Instruct, LLaDA-1.5-8B, and Dream-7B), seven benchmarks spanning code, math, science and planning, against eight decoding baselines and autoregressive references. The table below shows the two LLaDA backbones; full Dream-7B results are reported in the paper.

Main results — Uncode is near-best across the board

Method	HumanEval	MBPP	GSM8K	MATH500	GPQA	Countdown	Sudoku	Avg.
Autoregressive LLMs reference
LLaMA-3.1-8B	53.1	56.7	83.9	23.8	31.0	27.0	0.0	39.4
Mistral-7B	43.9	37.0	49.4	7.2	28.1	22.7	0.0	26.9
Qwen-2.5-7B	78.1	62.8	71.9	64.2	32.8	7.7	0.0	45.3
LLaDA-8B-Instruct
Uniform	15.2	24.6	48.8	15.0	29.0	14.4	2.2	21.3
Confidence	27.4	42.4	59.1	20.8	27.9	34.0	23.8	33.6
Entropy	28.1	42.2	60.9	11.2	28.4	33.8	1.6	29.4
Margin	32.3	42.4	58.3	19.8	28.4	33.9	26.6	34.5
EB-Sampler	26.8	43.3	61.2	11.6	29.5	34.1	24.2	33.0
Semi-AR	39.0	45.2	77.9	27.6	27.7	32.6	0.0	35.7
Fast-dLLM	35.4	44.7	78.2	28.4	28.6	23.6	0.0	34.1
Uncode	42.1	47.8	79.2	34.8	29.2	36.3	29.8	42.7
LLaDA-1.5-8B
Uniform	17.7	23.0	52.7	20.0	28.1	15.8	3.4	23.0
Confidence	28.1	43.3	60.7	22.8	28.7	33.8	24.8	34.6
Entropy	32.9	44.0	60.3	11.2	26.6	34.7	0.2	30.0
Margin	25.0	43.3	57.5	23.2	28.4	31.8	33.6	34.7
EB-Sampler	32.9	43.6	61.1	13.4	26.6	34.6	0.2	30.3
Semi-AR	39.6	46.8	80.7	34.2	26.1	32.4	0.0	37.1
Fast-dLLM	37.2	46.1	80.8	31.2	27.9	23.6	0.0	36.7
Uncode	46.3	49.9	82.2	37.4	28.8	35.0	33.4	44.7

Pass@1 for code, accuracy elsewhere. Bold = best within each model group; shaded rows are Uncode. Uncode wins nearly every column on both MDM backbones, and LLaDA-1.5 + Uncode (44.7) is comparable to the much-tuned autoregressive Qwen-2.5-7B (45.3).

Ablation — both priors matter

Ablation bar chart: removing the positional prior or the semantic prior both reduce average performance on LLaDA-8B-Instruct and LLaDA-1.5-8B. — Removing the **positional trajectory prior** causes the largest drop (42.7 → 34.9 and 44.7 → 35.5); the **semantic informativeness prior** adds a further +2.3 / +2.7. Both are necessary.

Plug-and-play with efficient samplers

Uncode is an unmasking policy, so it drops into existing fast decoders — improving quality by over 3% on average while preserving their 2×–4× speedups.

Sampler	HumanEval	MBPP	GSM8K	MATH500	GPQA	Countdown	Sudoku	Avg.	Speedup
τ-leaping	17.7	38.9	55.2	16.2	28.9	32.1	3.2	27.5	2.14×
+ Uncode	22.6	40.5	75.4	30.8	29.5	28.4	25.4	36.1	1.99×
EB-Sampler	26.8	43.3	61.2	11.6	29.5	34.1	24.2	33.0	2.32×
+ Uncode	41.5	46.6	79.3	35.2	28.4	36.2	25.6	41.8	2.28×
Fast-dLLM	35.0	45.9	77.4	25.2	27.8	24.4	0.0	33.9	4.21×
+ Uncode	36.0	48.2	77.8	29.2	28.4	36.3	0.6	36.7	3.99×

Results on LLaDA-8B-Instruct. Speedup is relative to vanilla full-step decoding.

ACL 2026 Poster

BibTeX

@inproceedings{huang-etal-2026-empirical,
    title = "Empirical Analysis of Decoding Biases in Masked Diffusion Models",
    author = "Huang, Pengcheng  and Liu, Tianming  and Liu, Zhenghao  and Yan, Yukun  and Wang, Shuo  and Xiao, Tong  and Chen, Zulong  and Sun, Maosong",
    booktitle = "Proceedings of the 64th Annual Meeting of the {A}ssociation for {C}omputational {L}inguistics (Volume 1: Long Papers)",
    year = "2026",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2026.acl-long.311/",
    pages = "6853--6876",
    ISBN = "979-8-89176-390-6",
}

Empirical Analysis of Decoding Biasesin Masked Diffusion Models