Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models

1Cornell        2Cohere        3Stanford
ICLR 2025 Oral
Paper Code Hugging Face HuggingFace

Autoregression: ✅ High quality ✅ Arbitrary-length ✅ KV caching ❌ Not parallelizable

Diffusion: ❌ Lower quality ❌ Fixed-length ❌ No KV caching ✅ Parallelizable

Block Diffusion: ✅ High quality ✅ Arbitrary-length ✅ KV caching ✅ Parallelizable

Abstract

Diffusion language models offer unique benefits over autoregressive models due to their potential for parallelized generation and controllability, yet they lag in likelihood modeling and are limited to fixed-length generation. In this work, we introduce a class of block diffusion language models that interpolate between discrete denoising diffusion and autoregressive models. Block diffusion overcomes key limitations of both approaches by supporting flexible-length generation and improving inference efficiency with KV caching and parallel token sampling. We propose a recipe for building effective block diffusion models that includes an efficient training algorithm, estimators of gradient variance, and data-driven noise schedules to minimize the variance. Block diffusion sets a new state-of-the-art performance among diffusion models on language modeling benchmarks and enables generation of arbitrary-length sequences.

Autoregressive vs. Diffusion Language Models

In the language modeling task, we have a sequence of \( L \) tokens \( \mathbf{x} = (\mathbf{x}^1, \dots, \mathbf{x}^L ) \) drawn from the data distribution \( q(\mathbf{x}) \). We aim to fit a model \( p_\theta(\mathbf{x}) \) of \( q \).

Autoregressive models define a factorized distribution of the form:

\[ \log p_\theta(\mathbf{x}) = \sum_{\ell=1}^L \log p_\theta(\mathbf{x}^\ell \mid \mathbf{x}^{\lt \ell}) \]

However, the sequential dependencies between tokens require that AR sampling is restricted to \( L \) sampling steps, which may be slow for long sequences.

Diffusion models overcome this limitation by modeling tokens independently, admitting parallel generation. Diffusion models instead fit a model to undo a forward corruption process \( q(\mathbf{x}_t | \mathbf{x}_{t-1}) = \text{Cat}(\mathbf{x_t} ; Q_t \mathbf{x}_{t-1} ) \) using transition matrices \( Q_t \). D3PM (Austin et. al) defines this as

\[ p_\theta(\mathbf{x}_s | \mathbf{x}_t) = \prod_{\ell=1}^L p_\theta (\mathbf{x}_s^\ell | \mathbf{x}_t) = \sum_{\mathbf{x}} \left[q(\mathbf{x}_s^{\ell} | \mathbf{x}_t^\ell, \mathbf{x}^\ell) p_\theta(\mathbf{x}^{\ell} | \mathbf{x}_t) \right] \]

where the denoising base model \( p_\theta(\mathbf{x}^\ell | \mathbf{x}_t) \) predicts clean token \( \mathbf{x}^\ell \) given noised tokens \( \mathbf{x}_t \). However, the diffusion objective minimizes a bound on the likelihood. As a result, diffusion models lag in terms of likelihood and sample quality. Furthermore, diffusion models are restricted to generate fixed length sequences.

BD3-LMs: Block Discrete Denoising Diffusion Language Models

We combine modeling paradigms to enjoy better likelihoods & flexible-length generation from autoregressive models, as well as fast & parallel generation from diffusion models.

Block Diffusion Likelihood

We propose a modeling framework that autoregressively models blocks of tokens and performs diffusion within each block. Our likelihood factorizes over \( B \) blocks of length \( L' \) as

\[ \log p_\theta (\mathbf{x}) = \sum_{b=1}^B \log p_\theta (\mathbf{x}^b | \mathbf{x}^{\lt b}) \]

Each \( p_\theta (\mathbf{x}^b | \mathbf{x}^{\lt b}) \) is modeled using discrete diffusion ELBO over a block of \( L' \) tokens. We obtain a principled learning objective \( \mathcal{L}_\text{BD}(\mathbf{x}, \theta) \) by optimizing the following likelihood bound:

\[ \log p_\theta(\mathbf{x}) \geq \mathcal{L}_\text{BD}(\mathbf{x}, \theta) := \sum_{b=1}^{B} \mathcal{L}_{\text{diffusion}}(\mathbf{x}^b, \mathbf{x}^{\lt b}, \theta), \]

We model the per-block likelihood under a simple discrete diffusion parameterization (Sahoo et. al, Shi et. al, Ou et. al). Our final objective becomes a sum of weighted cross-entropy terms:

\[ \mathcal{L}_\text{BD}(\mathbf{x}, \theta) := - \sum_{b=1}^{B} \mathbb{E}_{t \sim [0, 1]} \mathbb{E}_{q} \frac{1}{t} \log p_\theta(\mathbf{x}^b | \mathbf{x}_{t}^b, \mathbf{x}^{\lt b}) \]

Efficient Training & Sampling Algorithms

Naively, we would compute the logits by applying \( \mathbf{x}_\theta^b( \mathbf{x}_t^b, \mathbf{K}^{1:b\text{-}1}, \mathbf{V}^{1:b\text{-}1}) \) in a loop \( B\) times. Instead, we only require two forward passes. The first pass precomputes keys and values \( \mathbf{K}^{1:B}, \mathbf{V}^{1:B} \) for the full sequence \( \mathbf{x}\). In the second forward pass, we compute denoised predictions for all blocks simulatenously using \( \mathbf{x}_\theta^b( \mathbf{x}_t^b, \mathbf{K}^{1:b\text{-}1}, \mathbf{V}^{1:b\text{-}1}) \).

To sample from BD3-LMs, we generate one block at a time, conditioned on previously sampled blocks. After generating a block, we cache its keys and values, similar to AR. We may use any diffusion sampling procedure \( \text{SAMPLE} ( \mathbf{x}_\theta^b, \mathbf{K}^{1:b\text{-}1}, \mathbf{V}^{1:b\text{-}1}) \) to sample from the conditional distribution \( p_\theta (\mathbf{x}_s^b | \mathbf{x}_t^b, \mathbf{x}^{ < b}) \) over \( T\) sampling steps per block.

mask
BD3-LM training and sampling algorithms.

Understanding Likelihood Gaps Between Diffusion and AR Models

Case Study: Single Token Generation

Our block diffusion parameterization is equivalent in expectation to the autoregressive NLL in the limiting case where \( L'=1 \). Surprisingly, we find a two point perplexity gap between our block diffusion model for \( L'=1 \) and AR when training both models on the LM1B dataset. We identify high training variance of the diffusion objective as responsible for the perplexity gap.

mask
Training under the discrete diffusion ELBO suffers from high variance.

Diffusion Gap from High Variance Training

Intuitively, if the sampled masking rate \( t \sim \mathcal{U}[0, 1] \) is too low, reconstructing \( \mathbf{x} \) is easy, which does not provide a useful learning signal. If we mask everything, the optimal reconstruction are the marginals of each token in the data distribution, which is easy to learn, and again not useful.

We seek to find noise schedules that minimize training variance caused by the diffusion objective and further reduce the perplexity gap.

Data-Driven Noise Schedules for Low-Variance Training

To avoid masking rates that cause high-variance training, we train BD3-LMs under "clipped" masking rates \( t \sim \mathcal{U}[\beta, \omega] \) for \( 0 \leq \beta, \omega \leq 1 \). By reducing the training variance, we improve likelihoods when we evaluate under uniformly sampled mask rates.

As the optimal mask rates may differ depending on the block size \(L'\), we adaptively learn \( \beta, \omega \) during training. In practice, we do so using a grid search during every validation step, after 5K gradient updates, to optimize \(\min_{\beta, \omega} \text{Var}_{\mathbf{X}, t} \left[ \mathcal{L}_{\text{BD}}(\theta, \beta, \omega; \mathbf{X}) \right] \).

Below, we show that optimizing the noise schedule per block size reduces the variance of the loss estimator and achieves the best perplexities compared to alternative schedules.

Effect of training under different noise schedules on perplexity (PPL ↓) on LM1B. All models are finetuned for 50K steps and are evaluated under uniformly sampled mask rates. For our clipped schedules, we compare the optimized clipping rates for \( L'=4, 16 \).
BD3-LMs Noise schedule PPL Var. ELBO
L' = 4 Linear t ~ U[0, 1] 30.18 23.45
Clipped t ~ U[0.45, 0.95] 29.21 6.24
Clipped t ~ U[0.3, 0.8] 29.38 10.33
Logarithmic 30.36 23.53
L' = 16 Linear t ~ U[0, 1] 31.72 7.62
Clipped t ~ U[0.45, 0.95] 31.42 3.60
Clipped linear t ~ U[0.3, 0.8] 31.12 3.58
Cosine 31.41 13.00

Results

Likelihood Evaluation

BD3-LMs achieve state-of-the-art likelihoods among diffusion models. As shown below, BD3-LMs interpolate between diffusion and autoregressive likelihoods by tuning the block length \( L' \).

Test perplexities (PPL; ↓) on OWT for models trained for 262B tokens.
Model PPL (↓)
AR 17.54
SEDD ≤ 24.10
MDLM ≤ 22.98
BD3-LMs L' = 16 ≤ 22.27
L' = 8 ≤ 21.68
L' = 4 ≤ 20.73

Arbitrary-length sequence generation

One key drawback of many existing diffusion language models is that they cannot generate full-length documents that are longer than the length of the output context chosen at training time. For example, OpenWebText contains documents up to 131K tokens, whereas discrete diffusion model SEDD (Lou et. al) is restricted to generate 1024 tokens. Below, we show that BD3-LMs can generate variable-length documents by decoding an arbitrary number of blocks.

Generation length statistics from sampling 500 documents from models trained on OWT.
Median # tokens Max # tokens
OWT train set 717 131K
AR 4008 131K
SEDD 1021 1024
BD3-LM L'=16 798 9982

We assess the generation quality of BD3-LMs on variable-length sequences, comparing all methods using the same number of generation steps (NFEs). Below, we measure the generative perplexity of sampled sequences under GPT2-Large. BD3-LMs achieve the best generative perplexities compared to all previous diffusion methods.

Generative perplexity (Gen. PPL; ↓) and number of function evaluations (NFEs; ↓) of 300 variable-length samples. All models are trained on OWT with a context length L = 1024 and use nucleus sampling.
Category Model
L = 1024
L = 2048
Gen. PPL (↓) NFEs Gen. PPL (↓) NFEs
Autoregressive AR
14.1
1K
13.2
2K
Diffusion SEDD
52.0
1K
-
-
MDLM
46.8
1K
41.3
2K
Block diffusion SSD-LM L' = 25
37.2
40K
35.3
80K
281.3
1K
261.9
2K
BD3-LMs L' = 16
33.4
1K
31.5
2K
                   L' = 8
30.4
1K
28.2
2K
                   L' = 4
25.7
1K
23.6
2K

For MDLM, we use their block-wise decoding technique (which does not feature block diffusion training as in BD3-LMs) for L=2048. We also compare to SSD-LM (Han et. al) an alternative block-autoregressive method (also known as semi-autoregression) that performs Gaussian diffusion over word embeddings but cannot perform likelihood estimation. Our discrete approach yields samples with improved generative perplexity using an order of magnitude fewer generation steps.

Conclusion

We presented Block Discrete Diffusion Language models, a new model class that combines strength of both autoregressive and diffusion approaches while overcoming their limitations. Block diffusion overcomes key drawbacks of existing discrete diffusion models: the quality gap to AR model and their inability to generate arbitrary-length sequences or support KV caching. By doing so, BD3-LMs set a new state-of-the-art among discrete diffusion models. Our work presents a promising step forward in building powerful diffusion language models that are competitive with standard LLMs, while offering parallel token generation and improved controllability of samples.

BibTeX


        @inproceedings{
          arriola2025interpolating,
          title={Interpolating Autoregressive and Discrete Denoising Diffusion Language Models},
          author={Marianne Arriola and Aaron Gokaslan and Justin T Chiu and Jiaqi Han and Zhihan Yang and Zhixuan Qi and Subham Sekhar Sahoo and Volodymyr Kuleshov},
          booktitle={The Thirteenth International Conference on Learning Representations},
          year={2025},
          url={https://openreview.net/forum?id=tyEyYT267x}
        }