Scaling up Masked Diffusion Models on Text

Shen Nie1,2∗, Fengqi Zhu1,2, Chao Du3‡, Tianyu Pang3, Qian Liu3, Guangtao Zeng4
 Min Lin3, Chongxuan Li1,2‡†
1Gaoling School of Artificial Intelligence, Renmin University of China
2Beijing Key Laboratory of Big Data Management and Analysis Methods
3Sea AI Lab, Singapore     4Singapore University of Technology and Design
{nieshen, fengqizhu}@ruc.edu.cn; {duchao, tianyupang, liuqian}@sea.com;
zengguangtao98@gmail.com; linmin@sea.com; chongxuanli@ruc.edu.cn
Work done during Shen Nie’s internship at Sea AI Lab.Project leaders. Correspondence to Chongxuan Li.
Abstract

Masked diffusion models (MDMs) have shown promise in language modeling, yet their scalability and effectiveness in core language tasks, such as text generation and language understanding, remain underexplored. This paper establishes the first scaling law for MDMs, demonstrating a scaling rate comparable to autoregressive models (ARMs) and a relatively small compute gap. Motivated by their scalability, we train a family of MDMs with up to 1.1 billion (B) parameters to systematically evaluate their performance against ARMs of comparable or larger sizes. Fully leveraging the probabilistic formulation of MDMs, we propose a simple yet effective unsupervised classifier-free guidance that effectively exploits large-scale unpaired data, boosting performance for conditional inference. In language understanding, a 1.1B MDM shows competitive results, outperforming the larger 1.5B GPT-2 model on four out of eight zero-shot benchmarks. In text generation, MDMs provide a flexible trade-off compared to ARMs utilizing KV-cache: MDMs match the performance of ARMs while being 1.4 times faster, or achieve higher quality than ARMs at a higher computational cost. Moreover, MDMs address challenging tasks for ARMs by effectively handling bidirectional reasoning and adapting to temporal shifts in data. Notably, a 1.1B MDM breaks the reverse curse encountered by much larger ARMs with significantly more data and computation, such as Llama-2 (13B) and GPT-3 (175B). Our code is available at https://github.com/ML-GSAI/SMDM.

Refer to caption
(a) ARMs.
Refer to caption
(b) MDMs.
Figure 1: IsoFLOP curves plot optimal model sizes under fixed computation budgets. The optimal MDM validation loss exhibits power-law scaling, decreasing at a rate comparable to that of ARMs.

1 Introduction

Autoregressive models (ARMs) have long been regarded as the gold standard in probabilistic language modeling. Their ability to predict the next token, grounded in the chain rule, naturally aligns with the sequential nature of language and scales effectively (Radford, 2018; Radford et al., 2019; Brown, 2020; OpenAI, 2022; Achiam et al., 2023; Touvron et al., 2023a; b; Dubey et al., 2024) when integrated with Transformers (Vaswani, 2017). However, ARMs exhibit inherent limitations, particularly in reasoning tasks that require bidirectional context understanding or handling temporal shifts in data. These shortcomings, widely recognized as the reverse curse (Berglund et al., 2023) and temporal quality degradation (Vela et al., 2022), significantly hinder their applicability in complex language modeling scenarios. Additionally, their linear sampling time growth w.r.t. the output length poses practical challenges for long text generation.

The limitations of ARMs have sparked interest in an alternative approach: masked diffusion models (MDMs) (Austin et al., 2021; Hoogeboom et al., 2021b; a; He et al., 2022; Campbell et al., 2022; Meng et al., 2022; Sun et al., 2022; Lou et al., 2023; Sahoo et al., 2024; Shi et al., 2024; Ou et al., 2024). MDMs present a promising alternative due to their unique probabilistic framework, which enables flexible bidirectional context modeling by filling in masked positions across a sequence. Recent advances (Lou et al., 2023; Sahoo et al., 2024; Shi et al., 2024; Ou et al., 2024) have shown promise in unconditional text generation and zero-shot perplexity evaluation. Despite recent progress, the scalability of MDMs and their effectiveness in critical language tasks, such as conditional generation and language understanding, remain open questions. Furthermore, it is still unclear whether MDMs can address the inherent limitations of ARMs, such as improving bidirectional reasoning capabilities.

This paper challenges the longstanding dominance of ARMs by presenting a comprehensive study of MDMs regarding key factors in language models: scalability, capabilities for language understanding, and conditional generation. To achieve this, we train a family of MDMs with up to 1.1 billion (B) parameters on a large-scale dataset and establish the first scaling law for MDMs. Leveraging their unique probabilistic framework, we propose a simple yet effective unsupervised classifier-free guidance (CFG) mechanism to leverage unsupervised data to enhance inference performance in language tasks involving conditional distributions. Notably, unsupervised CFG does not rely on paired data as standard CFG (Ho & Salimans, 2022) but can still benefit from paired data when available, achieving performance that surpasses standard CFG. Supported by the scaling law and unsupervised CFG, our extensive experiments yield the following key findings:

  • Strong scalability. As the IsoFLOP analysis (Hoffmann et al., 2022) scaling computate budgets from 6×10186superscript10186\times 10^{18}6 × 10 start_POSTSUPERSCRIPT 18 end_POSTSUPERSCRIPT to 1020superscript102010^{20}10 start_POSTSUPERSCRIPT 20 end_POSTSUPERSCRIPT FLOPs (see Fig. 1), the optimal validation loss of MDMs decreases according to a power law, with a rate matching that of ARMs (see Fig. 2). While MDMs maintain a constant computation gap of 16 times compared to ARMs, this gap is smaller than the factor of 64 observed in continuous diffusion models (Gulrajani & Hashimoto, 2024) and can be further minimized with future optimizations.

  • Competitive in zero-shot language understanding. Across eight standard zero-shot benchmarks like commonsense reasoning and reading comprehension, MDMs outperform not only a same-sized ARM with the same pre-training FLOPs but also a larger 1.5B GPT-2 model on four tasks. Furthermore, when scaled up with 16 times more pre-training time, as suggested by the scaling law, MDMs consistently surpass ARMs across all tasks.

  • Flexible trade-off in conditional generation. On the standard MT-Bench, a 1.1B MDM matches the performance of a same-sized ARM while achieving a 1.4 times speedup in sampling time. By increasing sampling steps, MDMs can further improve generation quality at the cost of being 1.4 times slower. Notably, ARMs are equipped with KV-cache, a technique to speed up sequential sampling while MDMs exploit no system optimization.

  • Addressing challenging tasks for ARMs. MDMs effectively relieve temporal quality degradation (Vela et al., 2022) compared to a same-sized ARM and successfully overcome the reverse curse (Berglund et al., 2023) encountered by much larger ARMs with significantly more data and computation, such as Llama-2 (13B) and GPT-3 (175B).

2 Masked Diffusion Models on Text

In analogy to continuous diffusion models (Sohl-Dickstein et al., 2015; Ho et al., 2020; Song et al., 2020), MDMs (Austin et al., 2021; Lou et al., 2023; Ou et al., 2024) also introduce a forward process that gradually adds noise to the data and learn a corresponding reverse process to generate samples. Our basic approach is built upon Ou et al. (2024), an advanced MDM suitable for scaling.

Forward process. Let K𝐾Kitalic_K and L𝐿Litalic_L denote the vocabulary size and sentence length respectively. Given a sentence 𝒙0{0,1,,K1}Lsubscript𝒙0superscript01𝐾1𝐿{\bm{x}}_{0}\in\{0,1,\dots,K-1\}^{L}bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ { 0 , 1 , … , italic_K - 1 } start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT and a noise level t[0,1]𝑡01t\in[0,1]italic_t ∈ [ 0 , 1 ], the forward process in MDMs randomly and independently masks out tokens in the sentence, formulated as follows:

qt|0(𝒙t|𝒙0)=i=0L1qt|0(𝒙ti|𝒙0i)andqt|0(𝒙ti|𝒙0i)={αt,𝒙ti=𝒙0i,1αt,𝒙ti=m,formulae-sequencesubscript𝑞conditional𝑡0conditionalsubscript𝒙𝑡subscript𝒙0superscriptsubscriptproduct𝑖0𝐿1subscript𝑞conditional𝑡0conditionalsuperscriptsubscript𝒙𝑡𝑖superscriptsubscript𝒙0𝑖andsubscript𝑞conditional𝑡0conditionalsuperscriptsubscript𝒙𝑡𝑖superscriptsubscript𝒙0𝑖casessubscript𝛼𝑡superscriptsubscript𝒙𝑡𝑖superscriptsubscript𝒙0𝑖1subscript𝛼𝑡superscriptsubscript𝒙𝑡𝑖𝑚\displaystyle q_{t|0}({\bm{x}}_{t}|{\bm{x}}_{0})=\prod_{i=0}^{L-1}q_{t|0}({\bm% {x}}_{t}^{i}|{\bm{x}}_{0}^{i})\quad\text{and}\quad q_{t|0}({\bm{x}}_{t}^{i}|{% \bm{x}}_{0}^{i})=\begin{cases}\alpha_{t},&{\bm{x}}_{t}^{i}={\bm{x}}_{0}^{i},\\ 1-\alpha_{t},&{\bm{x}}_{t}^{i}=m,\end{cases}italic_q start_POSTSUBSCRIPT italic_t | 0 end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = ∏ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L - 1 end_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT italic_t | 0 end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT | bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) and italic_q start_POSTSUBSCRIPT italic_t | 0 end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT | bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) = { start_ROW start_CELL italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , end_CELL start_CELL bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , end_CELL end_ROW start_ROW start_CELL 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , end_CELL start_CELL bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = italic_m , end_CELL end_ROW (1)

where 𝒙isuperscript𝒙𝑖{\bm{x}}^{i}bold_italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT denotes the i𝑖iitalic_i-th element of 𝒙𝒙{\bm{x}}bold_italic_x, m𝑚mitalic_m denotes the mask token (Devlin, 2018), 𝒙tsubscript𝒙𝑡{\bm{x}}_{t}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT denotes the noisy data at time t𝑡titalic_t and q0()subscript𝑞0q_{0}(\cdot)italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( ⋅ ) is the data distribution pdata()subscript𝑝datap_{\textrm{data}}(\cdot)italic_p start_POSTSUBSCRIPT data end_POSTSUBSCRIPT ( ⋅ ). We set the hyperparameter αtsubscript𝛼𝑡\alpha_{t}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT as 1t1𝑡1-t1 - italic_t for the best empirical performance as suggested in previous work (Lou et al., 2023; Sahoo et al., 2024; Shi et al., 2024).

Reverse process. The reverse process in MDMs iteratively recover values for masked tokens, starting from a mask sequence 𝒙1subscript𝒙1{\bm{x}}_{1}bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. Let 0s<t10𝑠𝑡10\leq s<t\leq 10 ≤ italic_s < italic_t ≤ 1, the reverse process is characterized by

qs|t(𝒙s|𝒙t)=i=0L1qs|t(𝒙si|𝒙t)andqs|t(𝒙si|𝒙t)={1,𝒙tim,𝒙si=𝒙ti,st,𝒙ti=m,𝒙si=m,tstq0|t(𝒙si|𝒙t),𝒙ti=m,𝒙sim,0,otherwise.subscript𝑞conditional𝑠𝑡conditionalsubscript𝒙𝑠subscript𝒙𝑡superscriptsubscriptproduct𝑖0𝐿1subscript𝑞conditional𝑠𝑡conditionalsuperscriptsubscript𝒙𝑠𝑖subscript𝒙𝑡andsubscript𝑞conditional𝑠𝑡conditionalsuperscriptsubscript𝒙𝑠𝑖subscript𝒙𝑡cases1formulae-sequencesuperscriptsubscript𝒙𝑡𝑖𝑚superscriptsubscript𝒙𝑠𝑖superscriptsubscript𝒙𝑡𝑖𝑠𝑡formulae-sequencesuperscriptsubscript𝒙𝑡𝑖𝑚superscriptsubscript𝒙𝑠𝑖𝑚𝑡𝑠𝑡subscript𝑞conditional0𝑡conditionalsuperscriptsubscript𝒙𝑠𝑖subscript𝒙𝑡formulae-sequencesuperscriptsubscript𝒙𝑡𝑖𝑚superscriptsubscript𝒙𝑠𝑖𝑚0otherwise\displaystyle q_{s|t}({\bm{x}}_{s}|{\bm{x}}_{t})=\prod_{i=0}^{L-1}q_{s|t}({\bm% {x}}_{s}^{i}|{\bm{x}}_{t})~{}~{}\text{and}~{}~{}q_{s|t}({\bm{x}}_{s}^{i}|{\bm{% x}}_{t})=\begin{cases}1,&{\bm{x}}_{t}^{i}\neq m,{\bm{x}}_{s}^{i}={\bm{x}}_{t}^% {i},\\ \frac{s}{t},&{\bm{x}}_{t}^{i}=m,{\bm{x}}_{s}^{i}=m,\\ \frac{t-s}{t}q_{0|t}({\bm{x}}_{s}^{i}|{\bm{x}}_{t}),&{\bm{x}}_{t}^{i}=m,{\bm{x% }}_{s}^{i}\neq m,\\ 0,&\textrm{otherwise}.\end{cases}italic_q start_POSTSUBSCRIPT italic_s | italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = ∏ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L - 1 end_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT italic_s | italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and italic_q start_POSTSUBSCRIPT italic_s | italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = { start_ROW start_CELL 1 , end_CELL start_CELL bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ≠ italic_m , bold_italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , end_CELL end_ROW start_ROW start_CELL divide start_ARG italic_s end_ARG start_ARG italic_t end_ARG , end_CELL start_CELL bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = italic_m , bold_italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = italic_m , end_CELL end_ROW start_ROW start_CELL divide start_ARG italic_t - italic_s end_ARG start_ARG italic_t end_ARG italic_q start_POSTSUBSCRIPT 0 | italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , end_CELL start_CELL bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = italic_m , bold_italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ≠ italic_m , end_CELL end_ROW start_ROW start_CELL 0 , end_CELL start_CELL otherwise . end_CELL end_ROW (2)

Here q0|t(|)q_{0|t}(\cdot|\cdot)italic_q start_POSTSUBSCRIPT 0 | italic_t end_POSTSUBSCRIPT ( ⋅ | ⋅ ) is the data prediction model (Ho et al., 2020) to be learned. Notably, Ou et al. (2024) revealed an intrinsic property of MDMs that q0|t(|)q_{0|t}(\cdot|\cdot)italic_q start_POSTSUBSCRIPT 0 | italic_t end_POSTSUBSCRIPT ( ⋅ | ⋅ ) can be represented by conditional distributions on clean data pdata(|)p_{\text{data}}(\cdot|\cdot)italic_p start_POSTSUBSCRIPT data end_POSTSUBSCRIPT ( ⋅ | ⋅ ) independently from the time t𝑡titalic_t, distinct from other diffusion. Formally,

q0|t(𝒙0i|𝒙t)=pdata(𝒙0i|𝒙tUM),subscript𝑞conditional0𝑡conditionalsuperscriptsubscript𝒙0𝑖subscript𝒙𝑡subscript𝑝dataconditionalsuperscriptsubscript𝒙0𝑖superscriptsubscript𝒙𝑡UM\displaystyle q_{0|t}({\bm{x}}_{0}^{i}|{\bm{x}}_{t})=p_{\text{data}}({\bm{x}}_% {0}^{i}|{\bm{x}}_{t}^{\text{UM}}),italic_q start_POSTSUBSCRIPT 0 | italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_p start_POSTSUBSCRIPT data end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT UM end_POSTSUPERSCRIPT ) , (3)

where 𝒙tUMsuperscriptsubscript𝒙𝑡UM{\bm{x}}_{t}^{\text{UM}}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT UM end_POSTSUPERSCRIPT collects all unmasked tokens in noisy data 𝒙tsubscript𝒙𝑡{\bm{x}}_{t}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and pdata(|)p_{\text{data}}(\cdot|\cdot)italic_p start_POSTSUBSCRIPT data end_POSTSUBSCRIPT ( ⋅ | ⋅ ) is irrelevant to t𝑡titalic_t.111For example, if 𝒙t=[3,5,m,2]subscript𝒙𝑡35𝑚2{\bm{x}}_{t}=[3,5,m,2]bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = [ 3 , 5 , italic_m , 2 ], then 𝒙tUM=[3,5,,2]superscriptsubscript𝒙𝑡UM352{\bm{x}}_{t}^{\text{UM}}=[3,5,\cdot,2]bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT UM end_POSTSUPERSCRIPT = [ 3 , 5 , ⋅ , 2 ] and pdata(|[3,5,,2])p_{\text{data}}(\cdot|[3,5,\cdot,2])italic_p start_POSTSUBSCRIPT data end_POSTSUBSCRIPT ( ⋅ | [ 3 , 5 , ⋅ , 2 ] ) is irrelevant to t𝑡titalic_t.

Training objective. A distribution p𝜽(𝒙0i|𝒙t)subscript𝑝𝜽conditionalsuperscriptsubscript𝒙0𝑖subscript𝒙𝑡p_{{\bm{\theta}}}({\bm{x}}_{0}^{i}|{\bm{x}}_{t})italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) parameterized by 𝜽𝜽{\bm{\theta}}bold_italic_θ is employed to approximate pdata(𝒙0i|𝒙tUM)subscript𝑝dataconditionalsuperscriptsubscript𝒙0𝑖superscriptsubscript𝒙𝑡UMp_{\text{data}}({\bm{x}}_{0}^{i}|{\bm{x}}_{t}^{\text{UM}})italic_p start_POSTSUBSCRIPT data end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT UM end_POSTSUPERSCRIPT ), optimizing the following upper bound on negative log-likelihood (Ou et al., 2024):

logp𝜽(𝒙0)011t𝔼q(𝒙t|𝒙0)[{i|𝒙ti=m}logp𝜽(𝒙0i|𝒙t)]𝑑t.subscript𝑝𝜽subscript𝒙0superscriptsubscript011𝑡subscript𝔼𝑞conditionalsubscript𝒙𝑡subscript𝒙0delimited-[]subscriptconditional-set𝑖superscriptsubscript𝒙𝑡𝑖𝑚subscript𝑝𝜽conditionalsuperscriptsubscript𝒙0𝑖subscript𝒙𝑡differential-d𝑡\displaystyle-\log p_{{\bm{\theta}}}({\bm{x}}_{0})\leq\int_{0}^{1}\frac{1}{t}% \mathbb{E}_{q({\bm{x}}_{t}|{\bm{x}}_{0})}\left[\sum_{\{i|{\bm{x}}_{t}^{i}=m\}}% -\log p_{{\bm{\theta}}}({\bm{x}}_{0}^{i}|{\bm{x}}_{t})\right]dt\triangleq% \mathcal{L}.- roman_log italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ≤ ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_t end_ARG blackboard_E start_POSTSUBSCRIPT italic_q ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT { italic_i | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = italic_m } end_POSTSUBSCRIPT - roman_log italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] italic_d italic_t ≜ caligraphic_L . (4)

We emphasize that the formulation is particularly suitable for scaling. First, it is among the best MDMs w.r.t. zero-shot perplexity (Ou et al., 2024). Second, it removes the timestep from input and minimally modifies the original Transformers (see Sec. 3). Third, it enables unsupervised classifier-free guidance, which does not rely on paired data yet is effective in language tasks (see Sec. 4).

3 Scaling Laws for Masked Diffusion Models

Scaling laws (Kaplan et al., 2020; Hoffmann et al., 2022) characterize the fundamental relationship between model performance and computational resources under constraints, significantly influencing the progress of large ARMs. We introduce the first scaling laws for MDMs and conduct a fair comparison with ARMs. Our results reveal the strong scalability of MDMs, highlighting their potential as a competitive alternative to ARMs in language modeling.

Model. We employ a Transformer decoder for ARMs and the corresponding Transformer encoder for MDMs (note that it is unnecessary to input timestep t𝑡titalic_t according to Eq. (3)). The differences between these architectures are: (1) the encoder has an additional dimension in its embedding layer for the mask token, and (2) the encoder’s self-attention does not use a causal mask. All other architectural settings (e.g., depth, hidden size, and number of heads) remain consistent in both models.

We further enhance both models with several techniques inspired by advanced language models like Llama (Touvron et al., 2023a; b). Specifically, we adopt Pre-LayerNorm with RMSNorm (Zhang & Sennrich, 2019) for better stability, use SwiGLU (Shazeer, 2020) as the activation function to enhance non-linearity, and implement RoPE (Su et al., 2024) for more expressive positional encoding.

Data. The well-known Chinchilla scaling law (Hoffmann et al., 2022) utilizes a large dataset with more data than the number of training tokens. Motivated by it, we employ the open-source SlimPajama dataset (Soboleva et al., 2023), a multi-corpora dataset comprising 627 billion tokens, which is sufficiently large for all of our experiments. For simplicity and fairness, we employ the Llama-2 tokenizer (Touvron et al., 2023b) for both ARMs and MDMs. Additionally, we set the context length to 2048204820482048. Further implementation details are provided in Appendix B.2.

IsoFLOP analysis. We conduct a standard IsoFLOP analysis (Hoffmann et al., 2022) to identify the optimal allocation between the non-embedding parameters N𝑁Nitalic_N and dataset size D𝐷Ditalic_D. Specifically, building on prior studies (Kaplan et al., 2020; Hoffmann et al., 2022), we scale the compute budget C𝐶Citalic_C from 6×10186superscript10186\times 10^{18}6 × 10 start_POSTSUPERSCRIPT 18 end_POSTSUPERSCRIPT to 1020superscript102010^{20}10 start_POSTSUPERSCRIPT 20 end_POSTSUPERSCRIPT FLOPs. For a fixed C𝐶Citalic_C, we train a series of models with varying values of N𝑁Nitalic_N and D𝐷Ditalic_D such that C=6ND𝐶6𝑁𝐷C=6NDitalic_C = 6 italic_N italic_D. We fit a quadratic function to capture the relationship between the validation loss \mathcal{L}caligraphic_L and the logarithm of the parameter size logN𝑁\log Nroman_log italic_N. Specifically, the loss function \mathcal{L}caligraphic_L of MDMs is defined in Eq. (4). This regression allows us to determine the optimal parameter size NCsubscriptsuperscript𝑁𝐶N^{*}_{C}italic_N start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT, which corresponds to the minimum validation loss Csubscriptsuperscript𝐶\mathcal{L}^{*}_{C}caligraphic_L start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT for a given compute budget C𝐶Citalic_C. The IsoFLOP analysis results are visualized in Fig. 1.

Scaling laws. After obtaining the optimal validation losses for the corresponding compute budget in {C0,C1,,Cn1}subscript𝐶0subscript𝐶1subscript𝐶𝑛1\{C_{0},C_{1},\dots,C_{n-1}\}{ italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_C start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT }, we fit the following scaling law to model the relationship between them:

minα,βi=0n1(logCiαlogCiβ)2.subscript𝛼𝛽superscriptsubscript𝑖0𝑛1superscriptsubscriptsuperscriptsubscript𝐶𝑖𝛼subscript𝐶𝑖𝛽2\displaystyle\min_{\alpha,\beta}\sum_{i=0}^{n-1}\left(\log\mathcal{L}^{*}_{C_{% i}}-\alpha\log C_{i}-\beta\right)^{2}.roman_min start_POSTSUBSCRIPT italic_α , italic_β end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT ( roman_log caligraphic_L start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_α roman_log italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_β ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (5)

Let αsuperscript𝛼{\alpha^{*}}italic_α start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and βsuperscript𝛽{\beta^{*}}italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT denote the solution of Eq. (5) and the validation loss empirically follows =eβCαsuperscript𝑒superscript𝛽superscript𝐶superscript𝛼\mathcal{L}=e^{\beta^{*}}C^{\alpha^{*}}caligraphic_L = italic_e start_POSTSUPERSCRIPT italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT italic_C start_POSTSUPERSCRIPT italic_α start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT.

Refer to caption
(a) Loss-Flops curve.
Refer to caption
(b) Parameters-Flops curve.
Figure 2: Scaling laws for MDMs. Compared to ARMs, MDMs demonstrate competitive scalability with comparable scaling rates and similar scaling behavior on utilizing the parameter capacity.

As illustrated in Fig. 2(a), the validation loss of MDMs decreases according to a power law as the compute budget increases, following a rate similar to that of ARMs. MDMs still require approximately 16 times more computational resources than ARMs to achieve comparable validation losses. There is still potential to narrow this constant since optimizations for MDMs in model, data, and system remain unexplored. Besides, for reference, Gulrajani & Hashimoto (2024) reported that the constant factor between continuous diffusion models (CDMs) and ARMs is 64.

Furthermore, the optimal model size also follows a power-law relationship with the compute budget, as shown in Fig. 2(b). Notably, the optimal size of MDMs is approximately half that of ARMs across different computations, reflecting a very similar scaling behavior on utilizing the parameter capacity.

In conclusion, the comparable scaling rates and the relatively small constant factors suggest that MDMs have strong scalability and promising potential as an alternative to ARMs on a large scale.

4 Unsupervised Classifier-free Guidance

We propose a surprisingly simple yet effective approach that leverages unlabeled data to boost performance in various language tasks, dubbed unsupervised classifier-free guidance (CFG).

CFG. CFG (Ho & Salimans, 2022) is an effective and versatile technique widely used in both continuous and discrete diffusion models, with applications spanning image (Ho & Salimans, 2022; Chang et al., 2023) and text generation (Lovelace et al., 2024). Rooted in Bayes’ rule, CFG simultaneously trains a conditional and an unconditional diffusion model, introducing a rescaled distribution for inference. Specifically, at a given timestep t[0,1]𝑡01t\in[0,1]italic_t ∈ [ 0 , 1 ], CFG (Chang et al., 2023) is defined as:

p~𝜽(𝒙0|𝒄,𝒙t)p𝜽(𝒙0|𝒄,𝒙t)1+wp𝜽(𝒙0|𝒙t)w,proportional-tosubscript~𝑝𝜽conditionalsubscript𝒙0𝒄subscript𝒙𝑡subscript𝑝𝜽superscriptconditionalsubscript𝒙0𝒄subscript𝒙𝑡1𝑤subscript𝑝𝜽superscriptconditionalsubscript𝒙0subscript𝒙𝑡𝑤\displaystyle\tilde{p}_{{\bm{\theta}}}({\bm{x}}_{0}|{\bm{c}},{\bm{x}}_{t})% \propto\frac{p_{{\bm{\theta}}}({\bm{x}}_{0}|{\bm{c}},{\bm{x}}_{t})^{1+w}}{p_{{% \bm{\theta}}}({\bm{x}}_{0}|{\bm{x}}_{t})^{w}},over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_italic_c , bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∝ divide start_ARG italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_italic_c , bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 1 + italic_w end_POSTSUPERSCRIPT end_ARG start_ARG italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT end_ARG , (6)

where 𝒄𝒄{\bm{c}}bold_italic_c is the condition, w𝑤witalic_w is a hyperparameter that flexibly controls the strength of 𝒄𝒄{\bm{c}}bold_italic_c, and p𝜽(𝒙0|𝒄,𝒙t)subscript𝑝𝜽conditionalsubscript𝒙0𝒄subscript𝒙𝑡p_{{\bm{\theta}}}({\bm{x}}_{0}|{\bm{c}},{\bm{x}}_{t})italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_italic_c , bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and p𝜽(𝒙0|𝒙t)subscript𝑝𝜽conditionalsubscript𝒙0subscript𝒙𝑡p_{{\bm{\theta}}}({\bm{x}}_{0}|{\bm{x}}_{t})italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) are the conditional and unconditional models respectively.

Notably, it seems that the conditional model must be trained on paired data before applying CFG. Consequently, to the best of our knowledge, all existing work (Ho & Salimans, 2022; Chang et al., 2023; Lovelace et al., 2024) fall into supervised settings, where paired data are readily available.

Unsupervised CFG. We extend CFG to an unsupervised setting by introducing a new formulation:

p~𝜽(𝒙0|𝒄,𝒙t)p𝜽(𝒙0|𝒄,𝒙t)1+wp𝜽(𝒙0|𝒎,𝒙t)w,proportional-tosubscript~𝑝𝜽conditionalsubscript𝒙0𝒄subscript𝒙𝑡subscript𝑝𝜽superscriptconditionalsubscript𝒙0𝒄subscript𝒙𝑡1𝑤subscript𝑝𝜽superscriptconditionalsubscript𝒙0𝒎subscript𝒙𝑡𝑤\displaystyle\tilde{p}_{{\bm{\theta}}}({\bm{x}}_{0}|{\bm{c}},{\bm{x}}_{t})% \propto\frac{p_{{\bm{\theta}}}({\bm{x}}_{0}|{\bm{c}},{\bm{x}}_{t})^{1+w}}{p_{{% \bm{\theta}}}({\bm{x}}_{0}|{\bm{m}},{\bm{x}}_{t})^{w}},over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_italic_c , bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∝ divide start_ARG italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_italic_c , bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 1 + italic_w end_POSTSUPERSCRIPT end_ARG start_ARG italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_italic_m , bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT end_ARG , (7)

where 𝒎𝒎{\bm{m}}bold_italic_m is a mask sequence of the same length as 𝒄𝒄{\bm{c}}bold_italic_c. Compared to Eq. (6), the dummy variable 𝒎𝒎{\bm{m}}bold_italic_m translates the unconditional distribution to a conditional format without adding new information. For simplicity, we continue to refer to p𝜽(𝒙0|𝒎,𝒙t)subscript𝑝𝜽conditionalsubscript𝒙0𝒎subscript𝒙𝑡p_{{\bm{\theta}}}({\bm{x}}_{0}|{\bm{m}},{\bm{x}}_{t})italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_italic_m , bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) as the unconditional distribution in unsupervised CFG throughout this paper.

The core insight is that an MDM already characterizes both distributions employed in Eq. (7) during unsupervised pretraining. Specifically, in language tasks, both 𝒄𝒄{\bm{c}}bold_italic_c and 𝒙𝒙{\bm{x}}bold_italic_x can be viewed as segments of a whole sequence, following the same distribution of unsupervised samples for pretraining.222E.g., the question “where does the sun rise?” and answer “from the east.” is a paired sample but their concatenation “where does the sun rise? from the east.” can be modeled by an MDM with unsupervised training. After the pretraining on large-scale text data, MDMs can capture the joint distribution of the whole sequence, i.e., pdata(𝒄,𝒙)subscript𝑝data𝒄𝒙p_{\textrm{data}}({\bm{c}},{\bm{x}})italic_p start_POSTSUBSCRIPT data end_POSTSUBSCRIPT ( bold_italic_c , bold_italic_x ). Under the formulation, MDMs simultaneously learn all conditional distributions on clean data induced by pdata(𝒄,𝒙)subscript𝑝data𝒄𝒙p_{\textrm{data}}({\bm{c}},{\bm{x}})italic_p start_POSTSUBSCRIPT data end_POSTSUBSCRIPT ( bold_italic_c , bold_italic_x ) according to Eq. (3). In particular, we have:

p𝜽(𝒙0|𝒄,𝒙t)pdata(𝒙0|𝒄,𝒙tUM)andp𝜽(𝒙0|𝒎,𝒙t)pdata(𝒙0|𝒙tUM),formulae-sequencesubscript𝑝𝜽conditionalsubscript𝒙0𝒄subscript𝒙𝑡subscript𝑝dataconditionalsubscript𝒙0𝒄superscriptsubscript𝒙𝑡UMandsubscript𝑝𝜽conditionalsubscript𝒙0𝒎subscript𝒙𝑡subscript𝑝dataconditionalsubscript𝒙0superscriptsubscript𝒙𝑡UM\displaystyle p_{{\bm{\theta}}}({\bm{x}}_{0}|{\bm{c}},{\bm{x}}_{t})\approx p_{% \text{data}}({\bm{x}}_{0}|{\bm{c}},{\bm{x}}_{t}^{\text{UM}})\quad\text{and}% \quad p_{{\bm{\theta}}}({\bm{x}}_{0}|{\bm{m}},{\bm{x}}_{t})\approx p_{\text{% data}}({\bm{x}}_{0}|{\bm{x}}_{t}^{\textrm{UM}}),italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_italic_c , bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≈ italic_p start_POSTSUBSCRIPT data end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_italic_c , bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT UM end_POSTSUPERSCRIPT ) and italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_italic_m , bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≈ italic_p start_POSTSUBSCRIPT data end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT UM end_POSTSUPERSCRIPT ) , (8)

where both distributions are factorized as in Eq. (3), and the approximation error is due to the gap between the model distribution and the true data distribution. Notably, Eq. (8) also implies that the unconditional distribution p𝜽(𝒙0|𝒙t)subscript𝑝𝜽conditionalsubscript𝒙0subscript𝒙𝑡p_{{\bm{\theta}}}({\bm{x}}_{0}|{\bm{x}}_{t})italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) used in standard CFG and the conditional distribution with a dummy variable p𝜽(𝒙0|𝒎,𝒙t)subscript𝑝𝜽conditionalsubscript𝒙0𝒎subscript𝒙𝑡p_{{\bm{\theta}}}({\bm{x}}_{0}|{\bm{m}},{\bm{x}}_{t})italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_italic_m , bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) share a similar role.

We have explained why unsupervised CFG works without paired data (see Sec. 5). Moreover, when paired data are available for downstream tasks, simply fine-tuning the conditional distribution in MDMs—similar to the classical approach used for ARMs—not only further improves the performance of unsupervised CFG but also outperforms the standard CFG trained on paired data, demonstrating its superior capability in leveraging large-scale unpaired data (see Sec. 6).

5 Zero-shot Language Understanding

We investigate the capabilities of MDMs in zero-shot language understanding, a critical skill for language models that has been largely overlooked in prior studies (Austin et al., 2021; Lou et al., 2023; Sahoo et al., 2024; Shi et al., 2024; Ou et al., 2024; Gat et al., 2024). Our results show that MDMs are highly competitive to ARMs of similar model sizes and computations.

Benchmarks. To provide a comprehensive evaluation, we assess MDMs on eight widely used benchmarks involving commonsense reasoning and reading comprehension: Hellaswag (Zellers et al., 2019), ARC-Easy (Clark et al., 2018), BoolQ (Clark et al., 2019), PIQA (Bisk et al., 2020), SIQA (Sap et al., 2019), OpenBookQA (Mihaylov et al., 2018), RACE (Lai et al., 2017), and LAMBADA (Paperno et al., 2016). For a detailed description of these benchmarks, see Appendix D.

On certain challenging benchmarks such as ARC-Challenge (Clark et al., 2018), WinoGrande (Sakaguchi et al., 2021), and MMLU (Hendrycks et al., 2020), both ARMs and MDMs pre-trained in Sec. 3 perform similarly to random guessing. This is consistent with findings from Wei et al. (2022a), which showed that only ARMs with more than 1022superscript102210^{22}10 start_POSTSUPERSCRIPT 22 end_POSTSUPERSCRIPT training FLOPs can surpass random guessing on MMLU, a phenomenon known as the emergence of new capabilities in large language models. We leave the exploration of their potential emergent abilities at a larger scale as future work.

Evaluation. We employ the popular Language Model Evaluation Harness framework (Gao et al., 2024) to evaluate both ARMs and MDMs. For the LAMBADA dataset, given a prompt, we apply greedy sampling to generate responses from each model and calculate the matching accuracy against the ground truth (please refer to Appendix A for the greedy sampling algorithm of MDMs). For other tasks, we report the accuracy of each model that selects the correct answer from the provided options based on the given context. Specifically, we compute the likelihood of each option given the prompt and choose the answer with the highest likelihood.

Fixing the train-test discrepancy. Due to employing a bidirectional Transformer encoder, MDMs face a train-test discrepancy in context lengths, negatively impacting model performance. Specifically, the training context length is fixed at 2048 tokens, while the testing context length is variable and often shorter. To address this issue, we propose two mitigation strategies: (1) allocate a portion of training data with variable sequence lengths L𝒰[1,2048]similar-to𝐿𝒰12048L\sim\mathcal{U}[1,2048]italic_L ∼ caligraphic_U [ 1 , 2048 ], where 𝒰[]𝒰delimited-[]\mathcal{U}[\cdot]caligraphic_U [ ⋅ ] denotes the uniform distribution; (2) pad sentences with mask tokens to reach 2048 tokens during evaluation.

As present in Appendix C.1, both strategies effectively reduce the train-test discrepancy, and only a small proportion (e.g., 1%percent11\%1 %) of variable-length training data is sufficient to activate the capability to handle variable length inputs. Given its superior inference efficiency (e.g., 20 times faster than method (2) on the Hellaswag dataset), we employ method (1) in subsequent experiments.

Flexible likelihood evaluation. As detailed in Sec. 2, the MDMs model the conditional distribution of clean data, which enables flexible likelihood evaluation. Given a prompt and a sentence 𝒙0subscript𝒙0{\bm{x}}_{0}bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT of length L𝐿Litalic_L, we can determine the conditional likelihood using the following methods: (1) employ Monte Carlo estimation to establish a lower bound of the log-likelihood based on Eq. (4); (2) utilize the chain rule to compute the likelihood as logp𝜽(𝒙0|prompt)=i=0L1logp𝜽(𝒙0i|prompt,𝒙0<i,m)subscript𝑝𝜽conditionalsubscript𝒙0promptsuperscriptsubscript𝑖0𝐿1subscript𝑝𝜽conditionalsuperscriptsubscript𝒙0𝑖promptsuperscriptsubscript𝒙0absent𝑖𝑚\log p_{{\bm{\theta}}}({\bm{x}}_{0}|\text{prompt})=\sum_{i=0}^{L-1}\log p_{{% \bm{\theta}}}({\bm{x}}_{0}^{i}|\text{prompt},{\bm{x}}_{0}^{<i},m)roman_log italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | prompt ) = ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L - 1 end_POSTSUPERSCRIPT roman_log italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT | prompt , bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT < italic_i end_POSTSUPERSCRIPT , italic_m ).

We observed that the chain rule for likelihood evaluation results in higher accuracy for OpenBookQA and PIQA, while Monte Carlo estimation yields better accuracy for ARC-Easy, Hellaswag, RACE, and SIQA. Since the answer length of BoolQ consists of only one token (“Yes” or “No”), both methods produce identical results. We adopted this optimal configuration in subsequent experiments and please refer to Appendix C.1 for more details and results.

Table 1: Ablation of unsupervised CFG without paired data. Unsupervised CFG significantly improves the performance of MDMs across eight commonsense reasoning or reading comprehension tasks.

ARC-Easy BoolQ Hellaswag OpenBookQA PIQA RACE SIQA LAMBADA MDM w/o CFG 37.42 61.50 33.46 27.00 60.34 29.28 36.95 36.00 MDM w/ CFG 39.02 62.17 34.10 34.20 60.39 30.81 37.41 40.99

Effectivenes of unsupervised CFG without paired data. In this section, we use a default MDM model with 220M parameters and a training budget of 1020superscript102010^{20}10 start_POSTSUPERSCRIPT 20 end_POSTSUPERSCRIPT FLOPs for efficiency. For likelihood evaluation, we use the rescaled conditional distribution defined in Eq. (7) of unsupervised CFG. Since no paired data is available, standard CFG cannot be applied in this scenario. As shown in Table 1, unsupervised CFG significantly enhances the performance of MDMs across all eight widely used benchmarks, demonstrating its strong capability to leverage unpaired data effectively.

Competitive zero-shot language understanding performance. First, we compare the performance of MDMs and ARMs pre-trained on the same SlimPajama dataset for fair and detailed analysis. As shown in Table 2, MDMs outperform ARMs on four out of eight tasks when trained with an equivalent number of pre-training FLOPs. Besides, with approximately 16 times more pre-training time, as suggested by the scaling law in Sec. 3, MDMs surpass ARMs across all tasks.

We further analyze the scaling behavior of MDMs on understanding tasks and observe a clear trend: as the validation loss decreases, the performance on most tasks correspondingly improves, indicating a positive signal for scaling MDMs to achieve even stronger capabilities. We provide the results and more details in Appendix C.1.

Additionally, in Table 3, we benchmark MDMs against the well-known GPT-2 model for a comprehensive comparison with existing literature. When scaled to 1.1B parameters and a training budget of 1.61×10211.61superscript10211.61\times 10^{21}1.61 × 10 start_POSTSUPERSCRIPT 21 end_POSTSUPERSCRIPT FLOPs (see Appendix B.3 for details), MDMs outperform the official 1.5B GPT-2 model on four out of eight benchmarks, highlighting their competitive performance compared to larger and standard ARMs.

Interestingly, when ARMs and MDMs are matched in size (and computation if known), their relative performance across datasets is consistent in both Table 2 and Table 3: MDMs outperform ARMs on the same set of tasks where ARMs lag behind, and vice versa. Although the underlying mechanism is not yet fully understood, this observation suggests that MDMs and ARMs may play complementary roles in language understanding tasks. We believe all these results make MDMs a promising alternative to ARMs for language understanding tasks.

Table 2: Comparison between MDM and ARMs pre-trained on the SlimPajama dataset. MDM achieves comparable performance to ARM when pre-trained with equivalent pre-training FLOPs, and surpasses ARM on all benchmarks when pre-trained for approximately 16 times longer.

FLOPs ARC-Easy BoolQ Hellaswag OpenBookQA PIQA RACE SIQA LAMBADA ARM (220M) 6×10186superscript10186\times 10^{18}6 × 10 start_POSTSUPERSCRIPT 18 end_POSTSUPERSCRIPT 35.40 61.69 28.51 26.20 58.98 25.84 35.98 17.02 ARM (220M) 1020superscript102010^{20}10 start_POSTSUPERSCRIPT 20 end_POSTSUPERSCRIPT 40.49 60.06 35.81 29.60 65.07 29.47 38.38 26.92 MDM (220M) 1020superscript102010^{20}10 start_POSTSUPERSCRIPT 20 end_POSTSUPERSCRIPT 39.02 62.17 34.10 34.20 60.39 30.81 37.41 40.99

Table 3: Comparison with GPT-2. Our 1.1B MDM outperforms the larger 1.5B GPT-2 on four out of eight tasks. We use the official GPT-2 checkpoint (see link in Tab. 8) and its FLOPs are unknown.

FLOPs ARC-Easy BoolQ Hellaswag OpenBookQA PIQA RACE SIQA LAMBADA GPT-2 (1.5B) - 51.05 61.77 50.89 32.00 70.51 33.11 40.28 44.61 MDM (1.1B) 1.61×10211.61superscript10211.61\times 10^{21}1.61 × 10 start_POSTSUPERSCRIPT 21 end_POSTSUPERSCRIPT 44.44 62.17 45.91 34.40 64.31 33.40 40.02 44.71

6 Conditional Language Generation

We investigate the capabilities of MDMs in conditional generation, another core language task largely unexplored previously. Our results show that a 1.1B MDM achieves a more flexible and effective quality-efficiency trade-off during inference than a same-sized ARM that utilizes KV cache.

Evaluation. Previous studies (Lou et al., 2023; Sahoo et al., 2024; Shi et al., 2024; Ou et al., 2024; Gat et al., 2024) have commonly employed generative perplexity as a metric to assess unconditional generation quality. However, recent work (Zheng et al., 2024) demonstrated that even low-quality samples can yield high generative perplexity scores, suggesting that this metric may not reliably reflect generative quality. Moreover, conditional generation is more widely applicable in real-world scenarios than unconditional generation. Therefore, this paper focuses on conditional generation.

In particular, we employ MT-Bench (Zheng et al., 2023), which uses a strong language model (i.e., GPT-4o (Achiam et al., 2023)) as a judge to score models on open-ended questions. This metric aligns well with human preferences and has become a standard for evaluating large language models.

Supervised fine-tuning. We employ an ARM and an MDM, both pre-trained as described in Sec. 3 with 1.1B parameters each. For a meaningful comparison, we evaluate their inference performance and, guided by the scaling law, extend the MDM’s pre-training time by a factor of 16. Results using equal computation budgets are provided in Appendix C.2. Following a standard process in language models, we fine-tune both models on the ShareGPT dataset333https://sharegpt.com/, a high-quality dialogue corpus containing user prompts and corresponding ChatGPT responses (OpenAI, 2022).

Since ShareGPT samples vary in length, we pad each sample with the |EOS|EOS|\text{EOS}|| EOS | token to the maximum sequence length within a batch for the MDM. Following the same approach as for ARMs, we mask the loss on prompts, adding noise only to the response tokens (including the padding |EOS|EOS|\text{EOS}|| EOS |), while keeping the prompts unchanged in the forward process. As a result, the MDM only tunes the conditional distribution of the response given prompt. We set the sequence length to 1024 and remove the |EOS|EOS|\text{EOS}|| EOS | token from the generated outputs during inference. For the ARM, generation stops when the |EOS|EOS|\text{EOS}|| EOS | token is produced, with a maximum sequence length set to 1024 (Zheng et al., 2023). For a fair comparison, we use identical optimizer settings for both models and train for 3 epochs as specified in Zheng et al. (2023). Additional training details are provided in Appendix B.4.

Effectiveness of unsupervised CFG against standard CFG. As shown in Table 5, we evaluate the effectiveness of unsupervised CFG by comparing it against several baselines detailed in Appendix B.4. The first one fine-tunes only the conditional distribution of MDM on paired data and sampling without CFG. The second one fine-tunes both conditional and unconditional distributions on paired data and gets samples as in the standard CFG. Additionally, we enhance unsupervised CFG by fine-tuning its conditional distribution on paired data. This is because unsupervised CFG already leverages large-scale pre-trained data to obtain a strong unconditional model. Notably, our unsupervised CFG outperforms the standard CFG, demonstrating its superior ability to leverage large-scale unpaired data considering the paired data for fine-tuning are often of a small scale.

Better efficiency quality trade-off. We further compare MDMs and ARMs regarding sample quality and efficiency. Our study significantly extends prior work (Lou et al., 2023; Sahoo et al., 2024; Shi et al., 2024; Ou et al., 2024; Gat et al., 2024) in two key aspects: (1) we focus on the more practical and challenging task of conditional generation rather than unconditional generation, and (2) we measure the running time instead of the NFEs, even when ARMs are equipped with the KV-cache, a technique that accelerates sampling by caching intermediate features during sequential generation.

Built upon the unsupervised CFG, MDMs demonstrate a more flexible and effective trade-off between efficiency and quality in conditional generation compared to ARMs. As shown in Table 5, a 1.1B MDM matches the performance of a similarly sized ARM while achieving a 1.4 times speedup in sampling time. Conversely, by increasing the number of sampling steps (at the cost of being 1.4 times slower), MDMs can surpass ARMs in generation quality. These results indicate that MDMs hold promise for conditional generation tasks, such as chat-based applications, where the ability to balance speed and quality is critical.

Table 4: Ablation of unsupervised CFG. The symbols and indicate the standard CFG and unsupervised CFG respectively. We report the results with the optimal scale searched in {0.4,0.6,0.8,1}0.40.60.81\{0.4,0.6,0.8,1\}{ 0.4 , 0.6 , 0.8 , 1 } for both CFG approaches.

w/o CFG w/ CFG w/ CFG Score \uparrow 1.32 1.53 1.60

Table 5: Conditional generation results. All experiments are conducted on a single NVIDIA A100-40GB GPU. ARM utilizes KV cache.

MDM ARM Score \uparrow 1.40 1.56 1.60 1.57 NFEs \downarrow 64 128 256 325.94 Time \downarrow 204s 396s 780s 555s

7 Challenging Tasks for ARMs

We demonstrate that MDMs exhibit distinct advantages over ARMs in tackling two critical challenges: reverse curse (Berglund et al., 2023) and temporal quality degradation (Vela et al., 2022).

7.1 Breaking the Reverse Curse

Berglund et al. (2023) introduced the concept of the reverse curse, which refers to the difficulty of ARMs in generalizing bidirectional relationships. Specifically, this occurs when a model is trained on information in the form “A is B” but fails to infer the reverse relationship “B is A.” For example, a model trained on the fact “Valentina Tereshkova was the first woman to travel to space” may not correctly answer the reverse question “Who was the first woman to travel to space?” This limitation raises concerns about whether large language models genuinely possess logical reasoning capabilities (Berglund et al., 2023).

Setup. We evaluate MDMs on the same reverse curse dataset used by Berglund et al. (2023), which consists of fictitious statements in the format “namedelimited-⟨⟩name\langle\text{name}\rangle⟨ name ⟩ is descriptiondelimited-⟨⟩description\langle\text{description}\rangle⟨ description ⟩” and the reversals. We fine-tune MDMs on these statements and assess their performance using questions not seen during training. Following the same protocol as Berglund et al. (2023), we generate responses via greedy sampling and report the exact match accuracy. Additionally, we use the BLEU metric (Papineni et al., 2002) to evaluate the quality of name-to-description generation, as suggested by Lv et al. (2023).

Results. As shown in Table 6, advanced ARMs including GPT-3 (Brown, 2020) and Llama-2 (Touvron et al., 2023b), achieve zero accuracy and low BLEU scores when prompted with reverse queries. In contrast, MDMs achieve substantially higher scores across both metrics, despite using significantly fewer parameters, a smaller pre-training dataset, and less computation. Specifically, our MDM uses only 10% parameters, 10% pre-training data, and 1% computation compared to Llama-2. Besides, MDMs perform similarly to ARMs with queries in the same direction. These results indicate the power of MDMs in capturing bidirectional relationships and logical structures.

Table 6: Results on breaking the reverse curse. The performance of GPT-3 and Llama-2 is sourced from Berglund et al. (2023) and Lv et al. (2023), respectively. For MDM, we use a CFG scale of 0.8. While ARMs struggle to handle reverse queries, MDMs effectively overcome the reverse curse and maintain performance in the same direction.

DescriptionToName NameToDescription Same direction Reverse direction Same direction Reverse direction Acc. \uparrow Acc. \uparrow Acc. \uparrow BLEU \uparrow Acc. \uparrow BLEU \uparrow GPT3 (175B) 97 0 50 - 0 - Llama-2 (13B) 99 0 - 74 - 19 MDM (1.1B) 97 92 49 76 37 67

7.2 Relieving the Temporal Quality Degradation

Vela et al. (2022) highlight a common and challenging issue for modern AI models, including language models: model performance is sensitive to the temporal alignment between the training and test data, particularly when new data fall outside the temporal scope of the training set.

Setup. To evaluate the impact of temporal shifts, we train both ARMs and MDMs on the SlimPajama dataset (Soboleva et al., 2023) (see Sec. 3), released in 2023, and test them on the FineWeb dataset (Penedo et al., 2024), which contains samples from February&\&&March, and April of 2024. We extract the first 0.5 billion tokens from each period for evaluation. We use models of equal size (220M parameters) that achieve similar validation losses on SlimPajama. However, it is worth noting that MDMs require 16 times more computation to reach this performance level.

Results. As shown in Table 7, although the MDM achieves slightly higher perplexity on the standard validation set (i.e., SlimPajama), it outperforms the ARM on the newer 2024 data. While the exact mechanism remains unclear, we hypothesize that this advantage arises from MDMs’ ability to simultaneously model all conditional distributions, making them less sensitive to distributional shifts compared to the unidirectional dependencies in ARMs. These results indicate that MDMs are inherently more robust to temporal shifts, making them better suited for evolving data distributions.

Table 7: Perplexity (\downarrow) results on relieving temporal quality degradation. The symbol indicates the training dataset. MDM demonstrates superior robustness to temporal shifts than ARM.

SlimPajamasuperscriptSlimPajama\text{SlimPajama}^{*}SlimPajama start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT (before Jun. 2023) Fineweb (Feb. &\&& Mar. 2024) Fineweb (Apr. 2024) ARM 17.34 27.01 26.93 MDM 18.02 24.06 24.01

8 Conclusion

In this paper, we demonstrate the strong scalability of MDMs through a comprehensive scaling analysis. Our results show that MDMs can achieve comparable or even superior performance than ARMs in key tasks, such as conditional language generation and language understanding, supported by the scaling law and the unsupervised classifier-free guidance. Furthermore, MDMs effectively address major limitations of ARMs, including breaking the reverse curse and relieving temporal quality degradation, even outperforming much larger models like Llama-2 and GPT-3 in these aspects. These findings highlight MDMs as a promising alternative to ARMs for language modeling at scale.

One of the most important future directions is to scale MDMs to larger sizes, potentially matching advanced ARMs (Achiam et al., 2023; Dubey et al., 2024). This would allow for a thorough investigation into the emergent behaviors (Wei et al., 2022a) and long-range reasoning capabilities (Wei et al., 2022b) of MDMs. By scaling up, we hope that MDMs can fully demonstrate their unique advantages over ARMs in real-world scenarios, offering a competitive alternative. Further, we believe the studies can deepen our understanding of large language models and the role of key factors such as autoregressive formulation in achieving such intelligence.

We also note another line of research focusing on continuous diffusion language models (Li et al., 2022; Gong et al., 2022; Han et al., 2022; Strudel et al., 2022; Chen et al., 2022; Dieleman et al., 2022; Mahabadi et al., 2023; Graves et al., 2023; Xue et al., 2024; Gulrajani & Hashimoto, 2024). However, the experiments in this domain are relatively small in scale and lack evaluation on standard language benchmarks. We hypothesize that MDMs enjoy better scalability than these models due to their alignment with the inherent structure of language and ARMs.

9 Acknowledgments

We thank Jingyang Ou for the insightful discussions on RADD. We also thank Ang Lv for valuable conversations about the reverse curse and Wenkai Yang for discussions on the supervised fine-tuning of ARMs. Additionally, we appreciate Siqi Kou for providing guidance on data processing and evaluation for the conditional generation experiments.

References

  • Achiam et al. (2023) Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  • Austin et al. (2021) Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg. Structured denoising diffusion models in discrete state-spaces. Advances in Neural Information Processing Systems, 34:17981–17993, 2021.
  • Berglund et al. (2023) Lukas Berglund, Meg Tong, Max Kaufmann, Mikita Balesni, Asa Cooper Stickland, Tomasz Korbak, and Owain Evans. The reversal curse: Llms trained on” a is b” fail to learn” b is a”. arXiv preprint arXiv:2309.12288, 2023.
  • Bisk et al. (2020) Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, 2020.
  • Brown (2020) Tom B Brown. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.
  • Campbell et al. (2022) Andrew Campbell, Joe Benton, Valentin De Bortoli, Thomas Rainforth, George Deligiannidis, and Arnaud Doucet. A continuous time framework for discrete denoising models. Advances in Neural Information Processing Systems, 35:28266–28279, 2022.
  • Chang et al. (2022) Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman. Maskgit: Masked generative image transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  11315–11325, 2022.
  • Chang et al. (2023) Huiwen Chang, Han Zhang, Jarred Barber, AJ Maschinot, Jose Lezama, Lu Jiang, Ming-Hsuan Yang, Kevin Murphy, William T Freeman, Michael Rubinstein, et al. Muse: Text-to-image generation via masked generative transformers. arXiv preprint arXiv:2301.00704, 2023.
  • Chen et al. (2022) Ting Chen, Ruixiang Zhang, and Geoffrey Hinton. Analog bits: Generating discrete data using diffusion models with self-conditioning. arXiv preprint arXiv:2208.04202, 2022.
  • Clark et al. (2019) Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044, 2019.
  • Clark et al. (2018) Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
  • Devlin (2018) Jacob Devlin. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  • Dieleman et al. (2022) Sander Dieleman, Laurent Sartran, Arman Roshannai, Nikolay Savinov, Yaroslav Ganin, Pierre H Richemond, Arnaud Doucet, Robin Strudel, Chris Dyer, Conor Durkan, et al. Continuous diffusion for categorical data. arXiv preprint arXiv:2211.15089, 2022.
  • Du et al. (2024) Zhengxiao Du, Aohan Zeng, Yuxiao Dong, and Jie Tang. Understanding emergent abilities of language models from the loss perspective. arXiv preprint arXiv:2403.15796, 2024.
  • Dubey et al. (2024) Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024.
  • Gao et al. (2024) Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework for few-shot language model evaluation, 07 2024. URL https://zenodo.org/records/12608602.
  • Gat et al. (2024) Itai Gat, Tal Remez, Neta Shaul, Felix Kreuk, Ricky TQ Chen, Gabriel Synnaeve, Yossi Adi, and Yaron Lipman. Discrete flow matching. arXiv preprint arXiv:2407.15595, 2024.
  • Gong et al. (2022) Shansan Gong, Mukai Li, Jiangtao Feng, Zhiyong Wu, and LingPeng Kong. Diffuseq: Sequence to sequence text generation with diffusion models. arXiv preprint arXiv:2210.08933, 2022.
  • Graves et al. (2023) Alex Graves, Rupesh Kumar Srivastava, Timothy Atkinson, and Faustino Gomez. Bayesian flow networks. arXiv preprint arXiv:2308.07037, 2023.
  • Gulrajani & Hashimoto (2024) Ishaan Gulrajani and Tatsunori B Hashimoto. Likelihood-based diffusion language models. Advances in Neural Information Processing Systems, 36, 2024.
  • Han et al. (2022) Xiaochuang Han, Sachin Kumar, and Yulia Tsvetkov. Ssd-lm: Semi-autoregressive simplex-based diffusion language model for text generation and modular control. arXiv preprint arXiv:2210.17432, 2022.
  • He et al. (2022) Zhengfu He, Tianxiang Sun, Kuanning Wang, Xuanjing Huang, and Xipeng Qiu. Diffusionbert: Improving generative masked language models with diffusion models. arXiv preprint arXiv:2211.15029, 2022.
  • Hendrycks et al. (2020) Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
  • Ho & Salimans (2022) Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
  • Ho et al. (2020) Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  • Hoffmann et al. (2022) Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
  • Hoogeboom et al. (2021a) Emiel Hoogeboom, Alexey A Gritsenko, Jasmijn Bastings, Ben Poole, Rianne van den Berg, and Tim Salimans. Autoregressive diffusion models. arXiv preprint arXiv:2110.02037, 2021a.
  • Hoogeboom et al. (2021b) Emiel Hoogeboom, Didrik Nielsen, Priyank Jaini, Patrick Forré, and Max Welling. Argmax flows and multinomial diffusion: Learning categorical distributions. Advances in Neural Information Processing Systems, 34:12454–12465, 2021b.
  • Kaplan et al. (2020) Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  • Kou et al. (2024) Siqi Kou, Lanxiang Hu, Zhezhi He, Zhijie Deng, and Hao Zhang. Cllms: Consistency large language models. arXiv preprint arXiv:2403.00835, 2024.
  • Lai et al. (2017) Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. Race: Large-scale reading comprehension dataset from examinations. arXiv preprint arXiv:1704.04683, 2017.
  • Li et al. (2022) Xiang Li, John Thickstun, Ishaan Gulrajani, Percy S Liang, and Tatsunori B Hashimoto. Diffusion-lm improves controllable text generation. Advances in Neural Information Processing Systems, 35:4328–4343, 2022.
  • Loshchilov (2017) I Loshchilov. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  • Lou et al. (2023) Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion language modeling by estimating the ratios of the data distribution. arXiv preprint arXiv:2310.16834, 2023.
  • Lovelace et al. (2024) Justin Lovelace, Varsha Kishore, Yiwei Chen, and Kilian Q Weinberger. Diffusion guided language modeling. arXiv preprint arXiv:2408.04220, 2024.
  • Lv et al. (2023) Ang Lv, Kaiyi Zhang, Shufang Xie, Quan Tu, Yuhan Chen, Ji-Rong Wen, and Rui Yan. Are we falling in a middle-intelligence trap? an analysis and mitigation of the reversal curse. arXiv preprint arXiv:2311.07468, 2023.
  • Mahabadi et al. (2023) Rabeeh Karimi Mahabadi, Hamish Ivison, Jaesung Tae, James Henderson, Iz Beltagy, Matthew E Peters, and Arman Cohan. Tess: Text-to-text self-conditioned simplex diffusion. arXiv preprint arXiv:2305.08379, 2023.
  • Meng et al. (2022) Chenlin Meng, Kristy Choi, Jiaming Song, and Stefano Ermon. Concrete score matching: Generalized score matching for discrete data. Advances in Neural Information Processing Systems, 35:34532–34545, 2022.
  • Mihaylov et al. (2018) Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. arXiv preprint arXiv:1809.02789, 2018.
  • OpenAI (2022) OpenAI. ChatGPT: Optimizing Language Models for Dialogue. OpenAI blog, November 2022. URL https://openai.com/blog/chatgpt/.
  • Ou et al. (2024) Jingyang Ou, Shen Nie, Kaiwen Xue, Fengqi Zhu, Jiacheng Sun, Zhenguo Li, and Chongxuan Li. Your absorbing discrete diffusion secretly models the conditional distributions of clean data. arXiv preprint arXiv:2406.03736, 2024.
  • Paperno et al. (2016) Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernández. The lambada dataset: Word prediction requiring a broad discourse context. arXiv preprint arXiv:1606.06031, 2016.
  • Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp.  311–318, 2002.
  • Penedo et al. (2024) Guilherme Penedo, Hynek Kydlíček, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro Von Werra, and Thomas Wolf. The fineweb datasets: Decanting the web for the finest text data at scale, 2024. URL https://arxiv.org/abs/2406.17557.
  • Radford (2018) Alec Radford. Improving language understanding by generative pre-training, 2018.
  • Radford et al. (2019) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  • Sahoo et al. (2024) Subham Sekhar Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin T Chiu, Alexander Rush, and Volodymyr Kuleshov. Simple and effective masked diffusion language models. arXiv preprint arXiv:2406.07524, 2024.
  • Sakaguchi et al. (2021) Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106, 2021.
  • Sap et al. (2019) Maarten Sap, Hannah Rashkin, Derek Chen, Ronan Le Bras, and Yejin Choi. Social iqa: Commonsense reasoning about social interactions. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.  4463–4473, 2019.
  • Shazeer (2020) Noam Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020.
  • Shi et al. (2024) Jiaxin Shi, Kehang Han, Zhe Wang, Arnaud Doucet, and Michalis K Titsias. Simplified and generalized masked diffusion for discrete data. arXiv preprint arXiv:2406.04329, 2024.
  • Soboleva et al. (2023) Daria Soboleva, Faisal Al-Khateeb, Robert Myers, Jacob R Steeves, Joel Hestness, and Nolan Dey. SlimPajama: A 627B token cleaned and deduplicated version of RedPajama. https://www.cerebras.net/blog/slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama, 06 2023. URL https://huggingface.co/datasets/cerebras/SlimPajama-627B.
  • Sohl-Dickstein et al. (2015) Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pp.  2256–2265. PMLR, 2015.
  • Song et al. (2020) Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020.
  • Strudel et al. (2022) Robin Strudel, Corentin Tallec, Florent Altché, Yilun Du, Yaroslav Ganin, Arthur Mensch, Will Grathwohl, Nikolay Savinov, Sander Dieleman, Laurent Sifre, et al. Self-conditioned embedding diffusion for text generation. arXiv preprint arXiv:2211.04236, 2022.
  • Su et al. (2024) Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024.
  • Sun et al. (2022) Haoran Sun, Lijun Yu, Bo Dai, Dale Schuurmans, and Hanjun Dai. Score-based continuous-time discrete diffusion models. arXiv preprint arXiv:2211.16750, 2022.
  • Touvron et al. (2023a) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  • Touvron et al. (2023b) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  • Vaswani (2017) Ashish Vaswani. Attention is all you need. arXiv preprint arXiv:1706.03762, 2017.
  • Vela et al. (2022) Daniel Vela, Andrew Sharp, Richard Zhang, Trang Nguyen, An Hoang, and Oleg S Pianykh. Temporal quality degradation in ai models. Scientific Reports, 12(1):11654, 2022.
  • Wei et al. (2022a) Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022a.
  • Wei et al. (2022b) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022b.
  • Xue et al. (2024) Kaiwen Xue, Yuhao Zhou, Shen Nie, Xu Min, Xiaolu Zhang, Jun Zhou, and Chongxuan Li. Unifying bayesian flow networks and diffusion models through stochastic differential equations. arXiv preprint arXiv:2404.15766, 2024.
  • Zellers et al. (2019) Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019.
  • Zhang & Sennrich (2019) Biao Zhang and Rico Sennrich. Root mean square layer normalization. Advances in Neural Information Processing Systems, 32, 2019.
  • Zhang et al. (2024) Peiyuan Zhang, Guangtao Zeng, Tianduo Wang, and Wei Lu. Tinyllama: An open-source small language model. arXiv preprint arXiv:2401.02385, 2024.
  • Zheng et al. (2024) Kaiwen Zheng, Yongxin Chen, Hanzi Mao, Ming-Yu Liu, Jun Zhu, and Qinsheng Zhang. Masked diffusion models are secretly time-agnostic masked models and exploit inaccurate categorical sampling. arXiv preprint arXiv:2409.02908, 2024.
  • Zheng et al. (2023) Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36:46595–46623, 2023.

Appendix A Greddy Sampling method of MDMs

We employ the sampling method of MaskGIT (Chang et al., 2022) as the greedy sampling strategy for MDMs. For completeness, we include the algorithm in Alg. 1 and provide the following intuitive explanation.

Let us first revisit the original sampling method for MDMs as described in Eq. (2). During each sampling step from time t𝑡titalic_t to s𝑠sitalic_s, if 𝒙timsuperscriptsubscript𝒙𝑡𝑖𝑚{\bm{x}}_{t}^{i}\neq mbold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ≠ italic_m it remains unchanged. Otherwise, it retains the masked state with a probability of st𝑠𝑡\frac{s}{t}divide start_ARG italic_s end_ARG start_ARG italic_t end_ARG, or transitions to 𝒙0ip𝜽(𝒙0i|𝒙t)similar-tosuperscriptsubscript𝒙0𝑖subscript𝑝𝜽conditionalsuperscriptsubscript𝒙0𝑖subscript𝒙𝑡{\bm{x}}_{0}^{i}\sim p_{{\bm{\theta}}}({\bm{x}}_{0}^{i}|{\bm{x}}_{t})bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∼ italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) with a probability of 1st1𝑠𝑡1-\frac{s}{t}1 - divide start_ARG italic_s end_ARG start_ARG italic_t end_ARG. It is important to note that for all masked tokens 𝒙tisuperscriptsubscript𝒙𝑡𝑖{\bm{x}}_{t}^{i}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, they transition to corresponding 𝒙0isuperscriptsubscript𝒙0𝑖{\bm{x}}_{0}^{i}bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT with the same probability of 1st1𝑠𝑡1-\frac{s}{t}1 - divide start_ARG italic_s end_ARG start_ARG italic_t end_ARG.

Different from the original sampling method, MaskGIT (Chang et al., 2022) does not transition all masked tokens to their corresponding 𝒙0isuperscriptsubscript𝒙0𝑖{\bm{x}}_{0}^{i}bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT with the same probability of 1st1𝑠𝑡1-\frac{s}{t}1 - divide start_ARG italic_s end_ARG start_ARG italic_t end_ARG. Instead, it specifically selects masked tokens that exhibit the highest conditional probability p𝜽(𝒙0i|𝒙t)𝒙0isubscript𝑝𝜽subscriptconditionalsuperscriptsubscript𝒙0𝑖subscript𝒙𝑡superscriptsubscript𝒙0𝑖p_{{\bm{\theta}}}({\bm{x}}_{0}^{i}|{\bm{x}}_{t})_{{\bm{x}}_{0}^{i}}italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUBSCRIPT for transition to 𝒙0isuperscriptsubscript𝒙0𝑖{\bm{x}}_{0}^{i}bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT.

Algorithm 1 Greddy sampling method of MDMs
0:  A all masked sequence 𝒙1subscript𝒙1{\bm{x}}_{1}bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT of length L𝐿Litalic_L, sampling steps N𝑁Nitalic_N
1:  for t=1,N1N,N2N,,1N𝑡1𝑁1𝑁𝑁2𝑁1𝑁t=1,\frac{N-1}{N},\frac{N-2}{N},\dots,\frac{1}{N}italic_t = 1 , divide start_ARG italic_N - 1 end_ARG start_ARG italic_N end_ARG , divide start_ARG italic_N - 2 end_ARG start_ARG italic_N end_ARG , … , divide start_ARG 1 end_ARG start_ARG italic_N end_ARG do
2:     s=t1N𝑠𝑡1𝑁s=t-\frac{1}{N}italic_s = italic_t - divide start_ARG 1 end_ARG start_ARG italic_N end_ARG
3:     for i=0,1,,L1𝑖01𝐿1i=0,1,\dots,L-1italic_i = 0 , 1 , … , italic_L - 1 do
4:        if 𝒙timsuperscriptsubscript𝒙𝑡𝑖𝑚{\bm{x}}_{t}^{i}\neq mbold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ≠ italic_m then
5:           𝒙0i=𝒙tisuperscriptsubscript𝒙0𝑖superscriptsubscript𝒙𝑡𝑖{\bm{x}}_{0}^{i}={\bm{x}}_{t}^{i}bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, ci=1superscript𝑐𝑖1c^{i}=1italic_c start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = 1
6:        else
7:           𝒙0i=argmaxjp𝜽(𝒙0i|𝒙t)jsuperscriptsubscript𝒙0𝑖subscriptargmax𝑗subscript𝑝𝜽subscriptconditionalsuperscriptsubscript𝒙0𝑖subscript𝒙𝑡𝑗{\bm{x}}_{0}^{i}=\operatorname*{arg\,max}_{j}p_{{\bm{\theta}}}({\bm{x}}_{0}^{i% }|{\bm{x}}_{t})_{j}bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, and denote ci=p𝜽(𝒙0i|𝒙t)𝒙0isuperscript𝑐𝑖subscript𝑝𝜽subscriptconditionalsuperscriptsubscript𝒙0𝑖subscript𝒙𝑡superscriptsubscript𝒙0𝑖c^{i}=p_{{\bm{\theta}}}({\bm{x}}_{0}^{i}|{\bm{x}}_{t})_{{\bm{x}}_{0}^{i}}italic_c start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUBSCRIPT.
8:        end if
9:     end for
10:     l=L(1s)𝑙𝐿1𝑠l=\lfloor L(1-s)\rflooritalic_l = ⌊ italic_L ( 1 - italic_s ) ⌋ # we set the number of unmasked tokens to l𝑙litalic_l in timestep s𝑠sitalic_s
11:     for i=0,1,,L1𝑖01𝐿1i=0,1,\dots,L-1italic_i = 0 , 1 , … , italic_L - 1 do
12:        if citopl({ci}i=0L1)superscript𝑐𝑖top𝑙superscriptsubscriptsuperscript𝑐𝑖𝑖0𝐿1c^{i}\in\text{top}-l\left(\{c^{i}\}_{i=0}^{L-1}\right)italic_c start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∈ top - italic_l ( { italic_c start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L - 1 end_POSTSUPERSCRIPT ) then
13:           𝒙si=𝒙0isuperscriptsubscript𝒙𝑠𝑖superscriptsubscript𝒙0𝑖{\bm{x}}_{s}^{i}={\bm{x}}_{0}^{i}bold_italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT
14:        end if
15:     end for
16:  end for
17:  return  𝒙0subscript𝒙0{\bm{x}}_{0}bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT

Appendix B Experimental details

B.1 Reproducibility Statement

We implement our experiments based on the TinyLlama (Zhang et al., 2024) codebase. We use the code provided by TinyLlama to preprocess the SlimPajama (Soboleva et al., 2023) dataset. Additionally, we use the code provided by CLLM (Kou et al., 2024) to preprocess the ShareGPT dataset. We employ the fictitious dataset provided by Berglund et al. (2023) and Fineweb dataset (Penedo et al., 2024) for the reverse curse and temporal quality degradation experiments, respectively. Because of their simplicity, we preprocess these two datasets by ourselves. We employ the lm-eval (Gao et al., 2024) and fast-chat (Zheng et al., 2023) framework to evaluate language understanding tasks and conditional generation, respectively. In Sec. 5, the pre-trained GPT-2 model is provided by HuggingaFace. The corresponding links are detailed in Tab. 8.

Table 8: Links for code and checkpoints.

B.2 Additional Experimental Details of IsoFLOP Analysis

Training details. We use identical optimizer settings for both MDMs and ARMs during pre-training. Consistency with TinyLLama (Zhang et al., 2024), we utilize the AdamW optimizer (Loshchilov, 2017), setting β1=0.9subscript𝛽10.9\beta_{1}=0.9italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.9, β2=0.95subscript𝛽20.95\beta_{2}=0.95italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.95, and a weight decay of 0.10.10.10.1. Additionally, we apply a cosine learning rate schedule with a maximum learning rate of 4×1044superscript1044\times 10^{-4}4 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT and a minimum learning rate of 4×1054superscript1054\times 10^{-5}4 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT with 1%percent11\%1 % of the tokens for linear warmup. Notably, if the number of warmup steps is less than 100100100100, it is set to 100100100100. The batch size is set to 256256256256.

Specifically, we pre-train a 1.1B MDM with 1.6×10211.6superscript10211.6\times 10^{21}1.6 × 10 start_POSTSUPERSCRIPT 21 end_POSTSUPERSCRIPT training FLOPs for the downstream tasks. We use the above pre-training setting for this 1.1 B model except for batch size. As we use 24 GPUs to pre-train this model, therefore we set the batch size to 384384384384.

Evaluation details. For MDMs, we found that using more Monte Carlo estimation samples (i.e., 128128128128) when computing the validation loss effectively reduces the number of outliers in Fig. 1(b). This is because increasing the number of Monte Carlo samples reduces the variance of the estimation, leading to a more precise estimation of the validation loss.

Model configs. We list all model configurations in Tab. 9.

Table 9: Model configurations of MDMs and ARMs. labels the non-embedding parameters.

ParameterssuperscriptParameters\text{Parameters}^{*}Parameters start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT (M) n_layers n_heads n_embed intermediate_size 19 8 6 384 1536 34 8 8 512 2048 48 9 9 576 2304 66 10 10 640 2560 75 16 8 640 1600 85 13 10 640 2560 113 12 12 768 3072 142 15 12 768 3072 170 18 12 768 3072 180 14 14 896 3584 206 16 14 896 3584 231 18 14 896 3584 268 16 16 1024 4096 302 18 16 1024 4096 336 20 16 1024 4096 472 18 10 1280 5120 551 21 10 1280 5120 571 18 11 1408 5632 629 24 10 1280 5120 666 21 11 1408 5632 717 19 12 1536 6144 761 24 11 1408 5632 831 22 12 1536 6144 944 25 12 1536 6144 1028 20 14 1792 7168 1233 24 14 1792 7168 1476 22 16 2048 8192 1678 25 16 2048 8192 2121 28 17 2176 8704

B.3 Additional Experiment Details of Language Understanding

Here, we present more details about the 1.1B model we introduce in Sec. 5. Firstly, we pre-train a 1.1B MDM for 1.6×10211.6superscript10211.6\times 10^{21}1.6 × 10 start_POSTSUPERSCRIPT 21 end_POSTSUPERSCRIPT FLOPs as detailed in Appendix B.2. Due to limited computational resources, we do not retrain this 1.1B parameter model from scratch with random data length. Instead, we allocated a compute budget of 1019superscript101910^{19}10 start_POSTSUPERSCRIPT 19 end_POSTSUPERSCRIPT FLOPs for variable length fine-tuning on the SlimPajama dataset. As the proportion of random length data is set to 1%percent11\%1 % when training from scratch, we empirically increase it to 10%percent1010\%10 % during variable length fine-tuning, considering the limited fine-tuning FLOPs.

B.4 Additional Experimental Details of Conditional Generation

Setup. We use identical optimizer settings for both MDMs and ARMs during supervised fine-tuning. Similar to our pretraining process, we use the AdamW optimizer (Loshchilov, 2017) with hyperparameters β1=0.9subscript𝛽10.9\beta_{1}=0.9italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.9, β2=0.95subscript𝛽20.95\beta_{2}=0.95italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.95, and a weight decay of 0.10.10.10.1. We employ a cosine learning rate schedule starting from a maximum learning rate of 2×1042superscript1042\times 10^{-4}2 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT and decaying to a minimum of 2×1052superscript1052\times 10^{-5}2 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT. Additionally, we apply linear warm-up over the first 200200200200 steps and set the batch size to 256256256256.

For the preprocessing of the ShareGPT dataset, we use the same method as described in Kou et al. (2024). In addition, in line with Kou et al. (2024), we fine-tune both ARMs and MDMs on the first-turn conversation from the ShareGPT dataset and report the first-turn conversation score. We do not use any annealing sampling method for ARMs and MDMs during generation. The MT-Bench score is obtained via the “gpt-4o-2024-05-13” API provided by OpenAI.

Different CFG strategies. We provide an overview of no CFG, standard CFG, and unsupervised CFG in Tab. 10.

During fine-tuning on labeled data, the standard CFG (Ho & Salimans, 2022) replaces the label with a special token with a probability of 10%percent1010\%10 %. This special token represents the unconditional distribution, thereby enabling the simultaneous training of both conditional and unconditional distributions. Specifically, for the implementation of standard CFG in our experiment, we randomly replace the prompt with the masked tokens with probability 10%percent1010\%10 %.

In contrast to the standard CFG, unsupervised CFG already leverages large-scale pre-trained data to obtain a strong unconditional model, therefore we only enhance its conditional distribution during fine-tuning on paired data.

During inference, both standard CFG and unsupervised CFG employ the rescaled conditional distribution defined in Eq. (7).

Table 10: Overview of different CFG strategies for conditional generation. The standard CFG fine-tunes both conditional and unconditional distributions on paired data, while unsupervised CFG is enhanced by fine-tuning only conditional distribution. Unsupervised CFG already leverages large-scale pre-trained data to obtain a strong unconditional model, resulting in improved performance compared to standard CFG.

Training Sampling Conditional Unconditional No-CFG w/o CFG Standard CFG w/ CFG (i.e., Eq. (7)) Unsupervised CFG w/ CFG (i.e., Eq. (7))

B.5 Additional Experimental Details of Reverse Curse

We use the same optimizer settings as Appendix B.4 except batch size when finetuning on the fictitious dataset provided by Berglund et al. (2023). As the fictitious dataset is smaller (i.e., only 3600 data), we use a batch size of 32 for fine-tuning. We train for 10 epochs following Berglund et al. (2023). We also pad each sample with the |EOS|EOS|\text{EOS}|| EOS | token to the maximum sequence length within a batch as detailed in Sec. 6. Following the same approach as Berglund et al. (2023), we do not mask the loss on prompts, adding noise to the prompt and response simultaneously as Eq. (4).

Table 11: Comparison of different methods to address train-test discrepancy. 1%percent11\%1 % and 5%percent55\%5 % denote that set 1%percent11\%1 % and 5%percent55\%5 % training data to random length, respectively. For simplicity, we employ the chain rule to calculate the conditional likelihood and do not use the unsupervised CFG. Both variable length training and padding mask tokens significantly improve the performance of MDMs in language understanding tasks.

ARC-Easy BoolQ Hellaswag OpenBookQA PIQA RACE SIQA LAMBADA Original 30.13 55.29 29.16 26.20 56.04 28.52 35.21 16.51 Padding 38.38 59.91 31.63 27.60 60.77 28.42 37.00 31.03 1% 37.79 61.50 31.86 27.00 60.34 29.19 36.85 36.00 5% 37.12 51.87 32.29 26.60 58.98 29.18 36.85 32.04

Appendix C Additional Results

C.1 Additional Results of Language Understanding

Results of fixing traing-test discrepancy. For efficiency, we employ MDM with 220M parameters, pre-trained for 1020superscript102010^{20}10 start_POSTSUPERSCRIPT 20 end_POSTSUPERSCRIPT FLOPs to experiment. Tab. 11 presents the ablation studies of variable length training and padding mask tokens, demonstrating that both methods significantly improve the performance of MDMs.

Results of different likelihood evaluation methods. For efficiency, we employ MDM with 220M parameters, pre-trained for 1020superscript102010^{20}10 start_POSTSUPERSCRIPT 20 end_POSTSUPERSCRIPT FLOPs, and set 1%percent11\%1 % training data to random length. Tab. 12 presents the ablation studies of different likelihood evaluation methods.

Table 12: Comparison of different likelihood evaluation methods. We employed 1024 Monte Carlo samples for the Monte Carlo estimation. All results are reported with the corresponding optimal unsupervised CFG scale. The optimal likelihood evaluation method differs across tasks.

ARC-Easy BoolQ Hellaswag OpenBookQA PIQA RACE SIQA Monte Carlo 39.02 62.17 34.10 30.40 59.14 30.81 37.41 Chain rule 37.88 62.17 32.20 34.20 60.39 29.67 37.10

Scaling behavior of MDMs on language understanding tasks. As shown in Fig. 3, the performance of MDMs on the language understanding tasks shows a scaling behavior with respect to the validation loss, which is consistent with observations in ARMs (Du et al., 2024). For efficiency and simplicity, methods for fixing train-test discrepancies and unsupervised CFG are not applied in this analysis.

C.2 Additional Results of Conditional Generation

More MT-Bench results of MDM. In Sec. 6, we report the MT-Bench results of ARM and MDM with 1020superscript102010^{20}10 start_POSTSUPERSCRIPT 20 end_POSTSUPERSCRIPT and 1.6×10211.6superscript10211.6\times 10^{21}1.6 × 10 start_POSTSUPERSCRIPT 21 end_POSTSUPERSCRIPT pre-training FLOPs, respectively. Here, we present the MT-Bench result of MDM with 1020superscript102010^{20}10 start_POSTSUPERSCRIPT 20 end_POSTSUPERSCRIPT pre-training FLOPs in Tab. 13.

Table 13: MT-Bench results of MDM with 1020superscript102010^{20}10 start_POSTSUPERSCRIPT 20 end_POSTSUPERSCRIPT pre-training FLOPs.

CFG=0.4CFG0.4\text{CFG}=0.4CFG = 0.4 CFG=0.6CFG0.6\text{CFG}=0.6CFG = 0.6 CFG=0.8CFG0.8\text{CFG}=0.8CFG = 0.8 Score 1.21 1.22 1.23

Generated sentence of MDM on MT-Bench. We present some answers generated from MDM in Fig. (4-6).

C.3 Additional Results of Reverse Curse

Tab 14 shows the effectiveness of the unsupervised CFG on the reverse curse.

Appendix D Evaluation Metrics

In this section, we provide an overview of the benchmarks used in Sec. 5 and show some cases from these benchmarks in Tab. 15.

ARC-Easy. A subset of the AI2 Reasoning Challenge that focuses on elementary-level science questions to evaluate the model’s reasoning ability through basic scientific concepts.

BoolQ. A yes-or-no question-answering dataset designed to evaluate the model’s ability to answer questions based on a given passage.

HellaSwag. A metric assesses the model’s commonsense reasoning ability by completing a given sentence with one of four options.

OpenBookQA. A question-answering dataset, modeled after open-book exams, is designed to assess a model’s understanding of a subject by requiring multi-step reasoning and the integration of additional commonsense knowledge.

PIQA. Physical Interaction Question Answering is a metric that evaluates physical reasoning ability by asking models to select the best solution to a given problem involving everyday physical scenarios.

SIQA. Social Interaction Question Answering is a benchmark for commonsense reasoning and is established by presenting scenarios that require reasoning about social interactions and the motivations behind human behavior.

RACE. ReAding Comprehension Dataset From Examinations was designed to evaluate reading comprehension ability by understanding and interpreting text at a high school level.

LAMBADA. A dataset to evaluate models’ capabilities in text understanding through a final single-word prediction task based on a given context.

Table 14: Effectiveness of unsupervised CFG on reverse curse. The unsupervised CFG enhances the performance of MDM on the reverse queries.

DescriptionToName NameToDescription Same direction Reverse direction Same direction Reverse direction Acc. \uparrow Acc. \uparrow Acc. \uparrow BLEU \uparrow Acc. \uparrow BLEU \uparrow w/o CFG 95 85 52 80 28 60 w/ CFG 97 92 49 76 37 67

Refer to caption
Figure 3: Scaling properties of MDMs on language understanding tasks. The x-axis represents the validation loss, while the y-axis indicates the accuracy.
Table 15: Examples from language understanding benchmarks.

Metric Question Choices ARC-Easy Which of the following was probably most important in the formation of dark, fertile soil that is good for farming? A. plant decomposition
B. radioactive decay
C. water erosion
D. wind erosion
BoolQ was the leaning tower of pisa built leaning Yes
No
HellaSwag A camera pans around a set of stairs and leads into people working out in a class. Several shots are shown of people working out together while a man speaks to the camera. the man A. continues speaking while more people are shown working out together.
B. is seen crashing into a wall several more times while people watch on the side.
C. then leads the group on a liquid workout together.
D. continues speaking to the camera while more shots are shown of them lifting weights and/or speaking to the camera.
OpenBookQA A man plugs his television into an outlet behind a cabinet. He sees that the television may now be turned on so that he can watch his favorite show. The man knows that by hooking the t.v. cord into the outlet A. he completed a lap
B. he made a good deal
C. he invented new circuits
D. he completed a circuit
PIQA When boiling butter, when it’s ready, you can A. Pour it onto a plate
B. Pour it into a jar
SIQA Taylor took the poor dog she found on the road to the vet. What will the vet want to do next? A. pronounce the dog dead
B. wraps the dog in a towel
C. check to make sure the dogs okay
RACE The article is too long and has been disregarded.
According to the passage, which of the following statements is TRUE?
A. There is more petroleum than we can use now.
B. Trees are needed for some other things besides making gas.
C. We got electricity from ocean tides in the old days.
D. Gas wasn’t used to run cars in the Second World War.
LAMBADA “Again, he left that up to you. However, he was adamant in his desire that it remain a private ceremony. He asked me to make sure, for instance, that no information be given to the newspaper regarding his death, not even an obituary. I got the sense that he didn’t want anyone, aside from the three of us, to know that he’d even _. Not applicable: the LAMBADA metric does not involve choices; instead, it evaluates the model by predicting the final word of a given question. For this question, the answer is ”died”.

Prompt:
A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user’s questions. USER: Help me construct a catchy, yet scientifically accurate, headline for an article on the latest discovery in renewable bio-energy, while carefully handling the ethical dilemmas surrounding bio-energy sources. Propose 4 options. ASSISTANT:
Answers:
*Option 1: ” zum Uncovering the Power of Renewable Bio-energy: The Future We Have and Our Right to Utilize”
This article should handle the ethical dilemmas surrounding bio-energy sources, including addressing the need for sustainable alternatives, the choices made by developing renewable bio-energy sources, and the necessary balance the right of honest and technology.
*Option 2: ”Exploring the latest Sustainable Energy for All: Majestic Renewable Bio-Earth Reports and Stories”/”Comprehensive reports and insights on the latest research in renewable bio-energy sources. This may have a focus on renewable energy but should also be sustainable.
*Option 3: ”Unleashing Record Renewable Energy Production: Exploring Beyond Coal Stream Options” Record renewable energy production capacity and conditions for development. This article focuses on the future of renewable energy sources, while addressing the challenges involved with sustainatory alternatives.
*Option 4: ”wissenschaft Handling the Environmental Impact of Renewable Energy: Harnessing the opportun for Renewable Energy Sources”
One of the ethical dilemmas surrounding bio-energy sources are economic and social dilemmas, such as neglecting local, regional, or communities. This article seeks to discuss how to develop renewable energy sources that reduce the environmental impact and promote sustainability.
*Option 5: ”: Exploring the Future of Renewable Bio-energy: Navigating the Environmental Impacts and Source Options”
Discussing decision-making about the practices and consequences of developing renewable energy sources while ensuring a balance between ethical dilemmas and sustainability.
*Option 6: ”Exploring the Renewable Bio-Earth: Unlocking the Future of Renewable Energy Sources”

Figure 4: Generated sentence of MDM on the MT-Bench (case 1).

Prompt:
A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user’s questions. USER: What are some business etiquette norms when doing business in Japan? ASSISTANT:
Answers:
Business etiquette is influenced by Western culture, but Japan has many norms of respect when interacting. Here are some etiquette norms when doing business in Japan:
1. Show respect: Always greet with greets in a respectful manner in a Japanese business: Always greet with respectful and professional lines when working with Japanese service workers. Always greet the customers politely unless they have a reason.
3. Practice etiquette: Always make respectful statement and be polite: Always greet someone in a respectful manner as it is the cultural norm and always respond respectfully and in a polite manner.
4. Always work courteously: Always respect when hideholding in Japanese business: Always respect when dealing with Japanese service workers or officials. Always show respect when asking for something unless it is courteous and building.
5. Use Japanese apparel and respect Japanese culture when doing business in a Japanese company: Do not use Japanese or Western companies or meet Japanese officials unless you are allowed to use their language and respect in Japanese business. While it is common to use monosma garments, it is not in Japan keeping dignity: Always respect and maintain respect: Always respect and maintain your dignity when dealing with Japanese service workers or especially when answering questions or making demonstrations of professional respect or respect
7. Request permission: Alwaysfshow respect when asking permission: Request permission respectfully, fairly and courteously to avoid a request. Always show respect when asking for permission unless it is haing.
9. Show respect: Always respect the customer with respect in Japan business: Always respect the customer with respect. These are a few examples of common business etiquette norms of respect when doing business in Japan. In general, business etiquette norms depend on the specific industry or branch of operation and may vary depending on your understanding and circumstances.

Figure 5: Generated sentence of MDM on the MT-Bench (case 2).

Prompt:
A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user’s questions. USER: Compose an engaging travel blog post about a recent trip to Hawaii, highlighting cultural experiences and must-see attractions. ASSISTANT:
Answers:
An enchanted-eyed traveler recently had the pleasure of discover Hawaii, a tropical paradise in the heart of the Pacific. The trip offers a diverse and blend cultural experience, representing the culture and outdoor adventure that Hawaii calls home.
The first thrilling experience is I hadn’t seen a wave breaking before in these green waters. But I experienced the same thing and witnessed the cool water rushing past me and the surf noticing caught my attention while peeking at the tall towering waterfalls.
Another cultural experience I had in my trip was exploring the land and its beaches through snorkeling. It is not just snorkeling, it is discovering beach dunes and sunnaporkeling on the gorgeous sand of Hana Beach, which is more famous than its name. Whether you can surf good, hike, or monke, it was a day of exploration.
My trip to Hawaii was a diverse destination with many aspects to explore and not enough time to see everything. Each culture has its own unique culture, determining the must see attractions. If you can only take one trip, come here and explore the beauty.

Figure 6: Generated sentence of MDM on the MT-Bench (case 3).