Scaling up Masked Diffusion Models on Text
Abstract
Masked diffusion models (MDMs) have shown promise in language modeling, yet their scalability and effectiveness in core language tasks, such as text generation and language understanding, remain underexplored. This paper establishes the first scaling law for MDMs, demonstrating a scaling rate comparable to autoregressive models (ARMs) and a relatively small compute gap. Motivated by their scalability, we train a family of MDMs with up to 1.1 billion (B) parameters to systematically evaluate their performance against ARMs of comparable or larger sizes. Fully leveraging the probabilistic formulation of MDMs, we propose a simple yet effective unsupervised classifier-free guidance that effectively exploits large-scale unpaired data, boosting performance for conditional inference. In language understanding, a 1.1B MDM shows competitive results, outperforming the larger 1.5B GPT-2 model on four out of eight zero-shot benchmarks. In text generation, MDMs provide a flexible trade-off compared to ARMs utilizing KV-cache: MDMs match the performance of ARMs while being 1.4 times faster, or achieve higher quality than ARMs at a higher computational cost. Moreover, MDMs address challenging tasks for ARMs by effectively handling bidirectional reasoning and adapting to temporal shifts in data. Notably, a 1.1B MDM breaks the reverse curse encountered by much larger ARMs with significantly more data and computation, such as Llama-2 (13B) and GPT-3 (175B). Our code is available at https://github.com/ML-GSAI/SMDM.
1 Introduction
Autoregressive models (ARMs) have long been regarded as the gold standard in probabilistic language modeling. Their ability to predict the next token, grounded in the chain rule, naturally aligns with the sequential nature of language and scales effectively (Radford, 2018; Radford et al., 2019; Brown, 2020; OpenAI, 2022; Achiam et al., 2023; Touvron et al., 2023a; b; Dubey et al., 2024) when integrated with Transformers (Vaswani, 2017). However, ARMs exhibit inherent limitations, particularly in reasoning tasks that require bidirectional context understanding or handling temporal shifts in data. These shortcomings, widely recognized as the reverse curse (Berglund et al., 2023) and temporal quality degradation (Vela et al., 2022), significantly hinder their applicability in complex language modeling scenarios. Additionally, their linear sampling time growth w.r.t. the output length poses practical challenges for long text generation.
The limitations of ARMs have sparked interest in an alternative approach: masked diffusion models (MDMs) (Austin et al., 2021; Hoogeboom et al., 2021b; a; He et al., 2022; Campbell et al., 2022; Meng et al., 2022; Sun et al., 2022; Lou et al., 2023; Sahoo et al., 2024; Shi et al., 2024; Ou et al., 2024). MDMs present a promising alternative due to their unique probabilistic framework, which enables flexible bidirectional context modeling by filling in masked positions across a sequence. Recent advances (Lou et al., 2023; Sahoo et al., 2024; Shi et al., 2024; Ou et al., 2024) have shown promise in unconditional text generation and zero-shot perplexity evaluation. Despite recent progress, the scalability of MDMs and their effectiveness in critical language tasks, such as conditional generation and language understanding, remain open questions. Furthermore, it is still unclear whether MDMs can address the inherent limitations of ARMs, such as improving bidirectional reasoning capabilities.
This paper challenges the longstanding dominance of ARMs by presenting a comprehensive study of MDMs regarding key factors in language models: scalability, capabilities for language understanding, and conditional generation. To achieve this, we train a family of MDMs with up to 1.1 billion (B) parameters on a large-scale dataset and establish the first scaling law for MDMs. Leveraging their unique probabilistic framework, we propose a simple yet effective unsupervised classifier-free guidance (CFG) mechanism to leverage unsupervised data to enhance inference performance in language tasks involving conditional distributions. Notably, unsupervised CFG does not rely on paired data as standard CFG (Ho & Salimans, 2022) but can still benefit from paired data when available, achieving performance that surpasses standard CFG. Supported by the scaling law and unsupervised CFG, our extensive experiments yield the following key findings:
-
•
Strong scalability. As the IsoFLOP analysis (Hoffmann et al., 2022) scaling computate budgets from to FLOPs (see Fig. 1), the optimal validation loss of MDMs decreases according to a power law, with a rate matching that of ARMs (see Fig. 2). While MDMs maintain a constant computation gap of 16 times compared to ARMs, this gap is smaller than the factor of 64 observed in continuous diffusion models (Gulrajani & Hashimoto, 2024) and can be further minimized with future optimizations.
-
•
Competitive in zero-shot language understanding. Across eight standard zero-shot benchmarks like commonsense reasoning and reading comprehension, MDMs outperform not only a same-sized ARM with the same pre-training FLOPs but also a larger 1.5B GPT-2 model on four tasks. Furthermore, when scaled up with 16 times more pre-training time, as suggested by the scaling law, MDMs consistently surpass ARMs across all tasks.
-
•
Flexible trade-off in conditional generation. On the standard MT-Bench, a 1.1B MDM matches the performance of a same-sized ARM while achieving a 1.4 times speedup in sampling time. By increasing sampling steps, MDMs can further improve generation quality at the cost of being 1.4 times slower. Notably, ARMs are equipped with KV-cache, a technique to speed up sequential sampling while MDMs exploit no system optimization.
-
•
Addressing challenging tasks for ARMs. MDMs effectively relieve temporal quality degradation (Vela et al., 2022) compared to a same-sized ARM and successfully overcome the reverse curse (Berglund et al., 2023) encountered by much larger ARMs with significantly more data and computation, such as Llama-2 (13B) and GPT-3 (175B).
2 Masked Diffusion Models on Text
In analogy to continuous diffusion models (Sohl-Dickstein et al., 2015; Ho et al., 2020; Song et al., 2020), MDMs (Austin et al., 2021; Lou et al., 2023; Ou et al., 2024) also introduce a forward process that gradually adds noise to the data and learn a corresponding reverse process to generate samples. Our basic approach is built upon Ou et al. (2024), an advanced MDM suitable for scaling.
Forward process. Let and denote the vocabulary size and sentence length respectively. Given a sentence and a noise level , the forward process in MDMs randomly and independently masks out tokens in the sentence, formulated as follows:
(1) |
where denotes the -th element of , denotes the mask token (Devlin, 2018), denotes the noisy data at time and is the data distribution . We set the hyperparameter as for the best empirical performance as suggested in previous work (Lou et al., 2023; Sahoo et al., 2024; Shi et al., 2024).
Reverse process. The reverse process in MDMs iteratively recover values for masked tokens, starting from a mask sequence . Let , the reverse process is characterized by
(2) |
Here is the data prediction model (Ho et al., 2020) to be learned. Notably, Ou et al. (2024) revealed an intrinsic property of MDMs that can be represented by conditional distributions on clean data independently from the time , distinct from other diffusion. Formally,
(3) |
where collects all unmasked tokens in noisy data and is irrelevant to .111For example, if , then and is irrelevant to .
Training objective. A distribution parameterized by is employed to approximate , optimizing the following upper bound on negative log-likelihood (Ou et al., 2024):
(4) |
We emphasize that the formulation is particularly suitable for scaling. First, it is among the best MDMs w.r.t. zero-shot perplexity (Ou et al., 2024). Second, it removes the timestep from input and minimally modifies the original Transformers (see Sec. 3). Third, it enables unsupervised classifier-free guidance, which does not rely on paired data yet is effective in language tasks (see Sec. 4).
3 Scaling Laws for Masked Diffusion Models
Scaling laws (Kaplan et al., 2020; Hoffmann et al., 2022) characterize the fundamental relationship between model performance and computational resources under constraints, significantly influencing the progress of large ARMs. We introduce the first scaling laws for MDMs and conduct a fair comparison with ARMs. Our results reveal the strong scalability of MDMs, highlighting their potential as a competitive alternative to ARMs in language modeling.
Model. We employ a Transformer decoder for ARMs and the corresponding Transformer encoder for MDMs (note that it is unnecessary to input timestep according to Eq. (3)). The differences between these architectures are: (1) the encoder has an additional dimension in its embedding layer for the mask token, and (2) the encoder’s self-attention does not use a causal mask. All other architectural settings (e.g., depth, hidden size, and number of heads) remain consistent in both models.
We further enhance both models with several techniques inspired by advanced language models like Llama (Touvron et al., 2023a; b). Specifically, we adopt Pre-LayerNorm with RMSNorm (Zhang & Sennrich, 2019) for better stability, use SwiGLU (Shazeer, 2020) as the activation function to enhance non-linearity, and implement RoPE (Su et al., 2024) for more expressive positional encoding.
Data. The well-known Chinchilla scaling law (Hoffmann et al., 2022) utilizes a large dataset with more data than the number of training tokens. Motivated by it, we employ the open-source SlimPajama dataset (Soboleva et al., 2023), a multi-corpora dataset comprising 627 billion tokens, which is sufficiently large for all of our experiments. For simplicity and fairness, we employ the Llama-2 tokenizer (Touvron et al., 2023b) for both ARMs and MDMs. Additionally, we set the context length to . Further implementation details are provided in Appendix B.2.
IsoFLOP analysis. We conduct a standard IsoFLOP analysis (Hoffmann et al., 2022) to identify the optimal allocation between the non-embedding parameters and dataset size . Specifically, building on prior studies (Kaplan et al., 2020; Hoffmann et al., 2022), we scale the compute budget from to FLOPs. For a fixed , we train a series of models with varying values of and such that . We fit a quadratic function to capture the relationship between the validation loss and the logarithm of the parameter size . Specifically, the loss function of MDMs is defined in Eq. (4). This regression allows us to determine the optimal parameter size , which corresponds to the minimum validation loss for a given compute budget . The IsoFLOP analysis results are visualized in Fig. 1.
Scaling laws. After obtaining the optimal validation losses for the corresponding compute budget in , we fit the following scaling law to model the relationship between them:
(5) |
Let and denote the solution of Eq. (5) and the validation loss empirically follows .
As illustrated in Fig. 2(a), the validation loss of MDMs decreases according to a power law as the compute budget increases, following a rate similar to that of ARMs. MDMs still require approximately 16 times more computational resources than ARMs to achieve comparable validation losses. There is still potential to narrow this constant since optimizations for MDMs in model, data, and system remain unexplored. Besides, for reference, Gulrajani & Hashimoto (2024) reported that the constant factor between continuous diffusion models (CDMs) and ARMs is 64.
Furthermore, the optimal model size also follows a power-law relationship with the compute budget, as shown in Fig. 2(b). Notably, the optimal size of MDMs is approximately half that of ARMs across different computations, reflecting a very similar scaling behavior on utilizing the parameter capacity.
In conclusion, the comparable scaling rates and the relatively small constant factors suggest that MDMs have strong scalability and promising potential as an alternative to ARMs on a large scale.
4 Unsupervised Classifier-free Guidance
We propose a surprisingly simple yet effective approach that leverages unlabeled data to boost performance in various language tasks, dubbed unsupervised classifier-free guidance (CFG).
CFG. CFG (Ho & Salimans, 2022) is an effective and versatile technique widely used in both continuous and discrete diffusion models, with applications spanning image (Ho & Salimans, 2022; Chang et al., 2023) and text generation (Lovelace et al., 2024). Rooted in Bayes’ rule, CFG simultaneously trains a conditional and an unconditional diffusion model, introducing a rescaled distribution for inference. Specifically, at a given timestep , CFG (Chang et al., 2023) is defined as:
(6) |
where is the condition, is a hyperparameter that flexibly controls the strength of , and and are the conditional and unconditional models respectively.
Notably, it seems that the conditional model must be trained on paired data before applying CFG. Consequently, to the best of our knowledge, all existing work (Ho & Salimans, 2022; Chang et al., 2023; Lovelace et al., 2024) fall into supervised settings, where paired data are readily available.
Unsupervised CFG. We extend CFG to an unsupervised setting by introducing a new formulation:
(7) |
where is a mask sequence of the same length as . Compared to Eq. (6), the dummy variable translates the unconditional distribution to a conditional format without adding new information. For simplicity, we continue to refer to as the unconditional distribution in unsupervised CFG throughout this paper.
The core insight is that an MDM already characterizes both distributions employed in Eq. (7) during unsupervised pretraining. Specifically, in language tasks, both and can be viewed as segments of a whole sequence, following the same distribution of unsupervised samples for pretraining.222E.g., the question “where does the sun rise?” and answer “from the east.” is a paired sample but their concatenation “where does the sun rise? from the east.” can be modeled by an MDM with unsupervised training. After the pretraining on large-scale text data, MDMs can capture the joint distribution of the whole sequence, i.e., . Under the formulation, MDMs simultaneously learn all conditional distributions on clean data induced by according to Eq. (3). In particular, we have:
(8) |
where both distributions are factorized as in Eq. (3), and the approximation error is due to the gap between the model distribution and the true data distribution. Notably, Eq. (8) also implies that the unconditional distribution used in standard CFG and the conditional distribution with a dummy variable share a similar role.
We have explained why unsupervised CFG works without paired data (see Sec. 5). Moreover, when paired data are available for downstream tasks, simply fine-tuning the conditional distribution in MDMs—similar to the classical approach used for ARMs—not only further improves the performance of unsupervised CFG but also outperforms the standard CFG trained on paired data, demonstrating its superior capability in leveraging large-scale unpaired data (see Sec. 6).
5 Zero-shot Language Understanding
We investigate the capabilities of MDMs in zero-shot language understanding, a critical skill for language models that has been largely overlooked in prior studies (Austin et al., 2021; Lou et al., 2023; Sahoo et al., 2024; Shi et al., 2024; Ou et al., 2024; Gat et al., 2024). Our results show that MDMs are highly competitive to ARMs of similar model sizes and computations.
Benchmarks. To provide a comprehensive evaluation, we assess MDMs on eight widely used benchmarks involving commonsense reasoning and reading comprehension: Hellaswag (Zellers et al., 2019), ARC-Easy (Clark et al., 2018), BoolQ (Clark et al., 2019), PIQA (Bisk et al., 2020), SIQA (Sap et al., 2019), OpenBookQA (Mihaylov et al., 2018), RACE (Lai et al., 2017), and LAMBADA (Paperno et al., 2016). For a detailed description of these benchmarks, see Appendix D.
On certain challenging benchmarks such as ARC-Challenge (Clark et al., 2018), WinoGrande (Sakaguchi et al., 2021), and MMLU (Hendrycks et al., 2020), both ARMs and MDMs pre-trained in Sec. 3 perform similarly to random guessing. This is consistent with findings from Wei et al. (2022a), which showed that only ARMs with more than training FLOPs can surpass random guessing on MMLU, a phenomenon known as the emergence of new capabilities in large language models. We leave the exploration of their potential emergent abilities at a larger scale as future work.
Evaluation. We employ the popular Language Model Evaluation Harness framework (Gao et al., 2024) to evaluate both ARMs and MDMs. For the LAMBADA dataset, given a prompt, we apply greedy sampling to generate responses from each model and calculate the matching accuracy against the ground truth (please refer to Appendix A for the greedy sampling algorithm of MDMs). For other tasks, we report the accuracy of each model that selects the correct answer from the provided options based on the given context. Specifically, we compute the likelihood of each option given the prompt and choose the answer with the highest likelihood.
Fixing the train-test discrepancy. Due to employing a bidirectional Transformer encoder, MDMs face a train-test discrepancy in context lengths, negatively impacting model performance. Specifically, the training context length is fixed at 2048 tokens, while the testing context length is variable and often shorter. To address this issue, we propose two mitigation strategies: (1) allocate a portion of training data with variable sequence lengths , where denotes the uniform distribution; (2) pad sentences with mask tokens to reach 2048 tokens during evaluation.
As present in Appendix C.1, both strategies effectively reduce the train-test discrepancy, and only a small proportion (e.g., ) of variable-length training data is sufficient to activate the capability to handle variable length inputs. Given its superior inference efficiency (e.g., 20 times faster than method (2) on the Hellaswag dataset), we employ method (1) in subsequent experiments.
Flexible likelihood evaluation. As detailed in Sec. 2, the MDMs model the conditional distribution of clean data, which enables flexible likelihood evaluation. Given a prompt and a sentence of length , we can determine the conditional likelihood using the following methods: (1) employ Monte Carlo estimation to establish a lower bound of the log-likelihood based on Eq. (4); (2) utilize the chain rule to compute the likelihood as .
We observed that the chain rule for likelihood evaluation results in higher accuracy for OpenBookQA and PIQA, while Monte Carlo estimation yields better accuracy for ARC-Easy, Hellaswag, RACE, and SIQA. Since the answer length of BoolQ consists of only one token (“Yes” or “No”), both methods produce identical results. We adopted this optimal configuration in subsequent experiments and please refer to Appendix C.1 for more details and results.
ARC-Easy BoolQ Hellaswag OpenBookQA PIQA RACE SIQA LAMBADA MDM w/o CFG 37.42 61.50 33.46 27.00 60.34 29.28 36.95 36.00 MDM w/ CFG 39.02 62.17 34.10 34.20 60.39 30.81 37.41 40.99
Effectivenes of unsupervised CFG without paired data. In this section, we use a default MDM model with 220M parameters and a training budget of FLOPs for efficiency. For likelihood evaluation, we use the rescaled conditional distribution defined in Eq. (7) of unsupervised CFG. Since no paired data is available, standard CFG cannot be applied in this scenario. As shown in Table 1, unsupervised CFG significantly enhances the performance of MDMs across all eight widely used benchmarks, demonstrating its strong capability to leverage unpaired data effectively.
Competitive zero-shot language understanding performance. First, we compare the performance of MDMs and ARMs pre-trained on the same SlimPajama dataset for fair and detailed analysis. As shown in Table 2, MDMs outperform ARMs on four out of eight tasks when trained with an equivalent number of pre-training FLOPs. Besides, with approximately 16 times more pre-training time, as suggested by the scaling law in Sec. 3, MDMs surpass ARMs across all tasks.
We further analyze the scaling behavior of MDMs on understanding tasks and observe a clear trend: as the validation loss decreases, the performance on most tasks correspondingly improves, indicating a positive signal for scaling MDMs to achieve even stronger capabilities. We provide the results and more details in Appendix C.1.
Additionally, in Table 3, we benchmark MDMs against the well-known GPT-2 model for a comprehensive comparison with existing literature. When scaled to 1.1B parameters and a training budget of FLOPs (see Appendix B.3 for details), MDMs outperform the official 1.5B GPT-2 model on four out of eight benchmarks, highlighting their competitive performance compared to larger and standard ARMs.
Interestingly, when ARMs and MDMs are matched in size (and computation if known), their relative performance across datasets is consistent in both Table 2 and Table 3: MDMs outperform ARMs on the same set of tasks where ARMs lag behind, and vice versa. Although the underlying mechanism is not yet fully understood, this observation suggests that MDMs and ARMs may play complementary roles in language understanding tasks. We believe all these results make MDMs a promising alternative to ARMs for language understanding tasks.
FLOPs ARC-Easy BoolQ Hellaswag OpenBookQA PIQA RACE SIQA LAMBADA ARM (220M) 35.40 61.69 28.51 26.20 58.98 25.84 35.98 17.02 ARM (220M) 40.49 60.06 35.81 29.60 65.07 29.47 38.38 26.92 MDM (220M) 39.02 62.17 34.10 34.20 60.39 30.81 37.41 40.99
FLOPs ARC-Easy BoolQ Hellaswag OpenBookQA PIQA RACE SIQA LAMBADA GPT-2 (1.5B) - 51.05 61.77 50.89 32.00 70.51 33.11 40.28 44.61 MDM (1.1B) 44.44 62.17 45.91 34.40 64.31 33.40 40.02 44.71
6 Conditional Language Generation
We investigate the capabilities of MDMs in conditional generation, another core language task largely unexplored previously. Our results show that a 1.1B MDM achieves a more flexible and effective quality-efficiency trade-off during inference than a same-sized ARM that utilizes KV cache.
Evaluation. Previous studies (Lou et al., 2023; Sahoo et al., 2024; Shi et al., 2024; Ou et al., 2024; Gat et al., 2024) have commonly employed generative perplexity as a metric to assess unconditional generation quality. However, recent work (Zheng et al., 2024) demonstrated that even low-quality samples can yield high generative perplexity scores, suggesting that this metric may not reliably reflect generative quality. Moreover, conditional generation is more widely applicable in real-world scenarios than unconditional generation. Therefore, this paper focuses on conditional generation.
In particular, we employ MT-Bench (Zheng et al., 2023), which uses a strong language model (i.e., GPT-4o (Achiam et al., 2023)) as a judge to score models on open-ended questions. This metric aligns well with human preferences and has become a standard for evaluating large language models.
Supervised fine-tuning. We employ an ARM and an MDM, both pre-trained as described in Sec. 3 with 1.1B parameters each. For a meaningful comparison, we evaluate their inference performance and, guided by the scaling law, extend the MDM’s pre-training time by a factor of 16. Results using equal computation budgets are provided in Appendix C.2. Following a standard process in language models, we fine-tune both models on the ShareGPT dataset333https://sharegpt.com/, a high-quality dialogue corpus containing user prompts and corresponding ChatGPT responses (OpenAI, 2022).
Since ShareGPT samples vary in length, we pad each sample with the token to the maximum sequence length within a batch for the MDM. Following the same approach as for ARMs, we mask the loss on prompts, adding noise only to the response tokens (including the padding ), while keeping the prompts unchanged in the forward process. As a result, the MDM only tunes the conditional distribution of the response given prompt. We set the sequence length to 1024 and remove the token from the generated outputs during inference. For the ARM, generation stops when the token is produced, with a maximum sequence length set to 1024 (Zheng et al., 2023). For a fair comparison, we use identical optimizer settings for both models and train for 3 epochs as specified in Zheng et al. (2023). Additional training details are provided in Appendix B.4.
Effectiveness of unsupervised CFG against standard CFG. As shown in Table 5, we evaluate the effectiveness of unsupervised CFG by comparing it against several baselines detailed in Appendix B.4. The first one fine-tunes only the conditional distribution of MDM on paired data and sampling without CFG. The second one fine-tunes both conditional and unconditional distributions on paired data and gets samples as in the standard CFG. Additionally, we enhance unsupervised CFG by fine-tuning its conditional distribution on paired data. This is because unsupervised CFG already leverages large-scale pre-trained data to obtain a strong unconditional model. Notably, our unsupervised CFG outperforms the standard CFG, demonstrating its superior ability to leverage large-scale unpaired data considering the paired data for fine-tuning are often of a small scale.
Better efficiency quality trade-off. We further compare MDMs and ARMs regarding sample quality and efficiency. Our study significantly extends prior work (Lou et al., 2023; Sahoo et al., 2024; Shi et al., 2024; Ou et al., 2024; Gat et al., 2024) in two key aspects: (1) we focus on the more practical and challenging task of conditional generation rather than unconditional generation, and (2) we measure the running time instead of the NFEs, even when ARMs are equipped with the KV-cache, a technique that accelerates sampling by caching intermediate features during sequential generation.
Built upon the unsupervised CFG, MDMs demonstrate a more flexible and effective trade-off between efficiency and quality in conditional generation compared to ARMs. As shown in Table 5, a 1.1B MDM matches the performance of a similarly sized ARM while achieving a 1.4 times speedup in sampling time. Conversely, by increasing the number of sampling steps (at the cost of being 1.4 times slower), MDMs can surpass ARMs in generation quality. These results indicate that MDMs hold promise for conditional generation tasks, such as chat-based applications, where the ability to balance speed and quality is critical.
w/o CFG w/ CFG∗ w/ CFG† Score 1.32 1.53 1.60
MDM ARM Score 1.40 1.56 1.60 1.57 NFEs 64 128 256 325.94 Time 204s 396s 780s 555s
7 Challenging Tasks for ARMs
We demonstrate that MDMs exhibit distinct advantages over ARMs in tackling two critical challenges: reverse curse (Berglund et al., 2023) and temporal quality degradation (Vela et al., 2022).
7.1 Breaking the Reverse Curse
Berglund et al. (2023) introduced the concept of the reverse curse, which refers to the difficulty of ARMs in generalizing bidirectional relationships. Specifically, this occurs when a model is trained on information in the form “A is B” but fails to infer the reverse relationship “B is A.” For example, a model trained on the fact “Valentina Tereshkova was the first woman to travel to space” may not correctly answer the reverse question “Who was the first woman to travel to space?” This limitation raises concerns about whether large language models genuinely possess logical reasoning capabilities (Berglund et al., 2023).
Setup. We evaluate MDMs on the same reverse curse dataset used by Berglund et al. (2023), which consists of fictitious statements in the format “ is ” and the reversals. We fine-tune MDMs on these statements and assess their performance using questions not seen during training. Following the same protocol as Berglund et al. (2023), we generate responses via greedy sampling and report the exact match accuracy. Additionally, we use the BLEU metric (Papineni et al., 2002) to evaluate the quality of name-to-description generation, as suggested by Lv et al. (2023).
Results. As shown in Table 6, advanced ARMs including GPT-3 (Brown, 2020) and Llama-2 (Touvron et al., 2023b), achieve zero accuracy and low BLEU scores when prompted with reverse queries. In contrast, MDMs achieve substantially higher scores across both metrics, despite using significantly fewer parameters, a smaller pre-training dataset, and less computation. Specifically, our MDM uses only 10% parameters, 10% pre-training data, and 1% computation compared to Llama-2. Besides, MDMs perform similarly to ARMs with queries in the same direction. These results indicate the power of MDMs in capturing bidirectional relationships and logical structures.
DescriptionToName NameToDescription Same direction Reverse direction Same direction Reverse direction Acc. Acc. Acc. BLEU Acc. BLEU GPT3 (175B) 97 0 50 - 0 - Llama-2 (13B) 99 0 - 74 - 19 MDM (1.1B) 97 92 49 76 37 67
7.2 Relieving the Temporal Quality Degradation
Vela et al. (2022) highlight a common and challenging issue for modern AI models, including language models: model performance is sensitive to the temporal alignment between the training and test data, particularly when new data fall outside the temporal scope of the training set.
Setup. To evaluate the impact of temporal shifts, we train both ARMs and MDMs on the SlimPajama dataset (Soboleva et al., 2023) (see Sec. 3), released in 2023, and test them on the FineWeb dataset (Penedo et al., 2024), which contains samples from FebruaryMarch, and April of 2024. We extract the first 0.5 billion tokens from each period for evaluation. We use models of equal size (220M parameters) that achieve similar validation losses on SlimPajama. However, it is worth noting that MDMs require 16 times more computation to reach this performance level.
Results. As shown in Table 7, although the MDM achieves slightly higher perplexity on the standard validation set (i.e., SlimPajama), it outperforms the ARM on the newer 2024 data. While the exact mechanism remains unclear, we hypothesize that this advantage arises from MDMs’ ability to simultaneously model all conditional distributions, making them less sensitive to distributional shifts compared to the unidirectional dependencies in ARMs. These results indicate that MDMs are inherently more robust to temporal shifts, making them better suited for evolving data distributions.
(before Jun. 2023) Fineweb (Feb. Mar. 2024) Fineweb (Apr. 2024) ARM 17.34 27.01 26.93 MDM 18.02 24.06 24.01
8 Conclusion
In this paper, we demonstrate the strong scalability of MDMs through a comprehensive scaling analysis. Our results show that MDMs can achieve comparable or even superior performance than ARMs in key tasks, such as conditional language generation and language understanding, supported by the scaling law and the unsupervised classifier-free guidance. Furthermore, MDMs effectively address major limitations of ARMs, including breaking the reverse curse and relieving temporal quality degradation, even outperforming much larger models like Llama-2 and GPT-3 in these aspects. These findings highlight MDMs as a promising alternative to ARMs for language modeling at scale.
One of the most important future directions is to scale MDMs to larger sizes, potentially matching advanced ARMs (Achiam et al., 2023; Dubey et al., 2024). This would allow for a thorough investigation into the emergent behaviors (Wei et al., 2022a) and long-range reasoning capabilities (Wei et al., 2022b) of MDMs. By scaling up, we hope that MDMs can fully demonstrate their unique advantages over ARMs in real-world scenarios, offering a competitive alternative. Further, we believe the studies can deepen our understanding of large language models and the role of key factors such as autoregressive formulation in achieving such intelligence.
We also note another line of research focusing on continuous diffusion language models (Li et al., 2022; Gong et al., 2022; Han et al., 2022; Strudel et al., 2022; Chen et al., 2022; Dieleman et al., 2022; Mahabadi et al., 2023; Graves et al., 2023; Xue et al., 2024; Gulrajani & Hashimoto, 2024). However, the experiments in this domain are relatively small in scale and lack evaluation on standard language benchmarks. We hypothesize that MDMs enjoy better scalability than these models due to their alignment with the inherent structure of language and ARMs.
9 Acknowledgments
We thank Jingyang Ou for the insightful discussions on RADD. We also thank Ang Lv for valuable conversations about the reverse curse and Wenkai Yang for discussions on the supervised fine-tuning of ARMs. Additionally, we appreciate Siqi Kou for providing guidance on data processing and evaluation for the conditional generation experiments.
References
- Achiam et al. (2023) Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Austin et al. (2021) Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg. Structured denoising diffusion models in discrete state-spaces. Advances in Neural Information Processing Systems, 34:17981–17993, 2021.
- Berglund et al. (2023) Lukas Berglund, Meg Tong, Max Kaufmann, Mikita Balesni, Asa Cooper Stickland, Tomasz Korbak, and Owain Evans. The reversal curse: Llms trained on” a is b” fail to learn” b is a”. arXiv preprint arXiv:2309.12288, 2023.
- Bisk et al. (2020) Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, 2020.
- Brown (2020) Tom B Brown. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.
- Campbell et al. (2022) Andrew Campbell, Joe Benton, Valentin De Bortoli, Thomas Rainforth, George Deligiannidis, and Arnaud Doucet. A continuous time framework for discrete denoising models. Advances in Neural Information Processing Systems, 35:28266–28279, 2022.
- Chang et al. (2022) Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman. Maskgit: Masked generative image transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11315–11325, 2022.
- Chang et al. (2023) Huiwen Chang, Han Zhang, Jarred Barber, AJ Maschinot, Jose Lezama, Lu Jiang, Ming-Hsuan Yang, Kevin Murphy, William T Freeman, Michael Rubinstein, et al. Muse: Text-to-image generation via masked generative transformers. arXiv preprint arXiv:2301.00704, 2023.
- Chen et al. (2022) Ting Chen, Ruixiang Zhang, and Geoffrey Hinton. Analog bits: Generating discrete data using diffusion models with self-conditioning. arXiv preprint arXiv:2208.04202, 2022.
- Clark et al. (2019) Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044, 2019.
- Clark et al. (2018) Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
- Devlin (2018) Jacob Devlin. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Dieleman et al. (2022) Sander Dieleman, Laurent Sartran, Arman Roshannai, Nikolay Savinov, Yaroslav Ganin, Pierre H Richemond, Arnaud Doucet, Robin Strudel, Chris Dyer, Conor Durkan, et al. Continuous diffusion for categorical data. arXiv preprint arXiv:2211.15089, 2022.
- Du et al. (2024) Zhengxiao Du, Aohan Zeng, Yuxiao Dong, and Jie Tang. Understanding emergent abilities of language models from the loss perspective. arXiv preprint arXiv:2403.15796, 2024.
- Dubey et al. (2024) Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024.
- Gao et al. (2024) Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework for few-shot language model evaluation, 07 2024. URL https://zenodo.org/records/12608602.
- Gat et al. (2024) Itai Gat, Tal Remez, Neta Shaul, Felix Kreuk, Ricky TQ Chen, Gabriel Synnaeve, Yossi Adi, and Yaron Lipman. Discrete flow matching. arXiv preprint arXiv:2407.15595, 2024.
- Gong et al. (2022) Shansan Gong, Mukai Li, Jiangtao Feng, Zhiyong Wu, and LingPeng Kong. Diffuseq: Sequence to sequence text generation with diffusion models. arXiv preprint arXiv:2210.08933, 2022.
- Graves et al. (2023) Alex Graves, Rupesh Kumar Srivastava, Timothy Atkinson, and Faustino Gomez. Bayesian flow networks. arXiv preprint arXiv:2308.07037, 2023.
- Gulrajani & Hashimoto (2024) Ishaan Gulrajani and Tatsunori B Hashimoto. Likelihood-based diffusion language models. Advances in Neural Information Processing Systems, 36, 2024.
- Han et al. (2022) Xiaochuang Han, Sachin Kumar, and Yulia Tsvetkov. Ssd-lm: Semi-autoregressive simplex-based diffusion language model for text generation and modular control. arXiv preprint arXiv:2210.17432, 2022.
- He et al. (2022) Zhengfu He, Tianxiang Sun, Kuanning Wang, Xuanjing Huang, and Xipeng Qiu. Diffusionbert: Improving generative masked language models with diffusion models. arXiv preprint arXiv:2211.15029, 2022.
- Hendrycks et al. (2020) Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
- Ho & Salimans (2022) Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
- Ho et al. (2020) Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
- Hoffmann et al. (2022) Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
- Hoogeboom et al. (2021a) Emiel Hoogeboom, Alexey A Gritsenko, Jasmijn Bastings, Ben Poole, Rianne van den Berg, and Tim Salimans. Autoregressive diffusion models. arXiv preprint arXiv:2110.02037, 2021a.
- Hoogeboom et al. (2021b) Emiel Hoogeboom, Didrik Nielsen, Priyank Jaini, Patrick Forré, and Max Welling. Argmax flows and multinomial diffusion: Learning categorical distributions. Advances in Neural Information Processing Systems, 34:12454–12465, 2021b.
- Kaplan et al. (2020) Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
- Kou et al. (2024) Siqi Kou, Lanxiang Hu, Zhezhi He, Zhijie Deng, and Hao Zhang. Cllms: Consistency large language models. arXiv preprint arXiv:2403.00835, 2024.
- Lai et al. (2017) Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. Race: Large-scale reading comprehension dataset from examinations. arXiv preprint arXiv:1704.04683, 2017.
- Li et al. (2022) Xiang Li, John Thickstun, Ishaan Gulrajani, Percy S Liang, and Tatsunori B Hashimoto. Diffusion-lm improves controllable text generation. Advances in Neural Information Processing Systems, 35:4328–4343, 2022.
- Loshchilov (2017) I Loshchilov. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- Lou et al. (2023) Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion language modeling by estimating the ratios of the data distribution. arXiv preprint arXiv:2310.16834, 2023.
- Lovelace et al. (2024) Justin Lovelace, Varsha Kishore, Yiwei Chen, and Kilian Q Weinberger. Diffusion guided language modeling. arXiv preprint arXiv:2408.04220, 2024.
- Lv et al. (2023) Ang Lv, Kaiyi Zhang, Shufang Xie, Quan Tu, Yuhan Chen, Ji-Rong Wen, and Rui Yan. Are we falling in a middle-intelligence trap? an analysis and mitigation of the reversal curse. arXiv preprint arXiv:2311.07468, 2023.
- Mahabadi et al. (2023) Rabeeh Karimi Mahabadi, Hamish Ivison, Jaesung Tae, James Henderson, Iz Beltagy, Matthew E Peters, and Arman Cohan. Tess: Text-to-text self-conditioned simplex diffusion. arXiv preprint arXiv:2305.08379, 2023.
- Meng et al. (2022) Chenlin Meng, Kristy Choi, Jiaming Song, and Stefano Ermon. Concrete score matching: Generalized score matching for discrete data. Advances in Neural Information Processing Systems, 35:34532–34545, 2022.
- Mihaylov et al. (2018) Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. arXiv preprint arXiv:1809.02789, 2018.
- OpenAI (2022) OpenAI. ChatGPT: Optimizing Language Models for Dialogue. OpenAI blog, November 2022. URL https://openai.com/blog/chatgpt/.
- Ou et al. (2024) Jingyang Ou, Shen Nie, Kaiwen Xue, Fengqi Zhu, Jiacheng Sun, Zhenguo Li, and Chongxuan Li. Your absorbing discrete diffusion secretly models the conditional distributions of clean data. arXiv preprint arXiv:2406.03736, 2024.
- Paperno et al. (2016) Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernández. The lambada dataset: Word prediction requiring a broad discourse context. arXiv preprint arXiv:1606.06031, 2016.
- Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp. 311–318, 2002.
- Penedo et al. (2024) Guilherme Penedo, Hynek Kydlíček, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro Von Werra, and Thomas Wolf. The fineweb datasets: Decanting the web for the finest text data at scale, 2024. URL https://arxiv.org/abs/2406.17557.
- Radford (2018) Alec Radford. Improving language understanding by generative pre-training, 2018.
- Radford et al. (2019) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Sahoo et al. (2024) Subham Sekhar Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin T Chiu, Alexander Rush, and Volodymyr Kuleshov. Simple and effective masked diffusion language models. arXiv preprint arXiv:2406.07524, 2024.
- Sakaguchi et al. (2021) Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106, 2021.
- Sap et al. (2019) Maarten Sap, Hannah Rashkin, Derek Chen, Ronan Le Bras, and Yejin Choi. Social iqa: Commonsense reasoning about social interactions. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 4463–4473, 2019.
- Shazeer (2020) Noam Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020.
- Shi et al. (2024) Jiaxin Shi, Kehang Han, Zhe Wang, Arnaud Doucet, and Michalis K Titsias. Simplified and generalized masked diffusion for discrete data. arXiv preprint arXiv:2406.04329, 2024.
- Soboleva et al. (2023) Daria Soboleva, Faisal Al-Khateeb, Robert Myers, Jacob R Steeves, Joel Hestness, and Nolan Dey. SlimPajama: A 627B token cleaned and deduplicated version of RedPajama. https://www.cerebras.net/blog/slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama, 06 2023. URL https://huggingface.co/datasets/cerebras/SlimPajama-627B.
- Sohl-Dickstein et al. (2015) Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pp. 2256–2265. PMLR, 2015.
- Song et al. (2020) Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020.
- Strudel et al. (2022) Robin Strudel, Corentin Tallec, Florent Altché, Yilun Du, Yaroslav Ganin, Arthur Mensch, Will Grathwohl, Nikolay Savinov, Sander Dieleman, Laurent Sifre, et al. Self-conditioned embedding diffusion for text generation. arXiv preprint arXiv:2211.04236, 2022.
- Su et al. (2024) Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024.
- Sun et al. (2022) Haoran Sun, Lijun Yu, Bo Dai, Dale Schuurmans, and Hanjun Dai. Score-based continuous-time discrete diffusion models. arXiv preprint arXiv:2211.16750, 2022.
- Touvron et al. (2023a) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
- Touvron et al. (2023b) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
- Vaswani (2017) Ashish Vaswani. Attention is all you need. arXiv preprint arXiv:1706.03762, 2017.
- Vela et al. (2022) Daniel Vela, Andrew Sharp, Richard Zhang, Trang Nguyen, An Hoang, and Oleg S Pianykh. Temporal quality degradation in ai models. Scientific Reports, 12(1):11654, 2022.
- Wei et al. (2022a) Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022a.
- Wei et al. (2022b) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022b.
- Xue et al. (2024) Kaiwen Xue, Yuhao Zhou, Shen Nie, Xu Min, Xiaolu Zhang, Jun Zhou, and Chongxuan Li. Unifying bayesian flow networks and diffusion models through stochastic differential equations. arXiv preprint arXiv:2404.15766, 2024.
- Zellers et al. (2019) Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019.
- Zhang & Sennrich (2019) Biao Zhang and Rico Sennrich. Root mean square layer normalization. Advances in Neural Information Processing Systems, 32, 2019.
- Zhang et al. (2024) Peiyuan Zhang, Guangtao Zeng, Tianduo Wang, and Wei Lu. Tinyllama: An open-source small language model. arXiv preprint arXiv:2401.02385, 2024.
- Zheng et al. (2024) Kaiwen Zheng, Yongxin Chen, Hanzi Mao, Ming-Yu Liu, Jun Zhu, and Qinsheng Zhang. Masked diffusion models are secretly time-agnostic masked models and exploit inaccurate categorical sampling. arXiv preprint arXiv:2409.02908, 2024.
- Zheng et al. (2023) Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36:46595–46623, 2023.
Appendix A Greddy Sampling method of MDMs
We employ the sampling method of MaskGIT (Chang et al., 2022) as the greedy sampling strategy for MDMs. For completeness, we include the algorithm in Alg. 1 and provide the following intuitive explanation.
Let us first revisit the original sampling method for MDMs as described in Eq. (2). During each sampling step from time to , if it remains unchanged. Otherwise, it retains the masked state with a probability of , or transitions to with a probability of . It is important to note that for all masked tokens , they transition to corresponding with the same probability of .
Different from the original sampling method, MaskGIT (Chang et al., 2022) does not transition all masked tokens to their corresponding with the same probability of . Instead, it specifically selects masked tokens that exhibit the highest conditional probability for transition to .
Appendix B Experimental details
B.1 Reproducibility Statement
We implement our experiments based on the TinyLlama (Zhang et al., 2024) codebase. We use the code provided by TinyLlama to preprocess the SlimPajama (Soboleva et al., 2023) dataset. Additionally, we use the code provided by CLLM (Kou et al., 2024) to preprocess the ShareGPT dataset. We employ the fictitious dataset provided by Berglund et al. (2023) and Fineweb dataset (Penedo et al., 2024) for the reverse curse and temporal quality degradation experiments, respectively. Because of their simplicity, we preprocess these two datasets by ourselves. We employ the lm-eval (Gao et al., 2024) and fast-chat (Zheng et al., 2023) framework to evaluate language understanding tasks and conditional generation, respectively. In Sec. 5, the pre-trained GPT-2 model is provided by HuggingaFace. The corresponding links are detailed in Tab. 8.
Link GPT-2 model https://huggingface.co/openai-community/gpt2-xl TinyLlama codebase https://github.com/jzhang38/TinyLlama CLLM codebase https://github.com/hao-ai-lab/Consistency_LLM SlimPajama dataset https://huggingface.co/datasets/cerebras/SlimPajama-627B ShareGPT dataset https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered Reverse curse dataset https://huggingface.co/datasets/lberglund/reversal_curse Fineweb dataset https://huggingface.co/datasets/HuggingFaceFW/fineweb Lm-eval framweork https://github.com/EleutherAI/lm-evaluation-harness Fast-chat framework https://github.com/lm-sys/FastChat
B.2 Additional Experimental Details of IsoFLOP Analysis
Training details. We use identical optimizer settings for both MDMs and ARMs during pre-training. Consistency with TinyLLama (Zhang et al., 2024), we utilize the AdamW optimizer (Loshchilov, 2017), setting , , and a weight decay of . Additionally, we apply a cosine learning rate schedule with a maximum learning rate of and a minimum learning rate of with of the tokens for linear warmup. Notably, if the number of warmup steps is less than , it is set to . The batch size is set to .
Specifically, we pre-train a 1.1B MDM with training FLOPs for the downstream tasks. We use the above pre-training setting for this 1.1 B model except for batch size. As we use 24 GPUs to pre-train this model, therefore we set the batch size to .
Evaluation details. For MDMs, we found that using more Monte Carlo estimation samples (i.e., ) when computing the validation loss effectively reduces the number of outliers in Fig. 1(b). This is because increasing the number of Monte Carlo samples reduces the variance of the estimation, leading to a more precise estimation of the validation loss.
Model configs. We list all model configurations in Tab. 9.
(M) n_layers n_heads n_embed intermediate_size 19 8 6 384 1536 34 8 8 512 2048 48 9 9 576 2304 66 10 10 640 2560 75 16 8 640 1600 85 13 10 640 2560 113 12 12 768 3072 142 15 12 768 3072 170 18 12 768 3072 180 14 14 896 3584 206 16 14 896 3584 231 18 14 896 3584 268 16 16 1024 4096 302 18 16 1024 4096 336 20 16 1024 4096 472 18 10 1280 5120 551 21 10 1280 5120 571 18 11 1408 5632 629 24 10 1280 5120 666 21 11 1408 5632 717 19 12 1536 6144 761 24 11 1408 5632 831 22 12 1536 6144 944 25 12 1536 6144 1028 20 14 1792 7168 1233 24 14 1792 7168 1476 22 16 2048 8192 1678 25 16 2048 8192 2121 28 17 2176 8704
B.3 Additional Experiment Details of Language Understanding
Here, we present more details about the 1.1B model we introduce in Sec. 5. Firstly, we pre-train a 1.1B MDM for FLOPs as detailed in Appendix B.2. Due to limited computational resources, we do not retrain this 1.1B parameter model from scratch with random data length. Instead, we allocated a compute budget of FLOPs for variable length fine-tuning on the SlimPajama dataset. As the proportion of random length data is set to when training from scratch, we empirically increase it to during variable length fine-tuning, considering the limited fine-tuning FLOPs.
B.4 Additional Experimental Details of Conditional Generation
Setup. We use identical optimizer settings for both MDMs and ARMs during supervised fine-tuning. Similar to our pretraining process, we use the AdamW optimizer (Loshchilov, 2017) with hyperparameters , , and a weight decay of . We employ a cosine learning rate schedule starting from a maximum learning rate of and decaying to a minimum of . Additionally, we apply linear warm-up over the first steps and set the batch size to .
For the preprocessing of the ShareGPT dataset, we use the same method as described in Kou et al. (2024). In addition, in line with Kou et al. (2024), we fine-tune both ARMs and MDMs on the first-turn conversation from the ShareGPT dataset and report the first-turn conversation score. We do not use any annealing sampling method for ARMs and MDMs during generation. The MT-Bench score is obtained via the “gpt-4o-2024-05-13” API provided by OpenAI.
Different CFG strategies. We provide an overview of no CFG, standard CFG, and unsupervised CFG in Tab. 10.
During fine-tuning on labeled data, the standard CFG (Ho & Salimans, 2022) replaces the label with a special token with a probability of . This special token represents the unconditional distribution, thereby enabling the simultaneous training of both conditional and unconditional distributions. Specifically, for the implementation of standard CFG in our experiment, we randomly replace the prompt with the masked tokens with probability .
In contrast to the standard CFG, unsupervised CFG already leverages large-scale pre-trained data to obtain a strong unconditional model, therefore we only enhance its conditional distribution during fine-tuning on paired data.
During inference, both standard CFG and unsupervised CFG employ the rescaled conditional distribution defined in Eq. (7).
Training Sampling Conditional Unconditional No-CFG ✓ ✗ w/o CFG Standard CFG ✓ ✓ w/ CFG (i.e., Eq. (7)) Unsupervised CFG ✓ ✗ w/ CFG (i.e., Eq. (7))
B.5 Additional Experimental Details of Reverse Curse
We use the same optimizer settings as Appendix B.4 except batch size when finetuning on the fictitious dataset provided by Berglund et al. (2023). As the fictitious dataset is smaller (i.e., only 3600 data), we use a batch size of 32 for fine-tuning. We train for 10 epochs following Berglund et al. (2023). We also pad each sample with the token to the maximum sequence length within a batch as detailed in Sec. 6. Following the same approach as Berglund et al. (2023), we do not mask the loss on prompts, adding noise to the prompt and response simultaneously as Eq. (4).
ARC-Easy BoolQ Hellaswag OpenBookQA PIQA RACE SIQA LAMBADA Original 30.13 55.29 29.16 26.20 56.04 28.52 35.21 16.51 Padding 38.38 59.91 31.63 27.60 60.77 28.42 37.00 31.03 1% 37.79 61.50 31.86 27.00 60.34 29.19 36.85 36.00 5% 37.12 51.87 32.29 26.60 58.98 29.18 36.85 32.04
Appendix C Additional Results
C.1 Additional Results of Language Understanding
Results of fixing traing-test discrepancy. For efficiency, we employ MDM with 220M parameters, pre-trained for FLOPs to experiment. Tab. 11 presents the ablation studies of variable length training and padding mask tokens, demonstrating that both methods significantly improve the performance of MDMs.
Results of different likelihood evaluation methods. For efficiency, we employ MDM with 220M parameters, pre-trained for FLOPs, and set training data to random length. Tab. 12 presents the ablation studies of different likelihood evaluation methods.
ARC-Easy BoolQ Hellaswag OpenBookQA PIQA RACE SIQA Monte Carlo 39.02 62.17 34.10 30.40 59.14 30.81 37.41 Chain rule 37.88 62.17 32.20 34.20 60.39 29.67 37.10
Scaling behavior of MDMs on language understanding tasks. As shown in Fig. 3, the performance of MDMs on the language understanding tasks shows a scaling behavior with respect to the validation loss, which is consistent with observations in ARMs (Du et al., 2024). For efficiency and simplicity, methods for fixing train-test discrepancies and unsupervised CFG are not applied in this analysis.
C.2 Additional Results of Conditional Generation
More MT-Bench results of MDM. In Sec. 6, we report the MT-Bench results of ARM and MDM with and pre-training FLOPs, respectively. Here, we present the MT-Bench result of MDM with pre-training FLOPs in Tab. 13.
Score 1.21 1.22 1.23
C.3 Additional Results of Reverse Curse
Tab 14 shows the effectiveness of the unsupervised CFG on the reverse curse.
Appendix D Evaluation Metrics
In this section, we provide an overview of the benchmarks used in Sec. 5 and show some cases from these benchmarks in Tab. 15.
ARC-Easy. A subset of the AI2 Reasoning Challenge that focuses on elementary-level science questions to evaluate the model’s reasoning ability through basic scientific concepts.
BoolQ. A yes-or-no question-answering dataset designed to evaluate the model’s ability to answer questions based on a given passage.
HellaSwag. A metric assesses the model’s commonsense reasoning ability by completing a given sentence with one of four options.
OpenBookQA. A question-answering dataset, modeled after open-book exams, is designed to assess a model’s understanding of a subject by requiring multi-step reasoning and the integration of additional commonsense knowledge.
PIQA. Physical Interaction Question Answering is a metric that evaluates physical reasoning ability by asking models to select the best solution to a given problem involving everyday physical scenarios.
SIQA. Social Interaction Question Answering is a benchmark for commonsense reasoning and is established by presenting scenarios that require reasoning about social interactions and the motivations behind human behavior.
RACE. ReAding Comprehension Dataset From Examinations was designed to evaluate reading comprehension ability by understanding and interpreting text at a high school level.
LAMBADA. A dataset to evaluate models’ capabilities in text understanding through a final single-word prediction task based on a given context.
DescriptionToName NameToDescription Same direction Reverse direction Same direction Reverse direction Acc. Acc. Acc. BLEU Acc. BLEU w/o CFG 95 85 52 80 28 60 w/ CFG 97 92 49 76 37 67
Metric
Question
Choices
ARC-Easy
Which of the following was probably most important in the formation of dark, fertile soil that is good for farming?
A. plant decomposition
B. radioactive decay
C. water erosion
D. wind erosion
BoolQ
was the leaning tower of pisa built leaning
Yes
No
HellaSwag
A camera pans around a set of stairs and leads into people working out in a class. Several shots are shown of people working out together while a man speaks to the camera. the man
A. continues speaking while more people are shown working out together.
B. is seen crashing into a wall several more times while people watch on the side.
C. then leads the group on a liquid workout together.
D. continues speaking to the camera while more shots are shown of them lifting weights and/or speaking to the camera.
OpenBookQA
A man plugs his television into an outlet behind a cabinet. He sees that the television may now be turned on so that he can watch his favorite show. The man knows that by hooking the t.v. cord into the outlet
A. he completed a lap
B. he made a good deal
C. he invented new circuits
D. he completed a circuit
PIQA
When boiling butter, when it’s ready, you can
A. Pour it onto a plate
B. Pour it into a jar
SIQA
Taylor took the poor dog she found on the road to the vet. What will the vet want to do next?
A. pronounce the dog dead
B. wraps the dog in a towel
C. check to make sure the dogs okay
RACE
The article is too long and has been disregarded.
According to the passage, which of the following statements is TRUE?
A. There is more petroleum than we can use now.
B. Trees are needed for some other things besides making gas.
C. We got electricity from ocean tides in the old days.
D. Gas wasn’t used to run cars in the Second World War.
LAMBADA
“Again, he left that up to you. However, he was adamant in his desire that it remain a private ceremony. He asked me to make sure, for instance, that no information be given to the newspaper regarding his death, not even an obituary. I got the sense that he didn’t want anyone, aside from the three of us, to know that he’d even _.
Not applicable: the LAMBADA metric does not involve choices; instead, it evaluates the model by predicting the final word of a given question. For this question, the answer is ”died”.
Prompt:
A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user’s questions. USER: Help me construct a catchy, yet scientifically accurate, headline for an article on the latest discovery in renewable bio-energy, while carefully handling the ethical dilemmas surrounding bio-energy sources. Propose 4 options. ASSISTANT:
Answers:
*Option 1: ” zum Uncovering the Power of Renewable Bio-energy: The Future We Have and Our Right to Utilize”
This article should handle the ethical dilemmas surrounding bio-energy sources, including addressing the need for sustainable alternatives, the choices made by developing renewable bio-energy sources, and the necessary balance the right of honest and technology.
*Option 2: ”Exploring the latest Sustainable Energy for All: Majestic Renewable Bio-Earth Reports and Stories”/”Comprehensive reports and insights on the latest research in renewable bio-energy sources. This may have a focus on renewable energy but should also be sustainable.
*Option 3: ”Unleashing Record Renewable Energy Production: Exploring Beyond Coal Stream Options”
Record renewable energy production capacity and conditions for development. This article focuses on the future of renewable energy sources, while addressing the challenges involved with sustainatory alternatives.
*Option 4: ”wissenschaft Handling the Environmental Impact of Renewable Energy: Harnessing the opportun for Renewable Energy Sources”
One of the ethical dilemmas surrounding bio-energy sources are economic and social dilemmas, such as neglecting local, regional, or communities. This article seeks to discuss how to develop renewable energy sources that reduce the environmental impact and promote sustainability.
*Option 5: ”: Exploring the Future of Renewable Bio-energy: Navigating the Environmental Impacts and Source Options”
Discussing decision-making about the practices and consequences of developing renewable energy sources while ensuring a balance between ethical dilemmas and sustainability.
*Option 6: ”Exploring the Renewable Bio-Earth: Unlocking the Future of Renewable Energy Sources”
Prompt:
A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user’s questions. USER: What are some business etiquette norms when doing business in Japan? ASSISTANT:
Answers:
Business etiquette is influenced by Western culture, but Japan has many norms of respect when interacting. Here are some etiquette norms when doing business in Japan:
1. Show respect: Always greet with greets in a respectful manner in a Japanese business: Always greet with respectful and professional lines when working with Japanese service workers. Always greet the customers politely unless they have a reason.
3. Practice etiquette: Always make respectful statement and be polite: Always greet someone in a respectful manner as it is the cultural norm and always respond respectfully and in a polite manner.
4. Always work courteously: Always respect when hideholding in Japanese business: Always respect when dealing with Japanese service workers or officials. Always show respect when asking for something unless it is courteous and building.
5. Use Japanese apparel and respect Japanese culture when doing business in a Japanese company: Do not use Japanese or Western companies or meet Japanese officials unless you are allowed to use their language and respect in Japanese business. While it is common to use monosma garments, it is not in Japan
keeping dignity: Always respect and maintain respect: Always respect and maintain your dignity when dealing with Japanese service workers or especially when answering questions or making demonstrations of professional respect or respect
7. Request permission: Alwaysfshow respect when asking permission: Request permission respectfully, fairly and courteously to avoid a request. Always show respect when asking for permission unless it is haing.
9. Show respect: Always respect the customer with respect in Japan business: Always respect the customer with respect. These are a few examples of common business etiquette norms of respect when doing business in Japan. In general, business etiquette norms depend on the specific industry or branch of operation and may vary depending on your understanding and circumstances.
Prompt:
A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user’s questions. USER: Compose an engaging travel blog post about a recent trip to Hawaii, highlighting cultural experiences and must-see attractions. ASSISTANT:
Answers:
An enchanted-eyed traveler recently had the pleasure of discover Hawaii, a tropical paradise in the heart of the Pacific. The trip offers a diverse and blend cultural experience, representing the culture and outdoor adventure that Hawaii calls home.
The first thrilling experience is I hadn’t seen a wave breaking before in these green waters. But I experienced the same thing and witnessed the cool water rushing past me and the surf noticing caught my attention while peeking at the tall towering waterfalls.
Another cultural experience I had in my trip was exploring the land and its beaches through snorkeling. It is not just snorkeling, it is discovering beach dunes and sunnaporkeling on the gorgeous sand of Hana Beach, which is more famous than its name. Whether you can surf good, hike, or monke, it was a day of exploration.
My trip to Hawaii was a diverse destination with many aspects to explore and not enough time to see everything. Each culture has its own unique culture, determining the must see attractions. If you can only take one trip, come here and explore the beauty.