Is Model Collapse Inevitable? Breaking the Curse of
Recursion by Accumulating Real and Synthetic Data

Matthias Gerstgrasser   , Rylan Schaeffer, Apratim Dey, Rafael Rafailov, Dhruv Pai
Stanford University
{mgerst,rschaef,apd1995,rafailov,dhruvpai}@stanford.edu
\ANDHenry Sleight , John Hughes, Tomasz Korbak, Rajashree Agrawal
Constellation
\ANDAndrey Gromov
University of Maryland, College Park
\ANDDaniel A. Roberts
MIT & Sequoia Capital
\ANDDiyi Yang, David Donoho & Sanmi Koyejo
Stanford University
{diyiy,donoho,sanmi}@stanford.edu
Denotes equal authorship.Harvard & Stanford University.Denotes equal contribution.
Abstract

The proliferation of generative models, combined with pretraining on web-scale data, raises a timely question: what happens when these models are trained on their own generated outputs? Recent investigations into model-data feedback loops proposed that such loops would lead to a phenomenon termed model collapse, under which performance progressively degrades with each model-data feedback iteration until fitted models become useless. However, those studies largely assumed that new data replace old data over time, where an arguably more realistic assumption is that data accumulate over time. In this paper, we ask: what effect does accumulating data have on model collapse? We empirically study this question by pretraining sequences of language models on text corpora. We confirm that replacing the original real data by each generation’s synthetic data does indeed tend towards model collapse, then demonstrate that accumulating the successive generations of synthetic data alongside the original real data avoids model collapse; these results hold across a range of model sizes, architectures, and hyperparameters. We obtain similar results for deep generative models on other types of real data: diffusion models for molecule conformation generation and variational autoencoders for image generation. To understand why accumulating data can avoid model collapse, we use an analytically tractable framework introduced by prior work in which a sequence of linear models are fit to the previous models’ outputs. Previous work used this framework to show that if data are replaced, the test error increases with the number of model-fitting iterations; we extend this argument to prove that if data instead accumulate, the test error has a finite upper bound independent of the number of iterations, meaning model collapse no longer occurs. Our work provides consistent empirical and theoretical evidence that data accumulation avoids model collapse.

1 Introduction

The advent of large-scale generative models such as GPT-4 (Achiam et al., 2023), DALL-E (Ramesh et al., 2022) and Stable Diffusion (Rombach et al., 2022) has revolutionized the field of artificial intelligence. These models, trained on vast web-scale datasets, exhibit remarkable capabilities in generating text, images, and other media (Brown et al., 2020; Saharia et al., 2022). However, as these models become more widely used, an increasing amount of generated data populates the web. This raises a critical question: what are the consequences of training generative models on datasets containing their own outputs?

Recent studies have investigated this question, revealing that training generative models on their own outputs can cause the performance of such models to progressively degrade with each model-fitting iteration, eventually rendering newer models useless (Hataya et al., 2023; Martínez et al., 2023a; Shumailov et al., 2023; Alemohammad et al., 2023; Martínez et al., 2023b; Bertrand et al., 2023; Briesch et al., 2023; Dohmatob et al., 2024a; b) (see Appendix A for review and discussion of prior work). This phenomenon was consequently labeled model collapse. Model collapse warns that democratizing access to generative models runs the risk of polluting the very data necessary to train future iterations of generative models.

Refer to caption
Figure 1: Two Settings to Study Model Collapse. Model collapse is a phenomenon where sequences of generative models trained on their own outputs progressively degrade until the latest model becomes useless. Left: Many prior works studied model collapse where data are replaced with each model-fitting iteration. Right: We study model collapse where data accumulate with each iteration and demonstrate accumulating data avoids model collapse.

To better understand this phenomenon many prior works have considered a setup that assumes each model’s generated data replaces previous data. In theory, this leads to very natural comparisons across generations as the total number of training points for each model remains fixed. In practice, subsequent generations of LLMs are often trained with increasing data over time – e.g., 1.4 trillion tokens for Llama 1 (Touvron et al., 2023a), 2 trillion for Llama 2 (Touvron et al., 2023b), 15 trillion for Llama 3 – in which presumably both human-generated and machine-generated data are accumulating in training sets collected from the internet. It was noted in some of those works Hataya et al. (2023); Martínez et al. (2023a); Alemohammad et al. (2023); Bertrand et al. (2023); Dohmatob et al. (2024b) that model collapse can be either slowed down or negated by mixing in clean data with the generated data.

To that end, in this work we study the effect of accumulating data on model collapse, rather than replacing data. Our data-accumulating setting is, in some sense, maximally pessimistic: it considers a hypothetical future where synthetic data are uncontrollably dumped on the internet to be vacuumed up for training the next iteration of generative models. Nevertheless, we find that model collapse is avoided when accumulating data.

We begin by studying model collapse experimentally with deep generative models trained on realistic data: transformers on causal language modeling (Sec. 2.1), diffusion models on molecular conformation (Sec. 2.2) and variational autoencoders on images (Sec. 2.3). After confirming that replacing data at every iteration indeed causes test error to increase with the number of iterations, we empirically find that accumulating synthetic data with real data avoids model collapse for all models and for all data modalities we test. To understand why replacing data and accumulating data have different consequences for model collapse, we turn to an analytically tractable framework of a sequence of linear models, each trained on synthetic outputs generated from the previous-iteration’s fitted linear model (Mobahi et al., 2020; Dohmatob et al., 2024a). Within this framework, Dohmatob et al. (2024a) demonstrated that if data are replaced with each model-fitting iteration, the test error increases linearly with the number of iterations n𝑛nitalic_n. We extend Dohmatob et al. (2024a)’s analysis to prove that if data instead accumulate, then the test error has a finite and (to us, surprisingly) well-controlled upper bound independent of the number of model-fitting iterations.111An approach ‘halfway’ between the ‘replace’ and ‘accumulate’ settings replaces the previous dataset with a pure synthetic dataset of size T×i𝑇𝑖T\times iitalic_T × italic_i at the i𝑖iitalic_i-th iteration; in other words, the comparison at each generation is now between models trained on the same number of training points. While this halfway setting has a milder sublinear behavior – explicitly, the test MSE scaling is MSEO(log(n))asymptotically-equals𝑀𝑆𝐸𝑂𝑛MSE\asymp O(\log(n))italic_M italic_S italic_E ≍ italic_O ( roman_log ( italic_n ) ) – the test error does still diverge with iterations. (See Appendix E for more details.) We thank Elvis Dohmatob, Yunzhen Feng, and Julia Kempe for communicating this observation. We also consider the halfway setting as an ablation in our language modeling experiments, as detailed in Appendix C.

Altogether, our work suggests that data accumulation may be robust to model collapse and emphasizes the importance of considering accumulating data and other real-world data dynamics in the analysis of model collapse in generative models trained on web-scale data.

2 Accumulating Data Avoids Model Collapse in Deep
Generative Models

We first investigate model collapse experimentally in several classes of generative models. Here, and for the remainder of this manuscript, the term model collapse refers to notably worsening error over increasing iterations of the model-data loop, while avoiding model collapse refers instead to bounded error over such iterations. To test the effect of accumulating data on model collapse, we compare accumulating data against replacing data. We use three diverse experimental setups of causal transformers, diffusion models, and variational autoencoders trained on real text, molecular conformation, and image datasets, respectively. We find that replacing data yields model collapse for all models and all datasets, whereas accumulating data avoids model collapse.

Refer to caption
Figure 2: Data Accumulation Avoids Model Collapse in Language Modeling. Sequences of causal transformer-based language models are pretrained on TinyStories (Eldan & Li, 2023). Cross-entropy validation loss increases when repeatedly replacing data (left), but not when accumulating data (right). Synthetic data was sampled with temperature =1.0absent1.0=1.0= 1.0.

2.1 Transformer-Based Causal Language Modeling

Experiments

We first train causal transformers (Vaswani et al., 2017) on text data. Specifically, we pretrain 9M parameter GPT-2 (Radford et al., 2019) and 12M, 42M and 125M parameter Llama2 (Touvron et al., 2023b) language models for a single epoch on TinyStories (Eldan & Li, 2023), a 470M token GPT-3.5/4-generated dataset of short stories at a kindergarten reading level. For each model-fitting iteration n2𝑛2n\geq 2italic_n ≥ 2, we sample a new dataset of the same size as TinyStories from the previous iteration’s language model and then either replace or concatenate the previous dataset with the newly generated dataset. In each model-fitting iteration, we then pretrain a newly initialized model on the replaced or concatenated dataset from the previous iteration. We experiment with sampling the new datasets using temperatures 0.30.30.30.3 or 1.01.01.01.0. We chose this combination of architectures, scales, dataset, and sampling because the setup necessitates pretraining multiple iterations of language models – a computationally costly endeavor – but we also wish to study realistic conditions where generative models are high-performing and generative diverse outputs. Because small language models (below 10M parameters) pretrained on TinyStories were shown to be able to generate coherent-albeit-simple English sentences (Eldan & Li, 2023), this choice of architectures, scales, dataset and temperature hopefully strikes a good balance between being representative, being diverse and being computationally feasible.

Refer to caption
Figure 3: Data Accumulation Avoids Model Collapse in Language Modeling. Learning curves for individual model-fitting iterations when repeatedly replacing data (top), and when accumulating data (bottom). Note: Epochs correspond to more gradient steps for accumulate than replace because the number of training data grows for accumulate.
Model Iteration Sample Generation
Llama2 (125M) 3 (A) In the end, the crab found a smooth shell. He took it to a safe place under a tree. The crab put the shell where he found it. Tim and his mom were tired, but they were happy. They had a fun day at the beach. And they lived happily ever after. The end.
3 (R) Henry asked his Mom why the golf sounded so special. His Mom explained that the line of lumber had something special that would help. She said that if you’re not sure, the lumber is special.
8 (R) Friend Stan and Millie laughed together and prepared to spend the morning together. Mamaing Grandma’s possibilitant, twice would measure how much she lovedk. Everyone started to get ready when they started arguing until their mum upset.
GPT2 (9M) 5 (A) Jack was so happy that he took care of the honey. He thought, ”I care about the beautiful garden, because it is nice and clean.” He started to feed the flower every day. The flower grew bigger and taller, and Jack became very happy.
5 (R) After playing, Lily got tired and quickly ran back to playing with her dolls. She opened her eyes and played with her dolls all day long. Her grandma was so happy that she screamed as she watched her look back at her original clothes and laughed.
10 (R) When she finished eating it, she tasted it all up. She said goodbye to her mom and said goodbye. Mommy smiled, feeling very proud of her. It was other. She knew that sharing is always easy to share her meal with her mom.
Table 1: Data Accumulation Avoids Model Collapse in Language Modeling. Both 125M-parameter Llama2 as well as 9M GPT-2 models show decreasing quality when replacing data (R), but maintain high-quality text generations when accumulating data (A).

Results

We found that for all architectures, parameter counts, and sampling temperatures, as the number of model-fitting iterations increased, replacing data led to an increase in test cross entropy (Fig. 2 top). We also found that for all architectures, parameter counts, and sampling temperatures, as the number of model-fitting iterations increased, accumulating data led to equal-or-lower test cross entropy (Fig. 2 bottom). Lower temperature (0.3) led to a faster increase in test error than higher temperature (1.0) (Appendix Fig. 13), but the trend was consistent for both temperatures. Table 1 shows samples of generated texts for GPT2 (9M) and Llama2 (125M) models at model-fitting iterations 3-5 when both accumulating and replacing data, as well as iterations 8-10 (replacing only).

Ablations

We ablate for several additional potential confounds beyond generation temperature. First, when accumulating data, subsequent model iterations are trained on larger datasets than when replacing data. To control for this, we also perform experiments in which data is replaced, but the size of the (fully synthetic) dataset is grown to match the training set size in the accumulation regime. We find that model performance still degrades (albeit at a lower rate). This is shown in Appendix C, Table 2, right-most column. Second, a possible concern could be that degrading performance when replacing data could be due to low model performance in iteration 1 (and thus the quality of the first synthetic dataset). We control for this by varying the amount of training performed in iteration 1 only and find that this has no significant impact. Lastly, we find that our results are also consistent across varying dataset sizes and training epochs. These ablations are discussed in Appendix  F.

2.2 Diffusion Models on Molecular Conformation Data

Refer to caption
Figure 4: Data Accumulation Avoids Model Collapse in Geometric Diffusion Modeling. GeoDiff, a diffusion-based molecular conformation generation model, is trained on a subset of Drugs data containing molecular structures found in drugs. Test loss degrades when replacing data (left) but not when accumulating data (right).

Experiments We next train sequences of diffusion models on molecular conformation data. Specifically, we train GeoDiff (Xu et al., 2022), a geometric diffusion model for molecular conformation generation, on the GEOM-Drugs (Axelrod & Gomez-Bombarelli, 2022) dataset. We down-sample the training split of GEOM-Drugs to 40,0004000040,00040 , 000 molecular conformations, which we use as our initial training set, and perform 50505050 diffusion steps for each prediction. For the loss, we use the standard loss used by GeoDiff: a weighted variational lower bound to the conditional likelihood; for more details, see Xu et al. (2022).

Results Over 8888 model-fitting iterations, we find test loss increases when replacing data, matching our language model experiments, and test loss remains relatively constant when accumulating data (Fig. 4). Unlike with language models, we found that when replacing data, performance worsens significantly in the first model-fitting iteration trained on synthetic data and does not degrade further substantially in subsequent iterations.

2.3 Variational Autoencoders on Image Data

Experiments

We lastly train sequences of variational autoencoders (VAEs) (Kingma & Welling, 2013; Rezende et al., 2014) on CelebA (Liu et al., 2015), a dataset of 200k images of human faces split between train and test sets, chosen as a balance between being a realistic dataset with many samples, color images and resolution, and computational feasibility of training multiple iterations of models on accumulating data. The loss is the standard VAE loss: reconstruction error plus the KL divergence between the encoder’s output Gaussian and the isotropic Gaussian prior. See Appendix D for more experimental details.

Refer to caption
Figure 5: Data Accumulation Avoids Model Collapse in Variational Autoencoders for Image Generation. Sequences of variational autoencoders (VAEs) are trained on CelebA, a large-scale dataset of human faces. Test loss degrades when replacing data (left) but not when accumulating data (right).
Refer to caption
Refer to caption
Refer to caption
Figure 6: Sampled Images from Left: Replacing data with data generated by the previous iteration’s newly trained VAE yields lower quality and eventually leads to complete mode collapse. Middle: Accumulating data with data generated by the previous iteration’s newly trained VAE preserves the quality and diversity of generated data across iterations. Right: Baseline samples after 100 training epochs on the dataset.

Results

We find that replacing data at each iteration again exhibits model collapse: the test error rises swiftly with each additional iteration, and each iteration yields lower quality and less diverse generated faces until all model generations represent a single mode as shown in the left panel of Figure 6. In contrast, accumulating data at each iteration significantly slows model collapse: the test error increases significantly slower with each additional iteration. While the diversity of generations does go down as compared in the middle and right panel of Fig. 6, it still represents major axes of variation in the dataset, such as gender, but no longer seems to generate other details, along more minor axis of the data manifold, such as glasses and accessories. We discuss further analysis of VAE reconstructions in Appendix D.

Interestingly, unlike language modeling, the test error of accumulating data does increase with the number of iterations (albeit much more slowly than with replacing data). We also note that Martínez et al. (2023a) found slightly contradictory evidence, specifically that a different architecture on a much smaller dataset exhibits fast performance deterioration even with accumulating data. Understanding under what conditions and why these discrepancies exist is an interesting direction we leave for future research.

3 Accumulating Data Avoids Model Collapse in Linear Models

To gain mathematical understanding and intuition, we employ an analytical framework introduced in prior work (Mobahi et al., 2020; Dohmatob et al., 2024a) to understand the difference between data accumulation and data replacement. We will show that it predicts the same types of test error behaviors for these two data-use strategies that were measured empirically. The framework considers a sequence of linear models that are fit to the synthetic data sampled from the linear generative model model based on the previously fit linear models. Within this framework, Dohmatob et al. (2024a) showed that if data are replaced across model-fitting iterations, then the test squared error increases linearly222To echo an earlier footnote, an approach ‘halfway’ between the ‘replace’ and ‘accumulate’ approaches would replace the previous dataset with a pure synthetic dataset of size iT𝑖𝑇iTitalic_i italic_T at the i𝑖iitalic_i-th iteration. Analyzing this goes mostly in parallel, except the 1/i21superscript𝑖21/i^{2}1 / italic_i start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT mentioned in running text now becomes 1/i1𝑖1/i1 / italic_i for the ‘halfway’ approach. Consequently, the MSE scaling becomes MSEO(log(n))asymptotically-equals𝑀𝑆𝐸𝑂𝑛MSE\asymp O(\log(n))italic_M italic_S italic_E ≍ italic_O ( roman_log ( italic_n ) ); the ‘halfway’ approach with pure synthetic data but more of it, again has test error growing unboundedly with iterations. Thanks to Elvis Dohmatob, Yunzhen Feng and Julia Kempe for communicating this observation. See Appendix E for an extended discussion. with the number of iterations n𝑛nitalic_n. Here, we extend Dohmatob et al. (2024a)’s argument to show that if data instead accumulate across model-fitting iterations, then the test squared error is upper bounded by a relatively small constant, meaning model collapse is avoided333In this theoretical section, we identify the term model collapse with the situation where test error diverges to infinity (at any rate) as iterations progress. Other authors may employ similar terminology while identifying it with different properties of test error. For example, Alemohammad et al. (2023) use the term MAD to refer to the situation where the distance between the distribution of the original data and that of the subsequent generative models grow farther apart, without necessarily diverging..

The content in this section relies heavily on the framework and pioneering contributions of Dohmatob et al. (2024a). Our contribution is to study a different way to use synthetic data in training, namely accumulate, which seems to better align with certain real-world considerations. We show that our empirical results could have been anticipated on theoretical grounds, by applying the same analysis framework as in Dohmatob et al. (2024a), but instead to this specific training dataset pattern. We use the same framework to analyze some other ways that synthetic data might have be used, such as replace, again the theory aligns with many empirical results.

3.1 Notation and Preliminaries

Original Data Distribution.

We adapt notations from Dohmatob et al. (2024a). Define the distribution PΣ,w,σ2subscript𝑃Σ𝑤superscript𝜎2P_{\Sigma,w,\sigma^{2}}italic_P start_POSTSUBSCRIPT roman_Σ , italic_w , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT on d×superscript𝑑\mathbb{R}^{d}\times\mathbb{R}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT × blackboard_R given by (x,y)PΣ,w,σ2iffsimilar-to𝑥𝑦subscript𝑃Σ𝑤superscript𝜎2iff(x,y)\sim P_{\Sigma,w,\sigma^{2}}\quad\text{iff}\quad( italic_x , italic_y ) ∼ italic_P start_POSTSUBSCRIPT roman_Σ , italic_w , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT iff:

(Input) x𝒩(0,Σ),similar-to𝑥𝒩0Σ\displaystyle x\sim\mathcal{N}(0,\Sigma),italic_x ∼ caligraphic_N ( 0 , roman_Σ ) ,
(Noise) ϵ𝒩(0,σ2), independent of x,similar-toitalic-ϵ𝒩0superscript𝜎2 independent of 𝑥\displaystyle\epsilon\sim\mathcal{N}(0,\sigma^{2}),\text{ independent of }x,italic_ϵ ∼ caligraphic_N ( 0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) , independent of italic_x ,
(Label) y=xw+ϵ.𝑦𝑥superscript𝑤italic-ϵ\displaystyle y=x\cdot w^{*}+\epsilon.italic_y = italic_x ⋅ italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + italic_ϵ .

The positive integer d𝑑ditalic_d is the input-dimension, the matrix Σd×dΣsuperscript𝑑𝑑\Sigma\in\mathbb{R}^{d\times d}roman_Σ ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_d end_POSTSUPERSCRIPT is the true covariance structure of the input x𝑥xitalic_x, the vector wsuperscript𝑤w^{*}italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is the true linear relationship used to generate the original data and the scalar σ𝜎\sigmaitalic_σ is the level of label noise. We start at iteration n=1𝑛1n=1italic_n = 1 with T𝑇Titalic_T initial independent data points (xi,yi)subscript𝑥𝑖subscript𝑦𝑖(x_{i},y_{i})( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) each following PΣ,w,σ2subscript𝑃Σsuperscript𝑤superscript𝜎2P_{\Sigma,w^{*},\sigma^{2}}italic_P start_POSTSUBSCRIPT roman_Σ , italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, that is, yi=xiw+ϵisubscript𝑦𝑖subscript𝑥𝑖superscript𝑤subscriptitalic-ϵ𝑖y_{i}=x_{i}\cdot w^{*}+\epsilon_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for each i=1,2,,T𝑖12𝑇i=1,2,\cdots,Titalic_i = 1 , 2 , ⋯ , italic_T. We form the design matrix XT×d𝑋superscript𝑇𝑑X\in\mathbb{R}^{T\times d}italic_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × italic_d end_POSTSUPERSCRIPT with x1,,xTsuperscriptsubscript𝑥1topsuperscriptsubscript𝑥𝑇topx_{1}^{\top},\cdots,x_{T}^{\top}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , ⋯ , italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT as rows. We also form the vectors Y𝑌Yitalic_Y and E𝐸Eitalic_E with i𝑖iitalic_i-th coordinate yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and ϵisubscriptitalic-ϵ𝑖\epsilon_{i}italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT respectively. In whatever follows, we will assume that X𝑋Xitalic_X has full column rank, i.e., Td𝑇𝑑T\geq ditalic_T ≥ italic_d, XXsuperscript𝑋top𝑋X^{\top}Xitalic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_X is invertible and the model is underparameterized.

Synthetic Data Generation Process.

We generate synthetic data from the following sequence of distributions

PΣ,w,σ2PΣ,w^1,σ2PΣ,w^n,σ2,subscript𝑃Σsuperscript𝑤superscript𝜎2subscript𝑃Σsubscript^𝑤1superscript𝜎2subscript𝑃Σsubscript^𝑤𝑛superscript𝜎2\displaystyle P_{\Sigma,w^{*},\sigma^{2}}\to P_{\Sigma,\hat{w}_{1},\sigma^{2}}% \to\ldots\to P_{\Sigma,\hat{w}_{n},\sigma^{2}},italic_P start_POSTSUBSCRIPT roman_Σ , italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT → italic_P start_POSTSUBSCRIPT roman_Σ , over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT → … → italic_P start_POSTSUBSCRIPT roman_Σ , over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ,

where n𝑛n\in\mathbb{N}italic_n ∈ blackboard_N is the number of iterations. The scheme is outlined as follows.

  • For n=1𝑛1n=1italic_n = 1:

    • Accumulating Covariates/Features: X~1=defXsubscript~𝑋1superscriptdef𝑋\tilde{X}_{1}\operatorname{\stackrel{{\scriptstyle\text{def}}}{{\;=\;}}}Xover~ start_ARG italic_X end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_OPFUNCTION SUPERSCRIPTOP start_ARG = end_ARG start_ARG def end_ARG end_OPFUNCTION italic_X

    • Accumulating Targets: Y~1=defY^1=defXw+E1subscript~𝑌1superscriptdefsubscript^𝑌1superscriptdef𝑋superscript𝑤subscript𝐸1\tilde{Y}_{1}\operatorname{\stackrel{{\scriptstyle\text{def}}}{{\;=\;}}}\hat{Y% }_{1}\operatorname{\stackrel{{\scriptstyle\text{def}}}{{\;=\;}}}Xw^{*}+E_{1}over~ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_OPFUNCTION SUPERSCRIPTOP start_ARG = end_ARG start_ARG def end_ARG end_OPFUNCTION over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_OPFUNCTION SUPERSCRIPTOP start_ARG = end_ARG start_ARG def end_ARG end_OPFUNCTION italic_X italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, where E1=defE𝒩(0,σ2IT)similar-tosubscript𝐸1superscriptdef𝐸𝒩0superscript𝜎2subscript𝐼𝑇E_{1}\operatorname{\stackrel{{\scriptstyle\text{def}}}{{\;=\;}}}E\sim\mathcal{% N}(0,\sigma^{2}I_{T})italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_OPFUNCTION SUPERSCRIPTOP start_ARG = end_ARG start_ARG def end_ARG end_OPFUNCTION italic_E ∼ caligraphic_N ( 0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT )

    • Fit linear model: w^1=X~1Y~1subscript^𝑤1superscriptsubscript~𝑋1subscript~𝑌1\hat{w}_{1}=\tilde{X}_{1}^{\dagger}\tilde{Y}_{1}over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = over~ start_ARG italic_X end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT over~ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT

    • Sample synthetic data for the next iteration: Y^2=defXw^1+E2subscript^𝑌2superscriptdef𝑋subscript^𝑤1subscript𝐸2\hat{Y}_{2}\operatorname{\stackrel{{\scriptstyle\text{def}}}{{\;=\;}}}X\hat{w}% _{1}+E_{2}over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_OPFUNCTION SUPERSCRIPTOP start_ARG = end_ARG start_ARG def end_ARG end_OPFUNCTION italic_X over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, where E2𝒩(0,σ2IT)similar-tosubscript𝐸2𝒩0superscript𝜎2subscript𝐼𝑇E_{2}\sim\mathcal{N}(0,\sigma^{2}I_{T})italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT )

  • For n2𝑛2n\geq 2italic_n ≥ 2:

    • Accumulating Covariates/Features: X~n=[X~n1;X]d×nTsuperscriptsubscript~𝑋𝑛topsuperscriptsubscript~𝑋𝑛1topsuperscript𝑋topsuperscript𝑑𝑛𝑇\tilde{X}_{n}^{\top}=[\tilde{X}_{n-1}^{\top};X^{\top}]\in\mathbb{R}^{d\times nT}over~ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT = [ over~ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ; italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_n italic_T end_POSTSUPERSCRIPT

    • Accumulating Targets: Y~n=[Y~n1;Y^n]1×nTsuperscriptsubscript~𝑌𝑛topsuperscriptsubscript~𝑌𝑛1topsuperscriptsubscript^𝑌𝑛topsuperscript1𝑛𝑇\tilde{Y}_{n}^{\top}=[\tilde{Y}_{n-1}^{\top};\hat{Y}_{n}^{\top}]\in\mathbb{R}^% {1\times nT}over~ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT = [ over~ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ; over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT 1 × italic_n italic_T end_POSTSUPERSCRIPT

    • Fit linear model: w^n=defX~nY~nsubscript^𝑤𝑛superscriptdefsuperscriptsubscript~𝑋𝑛subscript~𝑌𝑛\hat{w}_{n}\operatorname{\stackrel{{\scriptstyle\text{def}}}{{\;=\;}}}\tilde{X% }_{n}^{\dagger}\tilde{Y}_{n}over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_OPFUNCTION SUPERSCRIPTOP start_ARG = end_ARG start_ARG def end_ARG end_OPFUNCTION over~ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT over~ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT

    • Sample synthetic data for the next iteration: Y^n+1=defXw^n+En+1subscript^𝑌𝑛1superscriptdef𝑋subscript^𝑤𝑛subscript𝐸𝑛1\hat{Y}_{n+1}\operatorname{\stackrel{{\scriptstyle\text{def}}}{{\;=\;}}}X\hat{% w}_{n}+E_{n+1}over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT start_OPFUNCTION SUPERSCRIPTOP start_ARG = end_ARG start_ARG def end_ARG end_OPFUNCTION italic_X over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + italic_E start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT, where En+1𝒩(0,σ2IT)similar-tosubscript𝐸𝑛1𝒩0superscript𝜎2subscript𝐼𝑇E_{n+1}\sim\mathcal{N}(0,\sigma^{2}I_{T})italic_E start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT )

Here, for a matrix A𝐴Aitalic_A with full column rank, A=(AA)1Asuperscript𝐴superscriptsuperscript𝐴top𝐴1superscript𝐴topA^{\dagger}=(A^{\top}A)^{-1}A^{\top}italic_A start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT = ( italic_A start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_A ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_A start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT is the Moore-Penrose pseudo-inverse of A𝐴Aitalic_A. The noise terms E1,E2,,Ensubscript𝐸1subscript𝐸2subscript𝐸𝑛E_{1},E_{2},\ldots,E_{n}italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_E start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT are independent of each other and of the covariates/features. Since X𝑋Xitalic_X has full column rank, so does X~nsubscript~𝑋𝑛\tilde{X}_{n}over~ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT for every n1𝑛1n\geq 1italic_n ≥ 1.

Test Error.

We are interested in the dynamics of the test error Etest(w^n)subscript𝐸testsubscript^𝑤𝑛E_{\text{test}}(\hat{w}_{n})italic_E start_POSTSUBSCRIPT test end_POSTSUBSCRIPT ( over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) of this sequence of linear model w^1,w^2,subscript^𝑤1subscript^𝑤2\hat{w}_{1},\hat{w}_{2},...over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , …. Note that evaluation of the model is done on the true distribution PΣ,w,σ2subscript𝑃Σsuperscript𝑤superscript𝜎2P_{\Sigma,w^{*},\sigma^{2}}italic_P start_POSTSUBSCRIPT roman_Σ , italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, even though the model is trained on the accumulated synthetic data. For any linear estimator w^^𝑤\hat{w}over^ start_ARG italic_w end_ARG computed from the training data, we measure test error in the standard way:

Etest(w)=def𝔼[(xtestTwytest)2]σ2=𝔼[wwΣ2]subscript𝐸𝑡𝑒𝑠𝑡𝑤superscriptdef𝔼delimited-[]superscriptsuperscriptsubscript𝑥𝑡𝑒𝑠𝑡𝑇𝑤subscript𝑦𝑡𝑒𝑠𝑡2superscript𝜎2𝔼delimited-[]superscriptsubscriptnorm𝑤superscript𝑤Σ2E_{test}(w)\operatorname{\stackrel{{\scriptstyle\text{def}}}{{\;=\;}}}\mathbb{% E}\left[(x_{test}^{T}w-y_{test})^{2}\right]-\sigma^{2}=\mathbb{E}[\|w-w^{*}\|_% {\Sigma}^{2}]italic_E start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT ( italic_w ) start_OPFUNCTION SUPERSCRIPTOP start_ARG = end_ARG start_ARG def end_ARG end_OPFUNCTION blackboard_E [ ( italic_x start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_w - italic_y start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] - italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = blackboard_E [ ∥ italic_w - italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT roman_Σ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] (1)

where the expectation is taken over the training data and (xtest,ytest)PΣ,w,σ2similar-tosubscript𝑥𝑡𝑒𝑠𝑡subscript𝑦𝑡𝑒𝑠𝑡subscript𝑃Σsuperscript𝑤superscript𝜎2(x_{test},y_{test})\sim P_{\Sigma,w^{*},\sigma^{2}}( italic_x start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT ) ∼ italic_P start_POSTSUBSCRIPT roman_Σ , italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT independent of the training data.

A Note on Extensions to Ridge Regression and Kernel Methods.

To reiterate a comment made previously by Dohmatob et al. (2024a), although we present our results in the context of ordinary linear regression in dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, our analysis can be readily extended to ridge regression and the kernel setting (Caponnetto & De Vito, 2007; Simon et al., 2021; Cui et al., 2021; Wei et al., 2022). We focus here on a simple useful model for studying model collapse.

Refer to caption
Figure 7: Accumulating Data Avoids Model Collapse in Linear Regression. We consider sequences of linear models recurrently fit to generated targets by previous iterations of models. Top: If each linear model is fit to the generated targets of only the preceding linear model, i.e., data are replaced, then the test error grows linearly with the number of iterations n𝑛nitalic_n. Bottom: If each linear model is instead fit to the generate targets of all the preceding linear models, i.e., data accumulate, then the test error has a finite upper bound independent of the number of iterations. This suggests that data accumulation might be a robust solution for mitigating model collapse. For log test error and higher iterations, see Appendix Fig. 16.

3.2 Precise Test Error Characterization Under Accumulating Data

Our goal is to establish an analytic formula for the test error of the n𝑛nitalic_nth model in the data accumulation setting. We begin by characterizing the relationship between the fitted linear parameters w^nsubscript^𝑤𝑛\hat{w}_{n}over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and the true parameters wsuperscript𝑤w^{*}italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. We remind the reader that we assume that X𝑋Xitalic_X has full column rank, i.e., XXsuperscript𝑋top𝑋X^{\top}Xitalic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_X is invertible. Proofs are deferred to App. B.

Theorem 1.

In the data accumulation setting, n1for-all𝑛1\forall n\geq 1∀ italic_n ≥ 1, the fitted linear parameters w^nsubscript^𝑤𝑛\hat{w}_{n}over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT can be expressed as:

w^n=w+(XX)1X(i=1nEii)subscript^𝑤𝑛superscript𝑤superscriptsuperscript𝑋top𝑋1superscript𝑋topsuperscriptsubscript𝑖1𝑛subscript𝐸𝑖𝑖\hat{w}_{n}=w^{*}+(X^{\top}X)^{-1}X^{\top}\left(\sum_{i=1}^{n}\frac{E_{i}}{i}\right)over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + ( italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_X ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT divide start_ARG italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_i end_ARG ) (2)

where, recall, wsuperscript𝑤w^{*}italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is the true parameter, X𝑋Xitalic_X is the original design matrix, and Eisubscript𝐸𝑖E_{i}italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the extra noise added at the i𝑖iitalic_i’th iteration.

Theorem 2.

For an n𝑛nitalic_n-fold synthetic data generation process with Td+2𝑇𝑑2T\geq d+2italic_T ≥ italic_d + 2 samples per iteration and isotropic features (Σ=defIdΣsuperscriptdefsubscript𝐼𝑑\Sigma\operatorname{\stackrel{{\scriptstyle\text{def}}}{{\;=\;}}}I_{d}roman_Σ start_OPFUNCTION SUPERSCRIPTOP start_ARG = end_ARG start_ARG def end_ARG end_OPFUNCTION italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT), the test error for the ridgeless linear predictor w^nsubscript^𝑤𝑛\hat{w}_{n}over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT learned on the accumulated data up to iteration n𝑛nitalic_n is given by:

EtestAccum(w^n)=σ2dTd1(i=1n1i2)σ2dTd1×π26superscriptsubscript𝐸testAccumsubscript^𝑤𝑛superscript𝜎2𝑑𝑇𝑑1superscriptsubscript𝑖1𝑛1superscript𝑖2superscript𝜎2𝑑𝑇𝑑1superscript𝜋26E_{\text{test}}^{\text{Accum}}(\hat{w}_{n})=\frac{\sigma^{2}d}{T-d-1}\left(% \sum_{i=1}^{n}\frac{1}{i^{2}}\right)\leq\frac{\sigma^{2}d}{T-d-1}\times\frac{% \pi^{2}}{6}italic_E start_POSTSUBSCRIPT test end_POSTSUBSCRIPT start_POSTSUPERSCRIPT Accum end_POSTSUPERSCRIPT ( over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = divide start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d end_ARG start_ARG italic_T - italic_d - 1 end_ARG ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_i start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) ≤ divide start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d end_ARG start_ARG italic_T - italic_d - 1 end_ARG × divide start_ARG italic_π start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 6 end_ARG (3)

where, recall, σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is the noise variance of the fake data generation process, d𝑑ditalic_d is the input dimension, and T𝑇Titalic_T is the number of samples (i.e., data points) added per iteration.

How does test error with accumulating data compare against test error with replacing data? Under otherwise identical assumptions, Dohmatob et al. (2024a) proved in the data-replacing setting that the test error is given by444For notational simplicity, we assume that Dohmatob et al. (2024a)’s T0=defTsubscript𝑇0superscriptdef𝑇T_{0}\operatorname{\stackrel{{\scriptstyle\text{def}}}{{\;=\;}}}Titalic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_OPFUNCTION SUPERSCRIPTOP start_ARG = end_ARG start_ARG def end_ARG end_OPFUNCTION italic_T and σ0=defσsubscript𝜎0superscriptdef𝜎\sigma_{0}\operatorname{\stackrel{{\scriptstyle\text{def}}}{{\;=\;}}}\sigmaitalic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_OPFUNCTION SUPERSCRIPTOP start_ARG = end_ARG start_ARG def end_ARG end_OPFUNCTION italic_σ.:

EtestReplace(w^n)=σ2dTd1×nsuperscriptsubscript𝐸testReplacesubscript^𝑤𝑛superscript𝜎2𝑑𝑇𝑑1𝑛E_{\text{test}}^{\text{Replace}}(\hat{w}_{n})=\frac{\sigma^{2}d}{T-d-1}\times% \color[rgb]{1,0,0}n\color[rgb]{0,0,0}italic_E start_POSTSUBSCRIPT test end_POSTSUBSCRIPT start_POSTSUPERSCRIPT Replace end_POSTSUPERSCRIPT ( over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = divide start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d end_ARG start_ARG italic_T - italic_d - 1 end_ARG × italic_n (4)

When data are replaced, the test error grows linearly with the number of iterations n𝑛nitalic_n (Fig 7 top), with the rate of growth determined by a noise-to-signal ratio: the amount of noise per dimension σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT times the number of dimensions d𝑑ditalic_d, adjusted by the (per-iteration) sample size T𝑇Titalic_T. In contrast, when data accumulate, Theorem 2 shows the test error is upper bounded regardless of the number of iterations n𝑛nitalic_n:

EtestAccum(w^n)σ2dTd1×π26superscriptsubscript𝐸testAccumsubscript^𝑤𝑛superscript����2𝑑𝑇𝑑1superscript𝜋26E_{\text{test}}^{\text{Accum}}(\hat{w}_{n})\leq\frac{\sigma^{2}d}{T-d-1}\times% \color[rgb]{1,0,0}\frac{\pi^{2}}{6}\color[rgb]{0,0,0}italic_E start_POSTSUBSCRIPT test end_POSTSUBSCRIPT start_POSTSUPERSCRIPT Accum end_POSTSUPERSCRIPT ( over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ≤ divide start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d end_ARG start_ARG italic_T - italic_d - 1 end_ARG × divide start_ARG italic_π start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 6 end_ARG

This striking difference can be intuitively explained by the differences in the way data are handled across iterations. In the data replacement setting, because previous data were discarded, the model is more strongly affected by the new noise that each iteration of generated data introduces, and adds that to the effects experienced in earlier iterations. But in the data accumulation setting, because iteration i𝑖iitalic_i contributes fraction 1/i1𝑖1/i1 / italic_i to the training dataset, the additional noise from the i𝑖iitalic_ith iteration of synthetic data has its effect on the model MSE shrunken proportional to 1/i21superscript𝑖21/i^{2}1 / italic_i start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (due to squared error). The summability of 1/i21superscript𝑖21/i^{2}1 / italic_i start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT prevents the test error from growing indefinitely. This suggests that accumulating generated data with real data can indeed avoid model collapse.

3.3 Numerical Confirmation of Analytical Results

To confirm the analytical results, we numerically simulate the setup. The numerics almost perfectly matched the analytics (Fig. 7): when data are replaced, the test error grows with the number of iterations n𝑛nitalic_n, with the prefactor set by the noise-to-signal ratio σ2d/(Td1)superscript𝜎2𝑑𝑇𝑑1\sigma^{2}d/(T-d-1)italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d / ( italic_T - italic_d - 1 ), but when data accumulate, the test error rapidly plateaus with the prefactor similarly set. For log test error and higher model-fitting iterations, see Appendix Fig. 16.

4 Discussion

This work explored the phenomenon of model collapse, an important concern as AI-generated content permeates the internet and finds its way into future training datasets. Prior work has shown that training on model outputs can lead to degraded performance (Martínez et al., 2023a; b; Shumailov et al., 2023; Alemohammad et al., 2023; Hataya et al., 2023; Bertrand et al., 2023; Briesch et al., 2023; Dohmatob et al., 2024a; b), implying that future model training faces a difficult challenge of ensuring strict training dataset hygiene. For a significantly more thorough discussion of related work, please see Appendix A.

Our findings extend these prior works to show that if data accumulates and models train on a mixture of “real” and synthetic data, model collapse no longer occurs. We show this both experimentally on causal transformers for language modeling, diffusion models for molecule generation, and variational auto-encoders on image data as well as theoretically for linear regression. Together, these results strongly suggest that the “curse of recursion” may not be as dire as had been portrayed – provided we accumulate synthetic data alongside real data, rather than replacing real data by synthetic data only.

Looking to the future, many questions worth investigating remain. For instance, in future work we would like to explore different data generation and accumulation regimes, such as (1) additional “real” data being introduced in each model-fitting iteration and (2) different schedules of how much synthetic data is generated at each iteration and (3) human-filtering of generated data, e.g., as done in RLHF. Additionally, we note that in all our experiments, the synthetic dataset is generated by sampling from the previous model, i.e., with some stochasticity; in future work, we would like to explore also what happens if data is generated deterministically, e.g. with temperature 0 in a typical language model.

Lastly, it is worth noting that “model collapse” – as a term of art – has been used in various ways by various researchers; so care is required in comparing claims across articles. In reviewing the literature, we identified at least four related phenomena: (0) unbounded test error blowup (as here); (1) modal collapse — collapse to one (or a few) modes; (2) collapse to uniformity; and (3) amplification of artifacts introduced by models fit to previous synthetic data. Future work should map out what factors cause which to occur and what preventative strategies are effective at addressing each.

5 Acknowledgements

The content of this paper does not necessarily reflect the position or the policy of any of the funding agencies/entities. No endorsement should be inferred. M.G. acknowledges support through a grant from the Cooperative AI Foundation. R.S. acknowledges support from Stanford Data Science and from OpenAI’s Superalignment Fast Grant Research Fellowship. A.G. acknowledges support from the NSF CAREER grant DMR-2045181, the Sloan Foundation, and by the Laboratory for Physical Sciences through the Condensed Matter Theory Center. D.R. acknowledges support from the National Science Foundation under Cooperative Agreement PHY-2019786 (the NSF AI Institute for Artificial Intelligence and Fundamental Interactions) and appreciates both the sanction and support of Sequoia Capital. S.K. is partially supported by NSF III 2046795, IIS 1909577, CCF 1934986, NIH 1R01MH116226-01A, NIFA award 2020-67021-32799, the Alfred P. Sloan Foundation, and Google Inc.

References

  • Achiam et al. (2023) Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  • Alemohammad et al. (2023) Sina Alemohammad, Josue Casco-Rodriguez, Lorenzo Luzi, Ahmed Imtiaz Humayun, Hossein Babaei, Daniel LeJeune, Ali Siahkoohi, and Richard G Baraniuk. Self-consuming generative models go mad. arXiv preprint arXiv:2307.01850, 2023.
  • Axelrod & Gomez-Bombarelli (2022) Simon Axelrod and Rafael Gomez-Bombarelli. Geom, energy-annotated molecular conformations for property prediction and molecular generation. Scientific Data, 9(1):185, 2022.
  • Bertrand et al. (2023) Quentin Bertrand, Avishek Joey Bose, Alexandre Duplessis, Marco Jiralerspong, and Gauthier Gidel. On the stability of iterative retraining of generative models on their own data. arXiv preprint arXiv:2310.00429, 2023.
  • Briesch et al. (2023) Martin Briesch, Dominik Sobania, and Franz Rothlauf. Large language models suffer from their own output: An analysis of the self-consuming training loop. arXiv preprint arXiv:2311.16822, 2023.
  • Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  • Caponnetto & De Vito (2007) Andrea Caponnetto and Ernesto De Vito. Optimal rates for the regularized least-squares algorithm. Foundations of Computational Mathematics, 7:331–368, 2007.
  • Cui et al. (2021) Hugo Cui, Bruno Loureiro, Florent Krzakala, and Lenka Zdeborová. Generalization error rates in kernel regression: The crossover from the noiseless to noisy regime. Advances in Neural Information Processing Systems, 34:10131–10143, 2021.
  • Dohmatob et al. (2024a) Elvis Dohmatob, Yunzhen Feng, and Julia Kempe. Model collapse demystified: The case of regression. arXiv preprint arXiv:2402.07712, 2024a.
  • Dohmatob et al. (2024b) Elvis Dohmatob, Yunzhen Feng, Pu Yang, Francois Charton, and Julia Kempe. A tale of tails: Model collapse as a change of scaling laws. arXiv preprint arXiv:2402.07043, 2024b.
  • Eldan & Li (2023) Ronen Eldan and Yuanzhi Li. Tinystories: How small can language models be and still speak coherent english? arXiv preprint arXiv:2305.07759, 2023.
  • Hataya et al. (2023) Ryuichiro Hataya, Han Bao, and Hiromi Arai. Will large-scale generative models corrupt future datasets? In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  20555–20565, 2023.
  • Jain et al. (2024) Ayush Jain, Andrea Montanari, and Eren Sasoglu. Scaling laws for learning with real and surrogate data. arXiv preprint arXiv:2402.04376, 2024.
  • Kingma & Welling (2013) Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  • Kwon et al. (2023) Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, pp.  611–626, 2023.
  • Liu et al. (2015) Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), December 2015.
  • Marchi et al. (2024) Matteo Marchi, Stefano Soatto, Pratik Chaudhari, and Paulo Tabuada. Heat death of generative models in closed-loop learning, 2024.
  • Martínez et al. (2023a) Gonzalo Martínez, Lauren Watson, Pedro Reviriego, José Alberto Hernández, Marc Juarez, and Rik Sarkar. Combining generative artificial intelligence (ai) and the internet: Heading towards evolution or degradation? arXiv preprint arXiv:2303.01255, 2023a.
  • Martínez et al. (2023b) Gonzalo Martínez, Lauren Watson, Pedro Reviriego, José Alberto Hernández, Marc Juarez, and Rik Sarkar. Towards understanding the interplay of generative artificial intelligence and the internet. arXiv preprint arXiv:2306.06130, 2023b.
  • Mobahi et al. (2020) Hossein Mobahi, Mehrdad Farajtabar, and Peter Bartlett. Self-distillation amplifies regularization in hilbert space. Advances in Neural Information Processing Systems, 33:3351–3361, 2020.
  • Radford et al. (2019) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  • Ramesh et al. (2022) Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
  • Rezende et al. (2014) Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In International conference on machine learning, pp. 1278–1286. PMLR, 2014.
  • Rombach et al. (2022) Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  10684–10695, 2022.
  • Saharia et al. (2022) Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in neural information processing systems, 35:36479–36494, 2022.
  • Seddik et al. (2024) Mohamed El Amine Seddik, Suei-Wen Chen, Soufiane Hayou, Pierre Youssef, and Merouane Debbah. How bad is training on synthetic data? a statistical analysis of language model collapse, 2024.
  • Shumailov et al. (2023) Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Yarin Gal, Nicolas Papernot, and Ross Anderson. The curse of recursion: Training on generated data makes models forget. arXiv preprint arXiv:2305.17493, 2023.
  • Simon et al. (2021) James B Simon, Madeline Dickens, and Michael Deweese. Neural tangent kernel eigenvalues accurately predict generalization. 2021.
  • Taori & Hashimoto (2023) Rohan Taori and Tatsunori Hashimoto. Data feedback loops: Model-driven amplification of dataset biases. In International Conference on Machine Learning, pp. 33883–33920. PMLR, 2023.
  • Touvron et al. (2023a) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  • Touvron et al. (2023b) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  • Wei et al. (2022) Alexander Wei, Wei Hu, and Jacob Steinhardt. More than a toy: Random matrix models predict how real-world neural representations generalize. In International Conference on Machine Learning, pp. 23549–23588. PMLR, 2022.
  • Wolf et al. (2019) Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, 2019.
  • Xu et al. (2022) Minkai Xu, Lantao Yu, Yang Song, Chence Shi, Stefano Ermon, and Jian Tang. Geodiff: A geometric diffusion model for molecular conformation generation. arXiv preprint arXiv:2203.02923, 2022.

Appendix A Summarization and Discussion of Prior and Related Work

Prior Empirical Work A growing body of recent work has investigated the phenomenon of iteratively training models on data generated by previous models, e.g., Hataya et al. (2023); Martínez et al. (2023a); Shumailov et al. (2023); Alemohammad et al. (2023); Martínez et al. (2023b); Bertrand et al. (2023); Briesch et al. (2023); Dohmatob et al. (2024a; b) and (in a different context) Taori & Hashimoto (2023). Hataya et al. (2023) and Martínez et al. (2023b) conducted experiments replacing real training data with generated data at each iteration, assuming that the dataset size remains fixed over time. They found that this iterative retraining procedure can lead to model degradation if the proportion of synthetic data becomes too high. Similarly, Shumailov et al. (2023) ran experiments with Gaussian mixture models, VAEs, and language models in which the total number of samples per iteration was held constant, and the samples always originated with the previous model rather than aggregating over time. Building on this work, Alemohammad et al. (2023) considered three treatments of data: fully replacing real data with synthetic data, augmenting a fixed real dataset with additional synthetic data, and mixing new real data with synthetic data at each iteration. In almost all of their experiments, they drew a fixed size dataset from the most recent model at each iteration, without accumulating data. Bertrand et al. (2023) also assumed that dataset size and mixing proportions are constant over time in their theoretical stability analysis and empirical validation.

Prior Theoretical Work

Over the last few years, there has been significant research effort contributing to our theoretical understanding of model behavior when synthetic data are integrated into training. The most closely related works to ours are Dohmatob et al. (2024a) and Dohmatob et al. (2024b); of course, the inspiration for the linear regression model studied in this paper directly comes from Dohmatob et al. (2024a). Dohmatob et al. (2024a) performs an in-depth analysis of high dimensional linear and ridge regression when the training data used per iteration are generated from the previous iteration’s fitted model. They are able to conclude that the test error grows linearly with the iteration count in their setup, as well as derive more interesting and more nuanced results using random matrix theory. They also discuss how to mitigate model collapse through optimal regularization both when the training data are noise-free and noisy versions of the previous model’s synthetic outputs. A related noise-free setup was studied by Mobahi et al. (2020) in the case of self-distillation. Although Mobahi et al. (2020) considers a more general setup with ridge regression as a special case, they use noiseless predictions from the previous model as the training data for the next model, and show that eventually, the predictions shrink to zero. Through this, they highlight that self-distillation induces regularization in the function space, which initially is beneficial for reducing over-fitting, but eventually over-regularization causes underfitting and hence performance decay. Dohmatob et al. (2024b) go beyond the linear model to study model collapse – they study the tails of LLM outputs vs. real data and provide scaling laws that clearly identify regimes of model degradation when synthetic data misses tails present in real data. They identify an interesting phase transition in the test error scaling law depending on the size of the real dataset size in comparison to (a functional of) the chopped-off tail, and conclude that enough real data is able to mitigate model collapse. All these works consider the scenario where the amount of training data available per iteration is fixed (and does not grow with the iteration count), and it is certainly possible that with larger amount of synthetic data (from prediction by the previous model), several of these scalings would improve significantly. For example, in Equation (12) of Dohmatob et al. (2024b), one obtains the linear scaling (with iteration count) of test error simply because the amount of synthetic data generated per iteration is the same. If one generated synthetic data with size proportional to the iteration count, then at iteration n𝑛nitalic_n, the scaling would, instead of n𝑛nitalic_n, be like n1c/(1c)superscript𝑛1𝑐1𝑐n^{1-c}/(1-c)italic_n start_POSTSUPERSCRIPT 1 - italic_c end_POSTSUPERSCRIPT / ( 1 - italic_c ) for c<1𝑐1c<1italic_c < 1. When one does not increase the dataset size, Dohmatob et al. (2024b) points out that increasing the proportion of real data would help one to avoid model collapse altogether. However, even if one did increase the amount of synthetic data with iteration count, Theorem 3.2 coupled with Corollary 3.3 in Dohmatob et al. (2024b) would tell us that the amount of real data was all that mattered – if the amount of real data is large, we overcome model collapse. If one only had synthetic data (and no real data), no matter how large, it would be impossible to regain the original real-data scaling laws. The scenario we study is highly inspired by these pioneering works, but still, in our view, different. We consider the case when we keep augmenting synthetic data (generated by the previous model trained on all the previous data so far) as iterations progress, much akin to how – in our view – the internet evolves. We observe that we can avoid model collapse in this setting. The analysis of previous models in our case is more involved, since the data used for training at iteration n𝑛nitalic_n is not homogeneous – different models from the past impart different statistical aspects to different parts of the training data. We also note a related augmentation model studied by Jain et al. (2024) – they perform risk minimization augmenting real data with synthetic data available from a potentially different independent source. One of their messages is that augmentation of (even) pure noise can be looked upon as adding a ridge penalty and hence, in certain cases, can improve test error. Their setup, however, is different from ours, since the synthetic data in their setup is not obtained by a learning algorithm employed on the real data, and the process is not iterative. However, morally, each iteration of ours involves risk minimization on data statistically composed of an equal mixture of data generated from the previous models, and hence each iteration of ours can be mapped to the general framework developed in Jain et al. (2024), although the dependencies among the various models trained in our setup introduce theoretical complications that do not seem to be too easily addressed by the theory developed in Jain et al. (2024). Shortly after v1 of our manuscript was uploaded to ArXiv, two other manuscripts appeared, dealing with the theoretical aspects in a setting similar to ours. Theorem 1 of Marchi et al. (2024) obtains the same square summability scaling of the variance as us. Seddik et al. (2024) studies collapse in language models in both purely synthetic and partly synthetic regimes and obtains deviation bounds as model iterations progress.

Refer to caption
Figure 8: Clarification of Data Accumulation in Alemohammad et al. (2023). Figure 7 from Alemohammad et al. (2023) (above) shows that linearly accumulating data (“Synthetic augmentation loop”) causes poor behavior to plateau with the number of model-fitting iterations. Alemohammad et al. (2023) write, “Our experiments […] support our main conclusion [that] fixed real training data only delays the inevitable degradation of the quality or diversity of the generative models over generations.” We believe is that our evidence and their evidence is more consistent with the conclusion that accumulating data avoids model collapse and does not merely delay it.

Considering Accumulating Data

The two papers we found that partially considered accumulating data are Martínez et al. (2023a) and Alemohammad et al. (2023). Alemohammad et al. (2023) did so in one-half of one experiment: StyleGAN2 trained on FliqrFaces 128×\times×128 (App. Fig. 8). The authors concluded that accumulating data does not avoid model collapse, but merely slows it down. However, we believe that a closer examination of their results (App. Fig. 8) reveals that accumulating data causes the test error to plateau to a relatively low error with increasing numbers of model-fitting iterations. This result would support our conclusion that accumulating data avoids model collapse and does not merely delay it. The results from Martínez et al. (2023a) are harder to evaluate; model collapse only seems to occur when the amount of synthetic data added per model-fitting iteration is 2×\times× the total amount of accumulated data, and the subsequent work by the authors switched from accumulating data to replacing data (Martínez et al., 2023b). We think understanding what conditions and why these discrepancies exist is an interesting future direction.

Avoiding Model Collapse

Several papers present methods for avoiding or slowing model collapse. Bertrand et al. (2023) shows in the replacing data setting that model collapse will not occur if the initial generative models approximate the data distribution well enough and the proportion of real data is sufficiently large with respect to the synthetic data. Dohmatob et al. (2024b) similarly demonstrates that in the replacing data setting, carefully selecting real data to mix with synthetic data can avoid model collapse. Other solutions may also be possible in various models and under various assumptions. To our knowledge, no paper has claimed an “optimal” strategy to avoid model collapse, and neither has ours.

Appendix B Proofs of Mathematical Results

We point out a lemma useful to prove Theorem 2.

Lemma 3.

Let T𝑇Titalic_T and d𝑑ditalic_d be positive integers with Td+2𝑇𝑑2T\geq d+2italic_T ≥ italic_d + 2, and let XT×d𝑋superscript𝑇𝑑X\in\mathbb{R}^{T\times d}italic_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × italic_d end_POSTSUPERSCRIPT be a random matrix with i.i.d. rows from 𝒩(0,Σ)𝒩0Σ\mathcal{N}(0,\Sigma)caligraphic_N ( 0 , roman_Σ ) with ΣΣ\Sigmaroman_Σ positive definite. Then, X𝑋Xitalic_X has full rank a.s. Moreover, it holds that:

𝔼X[(XX)1]=1Td1Σ1.subscript𝔼𝑋delimited-[]superscriptsuperscript𝑋top𝑋11𝑇𝑑1superscriptΣ1\mathbb{E}_{X}[(X^{\top}X)^{-1}]=\frac{1}{T-d-1}\Sigma^{-1}.blackboard_E start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT [ ( italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_X ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ] = divide start_ARG 1 end_ARG start_ARG italic_T - italic_d - 1 end_ARG roman_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT . (5)
Proof.

See Dohmatob et al. (2024a). ∎

Assuming Lemma 5 and Theorem 1, we present the proof of Theorem 2.

Proof of Theorem 2.

From Theorem 1, we have:

w^n=w+(XX)1X(i=1nEii)subscript^𝑤𝑛superscript𝑤superscriptsuperscript𝑋top𝑋1superscript𝑋topsuperscriptsubscript𝑖1𝑛subscript𝐸𝑖𝑖\hat{w}_{n}=w^{*}+(X^{\top}X)^{-1}X^{\top}\left(\sum_{i=1}^{n}\frac{E_{i}}{i}\right)over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + ( italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_X ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT divide start_ARG italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_i end_ARG ) (6)

where wsuperscript𝑤w^{*}italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is the true parameter, X𝑋Xitalic_X is the original data matrix, and Eisubscript𝐸𝑖E_{i}italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are the noise terms at each iteration, with Ei𝒩(0,σ2IT)similar-tosubscript𝐸𝑖𝒩0superscript𝜎2subscript𝐼𝑇E_{i}\sim\mathcal{N}(0,\sigma^{2}I_{T})italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ). The test error is given by:

Etest(w^n)=𝔼[w^nwΣ2]subscript𝐸testsubscript^𝑤𝑛𝔼delimited-[]superscriptsubscriptnormsubscript^𝑤𝑛superscript𝑤Σ2E_{\text{test}}(\hat{w}_{n})=\mathbb{E}[||\hat{w}_{n}-w^{*}||_{\Sigma}^{2}]italic_E start_POSTSUBSCRIPT test end_POSTSUBSCRIPT ( over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = blackboard_E [ | | over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT roman_Σ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] (7)

where the expectation is taken over all random quantities involved.

Substituting w^nsubscript^𝑤𝑛\hat{w}_{n}over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT into the test error expression and using the fact that Σ=defIdΣsuperscriptdefsubscript𝐼𝑑\Sigma\operatorname{\stackrel{{\scriptstyle\text{def}}}{{\;=\;}}}I_{d}roman_Σ start_OPFUNCTION SUPERSCRIPTOP start_ARG = end_ARG start_ARG def end_ARG end_OPFUNCTION italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT, we get:

Etest(w^n)subscript𝐸testsubscript^𝑤𝑛\displaystyle E_{\text{test}}(\hat{w}_{n})italic_E start_POSTSUBSCRIPT test end_POSTSUBSCRIPT ( over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) =𝔼[(i=1nEii)X(XX)2X(i=1nEii)]absent𝔼delimited-[]superscriptsuperscriptsubscript𝑖1𝑛subscript𝐸𝑖𝑖top𝑋superscriptsuperscript𝑋top𝑋2superscript𝑋topsuperscriptsubscript𝑖1𝑛subscript𝐸𝑖𝑖\displaystyle=\mathbb{E}\left[\left(\sum_{i=1}^{n}\frac{E_{i}}{i}\right)^{\top% }X(X^{\top}X)^{-2}X^{\top}\left(\sum_{i=1}^{n}\frac{E_{i}}{i}\right)\right]= blackboard_E [ ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT divide start_ARG italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_i end_ARG ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_X ( italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_X ) start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT divide start_ARG italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_i end_ARG ) ]
=𝔼[i=1nσ2i2tr(X(XX)2X)]absent𝔼delimited-[]superscriptsubscript𝑖1𝑛superscript𝜎2superscript𝑖2tr𝑋superscriptsuperscript𝑋top𝑋2superscript𝑋top\displaystyle=\mathbb{E}\left[\sum_{i=1}^{n}\frac{\sigma^{2}}{i^{2}}\text{tr}(% X(X^{\top}X)^{-2}X^{\top})\right]= blackboard_E [ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT divide start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_i start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG tr ( italic_X ( italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_X ) start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) ]
=i=1nσ2i2𝔼[tr((XX)1)]absentsuperscriptsubscript𝑖1𝑛superscript𝜎2superscript𝑖2𝔼delimited-[]trsuperscriptsuperscript𝑋top𝑋1\displaystyle=\sum_{i=1}^{n}\frac{\sigma^{2}}{i^{2}}\mathbb{E}\left[\text{tr}(% (X^{\top}X)^{-1})\right]= ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT divide start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_i start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG blackboard_E [ tr ( ( italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_X ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) ]

Using Lemma 5, we have:

𝔼X[tr((XX)1)]=dTd1subscript𝔼𝑋delimited-[]trsuperscriptsuperscript𝑋top𝑋1𝑑𝑇𝑑1\mathbb{E}_{X}\left[\text{tr}((X^{\top}X)^{-1})\right]=\frac{d}{T-d-1}blackboard_E start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT [ tr ( ( italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_X ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) ] = divide start_ARG italic_d end_ARG start_ARG italic_T - italic_d - 1 end_ARG (8)

Therefore, the test error for ridgeless regression with isotropic features in the data accumulation setting is:

Etest(w^n)subscript𝐸testsubscript^𝑤𝑛\displaystyle E_{\text{test}}(\hat{w}_{n})italic_E start_POSTSUBSCRIPT test end_POSTSUBSCRIPT ( over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) =i=1nσ2i2dTd1<σ2dTd1(π26)absentsuperscriptsubscript𝑖1𝑛superscript𝜎2superscript𝑖2𝑑𝑇𝑑1superscript𝜎2𝑑𝑇𝑑1superscript𝜋26\displaystyle=\sum_{i=1}^{n}\frac{\sigma^{2}}{i^{2}}\cdot\frac{d}{T-d-1}<\frac% {\sigma^{2}d}{T-d-1}\left(\frac{\pi^{2}}{6}\right)= ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT divide start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_i start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ⋅ divide start_ARG italic_d end_ARG start_ARG italic_T - italic_d - 1 end_ARG < divide start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d end_ARG start_ARG italic_T - italic_d - 1 end_ARG ( divide start_ARG italic_π start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 6 end_ARG )

as i=1ni2<i=1i2=π2/6superscriptsubscript𝑖1𝑛superscript𝑖2superscriptsubscript𝑖1superscript𝑖2superscript𝜋26\sum_{i=1}^{n}i^{-2}<\sum_{i=1}^{\infty}i^{-2}=\pi^{2}/6∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_i start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT < ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_i start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT = italic_π start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 6. ∎

Finally, we prove Theorem 1.

Proof of Theorem 1.

We prove this theorem by induction.

Base case: For n=1𝑛1n=1italic_n = 1, we have:

w^1subscript^𝑤1\displaystyle\hat{w}_{1}over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT =X~1Y~1=(XX)1X(Xw+E1)=w+(XX)1XE1absentsuperscriptsubscript~𝑋1subscript~𝑌1superscriptsuperscript𝑋top𝑋1superscript𝑋top𝑋superscript𝑤subscript𝐸1superscript𝑤superscriptsuperscript𝑋top𝑋1superscript𝑋topsubscript𝐸1\displaystyle=\tilde{X}_{1}^{\dagger}\tilde{Y}_{1}=(X^{\top}X)^{-1}X^{\top}(Xw% ^{*}+E_{1})=w^{*}+(X^{\top}X)^{-1}X^{\top}E_{1}= over~ start_ARG italic_X end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT over~ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ( italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_X ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( italic_X italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + ( italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_X ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT

which satisfies the lemma.

Inductive step: Assume that for some n1𝑛1n\geq 1italic_n ≥ 1, we have:

w^n=w+(XX)1X(i=1nEii)subscript^𝑤𝑛superscript𝑤superscriptsuperscript𝑋top𝑋1superscript𝑋topsuperscriptsubscript𝑖1𝑛subscript𝐸𝑖𝑖\hat{w}_{n}=w^{*}+(X^{\top}X)^{-1}X^{\top}\left(\sum_{i=1}^{n}\frac{E_{i}}{i}\right)over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + ( italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_X ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT divide start_ARG italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_i end_ARG )

Now, consider w^n+1subscript^𝑤𝑛1\hat{w}_{n+1}over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT:

w^n+1subscript^𝑤𝑛1\displaystyle\hat{w}_{n+1}over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT =X~n+1Y~n+1absentsuperscriptsubscript~𝑋𝑛1subscript~𝑌𝑛1\displaystyle=\tilde{X}_{n+1}^{\dagger}\tilde{Y}_{n+1}= over~ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT over~ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT
=(X~n+1X~n+1)1X~n+1Y~n+1absentsuperscriptsuperscriptsubscript~𝑋𝑛1topsubscript~𝑋𝑛11superscriptsubscript~𝑋𝑛1topsubscript~𝑌𝑛1\displaystyle=(\tilde{X}_{n+1}^{\top}\tilde{X}_{n+1})^{-1}\tilde{X}_{n+1}^{% \top}\tilde{Y}_{n+1}= ( over~ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over~ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT over~ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over~ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT
=1n+1(XX)1i=1n+1XY^iabsent1𝑛1superscriptsuperscript𝑋top𝑋1superscriptsubscript𝑖1𝑛1superscript𝑋topsubscript^𝑌𝑖\displaystyle=\frac{1}{n+1}(X^{\top}X)^{-1}\sum_{i=1}^{n+1}X^{\top}\hat{Y}_{i}= divide start_ARG 1 end_ARG start_ARG italic_n + 1 end_ARG ( italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_X ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT

Recalling that Y^isubscript^𝑌𝑖\hat{Y}_{i}over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT:

Y^i={Xw+E1,i=1Xw^i1+Ei,2in+1subscript^𝑌𝑖cases𝑋superscript𝑤subscript𝐸1𝑖1𝑋subscript^𝑤𝑖1subscript𝐸𝑖2𝑖𝑛1\displaystyle\hat{Y}_{i}=\begin{cases}Xw^{*}+E_{1},&i=1\\ X\hat{w}_{i-1}+E_{i},&2\leq i\leq n+1\end{cases}over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { start_ROW start_CELL italic_X italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , end_CELL start_CELL italic_i = 1 end_CELL end_ROW start_ROW start_CELL italic_X over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT + italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , end_CELL start_CELL 2 ≤ italic_i ≤ italic_n + 1 end_CELL end_ROW

Substituting this back into the expression for w^n+1subscript^𝑤𝑛1\hat{w}_{n+1}over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT:

w^n+1subscript^𝑤𝑛1\displaystyle\hat{w}_{n+1}over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT =1n+1(XX)1(X(Xw+E1)+i=2n+1X(Xw^i1+Ei))absent1𝑛1superscriptsuperscript𝑋top𝑋1superscript𝑋top𝑋superscript𝑤subscript𝐸1superscriptsubscript𝑖2𝑛1superscript𝑋top𝑋subscript^𝑤𝑖1subscript𝐸𝑖\displaystyle=\frac{1}{n+1}(X^{\top}X)^{-1}\left(X^{\top}(Xw^{*}+E_{1})+\sum_{% i=2}^{n+1}X^{\top}(X\hat{w}_{i-1}+E_{i})\right)= divide start_ARG 1 end_ARG start_ARG italic_n + 1 end_ARG ( italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_X ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( italic_X italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_i = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( italic_X over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT + italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) )
=1n+1(XX)1(XXw+XE1+i=2n+1(XXw^i1+XEi))absent1𝑛1superscriptsuperscript𝑋top𝑋1superscript𝑋top𝑋superscript𝑤superscript𝑋topsubscript𝐸1superscriptsubscript𝑖2𝑛1superscript𝑋top𝑋subscript^𝑤𝑖1superscript𝑋topsubscript𝐸𝑖\displaystyle=\frac{1}{n+1}(X^{\top}X)^{-1}\left(X^{\top}Xw^{*}+X^{\top}E_{1}+% \sum_{i=2}^{n+1}(X^{\top}X\hat{w}_{i-1}+X^{\top}E_{i})\right)= divide start_ARG 1 end_ARG start_ARG italic_n + 1 end_ARG ( italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_X ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_X italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_i = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT ( italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_X over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT + italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) )
=1n+1(XX)1(XXw+XE1+i=1n(XXw^i+XEi+1))absent1𝑛1superscriptsuperscript𝑋top𝑋1superscript𝑋top𝑋superscript𝑤superscript𝑋topsubscript𝐸1superscriptsubscript𝑖1𝑛superscript𝑋top𝑋subscript^𝑤𝑖superscript𝑋topsubscript𝐸𝑖1\displaystyle=\frac{1}{n+1}(X^{\top}X)^{-1}\left(X^{\top}Xw^{*}+X^{\top}E_{1}+% \sum_{i=1}^{n}(X^{\top}X\hat{w}_{i}+X^{\top}E_{i+1})\right)= divide start_ARG 1 end_ARG start_ARG italic_n + 1 end_ARG ( italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_X ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_X italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_X over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_E start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT ) )
=1n+1(XX)1(XXw+i=1nXXw^i+i=1n+1XEi)absent1𝑛1superscriptsuperscript𝑋top𝑋1superscript𝑋top𝑋superscript𝑤superscriptsubscript𝑖1𝑛superscript𝑋top𝑋subscript^𝑤𝑖superscriptsubscript𝑖1𝑛1superscript𝑋topsubscript𝐸𝑖\displaystyle=\frac{1}{n+1}(X^{\top}X)^{-1}\left(X^{\top}Xw^{*}+\sum_{i=1}^{n}% X^{\top}X\hat{w}_{i}+\sum_{i=1}^{n+1}X^{\top}E_{i}\right)= divide start_ARG 1 end_ARG start_ARG italic_n + 1 end_ARG ( italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_X ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_X italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_X over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )

Now, using the induction hypothesis:

w^n+1subscript^𝑤𝑛1\displaystyle\hat{w}_{n+1}over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT =1n+1(XX)1(XXw+i=1nXX(w+(XX)1Xj=1iEjj)+i=1n+1XEi)absent1𝑛1superscriptsuperscript𝑋top𝑋1superscript𝑋top𝑋superscript𝑤superscriptsubscript𝑖1𝑛superscript𝑋top𝑋superscript𝑤superscriptsuperscript𝑋top𝑋1superscript𝑋topsuperscriptsubscript𝑗1𝑖subscript𝐸𝑗𝑗superscriptsubscript𝑖1𝑛1superscript𝑋topsubscript𝐸𝑖\displaystyle=\frac{1}{n+1}(X^{\top}X)^{-1}\left(X^{\top}Xw^{*}+\sum_{i=1}^{n}% X^{\top}X\left(w^{*}+(X^{\top}X)^{-1}X^{\top}\sum_{j=1}^{i}\frac{E_{j}}{j}% \right)+\sum_{i=1}^{n+1}X^{\top}E_{i}\right)= divide start_ARG 1 end_ARG start_ARG italic_n + 1 end_ARG ( italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_X ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_X italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_X ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + ( italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_X ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT divide start_ARG italic_E start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_j end_ARG ) + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )
=1n+1(XX)1((n+1)XXw+i=1nXX(XX)1Xj=1iEjj+i=1n+1XEi)absent1𝑛1superscriptsuperscript𝑋top𝑋1𝑛1superscript𝑋top𝑋superscript𝑤superscriptsubscript𝑖1𝑛superscript𝑋top𝑋superscriptsuperscript𝑋top𝑋1superscript𝑋topsuperscriptsubscript𝑗1𝑖subscript𝐸𝑗𝑗superscriptsubscript𝑖1𝑛1superscript𝑋topsubscript𝐸𝑖\displaystyle=\frac{1}{n+1}(X^{\top}X)^{-1}\left((n+1)X^{\top}Xw^{*}+\sum_{i=1% }^{n}X^{\top}X(X^{\top}X)^{-1}X^{\top}\sum_{j=1}^{i}\frac{E_{j}}{j}+\sum_{i=1}% ^{n+1}X^{\top}E_{i}\right)= divide start_ARG 1 end_ARG start_ARG italic_n + 1 end_ARG ( italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_X ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( ( italic_n + 1 ) italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_X italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_X ( italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_X ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT divide start_ARG italic_E start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_j end_ARG + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )
=w+1n+1(XX)1(i=1nXj=1iEjj+i=1n+1XEi)absentsuperscript𝑤1𝑛1superscriptsuperscript𝑋top𝑋1superscriptsubscript𝑖1𝑛superscript𝑋topsuperscriptsubscript𝑗1𝑖subscript𝐸𝑗𝑗superscriptsubscript𝑖1𝑛1superscript𝑋topsubscript𝐸𝑖\displaystyle=w^{*}+\frac{1}{n+1}(X^{\top}X)^{-1}\left(\sum_{i=1}^{n}X^{\top}% \sum_{j=1}^{i}\frac{E_{j}}{j}+\sum_{i=1}^{n+1}X^{\top}E_{i}\right)= italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG italic_n + 1 end_ARG ( italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_X ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT divide start_ARG italic_E start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_j end_ARG + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )
=w+1n+1(XX)1X(i=1nj=1iEjj+i=1n+1Ei)absentsuperscript𝑤1𝑛1superscriptsuperscript𝑋top𝑋1superscript𝑋topsuperscriptsubscript𝑖1𝑛superscriptsubscript𝑗1𝑖subscript𝐸𝑗𝑗superscriptsubscript𝑖1𝑛1subscript𝐸𝑖\displaystyle=w^{*}+\frac{1}{n+1}(X^{\top}X)^{-1}X^{\top}\left(\sum_{i=1}^{n}% \sum_{j=1}^{i}\frac{E_{j}}{j}+\sum_{i=1}^{n+1}E_{i}\right)= italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG italic_n + 1 end_ARG ( italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_X ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT divide start_ARG italic_E start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_j end_ARG + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )

Now, we need to simplify the term i=1nj=1iEjj+i=1n+1Eisuperscriptsubscript𝑖1𝑛superscriptsubscript𝑗1𝑖subscript𝐸𝑗𝑗superscriptsubscript𝑖1𝑛1subscript𝐸𝑖\sum_{i=1}^{n}\sum_{j=1}^{i}\frac{E_{j}}{j}+\sum_{i=1}^{n+1}E_{i}∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT divide start_ARG italic_E start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_j end_ARG + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. We can do this by counting the number of times each Eisubscript𝐸𝑖E_{i}italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT appears in the double sum: E1subscript𝐸1E_{1}italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT appears n𝑛nitalic_n times in the double sum and once in the single sum, so its coefficient is n+11𝑛11\frac{n+1}{1}divide start_ARG italic_n + 1 end_ARG start_ARG 1 end_ARG. E2subscript𝐸2E_{2}italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT appears n1𝑛1n-1italic_n - 1 times in the double sum and once in the single sum, so its coefficient is n2𝑛2\frac{n}{2}divide start_ARG italic_n end_ARG start_ARG 2 end_ARG. This continues along till we reach Ensubscript𝐸𝑛E_{n}italic_E start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, which appears once in the double sum and once in the single sum, so its coefficient is 2n2𝑛\frac{2}{n}divide start_ARG 2 end_ARG start_ARG italic_n end_ARG. En+1subscript𝐸𝑛1E_{n+1}italic_E start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT appears only once in the single sum, so its coefficient is 1n+11𝑛1\frac{1}{n+1}divide start_ARG 1 end_ARG start_ARG italic_n + 1 end_ARG. Therefore,

i=1nj=1iEjj+i=1n+1Eisuperscriptsubscript𝑖1𝑛superscriptsubscript𝑗1𝑖subscript𝐸𝑗𝑗superscriptsubscript𝑖1𝑛1subscript𝐸𝑖\displaystyle\sum_{i=1}^{n}\sum_{j=1}^{i}\frac{E_{j}}{j}+\sum_{i=1}^{n+1}E_{i}∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT divide start_ARG italic_E start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_j end_ARG + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT =i=1n+1n+2iiEi=(n+1)i=1n+1Eiiabsentsuperscriptsubscript𝑖1𝑛1𝑛2𝑖𝑖subscript𝐸𝑖𝑛1superscriptsubscript𝑖1𝑛1subscript𝐸𝑖𝑖\displaystyle=\sum_{i=1}^{n+1}\frac{n+2-i}{i}E_{i}=(n+1)\sum_{i=1}^{n+1}\frac{% E_{i}}{i}= ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT divide start_ARG italic_n + 2 - italic_i end_ARG start_ARG italic_i end_ARG italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( italic_n + 1 ) ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT divide start_ARG italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_i end_ARG

Substituting this back into the expression for w^n+1subscript^𝑤𝑛1\hat{w}_{n+1}over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT:

w^n+1subscript^𝑤𝑛1\displaystyle\hat{w}_{n+1}over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT =w+1n+1(XX)1X((n+1)i=1n+1Eii)absentsuperscript𝑤1𝑛1superscriptsuperscript𝑋top𝑋1superscript𝑋top𝑛1superscriptsubscript𝑖1𝑛1subscript𝐸𝑖𝑖\displaystyle=w^{*}+\frac{1}{n+1}(X^{\top}X)^{-1}X^{\top}\left((n+1)\sum_{i=1}% ^{n+1}\frac{E_{i}}{i}\right)= italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG italic_n + 1 end_ARG ( italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_X ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( ( italic_n + 1 ) ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT divide start_ARG italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_i end_ARG )
=w+(XX)1Xi=1n+1Eiiabsentsuperscript𝑤superscriptsuperscript𝑋top𝑋1superscript𝑋topsuperscriptsubscript𝑖1𝑛1subscript𝐸𝑖𝑖\displaystyle=w^{*}+(X^{\top}X)^{-1}X^{\top}\sum_{i=1}^{n+1}\frac{E_{i}}{i}= italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + ( italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_X ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT divide start_ARG italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_i end_ARG

Therefore, by mathematical induction, the lemma holds for all n1𝑛1n\geq 1italic_n ≥ 1. ∎

Appendix C Additional Details and Ablations on Language Model Experiments

Implementation Details

Model training was implemented using Huggingface Transformers (Wolf et al., 2019). Dataset generation was implemented using vllm (Kwon et al., 2023).

Additional Plots

In addition to Figure 3 in the main text, Figures 9-12 show learning curves in larger print, with x-axes showing either epochs or gradient steps, and with axes shown in linear-linear or log-log scale, respectively.

Refer to caption
Figure 9: Data Accumulation Avoids Model Collapse in Language Modeling. Learning curves for individual model-fitting iterations when repeatedly replacing data (left), and when accumulating data (right). Note: Epochs correspond to more gradient steps for accumulate than replace because the number of training data grows for accumulate.
Refer to caption
Figure 10: Data Accumulation Avoids Model Collapse in Language Modeling. Learning curves for individual model-fitting iterations when repeatedly replacing data (left), and when accumulating data (right), in log-log scale. Note: Epochs correspond to more gradient steps for accumulate than replace because the number of training data grows for accumulate.
Refer to caption
Figure 11: Data Accumulation Avoids Model Collapse in Language Modeling. Learning curves for individual model-fitting iterations when repeatedly replacing data (left), and when accumulating data (right).
Refer to caption
Figure 12: Data Accumulation Avoids Model Collapse in Language Modeling. Learning curves for individual model-fitting iterations when repeatedly replacing data (left), and when accumulating data (right), in log-log scale.

Ablations

In addition to the experiments shown in the main paper, we conducted several ablation studies.

Controlling for dataset size.

One possible concern is that when accumulating data, the train dataset size will grow at each model-fitting iteration, meaning subsequent models will be trained on more aggregate data than their counterparts in the replacement regime. To control for this, we run experiments controlling for this. In this “replace-multiple” regime, we create a fully synthetic dataset at the end of each model-fitting iteration, but grow the size of this dataset to match that of the accumulated data in the accumulation regime. Table 2 rightmost column shows that in this regime, evaluation loss still increases over model-fitting iterations.

Generation temperature.

Most of our language model experiments were run with sampling temperature 1.01.01.01.0 during generation of new datasets. To ensure that this choice is not critical, we also run one experiment with temperature 0.30.30.30.3, and see that this shows similar results (with even larger increases in validation loss in the replacement regime than temperature 1.01.01.01.0), as shown in Table 2, row 2, and Figure 13.

Dataset size and training epochs.

We similarly vary the size of the initial (and subsequent) training datasets and number of training epochs, and see that this has no qualitative effect on the results (Table 2, rows 3 & 4 show training on 1/5th of the TinyStories dataset for 1 & 3 epochs, respectively).

Model quality after first model-fitting iteration.

Finally, we control specifically for model (and thus synthetic dataset) quality after the first iteration, to rule out an undue influence of a “bad” first synthetic dataset on subsequent training. Figure 14 shows performance in subsequent iterations for different amounts of training in the first iteration, showing no qualitative differences.

Model t=1 t=4 (acc) t=4 (repl) t=10 (repl) t=4 (*)
GPT-2 (9M) 1.82 1.74 (-0.07) 2.39 (+0.58) 2.91 (+1.09) 2.18 (+0.36)
GPT-2 (9M) (temp=0.3) 1.82 1.75 (-0.06) 5.82 (+4.00) 9.85 (+8.04) n/a
GPT-2 (9M) (small dataset) 2.56 2.28 (-0.28) 3.21 (+0.65) 3.72 (+1.16) 2.91 (+0.35)
ibid (+ 3 epochs) 1.99 1.87 (-0.12) 2.62 (+0.63) n/a n/a
Llama-2 (12M) 2.06 1.94 (-0.12) 2.72 (+0.66) n/a n/a
Llama-2 (42M) 1.90 1.76 (-0.14) 2.52 (+0.62) n/a n/a
Llama-2 (126M) 1.71 1.59 (-0.12) 2.23 (+0.53) n/a n/a
Table 2: Evaluation cross-entropy loss for different models at model-fitting iterations 1, 4 and 10 for replacement and accumulation regimes. (*) indicates a replacement regime with growing dataset size to ablate for total train set size.
Refer to caption
Figure 13: Accumulating data shows stable behavior across different generation temperatures for a GPT-2 (9M) model, while replacing data does not.
Refer to caption
Figure 14: Model quality after the first model-fitting iteration does not qualitatively change behavior in subsequent iterations. Columns show differing training amount (as measure by epochs) in first iteration.

Appendix D Additional Details on VAE Experiments

Experiment Details.

As pre-processing, we crop and down-sample the images to 64x64 pixels. We use a standard convolutional architecture for the VAE model consisting of 5 convolutional layers with 32, 64, 128, 256, and 512 channels, respectively, and a similar convolutional decoder structure. The latent space is 128-dimensional isotropic Gaussian, represented by 2 MLP layers. Each data iteration consists of 100 training epochs, after which we generate 163K new training images by sampling latents from the Gaussian prior and the passing them through the generator model.

Analysis of Reconstructions.

Figure 15 shows reconstructions after replacing (left) and accumulating (center) data, compared to baseline (right). Analyzing the reconstruction of test set images also reveals interesting findings - the model trained only on data from the prior iteration has indeed collapsed and cannot represent any other classes besides the single mode it generates. Interestingly, the model trained on aggregated data still maintains it’s capabilities and generates accurate reconstructions, including smaller details such as glasses and hats. We hypothesize that this model maintains it’s generative capabilities, but these details become a more minor sub-manifold in the latent space, which is realigned with the newly-generated data, hence why they appear less often in the generated images, which use samples from the prior.

Refer to caption
Refer to caption
Refer to caption
Figure 15: Data Accumulation Maintains Model Capabilities. Image reconstructions from the test set. Left: Training on prior iterations collapses the model’s capability, and subsequently, it can only represent a single mode. Middle: training on aggregated data preserves model capabilities and leads to little to no degradation in the reconstructed images. Right: Baseline reconstructions after 100 training epochs on the dataset.

Appendix E Linear Regression: Replacing Data with Increasing Sample Size

In the framework of Mobahi et al. (2020) and Dohmatob et al. (2024a), we consider sequences of linear models fit to the previous model’s synthetic outputs. Within this framework, Dohmatob et al. (2024a) proved that if data are replaced with each model fitting iteration and the training data cardinality remains constant, then the test squared error scales linearly with the number of model fitting iterations n𝑛nitalic_n:

EtestReplace(w^n)=σ2dTd1×nsuperscriptsubscript𝐸testReplacesubscript^𝑤𝑛superscript𝜎2𝑑𝑇𝑑1𝑛E_{\text{test}}^{\text{Replace}}(\hat{w}_{n})=\frac{\sigma^{2}d}{T-d-1}\times% \color[rgb]{1,0,0}n\color[rgb]{0,0,0}italic_E start_POSTSUBSCRIPT test end_POSTSUBSCRIPT start_POSTSUPERSCRIPT Replace end_POSTSUPERSCRIPT ( over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = divide start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d end_ARG start_ARG italic_T - italic_d - 1 end_ARG × italic_n (9)

In this work, we lightly adapt the argument of Dohmatob et al. (2024a) to study the effects if data accumulate with each model fitting iteration. We specifically considered the case where the training data cardinality increases by a constant T𝑇Titalic_T with each model-fitting iteration i.e. the i𝑖iitalic_ith model is fit using T×i𝑇𝑖T\times iitalic_T × italic_i data, where T𝑇Titalic_T data are “real” and then each subsequently fit model contributes its own T𝑇Titalic_T synthetic data to the accumulating data. In this setting, the test squared error is upper bounded independent of the number of iterations.

EtestAccumulate(w^n)=σ2dTd1×k=1n1k2σ2dTd1×π26superscriptsubscript𝐸testAccumulatesubscript^𝑤𝑛superscript𝜎2𝑑𝑇𝑑1superscriptsubscript𝑘1𝑛1superscript𝑘2superscript𝜎2𝑑𝑇𝑑1superscript𝜋26E_{\text{test}}^{\text{Accumulate}}(\hat{w}_{n})=\frac{\sigma^{2}d}{T-d-1}% \times\sum_{k=1}^{n}\dfrac{1}{k^{2}}\leq\frac{\sigma^{2}d}{T-d-1}\times\color[% rgb]{1,0,0}\frac{\pi^{2}}{6}\color[rgb]{0,0,0}italic_E start_POSTSUBSCRIPT test end_POSTSUBSCRIPT start_POSTSUPERSCRIPT Accumulate end_POSTSUPERSCRIPT ( over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = divide start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d end_ARG start_ARG italic_T - italic_d - 1 end_ARG × ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_k start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ≤ divide start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d end_ARG start_ARG italic_T - italic_d - 1 end_ARG × divide start_ARG italic_π start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 6 end_ARG (10)

In the main text, we focus on the replace and accumulate data settings because prior work focused on replacing data and we wished to study how accumulating data affects model collapse. However, a much richer landscape of outcomes is possible. For instance, and as pointed out in personal correspondence with Dohmatob et al. (2024a), one can consider what we term the “Replace-Multiple” setting, in which one fits the i𝑖iitalic_i-th linear model using T×i𝑇𝑖T\times iitalic_T × italic_i data sampled from the (i1)𝑖1(i-1)( italic_i - 1 )-th linear model. Replace-Multiple is a useful baseline for Accumulate because it matches the amount of training data at each model fitting iteration. Under Replace-Multiple, the test squared error grows logarithmically:

EtestReplace-Multiple(w^n)=σ2dTd1×k=1n1kσ2dTd1×log(n)superscriptsubscript𝐸testReplace-Multiplesubscript^𝑤𝑛superscript𝜎2𝑑𝑇𝑑1superscriptsubscript𝑘1𝑛1𝑘superscript𝜎2𝑑𝑇𝑑1𝑛E_{\text{test}}^{\text{Replace-Multiple}}(\hat{w}_{n})=\frac{\sigma^{2}d}{T-d-% 1}\times\sum_{k=1}^{n}\dfrac{1}{k}\approx\frac{\sigma^{2}d}{T-d-1}\times\color% [rgb]{1,0,0}\log(n)\color[rgb]{0,0,0}italic_E start_POSTSUBSCRIPT test end_POSTSUBSCRIPT start_POSTSUPERSCRIPT Replace-Multiple end_POSTSUPERSCRIPT ( over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = divide start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d end_ARG start_ARG italic_T - italic_d - 1 end_ARG × ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_k end_ARG ≈ divide start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d end_ARG start_ARG italic_T - italic_d - 1 end_ARG × roman_log ( italic_n ) (11)

Replace-Multiple has the drawback of not matching the total amount of compute of Accumulate since each iteration of Replace-Multiple draws T×i𝑇𝑖T\times iitalic_T × italic_i samples from the most recent model, whereas Accumulate draws T𝑇Titalic_T samples from the most recent model. Other baselines are also possible, but we leave these to future work. We focus on accumulating data as we feel real and synthetic data are likely to accumulate in the real world as time progresses.

Appendix F Additional Linear Regression Numerical Results

Refer to caption
Figure 16: Accumulating data across iterations avoids model collapse in linear regression. We consider sequences of linear models recurrently fit to generated targets by previous iterations of models. Replace (Top): If each linear model is fit to the generated targets of only the preceding linear model i.e. data are replaced, then the test squared error grows linearly with the number of model-fitting iterations iterations n𝑛nitalic_n. Replace-Multiple (Middle): If each linear model is fit to T×i𝑇𝑖T\times iitalic_T × italic_i samples from the (i1)𝑖1(i-1)( italic_i - 1 )-th model (i.e. the same amount of data as Accumulate), then the test squared error grows logarithmically with the number of model-fitting iterations; see Appendix E for more details. Accumulate (Bottom): If each linear model is instead fit to the generate targets of all the preceding linear models i.e. data accumulate, then the test squared error has a finite upper bound, independent of the number of iterations. This suggests that data accumulation might be a robust solution for mitigating model collapse. This figure is similar to Figure 7 but displaying log test squared error and more model-fitting iterations for additional clarity.