Ambient Dataloops

Generative Models for Dataset Refinement

Adrian Rodriguez-MunozWilliam DaspitAdam KlivansAntonio TorralbaConstantinos DaskalakisGiannis Daras

Abstract

Abstract

We propose Ambient Dataloops, an iterative framework for refining datasets that makes it easier for diffusion models to learn the underlying data distribution. Modern datasets contain samples of highly varying quality, and training directly on such heterogeneous data often yields suboptimal models. We propose a dataset-model co-evolution process; at each iteration of our method, the dataset becomes progressively higher quality, and the model improves accordingly. To avoid destructive self-consuming loops, at each generation, we treat the synthetically improved samples as noisy, but at a slightly lower noisy level than the previous iteration, and we use Ambient Diffusion techniques for learning under corruption. Empirically, Ambient Dataloops achieve state-of-the-art performance in unconditional and text-conditional image generation and de novo protein design. We further provide a theoretical justification for the proposed framework that captures the benefits of the data looping procedure.

Dataset-Model co-evolution

Method

Our method is as follows. Firstly, we annotate each of our datapoints according to their quality, by assigning them a minimum diffusion time at which low and high quality data distributions approximately merge. Second, we train a diffusion model using Ambient Diffusion (Daras et al., 2025c, 2023) techniques, which can learn from data of heterogenous qualities. Thirdly, we *restore* the original dataset using a trained model, by performing posterior sampling on each datapoint starting from the minimum diffusion time we assigned at the beginning, resulting in a better, higher quality dataset. We call this process an Ambient Dataloop. Successive iterations of this process lead to a co-evolution of both the model and the dataset.

Imperfect datasets, imperfect optimisation

Insight

The idea of dataset refinement, despite being natural, presents a paradox. First, it seems to be violating the data processing inequality; information cannot be created out of thin air, and hence any processing of the original data cannot have more information for the underlying distribution than the original dataset. While this is true, it is important to consider that the first training might be suboptimal due to failures of the optimization process. Hence, dataset refinement can be thought of as a reorganization of the original information in a way that facilitates learning and creates a better optimization landscape. Indeed, what our empirical results show is that training on imperfect datasets leads to imperfect optimisation, even with Ambient Diffusion training. Trying to remove the imperfections, even without external verifiers, results in better training, better optimisation, and better results.

Secondly, several recent works have shown that naive training on synthetic data leads to catastrophic self-confusing loops and mode collapse (Alemohammad et al., 2024a; Shumailov et al., 2024; Hataya et al., 2023; Martinez et al., 2023; Padmakumar & He, 2024; Seddik et al., 2024; Dohmatob et al., 2024). Our approach avoids this scenario by accounting for learning errors at each iteration using Ambient Diffusion to account for model errors, and by running only a small number of Ambient Dataloops.

FID vs Noise Ablation Plot FID vs Noise Ablation Plot
The horizontal axis is the noise level we denoise a corrupted CIFAR-10 dataset after \( k \) loops, where \( k \) changes for each one of the lines. Going too fast or too slow is suboptimal. There is a point after which reducing the dataset further only hurts (madness regime) because the current model has reached its denoising capacity. Decreasing the noise level at the right speed achieves maximum performance with \( k > 1 \) loops. FID is always computed with respect to the original clean CIFAR-10.

Text-to-image models with synthetic data

Results

To show the effectiveness of our method, we experiment with text-to-image generative modeling, following the architectural and dataset choices of MicroDiffusion (Sehwag et al., 2025). We attempt to refine the DiffusionDB dataset, a collection of synthetic images part of the MicroDiffusion training set that is of lower quality than the rest, and obtain stronger results than the baseline including the dataset and Ambient Diffusion.

Refinement of DiffusionDB synthetic data Refinement of DiffusionDB synthetic data
\(D_0\) shows synthetically generated images from DiffusionDB (Wang et al., 2022), a dataset used for text-to-image generative modeling. These images have artifacts due to learning errors of the underlying model. We train a model on this dataset, \(M_1\), that we use to improve its own training set, leading to a “restored” dataset \(D_1\). Successive iterations of this process lead to a co-evolution of both the model and the dataset — see dataset \(D_2\) and model \(M_1\) respectively.
Table 1: Quantitative benefits of Ambient Loops on COCO zero-shot generation and GenEval.
Method COCO (Fidelity & Alignment) GenEval Benchmark
FID-30K \(\downarrow\) Clip-FD-30K \(\downarrow\) Overall Single Two Counting Colors Position Color
attribution
Micro-diffusion 12.37 10.07 0.44 0.97 0.33 0.35 0.82 0.06 0.14
Ambient-o (L0) 10.61 9.40 0.47 0.97 0.40 0.36 0.82 0.11 0.14
Ambient Loops (L1) 10.06 8.83 0.47 0.97 0.38 0.35 0.78 0.11 0.19

DeNovo protein design

Results

To further show the scope of our method, we switch modality and target structural protein design. The problem is well-suited for our Ambient Dataloops framework, as techniques for determining the atomistic resolution of molecular protein structures (such as X-ray crystallography) are inherently noisy. We use the same dataset, architectural, and training procedures as in (Daras et al., 2025b), but use Ambient Dataloops to iteratively refine the protein structures in the dataset. Just one loop of our procedure is enough to achieve a new Pareto point, as shown below. In particular, we trade 0.2% decrease in designability for a 14.3% increase in diversity, significantly expanding the creativity boundaries of the baseline Ambient model for the same inference parameters.

Protein 1 Protein 2 Protein 3 Protein 4
Example of our dataset refinement procedure. An initial low pLDDT protein, denoted with green, is noised to a certain level, giving the shape in cyan. We initialize the reverse process with the cyan sample, and we sample the red point from the posterior.
Table 2: Designability and diversity for protein structure generation.
Model Designability (%\(\uparrow\)) Diversity (\(\uparrow\))
Ambient Proteins (L0, \(\gamma = 0.35\)) 99.2 0.615
Ambient Loops (L1, \(\gamma=0.35\)) 99.0 0.703
Proteina (FS \(\gamma=0.35\)) 98.2 0.49
Genie2 95.2 0.59
FoldFlow (base) 96.6 0.20
FoldFlow (stoc.) 97.0 0.25
FoldFlow (OT) 97.2 0.37
FrameFlow 88.6 0.53
RFDiffusion 94.4 0.46
Proteus 94.2 0.22

Citation

@article{rodriguez2025ambient,
  title = {Ambient Dataloops: Generative Models for Dataset Refinement},
  author = {Rodriguez-Munoz, A. and Daspit, W. and Klivans, A. and Torralba, A. and Daskalakis, C. and Daras, G.},
  year = {2025},
}
TY

Acknowledgments

Experiments conducted in the UT Austin Vista cluster. Adrian Rodriguez-Munoz is supported by a DSTA Singapore grant.