Ambient Dataloops
Generative Models for Dataset Refinement
Abstract
AbstractWe propose Ambient Dataloops, an iterative framework for refining datasets that makes it easier for diffusion models to learn the underlying data distribution. Modern datasets contain samples of highly varying quality, and training directly on such heterogeneous data often yields suboptimal models. We propose a dataset-model co-evolution process; at each iteration of our method, the dataset becomes progressively higher quality, and the model improves accordingly. To avoid destructive self-consuming loops, at each generation, we treat the synthetically improved samples as noisy, but at a slightly lower noisy level than the previous iteration, and we use Ambient Diffusion techniques for learning under corruption. Empirically, Ambient Dataloops achieve state-of-the-art performance in unconditional and text-conditional image generation and de novo protein design. We further provide a theoretical justification for the proposed framework that captures the benefits of the data looping procedure.
Dataset-Model co-evolution
MethodOur method is as follows. Firstly, we annotate each of our datapoints according to their quality, by assigning them a minimum diffusion time at which low and high quality data distributions approximately merge. Second, we train a diffusion model using Ambient Diffusion (Daras et al., 2025c, 2023) techniques, which can learn from data of heterogenous qualities. Thirdly, we *restore* the original dataset using a trained model, by performing posterior sampling on each datapoint starting from the minimum diffusion time we assigned at the beginning, resulting in a better, higher quality dataset. We call this process an Ambient Dataloop. Successive iterations of this process lead to a co-evolution of both the model and the dataset.
Imperfect datasets, imperfect optimisation
InsightThe idea of dataset refinement, despite being natural, presents a paradox. First, it seems to be violating the data processing inequality; information cannot be created out of thin air, and hence any processing of the original data cannot have more information for the underlying distribution than the original dataset. While this is true, it is important to consider that the first training might be suboptimal due to failures of the optimization process. Hence, dataset refinement can be thought of as a reorganization of the original information in a way that facilitates learning and creates a better optimization landscape. Indeed, what our empirical results show is that training on imperfect datasets leads to imperfect optimisation, even with Ambient Diffusion training. Trying to remove the imperfections, even without external verifiers, results in better training, better optimisation, and better results.
Secondly, several recent works have shown that naive training on synthetic data leads to catastrophic self-confusing loops and mode collapse (Alemohammad et al., 2024a; Shumailov et al., 2024; Hataya et al., 2023; Martinez et al., 2023; Padmakumar & He, 2024; Seddik et al., 2024; Dohmatob et al., 2024). Our approach avoids this scenario by accounting for learning errors at each iteration using Ambient Diffusion to account for model errors, and by running only a small number of Ambient Dataloops.
Text-to-image models with synthetic data
ResultsTo show the effectiveness of our method, we experiment with text-to-image generative modeling, following the architectural and dataset choices of MicroDiffusion (Sehwag et al., 2025). We attempt to refine the DiffusionDB dataset, a collection of synthetic images part of the MicroDiffusion training set that is of lower quality than the rest, and obtain stronger results than the baseline including the dataset and Ambient Diffusion.
| Method | COCO (Fidelity & Alignment) | GenEval Benchmark | |||||||
|---|---|---|---|---|---|---|---|---|---|
| FID-30K \(\downarrow\) | Clip-FD-30K \(\downarrow\) | Overall | Single | Two | Counting | Colors | Position | Color attribution |
|
| Micro-diffusion | 12.37 | 10.07 | 0.44 | 0.97 | 0.33 | 0.35 | 0.82 | 0.06 | 0.14 |
| Ambient-o (L0) | 10.61 | 9.40 | 0.47 | 0.97 | 0.40 | 0.36 | 0.82 | 0.11 | 0.14 |
| Ambient Loops (L1) | 10.06 | 8.83 | 0.47 | 0.97 | 0.38 | 0.35 | 0.78 | 0.11 | 0.19 |
DeNovo protein design
ResultsTo further show the scope of our method, we switch modality and target structural protein design. The problem is well-suited for our Ambient Dataloops framework, as techniques for determining the atomistic resolution of molecular protein structures (such as X-ray crystallography) are inherently noisy. We use the same dataset, architectural, and training procedures as in (Daras et al., 2025b), but use Ambient Dataloops to iteratively refine the protein structures in the dataset. Just one loop of our procedure is enough to achieve a new Pareto point, as shown below. In particular, we trade 0.2% decrease in designability for a 14.3% increase in diversity, significantly expanding the creativity boundaries of the baseline Ambient model for the same inference parameters.
| Model | Designability (%\(\uparrow\)) | Diversity (\(\uparrow\)) |
|---|---|---|
| Ambient Proteins (L0, \(\gamma = 0.35\)) | 99.2 | 0.615 |
| Ambient Loops (L1, \(\gamma=0.35\)) | 99.0 | 0.703 |
| Proteina (FS \(\gamma=0.35\)) | 98.2 | 0.49 |
| Genie2 | 95.2 | 0.59 |
| FoldFlow (base) | 96.6 | 0.20 |
| FoldFlow (stoc.) | 97.0 | 0.25 |
| FoldFlow (OT) | 97.2 | 0.37 |
| FrameFlow | 88.6 | 0.53 |
| RFDiffusion | 94.4 | 0.46 |
| Proteus | 94.2 | 0.22 |
Citation
@article{rodriguez2025ambient, title = {Ambient Dataloops: Generative Models for Dataset Refinement}, author = {Rodriguez-Munoz, A. and Daspit, W. and Klivans, A. and Torralba, A. and Daskalakis, C. and Daras, G.}, year = {2025}, }
Acknowledgments
Experiments conducted in the UT Austin Vista cluster. Adrian Rodriguez-Munoz is supported by a DSTA Singapore grant.