Finding DoRI: Discovery of Retained Images in Diffusion Models

  • Jun. 24, 2026

  • 6 min 39 sec.

  • SprintML

ICML logo
Accepted at ICML 2026

Finding DoRI DoRI icon: Discovery of
Retained Images in
Diffusion Models

Antoni Kowalczuk1,*, Dominik Hintersdorf2,3,*, Lukas Struppek6,*,†
Kristian Kersting2,3,4,5, Adam Dziedzic1, Franziska Boenisch1

1 CISPA Helmholtz Center for Information Security    2 German Research Center for Artificial Intelligence (DFKI)
3 Technical University of Darmstadt    4 Hessian Center for AI (Hessian.AI)
5 Centre for Cognitive Science, Technical University of Darmstadt    6 Far.AI

* Equal contribution.    Work mainly done at DFKI/Technical University of Darmstadt.

Paper  |  Code  |  Poster  |  Slides  |  Recording  |  BibTeX

On This Page

Overview · Main Finding · Problem · Key Idea · Our Method · Locality Evidence · Results · Mitigation Tradeoffs · Take-Aways · Citation

Overview

We study whether memorized training images in text-to-image diffusion models are actually removed by pruning-based mitigation methods. The central finding is that these methods can suppress replication for the original prompt while leaving alternative text embeddings that still recover the same image.

DoRI searches directly in continuous text-embedding space for such adversarial triggers. The experiments suggest that memorization is not localized in a small set of weights or limited to one prompt or one activation pattern, and that robust mitigation should be evaluated against diverse replication triggers.

Overview diagram showing mitigation, adversarial embedding search, and recovered retained images.
The central failure mode: pruning changes the generation for the original prompt, but DoRI finds adversarial embeddings that retrieve the retained image again.

Main Finding

Text-to-image diffusion models (DMs) have achieved remarkable success in image generation. However, concerns about data privacy and intellectual property remain due to their ability to inadvertently memorize and replicate training data. Recent mitigation efforts have focused on identifying and pruning weights responsible for triggering verbatim training data replication, based on the assumption that memorization can be localized. We challenge this assumption and demonstrate that, even after such pruning, small perturbations to the text embeddings of previously mitigated prompts can re-trigger data replication, revealing the fragility of such methods.

Our analysis provides multiple indications that memorization is not inherently local in diffusion models: (1) replication triggers for memorized images are distributed throughout text embedding space; (2) embeddings yielding the same replicated image produce divergent model activations; and (3) different pruning methods identify inconsistent sets of memorization-related weights for the same image. Finally, we show that bypassing the locality assumption enables more robust mitigation through adversarial fine-tuning. These findings provide new insights into the fundamental nature of memorization in text-to-image DMs and inform the future development of more reliable mitigation methods against DM memorization.

Problem: Pruning can hide replication without removing memorization

Verbatim data replication in text-to-image diffusion models creates privacy and intellectual-property risks, especially for open-weight models. Existing mitigation methods such as NeMo and WANDA identify weights associated with memorized prompts and prune them. This can stop the original caption from reproducing the training image, which makes the mitigation appear successful under standard prompt-based evaluation.

The paper asks whether that apparent success means the memorized image has been removed from the model. This distinction matters: deployment filters or single-prompt pruning can reduce one access path, but model release requires confidence that the retained image cannot be recovered through nearby or alternative inputs.

Key Idea: Search beyond natural-language prompts

DoRI treats the text conditioning vector itself as the search variable. Starting from the memorized prompt embedding, it optimizes an adversarial embedding with the diffusion training loss for a known memorized image. Noise and denoising timesteps are resampled during the optimization so that the final embedding is not tied to a single sampling trajectory.

If a pruned model still generates the target training image from this optimized embedding, then the pruning did not erase the memorized content. It only disrupted the original retrieval path.

WANDA mitigation examples where adversarial embeddings recover visually similar memorized images.
After WANDA mitigation, standard generations move away from the original image, while adversarial embeddings can steer generation back toward close replicas.

Our Method: Adversarial embeddings expose retained images

The method assumes a model developer has already identified a memorized image and its prompt. After applying NeMo or WANDA, DoRI optimizes a continuous text embedding against the target image under the pruned model. The optimized embedding is then used as the conditioning input for generation.

The analysis goes beyond showing that we can re-generate memorized images after the pruning methods. We also probe whether memorization is local in text-embedding space, internal activations, and selected weights. Across these views, the evidence points away from a local mechanism of memorization in diffusion models and toward a distributed retention pattern.

Locality Evidence: Retained images are not tied to one local trigger

We test locality from three angles.

In embedding space, optimized triggers for a single memorized image remain widely distributed. Randomly initialized or non-memorized prompt embeddings can be optimized into successful triggers, so retrieval is not confined to the original caption neighborhood. In general, adversarial triggers are widely scattered in text-embedding space.

For activations, triggers for the same retained image do not converge to one consistent activation pattern, weakening activation-localization assumptions. Successful triggers for the same image can produce divergent activations.

Finally, different pruning procedures identify inconsistent memorization weights for the same image, which makes small-subset removal brittle. Low agreement between selected weights undermines local pruning.

View What we check Main implication
Embedding space Randomly initialized or non-memorized prompt embeddings can be optimized into successful triggers. Retrieval is not confined to the original caption neighborhood.
Activations Triggers for the same retained image do not converge to one consistent activation pattern. Activation-localization assumptions are weak.
Weights Different pruning procedures identify inconsistent memorization weights for the same image. Small-subset removal is brittle.

T-SNE visualization showing scattered adversarial embeddings in text embedding space.
Adversarial triggers are widely scattered in text-embedding space.

Activation discrepancy plot for adversarial embeddings that recover the same memorized image.
Successful triggers for the same image can produce divergent activations.

Weight agreement plot showing low agreement between memorization-related weights.
Low agreement between selected weights undermines local pruning.

Key empirical findings

We show that NeMo and WANDA reduce replication under the original memorized prompts, but adversarial embeddings recover a high Memorization Rate (MR) after pruning. In the main table below, NeMo with DoRI reaches MR 0.99 and WANDA with DoRI reaches MR 0.72 for verbatim memorized samples. Additionally, we report SSCDOrig, which is a mean cosine similarity of feature embeddings from the SSCD model between the original image and the image generated by the model. Values above 0.7 indicate replication.

For Stable Diffusion v2.0, the same pattern appears: pruning lowers standard prompt replication, while DoRI adversarial embeddings recover high similarity to the memorized target.

Line chart showing memorization rate versus optimization steps for non-memorized images, NeMo, and WANDA.
Memorization rate rises quickly for adversarial embeddings targeting memorized images, while the non-memorized baseline remains near zero.

Setting ↓ SSCDOrig ↓ MR
Memorized prompts 0.90 ± 0.04 0.98
Non-memorized prompts 0.17 ± 0.05 0.00
Non-memorized prompts + DoRI 0.48 ± 0.06 0.00
NeMo 0.33 ± 0.18 0.20
NeMo + DoRI 0.91 ± 0.03 0.99
WANDA 0.20 ± 0.08 0.00
WANDA + DoRI 0.76 ± 0.05 0.72

Qualitative comparison of DoRI optimization steps for non-memorized and memorized examples.
DoRI optimization does not reproduce non-memorized examples in the same way, but it can recover memorized examples after NeMo or WANDA pruning as optimization steps increase.

Mitigation Tradeoffs: Robust removal needs more than pruning

Simply increasing WANDA sparsity can make DoRI fail, but this is due to large concept-quality degradation on related prompts. In contrast, adversarial fine-tuning updates the full model with diverse adversarial embeddings and surrogate images, reducing replication while preserving general image generation quality more effectively.

Qualitative examples after adversarial fine-tuning, with mitigated and adversarial-embedding generations.
After adversarial fine-tuning, adversarial embeddings no longer recover close replicas in the same way as after pruning.

Qualitative examples showing concept damage after WANDA 10 percent sparsity.
WANDA at 10 percent sparsity can suppress retrieval, but the paper shows visible damage on semantically related concept prompts.

Setting ↓ SSCDOrig ↓ MR Additional signal
ESD + DoRI 0.90 ± 0.04 0.98 Concept-removal transfer still fails under DoRI.
Concept Ablation + DoRI 0.91 ± 0.04 0.97 Also remains vulnerable to adversarial embeddings.
Our mitigation + DoRI 0.36 ± 0.14 0.02 Five-epoch adversarial fine-tuning result in the main paper.
SD v2: NeMo + DoRI 0.87 ± 0.02 1.00 Failure mode generalizes beyond Stable Diffusion v1.4.
SD v2: Our mitigation + DoRI 0.14 ± 0.06 0.06 Adversarial fine-tuning remains effective on SD v2.
WANDA 10% + DoRI 0.40 ± 0.13 0.01 AConcept drops to 0.33; FIDConcept is 80.70.

What Are the Main Take-Aways?

  • Single-prompt evaluation can overstate whether memorization has been removed.
  • Replication triggers can exist outside the original natural-language prompt.
  • Different successful triggers can produce divergent activations for the same retained image.
  • Pruning-based localization does not identify a set of memorization weights.
  • Mitigation should be tested against diverse adversarial retrieval paths, not only the original caption.

BibTeX

@inproceedings{kowalczuk2026findingdori,
  title = {Finding DoRI: Discovery of Retained Images in Diffusion Models},
  author = {Kowalczuk, Antoni and Hintersdorf, Dominik and Struppek, Lukas and Kersting, Kristian and Dziedzic, Adam and Boenisch, Franziska},
  booktitle = {Proceedings of the International Conference on Machine Learning},
  year = {2026},
  note = {Accepted at ICML 2026},
  url = {https://arxiv.org/abs/2507.16880}
}

Finding DoRI: Discovery of Retained Images in Diffusion Models. ICML 2026.
Paper  |  Code  |  Poster  |  Slides  |  Recording

Recent Blogs

Finding DoRI: Discovery of Retained Images in Diffusion Models


Read More

Jun. 24, 2026

6 min 39 sec.

Privacy Attacks on Tabular Foundation Models


Read More

May. 16, 2026

5 min 51 sec.

Benchmarking Empirical Privacy Protection for Adaptations of LLMs


Read More

Apr. 28, 2026

0 min 0 sec.