Benchmarking Empirical Privacy Protection for Adaptations of Large Language Models
ICLR
Accepted for the oral presentation at ICLR 2026 .
CISPA Helmholtz Center for Information Security
* Equal contribution

TL;DR: We benchmark empirical privacy risks in DP-adapted LLMs, revealing that distributional closeness to pretraining data and adaptation method critically impact practical privacy protection, even under the same formal guarantee, and that LoRA offers the strongest empirical protection most consistently. We also propose a structured four-stage framework for holistic privacy auditing of the full pretrain-adapt pipeline.


Abstract

Recent work has applied differential privacy (DP) to adapt large language models (LLMs) for sensitive applications, offering theoretical guarantees. However, its practical effectiveness remains unclear, partly due to LLM pretraining, where overlaps and interdependencies with adaptation data can undermine privacy despite DP efforts. To analyze this issue in practice, we investigate privacy risks under DP adaptations in LLMs using state-of-the-art attacks such as robust membership inference (RMIA) and canary data extraction. We benchmark these risks by systematically varying the adaptation data distribution, from exact overlaps with pretraining data, through in-distribution (IID) cases, to entirely out-of-distribution (OOD) examples. Additionally, we evaluate how different adaptation methods and different privacy regimes impact the vulnerability. Our results show that distribution shifts strongly influence privacy vulnerability: the closer the adaptation data is to the pretraining distribution, the higher the practical privacy risk at the same theoretical guarantee, even without direct data overlap. We find that parameter-efficient fine-tuning methods, such as LoRA, achieve the highest empirical privacy protection for OOD data. Our benchmark identifies key factors for achieving practical privacy in DP LLM adaptation, providing actionable insights for deploying customized models in sensitive settings. Looking forward, we propose a structured framework for holistic privacy assessment beyond adaptation privacy, to identify and evaluate risks across the full pretrain-adapt pipeline of LLMs.



Summary of Contributions and Findings

We provide the first comprehensive benchmark of empirical privacy risks under DP adaptations of LLMs in the pretrain-adapt paradigm, spanning 6 datasets, 4 adaptation methods, and 7 pretrained models of varying sizes and architectures.

  1. Distribution drives leakage: Privacy risk increases as adaptation data becomes distributionally closer to pretraining data. Crucially, IID validation data leaks as much as directly overlapping pretraining data, confirming distributional closeness, not just overlap, as the root cause.
  2. Adaptation method matters: LoRA (DP-LoRA) consistently provides the strongest empirical protection under the DP regime while maintaining competitive utility.
  3. Prefix Tuning reduces pretraining leakage: Private Prefix Tuning can reduce leakage of pretraining data post-adaptation.
  4. Tight privacy regimes are necessary: Moderate budgets (e.g., $\varepsilon = 8$) still leave IID adaptation data exposed. Effective practical protection requires strict DP values.
  5. Holistic audit framework: We formalize a four-stage framework for privacy auditing: (1) pretraining, (2) adaptation, (3) joint pretraining and adaptation, (4) post-adaptation auditing of pretraining, and define the corresponding membership inference games for each stage.


Setup

Setup for Privacy Auditing of Private LLM Adaptations
Figure 1: Setup for privacy auditing of private LLM adaptations. Audits are performed via robust membership inference (RMIA) against the adapted model's outputs, and via data extraction attacks using canary data inserted into the adaptation set.


We evaluate a wide range of private adaptation strategies: Full Fine-Tuning, Last-Layer (Head) Fine-Tuning, parameter-efficient LoRA, and Prefix Tuning, all trained with DP-SGD. For a fair comparison, we ensure similar final validation losses across methods and datasets.

Models and pretraining data. We primarily benchmark the Pythia family (70M–1.4B) and GPT-Neo (125M, 1.3B), both trained on the Pile dataset. We also include the fully open-source OLMo 1B and OLMo2 1B models. We focus on the Pythia 1B model as the default model.

Adaptation datasets. We distinguish three settings:

  • Overlap: Directly using the training data of the pretraining dataset for the adaptation.
  • IID:We use the validation data of the pretraining dataset for the adaptation.
  • For both IID and Overlap, we use the following Pile subsets: Bookcorpus2, GitHub, and Enron Emails.
  • OOD: We use SAMSum (English dialogue) and GermanWiki as OOD datasets.

Attacks. We rely on the state-of-the-art membership inference attack, RMIA (Robust Membership Inference Attack), and complement it with canary data extraction to evaluate more severe forms of leakage.



How Does The Relationship Between Adaptation and Pretraining Data Drive Privacy Leakage?

Table 1: Membership Inference AUC (RMIA, shadow model) for Pythia 1B across datasets and privacy budgets. IID settings consistently yield higher leakage than OOD at the same $\varepsilon$. Overlapping (Train) and non-overlapping (Val) IID data show near-identical leakage, confirming that distributional closeness, even without direct overlap, drives the risk.

Adaptation Bookcorpus2 IID Bookcorpus2 Overlap SAMSum (OOD) GermanWiki (OOD)
$\varepsilon{=}\infty$$\varepsilon{=}8$$\varepsilon{=}0.1$ $\varepsilon{=}\infty$$\varepsilon{=}8$$\varepsilon{=}0.1$ $\varepsilon{=}\infty$$\varepsilon{=}8$$\varepsilon{=}0.1$ $\varepsilon{=}\infty$$\varepsilon{=}8$$\varepsilon{=}0.1$
Prefix Tuning 1.000.890.56 1.000.900.55 1.000.620.63 1.000.640.61
LoRA 1.000.700.52 1.000.690.53 0.860.690.50 1.000.590.66
Full Fine-Tune 1.000.750.77 1.000.750.76 1.000.820.62 1.000.710.55
Head Fine-Tune 1.000.720.73 1.000.720.72 1.000.980.62 1.000.760.70


Core finding: Privacy risk scales with distributional closeness to the pretraining data, regardless of whether there is direct overlap. IID validation data, which was never seen during pretraining, leaks as severely as data drawn directly from the pretraining corpus. This pinpoints distributional similarity, not membership overlap, as the primary risk factor.

Concretely, with RMIA (shadow) at $\varepsilon = 8$, average AUC scores on IID data range from 0.70–0.90, compared to 0.63–0.87 on OOD data. Under no-privacy ($\varepsilon = \infty$) all IID settings achieve AUC = 1.00, confirming that IID data is trivially exposed before any DP protection is even applied. The effect is most pronounced for PEFT methods like LoRA, where the gap between IID and OOD leakage is especially clear.

This finding empirically confirms the theoretical concern of Tramèr et al. (2022): privately adapting an LLM that was pretrained on similar data does not provide the privacy protection that formal DP guarantees might suggest.



Which Adaptation Method Offers the Best Privacy?

Membership inference. Across privacy regimes and datasets, LoRA consistently achieves the lowest AUC. At $\varepsilon = 0.1$ on IID data, LoRA reaches AUC = 0.52, near random guessing, while Full Fine-Tune and Head Fine-Tune remain above 0.70. This gap persists across datasets and attack configurations, making LoRA the de-facto recommended method when empirical privacy is a concern.

The relative ordering of methods depends on the data regime. For OOD data at moderate privacy ($\varepsilon = 8$), Head Fine-Tune becomes the most vulnerable (AUC up to 0.98 on SAMSum), while LoRA stays closest to 0.5. Full Fine-Tune occupies the middle ground.

Data extraction. Against canary extraction attacks, Prefix Tuning is the most vulnerable adaptation method. LoRA and Head Fine-Tune both exhibit strong resistance against extraction regardless of canary type, privacy budget, or data distribution. At $\varepsilon = 0.1$, all methods show exposure close to the random-guessing baseline (≈ 1.44), confirming that tight DP constraints do suppress extraction risk. The adversarial prefix is the dominant source of leakage; the interaction between prefix and the canary sample itself plays a secondary role.

Setup for Privacy Auditing of Private LLM Adaptations
Figure 2: Results for the data extraction attack across datasets, adaptation methods, and privacy budgets.


A Holistic Framework for Pretrain-Adapt Privacy Auditing

Setup for Privacy Auditing of Private LLM Adaptations
Figure 3: Our four-stage framework for privacy auditing of private LLM adaptations. Auditing pretraining Auditing adaptation Joint auditing Post-adaptation auditing of pretraining


Examining pretraining and adaptation privacy in isolation yields a dangerously incomplete picture. The strong interdependence between these stages demands a unified view. We formalize the pretrain-adapt adversarial game and identify four distinct auditing stages, each with its own threat model and membership inference hypothesis:

    1
    Auditing pretraining (Stage 1)
    Standard ML auditing applied to the pretrained model $\theta$. The attacker guesses whether target sample $x$ was included in the pretraining set $S$.


    2
    Auditing adaptation (Stage 2)
    The attacker observes only the adapted model $\theta'$ and guesses whether $x$ was part of the adaptation set $D$. This is the primary focus of our benchmark.


    3
    Joint auditing (Stage 3)
    The attacker has access to both $\theta$ and $\theta'$ and can leverage knowledge of the pretraining set to attack the adaptation data more effectively. Three sub-cases arise depending on the attacker's prior knowledge about $x$'s membership in $S$.


    4
    Post-adaptation auditing of pretraining (Stage 4)
    Private adaptations introduce noise that may impact the privacy risks of the pretraining data. Here the attacker knows $x \notin D$ and uses the adapted model to infer membership in the pretraining set.

Formalizing these stages enables a systematic framework for privacy measurement across the full pipeline, supports structured reasoning about what privacy risks each method introduces, and motivates future work on joint auditing tools that match the complexity of modern LLM deployments.



Practical Guidelines for Deploying Private LLMs

Our benchmark surfaces actionable guidelines for practitioners deploying DP-adapted LLMs in sensitive settings:

  1. Avoid distributional overlap with pretraining data. Even validation splits of pretraining-domain datasets carry near-identical empirical risk to directly reusing training data. If the private adaptation task naturally falls in-distribution (e.g., adapting a model trained on web text to clinical notes that resemble web text), expect elevated leakage and compensate with tighter $\varepsilon$.
  2. Prefer LoRA for private adaptation. LoRA the most consistently achieves the lowest membership inference AUC while matching the performance, or similar privacy protection with a notably better performance, making it the method of choice when empirical privacy is a priority.
  3. Use high-privacy regimes ($\varepsilon < 0.1$). Moderate budgets (e.g., $\varepsilon = 8$) fail to suppress leakage even for OOD data under strong attacks. Effective practical protection requires operating at low $\varepsilon$, despite the associated utility cost.
  4. Audit before deployment. Use RMIA with a shadow model instantiated from the same pretrained base to estimate empirical leakage. Attackers with access to the public pretrained model, an increasingly common scenario, enjoy a substantial advantage; pre-deployment auditing surfaces this risk before it can be exploited.


Conclusions

We benchmark the practical privacy risks that arise under DP adaptations of LLMs within the pretrain-adapt paradigm. Our comprehensive empirical analysis confirms the theoretical concern that pretraining significantly amplifies the privacy risks associated with the adaptation data.

  1. Distribution over overlap: Distributional closeness between adaptation and pretraining data is the primary driver of empirical leakage, even without any data overlap between the two sets.
  2. Method choice matters: LoRA offers the strongest empirical protection under the DP regime. Prefix Tuning can additionally reduce leakage of pretraining data post-adaptation.
  3. Tight DP is non-negotiable: Moderate privacy budgets leave sensitive adaptation data exposed. The privacy protection requires strict $\varepsilon$ values.
  4. Holistic auditing is essential: Pretraining and adaptation privacy cannot be assessed in isolation. Our four-stage framework and formal membership inference games lay a foundation for comprehensive privacy assessments of future LLM pipelines in the pretrain-adapt paradigm.


BibTeX

@inproceedings{
marek2026benchmarking,
title={Benchmarking Empirical Privacy Protection for Adaptations of Large Language Models},
author={Bart{\l}omiej Marek and Lorenzo Rossi and Vincent Hanke and Xun Wang and Michael Backes and Franziska Boenisch and Adam Dziedzic},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
url={https://openreview.net/forum?id=jY7fAo9rfK}
}