AI Governance & Safety April 11, 2026 · 10 min read

Mathematical Privacy Guarantees: Differential Privacy and Synthetic Data

Part 6 showed Meridian adopting differential privacy for query analytics. This article explains why: what epsilon means, how noise is calibrated, what Apple and the Census Bureau chose, and when to use synthetic data instead.

By Vikas Pratap Singh

#data-governance #data-privacy #data-protection #privacy-engineering #differential-privacy #synthetic-data

Executive Briefing

What this covers: Differential privacy (epsilon explained, Laplace vs Gaussian mechanisms, local vs central models), production deployments at Apple, Google, and the US Census Bureau, synthetic data generation (CTGAN, vendor landscape), evaluation metrics, and a decision framework for technique selection. Part 6 introduced Meridian's epsilon decision for query analytics; this article provides the full technical foundation behind that choice.
Who should read it: Data Architects evaluating privacy-preserving analytics, privacy officers building data-sharing agreements, and ML engineers training models on sensitive data.
Key takeaway: Epsilon is the number that quantifies how much privacy you are trading for utility. Apple uses epsilon 2-8 per use case. The US Census chose epsilon 19.61. There is no universally correct value. Your PET assessment from the Privacy Guide framework (Part 3) should drive epsilon selection based on regulatory context, threat model, and data sensitivity.

Data Privacy Guide: Overview | Part 1 | Part 2 | Part 3 | Part 4 | Part 5 | Part 6 | Part 7 | Part 8 | Part 9 | Part 10

This is Part 8 of a three-part advanced series on Privacy-Enhancing Technologies. Part 7 covered operational techniques (masking, tokenization, k-anonymity). Part 8 covers mathematical privacy guarantees. Part 9 covers privacy-preserving computation and the explainability-privacy tension.

This part introduces mathematical concepts (epsilon, noise calibration) that may be new to practitioners without a statistics background. The key decisions, which epsilon to choose and which mechanism to use, are explained through production case studies at Apple, Google, and the US Census Bureau. Formal definitions are provided for completeness but are not required to follow the decision framework at the end.

Why Heuristic Privacy Is Not Enough

Part 7 covered operational techniques: masking, tokenization, and the k-anonymity family. These work. They are deployed at massive scale. But they share a limitation: they rely on assumptions about what an attacker knows.

K-anonymity assumes the attacker does not have auxiliary data that can narrow re-identification. The Netflix Prize dataset proved that assumption wrong: researchers showed that knowing just 2 movie ratings and dates within a 3-day window allows 64% re-identification of “anonymous” subscribers. The AOL search log release proved it wrong again. The Massachusetts hospital data proved it wrong first.

Differential privacy changes the model. Instead of assuming what the attacker knows, it provides a guarantee that holds regardless of auxiliary information. The guarantee is a number: epsilon.

Differential Privacy: The Core Idea

A computation on a dataset is differentially private if its output is nearly the same whether or not any single individual’s data is included. An observer looking at the output cannot tell if you were in the dataset.

A concrete example: A hospital publishes the count of diabetes patients in ZIP code 60601. Without privacy protection, the count is 847. With differential privacy at epsilon = 1, the published count might be 845 or 849: close enough to be useful, noisy enough that adding or removing any single patient barely changes the number. An attacker cannot tell whether you were included. With epsilon = 10, the noise shrinks so much that your presence becomes nearly visible. With epsilon = 0.1, the noise is so large (the count might read 820 or 870) that individual presence is completely hidden, but the number is less useful.

The formal definition bounds the probability ratio: for any two datasets differing by one record, the probability of any output changes by at most a factor of e^epsilon. Epsilon is the privacy budget: lower epsilon means more noise, more privacy, less accuracy. Higher epsilon means less noise, less privacy, more accuracy.

The odds ratio interpretation: If epsilon = 1, seeing the output changes your belief about whether a specific person was in the dataset by at most a factor of 2.72x. If epsilon = 0.1, the belief changes by 1.1x (almost no information). If epsilon = 10, the belief changes by 22,026x (very little privacy).

Epsilon Range	Privacy Level	Who Uses It
0.01 - 0.1	Very strong	Academic research
0.1 - 1.0	Strong	Google Gboard (epsilon = 0.994 for best model)
1.0 - 8.0	Moderate	Apple (epsilon 2-8 per use case), Microsoft (epsilon 4)
5.0 - 20.0	Weak to very weak	US Census Bureau (epsilon = 19.61 total)

Privacy loss compounds. Two epsilon=1 queries on the same dataset cost epsilon=2 total (sequential composition). This is why epsilon is called a “budget”: every analysis spends some of it, and when it is exhausted, no further queries can be answered without degrading the guarantee.

How Noise Is Added

The core technique is adding calibrated random noise to query results. The Laplace mechanism adds noise from the Laplace distribution, providing pure (epsilon, 0)-differential privacy with zero failure probability. For a count query (sensitivity = 1) with epsilon = 1, the noisy answer might be off by 1-2. With epsilon = 0.1, it might be off by 10.

The Gaussian mechanism adds noise from the Gaussian distribution, providing approximate (epsilon, delta)-differential privacy where delta is a small failure probability. Gaussian noise is more efficient in high-dimensional settings (like ML gradient updates across thousands of parameters), which is why DP-SGD (Differentially Private Stochastic Gradient Descent), the standard method for training ML models with DP guarantees, uses it.

For practitioners: When evaluating a vendor’s differential privacy claims, ask two questions: what mechanism do they use (Laplace, Gaussian, or something else), and what is the epsilon? If they say “differential privacy” but cannot answer these, the claim is marketing, not engineering.

Local vs Central: Two Trust Models

Central DP collects raw data at a trusted server and adds noise to query results. Better accuracy: noise is added once to the aggregate. Used by the US Census Bureau.

Local DP adds noise on each user’s device before data leaves. No trust in the server required. Worse accuracy: each person’s data is independently noised. Used by Apple and Google RAPPOR. Apple compensates for the accuracy loss by having hundreds of millions of users, making the aggregate signal recoverable despite per-user noise.

The idea behind local DP is not new. In 1965, Stanley Warner proposed randomized response for sensitive survey questions: flip a coin privately; if heads, answer truthfully; if tails, answer “Yes” regardless. No one can tell whether a “Yes” is truthful or forced by the coin. But the noise can be removed mathematically at the population level.

Two Case Studies That Define the Field

Apple: Privacy at Scale

Apple deployed local differential privacy starting in 2016 to learn aggregate usage patterns (which emojis people type, which new words appear, which websites drain battery) without learning anything about individuals. Each event is hashed, encoded into a bit vector, noised on-device, and transmitted without device identifiers. The server aggregates millions of noisy records to recover population-level statistics.

The controversy: a team led by researchers at USC analyzed Apple’s macOS and iOS implementations and found that while each individual data contribution used epsilon 1 or 2, the cumulative daily privacy loss reached 6 on macOS and 14 on iOS 10, with a total potentially reaching 16 across all applications. Frank McSherry, a co-creator of differential privacy, called epsilon 14 “relatively pointless.” Apple responded that summing epsilon across uncorrelated use cases (emoji usage and health data types) overstates the risk.

For practitioners: This controversy illustrates that epsilon is meaningless without context. A company can truthfully claim epsilon = 2 per use case while the total daily privacy loss is 16. When evaluating DP claims, ask: what is the unit of privacy? What is the total budget across all use cases? Are the use cases correlated?

US Census 2020: The Accuracy-Privacy War

The Census Bureau switched to differential privacy after internal experiments reconstructed microdata records matching 46% of the 2010 Census population on key attributes from published tables using the previous “data swapping” method. Their TopDown Algorithm adds noise hierarchically: less noise at the national level, more at the block level, with post-processing to maintain consistency across geographic layers.

The final epsilon was 19.61 (17.14 for persons, 2.47 for housing units). Research in Population Research and Policy Review (Santos-Lozada et al., 2022) found that rural areas and minority racial groups experienced disproportionately larger errors. A separate study in Science Advances (Kenny et al., 2021) found the approach systematically undercounted populations in racially and politically heterogeneous precincts. Alabama sued to block the approach. The court rejected the challenge on legal grounds without addressing data quality.

The lesson: epsilon selection is a policy decision, not just a technical one. The Census Bureau chose weak privacy because data users demanded accuracy for small geographic areas. A pharmaceutical company sharing patient data might choose epsilon = 1 because re-identification risk outweighs utility loss.

Synthetic Data: Replacing Real Records Entirely

Synthetic data is generated by an algorithm that learns the statistical properties of a real dataset. The output contains no real records but preserves distributions, correlations, and patterns. It is not masked data (which modifies real records) and it is not automatically private (a generative model can memorize training examples).

Synthetic data generation relies on a class of models called Generative Adversarial Networks (GANs). The concept is straightforward: two neural networks compete. One (the generator) creates fake data records. The other (the discriminator) tries to distinguish fake records from real ones. When the discriminator can no longer tell the difference, the generator has learned the data’s statistical patterns well enough to produce convincing synthetic records.

The dominant approach is CTGAN (Conditional Tabular GAN), developed at MIT’s Data to AI Lab. CTGAN handles real-world data quirks like multi-peaked distributions (income data, for example, often has clusters around minimum wage, median, and high earners) and ensures rare categories are not lost during generation. The generator never sees real data directly; it learns through the discriminator’s feedback, providing a natural privacy buffer compared to VAE architectures where the encoder directly processes real data.

For practitioners: You do not need to understand GAN architecture to procure synthetic data. What matters for your evaluation: does the vendor provide differential privacy guarantees (not just “privacy-preserving”), at what epsilon, and can they produce a fidelity/utility/privacy evaluation report per NIST SP 800-226?

The Vendor Landscape

Vendor	Key Differentiator	Built-in DP	Status
Gretel	Configurable epsilon, outlier/similarity filters	Yes	Acquired by NVIDIA (March 2025), deal exceeding $320M valuation
Mostly AI	First industry-grade open-source synthetic data SDK (Jan 2025), TabularARGN model	Yes	Independent
Tonic.ai	Multi-table referential integrity, unstructured text redaction	Yes	Independent
Hazy	Bayesian network-based, transparent epsilon accounting	Yes	Independent (UK-based)

Real-World Impact

Mastercard uses synthetic fraud data to augment sparse fraud signals across 125 billion annual transactions, achieving up to 300% improvement in detection for compromised cards. The class imbalance problem (fraud is less than 0.1% of transactions) makes synthetic oversampling essential.

The Prediction That Did Not Come True

Gartner predicted in 2021 that “60% of the data used for AI and analytics would be synthetically generated by 2024.” It did not materialize. By June 2025, Gartner pivoted to a cautionary stance: “By 2027, 60% of data and analytics leaders will face critical failures in managing synthetic data.” Adoption is real but concentrated in specific use cases (dev/test, fraud detection, healthcare research), not the broad vision.

Evaluating Synthetic Data Quality

Three dimensions exist in inherent tension:

Fidelity: Does it look like real data? (statistical distance tests, correlation matrix comparison)
Utility: Is it useful? (Train on synthetic, test on real. Compare ML model accuracy.)
Privacy: Does it protect individuals? (Membership inference: can an attacker determine whether a specific person’s data was used to train the generator? For non-DP synthetic data, attack accuracy is 50-70%. For DP-synthetic data with epsilon=1, accuracy drops to 45-52%, effectively random guessing.)

You cannot optimize for all three simultaneously. Strengthening privacy (adding DP noise) reduces fidelity and utility. Increasing fidelity increases re-identification risk.

The Bridge: DP-Synthetic Data

The strongest approach combines both: synthetic data provides a convenient dataset format; differential privacy provides the mathematical guarantee. Three research approaches are emerging, each with different trade-offs:

Approach	How It Works	Maturity	When to Consider
DP-CTGAN	Adds DP noise directly during GAN training, so the generator never fully memorizes real records	Research, some vendor integration	Tabular data with a formal privacy requirement
PATE-GAN	Multiple ‘teacher’ models each trained on a data slice vote on synthetic samples; noise is added to the votes	Research	When you need strong privacy across diverse data types
Private Evolution (Microsoft)	Starts with random data and iteratively improves it by asking DP-protected questions about the real data; no generative model needed	Research (ICML 2024)	When you want DP synthetic data without training a GAN

For practitioners: You do not need to build these from scratch. If you need synthetic data with provable privacy guarantees, start with Gretel (now NVIDIA) or Mostly AI, both of which offer built-in DP. The academic methods matter for evaluating vendor claims and understanding what ‘DP-synthetic’ actually means, not for implementation.

When to use which:

Scenario	Technique	Why
Release specific aggregate statistics	DP direct queries (Laplace/Gaussian)	Known query set, no full dataset needed
Dev/test data, low threat model	Plain synthetic data or static masking (Part 7)	Utility over formal guarantees
Data leaves organization with privacy requirement	DP-synthetic data	Full dataset AND provable guarantee

Priority	Action	Why It Matters
Immediate	Evaluate whether any current data-sharing agreements lack quantifiable privacy guarantees	”We removed the names” is not a privacy guarantee
This quarter	Pilot DP-synthetic data for one Restricted-tier research dataset	Establishes the workflow; validates epsilon selection for your data
Next quarter	Adopt NIST SP 800-226 as the framework for evaluating DP claims from vendors	NIST’s DP pyramid addresses the Apple-style “what counts as total epsilon” question
Ongoing	Track the Harvard Differential Privacy Deployments Registry for real-world epsilon benchmarks	Research analyzing real-world DP configurations found 59% of papers provide no justification for epsilon choices

Next: Part 9: Privacy-Preserving Computation covers the frontier: computing on data you never decrypt, training models without centralizing data, and the tension between explainability mandates and privacy protection.