Mathematical Privacy Guarantees: Differential Privacy and Synthetic Data
Part 6 showed Meridian adopting differential privacy for query analytics. This article explains why: what epsilon means, how noise is calibrated, what Apple and the Census Bureau chose, and when to use synthetic data instead.
Data Privacy Guide: Overview | Part 1 | Part 2 | Part 3 | Part 4 | Part 5 | Part 6 | Part 7 | Part 8 | Part 9 | Part 10
This is Part 8 of a three-part advanced series on Privacy-Enhancing Technologies. Part 7 covered operational techniques (masking, tokenization, k-anonymity). Part 8 covers mathematical privacy guarantees. Part 9 covers privacy-preserving computation and the explainability-privacy tension.
This part introduces mathematical concepts (epsilon, noise calibration) that may be new to practitioners without a statistics background. The key decisions, which epsilon to choose and which mechanism to use, are explained through production case studies at Apple, Google, and the US Census Bureau. Formal definitions are provided for completeness but are not required to follow the decision framework at the end.
Why Heuristic Privacy Is Not Enough
Part 7 covered operational techniques: masking, tokenization, and the k-anonymity family. These work. They are deployed at massive scale. But they share a limitation: they rely on assumptions about what an attacker knows.
K-anonymity assumes the attacker does not have auxiliary data that can narrow re-identification. The Netflix Prize dataset proved that assumption wrong: researchers showed that knowing just 2 movie ratings and dates within a 3-day window allows 64% re-identification of “anonymous” subscribers. The AOL search log release proved it wrong again. The Massachusetts hospital data proved it wrong first.
Differential privacy changes the model. Instead of assuming what the attacker knows, it provides a guarantee that holds regardless of auxiliary information. The guarantee is a number: epsilon.
Differential Privacy: The Core Idea
A computation on a dataset is differentially private if its output is nearly the same whether or not any single individual’s data is included. An observer looking at the output cannot tell if you were in the dataset.
A concrete example: A hospital publishes the count of diabetes patients in ZIP code 60601. Without privacy protection, the count is 847. With differential privacy at epsilon = 1, the published count might be 845 or 849: close enough to be useful, noisy enough that adding or removing any single patient barely changes the number. An attacker cannot tell whether you were included. With epsilon = 10, the noise shrinks so much that your presence becomes nearly visible. With epsilon = 0.1, the noise is so large (the count might read 820 or 870) that individual presence is completely hidden, but the number is less useful.
The formal definition bounds the probability ratio: for any two datasets differing by one record, the probability of any output changes by at most a factor of e^epsilon. Epsilon is the privacy budget: lower epsilon means more noise, more privacy, less accuracy. Higher epsilon means less noise, less privacy, more accuracy.
The odds ratio interpretation: If epsilon = 1, seeing the output changes your belief about whether a specific person was in the dataset by at most a factor of 2.72x. If epsilon = 0.1, the belief changes by 1.1x (almost no information). If epsilon = 10, the belief changes by 22,026x (very little privacy).
| Epsilon Range | Privacy Level | Who Uses It |
|---|---|---|
| 0.01 - 0.1 | Very strong | Academic research |
| 0.1 - 1.0 | Strong | Google Gboard (epsilon = 0.994 for best model) |
| 1.0 - 8.0 | Moderate | Apple (epsilon 2-8 per use case), Microsoft (epsilon 4) |
| 5.0 - 20.0 | Weak to very weak | US Census Bureau (epsilon = 19.61 total) |
Privacy loss compounds. Two epsilon=1 queries on the same dataset cost epsilon=2 total (sequential composition). This is why epsilon is called a “budget”: every analysis spends some of it, and when it is exhausted, no further queries can be answered without degrading the guarantee.
How Noise Is Added
The core technique is adding calibrated random noise to query results. The Laplace mechanism adds noise from the Laplace distribution, providing pure (epsilon, 0)-differential privacy with zero failure probability. For a count query (sensitivity = 1) with epsilon = 1, the noisy answer might be off by 1-2. With epsilon = 0.1, it might be off by 10.
The Gaussian mechanism adds noise from the Gaussian distribution, providing approximate (epsilon, delta)-differential privacy where delta is a small failure probability. Gaussian noise is more efficient in high-dimensional settings (like ML gradient updates across thousands of parameters), which is why DP-SGD (Differentially Private Stochastic Gradient Descent), the standard method for training ML models with DP guarantees, uses it.
For practitioners: When evaluating a vendor’s differential privacy claims, ask two questions: what mechanism do they use (Laplace, Gaussian, or something else), and what is the epsilon? If they say “differential privacy” but cannot answer these, the claim is marketing, not engineering.
Local vs Central: Two Trust Models
Central DP collects raw data at a trusted server and adds noise to query results. Better accuracy: noise is added once to the aggregate. Used by the US Census Bureau.
Local DP adds noise on each user’s device before data leaves. No trust in the server required. Worse accuracy: each person’s data is independently noised. Used by Apple and Google RAPPOR. Apple compensates for the accuracy loss by having hundreds of millions of users, making the aggregate signal recoverable despite per-user noise.
The idea behind local DP is not new. In 1965, Stanley Warner proposed randomized response for sensitive survey questions: flip a coin privately; if heads, answer truthfully; if tails, answer “Yes” regardless. No one can tell whether a “Yes” is truthful or forced by the coin. But the noise can be removed mathematically at the population level.
Two Case Studies That Define the Field
Apple: Privacy at Scale
Apple deployed local differential privacy starting in 2016 to learn aggregate usage patterns (which emojis people type, which new words appear, which websites drain battery) without learning anything about individuals. Each event is hashed, encoded into a bit vector, noised on-device, and transmitted without device identifiers. The server aggregates millions of noisy records to recover population-level statistics.
The controversy: a team led by researchers at USC analyzed Apple’s macOS and iOS implementations and found that while each individual data contribution used epsilon 1 or 2, the cumulative daily privacy loss reached 6 on macOS and 14 on iOS 10, with a total potentially reaching 16 across all applications. Frank McSherry, a co-creator of differential privacy, called epsilon 14 “relatively pointless.” Apple responded that summing epsilon across uncorrelated use cases (emoji usage and health data types) overstates the risk.
For practitioners: This controversy illustrates that epsilon is meaningless without context. A company can truthfully claim epsilon = 2 per use case while the total daily privacy loss is 16. When evaluating DP claims, ask: what is the unit of privacy? What is the total budget across all use cases? Are the use cases correlated?
US Census 2020: The Accuracy-Privacy War
The Census Bureau switched to differential privacy after internal experiments reconstructed microdata records matching 46% of the 2010 Census population on key attributes from published tables using the previous “data swapping” method. Their TopDown Algorithm adds noise hierarchically: less noise at the national level, more at the block level, with post-processing to maintain consistency across geographic layers.
The final epsilon was 19.61 (17.14 for persons, 2.47 for housing units). Research in Population Research and Policy Review (Santos-Lozada et al., 2022) found that rural areas and minority racial groups experienced disproportionately larger errors. A separate study in Science Advances (Kenny et al., 2021) found the approach systematically undercounted populations in racially and politically heterogeneous precincts. Alabama sued to block the approach. The court rejected the challenge on legal grounds without addressing data quality.
The lesson: epsilon selection is a policy decision, not just a technical one. The Census Bureau chose weak privacy because data users demanded accuracy for small geographic areas. A pharmaceutical company sharing patient data might choose epsilon = 1 because re-identification risk outweighs utility loss.
Synthetic Data: Replacing Real Records Entirely
Synthetic data is generated by an algorithm that learns the statistical properties of a real dataset. The output contains no real records but preserves distributions, correlations, and patterns. It is not masked data (which modifies real records) and it is not automatically private (a generative model can memorize training examples).
Synthetic data generation relies on a class of models called Generative Adversarial Networks (GANs). The concept is straightforward: two neural networks compete. One (the generator) creates fake data records. The other (the discriminator) tries to distinguish fake records from real ones. When the discriminator can no longer tell the difference, the generator has learned the data’s statistical patterns well enough to produce convincing synthetic records.
The dominant approach is CTGAN (Conditional Tabular GAN), developed at MIT’s Data to AI Lab. CTGAN handles real-world data quirks like multi-peaked distributions (income data, for example, often has clusters around minimum wage, median, and high earners) and ensures rare categories are not lost during generation. The generator never sees real data directly; it learns through the discriminator’s feedback, providing a natural privacy buffer compared to VAE architectures where the encoder directly processes real data.
For practitioners: You do not need to understand GAN architecture to procure synthetic data. What matters for your evaluation: does the vendor provide differential privacy guarantees (not just “privacy-preserving”), at what epsilon, and can they produce a fidelity/utility/privacy evaluation report per NIST SP 800-226?
The Vendor Landscape
| Vendor | Key Differentiator | Built-in DP | Status |
|---|---|---|---|
| Gretel | Configurable epsilon, outlier/similarity filters | Yes | Acquired by NVIDIA (March 2025), deal exceeding $320M valuation |
| Mostly AI | First industry-grade open-source synthetic data SDK (Jan 2025), TabularARGN model | Yes | Independent |
| Tonic.ai | Multi-table referential integrity, unstructured text redaction | Yes | Independent |
| Hazy | Bayesian network-based, transparent epsilon accounting | Yes | Independent (UK-based) |
Real-World Impact
Mastercard uses synthetic fraud data to augment sparse fraud signals across 125 billion annual transactions, achieving up to 300% improvement in detection for compromised cards. The class imbalance problem (fraud is less than 0.1% of transactions) makes synthetic oversampling essential.
The Prediction That Did Not Come True
Gartner predicted in 2021 that “60% of the data used for AI and analytics would be synthetically generated by 2024.” It did not materialize. By June 2025, Gartner pivoted to a cautionary stance: “By 2027, 60% of data and analytics leaders will face critical failures in managing synthetic data.” Adoption is real but concentrated in specific use cases (dev/test, fraud detection, healthcare research), not the broad vision.
Evaluating Synthetic Data Quality
Three dimensions exist in inherent tension:
- Fidelity: Does it look like real data? (statistical distance tests, correlation matrix comparison)
- Utility: Is it useful? (Train on synthetic, test on real. Compare ML model accuracy.)
- Privacy: Does it protect individuals? (Membership inference: can an attacker determine whether a specific person’s data was used to train the generator? For non-DP synthetic data, attack accuracy is 50-70%. For DP-synthetic data with epsilon=1, accuracy drops to 45-52%, effectively random guessing.)
You cannot optimize for all three simultaneously. Strengthening privacy (adding DP noise) reduces fidelity and utility. Increasing fidelity increases re-identification risk.
The Bridge: DP-Synthetic Data
The strongest approach combines both: synthetic data provides a convenient dataset format; differential privacy provides the mathematical guarantee. Three research approaches are emerging, each with different trade-offs:
| Approach | How It Works | Maturity | When to Consider |
|---|---|---|---|
| DP-CTGAN | Adds DP noise directly during GAN training, so the generator never fully memorizes real records | Research, some vendor integration | Tabular data with a formal privacy requirement |
| PATE-GAN | Multiple ‘teacher’ models each trained on a data slice vote on synthetic samples; noise is added to the votes | Research | When you need strong privacy across diverse data types |
| Private Evolution (Microsoft) | Starts with random data and iteratively improves it by asking DP-protected questions about the real data; no generative model needed | Research (ICML 2024) | When you want DP synthetic data without training a GAN |
For practitioners: You do not need to build these from scratch. If you need synthetic data with provable privacy guarantees, start with Gretel (now NVIDIA) or Mostly AI, both of which offer built-in DP. The academic methods matter for evaluating vendor claims and understanding what ‘DP-synthetic’ actually means, not for implementation.
When to use which:
| Scenario | Technique | Why |
|---|---|---|
| Release specific aggregate statistics | DP direct queries (Laplace/Gaussian) | Known query set, no full dataset needed |
| Dev/test data, low threat model | Plain synthetic data or static masking (Part 7) | Utility over formal guarantees |
| Data leaves organization with privacy requirement | DP-synthetic data | Full dataset AND provable guarantee |
| Priority | Action | Why It Matters |
|---|---|---|
| Immediate | Evaluate whether any current data-sharing agreements lack quantifiable privacy guarantees | ”We removed the names” is not a privacy guarantee |
| This quarter | Pilot DP-synthetic data for one Restricted-tier research dataset | Establishes the workflow; validates epsilon selection for your data |
| Next quarter | Adopt NIST SP 800-226 as the framework for evaluating DP claims from vendors | NIST’s DP pyramid addresses the Apple-style “what counts as total epsilon” question |
| Ongoing | Track the Harvard Differential Privacy Deployments Registry for real-world epsilon benchmarks | Research analyzing real-world DP configurations found 59% of papers provide no justification for epsilon choices |
Next: Part 9: Privacy-Preserving Computation covers the frontier: computing on data you never decrypt, training models without centralizing data, and the tension between explainability mandates and privacy protection.
Sources & References
- The Algorithmic Foundations of Differential Privacy (Dwork & Roth)
- Apple - Learning with Privacy at Scale
- Apple Differential Privacy Overview
- Tang et al. - Privacy Loss in Apple's DP Implementation
- US Census Bureau - Differential Privacy and the 2020 Census
- Census DP Rural/Minority Disparities - Population Research and Policy Review (Santos-Lozada et al., 2022)
- Census DP Redistricting Impact - Science Advances (Kenny et al., 2021)
- Google RAPPOR Paper
- Google - Federated Learning with Formal DP Guarantees
- NIST SP 800-226 - Guidelines for Evaluating Differential Privacy
- Deep Learning with Differential Privacy (Abadi et al., 2016)
- CTGAN - NeurIPS 2019 (MIT DAI Lab)
- Gretel/NVIDIA Acquisition - TechCrunch
- Mostly AI Open-Source SDK
- Mastercard Fraud Detection with Gen AI
- Gartner 2021 Synthetic Data Prediction
- Gartner 2025 Synthetic Data Governance Warning
- Warner Randomized Response (1965)
- Differential Privacy Deployments Registry - Harvard
- AWS - How to Evaluate Synthetic Data Quality
- DP-CTGAN Paper
- PATE-GAN Paper
- Microsoft DPSDA - Private Evolution (ICML 2024)
Stay in the loop
Get new articles on data governance, AI, and engineering delivered to your inbox.
No spam. Unsubscribe anytime.