Privacy-Enhancing Technologies: Masking, Tokenization, and De-identification
Part 3 introduced PETs as governance decisions. Part 6 showed Meridian evaluating them. This article explains how each technique actually works: static and dynamic masking, vault-based and format-preserving tokenization, and the k-anonymity family of de-identification methods.
Data Privacy Guide: Overview | Part 1 | Part 2 | Part 3 | Part 4 | Part 5 | Part 6 | Part 7 | Part 8 | Part 9 | Part 10
This is Part 7 of a three-part advanced series on Privacy-Enhancing Technologies. Part 3 introduced PETs as governance decisions. Part 6 showed Meridian evaluating them. This series explains how each technique works. Part 7 covers operational techniques. Part 8 covers differential privacy and synthetic data. Part 9 covers privacy-preserving computation and the explainability-privacy tension.
This part covers techniques most Data Architects have encountered in practice. If you have worked with data masking or tokenization, much of this will be familiar. The new depth is in the k-anonymity family and its limitations.
From Governance Decisions to Technical Depth
Part 3 of this guide introduced PETs in a table: differential privacy, federated learning, synthetic data, homomorphic encryption. Part 6 showed Meridian Analytics evaluating each one, adopting differential privacy, rejecting federated learning, and deferring homomorphic encryption. Those were governance decisions. This article provides the technical depth behind them.
The question shifts from “which PET should we adopt?” to “how does each technique actually work, and where does it break down?”
The answer depends on context. A credit card number in a payment processing pipeline needs tokenization. The same number in a dev/test copy of production needs static masking. A research dataset derived from hospital records needs statistical de-identification. The classification tier tells you the sensitivity. The use case tells you the technique.
| Classification Level | Typical Protection | Context |
|---|---|---|
| Public | None required | Open data, marketing materials |
| Internal | Dynamic masking for guest-accessible queries | Internal dashboards with external viewers |
| Confidential | Static masking for dev/test; tokenization for payment flows; dynamic masking for role-gated access | Customer PII, employee data, financial records |
| Restricted | Tokenization (PCI); k-anonymity for research releases; differential privacy for statistical outputs (Part 8) | PHI, PANs, biometric data, genetic data |
Data Masking: The Workhorse
Data masking replaces sensitive values with realistic but fictitious substitutes. The core property: masking is irreversible. Unlike encryption, there is no key to recover the original. Two variants serve different purposes.
Static Data Masking takes a snapshot of production, applies masking rules, and delivers a permanently altered copy to non-production environments. 95% of 280 global enterprise leaders surveyed in the 2025 State of Data Compliance and Security Report use static masking. The workflow is straightforward: clone production, mask sensitive columns (substitution for names, character masking for SSNs, date aging for birth dates, number variance for financial figures), validate referential integrity, and deploy to dev/test.
Dynamic Data Masking applies rules at query time. The underlying data stays unchanged. When a customer support agent queries the customers table, the database engine returns j***@email.com and ***-**-6789 instead of the real values. Admins see everything. Snowflake, SQL Server, and Databricks Unity Catalog support this natively. PostgreSQL and MySQL require third-party tools.
Before (production):
| customer_id | name | ssn | |
|---|---|---|---|
| 1001 | Jane Smith | jane.smith@email.com | 123-45-6789 |
After static masking (dev/test):
| customer_id | name | ssn | |
|---|---|---|---|
| 1001 | Maria Rodriguez | m.rodriguez@testmail.com | 555-12-3456 |
After dynamic masking (support agent view):
| customer_id | name | ssn | |
|---|---|---|---|
| 1001 | Jane Smith | j***@email.com | ***-**-6789 |
The hardest technical challenge is referential integrity across tables. If customer_id 1001 maps to Jane Smith in the customers table, it must map to the same masked name in orders, payments, and support_tickets. Modern tools like Delphix and Informatica discover foreign key relationships before applying transformations, ensuring consistency across the entire schema.
What this looks like in practice. In CI/CD pipelines, masking is not a manual, periodic activity. On each sprint, an automated job clones production, masks it, and delivers the masked copy to the test environment. Without this step, every non-production environment containing production data is a compliance violation under GDPR, CCPA, HIPAA, and PCI DSS.
Tokenization: Reducing Compliance Scope
Tokenization replaces sensitive data with a randomly generated surrogate (a “token”) that has no mathematical relationship to the original. The original is stored in a secure vault. This distinction from encryption matters: a stolen ciphertext is one key-recovery attack from exposure. A stolen token is useless without vault access.
Vault-based tokenization stores a mapping table in a hardened, Hardware Security Module (HSM)-protected database. The application sends a credit card number, gets back tok_a8f3b2c1d4e5, and never touches the original again. The vault is the single system that must be protected. Everything else operates on meaningless tokens.
Format-preserving tokenization uses NIST-standardized FPE algorithms (FF1) to produce tokens that match the original format. (FF3-1 is being withdrawn due to cryptographic weaknesses; FF1 is the recommended standard.) An SSN 123-45-6789 becomes 529-38-1746: same format, same length, reveals nothing. This matters for legacy systems with rigid schema requirements where a CHAR(16) field cannot accept a UUID token.
The primary business driver is PCI DSS scope reduction. PCI DSS v4.0 (with all future-dated requirements mandatory since March 31, 2025) requires rendering PANs unreadable anywhere they are stored. Tokenization removes systems from scope entirely because tokens are explicitly not cardholder data. A merchant that tokenizes at the point of capture reduces its PCI compliance burden from SAQ D (300+ requirements) to SAQ A (~20 requirements). In practical terms, SAQ D requires the merchant to validate over 300 security controls across a full audit cycle. SAQ A requires about 20. The difference is months of audit work and significant cost.
For practitioners: Use tokenization to reduce compliance scope (every system that touches only tokens is out of scope). Use encryption for systems that must process the actual sensitive value. Use both together for defense in depth.
K-Anonymity: When Data Must Be Shared
Masking and tokenization protect individual values. But what happens when you need to release an entire dataset for research, analytics, or regulatory reporting? You cannot mask every column without destroying the data’s analytical value. You need a technique that preserves statistical utility while preventing re-identification.
The Sweeney Demonstration
In 1997, Massachusetts released hospital discharge data for 135,000 state employees, with names and addresses removed. Latanya Sweeney, then a graduate student at MIT, purchased the public voter rolls for Cambridge for $20 and cross-referenced the two datasets. Using just three fields (ZIP code, date of birth, sex), she identified Governor William Weld’s complete medical records and mailed them to his office.
Her broader finding: 87% of the U.S. population can be uniquely identified using just three attributes: 5-digit ZIP code, date of birth, and sex.
How K-Anonymity Works
A dataset satisfies k-anonymity if every combination of quasi-identifiers (attributes like ZIP, age, gender that can re-identify in combination) appears in at least k records. An attacker who knows your quasi-identifiers can narrow you down to a group, but cannot pinpoint which record is yours.
The two core operations are generalization (age 29 becomes 20-30, ZIP 60601 becomes 606**) and suppression (removing records that cannot be grouped). The higher k is, the more aggressive the generalization, and the more utility is lost.
The Failures That Drove Stronger Techniques
K-anonymity has known attack vectors that led to two extensions:
The homogeneity attack: If all k people in a group share the same diagnosis, the attacker learns the diagnosis despite k-anonymity. L-diversity (Machanavajjhala et al., 2006) fixes this by requiring at least l distinct sensitive values within each group.
The skewness attack: A group where 50% test positive (in a population where 1% test positive) reveals information even if the group is l-diverse. T-closeness (Li et al., 2007) fixes this by requiring each group’s sensitive attribute distribution to look roughly like the overall population’s distribution. If a hospital publishes data and one anonymized group has 50% cancer patients while the general population rate is 1%, that group leaks information even if it satisfies l-diversity. T-closeness prevents this by bounding how different any group can look from the whole.
Each level adds protection but costs utility. By the time you achieve meaningful t-closeness on a high-dimensional dataset, the data may be too generalized to be useful. This is one reason differential privacy (Part 8) emerged as a fundamentally stronger approach.
For practitioners: You do not need to implement t-closeness yourself. What matters: k-anonymity alone is not sufficient for Restricted-tier data. When evaluating a vendor’s “anonymization” claims, ask whether they handle homogeneity and skewness attacks. If they mention only k-anonymity without addressing these extensions, push back.
Production Case Study: Airbnb Project Lighthouse
In June 2020, Airbnb launched Project Lighthouse to measure racial discrimination on its platform. They needed to analyze booking acceptance rates by perceived race without storing individual-level race data linked to user accounts.
Their approach used p-sensitive k-anonymity (k=3, p=2): each group of at least 3 indistinguishable records contained at least 2 distinct perceived race values. In plain terms: every group of at least 3 records that looked identical on quasi-identifiers also contained at least 2 different perceived race values, so no individual’s perceived race could be singled out. The result preserved the discrimination signal: booking acceptance rates for guests perceived to be Black were 91.4% versus 94.1% for guests perceived to be White (2021 data). The gap was later cut almost in half through platform interventions.
Airbnb’s broader classification-to-enforcement pipeline, documented in their engineering blog, is one of the most transparent published examples of automated classification driving automated protection. Their Inspekt scanner uses regex, ML models, and pattern-matching algorithms to classify data across MySQL, Hive, S3, and application logs. Discrepancies trigger automated PRs and security tickets with SLAs. Unresolved tickets lead to automatic access restriction and table dropping. Classification is not advisory. It has teeth.
The Lesson These Failures Teach
Three famous re-identification attacks bracket this entire article:
- Sweeney’s Governor Weld (1997): Three quasi-identifiers re-identified medical records from “anonymized” hospital data.
- AOL Search Data (2006): New York Times reporters identified a 62-year-old widow from her “anonymous” search queries. The incident led to the resignation of AOL’s CTO.
- Netflix Prize (2006): Researchers showed that knowing just 2 movie ratings and approximate dates allows 64% re-identification of “anonymous” Netflix subscribers. With 8 ratings (of which 2 may be wrong) and dates within a 3-day window, 96% could be uniquely identified.
The pattern is consistent: removing identifiers is not anonymization. Masking protects individual values. Tokenization reduces compliance scope. K-anonymity and its extensions protect released datasets against known attack types. But all of these are heuristic protections. They rely on assumptions about what an attacker knows.
When those assumptions fail, so does the protection.
Part 8 introduces techniques that do not rely on assumptions: differential privacy provides a mathematical bound on what any attacker can learn, regardless of auxiliary information. The guarantee is unconditional.
| Priority | Action | Why It Matters |
|---|---|---|
| Immediate | Audit non-production environments for unmasked production data | GDPR enforcement fines exceeded EUR 1.2 billion in 2025; unmasked test environments are a recurring enforcement target |
| This quarter | Implement tokenization for all Restricted-tier payment data | PCI DSS v4.0 scope reduction saves audit cost and reduces breach blast radius |
| Next quarter | Evaluate k-anonymity requirements for any research datasets derived from Confidential or Restricted sources | CMS uses a minimum cell size of 11 for health data releases; k >= 5 is a general-purpose minimum |
| Ongoing | Map every classification tier to a specific protection mechanism | Classification without enforcement is a labeling exercise |
Next in the Privacy Guide: Part 8: Mathematical Privacy Guarantees covers differential privacy and synthetic data, the techniques that provide provable bounds on privacy loss.
Sources & References
- Static vs Dynamic Data Masking - Perforce
- What is Data Masking? - AWS
- Dynamic Data Masking - Snowflake Documentation
- Tokenization vs Encryption - Skyflow
- Vault vs Vaultless Tokenization - Stripe
- Payment Tokenization 101 - Stripe
- Format-Preserving Encryption - NIST SP 800-38G
- PCI DSS 4.0 Data Masking and Tokenization - Accutive
- Latanya Sweeney - Re-identification Research
- L-Diversity Paper (Machanavajjhala et al., 2006)
- T-Closeness Paper (Li, Li, Venkatasubramanian, 2007)
- Airbnb Project Lighthouse - Anonymization Code
- Airbnb Automating Data Protection at Scale - Part 1
- Netflix Prize Re-identification (Narayanan & Shmatikov)
- Delphix Continuous Compliance
- Data Masking Techniques - Satori
Stay in the loop
Get new articles on data governance, AI, and engineering delivered to your inbox.
No spam. Unsubscribe anytime.