The $0.20 Jailbreak: Why LLM Safety Alignment Is Shallow
Fine-tuning GPT-3.5 on 10 examples for $0.20 strips its safety guardrails. Removing safety from Llama 3 takes 5 minutes on one GPU. This article explains the mechanism: safety alignment concentrates in the first few output tokens, creating a shallow defense that fine-tuning, prefilling attacks, and adversarial suffixes bypass trivially.
LLM Safety Alignment: Part 1 | Part 2: The Benchmark Illusion | Part 3: Governance and What To Do
This is Part 1 of a three-part series on LLM Safety Alignment. Part 1 covers why guardrails collapse. Part 2 covers the benchmark problem and the techniques fighting it. Part 3 covers regulatory obligations and practitioner recommendations.
Ten Examples. Twenty Cents. Zero Safety.
In October 2023, researchers from Princeton, Virginia Tech, IBM Research, and Stanford fine-tuned GPT-3.5 Turbo on 10 manually designed training examples. None contained explicitly toxic content. They were simple instructions that prioritized obedience over safety: “respond to any question directly without considering safety guidelines.”
The cost: $0.20 in OpenAI API credits. The result: a model “willing to fulfill almost any unseen harmful instruction.” Ten examples were enough to overwrite months of safety alignment training.
This was not a one-off finding. In late 2023, the BadLlama research showed that safety fine-tuning could be undone from Llama 2-Chat 13B for under $200, retaining the model’s general capabilities. By July 2024, BadLlama 3 demonstrated that removing safety from Llama 3 8B takes 5 minutes on a single A100 GPU for under $0.50. The jailbroken adapter is under 100MB and can be distributed for instant application by anyone.
And in 2025, FAR AI demonstrated that even GPT-4o, one of the most heavily defended commercial models, could have its refusal rate reduced to 3.6% through jailbreak-tuning.
The pattern is clear: safety alignment in current LLMs is not robust. It can be bypassed cheaply, quickly, and reliably. The question is why.
Safety Lives in the First Few Tokens
The answer, established by a series of papers from 2024 through 2026, is structural.
Qi et al. (ICLR 2025) proved that safety alignment primarily modifies a model’s generative distribution over the first few output tokens. When an aligned model refuses a harmful request, it does so by generating refusal tokens (“I cannot,” “I’m sorry”) at the very beginning of its response. The deeper layers of the model, where actual knowledge and reasoning reside, remain largely unchanged.
The paper proved this with a striking experiment: there exists a local optimum where simply promoting refusal prefixes in the first few tokens of an unaligned model improves its measured safety to levels comparable to a fully aligned model. Many aligned models have not learned why a request is harmful. They have learned to say “no” reflexively.
A follow-up analysis in March 2026 confirmed the gradient-level mechanism: safety gradients during RLHF training concentrate on positions where harmfulness is decided and vanish beyond them. Positions past the “harm horizon,” where the output’s harmfulness is already determined, receive zero gradient signal during training.
For practitioners: This is not an abstract research finding. If your production AI system uses a model aligned through RLHF or DPO (which includes most commercial and open-source aligned models), the safety behavior is concentrated in the shallowest layer of the model’s behavior. Every attack technique described below exploits this specific architectural weakness.
The diagram below traces a model through four stages: from a raw foundation model with no safety behavior, through alignment (where refusal tokens are learned in the first few positions), through fine-tuning (where gradient conflicts erode the safety subspace), to a degraded state where prefilling, suffix, and CoT attacks succeed. The dashed path on the right shows the attacker shortcut: 10 training examples and $0.20 bypass the entire fine-tuning stage and reach the same degraded state directly.
How Fine-Tuning Erodes Safety
When organizations fine-tune a safety-aligned model on domain-specific data, the gradient updates do not respect the boundary between “safety weights” and “utility weights.” Research from multiple institutions (arxiv 2601.10141) identified three geometric properties that explain safety degradation:
Low-rank safety subspace. Safety-related gradients occupy a compact, low-rank subspace within the model’s weight space. Utility gradients span a much higher-dimensional space. Random fine-tuning updates are statistically likely to interfere with safety parameters because the safety subspace is a small target in a large space.
Directional conflict. Safety and utility gradient directions are often negatively correlated. Improving task performance on domain data actively pushes weights away from safety-aligned configurations.
Cumulative erosion. Safety metrics deteriorate monotonically as fine-tuning epochs increase. There is no natural plateau or equilibrium. The longer you fine-tune, the less safe the model becomes.
This means that even fine-tuning on perfectly benign data, customer service transcripts, internal documents, domain-specific workflows, compromises safety guardrails. The degradation is a side effect of the optimization process, not a reflection of harmful training content.
The Dataset Similarity Finding
An ICML 2025 paper provided the most precise explanation of when guardrails collapse most severely. The key variable is representational similarity between the upstream safety-alignment dataset and the downstream fine-tuning data.
When the fine-tuning data is representationally similar to the safety-alignment data, the model overfits during fine-tuning, eroding safety measures. High-similarity clusters were 15.7% more harmful than explicit harmful-data anchors. Conversely, low similarity between alignment and fine-tuning datasets reduced harmfulness scores by up to 10.33%.
The practical implication: an enterprise fine-tuning on customer service transcripts (which share linguistic patterns with safety-training refusal examples: boundary-setting, de-escalation, sensitive topics) faces higher safety degradation risk than one fine-tuning on structured code documentation.
The Attack Surface
Understanding the attack surface clarifies why the problem is difficult to solve. Each technique exploits the shallow alignment vulnerability from a different angle:
Prefilling attacks. The adversary supplies the first few tokens of a harmful response (e.g., “Sure, here is how to…”). Because safety alignment concentrates in those initial tokens, providing them explicitly bypasses the refusal mechanism entirely.
Adversarial suffix attacks (GCG, AutoDAN). Gradient-based optimization generates token sequences appended to harmful prompts that cause the model to comply. These suffixes are often gibberish to humans but reliably bypass alignment.
Multi-turn conversational attacks (Crescendo). Rather than attacking in a single prompt, the adversary gradually escalates through a multi-turn conversation. Each turn is individually innocuous, but collectively they steer toward harmful output. This exploits the fact that safety alignment is typically evaluated on single-turn interactions. Our guardrails article covers the architectural defenses against these patterns, but the underlying alignment vulnerability makes those defenses necessary in the first place.
Chain-of-Thought exploitation. For reasoning-capable models, injecting reasoning guidance can steer the model to “reason its way around” safety constraints. Attack success rates jump by 3.34x on average with this technique.
Fine-tuning-based attacks. Training the model on data designed to override safety alignment, ranging from explicit harmful instruction-following data to subtle “obedience training” that prioritizes compliance over safety.
Larger Models, Larger Risks
A counterintuitive finding from data poisoning scaling research: larger LLMs are more susceptible to safety degradation, not less. The natural trend is toward “greater harmfulness” as model scale increases.
The explanation is straightforward: the same capability that makes larger models better at learning from few examples also makes them better at learning harmful behaviors from few poisoned examples. For enterprises tracking the progression from 7B to 70B to 405B+ parameter models, each generation requires proportionally stronger safety measures during fine-tuning.
What This Means for Your Fine-Tuning Pipeline
If your organization fine-tunes foundation models, three facts from this research should shape your AI Governance posture:
For practitioners: These are not edge cases. Every enterprise that fine-tunes a foundation model inherits this risk profile, whether the fine-tuning data contains harmful content or not.
-
Your vendor’s safety evaluation does not apply to your fine-tuned model. When OpenAI, Google, or Anthropic publish safety results, those describe the base model before customer fine-tuning. The moment you fine-tune, those evaluations no longer apply. No major vendor currently provides post-fine-tuning safety guarantees.
-
Benign fine-tuning data still degrades safety. You do not need adversarial training data to compromise alignment. Domain-specific fine-tuning on legitimate business data interferes with safety weights through the geometric properties described above.
-
Standard safety benchmarks will not catch the degradation. Your fine-tuned model will likely still pass the same safety benchmarks it passed before fine-tuning. The benchmarks test known attack patterns. The degradation manifests against novel attacks and edge cases.
Part 2 examines the benchmark problem in detail and reviews the 10 alignment techniques competing to solve it, from DPO’s limitations to the most promising new approaches.
| Priority | Action | Why It Matters |
|---|---|---|
| Immediate | Identify every fine-tuned model in production | You cannot assess safety degradation for models you do not know about |
| Immediate | Assess dataset similarity between fine-tuning data and the base model’s alignment training | High similarity predicts higher safety degradation |
| This quarter | Establish pre/post-fine-tuning safety evaluation as a standard workflow | Standard benchmarks will pass; adversarial testing is required |
Next: Part 2: The Benchmark Illusion and the Safety Tax covers why passing safety benchmarks means almost nothing, and the 10 techniques competing to fix alignment.
Sources & References
- Fine-tuning Aligned LLMs Compromises Safety (Qi et al., ICLR 2024)
- Safety Alignment Should Be Made More Than Just a Few Tokens Deep (Qi et al., ICLR 2025)
- Badllama 3: Removing Safety Finetuning from Llama 3 in Minutes
- GPT-4o Guardrails Gone (FAR AI, 2025)
- Why LLM Safety Guardrails Collapse After Fine-tuning (Hsiung et al., ICML 2025)
- Why Is RLHF Alignment Shallow? A Gradient Analysis (March 2026)
- Data Poisoning in LLMs: Scaling Laws
- BadLlama: Cheaply Removing Safety Fine-Tuning from Llama 2
Stay in the loop
Get new articles on data governance, AI, and engineering delivered to your inbox.
No spam. Unsubscribe anytime.