From Vibe Coding to Agentic Engineering: What Karpathy's Shift Means for Data Work
Andrej Karpathy hasn't written a line of code since December. His 80/20 flip from manual coding to agent orchestration is not a personal anecdote. It is the clearest signal yet that the value in data and AI work has shifted from execution to judgment.
The Flip
In December 2025, Andrej Karpathy noticed something had changed. For twenty years, he had written code the way most practitioners do: hands on keyboard, thinking in syntax, building line by line. Then, over the span of weeks, his workflow inverted.
“I went from 80/20 to 20/80 of writing code by myself versus just delegating.”
By March 2026, he told Sarah Guo on the No Priors podcast that he had not typed a line of code since December. He was running tmux grids of agents with watcher scripts to keep them looping. His description of the shift was blunt:
“I don’t think a normal person actually realizes that this happened or how dramatic it was.”
This is not a story about one person’s productivity hack. Karpathy is the former director of AI at Tesla, a founding member of OpenAI, and the creator of some of the most widely used AI education materials in the world. When someone with Karpathy’s background changes workflow this dramatically, it is worth treating as a leading indicator.
What he describes has direct implications for everyone building data and AI systems. The shift from writing code to orchestrating agents is not just a change in tooling. It is a change in what makes a practitioner valuable.
Three Eras in Eighteen Months
Drawing on Karpathy’s observations across his posts and this interview, I would frame the progression through six key moments over eighteen months:
2024: Manual coding. Write code yourself. GitHub Copilot and similar tools serve as autocomplete on steroids. The human is the author; the AI is the assistant. Every line passes through your hands.
Early 2025: Vibe coding emerges. Karpathy coins the term. Tools like Cursor and Windsurf launch, crossing a usability threshold that makes natural language code generation practical for daily work. Describe what you want; the AI builds it. The human provides intent; the AI provides implementation.
December 2025: The flip. Agents cross what Karpathy calls a “coherence threshold.” His workflow inverts from 80% manual / 20% agent to 20% manual / 80% agent. This happens in weeks, not months. The shift is not gradual; it is a phase change.
March 2026: Agentic engineering. Karpathy runs tmux grids of agents with watcher scripts. AutoResearch produces ~700 experiments in two days. He describes the new workflow: “Code’s not even the right verb anymore. But I have to express my will to my agents.” The human defines objectives, evaluation criteria, and quality gates. The AI handles execution across parallel workstreams.
Where this is heading. By mid-2026, the pattern points toward agent-to-agent collaboration: specialized agent teams (code agent, test agent, deploy agent, quality agent) coordinating autonomously, with humans overseeing at checkpoints rather than directing every step. By early 2027, self-improving agent workflows will likely move from early adopters to standard practice: agents that optimize their own prompts, search strategies, and evaluation criteria based on accumulated feedback from prior runs.
The distinction between vibe coding and agentic engineering matters. Vibe coding is casual: you describe a feature, the AI builds it, you ship it. Agentic engineering is structured: you define a program spec (Karpathy uses a program.md file), set constraints, establish success metrics, and run agents against those specifications with feedback loops.
The difference is the presence of judgment in the loop. In my view, vibe coding works for prototypes; agentic engineering works for production.
The Karpathy Loop: Hundreds of Experiments, No Manual Edits
The clearest demonstration of agentic engineering is Karpathy’s AutoResearch project: a small repo centered on train.py and program.md that defines the research objective. The agent:
- Reads the training code
- Forms a hypothesis about what to optimize
- Modifies the code
- Runs a 5-minute experiment
- Evaluates the result against a single metric (validation bits per byte)
- Keeps the change if it improved, reverts if it did not
- Repeats
In roughly two days, the agent ran ~700 experiments and found ~20 improvements that produced an 11% training speedup when applied to a larger model. The agent discovered architecture tweaks (like reordering QK Norm and RoPE) that Karpathy, with two decades of experience, had not tried.
Analyst Janakiram MSV named this the “Karpathy Loop”: one agent, one file, one metric, fixed time constraint. Karpathy’s vision for scaling it: not a single agent running in a loop, but “a research community” of agents exploring different optimizations simultaneously, with humans contributing “optionally at the periphery.”
The key insight: the loop works because there is a clear, measurable objective. The agent does not need taste or judgment about what to optimize. It needs a metric and permission to experiment. The human’s contribution is defining the metric, setting the constraints, and evaluating whether the results are meaningful in context.
”Everything Feels Like a Skill Issue”
This is the observation from the interview that carries the most weight for practitioners:
“Everything feels like a skill issue. You just haven’t found a way to string it together.”
Karpathy argues that many current agent failures feel like orchestration skill issues, though he also acknowledges the systems are still rough around the edges. The pattern he describes: imprecise prompts, missing context, wrong tool configuration, insufficient evaluation criteria. The model is often capable; the orchestration is not.
This reframes the entire conversation about AI reliability. The AgentDrift study found that across 1,563 contaminated tool-output turns and seven LLMs, no agent ever questioned the reliability of its tool data. The agents were not stupid. They were doing exactly what they were instructed to do: trust the context and reason within it. The failure was in the architecture, not the model.
In the series I published this week, I mapped this problem in detail: the context window is a data pipeline with no standardized quality controls. The architectural gap sits at the boundary between tool results and the context window. And the human capability needed to fill that gap, what I call judgment-in-the-loop, is exactly what Karpathy describes when he says agent failures are skill issues.
The skill is not coding. The skill is knowing what correct looks like, structuring the agent’s context so it can get there, and catching the errors it cannot see.
What This Means for Data Work
Karpathy’s observations are about coding agents. But the pattern extends directly to data and AI work.
Data platform design is becoming agent orchestration. Karpathy predicts that “there may be an overproduction of custom-spoke apps that shouldn’t exist because agents can crumble them up, with everything becoming exposed API endpoints and agents serving as the glue.” Replace “apps” with “dashboards” or “data products” and the implication is clear: the interface layer between humans and data systems will increasingly be agents, not UIs.
Context quality becomes the critical data discipline. If agents are the primary consumers of data platform outputs, then the quality of what enters the agent’s context window is the quality of the decision. Every stale API response, every truncated query result, every contradictory data point that enters the context without validation is a potential failure that no amount of model capability can fix. The six Data Quality dimensions that we built into warehouses and pipelines need equivalents inside the agent architecture.
The “LLM apps” layer is where data practitioners add value. Karpathy has drawn a distinction (in his 2025 year-in-review and subsequent posts) between LLM labs that “graduate the generally capable college student” and LLM apps that “organize, finetune, and animate teams of them into deployed professionals in specific verticals.” The LLM apps layer is where domain knowledge, Data Governance, Context Engineering, and quality controls live. This is the layer that data architects, data engineers, and governance practitioners should be building toward.
Evaluation is the new execution. AutoResearch works because it has a clear metric. Most data and AI workflows do not. Defining what “correct” means for a given context, establishing evaluation criteria that catch “almost right” outputs, and building feedback loops from production outcomes back to agent configuration: these are the skills that compound in the agentic era.
The Long Game
The agents are powerful. They are also unreliable in domain-specific ways that require domain-specific expertise to detect and correct. Karpathy himself has noted in discussions following the interview that working through the issues with agents will take years, not quarters. The work of building quality, governance, and reliability into agent systems is not a product launch. It is a discipline that will develop over a long horizon.
For data practitioners, this is not a threat. It is a sustained opportunity. The organizations that figure out how to apply Data Quality, Data Governance, and domain expertise inside agent architectures will outperform the ones that treat agents as black boxes that magically produce correct output.
The emotional dimension of this shift, what it means for professional identity when your core skill becomes automated, is worth sitting with separately. I wrote about that in When Your Core Skill Becomes Free, a companion to this piece.
The value used to live in writing the query. It now lives in knowing whether the answer is right.
Do Next
If you have not started with agents yet
Most practitioners are here. Your organization may not allow AI tools at work, or you have not found the time. Start on your own terms.
| Priority | Action | Why it matters |
|---|---|---|
| This weekend | Pick one personal project (a script, a side project, a data analysis) and build it entirely by directing an agent. Use Claude Code, Cursor, or Windsurf on a free or personal tier. | You need first-hand experience with the orchestration shift before you can evaluate it. Reading about agentic engineering is not the same as feeling the 80/20 flip yourself. |
| This week | Read Karpathy’s program.md for AutoResearch. Study how he structures agent instructions: objective, constraints, success metric, time limit. | The quality of your agent’s output is directly proportional to the quality of your instructions. This file is the best example of what “directing an agent” looks like in practice. |
| This month | Start a “hoarding” habit. Every time you solve a problem at work (a tricky SQL pattern, a pipeline fix, a governance decision), write it down in a personal note or repo. | When you eventually use agents at work, these notes become seeds the agent can grow into solutions. Domain knowledge you can articulate is domain knowledge an agent can use. Without it, you start from zero every session. |
If you are using agents at work
| Priority | Action | Why it matters |
|---|---|---|
| This week | Audit one workflow where agent-generated output goes to production without domain expert review. Log what the agent produced and check its accuracy manually. | The “skill issue” framing means your agent failures are likely solvable with better instructions and context. But first you need to see where the failures are. |
| This month | Define evaluation criteria for your most critical agent-assisted workflow. Write down what “correct” means before the agent runs, not after. | Evaluation is the new execution. Without clear criteria, you cannot distinguish “almost right” from right. This is the single highest-leverage investment for any team using agents. |
| This quarter | Build a shared library of agent prompts for your team’s common tasks: schema migrations, quality checks, pipeline debugging, report generation. | Individual productivity gains from agents do not compound unless the patterns are shared. A team library turns one person’s discovery into everyone’s capability. |
Preparing for what is coming
| Priority | Action | Why it matters |
|---|---|---|
| This quarter | Design one workflow where a quality-checking agent validates the output of a code-generating agent before a human reviews it. Even a prototype. | By mid-2026, agent-to-agent collaboration will be standard. Building a two-agent chain now (one generates, one validates) gives you experience with the coordination patterns before they become table stakes. |
| Next quarter | Set up a feedback loop where agent performance data feeds back into agent instructions. Track which prompts produce good results and which produce errors; update the prompts based on the data. | Self-improving workflows are the early-2027 prediction. The foundation is simple: log what works, analyze the patterns, update the instructions. Start the logging now so you have data to learn from when the tooling matures. |
This article is related to The Practitioner’s Guide to AI Agents, a nine-part series on building, evaluating, and improving AI agents.
Sources & References
- No Priors Podcast: Andrej Karpathy on Code Agents, AutoResearch, and the Loopy Era of AI(2026)
- Fortune: OpenAI cofounder hasn't written a line of code in months(2026)
- Fortune: The Karpathy Loop(2026)
- GitHub: karpathy/autoresearch(2026)
- ShiftMag: Karpathy Admits Software Development Has Changed for Good(2026)
- AgentDrift: Tool-Output Contamination in AI Agents(2026)
- Karpathy: 2025 LLM Year in Review(2025)
- Hacker News: Discussion on Karpathy's March of Nines(2026)
- Chroma Research: Context Window Performance Degradation(2025)
Stay in the loop
Get new articles on data governance, AI, and engineering delivered to your inbox.
No spam. Unsubscribe anytime.