AI Products & Strategy March 25, 2026 · 15 min read

The Self-Improving Agent: From Static Prompts to Learning Systems

Most AI agents run the same prompt every time. The best ones evolve. This article maps the spectrum from static to self-improving agents, introduces the inner loop / outer loop architecture, and walks through a real system that learns from feedback weekly. Pike's Rules 3-4 set the boundary: start simple, add complexity only when measurement demands it.

By Vikas Pratap Singh

#ai-agents #self-improving-agents #agentic-engineering #learning-systems #judgment-in-the-loop #context-engineering

Executive Briefing

What this covers: How to build agents that improve over time, from the simplest feedback loop (one agent, one file, one metric) through inner/outer loop architectures that separate daily execution from periodic learning.
Who should read it: AI engineers, data architects, platform engineers, and product leaders designing agent systems intended to run for months or years, not just once.
Key takeaway: Self-improvement follows Pike's Rules 3-4. Start with the simplest possible learning loop. Complexity compounds bugs. Add sophistication only when you have measured evidence that the simple version is insufficient.
The uncomfortable truth: Most teams skip straight to multi-agent orchestration, retrieval-augmented memory, and autonomous prompt evolution. The Karpathy Loop produced an 11% speedup with one agent, one file, and one metric. Fancy is not a feature.

Part 11 of 12: The Practitioner’s Guide to AI Agents

The earlier articles in this series established what agents are (1), when not to build one (2), the principles that guide their design (3), how to build one (4), why context quality determines output quality (5-6), how to know if your agent works (7), where to put the guardrails (8), how to see what agents actually do (9), and when to go multi-agent (10). This article is about what happens after your agent is running: how it gets better.

Start Simple. Stay Simple. Add Complexity Only When Forced.

Pike’s Rule 3: “Fancy algorithms are slow when n is small; and n is usually small.” Pike’s Rule 4: “Fancy algorithms have big constants. Until you know that n is frequently going to be big, don’t get fancy.”

Applied to self-improving agents, these rules say the same thing: your first learning loop should be embarrassingly simple. One metric. One feedback channel. One file that stores what the agent has learned. If that setup stops being sufficient, you will know, because you will have measured it.

The temptation is to build a sophisticated memory system, a vector database, a multi-agent reflection chain, an autonomous prompt optimizer. Resist it. Every layer of sophistication is a layer that can fail silently. The agent that evolves a bad preference without human review is worse than the agent that never learns at all, because the static agent produces predictable output. The silently drifting agent produces output that degrades in ways you will not notice until something breaks.

The Spectrum: Static to Self-Improving

Not every agent needs to learn. Most should not. Here is the progression, and most production agents should stay at level one or two.

Level 1: Static agent. The prompt is hardcoded. The agent runs the same instructions every time. This is fine for well-defined tasks with stable requirements: format this data, run this query, generate this report. If the task does not change, the agent should not change either.

Level 2: Parameterized agent. The prompt reads from a configuration file. The instructions are fixed, but the parameters (which sources to check, what thresholds to apply, which topics to prioritize) are externalized. A human updates the config when needs change. This is the right default for most production agents. Stable behavior with manual tunability.

Level 3: Self-improving agent. The agent’s configuration evolves based on measured feedback. Not the core instructions (those stay human-authored), but the parameters: search weights, topic priorities, people rankings, format preferences. The agent proposes changes; a human approves them. The key distinction from Level 2 is that the proposals come from data, not from intuition.

Level	What changes	Who changes it	Risk
Static	Nothing	Nobody	Stale if requirements shift
Parameterized	Config values	Human, manually	Human forgets to update
Self-improving	Config values	Agent proposes, human approves	Silent drift if approval is rubber-stamped

The progression is not a maturity ladder. Level 1 is not inferior to Level 3. A static agent running a well-defined task is simpler, more predictable, and easier to debug. Move to Level 3 only when you have evidence that the agent’s output quality varies based on parameters that could be optimized from data.

The Karpathy Loop: One Agent, One File, One Metric

The Karpathy Loop, which I analyzed in the agentic engineering article, is the purest expression of Pike’s Rule 3 applied to self-improvement.

Karpathy’s AutoResearch project defines the pattern: a single agent reads a training script, forms a hypothesis, modifies the code, runs a five-minute experiment, evaluates the result against one metric (validation bits per byte, a single number measuring how well the model predicts text), keeps the change if it improved, reverts if it did not, and repeats. In roughly two days, the agent ran ~700 experiments and found ~20 improvements that produced an 11% training speedup on a larger model.

The design is deliberately minimal. One agent. One file (train.py). One metric. A fixed time constraint per experiment. No memory system, no reflection chain, no multi-agent coordination. The agent does not remember what it tried three hundred experiments ago; it just reads the current code and tries something.

That simplicity is the point. ~700 experiments with a complex system would produce ~700 opportunities for compound bugs. The Karpathy Loop works because there is almost nothing that can go wrong silently.

OpenAI’s harness engineering experiment follows the same pattern at a different scale. Their coding agents write code, custom linters check it, teaching error messages explain what went wrong, and the agent retries with the fix instructions in its context. The linter feedback loop is an inner loop: every iteration improves the output without any persistent state change. The agent does not remember what it got wrong three tasks ago. It just reads the current error message and fixes the current code. Like the Karpathy Loop, the power comes from fast iteration with tight feedback, not from memory or sophistication.

Inner Loop, Outer Loop: The Three-Layer Pattern

When you do need the agent to learn, the simplest architecture that works separates execution from learning into two distinct loops sharing a persistence layer.

The inner loop runs on every execution. It reads the current configuration, performs the task, produces output, and logs metrics. It does not modify anything about its own behavior. It is a pure executor.

The outer loop runs periodically (weekly, biweekly, monthly). It reads the accumulated metrics and feedback, analyzes patterns, and proposes changes to the configuration. It is a pure analyst.

The persistence layer sits between them: configuration files, feedback stores, metric logs, output archives. Both loops read from it. Only the outer loop (after human approval) writes to the configuration.

The self-improving agent architecture: inner loop (per-run execution), outer loop (periodic learning), and persistence layer connecting them

This separation matters for three reasons:

Debuggability. When output quality drops, you know where to look. If a single run produced bad output, the inner loop has a bug. If output quality degraded gradually over weeks, the outer loop proposed a bad configuration change.
Rollback. The configuration is a file. You can version it, diff it, and revert it. If the agent’s learned preferences produce worse results, roll back the config to last week’s version. Try doing that with a neural network’s weights.
Human oversight. The outer loop proposes; a human approves. There is no moment where the agent unilaterally changes its own behavior. This is Pike’s Rule 4 in practice: the simpler the change mechanism, the fewer ways it can go wrong.

A Worked Example: This Blog’s Briefing Agent

I run a daily briefing agent on this blog. Every morning at 6:03 AM CT, it searches for new content from people and topics I care about, filters and ranks the results, writes a briefing, deploys it to a preview page, and emails me a link. I read it on my phone during coffee and leave feedback: “expand this,” “skip this person,” “more papers like this one.”

That feedback goes to S3. Every Sunday, a learning step reads the accumulated feedback, counts which people and topics generated engagement (positive mentions, items I expanded into research docs, explicit skip requests), and proposes updates to a preferences.md file. People I consistently engage with get boosted. People I consistently ignore get demoted. Topics that convert into research docs or published articles get higher priority.

The roadmap for this system phases learning capabilities so that each phase has enough data before the next one starts:

Phase	Capability	Confidence	Prerequisite
1	Basic preferences: boost/demote people, topic affinities, explicit rules	100%	Live (running now)
2	Engagement scoring: count which people and topics generate feedback	90-95%	2-3 weeks of daily briefings with feedback
3	Trajectory analysis: which briefing items became research docs or published articles	80%	30+ briefings, some published articles
4	Citation discovery: auto-detect new people to follow from who your boosted voices reference	65%	Boosted people identified from Phase 2
5	Query evolution: rewrite underperforming search queries based on engagement data	60%	60+ briefings
6	Format preference: learn whether you engage more with quotes vs. summaries, papers vs. blog posts	70%	30+ data points per format type
7	Editorial flagging: “this looks like it could become a Teardown”	50%	Substantial published article history

The confidence levels are honest estimates, not aspirational targets. Phase 7 may never ship. My own judgment after reading the briefing is likely faster and better than any automated suggestion. The point is not to build all seven phases. The point is to know which phase you are ready for and which ones require more data before they deliver value.

The system is deliberately not autonomous. The agent proposes preference changes. I review and approve them. The agent flags candidate people to add to the watch list. I decide whether to add them. The learning step runs on my machine, not in the cloud, using the same Claude Code subscription I use for everything else. There is no API cost, no Lambda function, no database. Just cron, markdown files, and S3.

This is Level 3 on the spectrum, but it is the simplest possible version of Level 3. One learning step, one preferences file, one feedback channel. Pike would approve.

Self-Evolving Agents: What the Research Shows

The research community has been building more sophisticated versions of this pattern. Three projects stand out.

If you are new to agents, skip this section. It covers academic research that informs the patterns above. Return to it later if you want the theoretical foundation.

ACE (Agentic Context Engineering), described in arXiv 2510.04618, formalizes the idea that an agent’s context window needs engineered quality controls. ACE treats context as the agent’s operating system and proposes five quality criteria (relevance, sufficiency, isolation, economy, provenance) that an outer loop can optimize against.

IBM’s Trajectory-Informed Memory (arXiv 2603.10600) takes a different approach: instead of storing raw experiences, it distills successful action sequences into reusable “trajectory memories” that the agent can retrieve in similar future situations. The insight is that what the agent did matters less than what worked.

EvoAgentX (GitHub) is an open-source framework for building agents that evolve their own prompts, tool configurations, and workflow graphs based on task performance. It implements the inner/outer loop pattern with genetic-algorithm-inspired mutation and selection of agent configurations.

All three share a common principle: the learning happens to the configuration, not to the model. The LLM itself does not change. What changes is what the LLM receives as context and instructions. This is why the inner/outer loop architecture works: it treats the agent’s behavior as a function of its configuration, and the configuration is a human-readable, version-controlled artifact.

A Note on Multi-Agent Systems

Most of this series assumes a single agent. That is deliberate. A single agent with well-engineered context handles the majority of tasks people reach for multi-agent architectures to solve. The pattern I have seen repeatedly across client engagements: teams reach for multi-agent architectures before proving that a single agent with better context cannot do the job. Multi-agent complexity is easy to add and genuinely hard to remove once teams have built around it.

If you do go multi-agent, keep learning loops per-agent. Shared preference state between agents creates coupling that makes debugging nearly impossible. Each agent should have its own preferences file, its own feedback pipeline, and its own learning cadence. Coordinate at the human review level, not at the agent level.

Where Automation Stops: Judgment-in-the-Loop

Every self-improving system needs a boundary between what the agent decides and what a human decides. I defined judgment-in-the-loop in a dedicated article as the ongoing human responsibility of ensuring AI context is correct, complete, and current. That definition applies directly to self-improving agents: the agent can analyze patterns in feedback data, but it cannot judge whether those patterns reflect genuine preferences or temporary noise.

The boundary is straightforward: the agent proposes, the human approves. This applies to every evolved artifact: preference rankings, search queries, topic weights, format adjustments. The agent’s proposal is data. The human’s approval is judgment.

In my briefing agent, this boundary is enforced structurally. The learning step writes proposed changes to a diff that I review before merging into preferences.md. If the agent proposes boosting someone I find uninteresting, I override it. If the agent proposes demoting a topic because I have not engaged with it this month, I can recognize that the topic is important even though the last two weeks were quiet. The agent sees signal. I see context.

Code Example: A Preference Learning Pipeline

Here is what the learning step looks like in simplified form. Three functions handle the full cycle: loading the current preferences, updating them from feedback, and writing the proposed diff for human review.

Python: preference learning pipeline (~30 lines)

import json
from collections import Counter
from pathlib import Path

def load_preferences(filepath: str) -> dict:
    """Read the current preferences file."""
    path = Path(filepath)
    if not path.exists():
        return {"people": {}, "topics": {}}
    return json.loads(path.read_text())

def update_preferences(preferences: dict, feedback_items: list[dict]) -> dict:
    """Process feedback and adjust scores."""
    person_signals = Counter()
    topic_signals = Counter()
    weights = {"expand": 2, "positive": 1, "skip": -2, "negative": -1}

    for item in feedback_items:
        w = weights[item["action"]]
        if person := item.get("person"):
            person_signals[person] += w
        if topic := item.get("topic"):
            topic_signals[topic] += w

    updated = {**preferences}
    for person, score in person_signals.items():
        level = "boost" if score >= 3 else "demote" if score <= -2 else None
        if level:
            updated.setdefault("people", {})[person] = level
    for topic, score in topic_signals.items():
        level = "high" if score >= 3 else "low" if score <= -2 else None
        if level:
            updated.setdefault("topics", {})[topic] = level
    return updated

def save_proposed_diff(preferences: dict, original: dict, output_path: str) -> None:
    """Write the proposed changes for human review."""
    diff = {k: v for k, v in preferences.items() if preferences[k] != original.get(k)}
    Path(output_path).write_text(json.dumps(diff, indent=2))
    # Human reviews output_path before merging into preferences.json

The pipeline is deliberately simple. No vector embeddings. No LLM call. No similarity matching. It counts signals, applies thresholds, and produces a diff file that a human reviews before it takes effect. If this stops being sufficient (maybe engagement patterns need temporal weighting, maybe topics need hierarchical grouping), the measurement data will tell you when.

The Danger of Over-Automation

The most dangerous self-improving agent is the one that learns without oversight.

Consider a search agent that autonomously tunes its own queries based on click-through rates. If users click on sensational headlines more than nuanced analysis, the agent learns to prioritize sensational content. The metric improved. The output degraded. Nobody noticed because nobody reviewed the query evolution.

Or consider a code agent that autonomously updates its own system prompt based on which prompts produce code that passes tests. If shorter prompts produce passing code (because the tests are insufficiently rigorous), the agent learns to strip out safety instructions, error handling guidance, and documentation requirements. The tests pass. The code quality drops.

The agent has optimized itself into a local minimum that happens to satisfy the measured objective.

Pike’s Rule 3 applies: don’t get fancy until measurement proves you need it. But it has a corollary for self-improving systems: don’t let the agent define what “better” means. The human defines the objective. The agent proposes ways to get there. If the agent is choosing both the direction and the steps, you have an optimization loop with no external constraint. That is how you get agents that are very good at hitting metrics and very bad at doing the actual job.

The safest pattern is the one that is hardest to over-automate: the inner loop executes, the outer loop analyzes, the persistence layer stores, and the human approves. Four components, clear boundaries, explicit handoffs.

If you find yourself adding a fifth component (an “auto-approval” layer, a “confidence threshold” that bypasses human review, a “fast track” for low-risk changes), stop. That fifth component is where silent drift enters.

How do you catch this? Run your eval suite before and after each preference update. If any dimension drops more than 5%, flag the most recent update for review. This connects Article 6 (evals) to Article 8 (learning): evals are the safety net that prevents self-improvement from becoming self-degradation.

Do Next

Tier	Priority	Action	Why it matters
Newcomer	This weekend	Build a static agent (Level 1) that does one task well. A daily summary, a Data Quality check, a report generator. If you built the research agent from Article 4, you already have a Level 1 agent. Your next step: externalize its configuration into a file the agent reads at startup.	You need a working agent before you can make one that learns. The learning is not the hard part; the reliable execution is.
Newcomer	This month	Move hardcoded values into a config file the agent reads at startup (Level 2).	This is the prerequisite for any learning. If the agent’s behavior cannot be changed by editing a file, it cannot be changed by a learning loop either.
Newcomer	This quarter	Add logging. Record what the agent produced, what inputs it used, and any feedback you have on the output quality.	You cannot improve what you do not measure. The logs are the raw material for every future learning step.
Learner	This week	Review the agent’s current configuration. Is anything hardcoded that should be parameterized? Are there values the agent could optimize if it had feedback data?	Most Level 1 agents have implicit preferences (sort order, filter thresholds, source priorities) that could be explicit configuration. Making them explicit is the first step toward making them learnable.
Learner	This month	Add a feedback channel. It can be as simple as a form that writes JSON to S3, or a Slack message that gets parsed into structured data.	Feedback is the signal that drives the outer loop. Without it, the agent has nothing to learn from. The format barely matters; the habit of providing feedback matters enormously.
Learner	This quarter	Implement a basic outer loop: read the accumulated feedback once a week, count which preferences generated engagement, and produce a proposed diff to the config. Review the diff yourself before applying it.	This is the transition from Level 2 to Level 3. The review step is not optional. The agent proposes; you decide.
Practitioner	This week	Study the phasing table in the briefing agent example above. Map your own agent’s learning capabilities to the seven phases. Which phase are you ready for? Which ones need more data?	The phasing table is not specific to one system. Any self-improving agent can be assessed against these phases. The confidence levels tell you where the engineering is straightforward and where it is genuinely hard.
Practitioner	This month	Run your eval suite before and after each preference update; diff the results; if any metric regresses, revert the update.	Silent drift is the most dangerous failure mode in self-improving agents. Eval-gated updates catch regression before it compounds.
Practitioner	This quarter	Design your persistence layer for rollback. Every configuration change should be versioned. Every preference update should be diffable. If the agent learns something wrong, you should be able to revert to last week’s state in one command.	Rollback is the safety net that makes experimentation safe. Without it, every bad learning step is permanent. With it, you can let the agent explore more aggressively because the cost of a mistake is one `git revert`.

The Capstone

This series started with a question: what is an AI agent? The answer was a system that uses an LLM to decide which actions to take in a loop. We asked when not to build one and mapped Pike’s five rules onto agent development. We built a real agent and showed that context is the program. We built evals to measure whether the agent works and placed guardrails to catch compound errors.

This article closes the learning loop. Once you have an agent that works, that is measured, and that is guarded, you can make it learn. But the learning follows the same principles that built the agent in the first place: start simple, measure before optimizing, treat complexity as a cost, and never let the data (or the learning loop) run without human judgment in the loop.

The final article in the series puts everything together: a complete implementation walkthrough applying the full framework to a real problem.

Pike wrote his rules in 1989 for C programmers at Bell Labs. Thirty-seven years later, they are the best guide I have found for building AI agents that improve without breaking. The technology changed. The engineering discipline did not.

This is Part 11 of 12 in The Practitioner’s Guide to AI Agents. ← Previous: Multi-Agent Systems · Next: From Problem to Agent →