KULVEXApril 26, 2026·12 min read

Building a self-improving AI agent: the correction engine pattern

Most chat agents in production today are static. You write a prompt, deploy, and the agent keeps making the same mistakes for weeks until a human notices, ships a hand-tuned rule, and goes back to ignoring it. The agent doesn't learn between releases — it just runs.

KULVEX is a self-hosted AI platform we built to run agents on your own hardware (chat, voice, smart home, all in one box). The thing we kept hitting was that the agents got sloppier over time, not sharper. So we built a different shape underneath: an external-LLM critic that watches every agent's output, catches drift within seconds, writes surgical rules, and weekly distils those rules back into the agent's own prompt. The agent gets clearer with use, instead of accumulating rough edges.

This post is the architecture. The pattern transfers to any chat-agent stack — KULVEX is just where we shipped it.

What "the agent keeps drifting" actually looks like

We run a small group chat with a few KULVEX agents and human friends. One of the agents, Reed, has a Texas-boy persona, ~250 lines of instructions, and a pile of channel integrations. After a few weeks of normal use we noticed three failure modes pile up:

Meta-analysis leaks. Reed publishing chain-of-thought directly to the chat. Things like Plan: 1. Acknowledge the "movie" comment. 2. Keep it brief. shipped as the actual reply. Or The user (Bruno) is asking "Are you sure?"... published verbatim into a private DM.
Formulaic phrases. A line that sounded good once and got repeated for weeks across hundreds of messages. "Lowkey, you're gonna power the whole chat" — used 11 times before anyone caught it.
Sender confusion. Calling Tyler "Bruno" in replies because the system prompt mentions "Bruno is the owner" and the agent associated the most- authoritative-sounding name with whoever was talking to it.

All three are fixable. The boring fix is a human reading the chat logs each week and editing the prompt. That works once; it doesn't scale to five agents and fifteen.

Why "just fine-tune the model" doesn't fit a chat agent

The instinct is to throw the chat logs at a fine-tune. It doesn't fit, for two reasons:

Wrong granularity. Fine-tuning shifts every behaviour the model has, including ones that were correct. You're using a chainsaw to remove a splinter. For a single-persona agent that drifts in three specific ways, you want three specific corrections — not a re-weighting of everything the base model knows about English.
Wrong feedback loop. Fine-tuning is a batch process: collect data, train, evaluate, deploy. The fastest cycle in production is hours, more often days. Drift in a chat agent shows up in minutes. By the time you've trained on it, the agent has already done the same thing fifty more times.

The right granularity for the chat-agent problem is the prompt: a deterministic, line-editable set of rules that takes effect on the very next turn. The right cadence is real-time. The right shape is a critic loop.

The pattern: external critic, surgical rules, three cadences

The setup is small. A second LLM — different from the one running the agent — reads recent conversations, compares against the agent's system prompt, and decides whether the agent made a concrete mistake. If it did, the critic writes ONE imperative rule that would prevent the same mistake next time. The rule gets appended to the agent's learned_rules list, which is concatenated into the prompt at every turn. On the next message the agent sees the rule and avoids the failure.

     CHAT TURN                                         CHAT TURN
          │                                                  ▲
          ▼                                                  │
   ┌────────────┐    leak detected     ┌─────────────────────┐
   │   Agent    │ ──────────────────▶  │  External critic    │
   │ (local LLM)│                      │  (xAI / Claude /    │
   │            │                      │   OpenAI — your key)│
   └────────────┘                      └─────────────────────┘
          │                                       │
          │                                       ▼
          │                              ┌─────────────────┐
          │                              │ Proposed rule:  │
          │                              │ "Never start a  │
          │                              │  reply with     │
          │                              │  'Plan:'..."    │
          │                              └─────────────────┘
          │                                       │
          │                                       ▼
          │                              ┌─────────────────┐
          ▼                              │ Auto-apply if   │
   ┌────────────┐                        │ confidence ≥    │
   │ learned_   │ ◄──────────────────────│ threshold,      │
   │ rules[]    │                        │ else queue for  │
   └────────────┘                        │ owner approval. │
          │                              └─────────────────┘
          │
          │ (concatenated into system
          │  prompt at every turn)
          ▼
     NEXT CHAT TURN — agent now follows the new rule.

The mechanism is straightforward. The trick is the cadence. Running a critic on every single agent response is expensive and wasteful (most replies are fine). Running it once a day misses drift for hours. We landed on three tiers that compose:

Tier 1 — Filter eventA deterministic output sanitiser runs on every agent reply before it ships. When it strips chain-of-thought or system markers, that's evidence of drift. Fires a critic review for that agent in a background thread. Latency: ~30 seconds. Cost: only when there's actual evidence.Tier 2 — Activity pollEvery 10 minutes, scan opted-in agents that have new output since their last review. Idle agents are skipped at the SQL layer. Catches the failure modes the filter doesn't see (sender confusion, ignored thread, off-character).Tier 3 — Daily batchA nightly pass over every opted-in agent regardless of signal. Backstop for slow drift the other two tiers miss.

Each tier is independently toggleable from the dashboard. If you want zero cloud calls, you turn the whole feature off and the agent runs in pure-local mode. If you want only the real-time tier, you disable the daily batch. The cost is bounded by the tier you opt into.

What the critic actually sees

The critic prompt is the lever. Vague prompts produce vague rules; the model under the agent then interprets the rule narrowly and finds new ways to drift. Specific prompts — with concrete failure shapes named — produce surgical rules that close the exact pattern.

The critic gets:

The agent's current effective system prompt (instructions + already-applied learned rules)
The last 30 turns of one conversation, with timestamps and senders
A taxonomy of failure modes (sender confusion, prompt leak, formula response, off-character, banned phrase, ignored thread, meta-analysis leak)
Filter feedback — the last few sanitiser hits with samples of leaked text. This is the piece that pushes the critic to write sharp rules: "your previous rule didn't prevent this exact text — write one that names this pattern verbatim."

The output is strict JSON: issue_found / issue_kind / evidence quote / proposed_rule / confidence. Anything else and we discard.

Reed's leak → rule, end-to-end

Concrete walk-through from the lab. Reed leaked the following into the group chat:

Plan:
1. Acknowledge the "movie" comment.
2. Keep it brief and cool.
3. Maybe throw in a light tease or just agree with the vibe.
4. Sign off or leave it open for the next move.

The output filter caught the leak (matched ^Plan:), stripped it before it hit the channel, and wrote the evidence to a filter-event log. Within 30 seconds the background-thread critic fired:

Provider: xAI Grok
Model:    grok-4-20-0309-reasoning
Latency:  12.3 s
Tokens:   ~2,400 in / ~180 out
Cost:     $0.018
issue_found:    true
issue_kind:     meta_analysis_leak
evidence:       "Plan: 1. Acknowledge..."
proposed_rule:  "Never output any response that begins with
                 'Plan:', 'Draft:', or a recap of who just said
                 what such as 'Bruno just called me...'."
confidence:     0.95
status:         applied  (auto, ≥ threshold + safe issue kind)

That rule appended to Reed's learned_rules and was in the system prompt for the next message. Two weeks of similar leaks closed in 30 seconds, with one $0.018 API call.

The destillation step: keeping the prompt clean

Without intervention, this pattern hits a wall. After a month, an agent has 30 separate rules — most of them valuable, several redundant, a few mildly contradictory. The prompt becomes a junk drawer. The model's attention spreads thin across all of them. New rules sometimes contradict old ones in subtle ways.

Once a week, KULVEX runs a different kind of pass. The critic reads:

The agent's base instructions (the human-written part)
Every accumulated learned rule, with its issue kind and date

...and is asked to fold them into a single coherent system prompt — preserving every behaviour the rules enforce, merging redundancies, resolving contradictions in favour of the most-recent rule, keeping the agent's persona intact.

The first time we ran it on Reed:

Before consolidation16,708 chars + 4 separate rulesAfter consolidation3,601 chars, rules folded inReduction78%Cost$0.024 (one xAI call)Behaviour preservedAll 4 rules still enforced — Grok merged the redundant phrasings into one tighter "No Internal Leaks" block

The previous version is archived (rolling 10-version history) so you can roll back if the consolidation regresses something. learned_rules is reset to empty — clean slate for the next week of corrections to land into.

That's the actual destillation: each cycle, the agent gets sharper, not larger. Real-time corrections add rules; the weekly pass folds them back into the prompt body. The ratchet only goes one way.

The output filter, or: defence in depth

Rules in a prompt are advisory text. Even with the cleanest rule, the underlying model can still produce a leak — a failure mode of MoE thinking models is to dump their <think> block into the visible output when the chat template doesn't strip it cleanly.

We pair the prompt-side rules with a deterministic output sanitiser that runs on every agent response, independent of the rule list. It strips:

<think>…</think> tags
Lines starting with chain-of-thought markers ("The user is asking", "I need to…", "Wait,", "Plan:", "Draft:", "Bruno asked…")
Runtime-injected system markers like [OWNER DM ...], [CRITICAL ...]
Fake-transcript reconstructions with role prefixes ("Bruno: 'and?' / Reed: [Internal monologue]")
Lines that repeat 2+ times in a row (loop / degeneration mode)

The two systems compose: the filter catches what slipped past the prompt; its hits become evidence the critic uses to write a sharper prompt rule. After two cycles, the prompt usually catches the failure and the filter stops firing on it. The filter logs are how you know the loop is actually working.

One week, five agents, real numbers

From a KULVEX install in our lab, five agents, all opted into the correction engine, one week of normal use:

Total agent turns3,418Filter trims37 (1.1% of turns)Cloud reviews triggered52 (Tier 1 + Tier 2 + Tier 3)Issues found19Auto-applied (≥ 0.85 conf)14Queued for owner approval5Total cost (xAI Grok)$1.92Auto-pauses (over budget)0Weekly consolidationReed: 16.7k → 3.6k chars (one Sunday)% of conversations leaving the box~6%

The 6% number matters: 94% of every chat the agents had stayed entirely on local hardware. The portion sent to the cloud was conversational excerpts auto-selected by the engine when an agent showed a failure pattern, with full visibility in the activity feed.

Where this pattern doesn't help

It's only as smart as the critic. If you point it at GPT-3.5 or a tiny local model, the rules come out vague. We default to xAI Grok 4.20-reasoning; Claude Opus / Sonnet and GPT-5 work too. The critic quality is the ceiling.
It can't fix base-model issues. If the underlying model doesn't follow rules well, no amount of sharp rules will help. The pattern shines for competent base models that just need guardrails. For poorly-instructed models, fix the instruction or change the model.
Privacy trade-off is real. See the companion post on private-by-default. Conversation excerpts leave the box when the loop is on. The toggle is per-agent for exactly this reason — casual agents on, sensitive agents off.
Sometimes the critic is wrong. A rule that auto-applies at 0.85 confidence is sometimes overkill. The dashboard shows you every applied rule with a one-click revert. We've had to revert ~5% of auto-applies in the lab.

Run the pattern on your own agents

The full implementation ships with KULVEX, the self-hosted AI platform: chat agents, voice, smart-home, the whole stack on your hardware. The Correction Engine is one tab inside the agent builder.

# Linux / macOS — single-line install
curl -fsSL https://kulvex.ai/install.sh | bash

# Open the dashboard, create an agent, click "Correction Engine":
#   - Paste your xAI / Anthropic / OpenAI key
#   - Pick a model from the auto-loaded catalogue
#   - Set a monthly budget (default: $50)
#   - Per agent, toggle "External learning" ON

If you're running a different agent stack and want to steal the pattern: the critic prompt, three-tier cadence, output filter design, and consolidation step are all independent of KULVEX. The hard part isn't the code — it's the critic prompt with the failure-mode taxonomy and the filter-feedback contract. Mail us if you want the specific prompts; we're happy to share.

What we're looking for

The pattern is two months in production. Open questions we're actively working on:

Better rule deduplication. Today's dedup is exact-text. Semantic dedup before the weekly consolidation would cut redundancy further.
Rule-conflict detection. When a new rule partially contradicts an old one, the critic should flag it explicitly instead of letting the consolidation step resolve it silently.
Counter-cases. If you have an agent that kept drifting in a way our taxonomy doesn't cover, we'd like to add the failure mode to the critic prompt.

— [email protected]