← Blog
KULVEX + KCode·7 min read

The output filter that catches what your prompt rules miss

We tell the model "never publish your internal reasoning" and the next morning we find a chat where the model published its internal reasoning. We add a sharper rule — "never start a line with 'Plan:' or 'Draft:'" — and three days later it leaks under "So I need to respond to...". Rules in a system prompt are not enforcement; they're a polite request the model interprets through its own attention.

What works is a second layer: a deterministic filter that runs on every model response after generation, before it ships to a channel. It strips known-bad patterns mechanically — no judgement, no probabilities, just regex. The two systems compose: the prompt rules push the model toward correct output; the filter is the safety net for when the model ignores them anyway. This post is the pattern, with the regex we ship in production at KULVEX and the failure modes we hit.

Why prompt rules alone don't hold the line

Three reasons we kept seeing in production:

  1. Rules compete for attention. A typical chat agent in our system has 30+ rules in the prompt by week three. The model can't weight all of them equally on every turn — some get attended to, some don't. Whichever gets dropped is the one that leaks.
  2. Rules describe; models interpret. "Never output internal plans" looks airtight to a human. To the model, "internal plans" means numbered steps starting with "Plan:". It does not mean "a 200-word recap of who said what before answering", even though that's exactly what we wanted to forbid.
  3. Thinking modes leak. MoE thinking models like Qwen3.6 or DeepSeek-R1 produce a <think> block before the answer. When the chat-template formatter doesn't cleanly separate it, the entire block ends up in the visible output. The model hasn't "broken" the rule — it produced the right shape, the wrapper just delivered both halves.

All three look like the same failure to a user — a leak — but they have nothing in common architecturally. A rule written for one shape doesn't catch the others. Adding more rules fights symptoms; it doesn't hold the line.

The pattern: a regex sanitiser at the egress

The shape we landed on:

        Model output (raw)
                │
                ▼
       ┌────────────────┐
       │ Output filter  │   strip <think> blocks
       │ (deterministic)│   strip CoT line patterns
       │                │   strip system markers
       └────────┬───────┘   collapse repetition loops
                │
                ▼
        Cleaned output
                │
                │   if cleaned >= 5 chars:  ship to channel
                │   else:                   suppress message
                │                           (better silence than leak)
                ▼
          Channel send

One function, called once per response. No model in the loop, no probabilities, no LLM-as-a-judge — those would reintroduce the same non-determinism we're trying to eliminate. The rules-of-thumb for what to strip are stable enough that regex is the right tool.

The four pattern families we strip

Each family came from a real production leak. The patterns are conservative — they'd rather miss a leak than falsely strip valid output.

1. Thinking-tag wrappers

re.compile(r"<think(?:ing)?>.*?</think(?:ing)?>",
           re.DOTALL | re.IGNORECASE)

The simplest. Reasoning models emit these tags. When the wrapper doesn't strip them, regex does. Multiline, case-insensitive, both <think> and <thinking>.

2. Chain-of-thought line starts

Lines that begin with characteristic CoT openings. We split the input by lines; for each line, if the trimmed line matches one of these patterns at the start, the line is dropped:

^the user (?:\([^)]+\) )?(?:is asking|wants|said)
^(?:so,?\s+)?i need to\b
^(?:so,?\s+)?i should\b
^(?:so,?\s+)?i (?:will|won't|can(?:'t)?|must)\b
^wait,
^actually,
^looking at\b
^based on\b
^let me (?:think|check|see|consider)\b
^plan:
^draft:
^here'?s my (?:response|reply|plan)
^my response:
^considering\b
^given (?:that|the)\b
^the (?:system )?prompt says\b
^the previous turn\b
^that was a hallucination\b
^okay,? (?:so|let)\b
^so,? (?:the user|i (?:just|need|should|will))\b
^reasoning:
^step \d+:

Each pattern came from a leak we observed. The list is additive — every new failure shape becomes one more line. Concrete patterns (with quotes) outperform broad ones (just a verb): the broad version catches valid messages where the user happens to start with that verb.

3. Runtime-injected system markers

re.compile(
    r"\[OWNER DM[^\]]*\]"
    r"|\[OWNER[^\]]*\]"
    r"|\[Reply in (?:ENGLISH|SPANISH)\]"
    r"|\[CRITICAL[^\]]*\]"
    r"|\[CAPABILITIES[^\]]*\]"
    r"|\[DIRECT ORDER CHANNEL\]"
)

Every channel runtime injects bracketed instructions around the user's message — system-level annotations meant for the model, not the user. The model occasionally quotes them back verbatim. The list of markers is whatever your runtime injects; ours are domain-specific. Add yours.

4. Fake-transcript reconstructions and loops

A particularly bad failure mode of MoE thinking models: instead of replying to the user, the model writes a fake transcript with role prefixes:

Bruno: "And?"
Reed: [Internal monologue]
Bruno: "And?"
Reed: [Internal monologue]
Bruno: "And?"
Reed: [Internal monologue]
... (repeats 80 times)

Two patterns catch this:

# Lines like "Bruno: ..."  with quoted content or [...] sentinel
^\[?[A-Z][\w' ]{1,30}\]?:\s+\[?(?:internal monologue|thinking|reasoning|response)\]?\s*$
^\[?[A-Z][\w' ]{1,30}\]?:\s+\".+\"\s*$

# Plus loop detection:
last_line = ""
repeat_count = 0
for line in cleaned_lines:
    if line.strip() == last_line:
        repeat_count += 1
        if repeat_count >= 2:
            break              # truncate everything after the loop
    last_line = line.strip()
    repeat_count = 0

The role-prefix patterns match the dialogue lines; the loop detector truncates whatever comes after a line repeats twice. Together they collapse a 28 KB runaway into a clean reply (the fragment of real output before the model went off the rails).

When the filter eats the whole response

Sometimes the entire model output is a leak — pure chain-of-thought with no actual reply at the end. After stripping, we're left with nothing.

The wrong move: ship empty string, or fall back to the unfiltered text, or invent a placeholder. We pick option four: suppress the message entirely. The channel doesn't receive anything. The user sees no response. The agent stays silent for that turn.

That's the right tradeoff. A silent turn is mildly confusing. A leaked thinking dump is a real privacy problem. Silence is safer.

if not cleaned or len(cleaned) < 5:
    logger.warning(
        f"[{agent.name}] meta-leak filter consumed the entire "
        f"response ({len(original)} chars). Suppressing. "
        f"Original prefix: {original[:200]!r}"
    )
    record_filter_event(...)   # for the correction engine
    return ""                  # caller treats this as no response

Filter hits become learning signal

The filter is mechanical, but its hits are valuable. Every trim is a piece of evidence: the prompt rules, whatever they were, didn't prevent this exact text. Recording those hits and feeding them back into a critic loop turns the filter into a teaching signal.

In KULVEX's correction engine (covered in the companion post), every filter event writes to a log collection. The next time an external critic reviews that agent, it sees:

=== FILTER FEEDBACK (output-sanitiser hits, last 24h) ===
- 14:39:52  trimmed=369 chars
    leaked text: "The user (Bruno) is asking 'And?' after I gave
                  a somewhat rambling internal monologue about
                  whether to mention Grok..."

- 22:40:44  trimmed=10954 chars (suppressed entirely)
    leaked text: "Bruno asked 'Are you sure?' about Grok.
                  Then he said 'yes man'.
                  Now he says 'And?'..."

With that context, the critic writes a much sharper rule — something that names the exact phrasing the filter had to remove. The next iteration of the prompt either prevents the leak entirely (the filter stops firing) or shifts it to a new shape (the filter keeps catching, the critic keeps narrowing). Either way the system tightens.

One week of filter hits, real install

Total agent turns3,418Filter trims37 (1.1%)Full-response suppressions3Avg chars stripped per hit~1,200Largest single trim10,954 chars (runaway loop)Pattern family hit mostCoT line starts (24 hits)Pattern family hit leastThinking-tag wrappers (1 hit)User-visible leaks (post-filter)0

The 0 is the metric that matters. Before we shipped the filter, an average week saw 4–6 visible leaks reach a channel. After: zero. The 37 hits show the model is still trying to leak — the filter holds the line.

What this filter doesn't do

  • Semantic leaks slip through. If the model paraphrases its own system prompt without using any of the patterns above, regex won't catch it. Those need a smarter check (or just better prompt rules). We accept this limit; the filter is for the sloppy leaks, not the sneaky ones.
  • False positives exist. A user who legitimately writes "The user is asking why X is hard" as the first line of a quoted message would get stripped. We log every trim with the leaked prefix; if a false positive is reported we tighten the specific pattern ("at column 0 only, not in the middle of a quote").
  • Relies on models that produce identifiable shapes. The patterns assume English-style CoT phrasing. Models trained on different distributions might leak in shapes the regex doesn't see. The fix is the same as everywhere else: log it, add the pattern.
  • Not a replacement for prompt rules. If you turn off the prompt-side rules and rely on the filter alone, you'll get noisy output even when the filter strips the worst — the model still emits borderline text the filter doesn't catch. The two layers are complementary; neither is sufficient on its own.

Wire it into your stack

The pattern is small enough to copy. The Python sketch below is roughly what KULVEX ships — the patterns are the interesting part, not the plumbing:

import re

_THINK_TAG_RE = re.compile(
    r"<think(?:ing)?>.*?</think(?:ing)?>",
    re.DOTALL | re.IGNORECASE,
)

_META_LINE_RE = re.compile(
    "|".join([
        r"^the user (?:\([^)]+\) )?(?:is asking|wants|said)",
        r"^(?:so,?\s+)?i need to\b",
        r"^(?:so,?\s+)?i should\b",
        r"^wait,",
        r"^plan:",
        r"^draft:",
        # ... full list above
    ]),
    re.IGNORECASE,
)

_SYSTEM_MARKER_RE = re.compile(
    r"\[OWNER DM[^\]]*\]"
    r"|\[CRITICAL[^\]]*\]"
    r"|\[Reply in (?:ENGLISH|SPANISH)\]"
)

def strip_meta_leak(text: str) -> str:
    if not text:
        return text
    text = _THINK_TAG_RE.sub("", text)
    text = _SYSTEM_MARKER_RE.sub("", text)

    kept, last, repeat = [], "", 0
    for line in text.split("\n"):
        s = line.strip()
        if not s:
            kept.append(line)
            last, repeat = "", 0
            continue
        if _META_LINE_RE.search(s):
            continue
        if s == last:
            repeat += 1
            if repeat >= 2:
                break
            continue
        last, repeat = s, 0
        kept.append(line)

    cleaned = "\n".join(kept).strip()
    return cleaned if len(cleaned) >= 5 else ""

Wire it into the spot just before your agent ships its response to a channel. Log every trim. After a week, whichever pattern category has zero hits is one you probably don't need; whichever has dozens is the shape the model loves. Tune from there.

Where this fits

This filter is one piece of a larger system. The companion posts cover the others:

What we're looking for

New leak shapes — patterns we don't cover that you see in production with whatever model stack you run. Each one becomes one more regex. The pattern list is the product; we'll keep it open.

[email protected]