← Blog
KCode·9 min read

When the cheap model wins: a SAST verifier ensemble

Grok-4-fast (xAI, $0.20/M tokens) beats Claude Opus 4.7 ($75/M) on KCode's verifier benchmark. The cascade of both beats either alone. End-to-end numbers, the routing fix that unblocked the work, and one OWASP corpus quirk we document openly.

We expected Claude Opus 4.7 to win. It's the most capable model in Anthropic's lineup, the obvious choice for a "verify whether this SQL injection is real" question. We benchmarked it against Grok-4-fast-non-reasoning (xAI's $0.20/M-input-tokens budget model) as a sanity check. We did not expect the budget model to come out ahead.

But Grok beat Opus on F1, on cost, on wall time, and on recall — head-to-head on the same OWASP Benchmark v1.2 SQLi subset, same KCode binary, same prompts, same temperature. Then we tried running both as an ensemble and the combined F1 went up again. Numbers below; the rest of this post is what changed in our mental model and what we shipped to make this default.

The setup

KCode's audit pipeline has two stages: a deterministic pattern scanner that produces candidate findings, then an LLM verifier that decides confirmed / false_positive / needs_context for each one. The verifier is the precision lever — it never adds findings, it only downgrades the noise.

Until v2.10.405 we shipped one verifier model per run. The question we wanted to answer was: which model? We picked four candidates spanning two providers and three price points:

  • grok-4-fast-non-reasoning (xAI) — $0.20/M input, $0.50/M output. The cheap one.
  • claude-haiku-4-5 (Anthropic) — $1/M + $5/M. The Anthropic budget tier.
  • claude-opus-4-7 (Anthropic) — $15/M + $75/M. The expensive one.
  • grok-4-fast-reasoning (xAI) — same model as above with the reasoning trace enabled.

Corpus: the 504-file SQLi subset of OWASP Benchmark v1.2. Ground truth is published — we didn't write it. Each of the four runs gets the same 1,441 candidates from the scanner, sends each one to its assigned verifier with the same prompt, and writes a JSON report with file-level TP/FP/FN/TN computed against expectedresults-1.2.csv.

A routing bug we had to fix first

First attempt: every smoke test came back with verdict=needs_context and the same error in the verifier reasoning field — HTTP 429 from api.openai.com. We hadn't asked OpenAI anything. Why was xAI Grok and Claude Opus traffic showing up at OpenAI's quota wall?

The bug was in our CLI's default-resolution logic. When you ran kcode audit . -m claude-sonnet-4-6, the CLI looked at ~/.kcode/settings.json for an API key, found a saved sk-proj-… (an OpenAI dashboard key from a previous experiment), inferred defaultBase = openai.com, and dispatched the Claude model name to OpenAI's chat-completions endpoint. The endpoint returned 429 because the key was over quota for that account. Every candidate bucketed as needs_context; the audit silently produced zero confirmed findings.

The fix lives in v2.10.405: makeAuditLlmCallback now resolves the API base from the model-name prefix (claude → Anthropic, grok → xAI, kimi/moonshot → Moonshot, etc.) and reads provider-specific keys from settings (anthropicApiKey, xaiApiKey, kimiApiKey) — never the generic apiKey field. We also added a stderr trace line on every audit run so the routing is observable:

$ kcode audit . -m claude-opus-4-7
[Verifier] claude-opus-4-7 → https://api.anthropic.com/v1 (key: sk-ant-a…)
◆ KCode Audit Engine
  ...

With the routing fix in place, the four-model benchmark could actually run. We launched the four audits in parallel from the same machine; total wall time was bounded by the slowest model.

Single-model results

ModelRecallPrecisionF1WallCost
grok-4-fast-non-reasoning
xAI
100%66.7%0.80082 min$0.50
claude-opus-4-7
Anthropic
99.6%65.9%0.794143 min$39.50
claude-haiku-4-5
Anthropic
14.0%80.9%0.238120 min$2.65

Costs are real prompt + completion tokens for the 1,441-candidate run × public per-provider pricing. Wall time is real elapsed minutes from the parallel launch. The grok-4-fast-reasoning variant is excluded — xAI rate-limited the reasoning endpoint hard during the run, so its wall time isn't comparable.

Three findings broke our priors:

  • Grok-fast beats Opus head-to-head. Higher recall (100% vs 99.6% — Opus missed one), higher precision (66.7% vs 65.9%), 1.7× faster wall time, ~80× cheaper. No metric goes the other way.
  • Haiku is a trap, not a budget option. Recall collapses to 14% because Haiku reads OWASP-style synthetic files ("BenchmarkTest02740.java", obvious test scaffolding) and auto-classifies them as test-only code, refusing to flag the SQL injection. What it does flag is almost always real (precision 80.9%) but it misses 234 of 272 real vulnerabilities. The failure is category-level, not gradient — a budget cut in the wrong direction.
  • Opus and Grok agree more than they disagree. Grok confirms 594 candidates; Opus confirms 415; they overlap on 372 — most of which are real vulns. That's the building block of the ensemble idea.

Our hypothesis on why Grok wins: SAST verification is a bounded binary decision with all the context it needs in the prompt (the function around the match plus the pattern's verify question). Extra reasoning capacity has nowhere productive to go — there's no cross-function dataflow, no library lookup, no ambiguous corpus. Latency dominates. Grok-fast is fast and decisive; Opus is slow and decisive; both are decisive.

Cascade-on-confirmed: the ensemble

If two independent verifiers agree on a finding, the finding is more likely to be real. We tested that hypothesis directly: bucket every candidate by how many of the three models confirm it, then score each bucket against ground truth. The bucket where Grok and Opus both confirm — call it Tier B — was 372 candidates with F1 0.842 against ground truth (a post-hoc offline computation; the real cascade run later landed at F1 0.829, the gap is LLM nondeterminism at temperature 0.1).

That F1 beat every single model. We turned it into a feature: --cascade-mode on-confirmed (the new default when --fallback-model is set). The semantics:

for each candidate:
    primary_verdict = grok_fast(candidate)
    if primary_verdict == "confirmed":
        fallback_verdict = opus(candidate)
        if fallback_verdict == "confirmed":
            emit confirmed [ensemble ✓ both confirmed]
        else:
            emit false_positive [ensemble ✗ fallback disagreed]
    else:
        emit primary_verdict   # never invoke fallback

Two key properties fall out:

  • Cost stays low. The expensive model only runs on grok-confirmed candidates — 594 of 1,441 in this run, ~41%. At Opus pricing that's $39.50 × (594/1441) ≈ $16.30, plus Grok's $0.50 = ~$17 total, vs $39.50 for Opus alone (and $0.50 for Grok alone, but with worse F1).
  • The disagreement trail is recorded. Every downgraded finding gets a reasoning prefix: [ensemble ✗ fallback false_positive] primary said: … fallback said: … When we audit the audit later, we can see exactly which model caught what. Confirmed findings get [ensemble ✓ both confirmed] for symmetry.

End-to-end cascade results

We ran the cascade against the same 504 SQLi files. One audit invocation, two providers, one report:

kcode audit /tmp/owasp-sqli \
  -m grok-4-fast-non-reasoning \
  --fallback-model claude-opus-4-7 \
  --json -o result.md
Files (sqli)504 (272 vuln, 232 safe)Candidates raised1,441 (none needs_context)Confirmed (cascade)386TP / FP / FN / TN272 / 112 / 0 / 120Precision70.8%Recall100.0%F10.829Wall time178 minReal cost~$17

F1 0.829 with 100% recall beats every single-model run we tested, including Opus alone (0.794). The cascade preserves Grok's recall exactly — no real vulnerability was lost when Opus got veto power. The precision lift comes from the 24 candidates Grok flagged where Opus disagreed; all 24 turned out to be FPs against OWASP ground truth. Grok was wrong; Opus caught the disagreement; the cascade dropped them.

One methodology limitation

We clustered the 132 SQLi false positives Grok-fast produced on its own and almost all of them shared one structural shape: a helper function with a constant-evaluable if-condition that unconditionally routes user input into the SQL string.

private static String doSomething(HttpServletRequest req, String param) {
    String bar;
    int num = 196;
    // (500 / 42) + 196 == 207 > 200  → branch is ALWAYS taken
    if ((500 / 42) + num > 200) bar = param;          // user input
    else bar = "This should never happen";            // dead code
    return bar;
}

OWASP Benchmark v1.2 categorises this as safe, under the convention that a SAST tool should perform constant folding plus dead-code elimination to recognise the else-branch as the sanitiser. KCode's verifier reads the code correctly: user-controlled param flows into a SQL concatenation. The branch with "This should never happen" is dead code; it does not constitute sanitisation. The "FP" verdict here is a property of the OWASP convention, not of the analyser.

Quantitative impact: ~125 of Grok's 132 SQLi FPs fit this shape. Adding constant folding on if-conditions (no symbolic execution required) would lift Grok's projected precision to ~92% and F1 to ~0.96. That work is on the v2.10.41x roadmap.

We document this on the benchmarks page in a "Methodology limitations" collapsible because the alternative — quietly tuning the corpus until our numbers improve — is what every SAST vendor does and is exactly the opposite of what makes a benchmark trustworthy.

Routing recommendation for KCode v2.10.406+

Three configurations, three workflows:

  • Fast / cheap. kcode audit . -m grok-4-fast-non-reasoning — F1 0.800, 100% recall, ~$0.50 per OWASP-scale corpus. Single-model. The right default for interactive triage and most CI pipelines where the budget for a verifier is pennies.
  • Balanced. kcode audit . -m grok-4-fast-non-reasoning --fallback-model claude-opus-4-7 — F1 0.829, 100% recall, ~$17. Highest F1 of any configuration we tested. Cascade-on-confirmed is the new default semantics when --fallback-model is set in v2.10.406+.
  • Strict CI. Filter the balanced configuration's output to the unanimous-consensus tier (all three models confirm) for zero-FP behaviour at the cost of recall (86% precision, 13.6% recall on this corpus). Currently a post-processing step; will land as --ensemble strict in a follow-up.

For a local-only deploy (no cloud calls at all), point both -m and --fallback-model at two different local models registered in ~/.kcode/models.json. The cascade logic is provider-agnostic — what matters is that the two models disagree often enough to be useful as independent voters.

Try it

v2.10.406 ships with the cascade as a first-class flag. Single ~120 MB binary, Apache-2.0:

# Linux x64
curl -LO https://kulvex.ai/downloads/kcode/kcode-2.10.406-linux-x64
chmod +x kcode-2.10.406-linux-x64
./kcode-2.10.406-linux-x64 audit . \
  -m grok-4-fast-non-reasoning \
  --fallback-model claude-opus-4-7

Or grab the binary from GitHub Releases with checksums. API keys go in ~/.kcode/settings.json (xaiApiKey and anthropicApiKey for this configuration), or as environment variables (XAI_API_KEY, ANTHROPIC_API_KEY).

Full benchmark methodology, raw numbers, and per-tier breakdowns on kulvex.ai/kcode/benchmarks. The reproduce script clones the OWASP corpus, runs all four verifiers, and reconciles findings against ground truth — same shape as the script behind these numbers.

What we updated in our mental model

Three things:

  1. "Most capable model" doesn't mean "best for the task." SAST verification is a bounded binary decision. Generalised reasoning capacity has nowhere to go when the prompt fully constrains the question. The cheap fast model can win, and on this benchmark it did.
  2. Two independent verifiers compose better than one expensive one. The disagreement signal is real precision lift; the agreement signal is real confidence boost. Cascade lets you spend on quality only when the cheap model already says yes.
  3. Personalities matter. Haiku reading "BenchmarkTest…" filenames as test-only code is an emergent behaviour, not a bug — but it's catastrophic for our use case. "Smaller + faster + cheaper" is the right axis to test, not the right decision rule.

If you're running a verifier-gated SAST tool today and haven't benchmarked cheap fast models against the expensive ones, run it. The result might surprise you the way it surprised us.

[email protected]