Cutting SAST false positives with a local LLM verifier
Every SAST user we've talked to has the same complaint: the tool drowns you in findings, ~60% of them are noise, and the only way to get the signal-to-noise ratio below "useless" is hand-tuning queries or ignoring entire rule classes.
The LLM-first wave promised to fix this by throwing the whole codebase at a cloud model and asking it to "find the bugs." In practice it made things worse: same model, same prompt, two runs in a row → wildly different output. Hallucinated vulnerabilities. Missed the obvious ones under paragraphs of plausible-sounding prose. 300k tokens per audit. Privacy issues. Not shippable.
There's a third option that we've been shipping for six months and it works: keep the scanner deterministic, but add a small local LLM as a verifier. The LLM never hunts for bugs — it only downgrades false positives. This post explains why that split works, the architecture, and the numbers on NASA's Input Device Framework.
The false-positive problem in static analysis
Pattern-based static analyzers (Semgrep, CodeQL's simpler queries, linters in general) work by matching regular-expression or AST-shaped patterns against source code. Patterns are cheap to write but they lack semantic context — they don't know whether a caller already validated the input, whether the matched code is reachable in practice, or whether the flagged construct is intentional.
The result is the pile of false positives every AppSec engineer has seen:
- A
strcpycall flagged as a buffer overflow — but the caller just didstrnlenand allocated exactly the right size. - A hardcoded password flagged as a credential — but it's in a test fixture that ships with obvious placeholder values.
- A SQL concatenation flagged as injection — but the input is a typed enum from a Zod schema two functions up.
Traditional fix: hand-tune each rule with narrower patterns, allow-lists, and per-project annotations. Works, but it's the whole reason SAST tools come with professional services engagements. The tuning never ends.
Why LLM-first audits don't replace it
The natural reaction to the tuning problem is to outsource the semantic judgment to a large model: dump the project into the context, ask "what are the bugs in here," and let the model apply its world knowledge. This is what most of the 2024–2025 "AI security audit" products do under the hood.
It fails for three reasons that compound:
- Non-determinism. Run the same prompt twice and you get different findings. Sometimes the model spots a use-after-free; sometimes it invents one that doesn't exist. You can't ship a security gate that doesn't produce the same output on the same input.
- Token cost. A mid-size codebase (say, 200 files) costs ~300k tokens per complete audit if you give the model enough context to reason across files. That's real dollars per run and real latency per CI job.
- Privacy. Every LLM-first SAST product we looked at sends your source to a hosted inference endpoint. That's fine for open-source, painful for proprietary, impossible for regulated codebases.
The split: deterministic scanner + local LLM verifier
Our architecture separates the two jobs. The scanner finds candidates deterministically. The LLM verifies each candidate in isolation. Nothing else.
project/ findings
│ ▲
▼ │
┌───────────────┐ ┌────────────────────┐ │
│ Scanner │ ───▶ │ Local LLM verifier│───┘
│ (256 regex + │ │ (one candidate │
│ AST rules) │ │ at a time) │
└───────────────┘ └────────────────────┘
~4 s, 0 tokens ~10k tokens, ~46 s
31 candidates 28 confirmed, 3 FPThe scanner is pure pattern matching with comment-awareness and some semantic guards (it won't flag code inside a /* … */ or a string literal, for instance). It produces a list of candidates: (file, line, pattern ID, matched substring).
The verifier gets one candidate at a time with a narrow, pattern-specific prompt. The key design choice is that the verifier never sees the full codebase — it only sees the function containing the match, plus the pattern's verify_prompt field. A typical prompt looks like:
You are verifying a candidate match for the pattern
"strcpy without bounds check" (CWE-120).
Code context (lines 42-58 of src/auth/user.c):
void load_username(const char *raw) {
char buf[64];
strcpy(buf, raw); // <-- candidate
...
}
Question: Is this strcpy actually triggered with untrusted input
that could exceed 64 bytes? Respond with ONE of:
CONFIRMED: <1-line execution path showing how untrusted input reaches this line>
FALSE_POSITIVE: <1-line reason, e.g. "caller validates length via strnlen">
If you cannot determine from the given context, say:
NEEDS_CONTEXT: <what you need>The model's answer is trivial to parse and matches one of three outcomes. Non-determinism still exists, but it's confined to a question where the wrong answer either leaves a true positive in the report (fine, human triages) or adds a false positive (fine, same as running without the verifier). The verifier can only improve the signal; it never invents a finding.
Why the verifier has to be local
A hosted verifier (OpenAI, Anthropic, etc.) would work technically but defeats the privacy argument. The whole reason to run SAST on-prem is to avoid shipping source to a third party; adding a hosted LLM to the pipeline reintroduces exactly that exposure.
The good news is that verification is a much smaller task than discovery. A 14B to 31B parameter model (Gemma 4, Qwen 2.5 Coder, Llama 3.x) running on a consumer 24 GB GPU is plenty. The verifier doesn't need to know every framework in the world — it just needs to read ~30 lines of surrounding code and answer one yes/no question with an execution path.
We ship with mnemo:mark6-mid (Gemma 4 31B Q4_K_M) as the default verifier. It takes ~1.5 seconds per candidate on an RTX 4090. For teams without a GPU, a cloud verifier is configurable (same prompt shape, same parser), with the understanding that you're trading privacy for convenience.
The numbers: NASA's Input Device Framework
We pointed the pipeline at nasa/IDF — NASA's C++ library for joystick and HID device management in spacecraft simulation. It's real production code, maintained by NASA engineers, not a synthetic benchmark.
Three classes of bug surfaced:
- Pointer arithmetic on a
void*parameter.EthernetDevice.cpp:160had(&buffer)[bytesTotal]— which indexes past the local pointer variable on the stack, not into the buffer data. On partial UDP sends it transmitted whatever stack memory happened to sit after the pointer variable. Catching this with a plain regex would have worked — we had the pattern — but it would have also fired on the dozens of other(&local_array)[N]expressions that were intentional. The verifier looked at each case and kept only the 1 where the target was a pointer parameter. - Unreachable code after return.
lastPacketArrived = std::time(nullptr)sat after areturnstatement. The regex catches any assignment following any return — hundreds of false positives a project because of switch/case fallthroughs andgototargets. The verifier confirmed the one case that was actually dead code. - 27 USB HID decoder files with unchecked
data[N]access.UsbXBox.cpp,UsbDualShock3.cpp, etc. Each one assumed the incoming packet was the expected fixed length. A malformed USB device sending a short packet would trigger an out-of-bounds read. The scanner flagged 30 candidates across 27 files; the verifier confirmed 27 (one was a false positive because the packet length was validated in a caller) and the fixer applied a bounds-check template to each.
The full PR is public: nasa/IDF#107. All 28 patches compile clean; NASA's maintainers reviewed and merged.
What this approach does NOT solve
It's worth being direct about the limits before anyone evaluates this as a Semgrep or CodeQL replacement:
- No cross-function dataflow. The verifier only sees the function containing the candidate. If a taint crosses 15 function boundaries, we'll miss it. CodeQL is still the right tool for that; use them together.
- Not a larger rule catalog. Semgrep ships ~2000 OSS rules, we have 256 curated ones. We bet on depth + verification over breadth. If you want a rule for every CWE, you want Semgrep and some automation for triaging the noise.
- Not a compliance dashboard. We emit SARIF v2.1.0 and Markdown. If you need polished SOC2 / PCI reporting with audit trails, plug the SARIF into Snyk or SonarQube.
- Verifier still hallucinates sometimes. The 3 false positives in the NASA run were all cases where the verifier said CONFIRMED when the candidate was actually safe. The failure mode is benign — it adds noise, it doesn't lose findings — but it means "LLM-verified" is not a guarantee humans can skip review.
Try it on your repo
The pipeline ships as KCode, single ~100 MB binary, AGPL-3.0. Linux x64/ARM64, macOS x64/ARM64, Windows x64.
# Linux x64 — full pipeline on a repo curl -LO https://kulvex.ai/downloads/kcode/kcode-2.10.134-linux-x64 chmod +x kcode-2.10.134-linux-x64 ./kcode-2.10.134-linux-x64 audit .
Output goes to AUDIT_REPORT.md and AUDIT.sarif. For GitHub Code Scanning, the repo has a one-line Action.
Other platforms and the full pricing / feature comparison are on kulvex.ai/kcode.
What we're looking for
The 256 patterns are the product of six months of curating real bugs and discarding ones that turned out to be unmaintainable. There are obvious gaps (Elixir coverage is thin, no OCaml yet, Android-specific rules are lighter than we'd like). If you're running SAST in production and there's a class of bug your current tool keeps flagging as a false positive despite the tuning, we'd like to hear about it — those are the cases where the verifier design has the most leverage.