Benchmarks
Four-way SAST comparison — KCode vs Semgrep OSS vs Semgrep Pro (the commercial tier with cross-function dataflow) vs CodeQL — on two corpora: (1) our in-house 68-fixture catalog across 10 languages, and (2) the public OWASP Benchmark v1.2 — 2,740 Java test cases with published ground truth.
Corpus 1: in-house fixtures (68 cases, 10 languages)
Hand-built positive/negative pairs covering KCode's full pattern catalog. Note the bias caveat in "Caveats" below — the OWASP corpus (next section) is the harder, more honest test.
| Tool | TP | FN | TN | FP | Skip | Recall | Precision | F1 |
|---|---|---|---|---|---|---|---|---|
| KCode v2.10.406 | 67 | 1 | 55 | 13 | — | 98.5% | 83.8% | 0.905 |
| Semgrep OSS 1.160 | 21 | 47 | 63 | 5 | — | 30.9% | 80.8% | 0.447 |
| Semgrep Pro 1.160 | 26 | 42 | 61 | 7 | — | 38.2% | 78.8% | 0.515 |
| CodeQL 2.25.2 | 13 | 19 | 25 | 7 | 36 | 40.6% | 65.0% | 0.500 |
Corpus 2: OWASP Benchmark v1.2 (2,740 Java cases, public)
OWASP's open-source SAST evaluation corpus. 11 vulnerability categories (sqli, xss, cmdi, weakrand, pathtraver, ldapi, securecookie, trustbound, crypto, hash, xpathi). Ground truth published in expectedresults-1.2.csv. We didn't write it. The recall and precision numbers below are on a corpus the KCode team had no input into.
| Tool | TP | FN | TN | FP | Recall | Precision | F1 |
|---|---|---|---|---|---|---|---|
| KCode v2.10.406 | 1415 | 0 | 34 | 1291 | 100.0% | 52.3% | 0.687 |
| Semgrep OSS 1.160 | 1249 | 166 | 805 | 520 | 88.3% | 70.6% | 0.785 |
| Semgrep Pro 1.160 | 1263 | 152 | 397 | 928 | 89.3% | 57.6% | 0.700 |
| CodeQL 2.25.2 | 1415 | 0 | 296 | 1029 | 100.0% | 57.9% | 0.733 |
What changed between corpora. On the in-house corpus KCode wins on F1 by a wide margin because the fixtures map cleanly to its catalog. On OWASP — a corpus none of these tools' authors designed — the picture is more textured. Semgrep OSS leads on F1 (0.785) because its precision is high (70.6%): when it fires, it's usually right. KCode v2.10.406 hits 100% recall on OWASP — matching CodeQL — at 52.3% precision and F1 0.687. Per-category recall is 100% across all 11 OWASP categories (sqli, xss, cmdi, pathtraver, ldapi, xpathi, hash, crypto, weakrand, securecookie, trustbound). CodeQL's 100% recall comes with 57.9% precision (F1 0.733); KCode trades 6 precision points for the same recall but ships zero-build and local-first. Semgrep Pro lands at F1 0.700 with 89.3% recall. The precision gap to Semgrep OSS is real — that's the next milestone. The point KCode is making with the confidence dial (introduced in v2.10.400) is independent of F1: the engine ships a per-finding confidence score so the user picks where on the precision/recall curve they want to be. The "what the dial does" section below has the full breakdown.
The confidence dial (introduced in v2.10.400, benchmarked here on v2.10.402). KCode is the only engine in this comparison that exposes a per-finding confidence score and a CLI dial to filter on it. kcode audit . --confidence high|medium|all lets the operator pick where on the precision/recall curve they want the report. CodeQL flags everything; Semgrep ships one config-driven cut. KCode's bands are derived from six signals — pattern maturity, taint origin (Fix #3 verdict), sanitizer-seen, verifier verdict, prior demotion history, fix support — and the breakdown is in --json so the score is auditable: "scored 50 because pattern stable +30, taint tainted +25, no sanitizer +5, verifier skipped +0, no demotions +10".
What the dial actually does on OWASP. The table below shows the same 2,740-case run sliced three ways. The dial does not change the underlying scan; it gates which findings the report surfaces. Which cut you want depends on the workflow:
| Cut | Raw hits | TP | FP | FN | Recall | Precision | F1 |
|---|---|---|---|---|---|---|---|
| Default (--confidence all) | 12462 | 1415 | 1291 | 0 | 100.0% | 52.3% | 0.687 |
| --confidence medium (high + medium bands) | 3608 | 672 | 562 | 743 | 47.5% | 54.5% | 0.507 |
| --confidence high (high band only) | 705 | 446 | 0 | 969 | 31.5% | 100.0% | 0.479 |
Raw hits is finding-count (one finding per pattern match per line); TP/FP/FN are at file level because the OWASP truth file labels per-file. A single file can contribute several findings.
- Default — interactive triage, hunting for the bug. Every plausible finding ranked by confidence; FP overhead accepted because the expensive thing is missing a real vulnerability.
- --confidence medium — ranked-triage cut. Drops the broad sink-flagging patterns whose only signal is "the regex matched"; keeps anything backed by taint flow, a stable pattern, or a clean review record. Useful as an interactive priority cut — start at the top of the queue. Recall trades down hard (47.5% vs 100.0% — OWASP skews heavily to broad XSS/SQLi sink shapes tagged experimental on purpose); FP triage queue drops from 1,291 to 562 (≈56% fewer alerts to clear). F1 0.507 — useful for prioritising triage but not yet a CI exit-criteria gate; that's what `--confidence high` is for. On the in-house corpus the same dial retains 175 of 176 findings — F1 0.905 baseline holds because that catalog isn't dominated by experimental patterns.
- --confidence high — zero-FP CI gate on OWASP Benchmark v1.2. The high band only contains findings from categorical patterns — APIs that are wrong every time they appear regardless of surrounding context:
Math.random()for security purposes,MD5/SHA-1for authentication, rawresponse.getWriter().write(request.getParameter(...))without an encoder, broken cipher modes. 446 true positives, 0 false positives on this corpus — every alert is a real vulnerability. Recall 31.5% (the corpus is dominated by sink-flagging cases that aren't categorical); precision is the metric CI gates need, and this cut delivers it on OWASP. Real-world precision will vary — these patterns are chosen to be FP-free by construction (categorical "this API is wrong" matches), but no benchmark is a universal guarantee. Add a verifier run (-m claude-sonnet-4-6or a local model) and the band picks up additional findings the verifier confirms while staying near-100% precision on this corpus.
Verifier-ON measurement (sqli subset). Running KCode v2.10.406 on the 504-file sqli subset with -m claude-sonnet-4-6 (LLM verifier active) lifts precision substantially without touching recall:
| Mode | TP | FP | FN | Recall | Precision | F1 |
|---|---|---|---|---|---|---|
| --skip-verify (default) | 272 | 210 | 0 | 100.0% | 56.4% | 0.721 |
| + claude-sonnet-4-6 verifier | 272 | 141 | 0 | 100.0% | 65.9% | 0.794 |
Same 504 sqli files, same KCode binary, same patterns. The only delta is the verifier pass: a per-finding LLM call that reads the surrounding code and applies the pattern's verify_prompt. On this subset, the verifier rejected 69 false positives cleanly while confirming all 272 real vulnerabilities — F1 0.721 → 0.794. Cost: ~$8 in Sonnet tokens; ~41 min wall time. Notably, the verifier preserves recall perfectly on var-flow patterns (sqli, cmdi-exec, path-flow, etc.) — these are the high-precision-lift patterns where Sonnet has clean signal to work with.
The full-corpus run was attempted and aborted at 19.6% (2,445/12,462 findings, ~10h elapsed, $0.36 spent — Sonnet came in cheaper than expected at ~$0.00015/finding). Two throughput cliffs hit: structural patterns (java-028 cookie-flags, java-034 trustbound-setattribute) saturate the verifier with needs_context verdicts — those checks don't have a clean "is this tainted" question for the model to answer. The next round runs verifier ONLY on var-flow patterns (where the sqli proof point lives) and reports the full-corpus headline; this page updates when those numbers land.
Multi-model verifier comparison (v2.10.405). The Sonnet measurement above raised the obvious follow-up: does Sonnet's lift generalize to other verifiers? Or is it an artifact of one specific model? In v2.10.405 we ran the same 504-file sqli subset against three more verifiers — from xAI and Anthropic, across the price/quality spectrum — using a routing fix that lets KCode dispatch to any registered provider via a single -m <model> flag. Same KCode binary, same patterns, same prompt; only the verifier model changes per row.
| Verifier | TP | FP | Recall | Precision | F1 | Wall time | Cost |
|---|---|---|---|---|---|---|---|
grok-4-fast-non-reasoning xAI | 272 | 136 | 100.0% | 66.7% | 0.800 | 82 min | $0.50 |
claude-opus-4-7 Anthropic | 271 | 140 | 99.6% | 65.9% | 0.794 | 143 min | $39.50 |
claude-haiku-4-5 Anthropic | 38 | 9 | 14.0% | 80.9% | 0.238 | 120 min | $2.65 |
Costs are computed from actual prompt + completion tokens for the full 1,441-candidate run × public per-provider pricing (xAI: $0.20/M input, $0.50/M output; Anthropic Haiku: $1/M + $5/M; Opus: $15/M + $75/M). Wall time is real elapsed minutes — runs were launched in parallel from one machine to the same KCode binary, so each is independent of the others.
Three findings that broke our priors:
- grok-4-fast-non-reasoning beats Opus head-to-head. Higher precision (66.7% vs 65.9%), better recall (100% vs 99.6% — Opus misses one), 1.7× faster (82 vs 143 min), and ~80× cheaper ($0.50 vs $39.50). The "more expensive model is better" intuition fails on this task. Our hypothesis: SAST verification is a bounded binary decision that has all the context it needs in the prompt; extra reasoning capacity has nowhere productive to go and latency dominates.
- Haiku is a trap, not a budget option. Recall collapses to 14% because Haiku auto-classifies OWASP-style test files as "test-only code" and rejects them before evaluating the SQL. Precision is high (80.9%) because what it does flag is almost always real, but it misses 234 of 272 vulnerabilities. Cheaper ≠ worse-but-acceptable here — the failure mode is category-level, not gradient.
- Opus and Grok agree more than they disagree — and where they agree the answer is usually right. Grok confirms 594 candidates; Opus confirms 415; they overlap on 372. This overlap is the building block of the ensemble approach in the next section.
Ensemble verifier: Grok + Opus consensus (Tier B)
If two independent verifiers agree on a finding, the finding is more likely to be real. We tested that hypothesis directly: bucket every candidate by how many of the three models confirm it, then score each bucket against the OWASP ground truth.
| Tier | Rule | Findings | Recall | Precision | F1 | Use case |
|---|---|---|---|---|---|---|
| High (3/3 unanimous) | All 3 models confirm | 43 | 13.6% | 86.0% | 0.235 | Strict CI gate, zero-tolerance contexts |
| ★Medium (Grok + Opus) | Grok-fast and Opus both confirm | 386 | 100.0% | 70.8% | 0.829 | Default — verified end-to-end (v2.10.406 cascade) |
| Low (any 1) | Any single model confirms | 447 | 100.0% | 60.9% | 0.757 | Discovery / manual review queue |
The medium tier — Grok and Opus both confirm — is the headline. F1 0.829 with 100% recall beats every individual model tested, including Opus alone. The cost story is the second half of why this matters: run grok-fast as the first pass on every candidate, then run Opus only on grok-confirmed findings (594 of 1,441). Opus-on-subset costs $16.28; total ensemble cost ~$17 — versus $39.50 for Opus alone with worse F1.
A note on numbers. The Tier B values above (F1 0.829, precision 70.8%, 386 confirmed) come from a real kcode audit . -m grok-4-fast-non-reasoning --fallback-model claude-opus-4-7 run on the 504-file sqli subset (KCode v2.10.406, 178 min wall time, 0 needs_context). An offline post-hoc intersection of the two single-model runs predicted F1 0.842 / precision 72.8%; the small gap (~0.013 F1) is the LLM nondeterminism at temperature 0.1. We're publishing the measured numbers, not the predicted ones.
Why Haiku stays in the analysis but out of the consensus default. Haiku confirms only 38 findings, all of which are also confirmed by Grok and Opus. Its only useful contribution is the Tier A boost (3/3 unanimous → 86.0% precision) — but the recall cost there (13.6%) makes Tier A a strict-CI cut, not a triage default. The medium-tier consensus (Grok + Opus) gets 100% recall without paying Haiku's tax.
Routing recommendation for KCode v2.10.405+. Three configurations, three workflows:
- Fast / cheap:
kcode audit . -m grok-4-fast-non-reasoning— F1 0.800, 100% recall, ~$0.50 per OWASP-scale corpus. Single-model, no ensemble. The right default for interactive triage and most CI pipelines. - Balanced (recommended default):
kcode audit . -m grok-4-fast-non-reasoning --fallback-model claude-opus-4-7— Grok runs first; Opus is invoked only on grok-confirmed findings; report keeps only candidates both models agree on. F1 0.829, 100% recall, ~$17 per OWASP-scale corpus. Highest F1 of any configuration tested. Cascade-on-confirmed is the default semantics when--fallback-modelis set (v2.10.406+); pass--cascade-mode on-needs-contextfor the legacy escalate-on-ambiguous flow. - Strict CI: filter the balanced configuration's output to Tier A (3/3 unanimous including Haiku) for zero-FP behavior at the cost of recall. Currently a post-processing step; will be a CLI flag (
--ensemble strict) in v2.10.41x.
Methodology limitations: OWASP constant-folding convention (~95% of remaining FPs)
We clustered the 132 SQL-injection FPs reported by grok-4-fast-non-reasoning by structural shape. Almost all share one pattern: a helper function with a constant-evaluable if-condition that always routes user input through the "vulnerable" branch:
private static String doSomething(HttpServletRequest req, String param)
throws ServletException, IOException {
String bar;
int num = 196;
// (500 / 42) + 196 == 207 > 200 → branch is ALWAYS taken
if ((500 / 42) + num > 200) bar = param; // user input
else bar = "This should never happen"; // dead code
return bar;
}OWASP Benchmark v1.2 categorises these as safe (expectedresults-1.2.csv: false), under the convention that a SAST tool should perform constant folding + dead-code elimination to recognise the else-branch as the sanitiser. KCode's verifier reads the code correctly: user-controlled param flows into a SQL string concatenation, which is exactly what an LLM trained on real-world code learns to flag. The branch with "This should never happen" is dead code — it does not constitute sanitisation. The "FP" verdict is a property of the OWASP convention, not the analyser.
Quantitative impact. ~125 of grok-fast's 132 SQL-injection FPs fit this exact shape. Adding if-condition constant folding (no symbolic execution required) lifts grok-fast's projected precision to ~92% and F1 to ~0.96. Effort estimate: 3–5 days of IR work, scheduled for v2.10.41x.
We document this in the open because we'd rather lose 6 points of headline precision and keep the methodology honest than tune the corpus until the numbers look better. If you find a different cluster of FPs we missed, an issue on the KCode repo gets a fast response.
What's still on the next round. Themedium band is the obvious lever now that high is a clean CI gate. Wiring the v2.10.399 lightweight taint flow into the remaining non-SQL Java patterns would lift findings out of low into medium by giving them a real taint_origin signal instead of n/a. Concrete targets for the v2.10.4xx line:
--confidence medium: recall 65–75%, precision 65–75%, F1 ≥0.70. This makes the medium cut a usable triage filter rather than just "everything except the broadest patterns".--confidence high: today at 100% precision / 31.5% recall. The verifier-ON sqli measurement (above) shows that adding a Sonnet pass takes the same patterns to 65.9% precision / 100% recall (F1 0.794) on var-flow shapes — the question is now extending that throughout the corpus, not whether the verifier moves the dial. Cost on the sqli subset was ~$8; full-corpus extrapolated cost is sub-$10 with Sonnet, sub-$1 with a local 35B model.- More categorical patterns tagged. 21 patterns are tagged high_precision through v2.10.406: 4 from the original empirical-100% sweep (weak-random, MD5/SHA1 for auth, broken crypto, direct XSS write); 8 hardcoded-credentials patterns (Java, Python, JS, generic, OpenAI keys, etc.); and 9 more from the v2.10.406 categorical round — TLS trust-all, XXE in TransformerFactory, TLS verify-off, py-002 shell-injection with dynamic args, py-005 yaml.unsafe_load, js-002 innerHTML with dynamic content, inj-002 subprocess shell=True, cpp-006 strcpy/sprintf/gets, rb-001 eval(params). Pending review: ECB mode (currently 55% precision — not yet a clean tag), raw deserialization, log4j JNDI lookups.
The dial demonstrably moves precision into perfect-100% territory at the high band, and the verifier-ON measurement shows it can also lift the default cut without sacrificing recall — exactly the trade-off CI gates need.
Per-category KCode recall on OWASP. v2.10.406 hits 100% recall on all 11 categories — sqli 100%, xss 100%, cmdi 100%, pathtraver 100%, ldapi 100%, xpathi 100%, hash 100%, crypto 100%, weakrand 100%, securecookie 100%, trustbound 100%. The xss outlier from earlier versions (64.2% in v2.10.398, 78.9% in v2.10.402) closed in v2.10.402-404 with the printf alternation, Locale-skip in the sink-arg extractor, format-varargs analysis, header-injection pattern (java-035), and the bound-PrintWriter pattern (java-036). Same recall as CodeQL, on a public corpus, without a build step.
How CodeQL was run. The numbers above are from java-security-extended.qls — the standard pure-security pack. We initially ran java-security-and-quality.qls (which adds NPE/dead-code/resource-leak rules) and CodeQL saturated, flagging 2,710 of 2,740 files. The "security-extended" suite below 2,710 is the right comparison to Semgrep's security-only configs. Build setup: CodeQL needs a real Java compile, so we cloned the corpus fresh, used a user-owned local Maven repo (-Dmaven.repo.local=/tmp/m2-curly) to dodge a stale root-owned ~/.m2 from a prior docker run, and ran mvn -B -DskipTests clean compile so CodeQL's tracer actually saw javac. The resulting database is built with Java 21 / source-target 8 — same bytecode the corpus was originally pinned to.
The takeaway. Three usable rankings come out of OWASP. If you want every real vulnerability caught no matter the triage cost, CodeQL wins (100% recall, free for OSS, mandatory build step). If you want the best signal-to-noise ratio, Semgrep OSS wins (F1 0.785, fastest of the four). If you want local-first, pattern-engine simplicity with no build step or cloud login, KCode v2.10.406 ties CodeQL on recall (100%) at F1 0.683, with the smallest operational surface. Semgrep Pro, despite the cross-function dataflow, doesn't beat OSS or CodeQL on this corpus. We think these are three honest ways to use a SAST tool, and we'd rather give you the numbers to choose than blur them.
Methodology
Corpus 1 lives at tests/patterns/ in the KCode repository. Each fixture is a directory with two files:
positive.*— code containing the vulnerability the fixture is named after.negative.*— the same code with the vulnerability fixed (or a superficially-similar safe construction).
The benchmark script (scripts/bench-sast.py) walks all fixture directories and runs each tool against both files. For every fixture, each tool produces one of four outcomes:
- True positive (TP): tool flags
positive.*. - False negative (FN): tool misses
positive.*. - True negative (TN): tool clean on
negative.*. - False positive (FP): tool flags
negative.*.
Recall measures how many real vulnerabilities the tool catches. Precision measures how many of the tool's findings are real (vs noise). F1 is the harmonic mean — high F1 means both numbers are high.
Tool configuration
- KCode v2.10.406:
kcode audit <dir> --skip-verify. LLM verification was disabled — these are the raw scanner hits, no LLM second-pass filter, to keep the comparison apples-to-apples with Semgrep OSS (which has no LLM stage). Enabling verification would lower KCode's FP count further at a cost of seconds per finding. - Semgrep OSS 1.160:
semgrep --config p/security-audit --config p/owasp-top-ten --config p/cwe-top-25. The three rulesets are merged. We picked these because they are the rulesets a developer would reach for first when installing Semgrep without paying for the commercial tier. - Semgrep Pro 1.160:
semgrep scan --pro --config p/security-audit --config p/owasp-top-ten --config p/cwe-top-25withSEMGREP_APP_TOKENauthenticated to the Semgrep Cloud platform. Pro engine adds cross-function and inter-procedural dataflow analysis on top of the same rules. Each fixture was wrapped in a temporary git repo before scanning because--proskips files that aren't tracked by git. - CodeQL 2.25.2: GitHub's code-scanning engine, free for open-source projects. We ran the language-specific
*-security-and-quality.qlssuite (the standard "security plus code quality" pack) against each Python, JavaScript, and Ruby fixture. CodeQL skipped 36 of 68 fixtures: Java, Go, C++, Rust, Swift, and Kotlin all need a build step (CodeQL extracts data while a real compiler runs), and our fixtures are standalone snippets — not buildable projects. CodeQL numbers below are computed only on the 32 fixtures it could evaluate.
All four tools ran on the same 68 fixtures (positive + negative) covering Python (20), JS/TS (9), Java (8), C/C++ (10), Go (6), Ruby (3), PHP (2), and one each of Rust, Swift, Kotlin.
Corpus 2 is the public OWASP-Benchmark/BenchmarkJava v1.2 project. We checked out the released tag, ran each tool over the src/main/java/org/owasp/benchmark/testcode/ directory (2,740 BenchmarkTest*.java files), and matched each tool's findings to the published expectedresults-1.2.csv ground truth. KCode and Semgrep both ran without LLM verification on this corpus. CodeQL was run with java-security-extended.qls (pure security pack, no code-quality noise) — see the CodeQL setup paragraph above the Methodology section for the full run details.
What the numbers say (Corpus 1, in-house)
On the in-house corpus, KCode finds 67 of 68 vulnerabilities (98.5% recall). Semgrep OSS finds 21 (30.9%). Semgrep Pro improves to 26 (38.2%) — a real but modest +7.3-point uplift from cross-function dataflow. CodeQL finds 13 of the 32 fixtures it could evaluate without a build step (40.6% recall on its evaluable subset). All four tools have precision in the 65–84% range.
For the OWASP Benchmark v1.2 numbers (where KCode ties CodeQL on recall and Semgrep wins F1), see the "Corpus 2" table near the top of this page and the discussion immediately below it.
Semgrep Pro vs OSS: Pro adds 5 true positives over OSS — mostly cases where a vulnerability sits behind a function boundary and OSS's intra-procedural matching loses the connection. Pro's dataflow follows it. But Pro also adds 2 false positives (dataflow speculation firing on safe code paths), so its precision is slightly lower than OSS's (78.8% vs 80.8%). The recall lift is real; the cost is a small precision tax. Pro doesn't close the gap to KCode (98.5% vs 38.2%) on this corpus — the missing vulnerabilities Pro doesn't catch are mostly cases where Semgrep simply has no rule, not cases where dataflow would have helped.
CodeQL's 36 skipped fixtures are the languages that need a real compiler invocation to extract — Java, Go, C++, Rust, Swift, Kotlin. CodeQL's strength is in real applications where its build hooks see the whole compilation; standalone snippets can't be analyzed at all. We don't count those skips against CodeQL's recall — but they're a real-world friction point: if your project doesn't compile cleanly under CodeQL's autobuilder, you don't get a finding.
On the 32 evaluable fixtures (Python, JS, Ruby), CodeQL still misses many findings KCode catches. The reason is structural: CodeQL is a taint-flow engine. It looks for paths from network-input sources (request.args, sys.argv, input()) to dangerous sinks (eval, exec, os.system). Our fixtures use bare function parameters (eval(arg) without an explicit upstream source) — which is a real anti-pattern in production code (any internal caller could pass an unsafe value), but doesn't fire CodeQL's queries without a configured taint source. KCode's pattern-based engine flags the dangerous sink directly; CodeQL only flags the path.
KCode's thirteen false positives come from patterns that over-trigger on the safe variants. They cluster around three themes:
- Hard-to-distinguish safe variants. The "safe"
yaml.safe_loadandpickle.loadsexamples both still trigger because the pattern matches the call site without dataflow context. Same shape in the go-003 (command injection), java-021 (Spring@RequestBody Map), and js-003 (prototype pollution) cases. - Crypto and TLS:
crypto-007-tls-verify-offflags both the disable and a re-enable in the same file — we're missing a "did this actually disable verification?" dataflow check. - Embedded / firmware patterns:
fsw-005bandfsw-017fire on safe array accesses that look syntactically similar to unchecked ones.
Each is a known issue. v2.10.399 introduced lightweight taint flow that suppresses some of these (constant- and sanitizer-flow on the SQL injection path). v2.10.400 additionally exposes a per-finding confidence dial so the user can drop low-band findings without the engine having to make the call. The remaining gaps move into the v2.10.4xx line. The point of running this benchmark is to make these gaps visible and measurable, not to hide them.
Caveats
- Corpus 1 was designed by KCode contributors. Some fixtures map directly to KCode's pattern catalog. This biases recall toward KCode on the in-house numbers — which is exactly why we added the public OWASP corpus on top. On OWASP, none of the tested tools' authors had a hand in writing the fixtures, and the F1 leaderboard reorders.
- CodeQL fixtures lack explicit taint sources. Real applications usually have one (a Flask route, a CLI argv read, an HTTP body). The synthetic fixtures don't, so CodeQL's flow queries don't fire. CodeQL would do substantially better on a CVE-corpus benchmark where every fixture is by definition a real-world flow. We're treating the gap honestly here, not papering over it.
- CodeQL skipped 36 fixtures (Java/Go/C++/Rust/Swift/Kotlin). CodeQL needs a build step for compiled languages that standalone snippets can't satisfy. Adding
build.gradle/go.modwrappers per fixture is a corpus enhancement we can do. We haven't yet. - Snyk Code isn't tested yet. On the queue — their IDE-first developer-experience focus makes them the most relevant commercial benchmark.
- Real-world recall is not 100% even on the OWASP corpus. OWASP Benchmark v1.2 is canonical-shape Java security cases, but real applications layer routing frameworks, custom sanitizers, and ORM helpers on top of the patterns tested here. Treat the OWASP numbers as a credible upper bound on any tool's performance on real Java code, not as "this is what you'll see on your repo."
Reproduce
The script is in the public KCode repo. If you have Bun and Docker installed, you can re-run it yourself:
git clone https://github.com/AstrolexisAI/KCode
cd KCode
bun install
bun run build # builds the kcode binary into ~/.local/bin
# Optional: install CodeQL (free, open source) into ~/.codeql
curl -sL https://github.com/github/codeql-action/releases/latest/download/codeql-bundle-linux64.tar.gz \
| tar xz -C ~/.codeql/
# Optional: Semgrep Pro (free trial, requires login)
semgrep login
# Run Corpus 1 — Docker is required for both Semgrep variants
python3 scripts/bench-sast.py # all four tools
python3 scripts/bench-sast.py --tools kcode,semgrep # OSS only
python3 scripts/bench-sast.py --tools kcode,semgrep,semgrep-pro # skip CodeQL
# Run Corpus 2 — clones OWASP-Benchmark/BenchmarkJava v1.2 to /tmp on
# first run, then runs each tool against the 2,740 Java fixtures and
# reconciles findings against the published expectedresults-1.2.csv.
python3 scripts/bench-owasp.py # all four
python3 scripts/bench-owasp.py --tools kcode,semgrep,semgrep-pro
If your numbers come out different, open an issue with the output and we'll investigate. The corpus is small enough that anomalies usually trace to a specific fixture.
See also: why a local LLM verifier matters · a platform tour of KCode · source on GitHub