AI Vulnerability Detection: Mythos, AIxCC, and the Jagged Frontier¶
Date: 2026-04-14 Context: State of AI-powered security scanning as of April 2026. Practical tooling choices for small development teams (Python + C++ + frontend).
The Jagged Frontier¶
Claude Mythos Preview (Anthropic, Mar-Apr 2026) found thousands of zero-days across major OSes and browsers in weeks. Predecessor Opus 4.6 found ~500 zero-days in production OSS codebases in Feb 2026 with no specialized scaffolding.
AISLE's key finding (Stanislav Fort, ex-Anthropic Frontier Red Team): the same FreeBSD NFS buffer-overflow exploit that Mythos was showcased finding was reproduced by 8/8 tested models, including a 3.6B-parameter model at $0.11/M tokens.
| Task | Mythos analog | Smallest model that matched |
|---|---|---|
| FreeBSD NFS buffer-overflow (Mythos showcase) | All 8 tested models | 3.6B active params, $0.11/M tokens |
| OpenBSD SACK 27-year-old bug | Full chain | 5.1B active open-weights |
| OWASP false-positive discrimination | Inverse scaling - small models often better | GPT-OSS-120b failed Java ArrayList trace |
The moat is the system, not the model. What differentiates Mythos-class results is the scaffolding (iterative triage, targeting, fix verification, maintainer coordination) - not the model weights. A team with good tooling and a 5-10B open model gets ~80% of Mythos results at <1% of the cost.
Jagged capability profile: no single model dominates all tasks. Rankings reshuffle per-task. A model that recovers a 27-year-old kernel bug may fail to trace data flow through a Java ArrayList.
DARPA AIxCC Results (2025)¶
Four open-source Cyber Reasoning Systems (CRS) produced. Final round results:
| Metric | Semifinal | Final |
|---|---|---|
| Synthetic vuln detection | 37% | 86% |
| Synthetic vuln patching | 25% | 68% |
| Real non-synthetic vulns found | - | 18 (6 zero-days) |
| Avg cost per task | - | ~$152 |
Prize allocations: Team Atlanta ($4M, 1st), Trail of Bits Buttercup ($3M, 2nd), Theori ($1.5M, 3rd).
Cost comparison: $152/finding via AI vs $5,000-50,000 for equivalent manual security audit work.
Paper: arxiv:2602.07666 - SoK: DARPA's AIxCC
Tool Landscape (April 2026)¶
Open-Source / Free¶
| Tool | Key Facts | Best For |
|---|---|---|
| Buttercup (Trail of Bits) | AIxCC 2nd place. 28 vulns, 90% accuracy, ~$181/finding. Non-reasoning LLMs only. Laptop-runnable. | C/C++ periodic deep scans |
| CodeQL (GitHub) | Semantic code database. Underlies Copilot Autofix. FP ~68%, slow on large repos. | Free static analysis on repos |
| Semgrep community | Fast signature matching. LLM post-filter drops FP from ~74% to ~6%. CI-native. | Pre-commit hooks, fast patterns |
| Claude Code Opus 4.6 | 500 zero-days in OSS (Feb 2026), no special scaffolding. 1M context. | PR review, auth/crypto modules |
Commercial¶
| Tool | Key Facts | When to Use |
|---|---|---|
| GitHub Copilot Autofix | Fix suggestions on CodeQL findings. 2025: 0.66h avg resolution vs 1.29h manual. Expanding to Shell/Dockerfile/Terraform/PHP Q2 2026. | Already paying for Copilot |
| OpenAI Aardvark (private beta) | GPT-5 agent. 92% recall. 10 real CVEs. | Not available yet |
| Semgrep Pro/AI | Rule marketplace + AI triage. ~$40/dev/month. | 10+ dev team, compliance |
| Snyk DeepCode AI | Enterprise CI-native. SCA+SAST+container+IaC. $25-100/dev/month. | Compliance regime requires it |
| Checkmarx One | Heavy, slow, high FP legacy. Enterprise contract. | Compliance only |
Research / Internal¶
Google Big Sleep: CVE-2025-6965 (SQLite, Jul 2025) - first AI agent to prevent an in-the-wild exploit per Google. 20+ flaws in popular OSS. Internal tool, not available.
Anthropic Mythos Preview: Withheld from general availability. Access only via Project Glasswing partners (AWS, Apple, Google, Microsoft, NVIDIA, etc.). Not accessible to most teams.
Integration Patterns by Phase¶
| Phase | Tools | FP Budget | Blocking? |
|---|---|---|---|
| Pre-commit | Semgrep community (no LLM) | <5% | Yes - fails commit |
| PR-time | CodeQL + Copilot Autofix + Claude Code review (security-sensitive files) | <15% | Advisory |
| Nightly/weekly | Buttercup agentic deep scan (C++ modules) | <30% - triaged before filing | No |
| Quarterly | Full Claude Code pass with fresh context + full file content | N/A - every lead is a ticket | No |
Implementation cost estimate (medium Python+C++ repo): - CodeQL scan (100K LOC): ~free, 5-15 minutes - Claude Code PR review (~500 LOC diff): <$0.50 via Opus 4.6 - Buttercup deep scan (full repo): $50-200, depends on LLM selection - Semgrep community scan: free, seconds
What AI Scanners Still Miss (April 2026)¶
Blind spots confirmed by jagged-frontier research and community benchmarks:
- Multi-file auth/authz logic - model A issues a token, model B validates differently across file boundaries
- Business-logic flaws - pricing bypasses, quota exhaustion, state transition races; not CWE-tagged, not pattern-matchable
- Crypto misuse under composition - nonce reuse across modules, KDF on user-controlled input (not single-function obvious)
- Cross-process race conditions, TOCTOU at OS boundary (requires multi-process trace)
- Heavy abstraction data flow - through ORMs, DI containers, dynamic-dispatch C++; even GPT-OSS-120b failed ArrayList trace
- Supply-chain attacks (malicious dependencies, typosquatting) - requires specialized tooling (
min-release-age, Socket.dev), not code scanners
AI is reliable for: CWE-common memory/injection patterns (buffer overflows, SQL injection, path traversal), single-function bugs, well-understood vulnerability classes in C/C++.
Structured Reasoning Protocol for Claude Code Security Review¶
For PRs touching auth, file I/O, subprocess invocation, unsafe C++ pointers:
Prompt template:
"Review this code as a security researcher.
Premises (facts you can observe directly):
- [list actual data types, API calls, trust boundaries]
Execution trace:
- [trace exact data flow from input to output]
Conclusions (what follows from trace):
- [specific vulnerability class if exists]
Rejected paths:
- [hypotheses that don't fit the evidence, and why]
Find: CWE Top-25, injection, memory safety, auth bypass, race conditions."
This structured prompt (from arxiv:2603.01896) increases accuracy from ~78% to ~93% on real-world patches vs free-form chain-of-thought.
Practical Setup for Small Team¶
Free, no new tools (do this week): 1. Enable GitHub Actions + CodeQL on every private repo (one-time setup) 2. Enable Copilot Autofix if already paying for GitHub Copilot 3. Add REVIEW.md to each repo: list "always check" items for security-sensitive PRs 4. Use Claude Code with structured reasoning for any PR touching auth/crypto/native
Add within a month (free, some setup): 5. Semgrep community on pre-commit (blocks obvious issues before code review) 6. Buttercup nightly cron on C++ code - its AIxCC lineage makes it reliable for memory safety
When to pay: - >10 developers: Corgea / ZeroPath / Qwiet AI (better FPR than Snyk at lower cost) - Compliance requirement (SOC2, HIPAA, PCI): Snyk or Checkmarx (auditors accept their reports) - Do not pay for Snyk/Checkmarx otherwise - Claude Code + CodeQL + Semgrep covers 80% at near-zero cost
Gotchas¶
- Raw SAST without LLM filter is unusable. CodeQL alone produces 68-95% false positive rates. Teams disable alerts when FP exceeds ~15%. Always add LLM triage (Copilot Autofix, Semgrep AI) or manual review step, never alert directly from raw SAST output.
- Capability ranking reshuffles per task. The model that recovered an OpenBSD kernel bug may fail a Java data-flow trace. Don't commit to one model for all scan types. Mix: use cheaper models for common CWE patterns, Opus 4.6 for complex multi-file auth flows.
- Buttercup was designed for C/C++ DARPA harnesses. The open-source version may need configuration to run against Python or web stacks. Its strength is memory safety bugs in native code - that's where it earns its $181/finding stat.
- "AI found thousands of zero-days" does not mean general-purpose AI replaces security engineers. Mythos-class performance requires coordinated disclosure workflows, maintainer relationships, and patch verification that the model cannot do alone. The scaffolding is the product.
- Business-logic and multi-file auth bugs will be missed. No AI scanner as of April 2026 reliably catches bugs that span multiple files with different trust assumptions (e.g., a token issued by module A that module B doesn't verify correctly). These require human architecture review.