Skip to content

AI Vulnerability Detection: Mythos, AIxCC, and the Jagged Frontier

Advanced

Date: 2026-04-14 Context: State of AI-powered security scanning as of April 2026. Practical tooling choices for small development teams (Python + C++ + frontend).


The Jagged Frontier

Claude Mythos Preview (Anthropic, Mar-Apr 2026) found thousands of zero-days across major OSes and browsers in weeks. Predecessor Opus 4.6 found ~500 zero-days in production OSS codebases in Feb 2026 with no specialized scaffolding.

AISLE's key finding (Stanislav Fort, ex-Anthropic Frontier Red Team): the same FreeBSD NFS buffer-overflow exploit that Mythos was showcased finding was reproduced by 8/8 tested models, including a 3.6B-parameter model at $0.11/M tokens.

Task Mythos analog Smallest model that matched
FreeBSD NFS buffer-overflow (Mythos showcase) All 8 tested models 3.6B active params, $0.11/M tokens
OpenBSD SACK 27-year-old bug Full chain 5.1B active open-weights
OWASP false-positive discrimination Inverse scaling - small models often better GPT-OSS-120b failed Java ArrayList trace

The moat is the system, not the model. What differentiates Mythos-class results is the scaffolding (iterative triage, targeting, fix verification, maintainer coordination) - not the model weights. A team with good tooling and a 5-10B open model gets ~80% of Mythos results at <1% of the cost.

Jagged capability profile: no single model dominates all tasks. Rankings reshuffle per-task. A model that recovers a 27-year-old kernel bug may fail to trace data flow through a Java ArrayList.


DARPA AIxCC Results (2025)

Four open-source Cyber Reasoning Systems (CRS) produced. Final round results:

Metric Semifinal Final
Synthetic vuln detection 37% 86%
Synthetic vuln patching 25% 68%
Real non-synthetic vulns found - 18 (6 zero-days)
Avg cost per task - ~$152

Prize allocations: Team Atlanta ($4M, 1st), Trail of Bits Buttercup ($3M, 2nd), Theori ($1.5M, 3rd).

Cost comparison: $152/finding via AI vs $5,000-50,000 for equivalent manual security audit work.

Paper: arxiv:2602.07666 - SoK: DARPA's AIxCC


Tool Landscape (April 2026)

Open-Source / Free

Tool Key Facts Best For
Buttercup (Trail of Bits) AIxCC 2nd place. 28 vulns, 90% accuracy, ~$181/finding. Non-reasoning LLMs only. Laptop-runnable. C/C++ periodic deep scans
CodeQL (GitHub) Semantic code database. Underlies Copilot Autofix. FP ~68%, slow on large repos. Free static analysis on repos
Semgrep community Fast signature matching. LLM post-filter drops FP from ~74% to ~6%. CI-native. Pre-commit hooks, fast patterns
Claude Code Opus 4.6 500 zero-days in OSS (Feb 2026), no special scaffolding. 1M context. PR review, auth/crypto modules

Commercial

Tool Key Facts When to Use
GitHub Copilot Autofix Fix suggestions on CodeQL findings. 2025: 0.66h avg resolution vs 1.29h manual. Expanding to Shell/Dockerfile/Terraform/PHP Q2 2026. Already paying for Copilot
OpenAI Aardvark (private beta) GPT-5 agent. 92% recall. 10 real CVEs. Not available yet
Semgrep Pro/AI Rule marketplace + AI triage. ~$40/dev/month. 10+ dev team, compliance
Snyk DeepCode AI Enterprise CI-native. SCA+SAST+container+IaC. $25-100/dev/month. Compliance regime requires it
Checkmarx One Heavy, slow, high FP legacy. Enterprise contract. Compliance only

Research / Internal

Google Big Sleep: CVE-2025-6965 (SQLite, Jul 2025) - first AI agent to prevent an in-the-wild exploit per Google. 20+ flaws in popular OSS. Internal tool, not available.

Anthropic Mythos Preview: Withheld from general availability. Access only via Project Glasswing partners (AWS, Apple, Google, Microsoft, NVIDIA, etc.). Not accessible to most teams.


Integration Patterns by Phase

Phase Tools FP Budget Blocking?
Pre-commit Semgrep community (no LLM) <5% Yes - fails commit
PR-time CodeQL + Copilot Autofix + Claude Code review (security-sensitive files) <15% Advisory
Nightly/weekly Buttercup agentic deep scan (C++ modules) <30% - triaged before filing No
Quarterly Full Claude Code pass with fresh context + full file content N/A - every lead is a ticket No

Implementation cost estimate (medium Python+C++ repo): - CodeQL scan (100K LOC): ~free, 5-15 minutes - Claude Code PR review (~500 LOC diff): <$0.50 via Opus 4.6 - Buttercup deep scan (full repo): $50-200, depends on LLM selection - Semgrep community scan: free, seconds


What AI Scanners Still Miss (April 2026)

Blind spots confirmed by jagged-frontier research and community benchmarks:

  • Multi-file auth/authz logic - model A issues a token, model B validates differently across file boundaries
  • Business-logic flaws - pricing bypasses, quota exhaustion, state transition races; not CWE-tagged, not pattern-matchable
  • Crypto misuse under composition - nonce reuse across modules, KDF on user-controlled input (not single-function obvious)
  • Cross-process race conditions, TOCTOU at OS boundary (requires multi-process trace)
  • Heavy abstraction data flow - through ORMs, DI containers, dynamic-dispatch C++; even GPT-OSS-120b failed ArrayList trace
  • Supply-chain attacks (malicious dependencies, typosquatting) - requires specialized tooling (min-release-age, Socket.dev), not code scanners

AI is reliable for: CWE-common memory/injection patterns (buffer overflows, SQL injection, path traversal), single-function bugs, well-understood vulnerability classes in C/C++.


Structured Reasoning Protocol for Claude Code Security Review

For PRs touching auth, file I/O, subprocess invocation, unsafe C++ pointers:

Prompt template:
"Review this code as a security researcher.

Premises (facts you can observe directly):
- [list actual data types, API calls, trust boundaries]

Execution trace:
- [trace exact data flow from input to output]

Conclusions (what follows from trace):
- [specific vulnerability class if exists]

Rejected paths:
- [hypotheses that don't fit the evidence, and why]

Find: CWE Top-25, injection, memory safety, auth bypass, race conditions."

This structured prompt (from arxiv:2603.01896) increases accuracy from ~78% to ~93% on real-world patches vs free-form chain-of-thought.


Practical Setup for Small Team

Free, no new tools (do this week): 1. Enable GitHub Actions + CodeQL on every private repo (one-time setup) 2. Enable Copilot Autofix if already paying for GitHub Copilot 3. Add REVIEW.md to each repo: list "always check" items for security-sensitive PRs 4. Use Claude Code with structured reasoning for any PR touching auth/crypto/native

Add within a month (free, some setup): 5. Semgrep community on pre-commit (blocks obvious issues before code review) 6. Buttercup nightly cron on C++ code - its AIxCC lineage makes it reliable for memory safety

When to pay: - >10 developers: Corgea / ZeroPath / Qwiet AI (better FPR than Snyk at lower cost) - Compliance requirement (SOC2, HIPAA, PCI): Snyk or Checkmarx (auditors accept their reports) - Do not pay for Snyk/Checkmarx otherwise - Claude Code + CodeQL + Semgrep covers 80% at near-zero cost


Gotchas

  • Raw SAST without LLM filter is unusable. CodeQL alone produces 68-95% false positive rates. Teams disable alerts when FP exceeds ~15%. Always add LLM triage (Copilot Autofix, Semgrep AI) or manual review step, never alert directly from raw SAST output.
  • Capability ranking reshuffles per task. The model that recovered an OpenBSD kernel bug may fail a Java data-flow trace. Don't commit to one model for all scan types. Mix: use cheaper models for common CWE patterns, Opus 4.6 for complex multi-file auth flows.
  • Buttercup was designed for C/C++ DARPA harnesses. The open-source version may need configuration to run against Python or web stacks. Its strength is memory safety bugs in native code - that's where it earns its $181/finding stat.
  • "AI found thousands of zero-days" does not mean general-purpose AI replaces security engineers. Mythos-class performance requires coordinated disclosure workflows, maintainer relationships, and patch verification that the model cannot do alone. The scaffolding is the product.
  • Business-logic and multi-file auth bugs will be missed. No AI scanner as of April 2026 reliably catches bugs that span multiple files with different trust assumptions (e.g., a token issued by module A that module B doesn't verify correctly). These require human architecture review.

See Also