Security v2
Security
How well can each AI model write secure code? 30 expert-level security tasks across 6 domains with 500+ test cases including adversarial hidden edge cases.
Updated 2026-03-27
Sanitization6 tasks
94.8avg
Auth & Session7 tasks
68.8avg
Access Control5 tasks
94.6avg
Detection & Analysis9 tasks
72.5avg
Traffic Protection1 tasks
93.9avg
Crypto Utils2 tasks
49.5avg
Visible tests: shown to model
Hidden tests: adversarial edge cases
| Rank | Model | Score | Visible | Hidden | |
|---|---|---|---|---|---|
| Claude Sonnet 4.6 anthropic/claude-sonnet-4-6 | 85.3 | 86.7% | 81.8% | ||
| Gemini 3.1 Pro google/gemini-3.1-pro-preview | 85.2 | 85.6% | 82.2% | ||
| GPT-5.4 openai/gpt-5.4 | 84.8 | 85.6% | 81.6% | ||
| 4 | GPT-5.4 Mini openai/gpt-5.4-mini | 83.2 | 84.4% | 79.3% | |
| 5 | GPT-5.4 Nano openai/gpt-5.4-nano | 81.9 | 83.9% | 77.1% | |
| 6 | Claude Opus 4.6 openrouter/anthropic/claude-opus-4-6 | 81.6 | 81.7% | 78.6% | |
| 7 | Grok 4.20 Reasoning x-ai/grok-4.20-reasoning | 78.9 | 80.0% | 74.0% | |
| 8 | Grok 4.20 (Non-Reasoning) x-ai/grok-4.20 | 76.3 | 76.7% | 71.7% | |
| 9 | MiMo-V2-Pro openrouter/xiaomi/mimo-v2-pro | 53.2 | 54.4% | 49.0% |
9models tested · 30 tasks · 500+ test cases · Click a model for detailed analysis