BridgeBenchBridgeBench
Security v2

Security

How well can each AI model write secure code? 30 expert-level security tasks across 6 domains with 500+ test cases including adversarial hidden edge cases.

Updated 2026-03-27

Sanitization6 tasks

94.8avg

Auth & Session7 tasks

68.8avg

Access Control5 tasks

94.6avg

Detection & Analysis9 tasks

72.5avg

Traffic Protection1 tasks

93.9avg

Crypto Utils2 tasks

49.5avg

Visible tests: shown to model
Hidden tests: adversarial edge cases
RankModelScore
Claude Sonnet 4.6

anthropic/claude-sonnet-4-6

85.3
Gemini 3.1 Pro

google/gemini-3.1-pro-preview

85.2
GPT-5.4

openai/gpt-5.4

84.8
4GPT-5.4 Mini

openai/gpt-5.4-mini

83.2
5GPT-5.4 Nano

openai/gpt-5.4-nano

81.9
6Claude Opus 4.6

openrouter/anthropic/claude-opus-4-6

81.6
7Grok 4.20 Reasoning

x-ai/grok-4.20-reasoning

78.9
8Grok 4.20 (Non-Reasoning)

x-ai/grok-4.20

76.3
9MiMo-V2-Pro

openrouter/xiaomi/mimo-v2-pro

53.2

9models tested · 30 tasks · 500+ test cases · Click a model for detailed analysis