BridgeBenchBridgeBench
Security
Model Analysis

Claude Sonnet 4.6

anthropic/claude-sonnet-4-6

85.3

overall score

86.7% visible
81.8% hidden

Tasks

30

Passed

14

Failed

16

Avg latency

28300ms

Total cost

$1.1848

AI Commentary

by openai/gpt-5.4-mini

Claude Sonnet 4.6 is strong on security-oriented implementation tasks, with high scores in sanitization (98.8), access control (97.6), and traffic protection (95.6), and it handled most visible tests well (86.7%). Its main gaps are in brittle edge-case handling and output correctness on a few specialized validators: auth/session drops to 80.3 due to OAuth state and password-strength issues, detection/analysis is uneven at 79.6, and crypto-utils is the weakest area at 52.6 with structural failures and incorrect outputs.

Domain Performance

Sanitization6 tasks
98.8

Performance is near-ceiling across all five listed tasks, including file upload validation, hostname allowlisting, HTML entity encoding, safe redirect building, and URL sanitization. No meaningful weaknesses appear here, suggesting strong baseline defensive coding and normalization behavior.

Auth & Session7 tasks
80.3

This domain is mostly solid, but the OAuth state validator failed catastrophically with null output and a TypeError, indicating a broken code path rather than a logic miss. Password strength scoring also overestimated several weak passwords and missed multiple feedback conditions, so the model is less reliable when rules require nuanced aggregation of multiple weakness signals.

Access Control5 tasks
97.6

Access control is a clear strength, with strong results on access-control engine, API key scope checking, permission checking, and tenant isolation. The high average suggests the model handles authorization boundaries and policy evaluation consistently.

Detection & Analysis9 tasks
79.6

Results are mixed: CSP parsing, dependency risk classification, insecure config scanning, and vulnerability scanning are strong, but anomaly detection, secret detection, and SSRF detection are weak. The failures point to incomplete pattern matching and parser robustness issues, including truncated anomaly sets, missed secrets, and outright parse errors on SSRF inputs.

Traffic Protection1 tasks
95.6

Rate limiting is strong and appears stable, with no notable weaknesses in the single task. This suggests good handling of thresholding and request-accounting logic.

Crypto Utils2 tasks
52.6

This is the weakest domain by a wide margin, with one strong encryption pipeline task but a failing crypto-utils task that returned null and TypeErrors. The failure pattern suggests the implementation is not resilient to expected input shapes and may have broken assumptions about array/string handling.

Notable Tasks

sec-oauth-state-validator9.5Auth & Session

The task returned null with a TypeError on reading 'length', which indicates a runtime failure in state handling rather than an incorrect validation decision.

sec-password-strength66.9Auth & Session

The model under-scored several weak passwords and omitted required feedback items, then over-classified some cases as stronger than expected, showing inconsistent rule aggregation and thresholding.

sec-ssrf-detector9.5Detection & Analysis

All reported failures were parse-level crashes ('Unexpected end of input'), so the detector likely cannot robustly parse or normalize malformed/edge-case URLs before applying SSRF policy checks.

sec-secret-detector69.5Detection & Analysis

It truncated secret matches and missed at least one AWS secret key entirely, which points to brittle regex extraction and incomplete multi-secret handling.

sec-access-control-engine99.5Access Control

This is a strong authorization result, indicating the model can correctly encode access rules and preserve tenant or permission boundaries under test.

All Task Results

TaskDomainScore
sec-crypto-utilsCrypto Utils9.5
sec-oauth-state-validatorAuth & Session9.5
sec-ssrf-detectorDetection & Analysis9.5
sec-auth-log-anomaly-detectorDetection & Analysis62.2
sec-password-strengthAuth & Session66.9
sec-secret-detectorDetection & Analysis69.5
sec-csp-nonce-validatorDetection & Analysis88.3
sec-session-fixation-detectorAuth & Session91.7
sec-sql-injection-detectorDetection & Analysis92.2
sec-input-sanitizerSanitization93.7
sec-abac-rule-engineAccess Control94.2
sec-api-key-scope-checkerAccess Control95.1
sec-refresh-token-rotationAuth & Session95.3
sec-rate-limit-engineTraffic Protection95.6
sec-encryption-pipelineCrypto Utils95.7
sec-csp-parserDetection & Analysis96.1
sec-access-control-engineAccess Control99.5
sec-cookie-policy-validatorAuth & Session99.5
sec-csrf-token-managerAuth & Session99.5
sec-dependency-risk-classifierDetection & Analysis99.5
sec-file-upload-validatorSanitization99.5
sec-insecure-config-scannerDetection & Analysis99.5
sec-jwt-validatorAuth & Session99.5
sec-permission-checkerAccess Control99.5
sec-safe-redirect-builderSanitization99.5
sec-tenant-isolation-checkerAccess Control99.5
sec-vulnerability-scannerDetection & Analysis99.5
sec-hostname-allowlist-validatorSanitization100.0
sec-html-entity-encoderSanitization100.0
sec-url-sanitizerSanitization100.0

30tasks · Sorted by score (lowest first) · Hidden = adversarial edge case pass rate