BridgeBenchBridgeBench
Security
Model Analysis

Claude Opus 4.6

openrouter/anthropic/claude-opus-4-6

81.6

overall score

81.7% visible
78.6% hidden

Tasks

30

Passed

14

Failed

16

Avg latency

29899ms

Total cost

$1.9058

AI Commentary

by openai/gpt-5.4-mini

Claude Opus 4.6 is strong on security-oriented validation and policy logic, with excellent scores in sanitization (97.6), access control (97.5), and traffic protection (95.6). Its main weaknesses are in stateful/authentication edge cases and low-level parsing/formatting tasks: auth-session drops to 69.2 due to JWT, OAuth state, and password scoring errors, while crypto-utils is especially weak at 54.5 with null/format failures.

Domain Performance

Sanitization6 tasks
97.6

Performance is near-ceiling at 97.6 across file upload, hostname allowlisting, HTML entity encoding, safe redirects, and URL sanitization. The model appears reliable when the task is straightforward normalization or validation with clear rules.

Auth & Session7 tasks
69.2

This is a mixed domain at 69.2: cookie policy, CSRF token management, refresh token rotation, and session fixation detection are strong, but JWT validation and OAuth state handling fail badly. The JWT task suggests brittle parsing or over-rejection of malformed inputs, while OAuth state returned null/TypeError, indicating missing output construction or unchecked undefined access.

Access Control5 tasks
97.5

Very strong at 97.5, with correct handling of access control engine logic, API key scopes, permissions, and tenant isolation. No meaningful weakness appears in authorization reasoning or multi-tenant boundary enforcement.

Detection & Analysis9 tasks
76.3

At 76.3, the model is competent but inconsistent in security detection tasks. It performs well on CSP parsing, dependency risk classification, SQL injection detection, and vulnerability scanning, but misses or over-adds anomaly labels in auth-log analysis and fails hard on SSRF detection with parse errors, suggesting weaker robustness on structured edge-case inputs.

Traffic Protection1 tasks
95.6

The single rate-limit task scored 95.6, indicating solid handling of throttling logic and request control. There is not enough breadth here to infer broader weaknesses.

Crypto Utils2 tasks
54.5

This is the weakest domain at 54.5, driven by a severe failure in sec-crypto-utils where outputs were null or empty and TypeErrors indicate undefined-length access. The model likely struggles with exact byte/string transformations and output formatting in cryptographic helper routines.

Notable Tasks

sec-jwt-validator27.7Auth & Session

The model rejected all examples as invalid token format and returned null header/payload fields, which points to brittle JWT parsing rather than nuanced validation of alg/claims.

sec-oauth-state-validator9.5Auth & Session

Returning null with a TypeError on length access suggests the implementation failed before producing any structured result, likely due to missing initialization or unsafe handling of absent state values.

sec-ssrf-detector9.5Detection & Analysis

Unexpected end-of-input errors indicate the detector could not parse the test cases at all, so the failure is structural rather than a simple classification mistake.

sec-access-control-engine99.5Access Control

This task is notable because it sits inside a near-perfect access-control domain score, implying the model handles policy evaluation and authorization boundaries reliably.

sec-crypto-utils9.5Crypto Utils

Null outputs, TypeErrors, and an empty string where a transformed value was expected point to broken utility logic and poor resilience to edge-case inputs.

All Task Results

TaskDomainScore
sec-crypto-utilsCrypto Utils9.5
sec-oauth-state-validatorAuth & Session9.5
sec-ssrf-detectorDetection & Analysis9.5
sec-jwt-validatorAuth & Session27.7
sec-password-strengthAuth & Session49.5
sec-auth-log-anomaly-detectorDetection & Analysis62.2
sec-insecure-config-scannerDetection & Analysis66.3
sec-secret-detectorDetection & Analysis70.0
sec-input-sanitizerSanitization87.3
sec-csp-nonce-validatorDetection & Analysis87.8
sec-abac-rule-engineAccess Control94.2
sec-rate-limit-engineTraffic Protection95.6
sec-sql-injection-detectorDetection & Analysis95.6
sec-csp-parserDetection & Analysis96.1
sec-permission-checkerAccess Control96.8
sec-api-key-scope-checkerAccess Control97.3
sec-access-control-engineAccess Control99.5
sec-cookie-policy-validatorAuth & Session99.5
sec-csrf-token-managerAuth & Session99.5
sec-dependency-risk-classifierDetection & Analysis99.5
sec-encryption-pipelineCrypto Utils99.5
sec-hostname-allowlist-validatorSanitization99.5
sec-refresh-token-rotationAuth & Session99.5
sec-safe-redirect-builderSanitization99.5
sec-session-fixation-detectorAuth & Session99.5
sec-tenant-isolation-checkerAccess Control99.5
sec-url-sanitizerSanitization99.5
sec-vulnerability-scannerDetection & Analysis99.5
sec-file-upload-validatorSanitization100.0
sec-html-entity-encoderSanitization100.0

30tasks · Sorted by score (lowest first) · Hidden = adversarial edge case pass rate