BridgeBenchBridgeBench
Security
Model Analysis

GPT-5.4 Nano

openai/gpt-5.4-nano

81.9

overall score

83.9% visible
77.1% hidden

Tasks

30

Passed

9

Failed

21

Avg latency

11721ms

Total cost

$0.0752

AI Commentary

by openai/gpt-5.4-mini

GPT-5.4 Nano is strong on straightforward security transformations and policy checks, with high scores in sanitization (95.2), access control (94.8), and traffic protection (99.5). Its main weaknesses are in edge-case-heavy tasks and parsers: auth/session drops sharply on OAuth state validation, detection-analysis is uneven on SSRF/config scanning, and crypto-utils is brittle enough to throw type errors, which is a concern despite the 81.9 average score and 77.1 hidden edge-case pass rate.

Domain Performance

Sanitization6 tasks
95.2

Performance is excellent overall at 95.2, with correct handling of file uploads, HTML encoding, redirects, and URL sanitization. The main miss is hostname allowlisting, where it rejected wildcard subdomains and punycode cases that should have matched, indicating incomplete wildcard/IDN normalization logic.

Auth & Session7 tasks
78.3

This domain is mixed at 78.3: cookie policy, CSRF, JWT, refresh rotation, and fixation detection were solid, but OAuth state validation failed catastrophically with null returns and a TypeError, suggesting an unhandled undefined input path. Password strength scoring was also inconsistent, over-penalizing some weak passwords and misclassifying others, which points to unstable heuristic weighting.

Access Control5 tasks
94.8

Access control is a clear strength at 94.8, with the engine and permission checker performing reliably. There are no notable weaknesses, so the model appears dependable for rule-based authorization logic.

Detection & Analysis9 tasks
72.5

At 72.5, this is the most uneven non-crypto area. It over-reported anomalies in auth logs, underperformed badly on SSRF detection by labeling valid URLs as invalid, and the insecure config scanner used different issue labels than expected, suggesting both semantic drift and brittle output formatting.

Traffic Protection1 tasks
99.5

Traffic protection is effectively solved here at 99.5, with rate limiting handled correctly. This is the most stable domain in the benchmark.

Crypto Utils2 tasks
56.0

Crypto utilities are the weakest area at 56.0, driven by a hard failure in sec-crypto-utils where charCodeAt was called on a non-string input. That kind of runtime exception, plus missed outputs in the same task family, indicates poor input type handling and insufficient defensive coding.

Notable Tasks

sec-oauth-state-validator9.5Auth & Session

The model returned null and threw a TypeError on every case, which points to a missing guard for undefined state data rather than a simple logic error.

sec-ssrf-detector9.5Detection & Analysis

It marked even valid URLs as invalid_url and never normalized hosts, so the failure is in URL parsing/normalization before any SSRF policy decision.

sec-insecure-config-scanner31.0Detection & Analysis

The scanner produced semantically similar but non-matching issue strings, implying the detection logic may be roughly correct but the output schema and exact taxonomy are not aligned with the benchmark.

sec-hostname-allowlist-validator77.8Sanitization

It rejected wildcard subdomains and punycode hostnames that should have matched, so wildcard expansion and internationalized domain handling are incomplete.

sec-rate-limit-engine99.5Traffic Protection

This task was effectively perfect, indicating the model can implement deterministic policy logic cleanly when the state machine is simple and well-specified.

All Task Results

TaskDomainScore
sec-oauth-state-validatorAuth & Session9.5
sec-ssrf-detectorDetection & Analysis9.5
sec-crypto-utilsCrypto Utils16.3
sec-insecure-config-scannerDetection & Analysis31.0
sec-password-strengthAuth & Session49.0
sec-secret-detectorDetection & Analysis69.5
sec-auth-log-anomaly-detectorDetection & Analysis72.3
sec-hostname-allowlist-validatorSanitization77.8
sec-csp-nonce-validatorDetection & Analysis87.8
sec-api-key-scope-checkerAccess Control90.7
sec-tenant-isolation-checkerAccess Control92.6
sec-input-sanitizerSanitization93.7
sec-abac-rule-engineAccess Control94.2
sec-dependency-risk-classifierDetection & Analysis94.8
sec-refresh-token-rotationAuth & Session95.3
sec-session-fixation-detectorAuth & Session95.6
sec-vulnerability-scannerDetection & Analysis95.6
sec-encryption-pipelineCrypto Utils95.7
sec-csp-parserDetection & Analysis96.1
sec-sql-injection-detectorDetection & Analysis96.1
sec-permission-checkerAccess Control96.8
sec-access-control-engineAccess Control99.5
sec-cookie-policy-validatorAuth & Session99.5
sec-csrf-token-managerAuth & Session99.5
sec-jwt-validatorAuth & Session99.5
sec-rate-limit-engineTraffic Protection99.5
sec-safe-redirect-builderSanitization99.5
sec-file-upload-validatorSanitization100.0
sec-html-entity-encoderSanitization100.0
sec-url-sanitizerSanitization100.0

30tasks · Sorted by score (lowest first) · Hidden = adversarial edge case pass rate