BridgeBenchBridgeBench
Security
Model Analysis

GPT-5.4 Mini

openai/gpt-5.4-mini

83.2

overall score

84.4% visible
79.3% hidden

Tasks

30

Passed

12

Failed

18

Avg latency

5660ms

Total cost

$0.0762

AI Commentary

by openai/gpt-5.4-mini

GPT-5.4 Mini is strong on core security primitives, with very high scores in sanitization (97.9) and access control (96.6), and solid visible-test performance overall (84.4%). Its main weaknesses are in edge-case handling and output robustness: hidden pass rate drops to 79.3%, and several failures show brittle parsing/validation logic, especially in SSRF detection, OAuth state handling, and crypto utilities where TypeErrors and null outputs indicate incomplete defensive coding.

Domain Performance

Sanitization6 tasks
97.9

This is the model’s best area, with near-perfect performance across file upload, hostname allowlisting, HTML entity encoding, safe redirects, and URL sanitization. The results suggest it reliably applies standard input-hardening patterns and handles common injection vectors well.

Auth & Session7 tasks
77.8

Performance is mixed: cookie policy, CSRF token, JWT, and refresh-token tasks were strong, but oauth-state-validator failed catastrophically with null outputs and a TypeError, indicating missing null/undefined guards. Password-strength scoring was also inconsistent, with misaligned scores and feedback, suggesting weak rubric adherence for nuanced classification tasks.

Access Control5 tasks
96.6

Access control is a clear strength, with high scores across engine, permission, and tenant-isolation checks. The model appears reliable at enforcing authorization boundaries and multi-tenant separation without obvious logic gaps.

Detection & Analysis9 tasks
76.6

This domain is uneven: CSP parsing, insecure-config scanning, and vulnerability scanning were strong, but anomaly detection and secret detection missed important cases. The failures suggest incomplete pattern coverage and inconsistent prioritization of indicators, while the SSRF detector repeatedly collapsed to 'invalid_url' for distinct cases, pointing to overly strict or malformed URL parsing.

Traffic Protection1 tasks
87.8

The single task in this area scored well, so there is no evidence of a weakness here. However, the sample size is too small to infer robustness beyond the tested rate-limiting behavior.

Crypto Utils2 tasks
52.6

This is the weakest domain overall, with a low average score of 52.6 and a crypto-utils failure that returned null plus a TypeError. The error pattern suggests the implementation is not resilient to malformed inputs or missing fields, which is especially problematic for security-sensitive utility code.

Notable Tasks

sec-oauth-state-validator9.5Auth & Session

The validator returned null and threw a TypeError on multiple cases, which points to missing defensive checks around undefined state fields rather than a simple logic mistake.

sec-ssrf-detector9.5Detection & Analysis

All cases collapsed to 'invalid_url' with null normalizedHost, suggesting the URL parser or normalization path is broken and cannot distinguish protocol, private-IP, and valid-host cases.

sec-secret-detector66.1Detection & Analysis

It missed obvious AWS secrets and inconsistently detected other tokens, indicating incomplete regex coverage and poor handling of multi-secret extraction across lines.

sec-access-control-engine99.5Access Control

This was one of the strongest tasks in the benchmark, consistent with the domain’s high average and indicating reliable authorization decision logic.

sec-encryption-pipeline95.7Crypto Utils

The strong result here suggests the model can correctly assemble secure crypto workflows when the task is well-scoped and the expected behavior is explicit.

All Task Results

TaskDomainScore
sec-crypto-utilsCrypto Utils9.5
sec-oauth-state-validatorAuth & Session9.5
sec-ssrf-detectorDetection & Analysis9.5
sec-password-strengthAuth & Session49.5
sec-auth-log-anomaly-detectorDetection & Analysis57.9
sec-secret-detectorDetection & Analysis66.1
sec-dependency-risk-classifierDetection & Analysis80.4
sec-input-sanitizerSanitization87.3
sec-csp-nonce-validatorDetection & Analysis87.8
sec-rate-limit-engineTraffic Protection87.8
sec-session-fixation-detectorAuth & Session91.7
sec-sql-injection-detectorDetection & Analysis92.2
sec-api-key-scope-checkerAccess Control92.9
sec-abac-rule-engineAccess Control94.2
sec-refresh-token-rotationAuth & Session95.3
sec-encryption-pipelineCrypto Utils95.7
sec-csp-parserDetection & Analysis96.1
sec-permission-checkerAccess Control96.8
sec-access-control-engineAccess Control99.5
sec-cookie-policy-validatorAuth & Session99.5
sec-csrf-token-managerAuth & Session99.5
sec-insecure-config-scannerDetection & Analysis99.5
sec-jwt-validatorAuth & Session99.5
sec-tenant-isolation-checkerAccess Control99.5
sec-vulnerability-scannerDetection & Analysis99.5
sec-file-upload-validatorSanitization100.0
sec-hostname-allowlist-validatorSanitization100.0
sec-html-entity-encoderSanitization100.0
sec-safe-redirect-builderSanitization100.0
sec-url-sanitizerSanitization100.0

30tasks · Sorted by score (lowest first) · Hidden = adversarial edge case pass rate