BridgeBenchBridgeBench
Security
Model Analysis

GPT-5.4

openai/gpt-5.4

84.8

overall score

85.6% visible
81.6% hidden

Tasks

30

Passed

15

Failed

15

Avg latency

15931ms

Total cost

$0.4733

AI Commentary

by openai/gpt-5.4-mini

GPT-5.4 is strong on security-oriented validation and policy logic, with excellent results in sanitization (97.8) and access control (98.4), and solid performance in session/auth tasks overall. Its main weaknesses are in edge-case handling and output robustness: OAuth state validation and SSRF detection both failed catastrophically with null/TypeError behavior, and crypto utility handling is brittle on type assumptions, which drags down hidden-task reliability despite a high visible pass rate.

Domain Performance

Sanitization6 tasks
97.8

Very strong domain performance at 97.8 across 6 tasks, with no notable weaknesses. The model handled file upload, hostname allowlisting, HTML encoding, redirect building, and URL sanitization consistently.

Auth & Session7 tasks
79.4

Good but uneven at 79.4: cookie policy, CSRF, JWT, refresh rotation, and fixation detection were strong, but OAuth state validation failed with undefined-length TypeErrors and password strength scoring drifted on nuanced pattern detection. The failures suggest brittle parsing and inconsistent rubric alignment on password feedback.

Access Control5 tasks
98.4

Excellent at 98.4 with all five tasks strong, including ABAC, API key scope, permission checks, and tenant isolation. This is the most reliable area of the model, with no visible edge-case regressions.

Detection & Analysis9 tasks
78.7

Moderate at 78.7, with strong static-analysis style tasks but weaker dynamic detection tasks. Auth log anomaly detection missed or over-added indicators, secret detection had both false negatives and partial extraction issues, and SSRF detection failed hard by classifying valid URLs as invalid.

Traffic Protection1 tasks
87.8

The single task scored well enough to land at 87.8, but the sample size is too small to trust as a stable signal. No obvious weakness is visible from the provided result.

Crypto Utils2 tasks
56.0

Weakest domain at 56.0, driven by a brittle crypto utility implementation that appears to assume string inputs and crashes on non-string values. The encryption pipeline was strong, but the utility task failed with TypeErrors and empty outputs, indicating poor defensive handling.

Notable Tasks

sec-oauth-state-validator9.5Auth & Session

The validator crashed with TypeError on multiple cases instead of returning structured validation results, indicating missing null/undefined guards and broken error-path handling.

sec-ssrf-detector9.5Detection & Analysis

It mislabeled even allowed URLs as invalid_url, which points to a flawed URL parser/normalizer rather than a policy mistake.

sec-auth-log-anomaly-detector57.9Detection & Analysis

The detector under-reported anomalies in some cases and over-reported in others, suggesting inconsistent rule aggregation and threshold logic.

sec-password-strength49.0Auth & Session

The scoring and feedback drifted from expected outputs, likely because the model overfit to surface patterns and produced inconsistent strength labels and missing feedback items.

sec-access-control-engine99.5Access Control

This was part of a perfect access-control suite, indicating robust handling of authorization rules and tenant boundaries.

All Task Results

TaskDomainScore
sec-oauth-state-validatorAuth & Session9.5
sec-ssrf-detectorDetection & Analysis9.5
sec-crypto-utilsCrypto Utils16.3
sec-password-strengthAuth & Session49.0
sec-auth-log-anomaly-detectorDetection & Analysis57.9
sec-secret-detectorDetection & Analysis70.0
sec-input-sanitizerSanitization87.3
sec-csp-nonce-validatorDetection & Analysis87.8
sec-rate-limit-engineTraffic Protection87.8
sec-sql-injection-detectorDetection & Analysis92.2
sec-vulnerability-scannerDetection & Analysis95.6
sec-encryption-pipelineCrypto Utils95.7
sec-csp-parserDetection & Analysis96.1
sec-abac-rule-engineAccess Control96.8
sec-permission-checkerAccess Control96.8
sec-access-control-engineAccess Control99.5
sec-api-key-scope-checkerAccess Control99.5
sec-cookie-policy-validatorAuth & Session99.5
sec-csrf-token-managerAuth & Session99.5
sec-dependency-risk-classifierDetection & Analysis99.5
sec-insecure-config-scannerDetection & Analysis99.5
sec-jwt-validatorAuth & Session99.5
sec-refresh-token-rotationAuth & Session99.5
sec-safe-redirect-builderSanitization99.5
sec-session-fixation-detectorAuth & Session99.5
sec-tenant-isolation-checkerAccess Control99.5
sec-file-upload-validatorSanitization100.0
sec-hostname-allowlist-validatorSanitization100.0
sec-html-entity-encoderSanitization100.0
sec-url-sanitizerSanitization100.0

30tasks · Sorted by score (lowest first) · Hidden = adversarial edge case pass rate