Model Analysis

GPT-5.4

openai/gpt-5.4

84.8

overall score

85.6% visible

81.6% hidden

Tasks

Passed

Failed

Avg latency

15931ms

Total cost

$0.4733

AI Commentary

by openai/gpt-5.4-mini

GPT-5.4 is strong on security-oriented validation and policy logic, with excellent results in sanitization (97.8) and access control (98.4), and solid performance in session/auth tasks overall. Its main weaknesses are in edge-case handling and output robustness: OAuth state validation and SSRF detection both failed catastrophically with null/TypeError behavior, and crypto utility handling is brittle on type assumptions, which drags down hidden-task reliability despite a high visible pass rate.

Domain Performance

Sanitization6 tasks

97.8

Very strong domain performance at 97.8 across 6 tasks, with no notable weaknesses. The model handled file upload, hostname allowlisting, HTML encoding, redirect building, and URL sanitization consistently.

Auth & Session7 tasks

79.4

Good but uneven at 79.4: cookie policy, CSRF, JWT, refresh rotation, and fixation detection were strong, but OAuth state validation failed with undefined-length TypeErrors and password strength scoring drifted on nuanced pattern detection. The failures suggest brittle parsing and inconsistent rubric alignment on password feedback.

Access Control5 tasks

98.4

Excellent at 98.4 with all five tasks strong, including ABAC, API key scope, permission checks, and tenant isolation. This is the most reliable area of the model, with no visible edge-case regressions.

Detection & Analysis9 tasks

78.7

Moderate at 78.7, with strong static-analysis style tasks but weaker dynamic detection tasks. Auth log anomaly detection missed or over-added indicators, secret detection had both false negatives and partial extraction issues, and SSRF detection failed hard by classifying valid URLs as invalid.

Traffic Protection1 tasks

87.8

The single task scored well enough to land at 87.8, but the sample size is too small to trust as a stable signal. No obvious weakness is visible from the provided result.

Crypto Utils2 tasks

56.0

Weakest domain at 56.0, driven by a brittle crypto utility implementation that appears to assume string inputs and crashes on non-string values. The encryption pipeline was strong, but the utility task failed with TypeErrors and empty outputs, indicating poor defensive handling.

Notable Tasks

sec-oauth-state-validator9.5Auth & Session

The validator crashed with TypeError on multiple cases instead of returning structured validation results, indicating missing null/undefined guards and broken error-path handling.

sec-ssrf-detector9.5Detection & Analysis

It mislabeled even allowed URLs as invalid_url, which points to a flawed URL parser/normalizer rather than a policy mistake.

sec-auth-log-anomaly-detector57.9Detection & Analysis

The detector under-reported anomalies in some cases and over-reported in others, suggesting inconsistent rule aggregation and threshold logic.

sec-password-strength49.0Auth & Session

The scoring and feedback drifted from expected outputs, likely because the model overfit to surface patterns and produced inconsistent strength labels and missing feedback items.

sec-access-control-engine99.5Access Control

This was part of a perfect access-control suite, indicating robust handling of authorization rules and tenant boundaries.

All Task Results

Task	Domain	Score	Correct	Hidden	Latency
sec-oauth-state-validator	Auth & Session	9.5	0	0	15306ms
sec-ssrf-detector	Detection & Analysis	9.5	0	0	20805ms
sec-crypto-utils	Crypto Utils	16.3	0	14	16215ms
sec-password-strength	Auth & Session	49.0	67	23	14540ms
sec-auth-log-anomaly-detector	Detection & Analysis	57.9	33	73	18492ms
sec-secret-detector	Detection & Analysis	70.0	67	67	9617ms
sec-input-sanitizer	Sanitization	87.3	100	73	8155ms
sec-csp-nonce-validator	Detection & Analysis	87.8	100	75	10975ms
sec-rate-limit-engine	Traffic Protection	87.8	100	75	25731ms
sec-sql-injection-detector	Detection & Analysis	92.2	100	83	11374ms
sec-vulnerability-scanner	Detection & Analysis	95.6	100	92	19267ms
sec-encryption-pipeline	Crypto Utils	95.7	100	92	21048ms
sec-csp-parser	Detection & Analysis	96.1	100	92	10840ms
sec-abac-rule-engine	Access Control	96.8	100	94	26915ms
sec-permission-checker	Access Control	96.8	100	94	13280ms
sec-access-control-engine	Access Control	99.5	100	100	19320ms
sec-api-key-scope-checker	Access Control	99.5	100	100	20769ms
sec-cookie-policy-validator	Auth & Session	99.5	100	100	13702ms
sec-csrf-token-manager	Auth & Session	99.5	100	100	12003ms
sec-dependency-risk-classifier	Detection & Analysis	99.5	100	100	18731ms
sec-insecure-config-scanner	Detection & Analysis	99.5	100	100	24321ms
sec-jwt-validator	Auth & Session	99.5	100	100	16416ms
sec-refresh-token-rotation	Auth & Session	99.5	100	100	16055ms
sec-safe-redirect-builder	Sanitization	99.5	100	100	12543ms
sec-session-fixation-detector	Auth & Session	99.5	100	100	21782ms
sec-tenant-isolation-checker	Access Control	99.5	100	100	16079ms
sec-file-upload-validator	Sanitization	100.0	100	100	8454ms
sec-hostname-allowlist-validator	Sanitization	100.0	100	100	10464ms
sec-html-entity-encoder	Sanitization	100.0	100	100	10901ms
sec-url-sanitizer	Sanitization	100.0	100	100	13826ms

30tasks · Sorted by score (lowest first) · Hidden = adversarial edge case pass rate