GPT-5.4
openai/gpt-5.4
84.8
overall score
Tasks
30
Passed
15
Failed
15
Avg latency
15931ms
Total cost
$0.4733
AI Commentary
by openai/gpt-5.4-miniGPT-5.4 is strong on security-oriented validation and policy logic, with excellent results in sanitization (97.8) and access control (98.4), and solid performance in session/auth tasks overall. Its main weaknesses are in edge-case handling and output robustness: OAuth state validation and SSRF detection both failed catastrophically with null/TypeError behavior, and crypto utility handling is brittle on type assumptions, which drags down hidden-task reliability despite a high visible pass rate.
Domain Performance
Very strong domain performance at 97.8 across 6 tasks, with no notable weaknesses. The model handled file upload, hostname allowlisting, HTML encoding, redirect building, and URL sanitization consistently.
Good but uneven at 79.4: cookie policy, CSRF, JWT, refresh rotation, and fixation detection were strong, but OAuth state validation failed with undefined-length TypeErrors and password strength scoring drifted on nuanced pattern detection. The failures suggest brittle parsing and inconsistent rubric alignment on password feedback.
Excellent at 98.4 with all five tasks strong, including ABAC, API key scope, permission checks, and tenant isolation. This is the most reliable area of the model, with no visible edge-case regressions.
Moderate at 78.7, with strong static-analysis style tasks but weaker dynamic detection tasks. Auth log anomaly detection missed or over-added indicators, secret detection had both false negatives and partial extraction issues, and SSRF detection failed hard by classifying valid URLs as invalid.
The single task scored well enough to land at 87.8, but the sample size is too small to trust as a stable signal. No obvious weakness is visible from the provided result.
Weakest domain at 56.0, driven by a brittle crypto utility implementation that appears to assume string inputs and crashes on non-string values. The encryption pipeline was strong, but the utility task failed with TypeErrors and empty outputs, indicating poor defensive handling.
Notable Tasks
The validator crashed with TypeError on multiple cases instead of returning structured validation results, indicating missing null/undefined guards and broken error-path handling.
It mislabeled even allowed URLs as invalid_url, which points to a flawed URL parser/normalizer rather than a policy mistake.
The detector under-reported anomalies in some cases and over-reported in others, suggesting inconsistent rule aggregation and threshold logic.
The scoring and feedback drifted from expected outputs, likely because the model overfit to surface patterns and produced inconsistent strength labels and missing feedback items.
This was part of a perfect access-control suite, indicating robust handling of authorization rules and tenant boundaries.
All Task Results
| Task | Domain | Score | Correct | Hidden | Latency | |
|---|---|---|---|---|---|---|
| sec-oauth-state-validator | Auth & Session | 9.5 | 0 | 0 | 15306ms | |
| sec-ssrf-detector | Detection & Analysis | 9.5 | 0 | 0 | 20805ms | |
| sec-crypto-utils | Crypto Utils | 16.3 | 0 | 14 | 16215ms | |
| sec-password-strength | Auth & Session | 49.0 | 67 | 23 | 14540ms | |
| sec-auth-log-anomaly-detector | Detection & Analysis | 57.9 | 33 | 73 | 18492ms | |
| sec-secret-detector | Detection & Analysis | 70.0 | 67 | 67 | 9617ms | |
| sec-input-sanitizer | Sanitization | 87.3 | 100 | 73 | 8155ms | |
| sec-csp-nonce-validator | Detection & Analysis | 87.8 | 100 | 75 | 10975ms | |
| sec-rate-limit-engine | Traffic Protection | 87.8 | 100 | 75 | 25731ms | |
| sec-sql-injection-detector | Detection & Analysis | 92.2 | 100 | 83 | 11374ms | |
| sec-vulnerability-scanner | Detection & Analysis | 95.6 | 100 | 92 | 19267ms | |
| sec-encryption-pipeline | Crypto Utils | 95.7 | 100 | 92 | 21048ms | |
| sec-csp-parser | Detection & Analysis | 96.1 | 100 | 92 | 10840ms | |
| sec-abac-rule-engine | Access Control | 96.8 | 100 | 94 | 26915ms | |
| sec-permission-checker | Access Control | 96.8 | 100 | 94 | 13280ms | |
| sec-access-control-engine | Access Control | 99.5 | 100 | 100 | 19320ms | |
| sec-api-key-scope-checker | Access Control | 99.5 | 100 | 100 | 20769ms | |
| sec-cookie-policy-validator | Auth & Session | 99.5 | 100 | 100 | 13702ms | |
| sec-csrf-token-manager | Auth & Session | 99.5 | 100 | 100 | 12003ms | |
| sec-dependency-risk-classifier | Detection & Analysis | 99.5 | 100 | 100 | 18731ms | |
| sec-insecure-config-scanner | Detection & Analysis | 99.5 | 100 | 100 | 24321ms | |
| sec-jwt-validator | Auth & Session | 99.5 | 100 | 100 | 16416ms | |
| sec-refresh-token-rotation | Auth & Session | 99.5 | 100 | 100 | 16055ms | |
| sec-safe-redirect-builder | Sanitization | 99.5 | 100 | 100 | 12543ms | |
| sec-session-fixation-detector | Auth & Session | 99.5 | 100 | 100 | 21782ms | |
| sec-tenant-isolation-checker | Access Control | 99.5 | 100 | 100 | 16079ms | |
| sec-file-upload-validator | Sanitization | 100.0 | 100 | 100 | 8454ms | |
| sec-hostname-allowlist-validator | Sanitization | 100.0 | 100 | 100 | 10464ms | |
| sec-html-entity-encoder | Sanitization | 100.0 | 100 | 100 | 10901ms | |
| sec-url-sanitizer | Sanitization | 100.0 | 100 | 100 | 13826ms |
30tasks · Sorted by score (lowest first) · Hidden = adversarial edge case pass rate