GPT-5.4 Nano
openai/gpt-5.4-nano
81.9
overall score
Tasks
30
Passed
9
Failed
21
Avg latency
11721ms
Total cost
$0.0752
AI Commentary
by openai/gpt-5.4-miniGPT-5.4 Nano is strong on straightforward security transformations and policy checks, with high scores in sanitization (95.2), access control (94.8), and traffic protection (99.5). Its main weaknesses are in edge-case-heavy tasks and parsers: auth/session drops sharply on OAuth state validation, detection-analysis is uneven on SSRF/config scanning, and crypto-utils is brittle enough to throw type errors, which is a concern despite the 81.9 average score and 77.1 hidden edge-case pass rate.
Domain Performance
Performance is excellent overall at 95.2, with correct handling of file uploads, HTML encoding, redirects, and URL sanitization. The main miss is hostname allowlisting, where it rejected wildcard subdomains and punycode cases that should have matched, indicating incomplete wildcard/IDN normalization logic.
This domain is mixed at 78.3: cookie policy, CSRF, JWT, refresh rotation, and fixation detection were solid, but OAuth state validation failed catastrophically with null returns and a TypeError, suggesting an unhandled undefined input path. Password strength scoring was also inconsistent, over-penalizing some weak passwords and misclassifying others, which points to unstable heuristic weighting.
Access control is a clear strength at 94.8, with the engine and permission checker performing reliably. There are no notable weaknesses, so the model appears dependable for rule-based authorization logic.
At 72.5, this is the most uneven non-crypto area. It over-reported anomalies in auth logs, underperformed badly on SSRF detection by labeling valid URLs as invalid, and the insecure config scanner used different issue labels than expected, suggesting both semantic drift and brittle output formatting.
Traffic protection is effectively solved here at 99.5, with rate limiting handled correctly. This is the most stable domain in the benchmark.
Crypto utilities are the weakest area at 56.0, driven by a hard failure in sec-crypto-utils where charCodeAt was called on a non-string input. That kind of runtime exception, plus missed outputs in the same task family, indicates poor input type handling and insufficient defensive coding.
Notable Tasks
The model returned null and threw a TypeError on every case, which points to a missing guard for undefined state data rather than a simple logic error.
It marked even valid URLs as invalid_url and never normalized hosts, so the failure is in URL parsing/normalization before any SSRF policy decision.
The scanner produced semantically similar but non-matching issue strings, implying the detection logic may be roughly correct but the output schema and exact taxonomy are not aligned with the benchmark.
It rejected wildcard subdomains and punycode hostnames that should have matched, so wildcard expansion and internationalized domain handling are incomplete.
This task was effectively perfect, indicating the model can implement deterministic policy logic cleanly when the state machine is simple and well-specified.
All Task Results
| Task | Domain | Score | Correct | Hidden | Latency | |
|---|---|---|---|---|---|---|
| sec-oauth-state-validator | Auth & Session | 9.5 | 0 | 0 | 9463ms | |
| sec-ssrf-detector | Detection & Analysis | 9.5 | 0 | 0 | 17302ms | |
| sec-crypto-utils | Crypto Utils | 16.3 | 0 | 14 | 13503ms | |
| sec-insecure-config-scanner | Detection & Analysis | 31.0 | 50 | 0 | 15254ms | |
| sec-password-strength | Auth & Session | 49.0 | 67 | 23 | 8939ms | |
| sec-secret-detector | Detection & Analysis | 69.5 | 67 | 67 | 9260ms | |
| sec-auth-log-anomaly-detector | Detection & Analysis | 72.3 | 67 | 73 | 20395ms | |
| sec-hostname-allowlist-validator | Sanitization | 77.8 | 67 | 83 | 6831ms | |
| sec-csp-nonce-validator | Detection & Analysis | 87.8 | 100 | 75 | 9476ms | |
| sec-api-key-scope-checker | Access Control | 90.7 | 100 | 82 | 16297ms | |
| sec-tenant-isolation-checker | Access Control | 92.6 | 100 | 86 | 11353ms | |
| sec-input-sanitizer | Sanitization | 93.7 | 100 | 87 | 3937ms | |
| sec-abac-rule-engine | Access Control | 94.2 | 100 | 89 | 14010ms | |
| sec-dependency-risk-classifier | Detection & Analysis | 94.8 | 100 | 90 | 8248ms | |
| sec-refresh-token-rotation | Auth & Session | 95.3 | 100 | 91 | 14013ms | |
| sec-session-fixation-detector | Auth & Session | 95.6 | 100 | 92 | 20495ms | |
| sec-vulnerability-scanner | Detection & Analysis | 95.6 | 100 | 92 | 18143ms | |
| sec-encryption-pipeline | Crypto Utils | 95.7 | 100 | 92 | 14954ms | |
| sec-csp-parser | Detection & Analysis | 96.1 | 100 | 92 | 6992ms | |
| sec-sql-injection-detector | Detection & Analysis | 96.1 | 100 | 92 | 7415ms | |
| sec-permission-checker | Access Control | 96.8 | 100 | 94 | 12533ms | |
| sec-access-control-engine | Access Control | 99.5 | 100 | 100 | 8568ms | |
| sec-cookie-policy-validator | Auth & Session | 99.5 | 100 | 100 | 9347ms | |
| sec-csrf-token-manager | Auth & Session | 99.5 | 100 | 100 | 9195ms | |
| sec-jwt-validator | Auth & Session | 99.5 | 100 | 100 | 15315ms | |
| sec-rate-limit-engine | Traffic Protection | 99.5 | 100 | 100 | 22173ms | |
| sec-safe-redirect-builder | Sanitization | 99.5 | 100 | 100 | 7778ms | |
| sec-file-upload-validator | Sanitization | 100.0 | 100 | 100 | 5300ms | |
| sec-html-entity-encoder | Sanitization | 100.0 | 100 | 100 | 7719ms | |
| sec-url-sanitizer | Sanitization | 100.0 | 100 | 100 | 7418ms |
30tasks · Sorted by score (lowest first) · Hidden = adversarial edge case pass rate