GPT-5.4 Mini
openai/gpt-5.4-mini
83.2
overall score
Tasks
30
Passed
12
Failed
18
Avg latency
5660ms
Total cost
$0.0762
AI Commentary
by openai/gpt-5.4-miniGPT-5.4 Mini is strong on core security primitives, with very high scores in sanitization (97.9) and access control (96.6), and solid visible-test performance overall (84.4%). Its main weaknesses are in edge-case handling and output robustness: hidden pass rate drops to 79.3%, and several failures show brittle parsing/validation logic, especially in SSRF detection, OAuth state handling, and crypto utilities where TypeErrors and null outputs indicate incomplete defensive coding.
Domain Performance
This is the model’s best area, with near-perfect performance across file upload, hostname allowlisting, HTML entity encoding, safe redirects, and URL sanitization. The results suggest it reliably applies standard input-hardening patterns and handles common injection vectors well.
Performance is mixed: cookie policy, CSRF token, JWT, and refresh-token tasks were strong, but oauth-state-validator failed catastrophically with null outputs and a TypeError, indicating missing null/undefined guards. Password-strength scoring was also inconsistent, with misaligned scores and feedback, suggesting weak rubric adherence for nuanced classification tasks.
Access control is a clear strength, with high scores across engine, permission, and tenant-isolation checks. The model appears reliable at enforcing authorization boundaries and multi-tenant separation without obvious logic gaps.
This domain is uneven: CSP parsing, insecure-config scanning, and vulnerability scanning were strong, but anomaly detection and secret detection missed important cases. The failures suggest incomplete pattern coverage and inconsistent prioritization of indicators, while the SSRF detector repeatedly collapsed to 'invalid_url' for distinct cases, pointing to overly strict or malformed URL parsing.
The single task in this area scored well, so there is no evidence of a weakness here. However, the sample size is too small to infer robustness beyond the tested rate-limiting behavior.
This is the weakest domain overall, with a low average score of 52.6 and a crypto-utils failure that returned null plus a TypeError. The error pattern suggests the implementation is not resilient to malformed inputs or missing fields, which is especially problematic for security-sensitive utility code.
Notable Tasks
The validator returned null and threw a TypeError on multiple cases, which points to missing defensive checks around undefined state fields rather than a simple logic mistake.
All cases collapsed to 'invalid_url' with null normalizedHost, suggesting the URL parser or normalization path is broken and cannot distinguish protocol, private-IP, and valid-host cases.
It missed obvious AWS secrets and inconsistently detected other tokens, indicating incomplete regex coverage and poor handling of multi-secret extraction across lines.
This was one of the strongest tasks in the benchmark, consistent with the domain’s high average and indicating reliable authorization decision logic.
The strong result here suggests the model can correctly assemble secure crypto workflows when the task is well-scoped and the expected behavior is explicit.
All Task Results
| Task | Domain | Score | Correct | Hidden | Latency | |
|---|---|---|---|---|---|---|
| sec-crypto-utils | Crypto Utils | 9.5 | 0 | 0 | 5971ms | |
| sec-oauth-state-validator | Auth & Session | 9.5 | 0 | 0 | 5016ms | |
| sec-ssrf-detector | Detection & Analysis | 9.5 | 0 | 0 | 6308ms | |
| sec-password-strength | Auth & Session | 49.5 | 67 | 23 | 5255ms | |
| sec-auth-log-anomaly-detector | Detection & Analysis | 57.9 | 33 | 73 | 9279ms | |
| sec-secret-detector | Detection & Analysis | 66.1 | 67 | 58 | 4604ms | |
| sec-dependency-risk-classifier | Detection & Analysis | 80.4 | 67 | 90 | 5631ms | |
| sec-input-sanitizer | Sanitization | 87.3 | 100 | 73 | 2278ms | |
| sec-csp-nonce-validator | Detection & Analysis | 87.8 | 100 | 75 | 5302ms | |
| sec-rate-limit-engine | Traffic Protection | 87.8 | 100 | 75 | 7678ms | |
| sec-session-fixation-detector | Auth & Session | 91.7 | 100 | 83 | 6412ms | |
| sec-sql-injection-detector | Detection & Analysis | 92.2 | 100 | 83 | 4314ms | |
| sec-api-key-scope-checker | Access Control | 92.9 | 100 | 86 | 6635ms | |
| sec-abac-rule-engine | Access Control | 94.2 | 100 | 89 | 8204ms | |
| sec-refresh-token-rotation | Auth & Session | 95.3 | 100 | 91 | 5237ms | |
| sec-encryption-pipeline | Crypto Utils | 95.7 | 100 | 92 | 6773ms | |
| sec-csp-parser | Detection & Analysis | 96.1 | 100 | 92 | 4533ms | |
| sec-permission-checker | Access Control | 96.8 | 100 | 94 | 6160ms | |
| sec-access-control-engine | Access Control | 99.5 | 100 | 100 | 6387ms | |
| sec-cookie-policy-validator | Auth & Session | 99.5 | 100 | 100 | 5825ms | |
| sec-csrf-token-manager | Auth & Session | 99.5 | 100 | 100 | 5750ms | |
| sec-insecure-config-scanner | Detection & Analysis | 99.5 | 100 | 100 | 7595ms | |
| sec-jwt-validator | Auth & Session | 99.5 | 100 | 100 | 6479ms | |
| sec-tenant-isolation-checker | Access Control | 99.5 | 100 | 100 | 6166ms | |
| sec-vulnerability-scanner | Detection & Analysis | 99.5 | 100 | 100 | 7464ms | |
| sec-file-upload-validator | Sanitization | 100.0 | 100 | 100 | 3363ms | |
| sec-hostname-allowlist-validator | Sanitization | 100.0 | 100 | 100 | 3859ms | |
| sec-html-entity-encoder | Sanitization | 100.0 | 100 | 100 | 2962ms | |
| sec-safe-redirect-builder | Sanitization | 100.0 | 100 | 100 | 4803ms | |
| sec-url-sanitizer | Sanitization | 100.0 | 100 | 100 | 3569ms |
30tasks · Sorted by score (lowest first) · Hidden = adversarial edge case pass rate