Model Analysis

GPT-5.4 Mini

openai/gpt-5.4-mini

83.2

overall score

84.4% visible

79.3% hidden

Tasks

Passed

Failed

Avg latency

5660ms

Total cost

$0.0762

AI Commentary

by openai/gpt-5.4-mini

GPT-5.4 Mini is strong on core security primitives, with very high scores in sanitization (97.9) and access control (96.6), and solid visible-test performance overall (84.4%). Its main weaknesses are in edge-case handling and output robustness: hidden pass rate drops to 79.3%, and several failures show brittle parsing/validation logic, especially in SSRF detection, OAuth state handling, and crypto utilities where TypeErrors and null outputs indicate incomplete defensive coding.

Domain Performance

Sanitization6 tasks

97.9

This is the model’s best area, with near-perfect performance across file upload, hostname allowlisting, HTML entity encoding, safe redirects, and URL sanitization. The results suggest it reliably applies standard input-hardening patterns and handles common injection vectors well.

Auth & Session7 tasks

77.8

Performance is mixed: cookie policy, CSRF token, JWT, and refresh-token tasks were strong, but oauth-state-validator failed catastrophically with null outputs and a TypeError, indicating missing null/undefined guards. Password-strength scoring was also inconsistent, with misaligned scores and feedback, suggesting weak rubric adherence for nuanced classification tasks.

Access Control5 tasks

96.6

Access control is a clear strength, with high scores across engine, permission, and tenant-isolation checks. The model appears reliable at enforcing authorization boundaries and multi-tenant separation without obvious logic gaps.

Detection & Analysis9 tasks

76.6

This domain is uneven: CSP parsing, insecure-config scanning, and vulnerability scanning were strong, but anomaly detection and secret detection missed important cases. The failures suggest incomplete pattern coverage and inconsistent prioritization of indicators, while the SSRF detector repeatedly collapsed to 'invalid_url' for distinct cases, pointing to overly strict or malformed URL parsing.

Traffic Protection1 tasks

87.8

The single task in this area scored well, so there is no evidence of a weakness here. However, the sample size is too small to infer robustness beyond the tested rate-limiting behavior.

Crypto Utils2 tasks

52.6

This is the weakest domain overall, with a low average score of 52.6 and a crypto-utils failure that returned null plus a TypeError. The error pattern suggests the implementation is not resilient to malformed inputs or missing fields, which is especially problematic for security-sensitive utility code.

Notable Tasks

sec-oauth-state-validator9.5Auth & Session

The validator returned null and threw a TypeError on multiple cases, which points to missing defensive checks around undefined state fields rather than a simple logic mistake.

sec-ssrf-detector9.5Detection & Analysis

All cases collapsed to 'invalid_url' with null normalizedHost, suggesting the URL parser or normalization path is broken and cannot distinguish protocol, private-IP, and valid-host cases.

sec-secret-detector66.1Detection & Analysis

It missed obvious AWS secrets and inconsistently detected other tokens, indicating incomplete regex coverage and poor handling of multi-secret extraction across lines.

sec-access-control-engine99.5Access Control

This was one of the strongest tasks in the benchmark, consistent with the domain’s high average and indicating reliable authorization decision logic.

sec-encryption-pipeline95.7Crypto Utils

The strong result here suggests the model can correctly assemble secure crypto workflows when the task is well-scoped and the expected behavior is explicit.

All Task Results

Task	Domain	Score	Correct	Hidden	Latency
sec-crypto-utils	Crypto Utils	9.5	0	0	5971ms
sec-oauth-state-validator	Auth & Session	9.5	0	0	5016ms
sec-ssrf-detector	Detection & Analysis	9.5	0	0	6308ms
sec-password-strength	Auth & Session	49.5	67	23	5255ms
sec-auth-log-anomaly-detector	Detection & Analysis	57.9	33	73	9279ms
sec-secret-detector	Detection & Analysis	66.1	67	58	4604ms
sec-dependency-risk-classifier	Detection & Analysis	80.4	67	90	5631ms
sec-input-sanitizer	Sanitization	87.3	100	73	2278ms
sec-csp-nonce-validator	Detection & Analysis	87.8	100	75	5302ms
sec-rate-limit-engine	Traffic Protection	87.8	100	75	7678ms
sec-session-fixation-detector	Auth & Session	91.7	100	83	6412ms
sec-sql-injection-detector	Detection & Analysis	92.2	100	83	4314ms
sec-api-key-scope-checker	Access Control	92.9	100	86	6635ms
sec-abac-rule-engine	Access Control	94.2	100	89	8204ms
sec-refresh-token-rotation	Auth & Session	95.3	100	91	5237ms
sec-encryption-pipeline	Crypto Utils	95.7	100	92	6773ms
sec-csp-parser	Detection & Analysis	96.1	100	92	4533ms
sec-permission-checker	Access Control	96.8	100	94	6160ms
sec-access-control-engine	Access Control	99.5	100	100	6387ms
sec-cookie-policy-validator	Auth & Session	99.5	100	100	5825ms
sec-csrf-token-manager	Auth & Session	99.5	100	100	5750ms
sec-insecure-config-scanner	Detection & Analysis	99.5	100	100	7595ms
sec-jwt-validator	Auth & Session	99.5	100	100	6479ms
sec-tenant-isolation-checker	Access Control	99.5	100	100	6166ms
sec-vulnerability-scanner	Detection & Analysis	99.5	100	100	7464ms
sec-file-upload-validator	Sanitization	100.0	100	100	3363ms
sec-hostname-allowlist-validator	Sanitization	100.0	100	100	3859ms
sec-html-entity-encoder	Sanitization	100.0	100	100	2962ms
sec-safe-redirect-builder	Sanitization	100.0	100	100	4803ms
sec-url-sanitizer	Sanitization	100.0	100	100	3569ms

30tasks · Sorted by score (lowest first) · Hidden = adversarial edge case pass rate