Model Analysis

GPT-5.4 Nano

openai/gpt-5.4-nano

81.9

overall score

83.9% visible

77.1% hidden

Tasks

Passed

Failed

Avg latency

11721ms

Total cost

$0.0752

AI Commentary

by openai/gpt-5.4-mini

GPT-5.4 Nano is strong on straightforward security transformations and policy checks, with high scores in sanitization (95.2), access control (94.8), and traffic protection (99.5). Its main weaknesses are in edge-case-heavy tasks and parsers: auth/session drops sharply on OAuth state validation, detection-analysis is uneven on SSRF/config scanning, and crypto-utils is brittle enough to throw type errors, which is a concern despite the 81.9 average score and 77.1 hidden edge-case pass rate.

Domain Performance

Sanitization6 tasks

95.2

Performance is excellent overall at 95.2, with correct handling of file uploads, HTML encoding, redirects, and URL sanitization. The main miss is hostname allowlisting, where it rejected wildcard subdomains and punycode cases that should have matched, indicating incomplete wildcard/IDN normalization logic.

Auth & Session7 tasks

78.3

This domain is mixed at 78.3: cookie policy, CSRF, JWT, refresh rotation, and fixation detection were solid, but OAuth state validation failed catastrophically with null returns and a TypeError, suggesting an unhandled undefined input path. Password strength scoring was also inconsistent, over-penalizing some weak passwords and misclassifying others, which points to unstable heuristic weighting.

Access Control5 tasks

94.8

Access control is a clear strength at 94.8, with the engine and permission checker performing reliably. There are no notable weaknesses, so the model appears dependable for rule-based authorization logic.

Detection & Analysis9 tasks

72.5

At 72.5, this is the most uneven non-crypto area. It over-reported anomalies in auth logs, underperformed badly on SSRF detection by labeling valid URLs as invalid, and the insecure config scanner used different issue labels than expected, suggesting both semantic drift and brittle output formatting.

Traffic Protection1 tasks

99.5

Traffic protection is effectively solved here at 99.5, with rate limiting handled correctly. This is the most stable domain in the benchmark.

Crypto Utils2 tasks

56.0

Crypto utilities are the weakest area at 56.0, driven by a hard failure in sec-crypto-utils where charCodeAt was called on a non-string input. That kind of runtime exception, plus missed outputs in the same task family, indicates poor input type handling and insufficient defensive coding.

Notable Tasks

sec-oauth-state-validator9.5Auth & Session

The model returned null and threw a TypeError on every case, which points to a missing guard for undefined state data rather than a simple logic error.

sec-ssrf-detector9.5Detection & Analysis

It marked even valid URLs as invalid_url and never normalized hosts, so the failure is in URL parsing/normalization before any SSRF policy decision.

sec-insecure-config-scanner31.0Detection & Analysis

The scanner produced semantically similar but non-matching issue strings, implying the detection logic may be roughly correct but the output schema and exact taxonomy are not aligned with the benchmark.

sec-hostname-allowlist-validator77.8Sanitization

It rejected wildcard subdomains and punycode hostnames that should have matched, so wildcard expansion and internationalized domain handling are incomplete.

sec-rate-limit-engine99.5Traffic Protection

This task was effectively perfect, indicating the model can implement deterministic policy logic cleanly when the state machine is simple and well-specified.

All Task Results

Task	Domain	Score	Correct	Hidden	Latency
sec-oauth-state-validator	Auth & Session	9.5	0	0	9463ms
sec-ssrf-detector	Detection & Analysis	9.5	0	0	17302ms
sec-crypto-utils	Crypto Utils	16.3	0	14	13503ms
sec-insecure-config-scanner	Detection & Analysis	31.0	50	0	15254ms
sec-password-strength	Auth & Session	49.0	67	23	8939ms
sec-secret-detector	Detection & Analysis	69.5	67	67	9260ms
sec-auth-log-anomaly-detector	Detection & Analysis	72.3	67	73	20395ms
sec-hostname-allowlist-validator	Sanitization	77.8	67	83	6831ms
sec-csp-nonce-validator	Detection & Analysis	87.8	100	75	9476ms
sec-api-key-scope-checker	Access Control	90.7	100	82	16297ms
sec-tenant-isolation-checker	Access Control	92.6	100	86	11353ms
sec-input-sanitizer	Sanitization	93.7	100	87	3937ms
sec-abac-rule-engine	Access Control	94.2	100	89	14010ms
sec-dependency-risk-classifier	Detection & Analysis	94.8	100	90	8248ms
sec-refresh-token-rotation	Auth & Session	95.3	100	91	14013ms
sec-session-fixation-detector	Auth & Session	95.6	100	92	20495ms
sec-vulnerability-scanner	Detection & Analysis	95.6	100	92	18143ms
sec-encryption-pipeline	Crypto Utils	95.7	100	92	14954ms
sec-csp-parser	Detection & Analysis	96.1	100	92	6992ms
sec-sql-injection-detector	Detection & Analysis	96.1	100	92	7415ms
sec-permission-checker	Access Control	96.8	100	94	12533ms
sec-access-control-engine	Access Control	99.5	100	100	8568ms
sec-cookie-policy-validator	Auth & Session	99.5	100	100	9347ms
sec-csrf-token-manager	Auth & Session	99.5	100	100	9195ms
sec-jwt-validator	Auth & Session	99.5	100	100	15315ms
sec-rate-limit-engine	Traffic Protection	99.5	100	100	22173ms
sec-safe-redirect-builder	Sanitization	99.5	100	100	7778ms
sec-file-upload-validator	Sanitization	100.0	100	100	5300ms
sec-html-entity-encoder	Sanitization	100.0	100	100	7719ms
sec-url-sanitizer	Sanitization	100.0	100	100	7418ms

30tasks · Sorted by score (lowest first) · Hidden = adversarial edge case pass rate