Model Analysis

Grok 4.20 Reasoning

x-ai/grok-4.20-reasoning

78.9

overall score

80.0% visible

74.0% hidden

Tasks

Passed

Failed

Avg latency

23615ms

Total cost

$0.2887

AI Commentary

by openai/gpt-5.4-mini

Grok 4.20 Reasoning is strong on straightforward security transformations and policy checks, with high scores in sanitization (92.4), access control (95.4), and traffic protection (95.6). Its main weaknesses are in edge-case-heavy parsing and structured security reasoning: auth/session (68.3), detection-analysis (70.5), and crypto-utils (63.1) all show brittle handling of malformed inputs, incomplete extraction, and over/under-detection, which is consistent with the low 36.7% success rate despite a solid 78.9 average score.

Domain Performance

Sanitization6 tasks

92.4

Very strong overall at 92.4, with correct behavior on file upload validation, hostname allowlists, HTML entity encoding, safe redirects, and URL sanitization. The main miss is sec-input-sanitizer, where it failed to fully neutralize script-like content and preserve expected spacing, suggesting incomplete token stripping and inconsistent normalization.

Auth & Session7 tasks

68.3

Mixed performance at 68.3: it handled cookie policy, CSRF, refresh rotation, and session fixation well, but broke on JWT and OAuth state handling. The JWT validator appears to reject malformed tokens too early with a generic format error instead of parsing headers/payloads, while the OAuth state validator likely has a null/undefined access bug causing runtime failure rather than a structured validation result.

Access Control5 tasks

95.4

Excellent at 95.4 with no significant weaknesses, indicating reliable enforcement of scope and tenant boundaries. This is one of the model's most dependable areas and suggests good rule-based authorization reasoning.

Detection & Analysis9 tasks

70.5

Moderate at 70.5, with strong results on CSP parsing, dependency risk classification, insecure config scanning, and SQL injection detection, but weaker behavior on anomaly detection, secret extraction, SSRF, and vulnerability scanning. The failures suggest inconsistent thresholding and pattern matching, plus a tendency to over-flag or under-parse when inputs require precise normalization or multi-signal correlation.

Traffic Protection1 tasks

95.6

Near-perfect at 95.6 on the rate limit engine, indicating robust handling of throttling logic and request policy computation. This domain is a clear strength with no notable edge-case regressions reported.

Crypto Utils2 tasks

63.1

Weaker at 63.1, driven by sec-crypto-utils failing on expected true/false outputs and producing empty or incorrect derived values. This points to brittle implementation of cryptographic helper logic, likely around key/nonce derivation or validation formatting rather than core encryption pipeline behavior.

Notable Tasks

sec-input-sanitizer54.2Sanitization

It left XSS-like content partially intact and failed to normalize spacing consistently, indicating incomplete sanitization rather than a simple escaping bug.

sec-jwt-validator27.7Auth & Session

It returned a generic invalid-format error and null header/payload for cases that should have been parsed, suggesting the validator bails out before structured JWT inspection.

sec-oauth-state-validator9.5Auth & Session

The runtime TypeError implies a null/undefined access path, so the model likely produced code that assumes state length exists before checking input presence.

sec-secret-detector70.0Detection & Analysis

It truncated AWS-style secrets and missed at least one secret type entirely, which points to overly narrow regexes or premature token length limits.

sec-rate-limit-engine95.6Traffic Protection

This was a clean pass in the traffic protection domain, indicating the model can implement deterministic policy logic accurately when the rules are explicit.

All Task Results

Task	Domain	Score	Correct	Hidden	Latency
sec-oauth-state-validator	Auth & Session	9.5	0	0	28617ms
sec-ssrf-detector	Detection & Analysis	9.5	0	0	25632ms
sec-vulnerability-scanner	Detection & Analysis	13.4	0	8	20965ms
sec-jwt-validator	Auth & Session	27.7	33	8	29113ms
sec-crypto-utils	Crypto Utils	30.5	33	14	17645ms
sec-password-strength	Auth & Session	49.5	67	23	20059ms
sec-input-sanitizer	Sanitization	54.2	67	33	37093ms
sec-auth-log-anomaly-detector	Detection & Analysis	62.2	33	82	44682ms
sec-secret-detector	Detection & Analysis	70.0	67	67	17530ms
sec-csp-nonce-validator	Detection & Analysis	88.3	100	75	21696ms
sec-permission-checker	Access Control	88.9	100	78	15854ms
sec-abac-rule-engine	Access Control	94.2	100	89	22554ms
sec-access-control-engine	Access Control	94.9	100	91	18437ms
sec-refresh-token-rotation	Auth & Session	95.3	100	91	39136ms
sec-rate-limit-engine	Traffic Protection	95.6	100	92	26708ms
sec-encryption-pipeline	Crypto Utils	95.7	100	92	14281ms
sec-csp-parser	Detection & Analysis	96.1	100	92	15783ms
sec-sql-injection-detector	Detection & Analysis	96.1	100	92	21918ms
sec-cookie-policy-validator	Auth & Session	97.1	100	95	22247ms
sec-api-key-scope-checker	Access Control	99.5	100	100	31669ms
sec-csrf-token-manager	Auth & Session	99.5	100	100	17666ms
sec-dependency-risk-classifier	Detection & Analysis	99.5	100	100	18237ms
sec-insecure-config-scanner	Detection & Analysis	99.5	100	100	16627ms
sec-session-fixation-detector	Auth & Session	99.5	100	100	25031ms
sec-tenant-isolation-checker	Access Control	99.5	100	100	19620ms
sec-file-upload-validator	Sanitization	100.0	100	100	20191ms
sec-hostname-allowlist-validator	Sanitization	100.0	100	100	28417ms
sec-html-entity-encoder	Sanitization	100.0	100	100	33120ms
sec-safe-redirect-builder	Sanitization	100.0	100	100	16619ms
sec-url-sanitizer	Sanitization	100.0	100	100	21310ms

30tasks · Sorted by score (lowest first) · Hidden = adversarial edge case pass rate