Model Analysis

Grok 4.20 (Non-Reasoning)

x-ai/grok-4.20

76.3

overall score

76.7% visible

71.7% hidden

Tasks

Passed

Failed

Avg latency

25073ms

Total cost

$0.2914

AI Commentary

by openai/gpt-5.4-mini

Grok 4.20 is strong on straightforward security controls, with near-ceiling performance in sanitization (97.8), access control (96.5), and rate limiting (99.5). Its main weaknesses are in stateful/authentication logic and detection tasks, where it loses points to malformed outputs, missing edge cases, and brittle parsing behavior; the low success rate (36.7%) versus a much higher visible pass rate (76.7%) suggests it overfits easier cases and degrades on hidden variants.

Domain Performance

Sanitization6 tasks

97.8

Performance is excellent across file upload, hostname allowlisting, HTML entity encoding, safe redirects, and URL sanitization, with no meaningful weaknesses. This domain looks robust and consistent under both visible and hidden cases.

Auth & Session7 tasks

67.8

This domain is uneven: cookie policy, CSRF token handling, refresh-token rotation, and session fixation detection are strong, but JWT validation and OAuth state handling are brittle. The JWT task appears to reject malformed tokens too early instead of parsing headers/payloads, while the OAuth state validator throws a length-related TypeError, indicating a structural bug rather than a logic miss.

Access Control5 tasks

96.5

Access control is a clear strength, with strong results on API key scope checking and tenant isolation. The model handled authorization boundaries correctly and did not show notable edge-case regressions here.

Detection & Analysis9 tasks

60.1

This is the most inconsistent domain: insecure config scanning is strong, but anomaly detection, dependency risk classification, secret detection, SSRF detection, and vulnerability scanning all show different failure modes. The errors range from under-detection and over-detection to runtime exceptions and overly strict URL parsing, which points to weak normalization and inconsistent rule application.

Traffic Protection1 tasks

99.5

Rate limiting is effectively perfect, with no significant weaknesses. The model appears reliable for deterministic traffic-control logic.

Crypto Utils2 tasks

52.6

Cryptographic utility handling is weak overall despite one strong encryption pipeline result. The failing crypto-utils task shows runtime errors from undefined length access and an empty-string output where a formatted value was expected, suggesting poor input handling and broken helper logic.

Notable Tasks

sec-jwt-validator27.7Auth & Session

The model returned "Invalid token format" with null header/payload for cases that required parsing first, so it likely short-circuited on token shape instead of validating JWT structure and claims.

sec-oauth-state-validator9.5Auth & Session

A TypeError on reading 'length' indicates the implementation is not guarding against undefined inputs or missing state fields before validation.

sec-ssrf-detector9.5Detection & Analysis

It labeled valid and invalid URLs as "invalid_url" and failed to normalize hosts, which suggests the URL parser/normalizer is too strict or incorrectly wired.

sec-secret-detector70.0Detection & Analysis

The detector truncated AWS key matches and missed at least one secret entirely, indicating weak pattern extraction and poor handling of multi-secret inputs.

sec-rate-limit-engine99.5Traffic Protection

This near-perfect result indicates the model can implement deterministic policy logic accurately when the task is well-specified and state transitions are simple.

All Task Results

Task	Domain	Score	Correct	Hidden	Latency
sec-crypto-utils	Crypto Utils	9.5	0	0	28762ms
sec-oauth-state-validator	Auth & Session	9.5	0	0	23381ms
sec-ssrf-detector	Detection & Analysis	9.5	0	0	23531ms
sec-dependency-risk-classifier	Detection & Analysis	14.2	0	10	22089ms
sec-vulnerability-scanner	Detection & Analysis	17.3	0	17	20688ms
sec-jwt-validator	Auth & Session	27.7	33	8	19711ms
sec-password-strength	Auth & Session	45.8	67	15	20821ms
sec-auth-log-anomaly-detector	Detection & Analysis	62.2	33	82	44243ms
sec-secret-detector	Detection & Analysis	70.0	67	67	22756ms
sec-input-sanitizer	Sanitization	87.3	100	73	39018ms
sec-csp-nonce-validator	Detection & Analysis	87.8	100	75	19342ms
sec-csp-parser	Detection & Analysis	88.3	100	75	14738ms
sec-sql-injection-detector	Detection & Analysis	92.2	100	83	15693ms
sec-abac-rule-engine	Access Control	94.2	100	89	28289ms
sec-permission-checker	Access Control	94.2	100	89	33607ms
sec-access-control-engine	Access Control	94.9	100	91	21158ms
sec-refresh-token-rotation	Auth & Session	95.3	100	91	20861ms
sec-encryption-pipeline	Crypto Utils	95.7	100	92	19477ms
sec-cookie-policy-validator	Auth & Session	97.1	100	95	32947ms
sec-api-key-scope-checker	Access Control	99.5	100	100	21929ms
sec-csrf-token-manager	Auth & Session	99.5	100	100	14038ms
sec-insecure-config-scanner	Detection & Analysis	99.5	100	100	25765ms
sec-rate-limit-engine	Traffic Protection	99.5	100	100	26712ms
sec-safe-redirect-builder	Sanitization	99.5	100	100	30921ms
sec-session-fixation-detector	Auth & Session	99.5	100	100	29523ms
sec-tenant-isolation-checker	Access Control	99.5	100	100	22923ms
sec-file-upload-validator	Sanitization	100.0	100	100	18967ms
sec-hostname-allowlist-validator	Sanitization	100.0	100	100	22811ms
sec-html-entity-encoder	Sanitization	100.0	100	100	34861ms
sec-url-sanitizer	Sanitization	100.0	100	100	32612ms

30tasks · Sorted by score (lowest first) · Hidden = adversarial edge case pass rate