Grok 4.20 Reasoning
x-ai/grok-4.20-reasoning
78.9
overall score
Tasks
30
Passed
11
Failed
19
Avg latency
23615ms
Total cost
$0.2887
AI Commentary
by openai/gpt-5.4-miniGrok 4.20 Reasoning is strong on straightforward security transformations and policy checks, with high scores in sanitization (92.4), access control (95.4), and traffic protection (95.6). Its main weaknesses are in edge-case-heavy parsing and structured security reasoning: auth/session (68.3), detection-analysis (70.5), and crypto-utils (63.1) all show brittle handling of malformed inputs, incomplete extraction, and over/under-detection, which is consistent with the low 36.7% success rate despite a solid 78.9 average score.
Domain Performance
Very strong overall at 92.4, with correct behavior on file upload validation, hostname allowlists, HTML entity encoding, safe redirects, and URL sanitization. The main miss is sec-input-sanitizer, where it failed to fully neutralize script-like content and preserve expected spacing, suggesting incomplete token stripping and inconsistent normalization.
Mixed performance at 68.3: it handled cookie policy, CSRF, refresh rotation, and session fixation well, but broke on JWT and OAuth state handling. The JWT validator appears to reject malformed tokens too early with a generic format error instead of parsing headers/payloads, while the OAuth state validator likely has a null/undefined access bug causing runtime failure rather than a structured validation result.
Excellent at 95.4 with no significant weaknesses, indicating reliable enforcement of scope and tenant boundaries. This is one of the model's most dependable areas and suggests good rule-based authorization reasoning.
Moderate at 70.5, with strong results on CSP parsing, dependency risk classification, insecure config scanning, and SQL injection detection, but weaker behavior on anomaly detection, secret extraction, SSRF, and vulnerability scanning. The failures suggest inconsistent thresholding and pattern matching, plus a tendency to over-flag or under-parse when inputs require precise normalization or multi-signal correlation.
Near-perfect at 95.6 on the rate limit engine, indicating robust handling of throttling logic and request policy computation. This domain is a clear strength with no notable edge-case regressions reported.
Weaker at 63.1, driven by sec-crypto-utils failing on expected true/false outputs and producing empty or incorrect derived values. This points to brittle implementation of cryptographic helper logic, likely around key/nonce derivation or validation formatting rather than core encryption pipeline behavior.
Notable Tasks
It left XSS-like content partially intact and failed to normalize spacing consistently, indicating incomplete sanitization rather than a simple escaping bug.
It returned a generic invalid-format error and null header/payload for cases that should have been parsed, suggesting the validator bails out before structured JWT inspection.
The runtime TypeError implies a null/undefined access path, so the model likely produced code that assumes state length exists before checking input presence.
It truncated AWS-style secrets and missed at least one secret type entirely, which points to overly narrow regexes or premature token length limits.
This was a clean pass in the traffic protection domain, indicating the model can implement deterministic policy logic accurately when the rules are explicit.
All Task Results
| Task | Domain | Score | Correct | Hidden | Latency | |
|---|---|---|---|---|---|---|
| sec-oauth-state-validator | Auth & Session | 9.5 | 0 | 0 | 28617ms | |
| sec-ssrf-detector | Detection & Analysis | 9.5 | 0 | 0 | 25632ms | |
| sec-vulnerability-scanner | Detection & Analysis | 13.4 | 0 | 8 | 20965ms | |
| sec-jwt-validator | Auth & Session | 27.7 | 33 | 8 | 29113ms | |
| sec-crypto-utils | Crypto Utils | 30.5 | 33 | 14 | 17645ms | |
| sec-password-strength | Auth & Session | 49.5 | 67 | 23 | 20059ms | |
| sec-input-sanitizer | Sanitization | 54.2 | 67 | 33 | 37093ms | |
| sec-auth-log-anomaly-detector | Detection & Analysis | 62.2 | 33 | 82 | 44682ms | |
| sec-secret-detector | Detection & Analysis | 70.0 | 67 | 67 | 17530ms | |
| sec-csp-nonce-validator | Detection & Analysis | 88.3 | 100 | 75 | 21696ms | |
| sec-permission-checker | Access Control | 88.9 | 100 | 78 | 15854ms | |
| sec-abac-rule-engine | Access Control | 94.2 | 100 | 89 | 22554ms | |
| sec-access-control-engine | Access Control | 94.9 | 100 | 91 | 18437ms | |
| sec-refresh-token-rotation | Auth & Session | 95.3 | 100 | 91 | 39136ms | |
| sec-rate-limit-engine | Traffic Protection | 95.6 | 100 | 92 | 26708ms | |
| sec-encryption-pipeline | Crypto Utils | 95.7 | 100 | 92 | 14281ms | |
| sec-csp-parser | Detection & Analysis | 96.1 | 100 | 92 | 15783ms | |
| sec-sql-injection-detector | Detection & Analysis | 96.1 | 100 | 92 | 21918ms | |
| sec-cookie-policy-validator | Auth & Session | 97.1 | 100 | 95 | 22247ms | |
| sec-api-key-scope-checker | Access Control | 99.5 | 100 | 100 | 31669ms | |
| sec-csrf-token-manager | Auth & Session | 99.5 | 100 | 100 | 17666ms | |
| sec-dependency-risk-classifier | Detection & Analysis | 99.5 | 100 | 100 | 18237ms | |
| sec-insecure-config-scanner | Detection & Analysis | 99.5 | 100 | 100 | 16627ms | |
| sec-session-fixation-detector | Auth & Session | 99.5 | 100 | 100 | 25031ms | |
| sec-tenant-isolation-checker | Access Control | 99.5 | 100 | 100 | 19620ms | |
| sec-file-upload-validator | Sanitization | 100.0 | 100 | 100 | 20191ms | |
| sec-hostname-allowlist-validator | Sanitization | 100.0 | 100 | 100 | 28417ms | |
| sec-html-entity-encoder | Sanitization | 100.0 | 100 | 100 | 33120ms | |
| sec-safe-redirect-builder | Sanitization | 100.0 | 100 | 100 | 16619ms | |
| sec-url-sanitizer | Sanitization | 100.0 | 100 | 100 | 21310ms |
30tasks · Sorted by score (lowest first) · Hidden = adversarial edge case pass rate