Grok 4.20 (Non-Reasoning)
x-ai/grok-4.20
76.3
overall score
Tasks
30
Passed
11
Failed
19
Avg latency
25073ms
Total cost
$0.2914
AI Commentary
by openai/gpt-5.4-miniGrok 4.20 is strong on straightforward security controls, with near-ceiling performance in sanitization (97.8), access control (96.5), and rate limiting (99.5). Its main weaknesses are in stateful/authentication logic and detection tasks, where it loses points to malformed outputs, missing edge cases, and brittle parsing behavior; the low success rate (36.7%) versus a much higher visible pass rate (76.7%) suggests it overfits easier cases and degrades on hidden variants.
Domain Performance
Performance is excellent across file upload, hostname allowlisting, HTML entity encoding, safe redirects, and URL sanitization, with no meaningful weaknesses. This domain looks robust and consistent under both visible and hidden cases.
This domain is uneven: cookie policy, CSRF token handling, refresh-token rotation, and session fixation detection are strong, but JWT validation and OAuth state handling are brittle. The JWT task appears to reject malformed tokens too early instead of parsing headers/payloads, while the OAuth state validator throws a length-related TypeError, indicating a structural bug rather than a logic miss.
Access control is a clear strength, with strong results on API key scope checking and tenant isolation. The model handled authorization boundaries correctly and did not show notable edge-case regressions here.
This is the most inconsistent domain: insecure config scanning is strong, but anomaly detection, dependency risk classification, secret detection, SSRF detection, and vulnerability scanning all show different failure modes. The errors range from under-detection and over-detection to runtime exceptions and overly strict URL parsing, which points to weak normalization and inconsistent rule application.
Rate limiting is effectively perfect, with no significant weaknesses. The model appears reliable for deterministic traffic-control logic.
Cryptographic utility handling is weak overall despite one strong encryption pipeline result. The failing crypto-utils task shows runtime errors from undefined length access and an empty-string output where a formatted value was expected, suggesting poor input handling and broken helper logic.
Notable Tasks
The model returned "Invalid token format" with null header/payload for cases that required parsing first, so it likely short-circuited on token shape instead of validating JWT structure and claims.
A TypeError on reading 'length' indicates the implementation is not guarding against undefined inputs or missing state fields before validation.
It labeled valid and invalid URLs as "invalid_url" and failed to normalize hosts, which suggests the URL parser/normalizer is too strict or incorrectly wired.
The detector truncated AWS key matches and missed at least one secret entirely, indicating weak pattern extraction and poor handling of multi-secret inputs.
This near-perfect result indicates the model can implement deterministic policy logic accurately when the task is well-specified and state transitions are simple.
All Task Results
| Task | Domain | Score | Correct | Hidden | Latency | |
|---|---|---|---|---|---|---|
| sec-crypto-utils | Crypto Utils | 9.5 | 0 | 0 | 28762ms | |
| sec-oauth-state-validator | Auth & Session | 9.5 | 0 | 0 | 23381ms | |
| sec-ssrf-detector | Detection & Analysis | 9.5 | 0 | 0 | 23531ms | |
| sec-dependency-risk-classifier | Detection & Analysis | 14.2 | 0 | 10 | 22089ms | |
| sec-vulnerability-scanner | Detection & Analysis | 17.3 | 0 | 17 | 20688ms | |
| sec-jwt-validator | Auth & Session | 27.7 | 33 | 8 | 19711ms | |
| sec-password-strength | Auth & Session | 45.8 | 67 | 15 | 20821ms | |
| sec-auth-log-anomaly-detector | Detection & Analysis | 62.2 | 33 | 82 | 44243ms | |
| sec-secret-detector | Detection & Analysis | 70.0 | 67 | 67 | 22756ms | |
| sec-input-sanitizer | Sanitization | 87.3 | 100 | 73 | 39018ms | |
| sec-csp-nonce-validator | Detection & Analysis | 87.8 | 100 | 75 | 19342ms | |
| sec-csp-parser | Detection & Analysis | 88.3 | 100 | 75 | 14738ms | |
| sec-sql-injection-detector | Detection & Analysis | 92.2 | 100 | 83 | 15693ms | |
| sec-abac-rule-engine | Access Control | 94.2 | 100 | 89 | 28289ms | |
| sec-permission-checker | Access Control | 94.2 | 100 | 89 | 33607ms | |
| sec-access-control-engine | Access Control | 94.9 | 100 | 91 | 21158ms | |
| sec-refresh-token-rotation | Auth & Session | 95.3 | 100 | 91 | 20861ms | |
| sec-encryption-pipeline | Crypto Utils | 95.7 | 100 | 92 | 19477ms | |
| sec-cookie-policy-validator | Auth & Session | 97.1 | 100 | 95 | 32947ms | |
| sec-api-key-scope-checker | Access Control | 99.5 | 100 | 100 | 21929ms | |
| sec-csrf-token-manager | Auth & Session | 99.5 | 100 | 100 | 14038ms | |
| sec-insecure-config-scanner | Detection & Analysis | 99.5 | 100 | 100 | 25765ms | |
| sec-rate-limit-engine | Traffic Protection | 99.5 | 100 | 100 | 26712ms | |
| sec-safe-redirect-builder | Sanitization | 99.5 | 100 | 100 | 30921ms | |
| sec-session-fixation-detector | Auth & Session | 99.5 | 100 | 100 | 29523ms | |
| sec-tenant-isolation-checker | Access Control | 99.5 | 100 | 100 | 22923ms | |
| sec-file-upload-validator | Sanitization | 100.0 | 100 | 100 | 18967ms | |
| sec-hostname-allowlist-validator | Sanitization | 100.0 | 100 | 100 | 22811ms | |
| sec-html-entity-encoder | Sanitization | 100.0 | 100 | 100 | 34861ms | |
| sec-url-sanitizer | Sanitization | 100.0 | 100 | 100 | 32612ms |
30tasks · Sorted by score (lowest first) · Hidden = adversarial edge case pass rate