BridgeBenchBridgeBench
Security
Model Analysis

Grok 4.20 Reasoning

x-ai/grok-4.20-reasoning

78.9

overall score

80.0% visible
74.0% hidden

Tasks

30

Passed

11

Failed

19

Avg latency

23615ms

Total cost

$0.2887

AI Commentary

by openai/gpt-5.4-mini

Grok 4.20 Reasoning is strong on straightforward security transformations and policy checks, with high scores in sanitization (92.4), access control (95.4), and traffic protection (95.6). Its main weaknesses are in edge-case-heavy parsing and structured security reasoning: auth/session (68.3), detection-analysis (70.5), and crypto-utils (63.1) all show brittle handling of malformed inputs, incomplete extraction, and over/under-detection, which is consistent with the low 36.7% success rate despite a solid 78.9 average score.

Domain Performance

Sanitization6 tasks
92.4

Very strong overall at 92.4, with correct behavior on file upload validation, hostname allowlists, HTML entity encoding, safe redirects, and URL sanitization. The main miss is sec-input-sanitizer, where it failed to fully neutralize script-like content and preserve expected spacing, suggesting incomplete token stripping and inconsistent normalization.

Auth & Session7 tasks
68.3

Mixed performance at 68.3: it handled cookie policy, CSRF, refresh rotation, and session fixation well, but broke on JWT and OAuth state handling. The JWT validator appears to reject malformed tokens too early with a generic format error instead of parsing headers/payloads, while the OAuth state validator likely has a null/undefined access bug causing runtime failure rather than a structured validation result.

Access Control5 tasks
95.4

Excellent at 95.4 with no significant weaknesses, indicating reliable enforcement of scope and tenant boundaries. This is one of the model's most dependable areas and suggests good rule-based authorization reasoning.

Detection & Analysis9 tasks
70.5

Moderate at 70.5, with strong results on CSP parsing, dependency risk classification, insecure config scanning, and SQL injection detection, but weaker behavior on anomaly detection, secret extraction, SSRF, and vulnerability scanning. The failures suggest inconsistent thresholding and pattern matching, plus a tendency to over-flag or under-parse when inputs require precise normalization or multi-signal correlation.

Traffic Protection1 tasks
95.6

Near-perfect at 95.6 on the rate limit engine, indicating robust handling of throttling logic and request policy computation. This domain is a clear strength with no notable edge-case regressions reported.

Crypto Utils2 tasks
63.1

Weaker at 63.1, driven by sec-crypto-utils failing on expected true/false outputs and producing empty or incorrect derived values. This points to brittle implementation of cryptographic helper logic, likely around key/nonce derivation or validation formatting rather than core encryption pipeline behavior.

Notable Tasks

sec-input-sanitizer54.2Sanitization

It left XSS-like content partially intact and failed to normalize spacing consistently, indicating incomplete sanitization rather than a simple escaping bug.

sec-jwt-validator27.7Auth & Session

It returned a generic invalid-format error and null header/payload for cases that should have been parsed, suggesting the validator bails out before structured JWT inspection.

sec-oauth-state-validator9.5Auth & Session

The runtime TypeError implies a null/undefined access path, so the model likely produced code that assumes state length exists before checking input presence.

sec-secret-detector70.0Detection & Analysis

It truncated AWS-style secrets and missed at least one secret type entirely, which points to overly narrow regexes or premature token length limits.

sec-rate-limit-engine95.6Traffic Protection

This was a clean pass in the traffic protection domain, indicating the model can implement deterministic policy logic accurately when the rules are explicit.

All Task Results

TaskDomainScore
sec-oauth-state-validatorAuth & Session9.5
sec-ssrf-detectorDetection & Analysis9.5
sec-vulnerability-scannerDetection & Analysis13.4
sec-jwt-validatorAuth & Session27.7
sec-crypto-utilsCrypto Utils30.5
sec-password-strengthAuth & Session49.5
sec-input-sanitizerSanitization54.2
sec-auth-log-anomaly-detectorDetection & Analysis62.2
sec-secret-detectorDetection & Analysis70.0
sec-csp-nonce-validatorDetection & Analysis88.3
sec-permission-checkerAccess Control88.9
sec-abac-rule-engineAccess Control94.2
sec-access-control-engineAccess Control94.9
sec-refresh-token-rotationAuth & Session95.3
sec-rate-limit-engineTraffic Protection95.6
sec-encryption-pipelineCrypto Utils95.7
sec-csp-parserDetection & Analysis96.1
sec-sql-injection-detectorDetection & Analysis96.1
sec-cookie-policy-validatorAuth & Session97.1
sec-api-key-scope-checkerAccess Control99.5
sec-csrf-token-managerAuth & Session99.5
sec-dependency-risk-classifierDetection & Analysis99.5
sec-insecure-config-scannerDetection & Analysis99.5
sec-session-fixation-detectorAuth & Session99.5
sec-tenant-isolation-checkerAccess Control99.5
sec-file-upload-validatorSanitization100.0
sec-hostname-allowlist-validatorSanitization100.0
sec-html-entity-encoderSanitization100.0
sec-safe-redirect-builderSanitization100.0
sec-url-sanitizerSanitization100.0

30tasks · Sorted by score (lowest first) · Hidden = adversarial edge case pass rate