BridgeBenchBridgeBench
Security
Model Analysis

Grok 4.20 (Non-Reasoning)

x-ai/grok-4.20

76.3

overall score

76.7% visible
71.7% hidden

Tasks

30

Passed

11

Failed

19

Avg latency

25073ms

Total cost

$0.2914

AI Commentary

by openai/gpt-5.4-mini

Grok 4.20 is strong on straightforward security controls, with near-ceiling performance in sanitization (97.8), access control (96.5), and rate limiting (99.5). Its main weaknesses are in stateful/authentication logic and detection tasks, where it loses points to malformed outputs, missing edge cases, and brittle parsing behavior; the low success rate (36.7%) versus a much higher visible pass rate (76.7%) suggests it overfits easier cases and degrades on hidden variants.

Domain Performance

Sanitization6 tasks
97.8

Performance is excellent across file upload, hostname allowlisting, HTML entity encoding, safe redirects, and URL sanitization, with no meaningful weaknesses. This domain looks robust and consistent under both visible and hidden cases.

Auth & Session7 tasks
67.8

This domain is uneven: cookie policy, CSRF token handling, refresh-token rotation, and session fixation detection are strong, but JWT validation and OAuth state handling are brittle. The JWT task appears to reject malformed tokens too early instead of parsing headers/payloads, while the OAuth state validator throws a length-related TypeError, indicating a structural bug rather than a logic miss.

Access Control5 tasks
96.5

Access control is a clear strength, with strong results on API key scope checking and tenant isolation. The model handled authorization boundaries correctly and did not show notable edge-case regressions here.

Detection & Analysis9 tasks
60.1

This is the most inconsistent domain: insecure config scanning is strong, but anomaly detection, dependency risk classification, secret detection, SSRF detection, and vulnerability scanning all show different failure modes. The errors range from under-detection and over-detection to runtime exceptions and overly strict URL parsing, which points to weak normalization and inconsistent rule application.

Traffic Protection1 tasks
99.5

Rate limiting is effectively perfect, with no significant weaknesses. The model appears reliable for deterministic traffic-control logic.

Crypto Utils2 tasks
52.6

Cryptographic utility handling is weak overall despite one strong encryption pipeline result. The failing crypto-utils task shows runtime errors from undefined length access and an empty-string output where a formatted value was expected, suggesting poor input handling and broken helper logic.

Notable Tasks

sec-jwt-validator27.7Auth & Session

The model returned "Invalid token format" with null header/payload for cases that required parsing first, so it likely short-circuited on token shape instead of validating JWT structure and claims.

sec-oauth-state-validator9.5Auth & Session

A TypeError on reading 'length' indicates the implementation is not guarding against undefined inputs or missing state fields before validation.

sec-ssrf-detector9.5Detection & Analysis

It labeled valid and invalid URLs as "invalid_url" and failed to normalize hosts, which suggests the URL parser/normalizer is too strict or incorrectly wired.

sec-secret-detector70.0Detection & Analysis

The detector truncated AWS key matches and missed at least one secret entirely, indicating weak pattern extraction and poor handling of multi-secret inputs.

sec-rate-limit-engine99.5Traffic Protection

This near-perfect result indicates the model can implement deterministic policy logic accurately when the task is well-specified and state transitions are simple.

All Task Results

TaskDomainScore
sec-crypto-utilsCrypto Utils9.5
sec-oauth-state-validatorAuth & Session9.5
sec-ssrf-detectorDetection & Analysis9.5
sec-dependency-risk-classifierDetection & Analysis14.2
sec-vulnerability-scannerDetection & Analysis17.3
sec-jwt-validatorAuth & Session27.7
sec-password-strengthAuth & Session45.8
sec-auth-log-anomaly-detectorDetection & Analysis62.2
sec-secret-detectorDetection & Analysis70.0
sec-input-sanitizerSanitization87.3
sec-csp-nonce-validatorDetection & Analysis87.8
sec-csp-parserDetection & Analysis88.3
sec-sql-injection-detectorDetection & Analysis92.2
sec-abac-rule-engineAccess Control94.2
sec-permission-checkerAccess Control94.2
sec-access-control-engineAccess Control94.9
sec-refresh-token-rotationAuth & Session95.3
sec-encryption-pipelineCrypto Utils95.7
sec-cookie-policy-validatorAuth & Session97.1
sec-api-key-scope-checkerAccess Control99.5
sec-csrf-token-managerAuth & Session99.5
sec-insecure-config-scannerDetection & Analysis99.5
sec-rate-limit-engineTraffic Protection99.5
sec-safe-redirect-builderSanitization99.5
sec-session-fixation-detectorAuth & Session99.5
sec-tenant-isolation-checkerAccess Control99.5
sec-file-upload-validatorSanitization100.0
sec-hostname-allowlist-validatorSanitization100.0
sec-html-entity-encoderSanitization100.0
sec-url-sanitizerSanitization100.0

30tasks · Sorted by score (lowest first) · Hidden = adversarial edge case pass rate