BridgeBenchBridgeBench
Debugging
Model Analysis

Grok 4.20 (Non-Reasoning)

x-ai/grok-4.20

86.3

overall score

100.0% repro
100.0% regress
19.3% diagnose

Tasks

10

Passed

10

Failed

0

Avg latency

25127ms

Total cost

$0.0272

Cluster Performance

Test Failure2 tasks
85.2
Runtime Exception2 tasks
87.6
Incorrect Output1 tasks
88.4
Async Timing1 tasks
87.3
State Mutation2 tasks
84.4
Regression After Refactor2 tasks
86.3

All Task Results

TaskClusterScore
Deep Clone Shared Reference Bug

debug-deep-clone-v2

State Mutation84.0
Safe Number Parsing Coercion Bug

debug-type-coercion-v2

Test Failure84.3
Default Merge Mutation Bug

debug-object-mutation-v2

State Mutation84.8
LRU Access Order Regression

debug-lru-eviction-v2

Regression After Refactor85.8
JSON Parser Escape and Whitespace Bugs

debug-json-parser-v2

Runtime Exception86.0
Loop Closure Capture Bug

debug-closure-loop-v2

Test Failure86.1
Cycle Detection False Positive Regression

debug-graph-cycle-v2

Regression After Refactor86.8
Broken Batch Promise Chain

debug-promise-chain-v2

Async Timing87.3
Sliding Window Rate Limiter

debug-rate-limiter-v2

Incorrect Output88.4
Flatten Object Null Base Case Bug

debug-recursion-base-case-v2

Runtime Exception89.1

10tasks · visible repro, hidden bug, and regression scoring