Model Analysis
Grok 4.20 (Non-Reasoning)
x-ai/grok-4.20
86.3
overall score
100.0% repro
100.0% regress
19.3% diagnose
Tasks
10
Passed
10
Failed
0
Avg latency
25127ms
Total cost
$0.0272
Cluster Performance
Test Failure2 tasks
85.2
Runtime Exception2 tasks
87.6
Incorrect Output1 tasks
88.4
Async Timing1 tasks
87.3
State Mutation2 tasks
84.4
Regression After Refactor2 tasks
86.3
All Task Results
| Task | Cluster | Score | Repro | Hidden | Diagnose | Latency | |
|---|---|---|---|---|---|---|---|
| Deep Clone Shared Reference Bug debug-deep-clone-v2 | State Mutation | 84.0 | 100 | 100 | 10 | 12314ms | |
| Safe Number Parsing Coercion Bug debug-type-coercion-v2 | Test Failure | 84.3 | 100 | 100 | 5 | 42718ms | |
| Default Merge Mutation Bug debug-object-mutation-v2 | State Mutation | 84.8 | 100 | 100 | 15 | 16277ms | |
| LRU Access Order Regression debug-lru-eviction-v2 | Regression After Refactor | 85.8 | 100 | 100 | 5 | 21168ms | |
| JSON Parser Escape and Whitespace Bugs debug-json-parser-v2 | Runtime Exception | 86.0 | 100 | 100 | 30 | 79366ms | |
| Loop Closure Capture Bug debug-closure-loop-v2 | Test Failure | 86.1 | 100 | 100 | 18 | 11789ms | |
| Cycle Detection False Positive Regression debug-graph-cycle-v2 | Regression After Refactor | 86.8 | 100 | 100 | 35 | 16918ms | |
| Broken Batch Promise Chain debug-promise-chain-v2 | Async Timing | 87.3 | 100 | 100 | 15 | 12354ms | |
| Sliding Window Rate Limiter debug-rate-limiter-v2 | Incorrect Output | 88.4 | 100 | 100 | 23 | 18758ms | |
| Flatten Object Null Base Case Bug debug-recursion-base-case-v2 | Runtime Exception | 89.1 | 100 | 100 | 38 | 19612ms |
10tasks · visible repro, hidden bug, and regression scoring