Grok 4.20 Reasoning
x-ai/grok-4.20-reasoning
237.7
median tok/s
Throughput Runs
9
TTFT Runs
6
Avg TTFT
1502ms
Avg Throughput
225.7 tok/s
Total Cost
$0.1243
Commentary
by openai/gpt-5.4-miniGrok 4.20 Reasoning is reliable on BridgeBench, with a 100.0% success rate and no prompt-level failures, but startup latency is relatively high: median TTFT is 1497 ms and average TTFT is 1502 ms. Sustained decode performance is solid at 237.7 tok/s median throughput and 225.7 tok/s average throughput, with moderate variance across prompt types; cost is low at $0.124308 for the run set.
This is the strongest throughput case at 258.7 tok/s median with no issues, indicating the model can sustain high decode speed on long-form technical generation. The 2666 average output tokens suggest it holds performance well over extended generations.
Throughput is still strong at 237.7 tok/s median, close to the overall median, with no failures or anomalies. Output length is shorter than API Design, but the model remains stable and consistent.
This is the weakest sustained-throughput prompt at 172.6 tok/s median, a notable drop versus the other throughput tasks. The slower rate on a technical essay suggests longer-form prose generation is more decode-limited for this model.
TTFT is relatively slow at 1551 ms median, which is worse than the factual TTFT case and close to the overall TTFT average. The very short 71-token outputs mean startup latency dominates the user-visible delay here.
This is the best startup case with 932 ms median TTFT, indicating the model can begin responding quickly on short factual prompts. The 13-token average output keeps decode cost minimal, so this prompt is mostly a pure latency test.
Notable Prompts
Highest sustained throughput at 258.7 tok/s with no issues, making it the clearest long-generation strength.
Lowest throughput at 172.6 tok/s, showing the biggest slowdown on extended prose generation.
Fastest startup at 932 ms median TTFT, so short factual prompts get the best perceived responsiveness.
Median TTFT of 1551 ms is materially slower than the factual case, indicating startup latency is sensitive to prompt shape.
All Runs
| Prompt | Type | Tok/s | TTFT | Tokens | Cost | |
|---|---|---|---|---|---|---|
1. Api Design throughput-api-design | throughput | 258.7 | 16000ms | 2708 | $0.0167 | |
2. Api Design throughput-api-design | throughput | 257.6 | 15540ms | 2489 | $0.0154 | |
3. Api Design throughput-api-design | throughput | 272.6 | 16572ms | 2801 | $0.0172 | |
1. Data Structures throughput-data-structures | throughput | 215.5 | 9373ms | 1851 | $0.0116 | |
2. Data Structures throughput-data-structures | throughput | 263.1 | 11275ms | 2101 | $0.0131 | |
3. Data Structures throughput-data-structures | throughput | 237.7 | 10649ms | 1950 | $0.0122 | |
1. Essay throughput-essay | throughput | 190.3 | 13434ms | 1844 | $0.0115 | |
2. Essay throughput-essay | throughput | 172.6 | 10228ms | 2011 | $0.0125 | |
3. Essay throughput-essay | throughput | 163.3 | 10107ms | 1777 | $0.0111 | |
1. Definition ttft-definition | ttft | n/a | 1442ms | 83 | $0.0008 | |
2. Definition ttft-definition | ttft | n/a | 1551ms | 72 | $0.0007 | |
3. Definition ttft-definition | ttft | n/a | 2037ms | 58 | $0.0006 | |
1. Factual ttft-factual | ttft | n/a | 2129ms | 17 | $0.0004 | |
2. Factual ttft-factual | ttft | n/a | 932ms | 11 | $0.0003 | |
3. Factual ttft-factual | ttft | n/a | 922ms | 11 | $0.0003 |
15 runs · Throughput rows require valid long-output runs · TTFT shown for all successful runs