Model Analysis

Grok 4.20 Reasoning

x-ai/grok-4.20-reasoning

237.7

median tok/s

1497ms TTFT

100.0% success

Throughput Runs

TTFT Runs

Avg TTFT

1502ms

Avg Throughput

225.7 tok/s

Total Cost

$0.1243

Commentary

by openai/gpt-5.4-mini

Grok 4.20 Reasoning is reliable on BridgeBench, with a 100.0% success rate and no prompt-level failures, but startup latency is relatively high: median TTFT is 1497 ms and average TTFT is 1502 ms. Sustained decode performance is solid at 237.7 tok/s median throughput and 225.7 tok/s average throughput, with moderate variance across prompt types; cost is low at $0.124308 for the run set.

Api Designthroughput

This is the strongest throughput case at 258.7 tok/s median with no issues, indicating the model can sustain high decode speed on long-form technical generation. The 2666 average output tokens suggest it holds performance well over extended generations.

Data Structuresthroughput

Throughput is still strong at 237.7 tok/s median, close to the overall median, with no failures or anomalies. Output length is shorter than API Design, but the model remains stable and consistent.

Essaythroughput

This is the weakest sustained-throughput prompt at 172.6 tok/s median, a notable drop versus the other throughput tasks. The slower rate on a technical essay suggests longer-form prose generation is more decode-limited for this model.

Definitionttft

TTFT is relatively slow at 1551 ms median, which is worse than the factual TTFT case and close to the overall TTFT average. The very short 71-token outputs mean startup latency dominates the user-visible delay here.

Factualttft

This is the best startup case with 932 ms median TTFT, indicating the model can begin responding quickly on short factual prompts. The 13-token average output keeps decode cost minimal, so this prompt is mostly a pure latency test.

Notable Prompts

Api Designthroughput

Highest sustained throughput at 258.7 tok/s with no issues, making it the clearest long-generation strength.

Essaythroughput

Lowest throughput at 172.6 tok/s, showing the biggest slowdown on extended prose generation.

Factualttft

Fastest startup at 932 ms median TTFT, so short factual prompts get the best perceived responsiveness.

Definitionttft

Median TTFT of 1551 ms is materially slower than the factual case, indicating startup latency is sensitive to prompt shape.

All Runs

Prompt	Type	Tok/s	TTFT	Tokens	Cost
1. Api Design throughput-api-design	throughput	258.7	16000ms	2708	$0.0167
2. Api Design throughput-api-design	throughput	257.6	15540ms	2489	$0.0154
3. Api Design throughput-api-design	throughput	272.6	16572ms	2801	$0.0172
1. Data Structures throughput-data-structures	throughput	215.5	9373ms	1851	$0.0116
2. Data Structures throughput-data-structures	throughput	263.1	11275ms	2101	$0.0131
3. Data Structures throughput-data-structures	throughput	237.7	10649ms	1950	$0.0122
1. Essay throughput-essay	throughput	190.3	13434ms	1844	$0.0115
2. Essay throughput-essay	throughput	172.6	10228ms	2011	$0.0125
3. Essay throughput-essay	throughput	163.3	10107ms	1777	$0.0111
1. Definition ttft-definition	ttft	n/a	1442ms	83	$0.0008
2. Definition ttft-definition	ttft	n/a	1551ms	72	$0.0007
3. Definition ttft-definition	ttft	n/a	2037ms	58	$0.0006
1. Factual ttft-factual	ttft	n/a	2129ms	17	$0.0004
2. Factual ttft-factual	ttft	n/a	932ms	11	$0.0003
3. Factual ttft-factual	ttft	n/a	922ms	11	$0.0003

15 runs · Throughput rows require valid long-output runs · TTFT shown for all successful runs