BridgeBenchBridgeBench
Speed
Model Analysis

Grok 4.20 Reasoning

x-ai/grok-4.20-reasoning

237.7

median tok/s

1497ms TTFT
100.0% success

Throughput Runs

9

TTFT Runs

6

Avg TTFT

1502ms

Avg Throughput

225.7 tok/s

Total Cost

$0.1243

Commentary

by openai/gpt-5.4-mini

Grok 4.20 Reasoning is reliable on BridgeBench, with a 100.0% success rate and no prompt-level failures, but startup latency is relatively high: median TTFT is 1497 ms and average TTFT is 1502 ms. Sustained decode performance is solid at 237.7 tok/s median throughput and 225.7 tok/s average throughput, with moderate variance across prompt types; cost is low at $0.124308 for the run set.

Api Designthroughput

This is the strongest throughput case at 258.7 tok/s median with no issues, indicating the model can sustain high decode speed on long-form technical generation. The 2666 average output tokens suggest it holds performance well over extended generations.

Data Structuresthroughput

Throughput is still strong at 237.7 tok/s median, close to the overall median, with no failures or anomalies. Output length is shorter than API Design, but the model remains stable and consistent.

Essaythroughput

This is the weakest sustained-throughput prompt at 172.6 tok/s median, a notable drop versus the other throughput tasks. The slower rate on a technical essay suggests longer-form prose generation is more decode-limited for this model.

Definitionttft

TTFT is relatively slow at 1551 ms median, which is worse than the factual TTFT case and close to the overall TTFT average. The very short 71-token outputs mean startup latency dominates the user-visible delay here.

Factualttft

This is the best startup case with 932 ms median TTFT, indicating the model can begin responding quickly on short factual prompts. The 13-token average output keeps decode cost minimal, so this prompt is mostly a pure latency test.

Notable Prompts

Api Designthroughput

Highest sustained throughput at 258.7 tok/s with no issues, making it the clearest long-generation strength.

Essaythroughput

Lowest throughput at 172.6 tok/s, showing the biggest slowdown on extended prose generation.

Factualttft

Fastest startup at 932 ms median TTFT, so short factual prompts get the best perceived responsiveness.

Definitionttft

Median TTFT of 1551 ms is materially slower than the factual case, indicating startup latency is sensitive to prompt shape.

All Runs

PromptTypeTok/sTTFT
1. Api Design
throughput-api-design
throughput258.716000ms
2. Api Design
throughput-api-design
throughput257.615540ms
3. Api Design
throughput-api-design
throughput272.616572ms
1. Data Structures
throughput-data-structures
throughput215.59373ms
2. Data Structures
throughput-data-structures
throughput263.111275ms
3. Data Structures
throughput-data-structures
throughput237.710649ms
1. Essay
throughput-essay
throughput190.313434ms
2. Essay
throughput-essay
throughput172.610228ms
3. Essay
throughput-essay
throughput163.310107ms
1. Definition
ttft-definition
ttftn/a1442ms
2. Definition
ttft-definition
ttftn/a1551ms
3. Definition
ttft-definition
ttftn/a2037ms
1. Factual
ttft-factual
ttftn/a2129ms
2. Factual
ttft-factual
ttftn/a932ms
3. Factual
ttft-factual
ttftn/a922ms

15 runs · Throughput rows require valid long-output runs · TTFT shown for all successful runs