BS v2
BS Benchmark
Do AI models push back on nonsensical premises or confidently invent answers? 100 tasks across 5 domains — finance, legal, medical, physics, software — each seeded with made-up jargon or reversed relationships. An LLM judge rates responses as pushback, partial, or accepted.
Updated 2026-04-20
| Rank | Model | Score |
|---|---|---|
| Claude Opus 4.6 anthropic/claude-opus-4-6 | 95.0 | |
| Claude Sonnet 4.6 anthropic/claude-sonnet-4-6 | 91.5 | |
| GPT-5.4 openai/gpt-5.4 | 91.5 | |
| 4 | Grok 4.20 x-ai/grok-4.20-beta | 82.5 |
| 5 | Claude Opus 4.7 openrouter/anthropic/claude-opus-4.7 | 75.5 |
| 6 | Grok 4.20 Reasoning x-ai/grok-4.20-reasoning | 74.0 |
| 7 | GPT-5.4 Mini openai/gpt-5.4-mini | 78.5 |
| 8 | Kimi K2.6 openrouter/moonshotai/kimi-k2.6 | 69.5 |
| 9 | Kimi K2.5 openrouter/moonshotai/kimi-k2.5 | 65.5 |
| 10 | GLM 5V Turbo openrouter/z-ai/glm-5v-turbo | 65.5 |
| 11 | Gemini 3.1 Pro google/gemini-3.1-pro-preview | 66.5 |
| 12 | MiniMax M2.7 minimax/MiniMax-M2.7 | 47.0 |
| 13 | GLM 5.1 z-ai/glm-5.1 | 36.5 |
13models tested · 100 tasks · 5 domains · Score = pushback + 0.5 × partial