AI Coding Benchmark
Model Rankings
17 AI coding models tested across 130+ real-world tasks spanning algorithms, debugging, refactoring, generation, UI, and security — the categories that matter for vibe coding and agentic workflows.
Updated 2026-03-21
| Rank | Model | Overall | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| GPT-5.4 | CLI 130 | 95.5 | 87.6 | 98.9 | 96.4 | 97.9 | 97.0 | 89.7 | ||
| GPT-5.4 Mini | CLI 130 | 94.8 | 87.3 | 99.0 | 96.4 | 97.6 | 94.4 | 88.4 | ||
| GPT-5.4 Nano | CLI 130 | 92.9 | 84.3 | 97.8 | 96.0 | 98.3 | 90.1 | 85.2 | ||
| 4 | GPT-4.1 | v1 | 91.8 | 92.1 | 92.7 | 93.8 | 91.9 | 92.4 | 88.9 | |
| 5 | Qwen 3.5 35B-A3B | v2 | 91.7 | 88.1 | 94.7 | 96.0 | 87.3 | 93.5 | 86.0 | |
| 6 | Claude Sonnet 4.5 | v1 | 90.7 | 87.2 | 89.6 | 92.5 | 93.1 | 90.4 | 90.9 | |
| 7 | Qwen 3.5 122B-A10B | v2 | 90.0 | 76.9 | 94.9 | 94.1 | 87.4 | 92.5 | 86.7 | |
| 8 | o3-mini | v1 | 89.6 | 91.2 | 90.3 | 91.4 | 89.8 | 88.2 | 86.5 | |
| 9 | Qwen 3.5 27B | v2 | 89.5 | 80.1 | 94.5 | 93.2 | 83.0 | 92.2 | 86.9 | |
| 10 | Gemini 2.5 Pro | v1 | 88.9 | 86.9 | 89.8 | 90.6 | 88.4 | 89.3 | 89.0 | |
| 11 | Qwen 3.5 Flash (02-23) | v2 | 86.9 | 84.2 | 87.0 | 89.5 | 86.5 | 90.8 | 78.6 | |
| 12 | Grok 4 | v1 | 86.2 | 83.7 | 87.1 | 85.4 | 84.9 | 87.6 | 88.1 | |
| 13 | DeepSeek R1 | v1 | 84.5 | 86.2 | 88.8 | 84.3 | 82.1 | 80.8 | 81.4 | |
| 14 | Qwen3 Coder 480B | v1 | 82.7 | 83.6 | 84.9 | 83.4 | 81.2 | 82.0 | 80.3 | |
| 15 | Llama 4 Maverick | v1 | 80.4 | 79.2 | 81.1 | 79.7 | 80.2 | 82.3 | 80.8 | |
| 16 | Qwen3.5 397B A17B | v1 | 60.1 | 51.6 | 72.0 | 78.7 | 51.3 | 50.5 | 48.1 | |
| 17 | Qwen3.5 Plus 2026-02-15 | v1 | 59.2 | 51.4 | 71.9 | 74.8 | 51.2 | 50.3 | 47.2 |
17models · Ranked by overall score · Category columns visible on wider screens