MODEL WARSTHE GAP
THE GAP
IS
CLOSING
Open source was 30 MMLU points behind frontier in 2023. It's 6 points behind today. The capability gap is almost gone. The cost gap is permanent. This changes everything about how you build.
● Open Source● Closed
THE CAPABILITY GAP — MMLU SCORE
Open source closing 24 points in 3 years. Gap was 28pts in 2023. Now 6pts.
2023 Q1Llama 1 vs GPT-4
Open: 62Closed: 90Δ28
2023 Q3Llama 2 closes gap slightly
Open: 68Closed: 91Δ23
2024 Q1Mistral 8x7B shocks everyone
Open: 76Closed: 93Δ17
2024 Q3Llama 3.1 405B near-frontier
Open: 82Closed: 93Δ11
2025 Q1DeepSeek R1 matches o1 on reasoning
Open: 87Closed: 95Δ8
2025 Q3Qwen 3.5 runs on MacBook Pro
Open: 89Closed: 95Δ6
2026 Q1Llama 4, Qwen 3.5 122B at near-parity
Open: 91Closed: 97Δ6
MODEL COMPARISON
Llama 4 MaverickOPENSELF-HOST
Meta
$0.3/1M out
MMLU
89/100
SWE-BENCH
46%
Best open-weight for self-hosting. Run locally, zero API cost.
Qwen 3.5 122BOPENSELF-HOST
Alibaba
$0.25/1M out
MMLU
84/100
SWE-BENCH
42%
Best open-weight coding. Rivals GPT-4o when self-hosted.
Mistral LargeOPEN
Mistral
$2/1M out
MMLU
81/100
SWE-BENCH
35%
EU-based, GDPR-native. Available via API and self-host.
DeepSeek V3OPENSELF-HOST
DeepSeek
$0.8/1M out
MMLU
88/100
SWE-BENCH
49%
Near-frontier coding at open-source cost. No vision.
Claude Sonnet 4.6CLOSED
Anthropic
$15/1M out
MMLU
90/100
SWE-BENCH
65%
Best closed model for agentic coding and writing.
GPT-5.4CLOSED
OpenAI
$10/1M out
MMLU
91/100
SWE-BENCH
49%
Default closed model. Strong vision, speed, ecosystem.
Gemini 3.1 ProCLOSED
Google
$7/1M out
MMLU
89/100
SWE-BENCH
44%
1M context. Best for large document processing.