MODEL WARS

THE GAP
IS
CLOSING

Open source was 30 MMLU points behind frontier in 2023. It's 6 points behind today. The capability gap is almost gone. The cost gap is permanent. This changes everything about how you build.

● Open Source● Closed

THE CAPABILITY GAP — MMLU SCORE

Open source closing 24 points in 3 years. Gap was 28pts in 2023. Now 6pts.

2023 Q1Llama 1 vs GPT-4
Open: 62Closed: 90Δ28
2023 Q3Llama 2 closes gap slightly
Open: 68Closed: 91Δ23
2024 Q1Mistral 8x7B shocks everyone
Open: 76Closed: 93Δ17
2024 Q3Llama 3.1 405B near-frontier
Open: 82Closed: 93Δ11
2025 Q1DeepSeek R1 matches o1 on reasoning
Open: 87Closed: 95Δ8
2025 Q3Qwen 3.5 runs on MacBook Pro
Open: 89Closed: 95Δ6
2026 Q1Llama 4, Qwen 3.5 122B at near-parity
Open: 91Closed: 97Δ6

MODEL COMPARISON

Llama 4 MaverickOPENSELF-HOST
Meta
$0.3/1M out
MMLU
89/100
SWE-BENCH
46%

Best open-weight for self-hosting. Run locally, zero API cost.

Qwen 3.5 122BOPENSELF-HOST
Alibaba
$0.25/1M out
MMLU
84/100
SWE-BENCH
42%

Best open-weight coding. Rivals GPT-4o when self-hosted.

Mistral LargeOPEN
Mistral
$2/1M out
MMLU
81/100
SWE-BENCH
35%

EU-based, GDPR-native. Available via API and self-host.

DeepSeek V3OPENSELF-HOST
DeepSeek
$0.8/1M out
MMLU
88/100
SWE-BENCH
49%

Near-frontier coding at open-source cost. No vision.

Claude Sonnet 4.6CLOSED
Anthropic
$15/1M out
MMLU
90/100
SWE-BENCH
65%

Best closed model for agentic coding and writing.

GPT-5.4CLOSED
OpenAI
$10/1M out
MMLU
91/100
SWE-BENCH
49%

Default closed model. Strong vision, speed, ecosystem.

Gemini 3.1 ProCLOSED
Google
$7/1M out
MMLU
89/100
SWE-BENCH
44%

1M context. Best for large document processing.

WHEN TO USE WHICH

Cost-sensitive high volume
Self-hosted Llama 4 at ~$0.30/1M vs GPT at $10/1M. The maths win.
Open
Data privacy / on-premise
Zero data leaves your infra. Non-negotiable for healthcare, legal, gov.
Open
Latest frontier capability
Closed models still lead by 5-10 points on most benchmarks.
Closed
Agentic coding pipelines
Claude Sonnet for complex multi-step. SWE-bench delta is still real.
Closed
EU / GDPR compliance
Mistral + self-hosted open models. No transatlantic data transfer.
Open
Fine-tuning on your data
Closed models: not possible. Open: fine-tune locally or via API.
Open
Consumer-facing apps
Managed uptime, safety filters, content policy — closed handles it.
Closed
Research / experimentation
Access weights, probe internals, custom RLHF. Closed is a black box.
Open