MODEL WARS

THE GAP
IS
CLOSING

Open source was 30 MMLU points behind frontier in 2023. It's 6 points behind today. The capability gap is almost gone. The cost gap is permanent. This changes everything about how you build.

● Open Source● Closed

THE CAPABILITY GAP — MMLU SCORE

Open source closing 24 points in 3 years. Gap was 28pts in 2023. Now 6pts.

2023 Q1Llama 1 vs GPT-4

Open: 62Closed: 90Δ28

2023 Q3Llama 2 closes gap slightly

Open: 68Closed: 91Δ23

2024 Q1Mistral 8x7B shocks everyone

Open: 76Closed: 93Δ17

2024 Q3Llama 3.1 405B near-frontier

Open: 82Closed: 93Δ11

2025 Q1DeepSeek R1 matches o1 on reasoning

Open: 87Closed: 95Δ8

2025 Q3Qwen 3.5 runs on MacBook Pro

Open: 89Closed: 95Δ6

2026 Q1Llama 4, Qwen 3.5 122B at near-parity

Open: 91Closed: 97Δ6

MODEL COMPARISON

Llama 4 MaverickOPENSELF-HOST

WHEN TO USE WHICH

Cost-sensitive high volume

Self-hosted Llama 4 at ~$0.30/1M vs GPT at $10/1M. The maths win.

Open

Data privacy / on-premise

Zero data leaves your infra. Non-negotiable for healthcare, legal, gov.

Open

Latest frontier capability

Closed models still lead by 5-10 points on most benchmarks.

Closed

Agentic coding pipelines

Claude Sonnet for complex multi-step. SWE-bench delta is still real.

Closed

EU / GDPR compliance

Mistral + self-hosted open models. No transatlantic data transfer.

Open

Fine-tuning on your data

Closed models: not possible. Open: fine-tune locally or via API.

Open

Consumer-facing apps

Managed uptime, safety filters, content policy — closed handles it.

Closed

Research / experimentation

Access weights, probe internals, custom RLHF. Closed is a black box.

Open

THE GAPISCLOSING

THE CAPABILITY GAP — MMLU SCORE

MODEL COMPARISON

WHEN TO USE WHICH

THE GAP
IS
CLOSING