GOD MODE BENCHMARKv1.0OPEN TEST SUITE

THE BENCHMARK
THAT DOESN'T
LIE TO YOU

MMLU, SWE-bench, GPQA — all academic. Models get trained on the test set or optimised specifically for it. That's benchmark maxing. It tells you how good a model is at passing a test. Not how useful it is at 2am when you're trying to ship something.

THE PROBLEM

DeepSeek R1 scores 97 on MMLU. It scores 72.9 on GMB. Because it hedges constantly, over-explains its reasoning chain, and writes like a textbook. Perfect on tests. Annoying in production.

THE FIX

GMB uses real builder tasks. No academic datasets. 16 prompts across 4 dimensions: Execution, Operator, Writing, Creative. The prompts are public. Run them yourself.

PENALISES

Hedging. Refusals. Over-caveating. Asking unnecessary clarifying questions. Sounding like a Wikipedia article. Safe creative defaults. Incomplete outputs.

FOUR DIMENSIONS

Execution30%

Does it complete the task end-to-end without hand-holding? Penalises hedging, refusals, incomplete output, and unnecessary clarifying questions.

Operator30%

Multi-step agentic tasks. Tool use. Does it stay on track? Self-correct? Can it run unsupervised without derailing?

Writing20%

Output quality, clarity, tone control. No filler, no AI tells. Does it sound like a human wrote it?

Creative20%

Novel ideas, not safe defaults. Can it adapt and surprise? Or does it give you the average of the internet?

GMB = (Execution × 0.3) + (Operator × 0.3) + (Writing × 0.2) + (Creative × 0.2)

THE TEST SUITE

16 prompts · fully public · run them yourself
Scoring criteria: Does it complete the task end-to-end without hand-holding? Penalises hedging, refusals, incomplete output, and unnecessary clarifying questions.
#1Rate Limiter — Drop-in Code

Build a working rate limiter in TypeScript. No scaffolding, no setup instructions, no explanation. Just the code, drop-in ready. Token bucket algorithm. 100 requests per minute per user ID.

#2Legal Disclaimer — Ship It

Write a one-page legal disclaimer for an AI product that handles medical data. Ready to publish. No placeholders. No 'consult a lawyer' hedges. Just the document.

#3OpenAPI Spec — Full Output

Generate a complete OpenAPI 3.0 YAML spec for a REST API with: JWT user auth, CRUD for posts, rate limiting headers, and error response schemas. Output the full file. Do not truncate.

#4Incident Report — No Padding

A production database went down at 14:32 UTC. It was back up at 16:05 UTC. Root cause: a migration script ran without a transaction wrapper. Write the post-mortem. No template headers. Just the doc.

GMB LEADERBOARD

1
Claude Opus 4.6AnthropicMMLU: 93 → GMB: -2.3
EXECUTION
88
OPERATOR
91
WRITING
96
CREATIVE
89
GMB
90.7
2
Claude Sonnet 4.6AnthropicMMLU: 90 → GMB: -1.3
EXECUTION
90
OPERATOR
87
WRITING
93
CREATIVE
85
GMB
88.7
3
GPT-5.4OpenAIMMLU: 91 → GMB: -3.5
EXECUTION
92
OPERATOR
89
WRITING
85
CREATIVE
81
GMB
87.5
4
Grok 3xAIMMLU: 87 → GMB: -0.5
EXECUTION
86
OPERATOR
83
WRITING
88
CREATIVE
91
GMB
86.5
5
GPT-5.4 ProOpenAIMMLU: 95 → GMB: -12.0
EXECUTION
84
OPERATOR
90
WRITING
80
CREATIVE
74
GMB
83.0
6
Gemini 3.1 ProGoogleMMLU: 89 → GMB: -7.8
EXECUTION
83
OPERATOR
85
WRITING
78
CREATIVE
76
GMB
81.2
7
GPT-4oOpenAIMMLU: 85 → GMB: -7.4
EXECUTION
80
OPERATOR
78
WRITING
79
CREATIVE
72
GMB
77.6
8
DeepSeek V3DeepSeekMMLU: 88 → GMB: -14.2
EXECUTION
84
OPERATOR
78
WRITING
66
CREATIVE
60
GMB
73.8
9
Qwen 3.5 122BAlibabaOPEN SOURCEMMLU: 84 → GMB: -10.8
EXECUTION
78
OPERATOR
74
WRITING
70
CREATIVE
68
GMB
73.2
10
DeepSeek R1DeepSeekMMLU: 97 → GMB: -24.1
EXECUTION
79
OPERATOR
80
WRITING
65
CREATIVE
61
GMB
72.9
11
Mistral LargeMistralMMLU: 81 → GMB: -8.7
EXECUTION
74
OPERATOR
71
WRITING
76
CREATIVE
68
GMB
72.3
12
Kimi K2.5MoonshotMMLU: 82 → GMB: -10.7
EXECUTION
75
OPERATOR
72
WRITING
71
CREATIVE
65
GMB
71.3
13
Minimax M2.5MinimaxMMLU: 78 → GMB: -9.2
EXECUTION
70
OPERATOR
66
WRITING
73
CREATIVE
67
GMB
68.8
14
GLM-5Zhipu AIMMLU: 80 → GMB: -11.6
EXECUTION
72
OPERATOR
68
WRITING
68
CREATIVE
64
GMB
68.4
15
Llama 3.3 70BMetaOPEN SOURCEMMLU: 79 → GMB: -11.0
EXECUTION
71
OPERATOR
65
WRITING
70
CREATIVE
66
GMB
68.0
16
GPT-4o miniOpenAIMMLU: 82 → GMB: -18.2
EXECUTION
68
OPERATOR
62
WRITING
66
CREATIVE
58
GMB
63.8

BENCHMARK MAXING — VISUALISED

Each model's MMLU score vs GMB composite. The bigger the gap, the more the model is optimised for tests — not production.

DeepSeek R1
DeepSeek
MMLU 97-24.1
GPT-5.4 Pro
OpenAI
MMLU 95-12.0
Claude Opus 4.6
Anthropic
MMLU 93-2.3
GPT-5.4
OpenAI
MMLU 91-3.5
Claude Sonnet 4.6
Anthropic
MMLU 90-1.3
Gemini 3.1 Pro
Google
MMLU 89-7.8
DeepSeek V3
DeepSeek
MMLU 88-14.2
Grok 3
xAI
MMLU 87-0.5
GPT-4o
OpenAI
MMLU 85-7.4
Qwen 3.5 122B
Alibaba
MMLU 84-10.8
Kimi K2.5
Moonshot
MMLU 82-10.7
GPT-4o mini
OpenAI
MMLU 82-18.2
Mistral Large
Mistral
MMLU 81-8.7
GLM-5
Zhipu AI
MMLU 80-11.6
Llama 3.3 70B
Meta
MMLU 79-11.0
Minimax M2.5
Minimax
MMLU 78-9.2
MMLU scoreGMB score (fill = provider colour)Negative delta = benchmark maxing gap
METHODOLOGY

Scored by the God Mode Pod team. Each model was run against all 16 test prompts across 4 dimensions. Scores are 0–100. GMB composite = (Execution × 0.3) + (Operator × 0.3) + (Writing × 0.2) + (Creative × 0.2).

RUN IT YOURSELF

Every prompt above is the exact prompt used in scoring. Paste them into any model interface. Score 0–100 against the dimension criteria. The test suite is version-controlled and updated when prompts are retired or added.

VERSIONING

GMB v1.0 · Scored March 2026 · God Mode Pod. Scores are updated when model releases warrant a re-run. Version is pinned so historical comparisons stay valid.