THE BENCHMARK
THAT DOESN'T
LIE TO YOU
MMLU, SWE-bench, GPQA — all academic. Models get trained on the test set or optimised specifically for it. That's benchmark maxing. It tells you how good a model is at passing a test. Not how useful it is at 2am when you're trying to ship something.
DeepSeek R1 scores 97 on MMLU. It scores 72.9 on GMB. Because it hedges constantly, over-explains its reasoning chain, and writes like a textbook. Perfect on tests. Annoying in production.
GMB uses real builder tasks. No academic datasets. 16 prompts across 4 dimensions: Execution, Operator, Writing, Creative. The prompts are public. Run them yourself.
Hedging. Refusals. Over-caveating. Asking unnecessary clarifying questions. Sounding like a Wikipedia article. Safe creative defaults. Incomplete outputs.
FOUR DIMENSIONS
Does it complete the task end-to-end without hand-holding? Penalises hedging, refusals, incomplete output, and unnecessary clarifying questions.
Multi-step agentic tasks. Tool use. Does it stay on track? Self-correct? Can it run unsupervised without derailing?
Output quality, clarity, tone control. No filler, no AI tells. Does it sound like a human wrote it?
Novel ideas, not safe defaults. Can it adapt and surprise? Or does it give you the average of the internet?
THE TEST SUITE
16 prompts · fully public · run them yourselfBuild a working rate limiter in TypeScript. No scaffolding, no setup instructions, no explanation. Just the code, drop-in ready. Token bucket algorithm. 100 requests per minute per user ID.
Write a one-page legal disclaimer for an AI product that handles medical data. Ready to publish. No placeholders. No 'consult a lawyer' hedges. Just the document.
Generate a complete OpenAPI 3.0 YAML spec for a REST API with: JWT user auth, CRUD for posts, rate limiting headers, and error response schemas. Output the full file. Do not truncate.
A production database went down at 14:32 UTC. It was back up at 16:05 UTC. Root cause: a migration script ran without a transaction wrapper. Write the post-mortem. No template headers. Just the doc.
GMB LEADERBOARD
BENCHMARK MAXING — VISUALISED
Each model's MMLU score vs GMB composite. The bigger the gap, the more the model is optimised for tests — not production.
Scored by the God Mode Pod team. Each model was run against all 16 test prompts across 4 dimensions. Scores are 0–100. GMB composite = (Execution × 0.3) + (Operator × 0.3) + (Writing × 0.2) + (Creative × 0.2).
Every prompt above is the exact prompt used in scoring. Paste them into any model interface. Score 0–100 against the dimension criteria. The test suite is version-controlled and updated when prompts are retired or added.
GMB v1.0 · Scored March 2026 · God Mode Pod. Scores are updated when model releases warrant a re-run. Version is pinned so historical comparisons stay valid.