GOD MODE BENCHMARKv1.0OPEN TEST SUITE

THE BENCHMARK
THAT DOESN'T
LIE TO YOU

MMLU, SWE-bench, GPQA — all academic. Models get trained on the test set or optimised specifically for it. That's benchmark maxing. It tells you how good a model is at passing a test. Not how useful it is at 2am when you're trying to ship something.

THE PROBLEM

DeepSeek R1 scores 97 on MMLU. It scores 72.9 on GMB. Because it hedges constantly, over-explains its reasoning chain, and writes like a textbook. Perfect on tests. Annoying in production.

THE FIX

GMB uses real builder tasks. No academic datasets. 16 prompts across 4 dimensions: Execution, Operator, Writing, Creative. The prompts are public. Run them yourself.

PENALISES

Hedging. Refusals. Over-caveating. Asking unnecessary clarifying questions. Sounding like a Wikipedia article. Safe creative defaults. Incomplete outputs.

FOUR DIMENSIONS

Execution30%

Does it complete the task end-to-end without hand-holding? Penalises hedging, refusals, incomplete output, and unnecessary clarifying questions.

Operator30%

Multi-step agentic tasks. Tool use. Does it stay on track? Self-correct? Can it run unsupervised without derailing?

Writing20%

Output quality, clarity, tone control. No filler, no AI tells. Does it sound like a human wrote it?

Creative20%

Novel ideas, not safe defaults. Can it adapt and surprise? Or does it give you the average of the internet?

GMB = (Execution × 0.3) + (Operator × 0.3) + (Writing × 0.2) + (Creative × 0.2)

THE TEST SUITE

16 prompts · fully public · run them yourself

Scoring criteria: Does it complete the task end-to-end without hand-holding? Penalises hedging, refusals, incomplete output, and unnecessary clarifying questions.

#1Rate Limiter — Drop-in Code

Build a working rate limiter in TypeScript. No scaffolding, no setup instructions, no explanation. Just the code, drop-in ready. Token bucket algorithm. 100 requests per minute per user ID.

#2Legal Disclaimer — Ship It

Write a one-page legal disclaimer for an AI product that handles medical data. Ready to publish. No placeholders. No 'consult a lawyer' hedges. Just the document.

#3OpenAPI Spec — Full Output

Generate a complete OpenAPI 3.0 YAML spec for a REST API with: JWT user auth, CRUD for posts, rate limiting headers, and error response schemas. Output the full file. Do not truncate.

#4Incident Report — No Padding

A production database went down at 14:32 UTC. It was back up at 16:05 UTC. Root cause: a migration script ran without a transaction wrapper. Write the post-mortem. No template headers. Just the doc.

GMB LEADERBOARD

Claude Opus 4.6AnthropicMMLU: 93 → GMB: -2.3

EXECUTION

OPERATOR

WRITING

CREATIVE

GMB

90.7

Claude Sonnet 4.6AnthropicMMLU: 90 → GMB: -1.3

EXECUTION

OPERATOR

WRITING

CREATIVE

GMB

88.7

GPT-5.4OpenAIMMLU: 91 → GMB: -3.5

EXECUTION

OPERATOR

WRITING

CREATIVE

GMB

87.5

Grok 3xAIMMLU: 87 → GMB: -0.5

EXECUTION

OPERATOR

WRITING

CREATIVE

GMB

86.5

GPT-5.4 ProOpenAIMMLU: 95 → GMB: -12.0

EXECUTION

OPERATOR

WRITING

CREATIVE

GMB

83.0

Gemini 3.1 ProGoogleMMLU: 89 → GMB: -7.8

EXECUTION

OPERATOR

WRITING

CREATIVE

GMB

81.2

GPT-4oOpenAIMMLU: 85 → GMB: -7.4

EXECUTION

OPERATOR

WRITING

CREATIVE

GMB

77.6

DeepSeek V3DeepSeekMMLU: 88 → GMB: -14.2

EXECUTION

OPERATOR

WRITING

CREATIVE

GMB

73.8

Qwen 3.5 122BAlibabaOPEN SOURCEMMLU: 84 → GMB: -10.8

EXECUTION

OPERATOR

WRITING

CREATIVE

GMB

73.2

DeepSeek R1DeepSeekMMLU: 97 → GMB: -24.1

EXECUTION

OPERATOR

WRITING

CREATIVE

GMB

72.9

Mistral LargeMistralMMLU: 81 → GMB: -8.7

EXECUTION

OPERATOR

WRITING

CREATIVE

GMB

72.3

Kimi K2.5MoonshotMMLU: 82 → GMB: -10.7

EXECUTION

OPERATOR

WRITING

CREATIVE

GMB

71.3

Minimax M2.5MinimaxMMLU: 78 → GMB: -9.2

EXECUTION

OPERATOR

WRITING

CREATIVE

GMB

68.8

GLM-5Zhipu AIMMLU: 80 → GMB: -11.6

EXECUTION

OPERATOR

WRITING

CREATIVE

GMB

68.4

Llama 3.3 70BMetaOPEN SOURCEMMLU: 79 → GMB: -11.0

EXECUTION

OPERATOR

WRITING

CREATIVE

GMB

68.0

GPT-4o miniOpenAIMMLU: 82 → GMB: -18.2

EXECUTION

OPERATOR

WRITING

CREATIVE

GMB

63.8

BENCHMARK MAXING — VISUALISED

Each model's MMLU score vs GMB composite. The bigger the gap, the more the model is optimised for tests — not production.

DeepSeek R1

DeepSeek

MMLU 97-24.1

GPT-5.4 Pro

OpenAI

MMLU 95-12.0

Claude Opus 4.6

Anthropic

MMLU 93-2.3

GPT-5.4

OpenAI

MMLU 91-3.5

Claude Sonnet 4.6

Anthropic

MMLU 90-1.3

Gemini 3.1 Pro

Google

MMLU 89-7.8

DeepSeek V3

DeepSeek

MMLU 88-14.2

Grok 3

xAI

MMLU 87-0.5

GPT-4o

OpenAI

MMLU 85-7.4

Qwen 3.5 122B

Alibaba

MMLU 84-10.8

Kimi K2.5

Moonshot

MMLU 82-10.7

GPT-4o mini

OpenAI

MMLU 82-18.2

Mistral Large

Mistral

MMLU 81-8.7

GLM-5

Zhipu AI

MMLU 80-11.6

Llama 3.3 70B

THE BENCHMARKTHAT DOESN'TLIE TO YOU

FOUR DIMENSIONS

THE TEST SUITE

GMB LEADERBOARD

BENCHMARK MAXING — VISUALISED

THE BENCHMARK
THAT DOESN'T
LIE TO YOU