COMPUTE RACE

AI IN
YOUR
POCKET

What compute is actually possible on a phone today. How fast it is improving. And when the moment arrives that your phone stops needing to call the cloud for anything routine.

APPLE ON-DEVICE MODEL

~3B params

2-bit quantised, private, always on

FASTEST PHONE INFERENCE

~60 tok/s

A19 Pro · 1B models

MAX RAM ON FLAGSHIP PHONE

16GB

Android (Snapdragon / Dimensity)

MODELS DESIGNED FOR MOBILE

270M–4B

Gemma 3, Phi-4 mini, Qwen3, SmolLM2

COMPARE CHIPS

Apple A18 Pro

iPhone 16 Pro / 16 Pro Max

Privacy-first, always-on Apple Intelligence, best power efficiency

TSMC 3nm2024

RAM

8GB LPDDR5

NPU

16-core Neural Engine

AI COMPUTE

35 TOPS

MEM BANDWIDTH

~77 GB/s

INFERENCE SPEED (approx, Q4 quant)

1B model

~40 tok/s

3B model

~15 tok/s

7B model

~5 tok/s

RUNS THESE MODELS

Apple Intelligence 3B (2-bit)Qwen3 0.6BLlama 3.2 1BPhi-4 mini 3.8B

Apple ships a custom ~3B model (2-bit quantised) for Apple Intelligence. Third-party: Llama.cpp + llama.rn runs Llama 3.2 3B at usable speed. 8GB RAM is the ceiling — no headroom for 7B.

AI COMPUTE (TOPS)

Apple A19 Pro~140 TOPS (est)

Snapdragon 8 Elite Gen 2~100+ TOPS (est)

Snapdragon 8 Elite75 TOPS

Dimensity 9400~50 TOPS

Google Tensor G5~40 TOPS

Apple A18 Pro35 TOPS

MEMORY BANDWIDTH

Snapdragon 8 Elite Gen 2~130 GB/s (est)

Dimensity 9400~120 GB/s

Apple A19 Pro~100 GB/s

Snapdragon 8 Elite~100 GB/s

Google Tensor G5~96 GB/s

Apple A18 Pro~77 GB/s

HOW WE GOT HERE — AND WHERE IT'S GOING

2022

PRE-MOBILE ERA

·ChatGPT launches — cloud only, no on-device anything
·Mobile NPUs exist (A15, SD 8 Gen 1) but zero LLM tooling
·On-device AI = image filters, voice commands, face unlock

Nothing useful for LLMs

2023

FIRST EXPERIMENTS

·Llama 1 + 2 leaked/released — hobbyists port to phones
·llama.cpp makes CPU-only inference possible (very slow)
·Phi-1 / TinyLlama prove sub-2B models can reason
·Gemini Nano ships on Pixel 8 — first OEM on-device LLM

~1–2 tok/s on flagship. Demo-ware, not practical.

2024

APPLE ENTERS

·Apple Intelligence ships: custom 3B model (2-bit) on-device
·Llama 3.2 1B/3B released, explicitly targeting mobile
·MLC-LLM / llama.cpp enable Metal + Vulkan acceleration
·Snapdragon 8 Elite: 75 TOPS, 7B models become feasible
·Gemma 2B ships in Android via MediaPipe
·~10–15 tok/s on 3B models on flagship hardware

3B models usable. 7B barely. Mostly apps, not user-side.

2025

MODEL EFFICIENCY CATCHES UP

·A19 Pro: 4× FP16 uplift. Neural Accelerators in every GPU core.
·Gemma 3: 270M → 27B, mobile-first architecture
·SmolLM2 135M — useful model in <1MB RAM
·Phi-4 mini (3.8B) beats many 7B models on reasoning benchmarks
·Qwen3 0.6B at ~40 tok/s on Pixel 8 — fast enough to feel instant
·Real-time local ASR, translation, summarisation on most flagships
·Private Cloud Compute: Apple's hybrid — on-device first, cloud fallback

3–4B models feel fast. 7B usable. Real apps shipping.

2026

THE RAM WALL STARTS TO BREAK

·LPDDR6 arrives in flagships — 2× bandwidth vs LPDDR5X
·Standard flagship: 12–16GB RAM, 100+ TOPS NPU
·7B models run comfortably on Android (not just barely)
·On-device multimodal: vision + LLM on same device, no cloud
·Agent pipelines on phone — Siri, Gemini Live, on-device tool calls
·IDC: AI phones ~40% of flagship market

7B native on Android flagships. 4B on iPhone. Agents.

2027+

LOCAL AI BECOMES DEFAULT

·16GB RAM standard across mid-range, not just flagship
·14B models possible on ultra-premium hardware
·HBM for phones? Samsung researching but not imminent
·On-device fine-tuning for personalisation — your phone learns you
·Battery life gap closes: NPU efficiency reaches 5× current iGPU
·AI inference: default to local, cloud only for frontier-class tasks

Local = default. Cloud = fallback for complex tasks only.

WHAT ACTUALLY LIMITS MOBILE AI

⚠RAM ceiling

Most iPhones ship with 8GB. Top Android flagships reach 16GB. A 7B model at Q4 needs ~4–5GB just for weights. Add KV-cache, OS overhead, and you need 12GB minimum to run 7B without thrashing. This is why Android is ahead of iPhone for local LLMs right now.

⚡Memory bandwidth

TOPS are not the bottleneck — bandwidth is. You can have 100 TOPS but if you're feeding model weights at 60 GB/s, you'll idle most of that compute. LPDDR5X phones top out at ~100–130 GB/s. Compare: Mac Studio M4 Max has 410 GB/s. This bandwidth gap is why phones can't match desktop inference speed even with similar TOPS.

🌡Thermal throttling

A phone running a 7B model at full throughput will thermal throttle within 2–5 minutes. The sustained token rate is often 50–60% of peak. Apple's A19 Pro addresses this with vapor chamber cooling — the first iPhone with active thermal management for sustained AI workloads.

📐Model size vs quality

The "on-device LLM" story isn't just about hardware — it's about model efficiency. A 2025 Phi-4 mini (3.8B) beats 2022's 7B models on most benchmarks. Distillation, quantisation, and architecture improvements mean the effective quality of a phone-runnable model is increasing faster than the hardware is improving.

THE INSIGHT

Model efficiency is moving faster than the hardware

HARDWARE IS IMPROVING STEADILY

NPU TOPS roughly double every 2 years. LPDDR6 will roughly double mobile memory bandwidth by 2026–2027. These are meaningful gains, but constrained by battery, thermals, and phone form factor.

MODELS ARE IMPROVING FASTER

Phi-4 mini at 3.8B beats many 2022-era 7B models. Gemma 3's 270M model is genuinely useful. Distillation, quantisation, and architecture advances mean what runs on a phone today would have needed a server rack in 2022.

THE CROSSOVER POINT IS 2026–2027

With 16GB RAM standard, LPDDR6 bandwidth, 100+ TOPS NPUs, and better-distilled 7B models — the phone becomes the default AI runtime for most routine tasks. Cloud becomes the fallback for frontier complexity.

WHAT YOU CAN DO ON A FLAGSHIP PHONE RIGHT NOW

Real-time speech transcriptionExcellent

Whisper small / Parakeet, Neural Engine. Fast, private, offline.

Text summarisationGood

Apple Intelligence, Gemini Nano. Genuinely useful, on-device.

Writing assist / rewriteGood

Apple Intelligence 3B handles this well. GPT-quality? No. Useful? Yes.

Quick Q&A chatbotUsable

3B models at 15 tok/s. Not fast enough to feel instant for long answers.

Code completionEarly

Phi-4 mini on Android works. Limited context window, not IDE-ready.

Image understandingGood

Gemini Nano multimodal, FastVLM on A18 Pro. Vision + text on-device.

Private document Q&AEarly

Works with RAG + 3B models. Slow retrieval on-device, improving fast.

7B model chatBarely

~8 tok/s on SD 8 Elite, 16GB Android. Usable but not enjoyable. 2026 problem.

Agentic tool useVery early

Siri (cloud), Gemini Live (cloud). True on-device agents: 2026+ problem.

AI INYOURPOCKET

Apple A18 Pro

Model efficiency is moving faster than the hardware

AI IN
YOUR
POCKET