COMPUTE RACE

AI IN
YOUR
POCKET

What compute is actually possible on a phone today. How fast it is improving. And when the moment arrives that your phone stops needing to call the cloud for anything routine.

APPLE ON-DEVICE MODEL
~3B params
2-bit quantised, private, always on
FASTEST PHONE INFERENCE
~60 tok/s
A19 Pro · 1B models
MAX RAM ON FLAGSHIP PHONE
16GB
Android (Snapdragon / Dimensity)
MODELS DESIGNED FOR MOBILE
270M–4B
Gemma 3, Phi-4 mini, Qwen3, SmolLM2
COMPARE CHIPS

Apple A18 Pro

iPhone 16 Pro / 16 Pro Max
Privacy-first, always-on Apple Intelligence, best power efficiency
TSMC 3nm2024
RAM
8GB LPDDR5
NPU
16-core Neural Engine
AI COMPUTE
35 TOPS
MEM BANDWIDTH
~77 GB/s
INFERENCE SPEED (approx, Q4 quant)
1B model
~40 tok/s
3B model
~15 tok/s
7B model
~5 tok/s
RUNS THESE MODELS
Apple Intelligence 3B (2-bit)Qwen3 0.6BLlama 3.2 1BPhi-4 mini 3.8B

Apple ships a custom ~3B model (2-bit quantised) for Apple Intelligence. Third-party: Llama.cpp + llama.rn runs Llama 3.2 3B at usable speed. 8GB RAM is the ceiling — no headroom for 7B.

AI COMPUTE (TOPS)
Apple A19 Pro~140 TOPS (est)
Snapdragon 8 Elite Gen 2~100+ TOPS (est)
Snapdragon 8 Elite75 TOPS
Dimensity 9400~50 TOPS
Google Tensor G5~40 TOPS
Apple A18 Pro35 TOPS
MEMORY BANDWIDTH
Snapdragon 8 Elite Gen 2~130 GB/s (est)
Dimensity 9400~120 GB/s
Apple A19 Pro~100 GB/s
Snapdragon 8 Elite~100 GB/s
Google Tensor G5~96 GB/s
Apple A18 Pro~77 GB/s
HOW WE GOT HERE — AND WHERE IT'S GOING
2022
PRE-MOBILE ERA
  • ·ChatGPT launches — cloud only, no on-device anything
  • ·Mobile NPUs exist (A15, SD 8 Gen 1) but zero LLM tooling
  • ·On-device AI = image filters, voice commands, face unlock
Nothing useful for LLMs
2023
FIRST EXPERIMENTS
  • ·Llama 1 + 2 leaked/released — hobbyists port to phones
  • ·llama.cpp makes CPU-only inference possible (very slow)
  • ·Phi-1 / TinyLlama prove sub-2B models can reason
  • ·Gemini Nano ships on Pixel 8 — first OEM on-device LLM
~1–2 tok/s on flagship. Demo-ware, not practical.
2024
APPLE ENTERS
  • ·Apple Intelligence ships: custom 3B model (2-bit) on-device
  • ·Llama 3.2 1B/3B released, explicitly targeting mobile
  • ·MLC-LLM / llama.cpp enable Metal + Vulkan acceleration
  • ·Snapdragon 8 Elite: 75 TOPS, 7B models become feasible
  • ·Gemma 2B ships in Android via MediaPipe
  • ·~10–15 tok/s on 3B models on flagship hardware
3B models usable. 7B barely. Mostly apps, not user-side.
2025
MODEL EFFICIENCY CATCHES UP
  • ·A19 Pro: 4× FP16 uplift. Neural Accelerators in every GPU core.
  • ·Gemma 3: 270M → 27B, mobile-first architecture
  • ·SmolLM2 135M — useful model in <1MB RAM
  • ·Phi-4 mini (3.8B) beats many 7B models on reasoning benchmarks
  • ·Qwen3 0.6B at ~40 tok/s on Pixel 8 — fast enough to feel instant
  • ·Real-time local ASR, translation, summarisation on most flagships
  • ·Private Cloud Compute: Apple's hybrid — on-device first, cloud fallback
3–4B models feel fast. 7B usable. Real apps shipping.
2026
THE RAM WALL STARTS TO BREAK
  • ·LPDDR6 arrives in flagships — 2× bandwidth vs LPDDR5X
  • ·Standard flagship: 12–16GB RAM, 100+ TOPS NPU
  • ·7B models run comfortably on Android (not just barely)
  • ·On-device multimodal: vision + LLM on same device, no cloud
  • ·Agent pipelines on phone — Siri, Gemini Live, on-device tool calls
  • ·IDC: AI phones ~40% of flagship market
7B native on Android flagships. 4B on iPhone. Agents.
2027+
LOCAL AI BECOMES DEFAULT
  • ·16GB RAM standard across mid-range, not just flagship
  • ·14B models possible on ultra-premium hardware
  • ·HBM for phones? Samsung researching but not imminent
  • ·On-device fine-tuning for personalisation — your phone learns you
  • ·Battery life gap closes: NPU efficiency reaches 5× current iGPU
  • ·AI inference: default to local, cloud only for frontier-class tasks
Local = default. Cloud = fallback for complex tasks only.
WHAT ACTUALLY LIMITS MOBILE AI
RAM ceiling

Most iPhones ship with 8GB. Top Android flagships reach 16GB. A 7B model at Q4 needs ~4–5GB just for weights. Add KV-cache, OS overhead, and you need 12GB minimum to run 7B without thrashing. This is why Android is ahead of iPhone for local LLMs right now.

Memory bandwidth

TOPS are not the bottleneck — bandwidth is. You can have 100 TOPS but if you're feeding model weights at 60 GB/s, you'll idle most of that compute. LPDDR5X phones top out at ~100–130 GB/s. Compare: Mac Studio M4 Max has 410 GB/s. This bandwidth gap is why phones can't match desktop inference speed even with similar TOPS.

🌡Thermal throttling

A phone running a 7B model at full throughput will thermal throttle within 2–5 minutes. The sustained token rate is often 50–60% of peak. Apple's A19 Pro addresses this with vapor chamber cooling — the first iPhone with active thermal management for sustained AI workloads.

📐Model size vs quality

The "on-device LLM" story isn't just about hardware — it's about model efficiency. A 2025 Phi-4 mini (3.8B) beats 2022's 7B models on most benchmarks. Distillation, quantisation, and architecture improvements mean the effective quality of a phone-runnable model is increasing faster than the hardware is improving.

THE INSIGHT

Model efficiency is moving faster than the hardware

HARDWARE IS IMPROVING STEADILY

NPU TOPS roughly double every 2 years. LPDDR6 will roughly double mobile memory bandwidth by 2026–2027. These are meaningful gains, but constrained by battery, thermals, and phone form factor.

MODELS ARE IMPROVING FASTER

Phi-4 mini at 3.8B beats many 2022-era 7B models. Gemma 3's 270M model is genuinely useful. Distillation, quantisation, and architecture advances mean what runs on a phone today would have needed a server rack in 2022.

THE CROSSOVER POINT IS 2026–2027

With 16GB RAM standard, LPDDR6 bandwidth, 100+ TOPS NPUs, and better-distilled 7B models — the phone becomes the default AI runtime for most routine tasks. Cloud becomes the fallback for frontier complexity.

WHAT YOU CAN DO ON A FLAGSHIP PHONE RIGHT NOW
Real-time speech transcriptionExcellent

Whisper small / Parakeet, Neural Engine. Fast, private, offline.

Text summarisationGood

Apple Intelligence, Gemini Nano. Genuinely useful, on-device.

Writing assist / rewriteGood

Apple Intelligence 3B handles this well. GPT-quality? No. Useful? Yes.

Quick Q&A chatbotUsable

3B models at 15 tok/s. Not fast enough to feel instant for long answers.

Code completionEarly

Phi-4 mini on Android works. Limited context window, not IDE-ready.

Image understandingGood

Gemini Nano multimodal, FastVLM on A18 Pro. Vision + text on-device.

Private document Q&AEarly

Works with RAG + 3B models. Slow retrieval on-device, improving fast.

7B model chatBarely

~8 tok/s on SD 8 Elite, 16GB Android. Usable but not enjoyable. 2026 problem.

Agentic tool useVery early

Siri (cloud), Gemini Live (cloud). True on-device agents: 2026+ problem.