AI IN
YOUR
POCKET
What compute is actually possible on a phone today. How fast it is improving. And when the moment arrives that your phone stops needing to call the cloud for anything routine.
Apple A18 Pro
Apple ships a custom ~3B model (2-bit quantised) for Apple Intelligence. Third-party: Llama.cpp + llama.rn runs Llama 3.2 3B at usable speed. 8GB RAM is the ceiling — no headroom for 7B.
- ·ChatGPT launches — cloud only, no on-device anything
- ·Mobile NPUs exist (A15, SD 8 Gen 1) but zero LLM tooling
- ·On-device AI = image filters, voice commands, face unlock
- ·Llama 1 + 2 leaked/released — hobbyists port to phones
- ·llama.cpp makes CPU-only inference possible (very slow)
- ·Phi-1 / TinyLlama prove sub-2B models can reason
- ·Gemini Nano ships on Pixel 8 — first OEM on-device LLM
- ·Apple Intelligence ships: custom 3B model (2-bit) on-device
- ·Llama 3.2 1B/3B released, explicitly targeting mobile
- ·MLC-LLM / llama.cpp enable Metal + Vulkan acceleration
- ·Snapdragon 8 Elite: 75 TOPS, 7B models become feasible
- ·Gemma 2B ships in Android via MediaPipe
- ·~10–15 tok/s on 3B models on flagship hardware
- ·A19 Pro: 4× FP16 uplift. Neural Accelerators in every GPU core.
- ·Gemma 3: 270M → 27B, mobile-first architecture
- ·SmolLM2 135M — useful model in <1MB RAM
- ·Phi-4 mini (3.8B) beats many 7B models on reasoning benchmarks
- ·Qwen3 0.6B at ~40 tok/s on Pixel 8 — fast enough to feel instant
- ·Real-time local ASR, translation, summarisation on most flagships
- ·Private Cloud Compute: Apple's hybrid — on-device first, cloud fallback
- ·LPDDR6 arrives in flagships — 2× bandwidth vs LPDDR5X
- ·Standard flagship: 12–16GB RAM, 100+ TOPS NPU
- ·7B models run comfortably on Android (not just barely)
- ·On-device multimodal: vision + LLM on same device, no cloud
- ·Agent pipelines on phone — Siri, Gemini Live, on-device tool calls
- ·IDC: AI phones ~40% of flagship market
- ·16GB RAM standard across mid-range, not just flagship
- ·14B models possible on ultra-premium hardware
- ·HBM for phones? Samsung researching but not imminent
- ·On-device fine-tuning for personalisation — your phone learns you
- ·Battery life gap closes: NPU efficiency reaches 5× current iGPU
- ·AI inference: default to local, cloud only for frontier-class tasks
Most iPhones ship with 8GB. Top Android flagships reach 16GB. A 7B model at Q4 needs ~4–5GB just for weights. Add KV-cache, OS overhead, and you need 12GB minimum to run 7B without thrashing. This is why Android is ahead of iPhone for local LLMs right now.
TOPS are not the bottleneck — bandwidth is. You can have 100 TOPS but if you're feeding model weights at 60 GB/s, you'll idle most of that compute. LPDDR5X phones top out at ~100–130 GB/s. Compare: Mac Studio M4 Max has 410 GB/s. This bandwidth gap is why phones can't match desktop inference speed even with similar TOPS.
A phone running a 7B model at full throughput will thermal throttle within 2–5 minutes. The sustained token rate is often 50–60% of peak. Apple's A19 Pro addresses this with vapor chamber cooling — the first iPhone with active thermal management for sustained AI workloads.
The "on-device LLM" story isn't just about hardware — it's about model efficiency. A 2025 Phi-4 mini (3.8B) beats 2022's 7B models on most benchmarks. Distillation, quantisation, and architecture improvements mean the effective quality of a phone-runnable model is increasing faster than the hardware is improving.
Model efficiency is moving faster than the hardware
NPU TOPS roughly double every 2 years. LPDDR6 will roughly double mobile memory bandwidth by 2026–2027. These are meaningful gains, but constrained by battery, thermals, and phone form factor.
Phi-4 mini at 3.8B beats many 2022-era 7B models. Gemma 3's 270M model is genuinely useful. Distillation, quantisation, and architecture advances mean what runs on a phone today would have needed a server rack in 2022.
With 16GB RAM standard, LPDDR6 bandwidth, 100+ TOPS NPUs, and better-distilled 7B models — the phone becomes the default AI runtime for most routine tasks. Cloud becomes the fallback for frontier complexity.
Whisper small / Parakeet, Neural Engine. Fast, private, offline.
Apple Intelligence, Gemini Nano. Genuinely useful, on-device.
Apple Intelligence 3B handles this well. GPT-quality? No. Useful? Yes.
3B models at 15 tok/s. Not fast enough to feel instant for long answers.
Phi-4 mini on Android works. Limited context window, not IDE-ready.
Gemini Nano multimodal, FastVLM on A18 Pro. Vision + text on-device.
Works with RAG + 3B models. Slow retrieval on-device, improving fast.
~8 tok/s on SD 8 Elite, 16GB Android. Usable but not enjoyable. 2026 problem.
Siri (cloud), Gemini Live (cloud). True on-device agents: 2026+ problem.