Three small models dominate the on-device iPhone conversation in 2026: Llama 3.2 3B, Phi-3.5 Mini 3.8B, and Gemma 2 2B. All three run on an iPhone 15 Pro. All three are small enough to download over cellular without dying inside. All three are genuinely good. But they're not interchangeable. This post is a head-to-head: which one to install first, and when to switch to one of the others. Extends our longer Llama 3.2 iPhone benchmarks with two extra models.
Short version: install Llama 3.2 3B for general use, Phi-3.5 Mini for reasoning and code, Gemma 2 2B for the fastest generation and multilingual work. PocketLLM bundles all three — join the waitlist.
The contenders
Llama 3.2 3B — Meta, September 2024, Llama Community License, ~2.0 GB at Q4. Trained with the current Llama 3 recipe and then distilled. The default answer to "what should I run on a phone?"
Phi-3.5 Mini (3.8B) — Microsoft, August 2024, MIT license, ~2.4 GB at Q4. Trained on a carefully curated "textbook" dataset designed to maximize reasoning and coding capability at small scale.
Gemma 2 2B — Google DeepMind, July 2024, Gemma Terms, ~1.6 GB at Q4. Distilled from larger Gemma variants with strong safety tuning and multilingual coverage.
Performance on iPhone 15 Pro
All three models were tested in Q4 quantization on an iPhone 15 Pro with 8 GB of RAM, running through llama.cpp via a native iOS app. Numbers below are approximate based on reported benchmarks and community measurements; your mileage will vary with thermal state, background apps, and other factors.
| Metric | Llama 3.2 3B | Phi-3.5 Mini | Gemma 2 2B |
|---|---|---|---|
| File size (Q4) | 2.0 GB | 2.4 GB | 1.6 GB |
| RAM in use | ~3.0 GB | ~3.5 GB | ~2.5 GB |
| Tok/s (prompt) | ~120 | ~100 | ~150 |
| Tok/s (generation) | ~30 | ~25 | ~38 |
| First token latency | ~400 ms | ~500 ms | ~300 ms |
Quality across task types
We ran a fixed suite of 20 prompts across four categories — general chat, reasoning, code, multilingual — through all three models. Scores are qualitative 1-10 ratings averaged across the prompts.
| Task category | Llama 3.2 3B | Phi-3.5 Mini | Gemma 2 2B |
|---|---|---|---|
| General conversation | 9/10 | 7/10 | 8/10 |
| Reasoning puzzles | 7/10 | 9/10 | 7/10 |
| Math word problems | 7/10 | 9/10 | 7/10 |
| Code completion | 7/10 | 9/10 | 6/10 |
| Creative writing | 9/10 | 7/10 | 8/10 |
| Summarization | 9/10 | 8/10 | 8/10 |
| Multilingual (non-English) | 7/10 | 6/10 | 9/10 |
| Safety / refusals | 7/10 | 8/10 | 9/10 |
The per-task verdict
General chat and creative writing: Llama 3.2 3B wins. The training data diversity shows up in tone, variety, and the ability to hold a freeform conversation.
Math, reasoning, and code: Phi-3.5 Mini wins, and it's not close. The textbook training data gives it a meaningful edge on structured tasks at a small parameter count.
Speed: Gemma 2 2B is the fastest, both on first-token latency and tokens-per-second. If you're building a chat UI where responsiveness matters, it's the pick.
Multilingual work: Gemma 2 2B also wins on non-English content. The training data included substantial multilingual material and it shows.
Safety and refusals: Gemma 2 has the cleanest out-of-the-box safety tuning. It's less likely to produce problematic output without explicit prompting.
Which should you install first?
If you're installing only one: Llama 3.2 3B. It's the most versatile. You'll be happy with it for 80% of what you'd use an AI for.
If you're installing two: add Phi-3.5 Mini. You'll immediately have better reasoning and code help for the 20% where Llama falls short.
If you're installing three: add Gemma 2 2B for the speed boost on quick tasks and for anything you want to do in a non-English language.
On a phone, the real answer is usually "have all three installed and switch between them for the task." 6 GB of download total, which fits comfortably on any iPhone. PocketLLM lets you switch models in one tap, which is the cleanest way to do this in practice.
How this compares to the old Llama 3.2 post
Our earlier Llama 3.2 iPhone benchmarks post focused only on Llama 3.2 1B and 3B. The numbers there (~30 tok/s for the 3B on iPhone 15 Pro) match what we're seeing now. This post adds Phi-3.5 Mini and Gemma 2 2B to the picture so you can compare all three. If you want even more detail on the Llama 3.2 family specifically, that post has the full 1B-vs-3B breakdown.
The quick answer
Llama 3.2 3B is the best default. Phi-3.5 Mini is better for reasoning, math, and code. Gemma 2 2B is the fastest and best at non-English. The three together cover every small-model use case on an iPhone, and together they take up about 6 GB — install all three if you have the space. PocketLLM bundles them as one-tap downloads — join the waitlist.