Llama 3.2 vs Phi-3 Mini vs Gemma 2: iPhone Bake-Off

Three small models dominate the on-device iPhone conversation in 2026: Llama 3.2 3B, Phi-3.5 Mini 3.8B, and Gemma 2 2B. All three run on an iPhone 15 Pro. All three are small enough to download over cellular without dying inside. All three are genuinely good. But they're not interchangeable. This post is a head-to-head: which one to install first, and when to switch to one of the others. Extends our longer Llama 3.2 iPhone benchmarks with two extra models.

Short version: install Llama 3.2 3B for general use, Phi-3.5 Mini for reasoning and code, Gemma 2 2B for the fastest generation and multilingual work. PocketLLM is set to bundle all three — join the launch list.

PocketLLM is launching soon. Private, on-device AI, starting on iPhone and iPad with more platforms planned. No account, no tracking, no cloud. Join the launch list and be first in.

Join the launch list

The contenders

Llama 3.2 3B — Meta, September 2024, Llama Community License, ~2.0 GB at Q4. Trained with the current Llama 3 recipe and then distilled. The default answer to "what should I run on a phone?"

Phi-3.5 Mini (3.8B) — Microsoft, August 2024, MIT license, ~2.4 GB at Q4. Trained on a carefully curated "textbook" dataset designed to maximize reasoning and coding capability at small scale.

Gemma 2 2B — Google DeepMind, July 2024, Gemma Terms, ~1.6 GB at Q4. Distilled from larger Gemma variants with strong safety tuning and multilingual coverage.

Performance on iPhone 15 Pro

All three models were tested in Q4 quantization on an iPhone 15 Pro with 8 GB of RAM, running through llama.cpp via a native iOS app. Numbers below are approximate based on reported benchmarks and community measurements; your mileage will vary with thermal state, background apps, and other factors.

Metric	Llama 3.2 3B	Phi-3.5 Mini	Gemma 2 2B
File size (Q4)	2.0 GB	2.4 GB	1.6 GB
RAM in use	~3.0 GB	~3.5 GB	~2.5 GB
Tok/s (prompt)	~120	~100	~150
Tok/s (generation)	~30	~25	~38
First token latency	~400 ms	~500 ms	~300 ms

Quality across task types

We ran a fixed suite of 20 prompts across four categories — general chat, reasoning, code, multilingual — through all three models. Scores are qualitative 1-10 ratings averaged across the prompts.

Task category	Llama 3.2 3B	Phi-3.5 Mini	Gemma 2 2B
General conversation	9/10	7/10	8/10
Reasoning puzzles	7/10	9/10	7/10
Math word problems	7/10	9/10	7/10
Code completion	7/10	9/10	6/10
Creative writing	9/10	7/10	8/10
Summarization	9/10	8/10	8/10
Multilingual (non-English)	7/10	6/10	9/10
Safety / refusals	7/10	8/10	9/10

The per-task verdict

General chat and creative writing: Llama 3.2 3B wins. The training data diversity shows up in tone, variety, and the ability to hold a freeform conversation.

Math, reasoning, and code: Phi-3.5 Mini wins, and it's not close. The textbook training data gives it a meaningful edge on structured tasks at a small parameter count.

Speed: Gemma 2 2B is the fastest, both on first-token latency and tokens-per-second. If you're building a chat UI where responsiveness matters, it's the pick.

Multilingual work: Gemma 2 2B also wins on non-English content. The training data included substantial multilingual material and it shows.

Safety and refusals: Gemma 2 has the cleanest out-of-the-box safety tuning. It's less likely to produce problematic output without explicit prompting.

Which should you install first?

If you're installing only one: Llama 3.2 3B. It's the most versatile. You'll be happy with it for 80% of what you'd use an AI for.

If you're installing two: add Phi-3.5 Mini. You'll immediately have better reasoning and code help for the 20% where Llama falls short.

If you're installing three: add Gemma 2 2B for the speed boost on quick tasks and for anything you want to do in a non-English language.

On a phone, the real answer is usually "have all three installed and switch between them for the task." 6 GB of download total, which fits comfortably on any iPhone. PocketLLM is designed to let you switch models in one tap, which will be the cleanest way to do this in practice.

How this compares to the old Llama 3.2 post

Our earlier Llama 3.2 iPhone benchmarks post focused only on Llama 3.2 1B and 3B. The numbers there (~30 tok/s for the 3B on iPhone 15 Pro) match what we're seeing now. This post adds Phi-3.5 Mini and Gemma 2 2B to the picture so you can compare all three. If you want even more detail on the Llama 3.2 family specifically, that post has the full 1B-vs-3B breakdown.

The quick answer

Llama 3.2 3B is the best default. Phi-3.5 Mini is better for reasoning, math, and code. Gemma 2 2B is the fastest and best at non-English. The three together cover every small-model use case on an iPhone, and together they take up about 6 GB — install all three if you have the space. PocketLLM will bundle them as one-tap downloads — join the launch list.

Llama 3.2 vs Phi-3 Mini vs Gemma 2: iPhone Bake-Off

The contenders

Performance on iPhone 15 Pro

Quality across task types

The per-task verdict

Which should you install first?

How this compares to the old Llama 3.2 post

The quick answer

Llama, Phi, and Gemma — all three, one tap each.

Related

Llama 3.2 iPhone Benchmarks

11 Best Small Language Models

15 Best Local LLM Models