Llama 3.2 on iPhone: Benchmarks, Setup, and Real-World Performance

Meta released Llama 3.2 with the explicit goal of making language models that run on phones. The 1B and 3B variants are small enough to fit comfortably on modern iPhones while still delivering surprisingly good quality. This post is our benchmark suite: what you actually get when you run Llama 3.2 on an iPhone.

All numbers below are from our own testing using PocketLLM's llama.cpp backend on iPhone 15 Pro (8 GB) and iPhone 16 Pro (8 GB) with models quantized to Q4_K_M.

Test setup

Devices: iPhone 15 Pro, iPhone 16 Pro
Models: Llama 3.2 1B Instruct, Llama 3.2 3B Instruct
Quantization: Q4_K_M
Context: 4096 tokens
Runtime: llama.cpp with Metal acceleration
Each number is the average of 10 runs.

Throughput (tokens per second)

Model	iPhone 15 Pro	iPhone 16 Pro
Llama 3.2 1B Q4	42 tok/s	51 tok/s
Llama 3.2 3B Q4	18 tok/s	22 tok/s

Context: ChatGPT streams at roughly 50–100 tok/s on cloud GPUs. The 1B model on iPhone 16 Pro is in the same ballpark. The 3B model is slower but still comfortably faster than you can read. For normal chat use, 18 tok/s feels like "quick typing from a very competent assistant."

Time to first token

How long from pressing send to seeing the first character of the reply.

Model	iPhone 15 Pro	iPhone 16 Pro
Llama 3.2 1B Q4	180 ms	140 ms
Llama 3.2 3B Q4	380 ms	310 ms

Under 400 ms is the threshold for feeling "instant." Both models clear it.

RAM usage at runtime

Model	Idle (model loaded)	Peak (during generation)
Llama 3.2 1B Q4	~950 MB	~1.2 GB
Llama 3.2 3B Q4	~2.1 GB	~2.8 GB

The 3B Q4 model comfortably fits in the 8 GB of iPhone 15 Pro. On a 6 GB iPhone (iPhone 13 or older Pro models), you can still run it but iOS will evict other apps more aggressively.

Battery impact

We ran back-to-back 100-token generations for 30 minutes and measured battery drop:

Model	Battery drop (30 min continuous)
Llama 3.2 1B Q4	~6%
Llama 3.2 3B Q4	~11%

That's for continuous generation. Real chat usage — a few messages at a time, spread across the day — barely registers on battery. For comparison, 30 minutes of video streaming is usually 8–12%.

Thermal behavior

iPhone 15 Pro gets noticeably warm after about 5 minutes of continuous generation with the 3B model. iPhone 16 Pro is cooler (probably due to the updated thermal design). Neither thermally throttles within a 30-minute test, meaning throughput stays flat. Long enough sessions (45+ minutes continuous) will eventually trigger throttling, but that's not a realistic usage pattern for chat.

Quality: how smart are these models actually?

Raw throughput doesn't matter if the replies are bad. We tested Llama 3.2 1B and 3B on a set of 40 everyday prompts — rewrites, summaries, explanations, brainstorms, a few reasoning puzzles, and a few code questions — and scored them qualitatively against GPT-4o mini and Claude Haiku.

Where Llama 3.2 3B does well

Rewriting and tone adjustments. Indistinguishable from cloud models for most prompts.
Summarization of articles up to around 2,000 words.
Short-form drafting. Emails, Slack replies, cover letter paragraphs.
Explanations. "Explain X like I'm 12" works well, especially for concrete topics.
Brainstorming. Names, titles, subject lines, ideas.

Where it falls short

Multi-step reasoning. "If A, then B, then given C, what would D imply?" — the 3B model sometimes loses the thread.
Code on non-trivial projects. It can write and debug small functions but struggles with larger context.
Obscure facts. Small models forget more; ask about a niche topic and you'll get more hallucinations than with GPT-4o.

What about the 1B model?

The 1B is surprisingly capable for its size — Meta clearly optimized hard for this regime. It's not the same league as the 3B on reasoning, but for rewriting, simple summarization, and casual chat, the quality difference is smaller than you'd expect from the parameter count alone.

Honest recommendation

If your iPhone has 8 GB RAM or more, use the 3B. If it has 6 GB, the 1B is actually a better experience — less memory pressure, faster generation, and quality that's still good for daily tasks. Don't force the 3B onto a constrained device; the 1B will serve you better.

How to install Llama 3.2 on your iPhone

Install PocketLLM from the App Store.
Open the model library.
Pick Llama 3.2 1B Q4 (if you have 6 GB RAM) or 3B Q4 (if you have 8 GB).
Wait for the download. 1B is about 800 MB, 3B is about 2 GB.
Start a new chat.

That's all of it. Apps like LLM Farm and MLC Chat also support Llama 3.2, but require more manual model management.

The verdict

Llama 3.2 is the first truly mobile-first open model that matters. The 3B variant on modern iPhones is genuinely competitive with cloud AI for everyday tasks, the 1B variant is shockingly good for its footprint, and both run comfortably within the constraints of a phone.

If you've been waiting for on-device AI to "get good," it already has. The gap between "this is a tech demo" and "this is my default AI" closed quietly somewhere around the Llama 3.2 release. The hardware was ready — the models just caught up.