Llama vs Qwen vs DeepSeek vs Gemma: On-Device Benchmark

Most head-to-head model comparisons stitch together numbers from four different model cards, each measured on different hardware with different quantization and different prompts, and call it a benchmark. That tells you nothing about which one to actually run. So we did the opposite: we put Llama 3.2 3B, Qwen 2.5 7B, DeepSeek Coder V2 Lite, and Gemma 2 2B on one machine, quantized them identically, and fired the same 40 prompts at each. This is not a generic "best local LLM" roundup — we already have that 15-model ranking. This is four specific models, one controlled test, and a transparent scoring rubric.

Want the short version? Jump to the summary table. Running one of these on an iPhone? PocketLLM will package Llama, Qwen, and Gemma as one-tap on-device downloads — join the launch list.

The verdict, up front

On raw quality, Qwen 2.5 7B won (92/100). On the combined size-speed-quality rubric, Llama 3.2 3B won (94/100) because it delivers ~95% of Qwen's everyday usefulness at under half the file size and more than double the speed. DeepSeek Coder V2 Lite is the coding pick if you have 12-16 GB of RAM. Gemma 2 2B is the fastest and the only true phone-and-tablet sprinter of the four. Pick by your RAM, not by the leaderboard.

PocketLLM is launching soon. Private, on-device AI, starting on iPhone and iPad with more platforms planned. No account, no tracking, no cloud. Join the launch list and be first in.

Join the launch list

How we tested

Hardware: One MacBook Pro M2 (16 GB unified memory), run from a cold thermal state for each model, plugged in, no other heavy processes. Same machine, same OS build, every run.
Quantization: Q4_K_M GGUF for all four models via llama.cpp. We did not let one model run at Q8 and another at Q4 — that is the most common way comparisons get rigged.
Prompt set: 40 fixed prompts in four buckets — 10 chat, 10 summarization, 10 reasoning/math, 10 code. Every model saw the identical prompts in the identical order.
Scoring rubric: Quality (50%), speed in tok/s (30%), and footprint — file size plus peak RAM (20%). Quality was scored 0-10 per prompt against a fixed answer key, then normalized to 100.

Speed is wall-clock tokens per second measured at a 512-token generation length, averaged across the prompt set. Footprint is the on-disk Q4 size plus observed peak resident memory. All four models are runnable today in llama.cpp, Ollama, LM Studio, and MLX.

The four models, head to head

Llama 3.2 3B — 94/100 (rubric winner)

Meta's 3B is the all-rounder. It fits in ~2 GB at Q4, peaked at roughly 3.2 GB RAM in our run, and held 30+ tok/s on the M2 — the second-fastest of the four. On quality it trailed Qwen by a real but modest margin on the reasoning bucket and essentially tied it on chat and summarization. The reason it wins the rubric is leverage: it is the only model here that is both phone-runnable and good enough that most people would not notice the gap to a 7B. For the deeper architectural comparison against Phi and Gemma, see our Llama 3 vs Phi-3 vs Gemma 2 on iPhone breakdown.

Qwen 2.5 7B — 92/100 (quality winner)

Alibaba's Qwen 2.5 7B posted the highest quality score in the test, topping the reasoning and summarization buckets and matching DeepSeek on general code. At Q4 it is ~4.5 GB on disk and peaked near 5.6 GB RAM, so it needs roughly 8 GB to run comfortably — laptop or iPad Pro territory, not a phone. Speed landed around 14-16 tok/s on the M2, the slowest of the four but still very usable. If you have the memory and want the smartest general model, this is it. Setup walkthrough in our run Qwen locally guide.

DeepSeek Coder V2 Lite 16B — 90/100 (code winner)

A mixture-of-experts model with only ~2.4B parameters active per token, so it runs at roughly 3B speeds while holding 16B-class code quality. It crushed the code bucket — the best HumanEval-style results of the four — and was competitive on reasoning. The catch is footprint: ~10 GB at Q4 and 12-16 GB RAM to load, which rules out phones entirely. On general chat it was slightly stiffer than Qwen. If code is your workload and you have the RAM, nothing else here competes. Full setup in our run DeepSeek locally guide.

Gemma 2 2B — 87/100 (speed winner)

Google's Gemma 2 2B was the fastest model in the test by a clear margin and the lightest, at ~1.6 GB Q4 and a ~2.6 GB RAM peak. It trailed on the reasoning and code buckets — it is a 2B, and it shows under load — but it held up remarkably well on chat and summarization, and its multilingual handling was the best of the four. If you are on a phone, an older iPad, or any device where every megabyte and every token of latency matters, Gemma is the sprinter.

The summary table

Model	Size (Q4)	Peak RAM	Speed (M2)	Quality	Rubric
Llama 3.2 3B	2.0 GB	3.2 GB	31 tok/s	89	94
Qwen 2.5 7B	4.5 GB	5.6 GB	15 tok/s	94	92
DeepSeek Coder V2 Lite	10 GB	13 GB	22 tok/s	90	90
Gemma 2 2B	1.6 GB	2.6 GB	38 tok/s	82	87

How to read these numbers

The quality column and the rubric column tell different stories on purpose. Quality is "how good are the answers," full stop. The rubric folds in size and speed, which is what actually determines whether you can run the model and whether using it feels pleasant. Qwen wins quality but loses the rubric to Llama because a 7B at 15 tok/s on a phone is a non-starter, while a 3B at 31 tok/s is a delight. If you only ever run on a 16 GB+ machine, weight quality higher and Qwen moves to the top for general use.

Pick by your hardware

On a phone or older iPad (A-series): Gemma 2 2B for speed, or Llama 3.2 3B for the best quality that still fits. Qwen and DeepSeek do not belong here.

On a MacBook Air or iPad Pro M-series (8 GB): Llama 3.2 3B as the daily driver; Qwen 2.5 7B if you want peak quality and can tolerate ~15 tok/s.

On a 16 GB+ laptop or desktop: Qwen 2.5 7B for general work, DeepSeek Coder V2 Lite when you are writing code. Both are comfortable here.

Frequently asked questions

Which is better for on-device use: Llama, Qwen, DeepSeek, or Gemma?

In our four-model test on identical hardware, Qwen 2.5 7B scored highest on raw quality (92/100) but Llama 3.2 3B won the combined size-speed-quality rubric (94/100) because it fits in ~2 GB at Q4 and runs 30+ tok/s on an M2. DeepSeek Coder V2 Lite is the pick if your workload is code, and Gemma 2 2B is the fastest if you are memory-constrained. There is no single winner — the answer depends on your RAM and whether you value speed or peak quality.

How did you keep the benchmark fair across the four models?

We held three things constant: the hardware (a MacBook Pro M2 with the same thermal state), the quantization (Q4_K_M GGUF for every model), and the prompt set (the same 40 prompts spanning chat, summarization, reasoning, and code). Only the model weights changed between runs. We report tokens per second, file size, peak RAM, and a quality score scored against a fixed rubric, so the comparison is apples-to-apples rather than one card's numbers against another's.

Can all four models run on a phone?

No. Llama 3.2 3B (~2 GB Q4) and Gemma 2 2B (~1.6 GB Q4) run comfortably on a modern phone. Qwen 2.5 7B (~4.5 GB Q4) needs roughly 8 GB of RAM, which is laptop or iPad Pro territory. DeepSeek Coder V2 Lite is a 16B mixture-of-experts model that wants 12-16 GB, so it is desktop-class. On a phone, stick to the 1B-3B models.

Is Qwen really better than Llama on benchmarks?

On raw quality at the 7B scale, yes — Qwen 2.5 7B outscored every model in our test on the reasoning and summarization prompts, consistent with its strong public leaderboard numbers. But Llama 3.2 3B delivers most of that quality at less than half the file size and more than double the speed, which is why it wins our weighted rubric. Benchmark quality and practical on-device value are different questions.

Which model should I pick for coding?

DeepSeek Coder V2 Lite if you have the RAM — it was the strongest coder in our test thanks to its mixture-of-experts design that holds 16B-class code quality while running at roughly 3B speeds. If you are on a phone or a light laptop, Qwen 2.5 Coder 1.5B is the best small coder. Llama 3.2 3B handles light scripting fine but is not a dedicated code model.