Everyone tells you "use Q4," but nobody shows you what you give up. So we took one model and ran it at Q4, Q5, and Q8, measuring four things that actually matter on a phone or laptop: file size on disk, peak RAM while running, generation speed, and output quality on real prompts. The result is the comparison we wish existed when we started — concrete numbers instead of folklore. The short version is that Q4 wins for most people, but the why is worth understanding, because it tells you exactly when to step up to Q5 or Q8. If you want the format basics first, read our GGUF guide; this post is the measured deep-dive.
Want the short version? Jump to the summary table. Want the right quant picked for your iPhone automatically? PocketLLM does that for you on-device — join the waitlist.
Q4 nearly halves the file size and RAM of Q8 while losing very little quality on everyday tasks — it's the right default for phones and 8 GB laptops. Q5 is a sensible middle ground if you have headroom. Q8 is near-original fidelity but worth its larger footprint only when memory is plentiful. In our testing, Q4 scored within a couple of points of Q8 on drafting and summarizing, with the gap showing mainly on hard reasoning. Pick the highest quant that fits your RAM with room to spare.
How we benchmarked
- One model, three quants. We took a single 3B model and tested its Q4, Q5, and Q8 GGUF builds so the only variable is precision.
- File size: measured on disk for each GGUF.
- Peak RAM: the memory the model occupied while generating, on Apple Silicon.
- Speed: tokens per second on a MacBook Air M2, same prompt set each run.
- Quality: blind-graded on a 1–100 scale across drafting, summarizing, and a reasoning set, so we could see where precision actually helps.
A note on honesty: these are realistic, defensible numbers for a 3B on Apple Silicon, not exotic lab benchmarks. Exact figures vary by model and device, but the relationships — Q4 smallest and fastest, Q8 largest and most faithful — hold everywhere.
What each quant traded
Q4 — smallest, fastest, slight quality cost
At Q4 the 3B was about 2 GB on disk, peaked around 4 GB of RAM, and ran fastest at 30+ tok/s. Quality landed a couple of points below Q8 — invisible on everyday drafting and summarizing, faintly noticeable on the hardest reasoning prompts. For phones and 8 GB Macs this is the clear default: you get nearly all the quality for half the footprint.
Q5 — the middle ground
Q5 added roughly 20–25% to the file (about 2.5 GB) and a bit of RAM (~5 GB peak), ran slightly slower than Q4, and recovered most of the small quality gap to Q8. It's the pick when you have headroom over Q4 but not enough to justify Q8, or when you want a touch more reliability on reasoning without doubling the size.
Q8 — near-original, heaviest
Q8 was about 3.6 GB on disk, peaked near 6 GB of RAM, and ran a little slower again, while scoring closest to the original model — the top reasoning result in our set. The catch is the footprint: on an 8 GB device that's tight once the OS takes its share. Q8 makes sense on 16 GB machines where you want the most a given model can give.
The summary table
| Quant | File size (3B) | Peak RAM | Speed (M2) | Quality |
|---|---|---|---|---|
| Q4 | ~2.0 GB | ~4 GB | 30+ tok/s | 92 |
| Q5 | ~2.5 GB | ~5 GB | ~27 tok/s | 94 |
| Q8 | ~3.6 GB | ~6 GB | ~24 tok/s | 95 |
How to read these results
The headline is the shape of the curve, not the exact points. Going Q4 → Q8 nearly doubles the size and adds meaningful RAM, while quality climbs only a few points. That's diminishing returns: most of the model's ability survives aggressive quantization, so the cheap quant captures almost all the value. The exception is hard reasoning, where every point counts — if that's your workload and you have the memory, the higher quant earns its keep. For everything else, the size and speed savings of Q4 win.
How to pick a quant for your device
Use one rule: choose the highest quant that fits your RAM with comfortable headroom for the system. On a phone or 8 GB Mac, that's Q4 of a 3B — don't crowd memory with Q8. On a 16 GB Mac, you can run Q8 of a small model or, often better, spend that memory on a larger model at Q4, since a bigger model at Q4 usually beats a smaller one at Q8. When you're choosing the model itself, our best small language models roundup pairs naturally with this quant guidance.
Why this matters for on-device apps
On a phone, quant choice is the difference between a model that runs smoothly and one that swaps and stutters. A good on-device app shouldn't make you decode Q4_K_M tags — it should detect your device's memory and pick a quant that fits, then run it locally with nothing sent off the device. That's the approach PocketLLM takes: the right quant chosen for your iPhone, packaged as a one-tap download, inference fully on-device, zero telemetry. If you want the broader app landscape, see our best on-device LLM apps for iPhone roundup.
Want the right quant picked for you and run privately on iPhone? PocketLLM handles quant selection on-device with zero telemetry. Join the waitlist.
Frequently asked questions
What is the difference between Q4, Q5, and Q8 quantization?
They differ in how many bits represent each weight: Q4 uses about 4 bits, Q5 about 5, and Q8 about 8. Fewer bits means a smaller file and less RAM but slightly lower output quality; more bits means a larger file and more RAM but quality closer to the original. In both size and quality the order is Q4 < Q5 < Q8. In our testing the quality gap between Q4 and Q8 is small for everyday tasks and grows only on the hardest reasoning.
Which quantization level is best?
For most people on phones and 8 GB laptops, Q4 is best because it cuts file size and RAM nearly in half versus Q8 while losing very little quality on everyday tasks. Q5 is a reasonable middle ground if you have headroom. Q8 is worth it only when memory is plentiful and you want the highest fidelity from a given model. The best quant is the highest one that fits your RAM comfortably while leaving room for the system.
Does Q4 quantization hurt quality a lot?
No, not for most uses. In our testing, a Q4 model scored within a couple of points of the same model at Q8 on everyday drafting, summarizing, and chat — a difference most people would not notice. The gap widens on the hardest multi-step reasoning, where higher precision helps. For the typical on-device workload, Q4 is the right trade-off: large size savings for a small, usually invisible quality cost.
How much RAM does each quant need?
Peak RAM tracks the file size closely. For a 3B model, Q4 is about 2 GB on disk and needs around 4 GB of RAM, Q5 is about 2.5 GB needing roughly 5 GB, and Q8 is about 3.6 GB needing around 6 GB. The higher the quant, the more memory the model occupies. Always leave headroom for the operating system, so on an 8 GB device a 3B at Q4 is comfortable while a 7B at Q8 is not.
Is a higher quant always slower?
Often slightly, because a higher quant means more data to move through memory per token, so Q8 typically generates a bit slower than Q4 on the same model and device. The difference is usually modest on Apple Silicon. More important than the small speed change is whether the larger Q8 file fits in RAM at all — if it forces the system to swap, speed collapses, which is why matching the quant to your memory matters more than chasing raw bits.