What Are GGUF Models? A Guide for iPhone and Mac

Q: What is a GGUF model?

GGUF is a single-file format for storing a quantized language model so it can be loaded and run by llama.cpp and the apps built on it. The file holds the model weights plus metadata like the tokenizer and architecture, all in one place. It is the standard format for running LLMs locally on a phone, Mac, or PC because it loads fast and supports the Q4, Q5, and Q8 quantizations that shrink a model to fit in limited RAM.

Q: What does Q4, Q5, and Q8 mean in GGUF?

They are quantization levels — how many bits represent each weight. Q4 uses about 4 bits per weight, Q5 about 5, and Q8 about 8. Lower numbers mean a smaller file and less RAM but slightly lower quality; higher numbers mean a larger file and more RAM but quality closer to the original. In size and quality the order is Q4 < Q5 < Q8. For most phones and laptops Q4 is the practical default.

Q: Which GGUF quant should I use for my device?

Match the quant to your RAM. On a phone or an 8 GB Mac, use Q4 — a 3B model at Q4 is about 2 GB and a 7B is about 4.5 GB. On a 16 GB Mac you can afford Q5 or Q8 of a small model for a slight quality bump, or move up to a larger model at Q4. The rule of thumb: pick the largest model your RAM allows, then the highest quant that still fits comfortably.

Q: How big is a GGUF file?

It depends on the model's parameter count and the quant level. At Q4, a 1B model is roughly 0.7 to 0.8 GB, a 3B is about 2 GB, and a 7B is about 4.5 GB. Moving from Q4 to Q5 adds roughly 20 to 25 percent, and Q8 is close to double the Q4 size. The file size on disk is also very close to the RAM the model needs to load, so it is a good first estimate of whether it fits.

Q: Can I run GGUF models on an iPhone?

Yes. iPhones run GGUF models through apps built on llama.cpp. A 1B to 3B model at Q4 fits comfortably in an iPhone's memory — Llama 3.2 3B at Q4 is about 2 GB. Larger 7B models can load on higher-memory iPhones but run slower. PocketLLM is designed to package compatible models as one-tap downloads and run them entirely on-device with zero telemetry; it is coming soon, with a PocketLLM Android version to follow.

If you've tried to run an LLM locally, you've seen the letters GGUF everywhere — in model filenames, on Hugging Face download pages, in app import screens — usually next to cryptic tags like Q4_K_M. GGUF is the file format that makes local AI practical on a phone or Mac, and the quant tag tells you how big the file is and how good the model will be. This guide explains both in plain terms, with real file sizes we measured and a clear rule for which quant to pick for your device's RAM. No theory you don't need — just what to download and why.

Want the short version? Jump to the summary table. Want models pre-packaged so you never touch a GGUF file by hand? PocketLLM is built to handle the format for you on iPhone — coming soon, join the launch list.

GGUF in one paragraph

GGUF is a single-file format that packs a quantized model — weights plus tokenizer and metadata — into one file that llama.cpp-based apps can load directly. The Q4 / Q5 / Q8 tag is the quantization level: roughly 4, 5, or 8 bits per weight. Lower means smaller file and less RAM at a small quality cost; higher means bigger and more faithful. In both size and quality the order is Q4 < Q5 < Q8. For phones and 8 GB Macs, Q4 is the right default.

PocketLLM is launching soon. Private, on-device AI, starting on iPhone and iPad with more platforms planned. No account, no tracking, no cloud. Join the launch list and be first in.

Join the launch list

What you actually need to understand

GGUF is a container. One file holds everything the runtime needs — you don't assemble pieces. Double-clickable, portable, fast to memory-map.
Quantization is compression for weights. The original model stores each weight in 16 bits. Quantization rounds them to 4, 5, or 8 bits so the file shrinks and fits in less RAM.
Smaller quant, smaller footprint, slightly worse output. The quality loss from Q4 is modest for most everyday use and the size savings are large — which is why Q4 dominates on-device.
File size ≈ RAM needed. The number on the download page is close to what the model will occupy in memory, so it's your first compatibility check.

GGUF quantization, explained with real files

Q4 — the on-device default

Q4 squeezes each weight to about 4 bits. In our file inspection, Llama 3.2 3B at Q4 lands around 2 GB on disk and needs roughly 4 GB of RAM to run comfortably; a 7B like Qwen 2.5 7B at Q4 is about 4.5 GB and wants 8 GB of RAM. Quality is close enough to the original that most users notice nothing on drafting, summarizing, and chat. This is the quant we recommend for phones and 8 GB Macs. For a deeper measured comparison, see our Q4 vs Q5 vs Q8 benchmark.

Q5 — a small quality bump if you have room

Q5 uses about 5 bits per weight, adding roughly 20–25% to the Q4 file size. A 3B model that was ~2 GB at Q4 becomes about 2.5 GB at Q5. The quality gain is real but subtle — you'd see it on tricky reasoning more than on everyday text. Worth it only if the larger file still fits your RAM with headroom.

Q8 — near-original, for when memory is plentiful

Q8 keeps about 8 bits per weight and is close to the full-precision quality, at roughly double the Q4 size. A 3B at Q8 is around 3.5–4 GB. On a phone or 8 GB Mac it's usually not worth the memory; on a 16 GB Mac it's a reasonable choice for a small model when you want the best output a given model can give.

The summary table

Model	Quant	File size	RAM to run	Best for
Llama 3.2 1B	Q4	~0.8 GB	2 GB	Tiny phones, drafts
Llama 3.2 3B	Q4	~2.0 GB	4 GB	Phones, 8 GB Macs
Llama 3.2 3B	Q5	~2.5 GB	5 GB	Slight quality bump
Llama 3.2 3B	Q8	~3.6 GB	6 GB	16 GB Macs
Qwen 2.5 7B	Q4	~4.5 GB	8 GB	16 GB Macs, best quality you can run
Qwen 2.5 7B	Q8	~8 GB	12 GB	High-RAM Macs only

How to pick a quant for your device

The rule that works: choose the largest model your RAM allows, then the highest quant that still fits with headroom. On a phone or 8 GB Mac, that means a 3B at Q4. On a 16 GB Mac, it means a 7B at Q4 — and only step up to Q5/Q8 of a smaller model if you specifically want a marginal quality bump over a bigger footprint. Don't try to run a 7B at Q8 on 8 GB; it'll swap and crawl. If you're building on this stuff, our developer guide to mobile LLMs goes deeper on memory budgeting and runtime choices.

Where GGUF files come from and how to load them

Most GGUF files live on Hugging Face, converted from the original weights by the community or the model authors. To run one you need a llama.cpp-based runtime: LM Studio or Ollama on Mac, or an app on iPhone. The friction is that you pick the right file, verify it fits your RAM, and point the runtime at it. Apps that bundle the model remove that step entirely — you tap download and they pick a quant that fits your device. That's the approach PocketLLM is built around on iPhone (coming soon), so most users never see a Q4_K_M tag at all.

Want GGUF handled for you, on-device and private? PocketLLM is designed to package compatible models as one-tap downloads on iPhone with zero telemetry. Coming soon — join the launch list.

Frequently asked questions

What is a GGUF model?

GGUF is a single-file format for storing a quantized language model so it can be loaded and run by llama.cpp and the apps built on it. The file holds the model weights plus metadata like the tokenizer and architecture, all in one place. It is the standard format for running LLMs locally on a phone, Mac, or PC because it loads fast and supports the Q4, Q5, and Q8 quantizations that shrink a model to fit in limited RAM.

What does Q4, Q5, and Q8 mean in GGUF?