"Local AI chat" sounds like a feature but it's actually a whole stack — hardware, inference runtimes, quantization formats, model architecture, memory layout, and enough acronyms to fill a dictionary. You don't need to understand all of it to use a local AI chatbot, but understanding most of it makes the difference between "this is slow and weird" and "this is the best AI I've ever used."
This is the long version. If you want the short version, our step-by-step tutorial gets you running in ten minutes. If you want the deep version, keep reading.
What "local" actually means
A local language model is one where the weights (the billions of numbers that make up the model) and the inference engine (the code that multiplies them with your prompt) both run on your device. No network request, no API call, no shared computation with a server. Everything happens on your phone.
Local is distinct from "offline." Offline is a user-facing property: can you use it in airplane mode? Local is a technical property: does the computation happen here? Every local model is offline, but not every "offline" app is local.
How a phone can run an LLM at all
The modern iPhone has three chips that care about neural networks: the CPU, the GPU, and the Neural Engine (ANE). Each is good at slightly different things.
- The CPU is general-purpose and can run anything. It's the slowest option for LLM inference but the most compatible.
- The GPU is designed for massively parallel floating-point math, which happens to be exactly what LLMs need. Most on-device LLMs run primarily on the GPU.
- The Neural Engine is a dedicated accelerator for specific neural network operations. It's extremely efficient for the workloads it's designed for, and Apple has been pushing it hard for LLMs since iOS 18.
Apple's CoreML framework abstracts these three so a model can be compiled once and scheduled across all three dynamically. llama.cpp, the most popular cross-platform LLM runtime, runs on CPU and GPU via Metal. The best on-device apps use both — CoreML for the models that compile cleanly, llama.cpp for everything else.
Quantization: why a 7B model fits on an 8 GB phone
A "7 billion parameter" model literally has 7 billion numbers that define its behavior. Stored at full precision (16 bits each), that's 14 GB — obviously too much for an iPhone. Quantization is how you shrink it.
The trick is that most of those 16-bit numbers don't actually need 16 bits of precision. If you're more clever about how you store them — using 8 bits, 4 bits, or even 3 bits — you can cut the file size dramatically with surprisingly little quality loss.
| Quantization | Size of 7B model | Quality vs FP16 |
|---|---|---|
| FP16 (no quantization) | ~14 GB | 100% (baseline) |
| Q8 | ~7 GB | ~99% |
| Q5_K_M | ~5 GB | ~97% |
| Q4_K_M | ~4 GB | ~95% |
| Q3_K_M | ~3 GB | ~90% |
| Q2_K | ~2.5 GB | ~80% |
Q4_K_M is the standard recommendation for on-device: about four times smaller than the original with almost no noticeable quality loss. Q2 is tempting because it's tiny, but the quality drop is real — Q2 models make visibly more mistakes, especially on reasoning.
The "K" means "k-quant" (a smarter quantization scheme from llama.cpp), and the "M" means "medium" (the balance preset). You'll also see "S" (small) and "L" (large) variants. Unless you're memory-constrained, always prefer K_M over plain Q4.
Memory: the real constraint on iPhone
The raw model file size is only part of the story. When the model runs, it needs additional RAM for the key-value cache (the running state of the attention mechanism), activations, and the inference runtime itself. A rough rule:
RAM needed ≈ (quantized model size) × 1.2 + (context length × ~1 MB)
So a 4 GB Q4 model with a 4K context window needs about 5 GB of RAM at runtime. On an iPhone with 6 GB total (and iOS + background apps using some of that), you're right on the edge. This is why 7B models work fine on iPhone 15 Pro (8 GB) but stutter on iPhone 13 (6 GB).
Model selection: what's actually out there
Small models (1B – 3B parameters)
- SmolLM 360M / 1.7B — Hugging Face's surprisingly capable tiny model. Good for rewriting and short replies. Fits on any iPhone.
- Llama 3.2 1B and 3B — Meta's mobile-friendly models. Solid general-purpose performance.
- Phi-3 Mini (3.8B) — Microsoft's small model. Very strong for its size, especially at reasoning and code.
- Qwen 2.5 1.5B / 3B — Alibaba's models. Strong multilingual support.
Mid-sized models (7B – 8B parameters)
- Llama 3.1 8B — the gold standard for mid-range local models.
- Qwen 2.5 7B — very strong on multilingual and coding tasks.
- Mistral 7B / Mixtral variants — older but still competitive.
Mid-sized models need 8+ GB of iPhone RAM to run comfortably at Q4, which in practice means iPhone 15 Pro or newer.
Context length: the other hidden number
A model's context length is how much text it can "see" at once. A 4K context window means the model can handle about 3,000 words of prompt + reply combined before it starts forgetting the beginning. Modern models advertise 32K, 128K, even 1M tokens — but those numbers assume you have the RAM for the KV cache.
On iPhone, you usually want to cap context at 4K–8K regardless of what the model supports. Going higher burns RAM and slows inference without much practical benefit for chat use.
Performance: what to actually expect
Rough numbers from an iPhone 15 Pro (8 GB RAM), Llama 3.2 3B Q4, short prompts:
- Time to first token: ~400 ms
- Tokens per second (generation): ~18 tok/s
- RAM usage during generation: ~2.8 GB
- Battery impact: roughly 1% per 2–3 minutes of continuous generation
18 tokens/second feels roughly like watching someone type quickly — noticeably slower than GPT-4o streaming but fast enough to read comfortably. The model catches up to your reading speed.
Battery and thermals
The phone will get warm during long generation. This is not a bug; you're running dense matrix math on millions of values per second. The Neural Engine is extremely efficient, but "extremely efficient" still means watts, and watts turn into heat.
If you run back-to-back 30-second generations for an hour, expect noticeable warmth and maybe 15–20% battery drain. Normal short chats (a few messages at a time) barely register. You don't have to treat this like gaming — it's closer to video playback in terms of impact.
When to use local, when to use cloud
Local and cloud aren't enemies. They're different tools:
- Use local for: anything sensitive, anything offline, frequent short tasks, rewriting, summarizing, brainstorming, drafting.
- Use cloud for: long multi-step reasoning, cutting-edge code, obscure factual recall, very long documents.
A workflow that uses both is better than one that treats them as alternatives. The mistake is using cloud for everything by default when most of your prompts are things a local 3B model handles perfectly.
The future (not very far off)
iPhone RAM keeps growing. The Neural Engine keeps getting faster. Model architectures are getting more efficient — the same quality at half the parameters. In 2022 you couldn't run anything useful on a phone. In 2024 you could run small models awkwardly. In 2026 you can run mid-size models comfortably. By 2028 or so, "run a GPT-4 class model on your phone" will be a normal sentence.
The part that doesn't change is the privacy. Wherever the quality ceiling lands, the privacy story of "it runs on your phone" stays the same, and that alone is enough for a lot of people.