Shipping an on-device LLM app is very different from integrating a cloud API. You're not calling a REST endpoint. You're loading several gigabytes of weights from disk, compiling them for a specific neural accelerator, managing a KV cache under tight memory pressure, and coordinating all of it with iOS's aggressive memory eviction policies. This post is the version of the playbook we wish had existed when we started building PocketLLM.
The three layer model
Conceptually, an on-device LLM app has three layers:
- Model storage. Quantized weights sitting on disk, plus a tokenizer.
- Inference runtime. The code that loads those weights into memory and actually runs forward passes.
- Application layer. Chat UI, session management, streaming output, error handling.
Each layer has its own performance and correctness concerns. Most of the tricky engineering lives in layer 2.
Picking an inference runtime on iOS
You have roughly three realistic options in 2026:
1. CoreML
Apple's native ML framework. It compiles models to run across the CPU, GPU, and Neural Engine. For models that CoreML supports cleanly, this is the fastest and most power-efficient option — especially on newer iPhones where the Neural Engine can take over most of the work.
Downsides: the model conversion pipeline is fragile. Not every model architecture converts cleanly. You're dependent on Apple's coremltools and the ANE's supported operations. New architectures from Meta or Mistral often take months to get clean CoreML conversions.
2. llama.cpp (with Metal backend)
The open-source inference runtime that runs on everything. On iOS it uses Metal shaders for GPU acceleration. It's not as power-efficient as CoreML (it doesn't hit the Neural Engine), but it supports essentially every model the community has published, in essentially every quantization format.
This is the sane default for anything that isn't in Apple's blessed list.
3. MLC / Apache TVM
A third option. MLC compiles models once, statically, for a target device. Fast inference, but the compilation step is heavy and the tooling is research-flavored.
We use a dual backend: CoreML for models where Apple has a high-quality conversion, llama.cpp for everything else. The app picks the backend automatically based on the model. This gives us the best of both: efficiency where it's available, compatibility everywhere else.
Quantization: pick your poison
The quantization format determines file size, quality, and runtime speed. For iOS you essentially choose between:
- GGUF Q4_K_M — the standard llama.cpp format. Good balance. Every model has one.
- GGUF Q5_K_M — slightly larger, slightly better quality. Use on higher-memory devices.
- CoreML int4 / int8 palletized — Apple's native format. Smaller and faster when available, but conversions are architecture-specific.
- MLX / Apple Silicon native formats — emerging, interesting, not yet mainstream on iOS.
Avoid Q2 and below unless you really need the extra space. The quality drop is noticeable and users will feel it.
Memory budgeting
The hard constraint on iOS is memory. iOS will kill your app aggressively when it uses too much. A rough budget for a 3B Q4 model on an 8 GB iPhone:
| Component | Memory |
|---|---|
| Model weights (memory-mapped) | ~2 GB |
| KV cache (4K context) | ~400 MB |
| Activations during forward pass | ~200 MB |
| Tokenizer + runtime | ~50 MB |
| App UI + misc | ~150 MB |
| Total | ~2.8 GB |
iOS lets a foreground app use about 60% of total RAM comfortably. On an 8 GB device that's ~4.8 GB, so you have headroom. On a 6 GB device (~3.6 GB usable), a 3B Q4 model is right on the edge — you'll see jetsam terminations under pressure. Drop to a 1B model or push for aggressive quantization (Q3_K_M).
Tip: mmap your model files
llama.cpp can memory-map model weights instead of reading them into heap. This lets the kernel page weights in and out as needed, which dramatically reduces your measured working set and makes you much more resilient to memory pressure. Always use mmap on iOS unless you have a specific reason not to.
Handling jetsam (iOS memory kills)
iOS will terminate your app if it uses too much memory while backgrounded. For an LLM app this means you need to checkpoint carefully:
- Save chat state on every generation boundary.
- Release the KV cache when backgrounded. It's regeneratable.
- Consider unloading model weights on background if your memory budget is tight. Reloading a Q4 3B model takes ~800 ms — not free, but acceptable.
- Listen for
UIApplication.didReceiveMemoryWarningNotificationand drop the KV cache immediately.
Streaming output to the UI
Users expect token streaming, like ChatGPT. Your inference runtime produces tokens one at a time; you need to pipe those to the UI with good performance.
The pattern that works well on iOS:
- Run inference on a background thread (never on the main thread — generation blocks for multiple seconds).
- Each new token is dispatched to the main thread via
MainActor. - The SwiftUI view observes a
@Observablestring that gets appended to. - Use a text rendering path that handles partial markdown gracefully (users will see "```" appear before the closing triple-backtick).
Avoid rebuilding large view hierarchies on every token. SwiftUI diffing is fast but not that fast at 20+ updates per second for a large transcript. Use stable identity, split long transcripts into chunks, and only update the active assistant message.
Model distribution
You cannot ship multi-gigabyte model files inside your app bundle — App Store size limits and user expectations both rule it out. The standard pattern:
- Ship a small default model bundled with the app (~200 MB).
- Let users download larger models on demand via a background URLSession.
- Host the model files on a CDN (Cloudflare R2, AWS CloudFront, Hugging Face).
- Verify checksums after download to catch corruption.
- Store downloaded models in the Caches directory so iOS can evict them if storage pressure is extreme (and so they don't bloat iCloud backups).
Battery and thermals
Long generations warm the phone. Best practices:
- Stop generation when the user backgrounds the app unless they explicitly opted into background continuation.
- Expose a visible "stop" button during generation. Users don't want a runaway 2000-token response.
- Consider adaptive quantization — fall back to a smaller model if the device is already thermally throttled.
- Respect Low Power Mode. If
ProcessInfo.processInfo.isLowPowerModeEnabledis true, consider downgrading model or reducing max tokens.
Privacy engineering
If your selling point is privacy, you have to engineer for it:
- No telemetry on conversations. Ever. Even aggregated. Even "anonymized."
- No network calls during inference. Run a network monitor in debug and make sure nothing surprises you.
- Mark conversation storage as Complete File Protection (
.completeUnlessOpen). - Exclude model files from iCloud backup if they're re-downloadable.
- Provide a clean "delete all data" that actually deletes all data, including caches and derived indices.
Lessons from shipping
Things we learned the hard way building PocketLLM:
- Startup time matters more than throughput. A 300 ms lower time-to-first-token feels better than 5 tok/s more throughput.
- Memory-map everything. It's not an optimization; it's the only way to survive iOS memory pressure.
- Thermal behavior varies across devices. Test on the oldest supported phone, not just your dev device.
- Apple will reject your first build. Probably for reasons unrelated to the LLM. Leave time.
- Battery fear is real. Users panic the first time they see their phone get warm from AI. UX messaging matters.
What's next
The on-device AI stack is still moving fast. Things to watch: MLX (Apple's new ML framework aimed partly at on-device inference), better CoreML support for emerging architectures, and the ongoing trend toward smaller models that match previous-generation large models.
If you're building in this space, the main thing to internalize is that the constraints are different from cloud AI. You're not optimizing for throughput at scale — you're optimizing for time-to-first-token, memory headroom, and thermal stability on a handheld device with a battery. Those are harder than they sound, but they're also more fun.