"Self-hosted AI" gets used as if it means one thing, but there are really three different architectures hiding under the term, and they have wildly different costs. You can run a model on your own laptop behind a local server, stand up a dedicated home server with a GPU, or run the model fully on-device on a phone with nothing to administer at all. This guide compares those three approaches by the things that actually decide whether you'll keep using them: setup effort, ongoing maintenance time, RAM required, and how much network exposure each one creates. The headline finding: on-device is the zero-ops way to self-host, and for most people it's the one that survives contact with real life.
Want the short version? Jump to the summary table. Want self-hosted AI with literally no server to run? PocketLLM runs the model on your iPhone — no setup, no maintenance, nothing leaves the device. Join the waitlist.
There are three ways to self-host: a desktop server (a local model runner on your laptop), a home server (a dedicated box with a GPU), and fully on-device (the model runs inside an app on your phone). On-device wins on setup effort and maintenance because there's no server to install, secure, or keep running — and with the network off, its exposure is zero. A desktop server is a good middle ground on a Mac with 16 GB. A home server gives the most raw power for big models but costs the most in time and is the only option that adds an internet-facing surface you must secure. Match the approach to how much ops work you're willing to do.
How we compared the approaches
- Setup effort: Time from zero to a working chat, including installing the runtime, downloading a model, and configuring it.
- Maintenance time: Ongoing work — model updates, OS and driver patches, keeping a service running, restarting after reboots.
- RAM and hardware: What you need to run a useful model. A 3B model needs about 4 GB; a 7B model needs about 8 GB.
- Network exposure: The size of the attack surface you create. On-device with the network off is zero; a home server reachable from the internet is the largest.
The three ways to self-host AI
1. Fully on-device — the zero-ops self-host
The model and runtime ship inside an app. You tap to download a model and start chatting. There is no server process, no port to open, no service to keep alive, and no OS to patch beyond your normal phone updates. On a phone you run 1B to 3B models; a 3B model fits in about 2 GB at Q4 and runs at usable speed on recent hardware. Network exposure is the smallest of any approach — turn off Wi-Fi and cellular and it keeps working, which means nothing can leave the device. The trade-off is a ceiling on model size: you won't run a 12B model on a phone. For most personal use that ceiling is fine. This is the approach PocketLLM takes; if you're choosing between desktop tools and this path, see Ollama alternatives for iPhone and Mac.
2. Desktop server — a local runner on your laptop
Tools that run a model on your Mac or PC and expose a local chat UI or API on localhost. Setup is modest: install the runner, pull a model, go. Maintenance is light but non-zero — you update models yourself and restart the service after reboots. With 16 GB of RAM you can run a 7B model comfortably, which raises the quality ceiling over a phone. Network exposure stays low as long as you keep it bound to localhost; it grows the moment you expose it to your LAN or the internet for remote access. A solid middle ground if you already work at a desk. Our complete guide to local AI chat walks through this setup end to end.
3. Home server — a dedicated box with a GPU
A standalone machine — often with a discrete GPU — that runs models and serves them to your other devices. This is the only approach that can comfortably run large 12B+ models and serve several requests at once. It's also the most expensive in time: you own the OS, the drivers, the inference service, uptime, and security. If you expose it beyond your LAN so you can use it from your phone away from home, you've created an internet-facing surface you must harden yourself — that's real network exposure that the other two approaches avoid. Worth it for power users and developers; overkill for someone who just wants private chat. If you're building on top of local models, our developer guide to mobile LLMs covers the integration side.
The summary table
| Approach | Setup effort | Maintenance | Typical RAM | Network exposure |
|---|---|---|---|---|
| Fully on-device | Minimal (one tap) | None | 2–4 GB | None (offline) |
| Desktop server | Low | Light | 8–16 GB | Low (localhost) |
| Home server | High | Ongoing | 16 GB+ / GPU | Highest (if exposed) |
Which approach should you choose?
You want private AI with no ops at all: go fully on-device. It's the only approach with zero maintenance and zero network exposure when offline, and it's the most likely to still be in use a month later because there's nothing to break.
You work at a desk and want bigger models: a desktop server on a 16 GB Mac runs 7B models and keeps your data on the machine, with only light upkeep.
You're a power user who needs 12B+ models or multi-device serving: a home server delivers, but budget real time for maintenance and security — and think hard before exposing it to the internet.
The network-exposure point most guides skip
"Self-hosted" is often treated as automatically private, but privacy is about exposure, not ownership. A home server you open to the internet so you can reach it from your phone is self-hosted and yet has a larger attack surface than many cloud services. The most private self-host is the one that needs no network at all. That's why on-device sits at the top of the privacy axis: with the connection off, there is no surface to attack and nothing to leave the device. When you evaluate a self-hosted setup, the right question isn't "do I own the hardware?" — it's "what can reach this, and from where?"
Getting started this week
The fastest path to a working private setup is on-device: install an app that bundles a model, download a 3B model, and start chatting offline — done in minutes with nothing to maintain. If you want more headroom, install a desktop runner on a 16 GB machine and pull a 7B model. Save the home server for when you've confirmed you actually need 12B+ models or multi-device serving; it's the most capable and the most work. Whichever you pick, the privacy test is the same — turn off the network and see what still answers.
Frequently asked questions
What is self-hosted AI?
Self-hosted AI means running the AI model on hardware you control instead of calling a third-party cloud API. It can mean a model running on your laptop behind a local server, a dedicated home server with a GPU, or a model running fully on-device on your phone. The common thread is that you own the hardware the model runs on, so you control where your prompts and responses go.
What is the easiest way to self-host AI?
The easiest way is a fully on-device app where the model and runtime are bundled together — there is no server to install, configure, or maintain. In our comparison this had the lowest setup effort and essentially zero ongoing maintenance. A desktop server tool like a local model runner is the next easiest, and a dedicated home server with a GPU is the most involved because you also manage the OS, drivers, and uptime.
How much RAM do I need to self-host an AI model?
It depends on the model size and quantization. A 1B model at Q4 loads in about 2 GB of RAM, a 3B model needs around 4 GB, and a 7B-8B model needs about 8 GB. On a phone you typically run 1B to 3B models; on a laptop or home server with 16 GB or more you can run 7B models comfortably. Match the model to the memory you have rather than the largest model that will technically load.
Is self-hosted AI more private than cloud AI?
It can be, but only if you control the network exposure. A fully on-device model can run with no network connection at all, so nothing leaves the device. A home server exposed to the internet for remote access re-introduces a network surface you have to secure yourself. The most private setup is the one with the smallest network exposure, and on-device with the connection off is the smallest possible.
Do I need a powerful GPU to self-host AI?
No. Small and mid-size models run well on Apple Silicon and modern phones using the CPU and the neural engine, no discrete GPU required. A 3B model runs at 30+ tokens per second on a MacBook Air M2. A GPU helps if you want to run larger 12B or bigger models or serve many requests at once, but for personal use on small-to-mid models it is optional.