← Back to blog

15 Best Local LLM Models to Run in 2026

Most "best local LLM" articles conflate two different questions: which model has the best benchmark scores, and which model you can actually run on the hardware you already own. This ranking answers the second question. Every model on this list runs on a laptop, a MacBook, or — for the smaller entries — an iPhone or Android phone. We ranked by a weighted combination of output quality, file size, tokens per second on common hardware, and license freedom. If a model looks great on the Hugging Face leaderboard but won't fit in 8 GB of RAM, it's not on this list.

Want the short version? Jump to the summary table. Running one of these on an iPhone? PocketLLM packages the top contenders as one-tap downloads — join the waitlist.

How we ranked

  • Quality (35%): MMLU, HumanEval, and GSM8K as reported by the model's own card, corroborated by independent leaderboards where available. We only used numbers we could trace to a primary source.
  • Runnability (35%): File size at common quantizations, minimum RAM to load, and whether the model has a working llama.cpp, MLX, Core ML, or MLC conversion.
  • Speed (20%): Tokens per second on three reference platforms — MacBook Air M2 (8 GB), MacBook Pro M3 Pro (18 GB), and iPhone 15 Pro (8 GB). All numbers are publicly reported by the runtime projects or our own Llama 3.2 iPhone benchmarks.
  • License (10%): We penalize models whose licenses forbid commercial use, gate weights behind forms, or restrict derivatives. Apache 2.0 and MIT score highest.

Scores are out of 100. Every entry below is currently runnable in llama.cpp, Ollama, LM Studio, MLX, Core ML, or MLC — the five runtimes that matter in 2026. If a model only runs in PyTorch, it's not local in any useful sense.

The 15 best local LLMs in 2026

1. Llama 3.2 3B — 94/100

Meta's 3B release is the default answer to "what should I run on my phone or laptop?" It fits in 2 GB at Q4, runs at 30+ tok/s on a MacBook Air M2, and holds its own against 7B models from 2024 on MMLU. The Llama Community License allows commercial use up to 700M monthly active users, which covers every reader of this post. The 1B sibling is faster but noticeably weaker; the 3B is the sweet spot.

2. Qwen 2.5 7B — 92/100

Alibaba's Qwen 2.5 is the current open-weights champion on most benchmark suites. The 7B variant outperforms Llama 3.1 8B on nearly everything, runs at Q4 in about 4.5 GB, and ships under Apache 2.0 (unlike its 72B sibling, which has a custom license). If you have 16 GB of RAM and want the best quality you can actually run, this is the pick.

3. Phi-3.5 Mini — 91/100

Microsoft's 3.8B Phi-3.5 Mini is trained on a curated "textbook" dataset and punches far above its weight on reasoning and code. MIT license, ~2.4 GB at Q4, and surprisingly good at structured output. The knock on Phi is weaker general knowledge — it knows less pop culture than Llama 3.2 — but for coding, math, and tool use it's an easy top 3.

4. Mistral Nemo 12B — 88/100

Mistral and NVIDIA's joint release is the best "if you have a gaming laptop" model. 12B parameters, 128K context, Apache 2.0, and a quantization-aware design that actually works at Q4. Requires around 8 GB to load at Q4, so it's laptop territory, not phone territory.

5. Gemma 2 2B — 87/100

Google's Gemma 2 2B is the strongest sub-3B model other than Llama 3.2 3B. Slightly weaker on reasoning, slightly better on multilingual tasks. Permissive terms (though not technically OSI-approved), small footprint, and fast on mobile hardware. Also notable for excellent safety tuning out of the box.

6. Llama 3.1 8B — 86/100

The workhorse that launched the current generation of local AI. Still excellent, still well-supported in every runtime, still the best choice if you want a proven option with huge fine-tune ecosystems. The 8B needs ~5 GB at Q4 and runs at 15-20 tok/s on an M2. Now superseded by Qwen 2.5 7B and Llama 3.2 on smaller devices, but the community around it is unmatched.

7. DeepSeek Coder V2 Lite 16B — 85/100

A mixture-of-experts model where only 2.4B parameters are active per token, so it runs at roughly 3B speeds while holding 16B quality. The best open coding model you can run on consumer hardware. Requires 10-12 GB of RAM and is worth every megabyte if code is your use case. See our best local LLMs for coding roundup for full benchmarks.

8. Qwen 2.5 Coder 1.5B — 83/100

The smallest model on this list that can actually write code you'd commit. Targets HumanEval 65%+ at 1.5B parameters, fits on a phone, and ships under Apache 2.0. A great first-line completion model that punches far above its weight.

9. SmolLM2 1.7B — 81/100

Hugging Face's in-house small model. Apache 2.0, genuinely small (~1 GB at Q4), and surprisingly coherent for its size. Built specifically for on-device deployment and shows it — lean, fast, and trained on a carefully filtered dataset.

10. Mistral 7B v0.3 — 80/100

The original king of 7B models, now comfortably beaten by Llama 3.1 8B and Qwen 2.5 7B, but still an excellent choice when you want a mature model with every fine-tune under the sun. Apache 2.0, huge ecosystem, easy to run.

11. Gemma 2 9B — 78/100

Google's mid-tier Gemma. Competitive with Llama 3.1 8B on most tasks, slightly better on multilingual. Requires ~5.5 GB at Q4. Held back from higher placement mostly by license terms that are less permissive than pure Apache or MIT.

12. StableLM 2 1.6B — 75/100

Stability AI's small model. Decent quality for its size, permissive research license, fast on mobile. Superseded in most benchmarks by SmolLM2 and Qwen 2.5 1.5B Coder, but worth a mention for multilingual work.

13. TinyLlama 1.1B — 70/100

The OG tiny Llama. Apache 2.0, 700 MB at Q4, runs on basically anything. Quality has been lapped by SmolLM2 and Llama 3.2 1B, but if you need the smallest possible coherent model, TinyLlama is still a valid choice and has the largest community around sub-2B models.

14. Llama 3.2 1B — 68/100

Meta's 1B sibling to our #1. Only 68/100 because it's dramatically weaker than the 3B for just a ~2x speedup, and SmolLM2 1.7B edges it out on most tasks. Still useful as a draft model for speculative decoding or on truly tiny devices.

15. OpenHermes 2.5 Mistral 7B — 65/100

A fine-tune of Mistral 7B v0.2 that was the go-to "smart" model of 2024. We include it because the training data is exceptional and it still produces great conversational output — but the base it's fine-tuned on is now two generations old. Use it if you want character and personality; use Qwen 2.5 7B if you want raw capability.

The summary table

#ModelSize (Q4)Min RAMLicenseScore
1Llama 3.2 3B2.0 GB4 GBLlama Community94
2Qwen 2.5 7B4.5 GB8 GBApache 2.092
3Phi-3.5 Mini2.4 GB4 GBMIT91
4Mistral Nemo 12B7.5 GB12 GBApache 2.088
5Gemma 2 2B1.6 GB4 GBGemma Terms87
6Llama 3.1 8B5.0 GB8 GBLlama Community86
7DeepSeek Coder V2 Lite10 GB16 GBDeepSeek License85
8Qwen 2.5 Coder 1.5B0.9 GB2 GBApache 2.083
9SmolLM2 1.7B1.1 GB3 GBApache 2.081
10Mistral 7B v0.34.4 GB8 GBApache 2.080
11Gemma 2 9B5.5 GB10 GBGemma Terms78
12StableLM 2 1.6B1.0 GB3 GBStability NC75
13TinyLlama 1.1B0.7 GB2 GBApache 2.070
14Llama 3.2 1B0.8 GB2 GBLlama Community68
15OpenHermes 2.54.4 GB8 GBApache 2.065

Which one should you actually run?

On a phone (iPhone or recent Android): Llama 3.2 3B is the default. If you want faster generation at some quality cost, drop to Gemma 2 2B or Qwen 2.5 Coder 1.5B. We walk through the iPhone-specific setup in our how-to-run-AI-offline guide.

On a MacBook Air M1/M2 (8 GB): Llama 3.2 3B, Phi-3.5 Mini, or Gemma 2 2B. Do not try to run 7B models on 8 GB — you can technically load them but the system swaps and everything slows to a crawl.

On a MacBook Pro M3/M4 (16 GB+): Qwen 2.5 7B for general use. DeepSeek Coder V2 Lite if you write code. Mistral Nemo 12B if you need long context.

On a gaming PC with a GPU: Anything on this list, plus fine-tuned variants. Runtime is usually llama.cpp with CUDA or Ollama.

How to actually run them

You need two things: the model weights and a runtime. For iPhone, the easiest path is an app that bundles both — PocketLLM, Private LLM, or LLM Farm. For Mac, LM Studio and Ollama handle model download and inference in one step. For a Linux box with a GPU, llama.cpp compiled from source is still the best option. If you're comparing the desktop options, we broke down the differences in Ollama vs LM Studio vs PocketLLM.

The quick answer

If you want one model to run, run Llama 3.2 3B. It has the best quality-per-megabyte ratio in 2026, it's supported in every runtime, and it's small enough to run on your phone. If you have 16 GB of RAM and want the smartest option, run Qwen 2.5 7B. Everything else on this list is a specialization of those two baselines.

Want all of these as one-tap downloads inside a privacy-first iPhone app? PocketLLM packages the top-ranked models and handles Core ML conversion for you. Join the waitlist.

Every model on this list, on your iPhone.

PocketLLM ships the best local LLMs as one-tap downloads, fully on-device, with zero telemetry. Join the waitlist.

Join the waitlist