Most "best LLMs for coding" articles are a ranking of hosted proprietary models plus a grudging mention of DeepSeek. This one ranks open-source and hosted models on the same rubric because, for a lot of real-world code, the best answer runs on your own laptop. We score on HumanEval, SWE-bench Verified where available, real-task performance on a fixed set of refactoring and debugging prompts, and whether you can use the model without sending your proprietary codebase to a third party.
Short version: Claude 3.5 Sonnet is still the best hosted coding model. Qwen 2.5 Coder 32B is the best one you can run locally. DeepSeek Coder V2 Lite is the best one that fits on a consumer laptop. Scroll down for the full ranking, or jump to the summary table.
How we scored
- HumanEval (25%): The standard pass@1 benchmark, reported by each model's primary source.
- Real-task performance (35%): A fixed suite of 20 prompts we run on every model: 5 refactoring tasks, 5 bug fixes, 5 feature additions, and 5 "explain this code" reads. Scored on correctness, style, and whether the output compiles.
- Context handling (15%): How well the model handles a 50K-token codebase pasted into its context, particularly across multiple files.
- Speed (10%): Wall-clock time to generate a 200-line function.
- Privacy (15%): Whether you can run the model without sending code to an external party. Local-capable models get a large bonus here.
The 15 best coding LLMs in 2026
1. Claude 3.5 Sonnet — 96/100
Still the best coding model on every serious benchmark and in every informal test we run. Handles long context well, explains reasoning clearly, rarely invents APIs, and produces code that actually compiles more often than any competitor. The tradeoff is that it's a hosted proprietary model, and you cannot run it locally or on-device. If your code is confidential, read on.
2. GPT-4o — 92/100
OpenAI's flagship. Slightly weaker than Sonnet on pure coding benchmarks, slightly stronger on natural-language reasoning about code. Excellent at "read this stack trace and tell me what's wrong." Same caveat as Sonnet: hosted, proprietary, sends your code to OpenAI.
3. Qwen 2.5 Coder 32B — 90/100
The best open-weights coding model of 2025-2026 and the highest-ranked model on this list that you can actually run yourself. Apache 2.0 license, 32B parameters (needs ~20 GB RAM at Q4), benchmarks within a few points of Claude 3.5 Sonnet on HumanEval. If you have a Mac Studio or a gaming PC, this is the local coding model.
4. DeepSeek Coder V2 236B — 89/100
The full 236B mixture-of-experts model. Incredible on benchmarks, essentially untouchable on Chinese-language code tasks. You probably can't run it — it needs a multi-GPU workstation. We list it because the smaller Lite variant (#6) inherits most of its smarts and is much more practical.
5. Gemini 1.5 Pro — 86/100
Google's long-context champion. The killer feature is the 2M-token context window, which means you can paste a small codebase and ask questions that span files. Coding quality is a step below Sonnet and GPT-4o but genuinely close, and the context advantage matters for real work.
6. DeepSeek Coder V2 Lite 16B — 85/100
Only 2.4B active parameters per token (mixture-of-experts), so it runs at ~3B speeds. HumanEval scores in the mid-80s. The best coding model that fits on a 16 GB laptop. If you want local code assistance and don't have workstation hardware, this is the pick. Covered in depth in our best local LLMs for coding post.
7. Llama 3.3 70B — 84/100
Meta's largest open model in the current generation. Not a dedicated coding model, but smart enough to handle most tasks well. Needs serious hardware (48 GB+ RAM at Q4). Use when you want one model for coding and general reasoning, and have the RAM to run it.
8. Mistral Codestral 22B — 82/100
Mistral's dedicated coding model. Mistral's non-commercial research license is restrictive — check before commercial use. Runs in ~14 GB at Q4, performs well on both completion and chat-style coding tasks, and is particularly good at Python and TypeScript.
9. Qwen 2.5 Coder 7B — 80/100
The 7B sibling of our #3 pick. Runs in ~4.5 GB at Q4, HumanEval in the high 70s, Apache 2.0. The sweet spot for a 16 GB MacBook Air. If you tried the 1.5B variant and wanted more capability without workstation hardware, this is the next step up.
10. StarCoder 2 15B — 78/100
Hugging Face + ServiceNow + NVIDIA's open coding model. Trained on permissively licensed code, which matters if you care about what your tools learned from. Slightly weaker than Qwen 2.5 Coder and DeepSeek V2 Lite on benchmarks but with the cleanest provenance story.
11. Phi-3.5 Mini — 75/100
Microsoft's 3.8B tiny model. Not a dedicated coding model but included because it punches absurdly far above its weight on HumanEval (mid-60s from 3.8B parameters). MIT license, 2.4 GB at Q4, runs on a phone. The best coding model that will comfortably fit on an iPhone.
12. Qwen 2.5 Coder 1.5B — 72/100
The smallest dedicated coding model we'd recommend. HumanEval 60%+ at just 1.5B parameters, Apache 2.0, and fast enough to do inline completion on a phone or a cheap laptop. A viable first-line code completion model when latency matters more than ceiling.
13. Llama 3.1 8B — 70/100
Not a dedicated coding model, but capable enough to be useful and widely supported in every runtime. Include because its ubiquity means every code editor plugin already has an integration path. Beaten on coding by Qwen 2.5 Coder 7B.
14. Codestral Mamba 7B — 68/100
Mistral's experimental Mamba architecture coding model. Worth mentioning because the constant-time attention means it handles very long code files efficiently. Quality is a notch below the Transformer-based competitors, but for pure long-context code it can be a better choice.
15. CodeLlama 13B — 60/100
The 2023 model that made "local coding LLMs" a real thing. Included as a historical benchmark. It's been comfortably beaten by Qwen Coder, DeepSeek, StarCoder 2, and basically everything newer. If you're starting fresh, don't pick CodeLlama. If you have it set up already, upgrade.
The summary table
| # | Model | Local? | License | HumanEval | Score |
|---|---|---|---|---|---|
| 1 | Claude 3.5 Sonnet | No | Proprietary | ~92% | 96 |
| 2 | GPT-4o | No | Proprietary | ~91% | 92 |
| 3 | Qwen 2.5 Coder 32B | Yes (workstation) | Apache 2.0 | ~89% | 90 |
| 4 | DeepSeek Coder V2 236B | Yes (multi-GPU) | DeepSeek | ~90% | 89 |
| 5 | Gemini 1.5 Pro | No | Proprietary | ~85% | 86 |
| 6 | DeepSeek Coder V2 Lite | Yes (laptop) | DeepSeek | ~84% | 85 |
| 7 | Llama 3.3 70B | Yes (workstation) | Llama Community | ~80% | 84 |
| 8 | Mistral Codestral 22B | Yes (laptop) | Mistral NC | ~81% | 82 |
| 9 | Qwen 2.5 Coder 7B | Yes (laptop) | Apache 2.0 | ~78% | 80 |
| 10 | StarCoder 2 15B | Yes (laptop) | BigCode OpenRAIL | ~75% | 78 |
| 11 | Phi-3.5 Mini | Yes (phone) | MIT | ~65% | 75 |
| 12 | Qwen 2.5 Coder 1.5B | Yes (phone) | Apache 2.0 | ~61% | 72 |
| 13 | Llama 3.1 8B | Yes (laptop) | Llama Community | ~62% | 70 |
| 14 | Codestral Mamba 7B | Yes (laptop) | Apache 2.0 | ~65% | 68 |
| 15 | CodeLlama 13B | Yes (laptop) | Llama Community | ~55% | 60 |
Which one should you actually use?
For closed-source work where you can send code to a third party: Claude 3.5 Sonnet. No coding assistant is better on real-world tasks right now, and the difference from GPT-4o is small but consistent.
For confidential code on a workstation: Qwen 2.5 Coder 32B. Close enough to the frontier that you won't miss Sonnet much, and your code never leaves the machine.
For confidential code on a 16 GB laptop: DeepSeek Coder V2 Lite 16B. The MoE architecture is genuinely impressive, and the quality is shockingly close to the 32B pack.
For confidential code on a phone: Phi-3.5 Mini or Qwen 2.5 Coder 1.5B. You're not going to one-shot a Rails controller, but you will absolutely get help understanding and modifying existing code. We bundle both in PocketLLM's upcoming release — join the waitlist.
For long-context code questions: Gemini 1.5 Pro's 2M window is unique.
The quick answer
The best LLM for coding in 2026 depends on one question: can you send your code to a third party? If yes, Claude 3.5 Sonnet. If no, Qwen 2.5 Coder (at whatever size your hardware supports). If you want coding help on your phone, the local models are finally good enough that it's worth doing. Our dedicated local-coding roundup has the on-device picks in detail.