Running LLMs Locally: A Practical Guide to Models, Hardware, and Getting Started
Why Run Models Locally
There are legitimate reasons to run language models on your own hardware instead of calling a cloud API. Data stays on your machine. Latency drops from 500ms to 50ms for smaller models. You stop paying per-token for repetitive development loops. And for multi-agent systems where agents call each other dozens of times per task, the cost of cloud inference adds up fast.
But local inference comes with tradeoffs. You need specific hardware. Model quality varies dramatically by size. And the ecosystem of tools, quantizations, and runtimes changes every few weeks.
This post is a practical snapshot of what’s available as of late 2025, what hardware you need, and how to get started without wasting time on dead ends.
The Model Landscape
The open-source LLM ecosystem has settled into a few major families. Each has different strengths, licensing terms, and hardware requirements.
Models Worth Running Locally
| Model | Parameters | License | Strengths | Best For |
|---|---|---|---|---|
| Llama 3.3 | 70B | Llama 3.3 Community | Strong reasoning, tool calling, instruction following | General-purpose, coding, analysis |
| Qwen 2.5 | 0.5B to 72B | Apache 2.0 | Excellent multilingual, strong coding variants | Code generation, multilingual tasks |
| Qwen 2.5 Coder | 1.5B to 32B | Apache 2.0 | Purpose-built for code, outperforms GPT-4 on benchmarks at 32B | Code writing, refactoring, review |
| Mistral | 7B | Apache 2.0 | Fast inference, good quality-to-size ratio | Quick responses, constrained hardware |
| DeepSeek R1 | 1.5B to 671B | MIT | Chain-of-thought reasoning, distilled variants | Math, logic, step-by-step analysis |
| DeepSeek Coder V2 | 16B to 236B | MIT | Code-specialized with fill-in-the-middle | IDE integration, code completion |
| Phi-3 / Phi-4 | 3.8B to 14B | MIT | Microsoft research models, strong for size | Edge deployment, resource-limited setups |
| Gemma 2 | 2B to 27B | Gemma Terms | Google’s open models, good instruction following | General tasks, research |
| CodeLlama | 7B to 70B | Llama 2 Community | Meta’s code-specialized Llama variant | Code generation, infilling |
| StarCoder2 | 3B to 15B | BigCode OpenRAIL-M | Trained on The Stack v2 (600+ languages) | Multi-language code tasks |
This isn’t exhaustive. New models appear weekly. But these families cover the practical use cases for local inference today.
Where to Get Models
Ollama (ollama.com) is the fastest path from zero to inference. It manages model downloads, quantization selection, and runtime in a single tool. Install it, run ollama pull llama3.3:70b, and you have a running model in minutes.
Hugging Face (huggingface.co) hosts the original model weights and community-contributed quantizations. Search for a model name + “GGUF” to find pre-quantized versions ready for local inference.
LM Studio (lmstudio.ai) provides a GUI for browsing, downloading, and running models. Good for exploration when you want to quickly compare models without terminal commands.
Hardware Requirements
This is where most guides get vague. Here are concrete numbers based on actual testing.
The Core Constraint: VRAM
Language model inference is memory-bound. The model weights need to fit in memory (GPU VRAM or Apple Silicon unified memory) along with the KV cache for context processing. If the model doesn’t fit, inference either fails or falls back to CPU, which is 10-50x slower.
How Much Memory Models Actually Need
The memory requirement depends on parameter count and quantization level. Quantization compresses model weights from 16-bit floats to smaller representations (8-bit, 4-bit) with minimal quality loss.
| Parameters | FP16 (full) | Q8 (8-bit) | Q5 (5-bit) | Q4 (4-bit) |
|---|---|---|---|---|
| 7B | 14 GB | 7.5 GB | 5.5 GB | 4.5 GB |
| 13B | 26 GB | 14 GB | 10 GB | 8 GB |
| 32B | 64 GB | 34 GB | 24 GB | 18 GB |
| 70B | 140 GB | 75 GB | 52 GB | 42 GB |
These are approximate. Actual memory usage varies by model architecture and context length. Add 1-4 GB for KV cache overhead depending on context window size.
Hardware Configurations That Work
Entry Level: 8-16 GB VRAM (RTX 3060/4060, M1/M2 MacBook)
What runs well: 7B models at Q4-Q8 quantization. This covers Mistral 7B, Qwen 2.5 7B, Phi-3, Gemma 2 2B/9B. Response quality is acceptable for code completion, simple Q&A, and basic agent workflows. Expect 15-40 tokens per second depending on the GPU.
What doesn’t work: Anything above 13B parameters. The model either won’t load or swaps to CPU and becomes unusably slow.
Mid Range: 24-48 GB VRAM (RTX 3090/4090, M2/M3 Pro/Max)
What runs well: 13B-32B models at Q4-Q8. This is the sweet spot for local development. Qwen 2.5 32B and DeepSeek Coder V2 16B produce output that competes with cloud APIs for many tasks. Expect 10-25 tokens per second.
What doesn’t work: 70B models will partially fit with aggressive quantization but inference speed drops to 3-5 tokens per second. Usable for batch processing, painful for interactive use.
High End: 64-96 GB Unified Memory (M3 Ultra Mac Studio, Multi-GPU setups)
What runs well: 70B models at Q4-Q5 quantization. Llama 3.3 70B and DeepSeek R1 70B run with acceptable speed (5-15 tokens per second). This is where local inference starts matching cloud API quality for complex tasks.
What doesn’t work: The largest models (DeepSeek R1 671B, Llama 405B) still need clusters or specialized hardware. These aren’t practical for individual developers.
Apple Silicon vs NVIDIA
Apple Silicon (M1/M2/M3/M4) uses unified memory shared between CPU and GPU. The advantage: your entire RAM pool is available for model weights. An M3 Ultra with 96 GB can load a 70B Q4 model entirely in memory.
NVIDIA GPUs have dedicated VRAM. An RTX 4090 has 24 GB. For models larger than 24 GB, you either need multiple GPUs or accept CPU offloading. The advantage: higher throughput per token for models that fit. CUDA acceleration is faster than Metal for raw inference speed.
For most developers running one or two models for daily use, Apple Silicon with 32 GB+ unified memory offers the best experience. For batch processing or serving multiple concurrent requests, NVIDIA GPUs with high VRAM are more cost-effective per token.
Getting Started in 10 Minutes
Step 1: Install Ollama
# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh
# Windows
# Download from https://ollama.com/download/windows
Ollama starts a local server on port 11434. It handles model management, quantization selection, and provides an OpenAI-compatible API endpoint.
Step 2: Pull a Model
# Start with a 7B model (fits on most hardware)
ollama pull qwen2.5:7b
# If you have 24GB+ VRAM, try a 32B model
ollama pull qwen2.5:32b
# If you have 48GB+, go for 70B
ollama pull llama3.3:70b
Step 3: Test It
# Interactive chat
ollama run qwen2.5:7b
# API call (OpenAI-compatible)
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen2.5:7b",
"messages": [{"role": "user", "content": "Write a Python function to reverse a linked list"}]
}'
Step 4: Run Multiple Models
Multi-agent systems need multiple models running simultaneously. Ollama handles this automatically. Each model loads when called and unloads when not in use (configurable via OLLAMA_KEEP_ALIVE).
# In one terminal
ollama run qwen2.5:7b
# In another terminal (second model loads alongside)
ollama run mistral:7b
Monitor memory usage with ollama ps to see which models are loaded and how much memory they consume.
Practical Comparison: Model Quality vs Size
I’ve tested these models across coding, reasoning, and multi-agent collaboration tasks. Here’s what I’ve observed:
Code Generation Quality
| Model | Simple Functions | Multi-file Refactor | Bug Analysis | Tool Calling |
|---|---|---|---|---|
| Mistral 7B | Acceptable | Weak | Weak | Basic |
| Qwen 2.5 7B | Good | Acceptable | Acceptable | Good |
| Qwen 2.5 Coder 32B | Very Good | Good | Good | Very Good |
| Llama 3.3 70B | Very Good | Very Good | Good | Very Good |
| DeepSeek R1 32B | Good | Acceptable | Very Good (CoT) | Limited |
Multi-Agent Suitability
For multi-agent orchestration (supervisor + worker + reviewer), model size directly affects collaboration quality:
7B models can execute tasks but struggle with multi-step reasoning. They echo system prompts, hallucinate agent names, produce repetitive output, and can’t verify each other’s arithmetic. Useful for prototyping agent workflows but not reliable for complex tasks.
32B models handle delegation, tool calling, and basic review well. They maintain context across multi-turn exchanges and produce meaningfully different output on retries. This is the practical minimum for production multi-agent systems.
70B models excel at decomposition, nuanced review feedback, and maintaining narrative coherence across long pipelines. The supervisor’s task planning is noticeably better at 70B than 32B. But the hardware requirements (48 GB+ VRAM) limit accessibility.
Common Pitfalls
Running models too large for your hardware. If a 32B model works but is painfully slow (under 5 tokens per second), drop to the 7B variant. Speed matters more than marginal quality gains for development workflows.
Ignoring quantization options. The Q4 quantization of a 32B model often outperforms the FP16 version of a 13B model while using less memory. Check the quantization options available for your chosen model before assuming you need more hardware.
Not setting OLLAMA_KEEP_ALIVE. By default, Ollama unloads models after 5 minutes of inactivity. For multi-agent setups where models are called intermittently, set OLLAMA_KEEP_ALIVE=30m to avoid repeated loading delays.
Expecting cloud API quality from 7B models. A 7B model is roughly equivalent to GPT-3.5. It will make mistakes that GPT-4 or Claude wouldn’t. Calibrate expectations accordingly and design your system (review loops, guardrails) to compensate.
Where to Go From Here
The open model ecosystem is moving fast. A few resources worth bookmarking:
- Ollama Model Library: ollama.com/library for the catalog of ready-to-run models
- Hugging Face Open LLM Leaderboard: huggingface.co/spaces/open-llm-leaderboard for benchmark comparisons
- LMSYS Chatbot Arena: chat.lmsys.org for human preference rankings
- TheBloke’s Quantizations: Search Hugging Face for “TheBloke” + model name for pre-quantized GGUF files
The practical takeaway: you can run genuinely useful language models on consumer hardware today. A machine with 16 GB of memory and an Ollama install gets you started. From there, model quality scales linearly with the hardware you’re willing to throw at it.
Running LLMs locally went from research experiment to daily development tool in under two years. The models are here, the tooling works, and the hardware requirements have dropped to consumer grade. The remaining question isn’t whether to run locally. It’s which model fits your hardware and your use case.