Why Run Models Locally

There are legitimate reasons to run language models on your own hardware instead of calling a cloud API. Data stays on your machine. Latency drops from 500ms to 50ms for smaller models. You stop paying per-token for repetitive development loops. And for multi-agent systems where agents call each other dozens of times per task, the cost of cloud inference adds up fast.

But local inference comes with tradeoffs. You need specific hardware. Model quality varies dramatically by size. And the ecosystem of tools, quantizations, and runtimes changes every few weeks.

This post is a practical snapshot of what’s available as of late 2025, what hardware you need, and how to get started without wasting time on dead ends.


The Model Landscape

The open-source LLM ecosystem has settled into a few major families. Each has different strengths, licensing terms, and hardware requirements.

Models Worth Running Locally

Model Parameters License Strengths Best For
Llama 3.3 70B Llama 3.3 Community Strong reasoning, tool calling, instruction following General-purpose, coding, analysis
Qwen 2.5 0.5B to 72B Apache 2.0 Excellent multilingual, strong coding variants Code generation, multilingual tasks
Qwen 2.5 Coder 1.5B to 32B Apache 2.0 Purpose-built for code, outperforms GPT-4 on benchmarks at 32B Code writing, refactoring, review
Mistral 7B Apache 2.0 Fast inference, good quality-to-size ratio Quick responses, constrained hardware
DeepSeek R1 1.5B to 671B MIT Chain-of-thought reasoning, distilled variants Math, logic, step-by-step analysis
DeepSeek Coder V2 16B to 236B MIT Code-specialized with fill-in-the-middle IDE integration, code completion
Phi-3 / Phi-4 3.8B to 14B MIT Microsoft research models, strong for size Edge deployment, resource-limited setups
Gemma 2 2B to 27B Gemma Terms Google’s open models, good instruction following General tasks, research
CodeLlama 7B to 70B Llama 2 Community Meta’s code-specialized Llama variant Code generation, infilling
StarCoder2 3B to 15B BigCode OpenRAIL-M Trained on The Stack v2 (600+ languages) Multi-language code tasks

This isn’t exhaustive. New models appear weekly. But these families cover the practical use cases for local inference today.

Where to Get Models

Ollama (ollama.com) is the fastest path from zero to inference. It manages model downloads, quantization selection, and runtime in a single tool. Install it, run ollama pull llama3.3:70b, and you have a running model in minutes.

Hugging Face (huggingface.co) hosts the original model weights and community-contributed quantizations. Search for a model name + “GGUF” to find pre-quantized versions ready for local inference.

LM Studio (lmstudio.ai) provides a GUI for browsing, downloading, and running models. Good for exploration when you want to quickly compare models without terminal commands.


Hardware Requirements

This is where most guides get vague. Here are concrete numbers based on actual testing.

The Core Constraint: VRAM

Language model inference is memory-bound. The model weights need to fit in memory (GPU VRAM or Apple Silicon unified memory) along with the KV cache for context processing. If the model doesn’t fit, inference either fails or falls back to CPU, which is 10-50x slower.

How Much Memory Models Actually Need

The memory requirement depends on parameter count and quantization level. Quantization compresses model weights from 16-bit floats to smaller representations (8-bit, 4-bit) with minimal quality loss.

Parameters FP16 (full) Q8 (8-bit) Q5 (5-bit) Q4 (4-bit)
7B 14 GB 7.5 GB 5.5 GB 4.5 GB
13B 26 GB 14 GB 10 GB 8 GB
32B 64 GB 34 GB 24 GB 18 GB
70B 140 GB 75 GB 52 GB 42 GB

These are approximate. Actual memory usage varies by model architecture and context length. Add 1-4 GB for KV cache overhead depending on context window size.

Hardware Configurations That Work

Entry Level: 8-16 GB VRAM (RTX 3060/4060, M1/M2 MacBook)

What runs well: 7B models at Q4-Q8 quantization. This covers Mistral 7B, Qwen 2.5 7B, Phi-3, Gemma 2 2B/9B. Response quality is acceptable for code completion, simple Q&A, and basic agent workflows. Expect 15-40 tokens per second depending on the GPU.

What doesn’t work: Anything above 13B parameters. The model either won’t load or swaps to CPU and becomes unusably slow.

Mid Range: 24-48 GB VRAM (RTX 3090/4090, M2/M3 Pro/Max)

What runs well: 13B-32B models at Q4-Q8. This is the sweet spot for local development. Qwen 2.5 32B and DeepSeek Coder V2 16B produce output that competes with cloud APIs for many tasks. Expect 10-25 tokens per second.

What doesn’t work: 70B models will partially fit with aggressive quantization but inference speed drops to 3-5 tokens per second. Usable for batch processing, painful for interactive use.

High End: 64-96 GB Unified Memory (M3 Ultra Mac Studio, Multi-GPU setups)

What runs well: 70B models at Q4-Q5 quantization. Llama 3.3 70B and DeepSeek R1 70B run with acceptable speed (5-15 tokens per second). This is where local inference starts matching cloud API quality for complex tasks.

What doesn’t work: The largest models (DeepSeek R1 671B, Llama 405B) still need clusters or specialized hardware. These aren’t practical for individual developers.

Apple Silicon vs NVIDIA

Apple Silicon (M1/M2/M3/M4) uses unified memory shared between CPU and GPU. The advantage: your entire RAM pool is available for model weights. An M3 Ultra with 96 GB can load a 70B Q4 model entirely in memory.

NVIDIA GPUs have dedicated VRAM. An RTX 4090 has 24 GB. For models larger than 24 GB, you either need multiple GPUs or accept CPU offloading. The advantage: higher throughput per token for models that fit. CUDA acceleration is faster than Metal for raw inference speed.

For most developers running one or two models for daily use, Apple Silicon with 32 GB+ unified memory offers the best experience. For batch processing or serving multiple concurrent requests, NVIDIA GPUs with high VRAM are more cost-effective per token.


Getting Started in 10 Minutes

Step 1: Install Ollama

# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh

# Windows
# Download from https://ollama.com/download/windows

Ollama starts a local server on port 11434. It handles model management, quantization selection, and provides an OpenAI-compatible API endpoint.

Step 2: Pull a Model

# Start with a 7B model (fits on most hardware)
ollama pull qwen2.5:7b

# If you have 24GB+ VRAM, try a 32B model
ollama pull qwen2.5:32b

# If you have 48GB+, go for 70B
ollama pull llama3.3:70b

Step 3: Test It

# Interactive chat
ollama run qwen2.5:7b

# API call (OpenAI-compatible)
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen2.5:7b",
    "messages": [{"role": "user", "content": "Write a Python function to reverse a linked list"}]
  }'

Step 4: Run Multiple Models

Multi-agent systems need multiple models running simultaneously. Ollama handles this automatically. Each model loads when called and unloads when not in use (configurable via OLLAMA_KEEP_ALIVE).

# In one terminal
ollama run qwen2.5:7b

# In another terminal (second model loads alongside)
ollama run mistral:7b

Monitor memory usage with ollama ps to see which models are loaded and how much memory they consume.


Practical Comparison: Model Quality vs Size

I’ve tested these models across coding, reasoning, and multi-agent collaboration tasks. Here’s what I’ve observed:

Code Generation Quality

Model Simple Functions Multi-file Refactor Bug Analysis Tool Calling
Mistral 7B Acceptable Weak Weak Basic
Qwen 2.5 7B Good Acceptable Acceptable Good
Qwen 2.5 Coder 32B Very Good Good Good Very Good
Llama 3.3 70B Very Good Very Good Good Very Good
DeepSeek R1 32B Good Acceptable Very Good (CoT) Limited

Multi-Agent Suitability

For multi-agent orchestration (supervisor + worker + reviewer), model size directly affects collaboration quality:

7B models can execute tasks but struggle with multi-step reasoning. They echo system prompts, hallucinate agent names, produce repetitive output, and can’t verify each other’s arithmetic. Useful for prototyping agent workflows but not reliable for complex tasks.

32B models handle delegation, tool calling, and basic review well. They maintain context across multi-turn exchanges and produce meaningfully different output on retries. This is the practical minimum for production multi-agent systems.

70B models excel at decomposition, nuanced review feedback, and maintaining narrative coherence across long pipelines. The supervisor’s task planning is noticeably better at 70B than 32B. But the hardware requirements (48 GB+ VRAM) limit accessibility.


Common Pitfalls

Running models too large for your hardware. If a 32B model works but is painfully slow (under 5 tokens per second), drop to the 7B variant. Speed matters more than marginal quality gains for development workflows.

Ignoring quantization options. The Q4 quantization of a 32B model often outperforms the FP16 version of a 13B model while using less memory. Check the quantization options available for your chosen model before assuming you need more hardware.

Not setting OLLAMA_KEEP_ALIVE. By default, Ollama unloads models after 5 minutes of inactivity. For multi-agent setups where models are called intermittently, set OLLAMA_KEEP_ALIVE=30m to avoid repeated loading delays.

Expecting cloud API quality from 7B models. A 7B model is roughly equivalent to GPT-3.5. It will make mistakes that GPT-4 or Claude wouldn’t. Calibrate expectations accordingly and design your system (review loops, guardrails) to compensate.


Where to Go From Here

The open model ecosystem is moving fast. A few resources worth bookmarking:

The practical takeaway: you can run genuinely useful language models on consumer hardware today. A machine with 16 GB of memory and an Ollama install gets you started. From there, model quality scales linearly with the hardware you’re willing to throw at it.


Running LLMs locally went from research experiment to daily development tool in under two years. The models are here, the tooling works, and the hardware requirements have dropped to consumer grade. The remaining question isn’t whether to run locally. It’s which model fits your hardware and your use case.