Running LLMs Locally: A Practical Guide to Models, Hardware, and Getting Started

Why Run Models Locally

There are legitimate reasons to run language models on your own hardware instead of calling a cloud API. Data stays on your machine. Latency drops from 500ms to 50ms for smaller models. You stop paying per-token for repetitive development loops. And for multi-agent systems where agents call each other dozens of times per task, the cost of cloud inference adds up fast.

But local inference comes with tradeoffs. You need specific hardware. Model quality varies dramatically by size. And the ecosystem of tools, quantizations, and runtimes changes every few weeks.

This post is a practical snapshot of what’s available as of late 2025, what hardware you need, and how to get started without wasting time on dead ends.

The Model Landscape

The open-source LLM ecosystem has settled into a few major families. Each has different strengths, licensing terms, and hardware requirements.

Models Worth Running Locally

Model	Parameters	License	Strengths	Best For
Llama 3.3	70B	Llama 3.3 Community	Strong reasoning, tool calling, instruction following	General-purpose, coding, analysis
Qwen 2.5	0.5B to 72B	Apache 2.0	Excellent multilingual, strong coding variants	Code generation, multilingual tasks
Qwen 2.5 Coder	1.5B to 32B	Apache 2.0	Purpose-built for code, outperforms GPT-4 on benchmarks at 32B	Code writing, refactoring, review
Mistral	7B	Apache 2.0	Fast inference, good quality-to-size ratio	Quick responses, constrained hardware
DeepSeek R1	1.5B to 671B	MIT	Chain-of-thought reasoning, distilled variants	Math, logic, step-by-step analysis
DeepSeek Coder V2	16B to 236B	MIT	Code-specialized with fill-in-the-middle	IDE integration, code completion
Phi-3 / Phi-4	3.8B to 14B	MIT	Microsoft research models, strong for size	Edge deployment, resource-limited setups
Gemma 2	2B to 27B	Gemma Terms	Google’s open models, good instruction following	General tasks, research
CodeLlama	7B to 70B	Llama 2 Community	Meta’s code-specialized Llama variant	Code generation, infilling
StarCoder2	3B to 15B	BigCode OpenRAIL-M	Trained on The Stack v2 (600+ languages)	Multi-language code tasks

This isn’t exhaustive. New models appear weekly. But these families cover the practical use cases for local inference today.

Where to Get Models

Ollama (ollama.com) is the fastest path from zero to inference. It manages model downloads, quantization selection, and runtime in a single tool. Install it, run ollama pull llama3.3:70b, and you have a running model in minutes.

Hugging Face (huggingface.co) hosts the original model weights and community-contributed quantizations. Search for a model name + “GGUF” to find pre-quantized versions ready for local inference.

LM Studio (lmstudio.ai) provides a GUI for browsing, downloading, and running models. Good for exploration when you want to quickly compare models without terminal commands.

Hardware Requirements

This is where most guides get vague. Here are concrete numbers based on actual testing.

The Core Constraint: VRAM

Language model inference is memory-bound. The model weights need to fit in memory (GPU VRAM or Apple Silicon unified memory) along with the KV cache for context processing. If the model doesn’t fit, inference either fails or falls back to CPU, which is 10-50x slower.

How Much Memory Models Actually Need

The memory requirement depends on parameter count and quantization level. Quantization compresses model weights from 16-bit floats to smaller representations (8-bit, 4-bit) with minimal quality loss.

Parameters	FP16 (full)	Q8 (8-bit)	Q5 (5-bit)	Q4 (4-bit)
7B	14 GB	7.5 GB	5.5 GB	4.5 GB
13B	26 GB	14 GB	10 GB	8 GB
32B	64 GB	34 GB	24 GB	18 GB
70B	140 GB	75 GB	52 GB	42 GB

These are approximate. Actual memory usage varies by model architecture and context length. Add 1-4 GB for KV cache overhead depending on context window size.

Hardware Configurations That Work

Entry Level: 8-16 GB VRAM (RTX 3060/4060, M1/M2 MacBook)

What runs well: 7B models at Q4-Q8 quantization. This covers Mistral 7B, Qwen 2.5 7B, Phi-3, Gemma 2 2B/9B. Response quality is acceptable for code completion, simple Q&A, and basic agent workflows. Expect 15-40 tokens per second depending on the GPU.

What doesn’t work: Anything above 13B parameters. The model either won’t load or swaps to CPU and becomes unusably slow.

Mid Range: 24-48 GB VRAM (RTX 3090/4090, M2/M3 Pro/Max)

What runs well: 13B-32B models at Q4-Q8. This is the sweet spot for local development. Qwen 2.5 32B and DeepSeek Coder V2 16B produce output that competes with cloud APIs for many tasks. Expect 10-25 tokens per second.

What doesn’t work: 70B models will partially fit with aggressive quantization but inference speed drops to 3-5 tokens per second. Usable for batch processing, painful for interactive use.

High End: 64-96 GB Unified Memory (M3 Ultra Mac Studio, Multi-GPU setups)

What runs well: 70B models at Q4-Q5 quantization. Llama 3.3 70B and DeepSeek R1 70B run with acceptable speed (5-15 tokens per second). This is where local inference starts matching cloud API quality for complex tasks.

What doesn’t work: The largest models (DeepSeek R1 671B, Llama 405B) still need clusters or specialized hardware. These aren’t practical for individual developers.

Apple Silicon vs NVIDIA

Apple Silicon (M1/M2/M3/M4) uses unified memory shared between CPU and GPU. The advantage: your entire RAM pool is available for model weights. An M3 Ultra with 96 GB can load a 70B Q4 model entirely in memory.

NVIDIA GPUs have dedicated VRAM. An RTX 4090 has 24 GB. For models larger than 24 GB, you either need multiple GPUs or accept CPU offloading. The advantage: higher throughput per token for models that fit. CUDA acceleration is faster than Metal for raw inference speed.

For most developers running one or two models for daily use, Apple Silicon with 32 GB+ unified memory offers the best experience. For batch processing or serving multiple concurrent requests, NVIDIA GPUs with high VRAM are more cost-effective per token.

Getting Started in 10 Minutes

Step 1: Install Ollama

# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh

# Windows
# Download from https://ollama.com/download/windows

Ollama starts a local server on port 11434. It handles model management, quantization selection, and provides an OpenAI-compatible API endpoint.

Step 2: Pull a Model

# Start with a 7B model (fits on most hardware)
ollama pull qwen2.5:7b

# If you have 24GB+ VRAM, try a 32B model
ollama pull qwen2.5:32b

# If you have 48GB+, go for 70B
ollama pull llama3.3:70b

Step 3: Test It

# Interactive chat
ollama run qwen2.5:7b

# API call (OpenAI-compatible)
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen2.5:7b",
    "messages": [{"role": "user", "content": "Write a Python function to reverse a linked list"}]
  }'

Step 4: Run Multiple Models

Multi-agent systems need multiple models running simultaneously. Ollama handles this automatically. Each model loads when called and unloads when not in use (configurable via OLLAMA_KEEP_ALIVE).

# In one terminal
ollama run qwen2.5:7b

# In another terminal (second model loads alongside)
ollama run mistral:7b

Monitor memory usage with ollama ps to see which models are loaded and how much memory they consume.

Practical Comparison: Model Quality vs Size

I’ve tested these models across coding, reasoning, and multi-agent collaboration tasks. Here’s what I’ve observed:

Code Generation Quality

Model	Simple Functions	Multi-file Refactor	Bug Analysis	Tool Calling
Mistral 7B	Acceptable	Weak	Weak	Basic
Qwen 2.5 7B	Good	Acceptable	Acceptable	Good
Qwen 2.5 Coder 32B	Very Good	Good	Good	Very Good
Llama 3.3 70B	Very Good	Very Good	Good	Very Good
DeepSeek R1 32B	Good	Acceptable	Very Good (CoT)	Limited

Multi-Agent Suitability

For multi-agent orchestration (supervisor + worker + reviewer), model size directly affects collaboration quality:

7B models can execute tasks but struggle with multi-step reasoning. They echo system prompts, hallucinate agent names, produce repetitive output, and can’t verify each other’s arithmetic. Useful for prototyping agent workflows but not reliable for complex tasks.

32B models handle delegation, tool calling, and basic review well. They maintain context across multi-turn exchanges and produce meaningfully different output on retries. This is the practical minimum for production multi-agent systems.

70B models excel at decomposition, nuanced review feedback, and maintaining narrative coherence across long pipelines. The supervisor’s task planning is noticeably better at 70B than 32B. But the hardware requirements (48 GB+ VRAM) limit accessibility.

Common Pitfalls

Running models too large for your hardware. If a 32B model works but is painfully slow (under 5 tokens per second), drop to the 7B variant. Speed matters more than marginal quality gains for development workflows.

Ignoring quantization options. The Q4 quantization of a 32B model often outperforms the FP16 version of a 13B model while using less memory. Check the quantization options available for your chosen model before assuming you need more hardware.

Not setting OLLAMA_KEEP_ALIVE. By default, Ollama unloads models after 5 minutes of inactivity. For multi-agent setups where models are called intermittently, set OLLAMA_KEEP_ALIVE=30m to avoid repeated loading delays.

Expecting cloud API quality from 7B models. A 7B model is roughly equivalent to GPT-3.5. It will make mistakes that GPT-4 or Claude wouldn’t. Calibrate expectations accordingly and design your system (review loops, guardrails) to compensate.

Where to Go From Here

The open model ecosystem is moving fast. A few resources worth bookmarking:

Ollama Model Library: ollama.com/library for the catalog of ready-to-run models
Hugging Face Open LLM Leaderboard: huggingface.co/spaces/open-llm-leaderboard for benchmark comparisons
LMSYS Chatbot Arena: chat.lmsys.org for human preference rankings
TheBloke’s Quantizations: Search Hugging Face for “TheBloke” + model name for pre-quantized GGUF files

The practical takeaway: you can run genuinely useful language models on consumer hardware today. A machine with 16 GB of memory and an Ollama install gets you started. From there, model quality scales linearly with the hardware you’re willing to throw at it.

Running LLMs locally went from research experiment to daily development tool in under two years. The models are here, the tooling works, and the hardware requirements have dropped to consumer grade. The remaining question isn’t whether to run locally. It’s which model fits your hardware and your use case.