The Experiment Nobody Runs

Every multi-agent AI demo you see online uses cloud APIs. GPT-4 as the supervisor, GPT-4 as the worker, GPT-4 as the reviewer. Unlimited compute, unlimited memory, unlimited budget. The results look impressive because the infrastructure is invisible.

I wanted to know what happens when you strip that away. What does multi-agent orchestration actually look like on hardware you own, with models you host, paying zero per token?

So I ran the same collaborative task on two machines:

  • Mac Studio M3 Ultra with 96GB unified memory, running 4 local LLMs simultaneously
  • Windows PC with 32GB RAM and an NVIDIA GPU, running 2-3 local LLMs

Same task. Same model families. Same orchestration code. The results taught me more about multi-agent systems than any paper I’ve read.


The Hardware

Mac Studio Setup

Component Specification
Chip Apple M3 Ultra
Cores 28 CPU cores
Memory 96GB unified (shared CPU/GPU)
Runtime Ollama on Metal

The M3 Ultra’s unified memory architecture means the entire 96GB pool is available for model weights. No separate VRAM constraint. In theory, this should comfortably hold multiple large models.

Windows PC Setup

Component Specification
GPU NVIDIA (consumer tier)
RAM 32GB system + GPU VRAM
Runtime Ollama on CUDA

A standard developer workstation. Not a server, not a cloud instance. The kind of machine most developers actually have.


What I Ran

The task was deliberately non-technical to remove any coding advantage: agents should ask each other questions, answer them, and have a back-and-forth dialogue for multiple rounds. This forces real inter-agent communication and tests the orchestration layer, not the models’ coding ability.

I tested several configurations:

Run Machine Agents Models Task
Mac 2-LLM Mac Studio supervisor + reviewer llama3.3:70b, qwen2.5:32b Q&A collaboration, 5 rounds
Mac 4-LLM Mac Studio supervisor + 2 workers + reviewer llama3.3:70b, qwen2.5:32b, qwen2.5-coder:32b, deepseek-r1:32b Same task
Win 2-LLM Windows supervisor + reviewer qwen2.5:7b, mistral:7b Q&A collaboration, 3 rounds
Win 3-LLM (story) Windows supervisor + 2 reviewers qwen2.5:7b, mistral:7b, qwen2.5:7b Collaborative 10-line story
Win 3-LLM (math) Windows supervisor + worker + reviewer qwen2.5:7b, mistral:7b, qwen2.5:7b Math word problem

Mac Studio: 4 Models, 0.5GB Free RAM

The Mac’s pre-flight hardware check told the story before any agent started working:

╭─ pre-flight check ───────────────────────────────────╮
│  System RAM: 96GB total, 0.6GB available             │
│                                                      │
│  ■ supervisor  llama3.3:70b  ~42GB                   │
│  ■ writer      qwen2.5:32b   ~4.5GB                  │
│  ■ coder       qwen2.5-coder:32b  ~18GB              │
│  ■ critic      deepseek-r1:32b    ~18GB              │
│                                                      │
│  Total estimated: ~82.5GB                            │
│  ⚠ Only 0.6GB available                              │
╰──────────────────────────────────────────────────────╯

82.5GB of models in 96GB of memory. That leaves 13.5GB for the OS, Ollama runtime, KV caches, and the orchestration process itself. macOS alone uses 8-10GB. The system was running at the edge.

What Happened: The 4-LLM Run

The orchestration submitted a collaborative Q&A task. The supervisor (llama3.3:70b) decomposed it into 3 steps and assigned all of them to the writer (qwen2.5:32b). It never assigned work to the coder (qwen2.5-coder:32b).

The coder agent consumed approximately 18GB of memory and did exactly zero work across the entire run.

The writer produced responses like this:

[writer] To proceed with this task, I need the specific details of the
supervisor's question and the coworkers' answers to identify the issue.
Since these are not provided in your message, I cannot pinpoint the
exact problem or devise a solution without additional information.

The writer asked for more information instead of executing the task. Three times. The structured debate review (proposer, critic, judge using three different models) approved this empty output twice before finally requesting changes on the third round.

Duration: 14 minutes. Output: A hollow “I need more information” response. Memory wasted: 18GB on an idle agent.

What Happened: The 2-LLM Run

Same Mac, same task, but only 2 models: llama3.3:70b (supervisor) and deepseek-r1:32b (reviewer). Total model memory: approximately 60GB, leaving 36GB of headroom.

The supervisor was forced to execute the task itself (no workers). It hallucinated an entire Java debugging scenario, complete with a NullPointerException in a FileUploader class. The user had asked for a quiz. The agent produced Java code.

The reviewer correctly rejected this three times, providing increasingly specific feedback. But the supervisor kept reverting to bug-fix behavior because of a framing issue in the prompt (which I later fixed).

Duration: 8 minutes. Outcome: Total failure after max retries.


Windows PC: Smaller Models, Better Results

The Windows runs used 7B models, roughly 10x smaller than the Mac’s 70B supervisor. I expected worse results. I was wrong about what mattered.

The 2-LLM Collaboration Run

Supervisor (qwen2.5:7b) and reviewer (mistral:7b). Total memory: approximately 9GB. The system had abundant headroom.

● tool lead · ask_coworker [19:02:40]
  calling ask_coworker
  Agent asking reviewer: "How are you doing today?"
└ result [19:02:47]
  [lead] ask_coworker done

● tool lead · ask_coworker [19:02:56]
  calling ask_coworker
  Agent asking reviewer: "What projects are you currently working on?"
└ result [19:03:00]
  [lead] ask_coworker done

Six successful ask_coworker calls. Zero failures. The agents had an actual back-and-forth conversation. The reviewer approved with substantive feedback.

Duration: 3.5 minutes. Outcome: Passed. The agents communicated.

The 3-LLM Story Run (After Fixes)

Three agents building a collaborative 10-line story. The supervisor planned 10 steps, called ask_coworker 18 times across the full pipeline, and the reviewer rejected 3 steps that deviated from the story format. Those rejections were correct and the agent self-corrected.

[review phase=changes_requested] reviewer verdict=CHANGES_REQUESTED
REASON: The output did not follow the requested format. It provides
suggestions for improving the process instead of contributing a
sentence for the story.
SUGGESTIONS: Provide a single sentence to start the story.

The reviewer pushed the agent from “let me suggest a process” to “here is an actual story sentence.” That self-correction loop is precisely why multi-agent systems are valuable.

Duration: 13 minutes. Outcome: Completed. Multi-paragraph story produced through genuine agent collaboration.

The 3-LLM Math Run

This was the most revealing run. Three agents (supervisor, worker, reviewer) tackling a math word problem. The worker (mistral:7b) submitted four attempts:

Attempt Worker’s Approach Reviewer’s Verdict
1 Hardcoded father=40, son=65 in Python Rejected: “hypothetical input, not solving the problem”
2 Hardcoded father=60, son=35 Rejected: “should use algebraic equations”
3 Attempted algebra, syntax errors Rejected: provided exact equations F-5=3(S-5), F+5=2(S+5)
4 Proper algebraic solution Approved

The review loop transformed a guessing worker into an equation solver across four iterations. The reviewer’s feedback quality improved with each rejection, from generic (“use algebra”) to specific (providing the actual equations). I didn’t program this progressive tutoring behavior. It emerged from the retry loop.

The answer was still wrong (the 7B model made arithmetic errors the reviewer couldn’t catch), but the approach evolution was remarkable.


The Comparison That Matters

Dimension Mac 4-LLM (70B+32B models) Windows 3-LLM (7B models)
Total model memory 82.5GB ~14GB
Free RAM 0.5GB 18GB+
Agent utilization 2 of 4 agents did work (50%) 3 of 3 agents did work (100%)
ask_coworker success rate 80% (2+ timeouts) 100% (0 failures)
Task completed? No (both runs failed) Yes (all runs completed)
Duration 8-14 minutes to failure 3.5-13 minutes to completion
Best output “I need more information” Multi-paragraph story + algebraic solution

The 7B models on a Windows PC with headroom outperformed 32B-70B models on a Mac Studio under memory pressure. Not because smaller models are better, but because memory pressure degrades everything: inference speed drops, inter-agent communication times out, the OS starts swapping, and model quality deteriorates unpredictably.


Five Things I Learned About Hardware and Multi-Agent Systems

1. Memory Headroom Matters More Than Model Size

Running a 70B model with 0.5GB free RAM produces worse results than running a 7B model with 18GB free. The model needs memory for KV cache (grows with context length), the orchestration process needs memory for state management, and the OS needs memory to not swap. When memory is exhausted, everything degrades.

My rule of thumb after these experiments: reserve at least 20% of total memory as headroom. On a 96GB machine, that means 76GB for models maximum, not 82.5GB.

2. Idle Agents Waste Real Resources

The 4-LLM Mac run loaded a qwen2.5-coder:32b model (18GB) that did zero work. The supervisor never assigned it a task. That 18GB of wasted memory could have given the active models more KV cache space, more context window, and better inference stability.

Before spawning an agent, ask: will the supervisor actually assign work to this model? If you have a supervisor and a worker, adding a second specialized worker only helps if the task decomposition will create subtasks matching that specialization.

The practical pattern: start with 2 agents (supervisor + worker) and add more only when you observe the supervisor trying to assign work to agents that don’t exist.

3. Apple Silicon Unified Memory Is a Double-Edged Sword

Unified memory means your entire RAM pool is available for model weights. That is genuinely useful. A 96GB M3 Ultra can load a 70B Q4 model entirely in GPU-accessible memory, something impossible on a 24GB NVIDIA card without multi-GPU setups.

But unified memory also means the OS, the runtime, and the models compete for the same pool. There is no protected VRAM partition. When Ollama loads 82.5GB of models, macOS has to squeeze into whatever remains. On an NVIDIA system, models live in dedicated VRAM and the OS uses separate system RAM. The separation is actually protective.

For multi-agent systems specifically, Apple Silicon is better for running one large model (70B solo). NVIDIA is better for running multiple smaller models concurrently (several 7B-13B agents), because system memory and VRAM are independent pools.

4. Inter-Agent Communication Fails Under Pressure

The ask_coworker tool sends a prompt to another agent’s LLM via a direct API call. When that LLM is under memory pressure, inference slows from seconds to tens of seconds. I observed 30-second timeouts on the Mac runs that never occurred on Windows.

The fix was adding explicit timeouts (60 seconds initial, 120 seconds retry) and one automatic retry with backoff. But the better fix was reducing memory pressure so the timeouts didn’t trigger in the first place.

5. The Review Loop Is the Real Value, Not More Agents

Across all runs, the most valuable pattern was the supervisor-worker-reviewer triangle. The worker produces output. The reviewer evaluates it. If rejected, the worker retries with the reviewer’s feedback. This loop produced genuine improvement across all platforms and model sizes.

Adding a fourth or fifth agent didn’t improve this loop. It just consumed memory that the core three agents needed. The math run showed 4 retries producing progressively better output from a 7B model. That’s the multi-agent value proposition, and it works with the minimum viable team of three.


Hardware Recommendations

Based on these experiments, here is what I would actually recommend for local multi-agent orchestration:

If you have 32GB total (most developer machines): Run 2-3 agents with 7B models. This is the sweet spot for prototyping and development. All agents stay responsive, ask_coworker calls complete in seconds, and the review loop works.

If you have 64GB (M2/M3 Pro Max or workstation): Run 2-3 agents with one 32B model (supervisor) and 7B models for workers and reviewers. The larger supervisor produces better task decomposition. Keep total model memory under 48GB.

If you have 96GB (M3 Ultra Mac Studio): Run 2-3 agents with a 70B supervisor and 7B-13B workers. Do not load 4 large models. The 70B model alone uses 42GB. Leave the remaining 54GB for one worker, one reviewer, the OS, and headroom. Resist the temptation to fill all 96GB with models.

The pattern in all cases: 2-3 agents, one strong model for the supervisor, smaller models for workers and reviewers, and always maintain 20%+ memory headroom.


What I Would Do Differently

If I were starting these experiments over, I would change three things:

First, I would start with 2 agents on every platform before adding more. The 4-LLM Mac run was an avoidable failure. I assumed more agents meant better collaboration. The opposite was true.

Second, I would monitor memory continuously during runs. I learned about the 0.5GB free RAM situation from the pre-flight check, ignored the warning, and ran the experiment anyway. The warning was accurate. Trust the pre-flight check.

Third, I would test the same task with the same number of agents across both platforms before varying anything. Changing hardware, model count, and model size simultaneously made it harder to isolate which variable caused each failure. Controlled experiments need controlled variables.


The hardware doesn’t make the multi-agent system. The orchestration patterns do. A 96GB Mac Studio running 4 models at the edge of its memory produced worse results than a Windows PC running 3 models with room to breathe. The lesson is counterintuitive but consistent: fewer agents with adequate resources outperform more agents under resource pressure. Every time.