You pointed Ollama, DeepSeek, or Llama at an agent framework, asked it to read a monorepo, and watched the UI freeze for three to eight minutes—not a crash, just prefill drowning in tool JSON. Local models on a Mac mini M4 with 16–32 GB unified memory hit a wall when a single turn ships 40k–80k tokens of npm test output, SQL dumps, or ripgrep hits. Cloud APIs bill you for that bloat; local GPUs burn wall-clock on attention over every byte.
Headroom compresses tool outputs, logs, and RAG chunks before they enter the model—published workloads show 92% shrink on SRE traces and 73–92% on code search. This Type A guide wires Headroom in front of Ollama’s OpenAI-compatible API so your agent keeps the same tools but feeds the model a 5k-token view instead of a 50k-token haystack. For cloud-gateway cost control, see our OpenClaw + Headroom proxy article; for zero-API local trading stacks, TradingAgents on Ollama.
Disclosure: MacHTML offers optional cloud Mac mini rental; this runbook works on any local macOS or Linux host with Ollama installed.
Why local agents “hang” on big tool output
Local LLM latency is roughly linear in context length for prefill on Apple Silicon—there is no magic “unlimited context” at 30 tok/s when you dump an entire CI log. Symptoms operators mislabel as freeze:
| Symptom | Likely cause | Headroom lever |
|---|---|---|
Spinner after read_file on 2 MB log | 50k+ tokens in one tool message | SmartCrusher / LogCompressor |
| First turn fast, turn 5 stalls | Context snowball across tools | CCR + IntelligentContext drops |
| RAM pressure, swap on 16 GB Mac mini | KV cache + huge prompt | Shrink prefill tokens 60–95% |
| “It worked on GPT-4o” | Cloud model wider/faster silicon | Local needs compression + smaller effective context |
Quotable: On Mac mini class hardware, routing agent traffic through Headroom on 127.0.0.1:8787 toward Ollama at 11434 can cut tool-heavy prompts by 60–95% and restore sub-minute tool turns that raw logs made impossible.
External: Headroom GitHub, proxy docs, Ollama API.
Compression layer architecture
┌──────────────┐ tool JSON/logs ┌─────────────────────┐ compressed ┌─────────────┐
│ Agent │ ────────────────► │ Headroom proxy │ ─────────────►│ Ollama │
│ (OpenClaw, │ OPENAI_BASE_URL │ :8787 │ chat API │ :11434/v1 │
│ LangGraph, │ = localhost:8787 │ SmartCrusher + CCR │ │ llama3/deep │
│ Aider…) │ │ │ │ seek-r1 │
└──────────────┘ └─────────────────────┘ └─────────────┘
Headroom’s proxy speaks OpenAI-compatible /v1/chat/completions—the same surface Ollama exposes. You aim the agent client at Headroom; Headroom compresses, then forwards to Ollama’s upstream URL.
Step-by-step runbook (Ollama + Headroom)
1. Baseline Ollama on the host
ollama --version
ollama pull llama3.2:3b # or deepseek-r1:7b, qwen2.5-coder, etc.
curl -s http://127.0.0.1:11434/api/tags | python3 -m json.tool
Pick a model that fits RAM: 3B–8B on 16 GB Mac mini for agent loops; 14B+ needs 32 GB+ for comfortable KV headroom.
2. Install Headroom proxy stack
python3 --version # 3.10+
pip install "headroom-ai[proxy]"
headroom --version
3. Start Headroom pointing at Ollama
export OPENAI_API_KEY=ollama-local # Ollama ignores key; some clients require a string
export OPENAI_TARGET_API_URL=http://127.0.0.1:11434/v1
headroom proxy --host 127.0.0.1 --port 8787 \
--log-file ~/.headroom/ollama-agent.jsonl
Verify:
curl -s http://127.0.0.1:8787/health | python3 -m json.tool
4. Point your agent SDK at Headroom, not Ollama directly
export OPENAI_BASE_URL=http://127.0.0.1:8787/v1
export OPENAI_API_KEY=ollama-local
OpenClaw with local model — set in ~/.openclaw/.env alongside any cloud keys; pair with OpenClaw + Ollama webhooks for ingress patterns.
5. Library mode for custom Python agents
from headroom import compress
import ollama
messages = [...] # includes fat tool role
result = compress(messages, model="llama3.2")
response = ollama.chat(model="llama3.2", messages=result.messages)
print(response["message"]["content"])
Library mode avoids proxy port conflicts when you already orchestrate HTTP yourself.
6. Cap tool payload before compression (belt + suspenders)
rg -n "ERROR" logs/ | head -n 200
instead of cat logs/build.log. Headroom recovers sloppy skills—but 200-line caps keep worst-case latency bounded on 16 GB hosts.
7. Measure savings and latency
curl -s http://127.0.0.1:8787/stats | python3 -m json.tool
Log tokens_before, tokens_after, and wall time per turn. Target ≥50% token reduction on first repo audit; 60–90% on JSON/search tool dumps per Headroom benchmarks.
8. Optional MCP for Claude Code + local Ollama sidecar
headroom mcp install
Use MCP headroom_stats while a local model runs in another terminal—useful when mixing Hermes memory compression on cloud and Headroom on local.
9. Persist proxy under launchd (always-on Mac mini)
Create ~/Library/LaunchAgents/ai.headroom.ollama.plist with OPENAI_TARGET_API_URL=http://127.0.0.1:11434/v1, load before your agent gateway. Keep Ollama’s own LaunchAgent separate—start order: Ollama → Headroom → agent.
10. Regression check after model swap
When you ollama pull a new tag, re-run one fat tool fixture. Quantization changes do not remove the need for compression—context length still dominates prefill.
Benchmark pattern on Mac mini
| Fixture | Raw tokens (typical) | After Headroom (typical) | Wall-clock goal |
|---|---|---|---|
npm test stderr (failed suite) | 25k–45k | 3k–8k | <45 s prefill on M4 16 GB |
find . -name '*.ts' listing | 15k–30k | 2k–5k | <30 s |
| Single 500 KB JSON API dump | 40k–70k | 4k–10k | <60 s |
Run each fixture with proxy bypass once (x-headroom-bypass: true header) to document the freeze you are eliminating—store numbers in your team wiki.
Troubleshooting
Agent still hits 11434 directly
Pattern: /stats on Headroom stays at zero requests.
Fix: Grep your process env (ps eww -p $(pgrep -f your-agent)). OPENAI_BASE_URL must be http://127.0.0.1:8787/v1 for the agent process, not only your shell profile.
Ollama 404 model not found through proxy
Pattern: Headroom logs upstream 404.
Fix: Model string in chat request must match ollama list. Pull locally first; Headroom does not manage model files.
Compression removed needed stack line
Pattern: Agent misses line number in audit.
Fix: Enable CCR retrieval in skill instructions (“call headroom_retrieve if stack trace incomplete”) or one bypass turn. Do not disable compression globally.
Mac mini swaps to death
Pattern: memory_pressure critical during agent run.
Fix: Smaller quant (:q4_0), fewer parallel tools, plus Headroom. On 16 GB, avoid 14B models + 80k raw context simultaneously.
FAQ
Does Headroom work only with OpenAI cloud?
No. The proxy compresses then forwards to any OpenAI-compatible upstream—Ollama, local vLLM, or cloud. Set OPENAI_TARGET_API_URL to your local server.
Is 95% compression realistic?
Headroom publishes up to 92% on some agent workloads; 60–80% is common on mixed repos. Measure /stats—do not assume marketing max on every turn.
Will this fix GPU overheating?
It reduces prefill work, which lowers heat versus raw huge prompts—but sustained tool loops still load CPU/GPU. Watch powermetrics on Mac mini.
Headroom vs Hermes trajectory_compressor?
Hermes shrinks long chat memory inside Hermes sessions; Headroom shrinks tool/RAG payloads at HTTP or library layer. Use both on hybrid cloud/local setups.
Do I need Headroom if I use cloud Anthropic?
Optional for cost; for local models it is often the difference between usable and frozen. See OpenClaw Headroom for cloud when you mix providers.
Stage Ollama + Headroom on a cloud Mac mini
Run proxy and Ollama on Apple Silicon with SSH/VNC—validate ports, LaunchAgent order, and fat-tool benchmarks before production agent loops.