AI Frontier

Speed Up Local LLMs: How Headroom Cuts Agent Context by 95% for Snappy Tool Responses

MacHTML Lab2026.06.08 ~11 min read
Headroom compress Ollama local LLM agent latency Mac mini

You pointed Ollama, DeepSeek, or Llama at an agent framework, asked it to read a monorepo, and watched the UI freeze for three to eight minutes—not a crash, just prefill drowning in tool JSON. Local models on a Mac mini M4 with 16–32 GB unified memory hit a wall when a single turn ships 40k–80k tokens of npm test output, SQL dumps, or ripgrep hits. Cloud APIs bill you for that bloat; local GPUs burn wall-clock on attention over every byte.

Headroom compresses tool outputs, logs, and RAG chunks before they enter the model—published workloads show 92% shrink on SRE traces and 73–92% on code search. This Type A guide wires Headroom in front of Ollama’s OpenAI-compatible API so your agent keeps the same tools but feeds the model a 5k-token view instead of a 50k-token haystack. For cloud-gateway cost control, see our OpenClaw + Headroom proxy article; for zero-API local trading stacks, TradingAgents on Ollama.

Disclosure: MacHTML offers optional cloud Mac mini rental; this runbook works on any local macOS or Linux host with Ollama installed.

Why local agents “hang” on big tool output

Local LLM latency is roughly linear in context length for prefill on Apple Silicon—there is no magic “unlimited context” at 30 tok/s when you dump an entire CI log. Symptoms operators mislabel as freeze:

SymptomLikely causeHeadroom lever
Spinner after read_file on 2 MB log50k+ tokens in one tool messageSmartCrusher / LogCompressor
First turn fast, turn 5 stallsContext snowball across toolsCCR + IntelligentContext drops
RAM pressure, swap on 16 GB Mac miniKV cache + huge promptShrink prefill tokens 60–95%
“It worked on GPT-4o”Cloud model wider/faster siliconLocal needs compression + smaller effective context

Quotable: On Mac mini class hardware, routing agent traffic through Headroom on 127.0.0.1:8787 toward Ollama at 11434 can cut tool-heavy prompts by 60–95% and restore sub-minute tool turns that raw logs made impossible.

External: Headroom GitHub, proxy docs, Ollama API.

Compression layer architecture

┌──────────────┐  tool JSON/logs   ┌─────────────────────┐  compressed   ┌─────────────┐
│ Agent        │ ────────────────► │ Headroom proxy      │ ─────────────►│ Ollama      │
│ (OpenClaw,   │  OPENAI_BASE_URL  │ :8787               │  chat API     │ :11434/v1   │
│  LangGraph,  │  = localhost:8787 │ SmartCrusher + CCR  │               │ llama3/deep │
│  Aider…)     │                   │                     │               │ seek-r1     │
└──────────────┘                   └─────────────────────┘               └─────────────┘

Headroom’s proxy speaks OpenAI-compatible /v1/chat/completions—the same surface Ollama exposes. You aim the agent client at Headroom; Headroom compresses, then forwards to Ollama’s upstream URL.

Step-by-step runbook (Ollama + Headroom)

1. Baseline Ollama on the host

ollama --version
ollama pull llama3.2:3b          # or deepseek-r1:7b, qwen2.5-coder, etc.
curl -s http://127.0.0.1:11434/api/tags | python3 -m json.tool

Pick a model that fits RAM: 3B–8B on 16 GB Mac mini for agent loops; 14B+ needs 32 GB+ for comfortable KV headroom.

2. Install Headroom proxy stack

python3 --version   # 3.10+
pip install "headroom-ai[proxy]"
headroom --version

3. Start Headroom pointing at Ollama

export OPENAI_API_KEY=ollama-local   # Ollama ignores key; some clients require a string
export OPENAI_TARGET_API_URL=http://127.0.0.1:11434/v1
headroom proxy --host 127.0.0.1 --port 8787 \
  --log-file ~/.headroom/ollama-agent.jsonl

Verify:

curl -s http://127.0.0.1:8787/health | python3 -m json.tool

4. Point your agent SDK at Headroom, not Ollama directly

export OPENAI_BASE_URL=http://127.0.0.1:8787/v1
export OPENAI_API_KEY=ollama-local

OpenClaw with local model — set in ~/.openclaw/.env alongside any cloud keys; pair with OpenClaw + Ollama webhooks for ingress patterns.

5. Library mode for custom Python agents

from headroom import compress
import ollama

messages = [...]  # includes fat tool role
result = compress(messages, model="llama3.2")
response = ollama.chat(model="llama3.2", messages=result.messages)
print(response["message"]["content"])

Library mode avoids proxy port conflicts when you already orchestrate HTTP yourself.

6. Cap tool payload before compression (belt + suspenders)

rg -n "ERROR" logs/ | head -n 200

instead of cat logs/build.log. Headroom recovers sloppy skills—but 200-line caps keep worst-case latency bounded on 16 GB hosts.

7. Measure savings and latency

curl -s http://127.0.0.1:8787/stats | python3 -m json.tool

Log tokens_before, tokens_after, and wall time per turn. Target ≥50% token reduction on first repo audit; 60–90% on JSON/search tool dumps per Headroom benchmarks.

8. Optional MCP for Claude Code + local Ollama sidecar

headroom mcp install

Use MCP headroom_stats while a local model runs in another terminal—useful when mixing Hermes memory compression on cloud and Headroom on local.

9. Persist proxy under launchd (always-on Mac mini)

Create ~/Library/LaunchAgents/ai.headroom.ollama.plist with OPENAI_TARGET_API_URL=http://127.0.0.1:11434/v1, load before your agent gateway. Keep Ollama’s own LaunchAgent separate—start order: Ollama → Headroom → agent.

10. Regression check after model swap

When you ollama pull a new tag, re-run one fat tool fixture. Quantization changes do not remove the need for compression—context length still dominates prefill.

Benchmark pattern on Mac mini

FixtureRaw tokens (typical)After Headroom (typical)Wall-clock goal
npm test stderr (failed suite)25k–45k3k–8k<45 s prefill on M4 16 GB
find . -name '*.ts' listing15k–30k2k–5k<30 s
Single 500 KB JSON API dump40k–70k4k–10k<60 s

Run each fixture with proxy bypass once (x-headroom-bypass: true header) to document the freeze you are eliminating—store numbers in your team wiki.

Troubleshooting

Agent still hits 11434 directly

Pattern: /stats on Headroom stays at zero requests.

Fix: Grep your process env (ps eww -p $(pgrep -f your-agent)). OPENAI_BASE_URL must be http://127.0.0.1:8787/v1 for the agent process, not only your shell profile.

Ollama 404 model not found through proxy

Pattern: Headroom logs upstream 404.

Fix: Model string in chat request must match ollama list. Pull locally first; Headroom does not manage model files.

Compression removed needed stack line

Pattern: Agent misses line number in audit.

Fix: Enable CCR retrieval in skill instructions (“call headroom_retrieve if stack trace incomplete”) or one bypass turn. Do not disable compression globally.

Mac mini swaps to death

Pattern: memory_pressure critical during agent run.

Fix: Smaller quant (:q4_0), fewer parallel tools, plus Headroom. On 16 GB, avoid 14B models + 80k raw context simultaneously.

FAQ

Does Headroom work only with OpenAI cloud?

No. The proxy compresses then forwards to any OpenAI-compatible upstream—Ollama, local vLLM, or cloud. Set OPENAI_TARGET_API_URL to your local server.

Is 95% compression realistic?

Headroom publishes up to 92% on some agent workloads; 60–80% is common on mixed repos. Measure /stats—do not assume marketing max on every turn.

Will this fix GPU overheating?

It reduces prefill work, which lowers heat versus raw huge prompts—but sustained tool loops still load CPU/GPU. Watch powermetrics on Mac mini.

Headroom vs Hermes trajectory_compressor?

Hermes shrinks long chat memory inside Hermes sessions; Headroom shrinks tool/RAG payloads at HTTP or library layer. Use both on hybrid cloud/local setups.

Do I need Headroom if I use cloud Anthropic?

Optional for cost; for local models it is often the difference between usable and frozen. See OpenClaw Headroom for cloud when you mix providers.

Stage Ollama + Headroom on a cloud Mac mini

Run proxy and Ollama on Apple Silicon with SSH/VNC—validate ports, LaunchAgent order, and fat-tool benchmarks before production agent loops.

Headroom + Ollama
Local LLM latency runbook — no pricing here