OpenClaw gateways feel magical until an agent discovers it can chain fifteen shell probes, re-read a multi-megabyte log on every turn, and stream a 4K-token “thinking” preamble. Finance then asks why a single weekend burned 3× the expected inference budget. This runbook is for teams running OpenClaw on a 24/7 macOS host—often a shared Mac mini—who need concrete throttles, not slogans. Pair it with doctor and gateway diagnostics when errors hide inside channel retries.
You will get a comparison matrix for policy styles, numeric starting points (tokens, concurrency, backoff ceilings), operational steps that survive LaunchAgent restarts, and a FAQ aimed at platform owners—not demo hackers.
Signals that you are under-throttled
Latency climbs linearly while CPU on the gateway host looks idle—that usually means the model provider is queueing you behind rate limits, not that Swift parsers suddenly got slower. Another tell is disk write spikes every few seconds when an agent re-serializes the same workspace tree because tool outputs were not memoized. Finally, watch channel-level duplicates: if users see two identical “working on it” replies within 400 ms, your dedupe layer is missing and retries multiply token usage silently.
Finance-friendly metrics to export: tokens per successful task, tokens per failed task, tool invocations per conversation, and wall-clock duration per resolved ticket. Without those four series you cannot prove whether a model upgrade helped or whether throttles regressed.
When incidents strike, freeze feature work: first snapshot ~/.openclaw (redact secrets), then roll back the last policy change. Teams that skip snapshots spend days guessing whether the regression was model routing or a tool sandbox change.
Document “break-glass” credentials rotation after incidents—throttles often fail open when auth errors cause clients to retry aggressively.
Support engineers should have a one-page cheat sheet listing which throttle values were live at each release tag; Git history alone is too noisy during a pager storm.
Policy matrix: hard caps vs adaptive queues
| Approach | Best for | Risk | Ops load |
|---|---|---|---|
| Hard max output tokens | Public-facing bots | Answers truncate mid-thought | Low |
| Per-tool latency budget | Filesystem crawlers | Legitimate deep searches fail | Medium |
| Adaptive queue depth | Internal teams with SLOs | Complexity in tuning | High |
| Conversation step caps | Research agents | Users must manually resume | Low |
Most production teams combine hard max output tokens with conversation step caps because they are explainable to finance and easy to audit. Adaptive queues belong after you have six weeks of baseline metrics.
Starting numbers that survive audits
These are conservative defaults for a single-tenant gateway on a Mac mini with 16 GB RAM serving fewer than twenty concurrent operators:
- Max output tokens: 900–1,200 for routine tasks; 1,800 only for code synthesis routes behind a feature flag.
- Parallel tool calls: 1 for shell, 2 for read-only file stats, 0 for network unless explicitly allow-listed.
- Backoff: start at 2 seconds, multiply by 1.8 each retry, cap at 45 seconds, maximum 5 attempts before surfacing a human-readable error.
- Wall-clock per conversation: hard stop at 12 minutes of model time unless an operator types
continue—prevents infinite “let me check again” loops.
Tune upward only with percentile evidence: if p95 latency stays under your target when you raise ceilings, document the change in the same commit that adjusts monitoring thresholds.
When agents call browser automation tools, divide token budgets by 2.5 because screenshots inflate prompts; keep DOM-only tools on the cheaper path.
Always log the effective policy version string with each completion event so Grafana dashboards can split before/after deployments without guessing.
Security note: throttles do not replace sandboxing—pair numeric limits with directory allow lists and command prefixes reviewed quarterly.
macOS scheduling, LaunchAgents, and fork pressure
OpenClaw often coexists with file watchers, log shippers, and occasionally Xcode-derived simulators. Under load, fork heavy toolchains can push the system into memory pressure even when CPU looks fine. Serialize shell tools when more than three conversations are active; macOS fair scheduling will otherwise interleave processes in ways that inflate wall-clock time per tool call.
LaunchAgents should include ThrottleInterval when you restart the gateway on failure—avoid tight restart loops that hammer the model API at 10 Hz during outages. Pair restarts with a manual page that links to your status channel.
If you cannot reproduce fork storms locally, rent a cloud Mac mini that mirrors production RAM and macOS minor version. MacHTML pricing commonly lands near $16.9/day for short bursts—cheaper than burning a weekend of senior engineer time on guesswork.
Snapshot the gateway plist and environment files before testing aggressive throttles; rollback should be a single launchctl bootout plus restore, not a reinstall.
Document thermal behavior: fanless minis throttle CPU after sustained all-core bursts, which shifts your latency histogram even when token policies stay constant.
Observability without drowning in logs
Structured JSON logs beat prose paragraphs. Emit one line per tool invocation with fields: conversation_id, tool, duration_ms, exit_code, retry_count, policy_version. Ship those to whatever cheap store you already run—OpenSearch, ClickHouse, or even S3 + Athena if volume is modest.
Alert when the moving average of tokens per successful resolution crosses 20% above the trailing seven-day baseline. That catches silent regressions from prompt template edits.
Dashboards should include a stacked area chart of tokens by model route and a heatmap of failures by hour-of-week; marketing launches spike traffic and expose throttling gaps.
Redact secrets at ingestion, not only at display—throttled retries multiply log volume and accidentally leak tokens when engineers copy “debug” bundles.
Run a weekly game day: artificially lower max output tokens by 30% for staging only and verify operators still complete golden paths; promote the lesson learned to production configs.
On-call runbooks should list three escalation tiers: (1) lower token ceilings and notify the channel, (2) disable nonessential tools and route traffic to a cold standby gateway, (3) fail closed with a maintenance banner while finance approves an emergency quota increase. Practicing those tiers quarterly prevents improvisation when APIs return 429 for an entire region.
Finally, align product marketing with reality: if the homepage promises “unlimited research,” throttles will always look like bugs to end users. Publish honest limits next to pricing so support tickets drop.
Engineering managers should attach throttle diffs to the same pull request as prompt template changes—when those drift apart, dashboards show “mysterious” token spikes that are actually copy edits masquerading as infrastructure regressions.
FAQ
What is the first knob when bills spike?
Reduce max output tokens and disable parallel tools until logs show the worst offender.
How do I stop loops without killing the gateway?
Add step and wall-clock caps that return explicit errors to channels instead of infinite retries.
Why test on a dedicated Mac mini?
macOS process behavior matches production; Linux stubs hide fork and watcher interactions.
Apple Silicon Mac mini remains the sweet spot for OpenClaw: enough unified memory for local models plus gateway overhead, quiet enough for office racks, and identical to what designers expect when they VNC in to reproduce issues. MacHTML rents Mac mini hosts with SSH/VNC so you can validate throttles, LaunchAgent restarts, and doctor workflows on real hardware—scale up for load tests, tear down when budgets stabilize.
Run OpenClaw throttles on a cloud Mac mini
Prototype token ceilings, tool serialization, and LaunchAgent recovery on Apple Silicon without buying new metal—SSH for automation, VNC for interactive checks.