Operators know OpenClaw gateways by their happy-path logs, but production teaches the vocabulary of exit codes. In 2026, macOS launchd still wraps Node runtimes the same way whether you self-host on a studio Mac mini or rent one in the cloud: 137 usually whispers memory pressure, 143 often means a polite SIGTERM during reloads, and rapid respawns point to plist mistakes rather than model quality. This playbook maps codes to signals, shows how to filter unified logs without drowning in noise, ties symptoms back to doctor diagnostics, and explains when to capture an Activity Monitor sample before you widen concurrency again.
Pair with LaunchAgent recovery patterns for clean restarts and memory and context pruning when conversations and tool transcripts grow without bound.
Exit code cheat sheet
| Code | Typical meaning | First checks |
|---|---|---|
| 0 | Clean shutdown | Confirm intentional stop vs watchdog |
| 1 | Generic Node error | Read stderr path; rerun foreground |
| 137 | OOM / SIGKILL | Memory pressure, tool output size, model context |
| 143 | SIGTERM | launchd reload, manual kill, deploy script |
macOS also surfaces Jetsam events for GUI apps; daemons instead leave breadcrumbs in unified logging with reason strings—learn your gateway’s bundle id spelling so predicates stay narrow.
launchd ThrottleInterval and crash loops
When a job exits non-zero, launchd applies backoff. If your plist sets KeepAlive with an aggressive respawn and the process dies in under a second, operators see a wall of identical timestamps that hide the first real error line. Temporarily raise ThrottleInterval to 10 seconds while debugging, fix root cause, then restore a tighter value for production responsiveness.
Document whether RunAtLoad is true: false positives happen when engineers manually unload agents during business hours but automation reloads them minutes later, masking the true trigger.
log show predicates that stay readable
log show --last 30m --predicate \
'subsystem == "com.apple.xpc.launchd" AND eventMessage CONTAINS[c] "openclaw"'
Narrow further with process == "launchd" plus your label string. Export JSON for postmortems so auditors can grep without SSH shells.
Console.app workflows
- Create a dedicated “Gateway exits” favorite combining subsystem and message contains filters.
- Start streaming before reproducing; pause immediately after the crash to avoid buffer loss.
- Attach a sysdiagnose only if filesystem or kext suspects exist—otherwise keep evidence lightweight.
Memory pressure and tool fan-out
Exit 137 often correlates with parallel tool calls that each buffer multi-megabyte stdout. Cap concurrent tools to three when the host has 8 GB unified memory, or lower model context windows below the spike you saw in Activity Monitor’s memory tab. If compression kicks in, latency climbs before outright kills—watch for that leading indicator.
Align pruning policies with the memory article: rotate transcripts, bound JSON depth, and refuse oversized attachments at the gateway instead of letting Node parse them.
When to sample the Node process
If CPU pegs at 100% for more than two minutes without progress logs, capture a sample from Activity Monitor and archive it beside the plist version. Samples reveal tight loops in custom middleware that never show as traditional stack traces in stderr.
Plist hygiene: KeepAlive, RunAtLoad
Use SuccessfulExit under KeepAlive only when a zero exit truly means “unhealthy.” Misconfigured booleans cause launchd to restart healthy shutdowns, burning CPU credits on cloud Mac hosts. Validate with launchctl print gui/$UID/your.label and screenshot the output for change management.
Why Linux CI cannot reproduce
CI containers lack the same unified memory compressor, launchd job lifecycle, and Keychain prompts. Treat Linux tests as linting: still run smoke on macOS before promoting gateway builds. A rented Mac mini closes that gap for roughly $16.9 per day instead of shipping laptops.
Postmortem template
- Timeline from last healthy request id to first crash log line.
- Exit code, signal, and launchd reason string.
- Memory high watermark and concurrent tool count.
- Doctor output hash and config diff since last deploy.
- Follow-up: code change, plist change, or capacity change.
Stderr rotation and disk-full exits
Some gateways exit with code 1 when StandardErrorPath cannot append because the volume filled. macOS unified logging still rotates, but plain files do not. Monitor free space on /private/var and your custom log directory; keep at least 5 GB free on shared Mac mini hosts where multiple agents write verbose tool dumps.
Prefer newsyslog or logrotate-style wrappers documented in your runbook over unbounded single files—parsing a 12 GB stderr tail under stress is how secondary outages happen.
Signal hygiene during deploy scripts
Blue-green scripts often send SIGTERM, wait 15 s, then escalate to SIGKILL. If your Node process traps SIGTERM to drain HTTP connections but tool calls exceed the grace window, launchd records 137 even though the operator believed the shutdown was clean. Lengthen the grace period or reduce in-flight tool fan-out before swapping binaries.
Expose a /readyz endpoint that flips false before SIGTERM so load balancers stop sending requests immediately—this pattern cuts forced kills more than tweaking Node flags alone.
Flight recorder: minimal always-on metrics
Export process_start_timestamp_seconds, process_exit_code, and rss_bytes_max to Prometheus even on tiny installs. When exit codes spike, those three series tell you whether you are fighting memory, deploy churn, or flaky configuration without opening a laptop.
FAQ
Does 137 always mean OOM?
Usually yes on gateways; confirm with memory graphs.
Why 143 after deploy?
launchd terminated the old process during reload.
Where is stderr?
Check StandardErrorPath in the LaunchAgent plist.
When rent Mac mini?
When you need faithful launchd and memory behavior.
Exit-code archaeology is tedious but cheaper than outage minutes. A physical Mac mini with Apple Silicon reproduces launchd backoff, memory compression, and file descriptor defaults that differ from Linux staging. MacHTML rents machines with SSH/VNC so you can keep a diagnostics host online during a release weekend, capture Console streams, and shut everything down Monday—elastic capacity without another CapEx ticket.
Quiet hardware also helps when you are on a video bridge reading logs aloud to a distributed team.
Reproduce OpenClaw exits on real macOS launchd
Rent a cloud Mac mini to validate exit codes, plist reload behavior, and Console.app evidence before promoting gateway changes.