When an OpenClaw gateway suddenly cannot reach a tool host, operators reflexively blame TLS or API keys. In 2026, a quieter class of failures still dominates incident reviews: DNS negative caching, split-horizon corporate resolvers, and retry loops that hammer NXDOMAIN answers for minutes after a typo ships. macOS combines mDNSResponder, per-process caches, and VPN-provided DNS that diverges from what your Linux staging container sees. This playbook explains how to separate NXDOMAIN from timeouts, tune retry budgets, align synthetic health checks with real resolver paths, and rehearse fixes on Apple Silicon before you widen blast radius.
Cross-link with openclaw doctor gateway diagnostics, gateway health monitoring, and remote CLI tunnels when DNS and split routing interact.
Log signals that distinguish NXDOMAIN, SERVFAIL, and timeouts
Structured logs should carry dns_rcode, elapsed resolver milliseconds, and the upstream library name. Without those fields, “connection failed” tickets waste hours. Teach support to look for NXDOMAIN first—if present, no amount of TLS debugging helps until TTL expires or caches flush.
Negative TTL behavior on macOS
Authoritative servers publish SOA minimum TTL that caps negative caching. macOS honors those hints through mDNSResponder; flushing requires intentional commands such as dscacheutil -flushcache combined with process restarts when user-space libraries cache independently. Document the flush runbook because security teams may question repeated sudo usage.
Retry budgets and jitter tables
| Scenario | Attempts | Backoff cap | Jitter |
|---|---|---|---|
| Transient timeout | 3 | 8 s | ±20% |
| NXDOMAIN after deploy typo | 1 | n/a | Stop and fix config |
| SERVFAIL burst | 5 | 32 s | ±35% |
Health probes versus user traffic paths
If synthetic monitors resolve hosts through a data-center resolver while user tools rely on split DNS, health stays green while customers fail. Point probes through the same resolver chain or inject explicit /etc/hosts only in lab environments—never let production monitors diverge silently.
mDNS .local collisions with unicast tools
LAN toolchains advertising *.local can shadow corporate unicast names when Bonjour browsing is enabled. Gateways co-located with design workstations are especially vulnerable. Prefer fully qualified domain names for tools and disable accidental multicast DNS shortcuts in LaunchAgent environments.
VPN and captive portal interactions
Remote engineers often route all TCP through VPN while DNS still points at the office split horizon. When the VPN drops, macOS may briefly use public resolvers that return NXDOMAIN for internal-only tool hosts—negative caching then blocks quick recovery. Add a resolver watchdog that flushes caches when VPN state transitions from connected to disconnected.
Remediation matrix
- NXDOMAIN + recent deploy: roll back config, flush cache, restart gateway process.
- SERVFAIL spikes: escalate to resolver team, lower parallelism temporarily.
- Timeouts only: inspect packet loss, MTU black holes, or TLS middleboxes.
Metrics that prove DNS regressions
Export counters for dns_query_total labeled by rcode, plus a histogram of resolver latency. Alert when NXDOMAIN share exceeds 0.5% of queries for more than 10 minutes—that usually means a bad rollout rather than organic traffic. Pair DNS metrics with http_client_errors_total{reason="connect"} so dashboards show whether failures precede or follow TLS handshakes.
Avoid cardinality explosions by bounding host label values: collapse tool hostnames to service families (vector_db, crm_api) instead of raw FQDNs unless you truly need per-tenant drilldowns.
LaunchAgent restarts and warm resolver caches
Restarting the gateway clears some user-space DNS caches but not the system cache. Document whether your process uses c-ares, getaddrinfo, or a higher-level HTTP client because each layer caches differently. After restart, run a warm-up probe that resolves every critical hostname twice before marking the instance ready—first call may still hit cold caches while the second reflects steady state.
Split-horizon case study template
When internal tools resolve only on corporate DNS, capture a before/after table listing resolver IP, search domains, and measured TTL for both success and negative answers. Store that table beside the postmortem so future engineers do not repeat the same VPN ordering mistake. Include timestamps showing how long negative caching lasted—often 300–900 seconds for internal SOA minimums.
Attach raw dig +trace outputs and redacted resolver PCAP snippets so auditors outside the incident channel can verify conclusions without SSH access to production—this habit alone shortens executive reviews by hours.
IPv6 AAAA records and Happy Eyeballs delays
Gateways that attempt IPv6 first may stall for hundreds of milliseconds when AAAA exists but routing is black-holed. Log whether your HTTP stack uses Happy Eyeballs and what the race timeout is—defaults near 300 ms interact badly with aggressive DNS retries. Consider pinning IPv4-only tool endpoints temporarily during incidents while networking fixes propagate, but document the debt.
TLS session reuse after DNS changes
Clients that keep HTTP connection pools open may continue speaking to an old IP after DNS updates because the socket never reopened. Pair DNS TTL reductions with connection pool idle timeouts below the TTL when you must drain traffic quickly. Mention this interaction in runbooks so DNS and app teams stop blaming each other.
When gateways reuse HTTP/2 connections across many tool hosts behind the same IP, ensure your client library re-evaluates DNS before opening new streams if the control plane rotates upstreams—some stacks opportunistically skip lookups when the socket stays warm.
Captive portals and hotel Wi-Fi
Road warriors running gateways on laptops may hit captive portals that hijack DNS until acceptance pages load. Detect captive states by probing a known-canary host and pause tool fan-out until DNS answers match expected signatures. Log a distinct captive_portal_suspected reason code so support macros route correctly.
Hotel and conference Wi-Fi often inject longer-lived negative answers for internal tool domains until the splash page completes; document separate flush steps for those venues so on-call engineers do not confuse them with genuine outages.
FAQ
Why does NXDOMAIN hurt more than slow DNS?
Negative answers cache and short-circuit retries until TTL expires.
Should OpenClaw bypass system DNS?
Only with explicit architecture approval and pinned resolvers.
How many DNS retries are sane?
Default to three attempts with jitter unless vendors specify more.
Why test on a Mac mini?
macOS resolver behavior matches many production gateways; Linux CI alone lies.
DNS issues are environmental: you need the same resolver stack, VPN client, and sleep policies your operators use daily. Renting an Apple Silicon Mac mini from MacHTML for roughly $16.9 per day gives a faithful rehearsal host without another CapEx approval. You can snapshot failing dscacheutil output, validate LaunchAgent restarts, and tear the machine down once runbooks update—elastic capacity beats leaving a misconfigured resolver wedged in a shared staging cluster.
Quiet thermals also matter when you leave packet captures running for hours; Mac mini stays stable under sustained resolver churn, which keeps repro logs trustworthy.
Rehearse OpenClaw DNS fixes on real macOS
Rent a cloud Mac mini to validate resolver paths, negative cache flush procedures, and health-check alignment before production cutovers.