Hermes Agent (NousResearch/hermes-agent) ships a dedicated trajectory_compressor.py utility to keep long agent runs inside model context windows without throwing away everything the operator just did. On a Mac mini M4 that stays online for gateways, cron jobs, and multi-hour research loops, token pressure shows up as truncated tool output, failed follow-up turns, and surprise billing from re-sending entire transcripts. This Type C explainer walks through upstream defaults, the protected-head / summarized-middle / protected-tail model, the difference between the interactive /compress slash command and the batch CLI, and a six-step runbook you can run on macOS before promoting compression settings to production.
If you are still choosing a harness, read Hermes Agent vs OpenClaw on macOS and Mac mini M4 first—compression matters after you pick where sessions live. Primary upstream references: trajectory_compressor.py on main, the Hermes Agent README, and Apple Mac mini specs for RAM and unified memory planning.
Disclosure: MacHTML offers optional cloud Mac mini rental for always-on agent staging mentioned briefly below.
Why token limits bite on Mac mini
Agent harnesses do not store conversation the way chat UIs do—they replay structured message lists on every model call. A single long day on Telegram plus local tool loops can accumulate tens of thousands of tokens across system prompts, human instructions, assistant replies, and tool JSON. When the provider hard-stops or silently truncates, operators blame the model; often the transcript simply exceeded the budget. A dedicated Mac mini M4 helps because it can run gateways 24/7 without laptop sleep, but it does not magically expand context—compression is an operational control, not a hardware feature.
Hermes addresses this with trajectory_compressor.py, which rewrites a trajectory folder into a shorter equivalent while preserving semantics operators still need: the original system contract, the opening human intent, early tool grounding, a faithful middle summary, and the most recent turns where mistakes are costly. Default budgeting targets 15,250 tokens for the compressed transcript with about 750 tokens reserved for the middle summary itself, while keeping the last four turns verbatim. Those numbers are tunable, but they encode a practical default: compress aggressively in the middle, never amputate the conversation you are actively steering.
Token optimization also intersects cost. Even when you route inference through OpenRouter, re-sending a 40k-token trace for a small follow-up question burns money and latency. Compression is cheaper than swapping to a larger context model for every turn, especially on staging Mac minis where you iterate prompts hourly. Pair compression with the harness comparison above so you do not optimize transcripts on a gateway you plan to replace next week.
Finally, memory on Apple silicon is unified: a Mac mini with 16 GB can host Hermes gateways plus light tooling, but parallel subagents, local Ollama sidecars, and huge uncompressed logs compete for the same pool. Compression reduces RAM pressure from gigantic in-memory message arrays during gateway fan-out, even though the primary win is model context. Check Apple’s Mac mini specifications before assuming a base model will hold your uncompressed archive and local models simultaneously.
Core compression model
Upstream trajectory_compressor.py implements a three-zone policy that is easy to reason about when auditing a run after the fact:
- Protected head: keep the first system, human, gpt, and tool messages so persona, safety, and initial tool schema survive compression.
- Summarized middle: everything between head and tail is collapsed via summarization (OpenRouter by default) into roughly summary_target_tokens=750 of narrative plus key facts.
- Protected tail: retain the last protect_last_n_turns=4 turns so recent operator corrections and tool errors remain addressable without re-deriving them from a summary.
The default target_max_tokens=15250 acts as a ceiling for the post-compression transcript. Think of it as a guardrail for gateway sessions that must stay inside provider limits even after skills inject extra system text. Operators tuning for a specific model should align target_max_tokens with that model’s usable window minus headroom for tools and completion tokens—not the marketing context length printed on a pricing page.
Summarization defaults to google/gemini-3-flash-preview through OpenRouter unless you override provider settings in your environment. That choice trades quality for speed and cost during batch replays of old runs. For compliance-sensitive workloads, point summarization at an approved model and log which trajectory folders were compressed, because summaries are lossy by design—they should capture decisions and blockers, not every stack trace line.
Compression is not a substitute for Hermes’s curated memory files or FTS5 session search; it is a per-trajectory emergency brake when a single thread outgrows the window. Use memory for durable facts across weeks; use trajectory compression when tonight’s incident thread is already 30k tokens deep.
Two surfaces: /compress vs batch CLI
Hermes exposes compression in two different ergonomics on purpose:
| Surface | When to use | What it does |
|---|---|---|
/compress slash command | Inside an active Hermes TUI or gateway-attached session when the operator feels context tightening mid-flight. | Interactive, session-aware compression tied to the live trajectory the agent is executing—best for “we are drowning right now” moments. |
python trajectory_compressor.py | Offline housekeeping on saved runs under data/, CI fixtures, or pre-migration archives. | Batch utility with explicit flags for sampling percent, token ceilings, and input directories—best for reproducible ops playbooks. |
Do not conflate them: the slash command is an operator control surface; the Python entrypoint is how you retrofit historical folders after a gateway upgrade or before attaching a huge trace to a ticket. Running the CLI against live files while a gateway still appends messages can race—quiesce the session or copy the trajectory to a scratch directory first.
Batch mode also supports sampling—for example --sample_percent=15 when profiling compression quality across hundreds of runs without paying to summarize every folder during an experiment. Raise to 100% before declaring production defaults.
Examples and defaults table
| Parameter | Default | Notes |
|---|---|---|
target_max_tokens | 15250 | Post-compression ceiling; raise cautiously if your model window allows. |
summary_target_tokens | 750 | Budget for the middle summary segment. |
protect_last_n_turns | 4 | Recent turns kept verbatim at the tail. |
| OpenRouter summarizer | google/gemini-3-flash-preview | Override via env/provider config for policy compliance. |
Representative CLI invocations from upstream:
python trajectory_compressor.py --input=data/my_run
python trajectory_compressor.py --input=data/my_run --sample_percent=15
python trajectory_compressor.py --input=data/my_run --target_max_tokens=16000
Inside a live session, operators invoke /compress when tool chains lengthen but the task is not finished—especially before attaching another subagent or re-running a failing tool with a fresh model call. Document your team norm: compress after major milestones, not after every single tool success, or summaries will erase nuance you still need.
Six-step runbook
- Clone or update Hermes Agent on your Mac mini and confirm
trajectory_compressor.pyexists at the repo root (source). - Export OpenRouter (or chosen) API keys in the same shell you use for batch compression; summarization will fail fast without them.
- Pick a staging trajectory under
data/with known token bloat—copy it aside so you can diff before/after JSON or message lists. - Run a dry batch pass with defaults:
python trajectory_compressor.py --input=data/my_runand inspect that head/tail messages remain intact. - Tune ceilings if your production model needs margin—e.g.
--target_max_tokens=16000—and re-run with--sample_percent=100before promoting settings. - Validate live behavior by starting a Hermes session on the Mac mini, growing context deliberately, then issuing
/compressand confirming the agent still answers with awareness of protected tail turns.
Record the chosen defaults in your internal runbook alongside gateway ports and LaunchAgent labels so the next operator does not rediscover them via a production outage.
Troubleshooting
Summarization fails or returns empty middle
Verify OpenRouter credentials, model allowlisting, and rate limits. Retry with a smaller input folder to confirm the trajectory parser is not choking on malformed tool payloads. If compliance blocks google/gemini-3-flash-preview, switch to an approved summarizer and update your documentation—do not silently disable compression.
Agent “forgets” recent fixes after compression
Increase protect_last_n_turns temporarily or lower aggressiveness on summary_target_tokens. Confirm operators are not running batch compression on live folders while messages still append. Re-run with a copied trajectory and diff tail messages against the source.
For always-on Telegram without keeping a Mac awake, see Hermes Agent Docker on a cheap VPS with Telegram bot setup.
Ready for scoped repo access, npm test, and guarded auto-fix? Read Hermes Agent MCP for Claude Opus 4.8: local code sandbox in 2026.
FAQ
Does trajectory compression replace Hermes memory files?
No. Compression shortens a single trajectory thread; curated memory and skills remain the long-horizon store. Use both: memory for weeks, compression for tonight’s runaway session.
Should OpenClaw operators use this script?
The utility ships with Hermes Agent. OpenClaw transcripts are a different shape—migrate or export before batching, and keep gateways separate per our Hermes vs OpenClaw guide.
How much does summarization cost?
Batch cost scales with middle-turn count and model tariff on OpenRouter. Sampling at 15% during experiments reduces spend; production passes should use 100% once defaults are trusted.
What Mac mini RAM for compression plus gateway?
16 GB works for compression-only batch jobs; 24 GB is safer if the same Mac mini also runs gateways, local models, and parallel subagents per Apple specs.
Stage Hermes compression on a cloud Mac mini
Rent an always-on Mac mini M4 to batch-compress long trajectories, validate /compress behavior, and keep gateway sessions under token budgets before production.