Swarm Management is the Next Real Systems Problem in AI

TL;DR

Spawning a subagent is not swarm management. It is the beginning of the problem. The real question is what happens after the child exists: where does it live, who owns it, can it be addressed, can it be steered, and what survives when the process restarts? A harness lets one agent call tools. A delegation tool lets one agent borrow workers. A swarm manager owns a fleet.

The Architectural Line

Hermes (NousResearch) has a genuinely good delegation primitive. Its delegate_task spawns child AIAgent instances with isolated context, up to 3 running concurrently by default, each with a 600-second timeout before being killed as stuck. They stream progress and return structured summaries back to the parent. Clean. Useful.

But the child lives inside the parent tool call. delegate_task is synchronous. If a user sends a new message while the parent is waiting, all active children are cancelled and their work is discarded. There is no registry, no recovery, no address the system can use to check on a child independently.

OpenClaw takes a different approach. When a parent spawns a child, the child becomes a gateway session with a durable session key (agent:<targetAgentId>:subagent:<uuid>), a run ID, lifecycle records, parent-child lineage, and a push-based completion path back to the requester. The child can be listed, patched, deleted, or linked back to its parent - visible to the entire gateway alongside chat sessions and cron sessions.

That is the architectural line. Delegation asks: how does one agent split work? Swarm management asks: how does a runtime own many agents over time?

Identity and Completion

The first requirement of swarm management is addressability. A swarm manager needs to track: child session key, run ID, requester, spawn depth, role, cleanup policy, timestamps, and outcome. If those answers only live inside a model's context window, the system cannot manage a fleet.

OpenClaw's registry - an in-memory Map persisted to ~/.OpenClaw/subagents/runs.json - survives process restarts. On startup it resumes pending announcements for completed runs and re-issues agent.wait for in-flight ones. The colon-delimited key namespace encodes routing context directly into the key itself - no separate routing table needed.

Completion, in a true swarm, is not a return value. It is a routing problem. The parent may be active, idle, restarted, or gone when the child finishes. OpenClaw handles this with a push-based model: sessions_spawn returns acceptance and bookkeeping, not a result. The result arrives later as a task_completion event routed back to the requester session - with delivery policy that can queue, retry, steer an active session, or fall back to direct send. Most delegation systems skip this entirely.

Queues, Roles, and Cascade

Once agents spawn agents, concurrency becomes a runtime responsibility. A user sends a follow-up while an agent is mid-run. A child finishes while its parent is busy. A steer message arrives while a model is streaming. If all of these are just messages, the system breaks fast. Queue policy - not better prompting - is the answer. OpenClaw's lane architecture separates main, subagent, and background work into distinct lanes, each with its own concurrency limit. SwarmClaw (the multi-gateway layer built on OpenClaw) caps parallel fan-out at 4 branches default, hard cap 16, with join policies: quorum cancels remaining branches once N succeed, first resolves on first success.

Flat swarms do not scale safely. Hermes enforces roles via max_spawn_depth (default 1, max 3) and a global kill switch for orchestrator behavior. OpenClaw goes harder: sub-agents cannot spawn sub-agents, full stop - a hard constraint enforced by the runtime, not suggested in a prompt. The model can request a spawn. The runtime decides if it is legal.

When a child is doing the wrong thing, killing it and losing the session is often the wrong move. Steering is more powerful: in OpenClaw, a steer-restart marks the run, suppresses stale announcements, aborts the in-flight run, clears queues, and remaps the registry to the new run. Kill cascades down the tree. These are control-plane operations. You cannot ask a model to manually track and clean up every live descendant. The runtime has to own the graph.

Recovery and Cleanup

A swarm manager cannot fire-and-forget. It needs to know when children are stuck, orphaned, completed, or completed-but-undelivered. OpenClaw's registry uses two parallel mechanisms: an in-process lifecycle event listener and a cross-process agent.wait RPC. The sweeper - unglamorous but essential - checks runs without live context, reconciles against session state, expires stale records, retries stuck states, and marks delivery failed rather than leaving cleanup half-done indefinitely.

This is OS-level work. A process table. A reaper. Every swarm manager eventually becomes a cleanup system because subagents create external state - transcripts, browser sessions, MCP runtimes, workspace files, cost metadata - that outlives the model's attention. In bounded delegation, cleanup is local (thread exits, parent gets JSON). In swarm management, cleanup is distributed and stateful. OpenClaw's file-level locking uses atomic creation with 30-second stale-lock eviction and 25ms polling, backed by an in-memory cache with a 45-second TTL. No database, no ORM - flat JSON/JSONL you can cat and jq.

The Layer Above the Harness

OpenClaw is instructive precisely because it has no magical SwarmManager class. The swarm emerges from ordinary runtime machinery: session keys, lanes, run IDs, registry records, lifecycle events, queue policies, delivery routing, cleanup decisions, recovery sweeps. Boring control-plane primitives that make many agents survivable.

Hermes shows what good delegation looks like. OpenClaw shows what happens when delegation becomes session infrastructure. The 2026 question is not whether an agent can call tools - that was the harness question, answered. The question is: where do agents live, who owns them, how do they report back, how are they stopped, and what survives after restart? That is the layer that turns a single agent harness into a fleet. And it is the next real systems problem in AI.

Sources: SwarmClaw GitHub, Hermes Agent Docs, OpenClaw Sessions Deep Dive, Conductor vs Swarm (Agix).