Mesh Architecture & Coordination
The flow control layer. How agents discover each other, hand off work, recover from failures, and propagate context — without you writing the supervisor.
Flow patterns
- Chain — Sequential handoff from agent A to agent B to agent C, with each agent receiving the previous agent's output as input. Used for linear processes like greeting → classification → response.
- Fan-out (parallel dispatch) — One agent triggers multiple downstream agents concurrently, all working in parallel. Supports parallelism caps to prevent flooding shared resources like database connections or API quotas.
- Fan-in (merge with boolean logic) — Multiple agents converge their outputs into a single downstream agent. Supports full boolean expressions like
processor AND (researcher OR fallback_researcher). Handles late arrivals, skipped contributors, and failed contributors with explicit reporting. - Dual mode — Agent A asks agent B a question, agent B replies back to agent A specifically, and agent A continues from there. Used for consultant patterns where one agent delegates a subquestion and weaves the answer back into its own reasoning.
- Bounded loops — Agents can call themselves or upstream agents, with a per-session rerun budget that prevents runaway retries. The Manager agent intervenes if the budget is exhausted.
- Conditional branching — Every routing decision can carry a Python expression evaluated against the agent's response (e.g.,
confidence > 0.7). Routes that don't match are silently skipped.
Dynamic per-call routing (human agents)
YAML sets the default — operator_ids for the logical roster and an operators: map inside each channel block to resolve that roster to per-channel recipient IDs (Slack user IDs, WhatsApp numbers, email addresses). Pre-compose then steers per call by setting _target_operator in input_data — same key drives all configured channels simultaneously. Customers think in terms of people; each channel resolves its own platform-specific IDs.
- Channel-scoped operators — Operators live inside the channel they're reachable on —
channels.slack.operatorsholds Slack IDs,channels.whatsapp.operatorsholds phone numbers. Adding or removing a channel takes its operator info with it; no orphaned IDs in a separate top-level map. - Single key, every transport —
_target_operator: 'alice'resolves to alice's Slack DM, WhatsApp number, and email address in parallel — each channel pulls from its own operators map. No per-platform code paths in the customer's pre_compose. - Fail-soft, never drops — If a
_target_operatorhas no entry under a given channel's operators map, the SDK logs a warning and falls back through session continuity → YAMLpost_channel→ webhook outbound_url for that channel. Messages always land somewhere the operator pre-approved. - Backward compatible — Legacy
_operator_idslist override still works on the default inbox. Raw_target_channel_idis the escape hatch when the recipient ID comes from a dynamic source. Agents not using the new keys behave exactly as before — the override layer is purely additive.
Per-caller response behavior
An agent's communication_type (dual / chain / execute) is the default for every caller. response_overrides lets that decision flow per upstream caller — the same agent replies back to a customer-facing router (dual) but stays silent when an internal processor calls it (chain). External entry points (mesh_call, event listeners, scheduled wake_up) have no preceding caller and always use the default — overrides are internal-call only by design.
- Default + override — Top-level
communication_typeis the fallback. Each entry inresponse_overridesnames a specific caller agent and the type the receiving agent uses when called by them. Lookup is O(1) per dispatch. - List or dict in YAML — Both shapes accepted — list of
{caller, type}for readability, or terser dict{caller_name: type}. AgentConfig validator normalises both into a dict for the runtime. - Studio-aware UI — Agent inspector restricts the caller picker to agents whose can_call actually targets this one. If nobody calls this agent and no overrides exist, the panel hides — no dead config surface.
- External traffic respects default — mesh_call from outside, event-listener triggers, and scheduled wake_up calls all bypass the override map — they have no called_by, so they use communication_type unchanged.
Routing decisions
- Declarative conditions — Each
can_callentry takes a condition expression. Wrong condition equals no dispatch. No if-else trees in code. - Narration-driven routing — Agents can describe routing intent in plain English. The Manager parses narration and applies it as a routing hint.
- Learned routing — Every dispatch outcome is recorded per agent pair (success, failure, timeout, cancelled). Over time, the platform biases routing toward paths with historically higher success rates. Routing statistics persist across SDK restarts and are multi-process safe.
- Manager arbitration — When multiple
can_calltargets qualify, the built-in Manager picks based on routing memory, agent health, in-flight load, and confidence. - Routing memory persistence — Stored in Redis under a hash per upstream agent, surviving SDK restarts, exposed via SDK API for analytics.
Safety nets
- Cycle detector layer 1 — Monotonic depth counter capped at 50 events per session by default. Configurable upward for research and batch fan-out workloads.
- Cycle detector layer 2 — Agent-sequence loop detection on the last 8 agents in a session. Catches A→B→A ping-pong at depth 3 instead of letting it run to 50.
- Loop guard — Detects immediate back-edges where the caller's prior position in history matches the current target. Publishes an INTERVENTION_NEEDED event with full diagnostic context.
- Chain completeness watchdog — Every dispatch schedules an asyncio task that fires after a configurable timeout. If expected agents never completed, the Manager intervenes. Smart-skips fan-in targets, human-type targets, and already-completed agents.
- Pause-and-resume — Sessions can be paused mid-flow with explicit pause types: HITL pending, workflow paused, awaiting human response. Manager defers all interventions while paused. Resume restores OpenTelemetry trace context.
- Status gating — When a session enters outage, stopped, cancelled, or escalated status, all further chain dispatch is halted.
- Auto-reset on new mesh_call — When a customer initiates a new entry-point call on a stale session, status auto-resets to active.
Chain context propagation
- Bounded chain history — Every dispatched agent input automatically carries a list of prior hops with agent name, output summary, timestamp.
- Configurable depth — Default 5 hops back, configurable via
mesh.chain_context_depth. Set to 0 to disable globally. - Per-entry character cap — Default 500 characters per hop with truncation marker.
- Hard total byte ceiling — Keeps chain history within LLM prompt limits.
- Per-target snapshot isolation — Each downstream target receives its own snapshot copy.
- Fan-in dedup — Fan-in targets get a fresh top-level snapshot; redundant copies stripped from contributor payloads.
- Universal rendering — LLM, programmatic, human inbox, external connector — all see chain history through their native access pattern.
- Pre-compose override — When customer's pre-compose populates prepared_data, auto-history defers to customer control.
Event sourcing as substrate
Every state transition in the mesh emits a typed event onto Redis Streams. Agent registration, every dispatch decision, every manager intervention, every HITL request, every channel delivery, every tool call, every config change. Streams are append-only, timestamped, and durable.
- Six built-in consumer groups — Agent registry, session manager, mesh communicator, LLM cache, manager decisions, config processors.
- Subscribable from customer code —
sdk.event_bus.subscribe(EventType.X, handler)attaches user handlers to any event type. - Multi-process safe — Consumer groups distribute event handling across worker processes without external coordination.
- Survives restarts — The event log is durable; subscribers resume from their last acknowledged position.
- Powers the rest of the platform — Manager arbitration, learned routing, cost analytics, ADK Studio dashboards, agent feed, and observability all read the same event stream rather than maintaining parallel state.
Event sourcing is what makes the audit story real — every state change is reconstructable from the stream, not patched together from logs.