Lesson 5 of 21Track 2

Multi-agent communication

Message passing patterns

Direct, pub/sub, and shared state approaches.

Video lesson Interactive exercise ~10 min

Video coming soon

How agents actually talk

Once you have more than one agent, you have a transport question: how does work get from agent A to agent B, and how do answers come back? The answer shapes the rest of your architecture more than people expect. Pick the wrong transport and your agents will spend more turns coordinating than thinking.

Three patterns cover almost every system you will build: direct call, shared state, and pub/sub. They differ in coupling, observability, and how they behave when an agent fails.

Direct call

The simplest pattern. Agent A calls agent B as if it were a function.

def code_agent(task: str) -> str:
    messages = [
        {"role": "system", "content": CODE_PROMPT},
        {"role": "user", "content": task},
    ]
    return run_loop(messages, code_tools, code_registry)
 
 
def orchestrator(user_input: str) -> str:
    if needs_code_work(user_input):
        return code_agent(user_input)
    return general_agent(user_input)

That is it. No queue, no broker, no shared memory. Calling agent B is just a function call that happens to involve an LLM.

When direct call fits

  • Synchronous workflows where you wait for a result before doing anything else.
  • Small fanout (one orchestrator, a handful of workers).
  • Short tasks. If a worker takes minutes to finish, blocking the caller is bad.

Where it breaks

  • Parallelism is awkward. You end up using threads or async to fan out and recollect.
  • The caller waits for the slowest worker.
  • Failures cascade. If worker B raises, orchestrator A gets the exception in its hands.

For most early-stage multi-agent systems, direct call is enough. Do not invent a queue before you need one.

Shared state

In this pattern, agents do not call each other directly. They read and write a shared object: a dict, a database row, a file. Coordination happens by inspecting the shared state and reacting.

state = {
    "task": "Fix the staging deploy",
    "code_findings": None,
    "ops_findings": None,
    "final_answer": None,
}
 
 
def code_agent(state):
    if state["code_findings"] is None:
        state["code_findings"] = analyze_code(state["task"])
 
 
def ops_agent(state):
    if state["ops_findings"] is None:
        state["ops_findings"] = check_deploy(state["task"])
 
 
def synthesizer(state):
    if state["code_findings"] and state["ops_findings"]:
        state["final_answer"] = combine(
            state["code_findings"], state["ops_findings"]
        )

You can run the workers in parallel. Each one is a small function that checks the state, decides whether it has work to do, does it, writes the result back. The synthesizer waits for both inputs to be present.

When shared state fits

  • Parallel work that converges into a single answer.
  • Long-running tasks where workers come and go.
  • Workflows you want to be inspectable: dump the state and see exactly where things are.

Where it breaks

  • Concurrency bugs become possible (two agents writing the same field).
  • Without explicit ordering, the state can have weird intermediate shapes that confuse the synthesizer.
  • The state object becomes a god-object. Every change to a worker has to update the state schema.

Shared state shines when the state schema is genuinely the API between agents. We come back to this in Module 4.

Pub/sub

Agents publish events to topics. Other agents subscribe to topics they care about. Nobody calls anyone directly.

from collections import defaultdict
 
 
bus = defaultdict(list)
 
 
def publish(topic, event):
    for handler in bus[topic]:
        handler(event)
 
 
def subscribe(topic, handler):
    bus[topic].append(handler)
 
 
def code_agent_handler(event):
    findings = analyze_code(event["task"])
    publish("findings.code", {"task_id": event["task_id"], "findings": findings})
 
 
def ops_agent_handler(event):
    findings = check_deploy(event["task"])
    publish("findings.ops", {"task_id": event["task_id"], "findings": findings})
 
 
subscribe("task.received", code_agent_handler)
subscribe("task.received", ops_agent_handler)
 
publish("task.received", {"task_id": "t1", "task": "Fix the staging deploy"})

Now the orchestrator does not even know which agents exist. It publishes a task; whoever is listening reacts.

When pub/sub fits

  • Many independent agents reacting to the same event.
  • Loose coupling matters more than control.
  • You want the ability to add new agents without modifying existing ones.

Where it breaks

  • Hard to reason about who runs when.
  • Dead-letter handling: if nobody handles an event, you lose it silently.
  • Circular triggers. Agent A publishes X, which triggers B, which publishes Y, which triggers A.

Pub/sub is the most "production looking" of the three but also the heaviest. Most agent systems do not need it.

Picking between them

PropertyDirect callShared statePub/sub
CouplingHighMediumLow
ParallelismHardEasyEasy
InspectabilityEasy (stack trace)Easy (dump state)Hard
Failure handlingCascadingLocalizedLocalized
Learning curveNoneSomeHigh
Right answer for early systemsAlmost alwaysSometimesRarely

The trap is reading "decoupled" as "good" and reaching for pub/sub on day one. Decoupling is a cost you pay for the ability to evolve independently. If you do not need that yet, the cost is pure overhead.

Start with direct call. Move to shared state when you have parallel work that has to converge. Move to pub/sub only when the shape of the system genuinely requires events you do not own to trigger work you do not control.

Async is not always a feature

Pub/sub gives you async by default. That sounds nice until you realize "async" also means "no return value to wait on" and "no way to know if anyone handled my message." If you need to know when work is done, you have to add request/reply on top, which gives you back the complexity you were avoiding. Use direct call until async is a requirement, not a default.

What flows in the messages matters

Whatever transport you pick, the content of the messages is its own design problem. A direct call that passes the entire conversation history forward to the next agent has just recreated context pollution at the system level. The next lesson is about message content: handoff protocols, what to summarize, what to keep, what to discard.

Key takeaway

Three patterns cover almost every multi-agent system: direct call (function-style), shared state (read/write a coordination object), and pub/sub (event topics). Coupling drops as you go down the list, and so does control. Start with direct call. Reach for the others only when the shape of your work demands them. The next lesson covers what goes in the messages, which matters more than how they get delivered.

>_message-passing.py
Loading editor...
Output will appear here.

Done with this lesson?