Lesson 12 of 14Track 4

Deployment and scaling

Queue-based architectures

Async agent workloads with job queues.

Video lesson Interactive exercise ~10 min

Video coming soon

Decoupling the request from the work

A naive agent service has a synchronous request-response shape: user sends a request, the HTTP handler runs the agent loop, returns the answer. Works fine for short interactions. Falls over the moment your loops take more than a few seconds.

The fix is a queue. The request gets accepted and turned into a job; one or more worker processes pull jobs from the queue and run the agent; the user gets the result via polling, webhook, or streaming. This decouples accepting the request from doing the work, which makes scaling, failure handling, and resource management much easier.

This lesson covers when to introduce a queue, what jobs look like, how workers operate, and the patterns specific to agent workloads (idempotency, partial results, long-running sessions).

When you need a queue

Start with synchronous request-response. Move to a queue when:

Latency exceeds patience

If your agent regularly takes 30 seconds, a synchronous HTTP request times out somewhere in the chain (load balancer, browser, etc.). Queues sidestep this entirely.

Concurrency exceeds your capacity

Synchronous requests tie up worker threads while waiting on LLM calls. With 100 concurrent users and 10 seconds per request, you need 100 threads sitting idle. Queues let you decouple worker count from request count.

Some requests need to survive restarts

A 5-minute agent run that gets interrupted by a deploy should resume, not start over. Queues + checkpointing (Track 2 Module 4 lesson 3) make this practical.

Workload has bursts

If your traffic is bursty (10x normal during business hours, near-zero overnight), queues let you scale workers to match. Synchronous services have to be sized for peak even when load is low.

If none of these apply yet, don't introduce a queue. The complexity isn't worth it.

Job shape

A job is a structured object the worker can pick up and act on:

@dataclass
class AgentJob:
    job_id: str
    user_id: str
    request: dict
    submitted_at: float
    status: str = "pending"   # pending, running, completed, failed
    result: dict = None
    error: str = None
    attempts: int = 0

The job carries everything the worker needs to start. The job's status reflects where it is in its lifecycle. Workers update status as they work.

Worker shape

A worker process pulls jobs from the queue, runs the agent, writes results back.

async def worker_loop(queue, store):
    while True:
        job = await queue.dequeue(timeout=5)
        if job is None:
            continue
        try:
            await process(job, store)
        except Exception as e:
            await handle_failure(job, e, store)
 
 
async def process(job, store):
    job.status = "running"
    job.attempts += 1
    await store.save(job)
 
    result = await run_agent(job.request)
 
    job.status = "completed"
    job.result = result
    await store.save(job)

Multiple workers run the same loop. The queue's contention guarantees only one worker gets each job. Workers can be added or removed without coordination.

Result delivery

Three patterns for getting the result back to the user:

Polling

The user polls a result endpoint with the job_id. The endpoint returns "pending" until done.

@app.get("/jobs/{job_id}")
async def get_job(job_id):
    job = await store.load(job_id)
    return {"status": job.status, "result": job.result, "error": job.error}

Simple. Wasteful for short-lived jobs (lots of polls return nothing). Fine for long jobs where polls happen every few seconds.

Webhooks

The job carries a callback URL. When the worker finishes, it POSTs the result to that URL. The user (their service, really) receives the result asynchronously.

async def process(job, store):
    result = await run_agent(job.request)
    if job.callback_url:
        await http.post(job.callback_url, json={"job_id": job.job_id, "result": result})

Good for B2B integrations. Requires the user to have a publicly reachable endpoint. Adds network failure modes (what if the callback fails?).

Streaming (SSE / WebSocket)

The user opens a long-lived connection at submission time. The worker streams events through it as they happen. Best UX for long jobs because the user sees progress.

Pick one or support multiple, but pick the simplest one that meets your latency expectations.

Idempotency

Workers can crash mid-job. The retry semantics depend on whether your work is idempotent.

If a job has side effects (sending an email, charging a card), naive retries duplicate them. Two patterns:

Job-level idempotency keys

Each job has a unique idempotency key. Side-effecting tools check the key before acting. Module 1 lesson 2 of this track covers this for individual tool calls; the same idea applies at the job level.

Checkpointing within the job

The agent loop checkpoints state during the job (Track 2 Module 4 lesson 3). On worker crash, a different worker picks up the same job and resumes from the last checkpoint instead of starting over.

For long jobs (more than a couple of minutes), checkpointing inside the job is essential. Otherwise you risk doing 10 minutes of work, crashing 30 seconds before completion, and starting over.

Concurrency models

Workers can be:

Process per worker (heavy)

Each worker is its own process. Strong isolation, but every worker carries its own runtime overhead. Right when each agent run uses a lot of memory or each worker holds non-shareable state.

Thread per worker (medium)

Multiple worker threads inside one process. Cheaper than processes, but you have to be careful about shared state, GIL contention (in Python), and graceful shutdown.

Async coroutine per worker (light)

For agent workloads where most time is spent waiting on I/O, async is ideal. One process can handle dozens of concurrent agent sessions because none of them are CPU-bound.

async def main():
    async with asyncio.TaskGroup() as tg:
        for _ in range(20):
            tg.create_task(worker_loop(queue, store))

For Python agent services, async coroutines per worker is the default. You scale horizontally by adding more processes (each running 20 coroutines), not by adding processes for each unit of concurrency.

Backpressure

What happens when the queue is full? Two policies:

  • Reject submissions. The submitter gets a 429 / "queue full" response and has to retry. Best when you don't want to silently degrade.
  • Slow submissions down. Accept submissions but with rate limiting. Best when you want to flatten traffic spikes.

Most production systems do both: a hard limit beyond which submissions get rejected, and rate limiting before that.

Queue choice

For most production agent systems, the right choice is one of:

QueueProsConsBest for
Redis (with BLPOP or streams)Simple, fast, ubiquitousNo durability without RDB/AOF; limited featuresMost agent systems, MVPs
AWS SQS / GCP Pub/SubManaged, durable, scales infinitelyLatency per call, costCloud-native deployments
Postgres (with SKIP LOCKED)One less infra component, transactional with results storeSlower at high throughputSmall services where Postgres is already in use
KafkaMassive throughput, replayableHeavy operationallyHigh-volume systems with complex requirements

Redis is the typical default. SQS is the typical cloud default. Postgres is fine when scale is moderate. Kafka is for when you genuinely have Kafka-shaped requirements; if you have to ask, you don't.

Visibility timeouts

When a worker dequeues a job, the queue should hide the job from other workers for some time (the visibility timeout). If the worker doesn't ack the job within that window, it reappears.

This is how you handle worker crashes: the job goes back into the queue and a different worker picks it up. The trick is picking a sensible timeout: long enough to cover normal job duration, short enough to detect failures fast. For agent jobs that average 30 seconds, a visibility timeout of 5 minutes is reasonable. Re-extend the timeout periodically (heartbeat) for unusually long jobs.

Dead-letter queues

A job that fails repeatedly should not be retried forever. Put a max-attempts cap and route exhausted jobs to a dead-letter queue (DLQ).

async def handle_failure(job, error, store):
    job.attempts += 1
    if job.attempts >= MAX_ATTEMPTS:
        job.status = "failed"
        job.error = str(error)
        await store.save(job)
        await dead_letter_queue.enqueue(job)
    else:
        job.status = "pending"
        await store.save(job)
        await queue.enqueue(job, delay=backoff(job.attempts))

A human reviews the DLQ. Sometimes jobs get fixed and re-queued; sometimes the user gets an apology; sometimes a class of failures gets investigated and patched.

Scaling autonomously

With queues, scaling is "more workers." How you trigger scaling depends on your platform:

  • Kubernetes HorizontalPodAutoscaler: based on queue depth or CPU.
  • Cloud serverless (AWS Lambda, Cloud Run): scales automatically per request.
  • Manual / scheduled: for predictable workloads, time-based scaling can be enough.

The metric most worth scaling on is queue lag: how far behind we are. CPU and memory are derived metrics; queue lag is the direct one.

Don't let queues hide reliability problems

Queues make your system feel calmer because requests don't fail loudly when workers can't keep up; they just wait. This can mask real problems. Set alerts on queue lag and DLQ depth so a slowly degrading worker pool surfaces as an alert, not as quietly increasing latency.

Key takeaway

Queue-based architectures decouple request acceptance from work execution. Submissions become jobs; workers process them; results are delivered via polling, webhook, or streaming. This unlocks scaling, durable failure handling, and burst tolerance. Use idempotency keys and checkpointing to make jobs safe to retry. Pick a queue based on scale (Redis for most cases, SQS for cloud-native, Postgres for small services). Alert on queue lag, not just CPU. The next lesson covers the financial side of running agents at scale.

>_queue-architecture.py
Loading editor...
Output will appear here.

Done with this lesson?