Lesson 11 of 14Track 4

Deployment and scaling

Containerizing agent systems

Docker for agent deployments.

Video lesson ~10 min

Video coming soon

Packaging the agent

Once your agent works locally and the eval pipeline gives you confidence, you have to put it somewhere users can reach it. The standard answer in 2026 is the same as it was a decade ago: a container. Containers package the agent's code, dependencies, and runtime into a single image that runs the same way on your laptop, in CI, and in production.

This lesson covers what's specific to agent containerization: image structure, secret handling, sidecar MCP servers, and the tradeoffs you'll see specifically because agents have unusual resource and dependency profiles.

A typical agent container

A reasonable structure for an agent service:

FROM python:3.12-slim
 
WORKDIR /app
 
# Install system deps for any tools that need them.
RUN apt-get update && apt-get install -y --no-install-recommends \
    curl ca-certificates \
    && rm -rf /var/lib/apt/lists/*
 
# Install Python deps before copying source for layer caching.
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
 
# Copy source code.
COPY ./src /app/src
COPY ./prompts /app/prompts
 
# Run as non-root.
RUN useradd -m -u 1001 agent
USER agent
 
# The agent's entrypoint is just a Python service.
CMD ["python", "-m", "src.main"]

Three things to notice:

Layered for cache. Dependencies installed before source copy means a code change doesn't reinstall pip packages.
Non-root user. Agents call tools that may have side effects; the container itself shouldn't have root permissions.
Slim base image. A 100MB base beats a 1GB one. Less surface area, smaller images, faster pulls.

Secrets

Don't bake secrets into the image. The image gets pushed to a registry and may be inspected by anyone with pull access. Instead:

Environment variables at runtime, sourced from a secret manager.
Mounted secrets for systems that support it (Kubernetes secrets, Docker secrets, Nomad). The secret never appears on disk inside the running container; it's mounted into a tmpfs.
Token brokers that the agent calls at startup to fetch fresh credentials.

# kubernetes manifest snippet
env:
  - name: ANTHROPIC_API_KEY
    valueFrom:
      secretKeyRef:
        name: agent-secrets
        key: anthropic_api_key

The agent reads the env var; nobody pushed plain-text secrets into the image. This is the basic discipline; treat it as non-negotiable.

Image size matters

Agent containers tend to grow:

LLM client libraries pull in lots of transitive deps.
Tool implementations may need scientific Python (numpy, pandas) or other heavy packages.
Multi-language tools sometimes require entire Node or Java runtimes.

Three patterns to keep size in check:

Multi-stage builds

Build artifacts in one stage, copy the minimum into the final stage.

FROM python:3.12 AS build
COPY requirements.txt .
RUN pip install --user -r requirements.txt
 
FROM python:3.12-slim AS runtime
COPY --from=build /root/.local /home/agent/.local
COPY ./src /app/src
ENV PATH=/home/agent/.local/bin:$PATH
CMD ["python", "-m", "src.main"]

Distroless or alpine bases

If your agent only needs Python and the runtime libs, distroless gives you a base of ~80MB instead of ~150MB for python:slim. Alpine is even smaller but has different libc semantics that some libraries don't love.

Shared base images

If you have multiple agent services, give them a common base image with the standard deps. New services build on top quickly.

For a production agent, a final image of 200-400MB is realistic. 1GB+ is a smell that warrants investigation.

MCP servers as sidecars

The previous tracks covered MCP servers extensively. In production, MCP servers come in three deployment shapes:

Inside the agent container

The simplest layout: stdio MCP servers spawned as subprocesses inside the agent. Works for small numbers of small servers; doesn't scale.

Sidecar containers

In Kubernetes / ECS / Nomad, run each MCP server as a sibling container in the same pod. The agent connects via localhost. Each MCP server has its own resource limits and lifecycle.

# kubernetes pod snippet
containers:
  - name: agent
    image: registry/agent:v1.2.3
  - name: mcp-filesystem
    image: registry/mcp-filesystem:v0.4.0
  - name: mcp-github
    image: registry/mcp-github:v0.7.0

This is the right pattern for production agents that talk to multiple servers. Sidecars share the network namespace with the agent, so connection setup is fast; they have their own filesystem, so they don't pollute the agent's image.

Standalone services

Heavy MCP servers (anything with significant compute, state, or shared use across agents) should be deployed as their own services. The agent connects over the network. This is the pattern for vendor-hosted MCP and for shared internal MCP servers.

The dividing line: small + agent-specific = sidecar; large or shared = standalone.

Resource shape

Agents have unusual resource profiles compared to typical web services:

Memory

The agent itself is small (Python service: 100-300MB). The model lives elsewhere (vendor API). Memory mostly grows with concurrent active sessions and any caching you do.

CPU

Mostly idle. The agent's main work is waiting on LLM and tool calls. Per-session CPU is tiny except during the brief windows of post-processing or validation.

I/O

Heavy. Most of the agent's life is sending requests to LLMs and tools, waiting, processing responses. Network I/O is the dominant resource.

File system

Often unused except for logs and checkpoint storage. If you're using local files, prefer mounted volumes you can control independently.

For Kubernetes resource requests, a typical agent service is something like 100m CPU, 512Mi memory. Bursting to higher CPU is fine; sustained high CPU is unusual and worth investigating.

Rollouts and rollbacks

Two-phase rollouts (canary + full) are especially important for agents because regressions often manifest as quality issues, not crashes. Roll out to 1-5% of traffic first, run evals on that traffic, decide whether to proceed.

v1.2.3 -- 100%
v1.2.4 -- canary 5% (run 1 hour; check metrics, eval scores, error rates)
       -- canary 25% (run 1 hour; same checks)
       -- 100% (full rollout)

If anything in the canary regresses meaningfully, roll back. Automated rollbacks tied to dashboards (eval pass rate, error rate, latency) are worth setting up.

Configuration vs code

The hot path for agent configuration:

Code: anything that needs review and tested deploys (prompts, tool registries, model selection logic).
Config: anything that should be tweakable per-environment or per-tenant (model name, max tokens, timeouts).
Runtime policy (lesson 4 of Module 1 of this track): anything that needs to change without redeploy (kill switches, rate limits).

A common mistake: putting prompts in runtime config so they can be edited without deploy. This sounds great until somebody changes the prompt and breaks the eval set. Prompts should be in code, with full review, and changes go through evals before reaching prod. The "edit live" pattern feels agile and is, in practice, a regression machine.

Health checks

The container needs to tell the orchestrator (Kubernetes, ECS) whether it's healthy. Two endpoints:

`/healthz` (liveness)

Is the process alive and the agent loop functional? If not, restart. Should not test downstream services.

`/readyz` (readiness)

Is the agent ready to accept new requests? This should test downstream services: model API reachable, MCP servers connected, secrets loaded. If any are not ready, return non-200; the load balancer skips this instance until it is.

@app.get("/healthz")
async def liveness():
    return {"status": "ok"}
 
 
@app.get("/readyz")
async def readiness():
    if not model_client.is_ready():
        raise HTTPException(503, "model client not ready")
    if not all(c.is_connected() for c in mcp_clients.values()):
        raise HTTPException(503, "mcp clients not connected")
    return {"status": "ok"}

Distinguish the two; conflating them causes traffic to be sent to broken instances and unbroken instances to be killed.

Local development parity

The container that runs in production should be reproducible locally. Three practices:

docker compose for the full local stack. Agent + MCP servers + any local databases, all spun up with one command.
Same image for dev and prod. Don't have a separate dev container; that's how prod-only bugs happen.
Mocked external services in dev. A local stub for the model API (replays cached responses) lets you run integration tests without paying for tokens.

Treat the dev environment as a target, not an afterthought. Time spent making it match production saves orders of magnitude more time when debugging.

The container is a small piece of the operational picture

A working container doesn't mean a working production system. Containers solve the packaging and reproducibility problem; they don't solve scheduling, scaling, networking, or secrets. Pair the container with a reasonable orchestrator (Kubernetes, ECS, or even just docker compose for small deployments) and the rest of the patterns from this track. Containers are necessary, not sufficient.

Key takeaway

Agent containers follow standard practices: layered Dockerfiles, non-root users, slim bases, secrets at runtime, multi-stage builds for size. MCP servers run as sidecars (small, agent-specific) or standalone services (large, shared). Agent resource profiles are I/O-heavy, CPU-light. Roll out with canaries; tie automated rollbacks to evals and metrics. Distinguish liveness from readiness. Match dev to prod. The next lesson covers the queue-based architecture that lets you scale agent workloads horizontally.

Done with this lesson?

Dashboards and alerting

Observability

Queue-based architectures