Lesson 18 of 20Track 1

Agentic RAG

From static RAG to agentic RAG

Why retrieval needs to be a decision, not a pipeline step.

Video lesson ~10 min

Video coming soon

Two ways to do retrieval

When you bolt retrieval onto an LLM, you have a choice. You can make retrieval part of the pipeline, where the system always retrieves first and then generates. Or you can make retrieval a tool, where the model decides whether and how to retrieve at each step.

The first is static RAG. The second is agentic RAG. They sound similar but produce dramatically different behavior.

This lesson is the bridge between the memory work in Module 5 and the rest of this module. Once you see retrieval as something the agent decides to do rather than something the system always does, the design space opens up.

Static RAG: retrieval as a pipeline step

The classic RAG architecture, popularized in 2023 and still everywhere:

def static_rag(question, vector_store, llm):
    # Step 1: always retrieve
    chunks = vector_store.search(question, k=5)
 
    # Step 2: stuff retrieved chunks into the prompt
    context = "\n\n".join(chunks)
    prompt = f"Context:\n{context}\n\nQuestion: {question}"
 
    # Step 3: generate
    return llm.complete(prompt)

Three fixed steps. Retrieval happens once, with the user's question as the query. The retrieved chunks go into the prompt. The model produces an answer.

This works for simple questions where:

The answer is in a single chunk
The user's question is a good query
One pass of retrieval is enough

It breaks down when any of those assumptions fail.

Where static RAG falls over

The query isn't the question

A user asks: "How does the system handle authentication failures during a database migration?"

The single best chunk for this question doesn't exist. You need:

The auth failure handling code
The migration system code
The intersection of how they interact

A single retrieval with the original question as the query gets you maybe one of these. The other two stay invisible.

Multi-hop reasoning

A user asks: "Was the bug fixed by the same person who introduced it?"

To answer, you need to:

Find the bug
Find who introduced it
Find the fix
Compare authors

No single retrieval call can produce that. You need a sequence of retrievals where each one depends on the result of the last.

Conditional retrieval

Some questions don't need retrieval at all. "What's 2 + 2?" is answered by the model directly. Static RAG retrieves anyway, wasting tokens and possibly polluting the context with irrelevant chunks.

Bad retrieval drags down the answer

If retrieval returns five irrelevant chunks, the static system has no recourse. It dumps them into the prompt and asks the model to answer anyway. The model often hallucinates a confident wrong answer rather than admitting it doesn't have what it needs.

Agentic RAG: retrieval as a decision

In agentic RAG, retrieval is a tool the agent can call. The decision is moved into the model:

search_tool = {
    "name": "search_docs",
    "description": "Search the project documentation by keyword. Use when the user asks about features, configuration, or how something works.",
    "parameters": {
        "type": "object",
        "properties": {"query": {"type": "string"}},
        "required": ["query"],
    },
}
 
def agentic_rag(question, tools, registry):
    return run_agent(
        goal=question,
        tools=tools,  # includes search_tool
        registry=registry,
    )

Now the model can:

Decide whether to retrieve. Skip retrieval for questions it can answer directly.
Choose its own query. Reformulate the user's question into a query the retriever can actually use.
Retrieve multiple times. Run a sequence of queries that each build on the last.
Recover from bad retrieval. Try a different query when the first one returns nothing useful.

It's the same retrieval mechanism. The difference is who's driving.

A side-by-side example

Question: "Was the auth bug fixed by the same person who introduced it?"

Static RAG:

retrieve("Was the auth bug fixed by the same person who introduced it?")
→ returns 5 chunks, mostly irrelevant
LLM: "I don't have enough information to answer." [or hallucinates]

Agentic RAG (full trace):

Thought: I need to find the bug first, then who introduced it, then the fix.
Action: search_commits[auth bug fix]
Observation: Commit abc123 by alice@... "Fix CSRF token handling on login"
Thought: Now I need to find who introduced the original bug.
Action: search_commits[CSRF token handling login]
Observation: Commit def456 by alice@... "Add CSRF middleware to login flow"
Thought: Both commits are by alice. I have my answer.
Action: finish[Yes, both commits were by alice@...]

Three retrievals, planned by the model, building on each other. None of this is possible in static RAG.

The trade-offs

Agentic RAG is more powerful, but it's not a free upgrade.

Dimension	Static RAG	Agentic RAG
Latency	Fast (1 retrieval, 1 generation)	Slow (N retrievals, N generations)
Cost	Low	Higher (model is called repeatedly)
Predictability	Deterministic flow	Variable, depends on model choices
Quality on simple Qs	Good	About the same
Quality on complex Qs	Often wrong	Often correct
Failure mode	Confidently wrong	Sometimes loops or stalls

Static RAG is the right choice when your questions are simple, latency matters, and you can tune the retriever well. Agentic RAG is the right choice when questions vary in complexity and the cost of being wrong is high.

Hybrid: the practical answer

Most production systems use both. A common architecture:

def hybrid_rag(question, vector_store, agent):
    # Step 1: always retrieve a baseline set of chunks
    baseline = vector_store.search(question, k=3)
 
    # Step 2: hand the baseline to an agent that can retrieve more if needed
    return agent.run(
        goal=question,
        initial_context=baseline,
        tools=[search_tool, lookup_tool],
    )

The baseline retrieval gives the agent a starting point so it doesn't waste calls re-retrieving the obvious. The agent's tool access lets it dig deeper when the baseline isn't enough.

Static RAG isn't dead

There's a tendency in agent communities to dismiss static RAG as old-fashioned. It isn't. For high-volume, low-complexity workloads (FAQ bots, documentation search), static RAG is faster and cheaper, and the quality gap with agentic RAG is small. Pick the architecture for the workload, not the trend.

Key takeaway

The shift from static to agentic RAG is the shift from "retrieve, then generate" to "let the model decide when, how, and what to retrieve." The next two lessons go deep on the most important agent skills in this space: planning a sequence of retrievals (query planning) and shaping how the underlying documents are chunked (agent-driven chunking).

Done with this lesson?

Context window budgeting

Memory and context engineering

Query planning and multi-step retrieval

Agentic RAG