Agentic RAG
From static RAG to agentic RAG
Why retrieval needs to be a decision, not a pipeline step.
Video coming soon
Two ways to do retrieval
When you bolt retrieval onto an LLM, you have a choice. You can make retrieval part of the pipeline, where the system always retrieves first and then generates. Or you can make retrieval a tool, where the model decides whether and how to retrieve at each step.
The first is static RAG. The second is agentic RAG. They sound similar but produce dramatically different behavior.
This lesson is the bridge between the memory work in Module 5 and the rest of this module. Once you see retrieval as something the agent decides to do rather than something the system always does, the design space opens up.
Static RAG: retrieval as a pipeline step
The classic RAG architecture, popularized in 2023 and still everywhere:
def static_rag(question, vector_store, llm):
# Step 1: always retrieve
chunks = vector_store.search(question, k=5)
# Step 2: stuff retrieved chunks into the prompt
context = "\n\n".join(chunks)
prompt = f"Context:\n{context}\n\nQuestion: {question}"
# Step 3: generate
return llm.complete(prompt)Three fixed steps. Retrieval happens once, with the user's question as the query. The retrieved chunks go into the prompt. The model produces an answer.
This works for simple questions where:
- The answer is in a single chunk
- The user's question is a good query
- One pass of retrieval is enough
It breaks down when any of those assumptions fail.
Where static RAG falls over
The query isn't the question
A user asks: "How does the system handle authentication failures during a database migration?"
The single best chunk for this question doesn't exist. You need:
- The auth failure handling code
- The migration system code
- The intersection of how they interact
A single retrieval with the original question as the query gets you maybe one of these. The other two stay invisible.
Multi-hop reasoning
A user asks: "Was the bug fixed by the same person who introduced it?"
To answer, you need to:
- Find the bug
- Find who introduced it
- Find the fix
- Compare authors
No single retrieval call can produce that. You need a sequence of retrievals where each one depends on the result of the last.
Conditional retrieval
Some questions don't need retrieval at all. "What's 2 + 2?" is answered by the model directly. Static RAG retrieves anyway, wasting tokens and possibly polluting the context with irrelevant chunks.
Bad retrieval drags down the answer
If retrieval returns five irrelevant chunks, the static system has no recourse. It dumps them into the prompt and asks the model to answer anyway. The model often hallucinates a confident wrong answer rather than admitting it doesn't have what it needs.
Agentic RAG: retrieval as a decision
In agentic RAG, retrieval is a tool the agent can call. The decision is moved into the model:
search_tool = {
"name": "search_docs",
"description": "Search the project documentation by keyword. Use when the user asks about features, configuration, or how something works.",
"parameters": {
"type": "object",
"properties": {"query": {"type": "string"}},
"required": ["query"],
},
}
def agentic_rag(question, tools, registry):
return run_agent(
goal=question,
tools=tools, # includes search_tool
registry=registry,
)Now the model can:
- Decide whether to retrieve. Skip retrieval for questions it can answer directly.
- Choose its own query. Reformulate the user's question into a query the retriever can actually use.
- Retrieve multiple times. Run a sequence of queries that each build on the last.
- Recover from bad retrieval. Try a different query when the first one returns nothing useful.
It's the same retrieval mechanism. The difference is who's driving.
A side-by-side example
Question: "Was the auth bug fixed by the same person who introduced it?"
Static RAG:
retrieve("Was the auth bug fixed by the same person who introduced it?")
→ returns 5 chunks, mostly irrelevant
LLM: "I don't have enough information to answer." [or hallucinates]Agentic RAG (full trace):
Thought: I need to find the bug first, then who introduced it, then the fix.
Action: search_commits[auth bug fix]
Observation: Commit abc123 by alice@... "Fix CSRF token handling on login"
Thought: Now I need to find who introduced the original bug.
Action: search_commits[CSRF token handling login]
Observation: Commit def456 by alice@... "Add CSRF middleware to login flow"
Thought: Both commits are by alice. I have my answer.
Action: finish[Yes, both commits were by alice@...]Three retrievals, planned by the model, building on each other. None of this is possible in static RAG.
The trade-offs
Agentic RAG is more powerful, but it's not a free upgrade.
| Dimension | Static RAG | Agentic RAG |
|---|---|---|
| Latency | Fast (1 retrieval, 1 generation) | Slow (N retrievals, N generations) |
| Cost | Low | Higher (model is called repeatedly) |
| Predictability | Deterministic flow | Variable, depends on model choices |
| Quality on simple Qs | Good | About the same |
| Quality on complex Qs | Often wrong | Often correct |
| Failure mode | Confidently wrong | Sometimes loops or stalls |
Static RAG is the right choice when your questions are simple, latency matters, and you can tune the retriever well. Agentic RAG is the right choice when questions vary in complexity and the cost of being wrong is high.
Hybrid: the practical answer
Most production systems use both. A common architecture:
def hybrid_rag(question, vector_store, agent):
# Step 1: always retrieve a baseline set of chunks
baseline = vector_store.search(question, k=3)
# Step 2: hand the baseline to an agent that can retrieve more if needed
return agent.run(
goal=question,
initial_context=baseline,
tools=[search_tool, lookup_tool],
)The baseline retrieval gives the agent a starting point so it doesn't waste calls re-retrieving the obvious. The agent's tool access lets it dig deeper when the baseline isn't enough.
Static RAG isn't dead
There's a tendency in agent communities to dismiss static RAG as old-fashioned. It isn't. For high-volume, low-complexity workloads (FAQ bots, documentation search), static RAG is faster and cheaper, and the quality gap with agentic RAG is small. Pick the architecture for the workload, not the trend.
Key takeaway
The shift from static to agentic RAG is the shift from "retrieve, then generate" to "let the model decide when, how, and what to retrieve." The next two lessons go deep on the most important agent skills in this space: planning a sequence of retrievals (query planning) and shaping how the underlying documents are chunked (agent-driven chunking).
Done with this lesson?