Agentic RAG
Query planning and multi-step retrieval
Breaking complex questions into retrieval sub-tasks.
Video coming soon
The user's question is a starting point, not a query
Users phrase questions for other humans. "What changed about how we handle webhooks last quarter?" is a question. It's a terrible vector search query.
A good retrieval system has to bridge that gap. In static RAG, you bridge it once: rewrite the question, retrieve, generate. In agentic RAG, you have a richer option: let the model break the question into a sequence of focused sub-queries, retrieve for each one, and synthesize.
This is query planning. It's the most important skill in agentic retrieval.
Sub-query decomposition
Take the webhook question and watch what an agent can do with it:
User: "What changed about how we handle webhooks last quarter?"
Thought: This is a multi-part question. I need to:
1. Find code that handles webhooks
2. Find changes to that code in the last quarter
3. Summarize the changes
Action: search_code[webhook handler]
Observation: webhooks/handler.py, webhooks/dispatch.py, webhooks/retry.py
Thought: Now I need git history for these files in the last quarter.
Action: git_log[webhooks/, since=2025-01-01]
Observation: 14 commits across the three files...
Thought: I'll group the commits by feature area to summarize.
Action: read_commits[webhooks/, since=2025-01-01, top=5]
Observation: [details of the 5 most-changed commits]
Thought: I can now write the summary.
Action: finish[Webhook handling changed in three areas: ...]Three sub-queries, each more specific than the original question, each building on the previous answer. The original "what changed about webhooks" was useless as a search query. The sub-queries are excellent searches.
Two strategies for planning
There are two ways to get this behavior:
Implicit planning (let the agent figure it out)
Don't plan up front. Let ReAct-style reasoning drive the sequence. Each thought decides the next query based on what's been seen so far.
SYSTEM = """You are a research agent. Use the search tool repeatedly with focused queries.
Start broad, then narrow down. Each query should be 2 to 5 specific terms, not full sentences.
"""This works well when the right next query genuinely depends on what you find. It fails when the agent gets stuck on one path and doesn't realize the question has multiple parts.
Explicit planning (plan first, then execute)
Make the model produce a plan before any retrieval happens.
PLAN_PROMPT = """Given the user's question, list 3 to 6 sub-queries that together would let you answer it.
Output one sub-query per line, in the order you'd run them. No commentary.
"""
def plan_queries(question, model="llama3"):
response = ollama.chat(model=model, messages=[
{"role": "system", "content": PLAN_PROMPT},
{"role": "user", "content": question},
])
return [line.strip() for line in response.message.content.splitlines() if line.strip()]Then the agent runs each sub-query in sequence, gathering observations:
def plan_and_retrieve(question, retriever):
sub_queries = plan_queries(question)
observations = []
for q in sub_queries:
chunks = retriever.search(q, k=3)
observations.append({"query": q, "chunks": chunks})
return synthesize(question, observations)The plan-first approach is more predictable. You can log it, evaluate it, even let the user inspect it. The trade-off is that the plan is fixed upfront, so the agent can't adapt mid-way if a sub-query reveals an unexpected angle.
A hybrid: plan first, but allow the agent to revise the plan after each retrieval.
Query rewriting
Even when the agent decides to retrieve once, the query it sends should usually not be the user's original question. A few transformations consistently help:
| User asks | Better query |
|---|---|
| "How do I fix the bug where logins fail on staging?" | "login failure staging environment" |
| "Was that decision documented anywhere?" | "[entity from prior turn] decision documented" |
| "Show me the function" | (depends on context) "function name from prior turn" |
The agent's job is to do this rewrite implicitly when it constructs each search call. You can also do it explicitly with a small "rewrite" prompt:
REWRITE_PROMPT = """Given the user's question and the conversation so far, write a focused search query.
Use 2 to 5 keywords. Drop pronouns. Add any specific names from the conversation.
Output only the query, nothing else.
"""This is sometimes called HyDE (Hypothetical Document Embeddings) or query expansion in the RAG literature. The mechanism is the same: turn a conversational question into something the retriever can actually match.
Multi-hop retrieval
Some questions require chained retrievals where each query is built from the result of the previous one. This is hard for static RAG because there's no place for the result to feed back into the next query. It's natural for agents.
A pattern that works well:
SYSTEM = """For each retrieval, write your query in this format:
SEARCH: <query>
WHY: <one sentence explaining what you're looking for and how it builds on what you already know>
After each search, decide whether to do another search or finalize your answer.
"""The "WHY" line forces the model to articulate the chain. It also gives you a perfect log: scan the WHYs and you can see the model's reasoning trajectory at a glance.
Knowing when to stop retrieving
A planning agent can also retrieve too much. The fix is to give the model an explicit stop condition.
SYSTEM_ADDENDUM = """After each retrieval, ask yourself:
1. Do I have enough to answer?
2. Will another retrieval probably add new information, or just rephrase what I already have?
If the answer to #1 is yes OR the answer to #2 is no, stop searching and answer.
"""Pair this with the convergence detector from the loop control lesson and most over-retrieval problems disappear.
Re-ranking and filtering
After retrieval but before passing chunks to the model, filtering is often the highest-leverage step. The retriever returns the top K by vector similarity, but vector similarity isn't the same as relevance.
Two common filters:
LLM as a relevance judge
def filter_chunks(question, chunks):
response = ollama.chat(model="llama3", messages=[
{"role": "system", "content": "For each numbered chunk, output YES if it helps answer the question, NO otherwise."},
{"role": "user", "content": f"Question: {question}\n\n" + "\n".join(f"{i+1}. {c}" for i, c in enumerate(chunks))},
])
return [chunks[i] for i, line in enumerate(response.message.content.splitlines()) if "YES" in line.upper()]Costs an extra LLM call. Often cuts irrelevant context by 60% or more.
Cross-encoder rerankers
A small specialized model that scores (query, chunk) pairs. More accurate than vector similarity, much cheaper than an LLM call. Common choices: cross-encoder/ms-marco-MiniLM-L-6-v2, Cohere's rerank API. We dig deeper into rerankers in Track 4.
Plan a few queries, then iterate
A useful default for production: plan 3 to 5 sub-queries upfront, run them in parallel, then let the agent decide whether to do additional one-off retrievals based on what came back. You get the predictability of explicit planning and the flexibility of agentic retrieval. This is also the architecture most "research agent" products use.
Key takeaway
Query planning is the difference between an agent that does one search and gives up, and one that breaks a complex question into a sequence of focused retrievals. The two strategies are implicit (let ReAct drive) and explicit (plan first, then execute). Either way, the agent's queries should look nothing like the user's question. The next lesson finishes the module by zooming in on the other side of retrieval: the chunks themselves.
Done with this lesson?