Agentic RAG
Agent-driven chunking strategies
Letting the agent control how documents are processed.
The chunk is the unit of retrieval
Everything we've talked about in this module assumes the documents are already broken into chunks. The chunk is what you embed, what you store, and what you retrieve. It is also the level of granularity at which the model sees information.
If chunks are too small, retrieval is precise but each hit lacks context. If chunks are too large, retrieval is fuzzy and the model has to wade through irrelevant text. If chunks split mid-sentence or mid-function, retrieval gets noisy regardless of size.
Static RAG systems use a fixed chunking strategy: 500-token chunks with 50-token overlap, set once at ingestion time. Agentic RAG opens up a richer space: the agent can influence chunking at retrieval time, or even let the chunking strategy itself become a tool the agent calls.
The classic problem
Take a Markdown document with sections, code blocks, and tables. The naive chunker splits it every 500 tokens regardless of structure:
Chunk 1: "...# Authentication\n\nWe use JWT tokens. The flow is:\n\n1. User logs in\n2. Server returns token\n3."
Chunk 2: "Client stores token in localStorage\n\n```python\ndef login(...):\n pass\n```\n\n# Authorization..."
Chunk 3: "...is enforced via middleware. Each request..."Three problems jump out:
- The numbered list is split. The model sees "1. ... 2. ... 3." in chunk 1 and the rest in chunk 2.
- The code block is split. Chunk 2 starts mid-prose and contains a fragment of code.
- The "# Authorization" header lands in the middle of chunk 2 instead of starting its own chunk.
Each of these makes retrieval and reasoning slightly worse. They compound across thousands of documents.
Structure-aware chunking
The first improvement is chunking on structural boundaries, not character counts:
import re
def chunk_markdown(text, target_size=800, max_size=1200):
sections = re.split(r"(?=^#+ )", text, flags=re.MULTILINE)
chunks = []
buf = ""
for s in sections:
if len(buf) + len(s) <= target_size:
buf += s
else:
if buf:
chunks.append(buf)
if len(s) > max_size:
# split big sections on paragraph boundaries
paras = s.split("\n\n")
buf = ""
for p in paras:
if len(buf) + len(p) > max_size:
chunks.append(buf)
buf = p
else:
buf += "\n\n" + p
else:
buf = s
if buf:
chunks.append(buf)
return chunksThe result is chunks that respect headers, paragraphs, and never split a code block in the middle. Same target size, much higher quality.
For different document types, use different chunkers:
| Document type | Chunk on |
|---|---|
| Markdown / docs | Headers, then paragraphs |
| Source code | Functions, then classes, then files |
| HTML | Block-level elements (sections, articles) |
| Conversation logs | Turn boundaries |
| PDFs | Pages, then headers (after layout extraction) |
This gets you 80% of the win with 20% of the effort.
What "agent-driven chunking" actually means
The phrase has two related meanings, both useful:
1. Agent-controlled retrieval granularity
The agent decides at retrieval time how big a chunk it wants. The chunker stores documents at multiple granularities; the retriever returns whichever level the agent asked for.
def search_docs(query: str, granularity: str = "section") -> list[dict]:
"""Search project docs.
granularity:
'paragraph' for precise answers (small chunks)
'section' for context (medium chunks, default)
'document' for whole-file overview (large chunks)
"""
return vector_store.search(query, k=5, granularity=granularity)The agent can start with section, narrow down to paragraph if the section is too broad, or zoom out to document if it needs the whole picture. This is essentially the small-to-big retrieval pattern, but the agent decides which "size" to ask for.
2. Agent-driven chunk synthesis
The agent doesn't just retrieve fixed chunks; it composes the context window itself by reading what it needs. Tools look more like a filesystem than a search engine:
list_documents() -> [doc_id, title, summary]
list_sections(doc_id) -> [section_id, heading]
read_section(doc_id, section_id) -> str
search(query, scope=doc_id|section_id) -> [hits]Now the agent navigates the corpus instead of retrieving from it. It looks at the table of contents, picks a likely section, reads it, and decides whether to drill deeper or move on.
This is overkill for most use cases. It's overwhelming for casual Q&A and unnecessarily slow. But for complex tasks (research, code analysis, audits), it dramatically outperforms search-only retrieval because the agent can use document structure as a guide.
A practical recipe
For most Track-1-level applications, this stack works well:
- Ingest with structure-aware chunking. Don't do fixed-size unless you have to.
- Store chunks at two granularities. A "fine" level (paragraphs / functions) and a "coarse" level (sections / files), with cross-references.
- Default the search tool to coarse granularity. The agent gets context first.
- Provide a "zoom in" tool. Given a coarse chunk, return the fine chunks inside it.
- Provide a "zoom out" tool. Given a fine chunk, return the coarse chunk that contains it.
In code:
search_tool = {
"name": "search",
"description": "Search documents at section granularity. Use this first to find relevant areas.",
"parameters": {"type": "object", "properties": {"query": {"type": "string"}}, "required": ["query"]},
}
zoom_in = {
"name": "zoom_in",
"description": "Get the paragraph-level chunks inside a section. Use when a section is too broad.",
"parameters": {"type": "object", "properties": {"section_id": {"type": "string"}}, "required": ["section_id"]},
}
zoom_out = {
"name": "zoom_out",
"description": "Get the surrounding section for a paragraph. Use when you need more context.",
"parameters": {"type": "object", "properties": {"paragraph_id": {"type": "string"}}, "required": ["paragraph_id"]},
}Three tools, two granularities, an agent that can move between them. This is enough to handle a wide range of retrieval-heavy tasks well.
What to skip
A few popular ideas that often aren't worth it for early-stage projects:
| Idea | Why to skip (initially) |
|---|---|
| Hierarchical summarization at every level | High ingestion cost, diminishing returns |
| Knowledge graphs auto-built from documents | Brittle, expensive to maintain, often outperformed by careful chunking + good retrieval |
| Custom embeddings fine-tuned per project | Fine-tune the retriever only when generic embeddings demonstrably fail on your data |
These can all help in the right setting, but they add complexity that most projects don't need. Spend your effort on chunking strategy and retrieval policy first.
Where Track 4 picks this up
Production retrieval systems add a lot of machinery on top of the basics here: hybrid search (vector + keyword), cross-encoder rerankers, evaluation pipelines, query analytics. Those belong in Track 4. The goal of this lesson is to give you the chunking and granularity primitives the rest of the stack assumes.
Key takeaway
Chunking is the silent variable in every RAG system. Structure-aware chunking is a simple upgrade that improves both retrieval and reasoning. Agent-driven chunking takes it further: the agent picks granularity at retrieval time, or even navigates the corpus tool by tool. Combine that with the query planning from the previous lesson and you have an agent that can answer questions a static RAG pipeline cannot.
This is the last lesson in Track 1. You now have the full set of agent fundamentals: the loop, the tools, the ReAct pattern, memory, and agentic retrieval. Track 2 takes everything you've built and scales it to multi-agent systems.
Done with this lesson?