Reasoning + Acting: the paper and the intuition

Understanding ReAct and why it works.

Video lesson ~10 min

Video coming soon

The paper that started everything

In October 2022, Yao et al. published a paper called "ReAct: Synergizing Reasoning and Acting in Language Models." It is, quietly, one of the most influential papers in the agent era. Almost every agent framework in production today is a descendant of ReAct, even when it isn't called that.

The core idea is one sentence: let the model interleave thinking and acting in plain text.

That's it. No new training, no new architecture. Just a different prompt structure that produces dramatically better results on tasks that require multi-step reasoning with tool use.

The intuition

Before ReAct, there were two camps:

Chain-of-thought (CoT) prompting. Ask the model to reason step by step before answering. Great for math and logic. Useless for tasks that need fresh information, because CoT only operates on what the model already knows.
Tool-use prompting. Give the model tools and let it call them. Great for fetching data. Bad for reasoning, because each tool call was treated as the final action.

ReAct combines them. The model produces an explicit thought, then an action, then receives an observation, then produces another thought, and so on. Reasoning and acting are no longer separate phases. They're alternating moves in the same conversation.

What a ReAct trace looks like

The classic format from the paper:

Question: Aside from Apple Remote, what other devices can control the program
that the Apple Remote was originally designed to interact with?
 
Thought 1: I need to find what program the Apple Remote was originally
designed to control.
Action 1: search[Apple Remote]
Observation 1: The Apple Remote is a remote control introduced in October 2005
by Apple. It was originally designed to control the Front Row media center program.
 
Thought 2: Apple Remote was designed to control Front Row. I need to find what
other devices can control Front Row.
Action 2: search[Front Row media center]
Observation 2: Front Row is a discontinued media center software. It was
controlled by Apple Remote and the keyboard function keys.
 
Thought 3: So besides Apple Remote, the keyboard function keys can also control
Front Row.
Action 3: finish[keyboard function keys]

Three things make this work:

The thought is explicit. The model writes out its reasoning before each action. This is chain-of-thought, but at every step instead of only at the end.
Actions are tied to thoughts. The model chooses the action because of the thought it just wrote. The thought is its plan; the action is the execution.
Observations interrupt reasoning. The world talks back. Real data, possibly contradicting what the model assumed, gets injected before the next thought.

Why it works better than the alternatives

The paper measured ReAct on knowledge-intensive tasks (HotpotQA, FEVER) and decision-making tasks (ALFWorld, WebShop). On every benchmark, ReAct beat both pure CoT and pure tool-use:

Approach	Knowledge tasks	Decision tasks
CoT only	OK on stuff the model already knows	Fails (no acting)
Acting only	Limited reasoning, often picks wrong action	Decent baseline
ReAct	Best on both	Best on both

The mechanism, as far as anyone can tell, is grounding. The thoughts give the model a place to plan and self-correct. The observations give the model fresh facts to plan against. Without thoughts, the model picks actions impulsively. Without observations, the model hallucinates facts to plan with. ReAct closes both gaps.

ReAct vs modern function calling

If you're paying attention, the ReAct loop sounds a lot like the orchestration loop we built in Module 2. That's not a coincidence. The five-phase loop is a direct descendant of ReAct, with two differences:

ReAct (paper version)	Modern function calling
Model emits "Thought:" and "Action:" as text you parse	Model returns structured tool calls; reasoning is implicit (or in a separate `reasoning` field)
Single tool call per step	Often multiple tool calls per step
Loop continues until model writes "finish[answer]"	Loop continues until model returns no tool calls

Native function calling APIs hide the textual ReAct format, but they implement the same loop. The model still alternates between deciding and acting. You just don't see the "Thought:" prefix in the response.

That said, you can still ask the model to write its thoughts explicitly. This is sometimes worth doing because:

It's easier to debug. The trace is human-readable.
It can improve quality, especially on smaller models. Forcing the model to verbalize reasoning before acting reduces impulsive tool calls.
It works on models without native function calling (which we'll exploit in the next lesson).

SYSTEM_PROMPT = """You are an agent that solves problems by reasoning and using tools.
 
Format every response as:
Thought: <your reasoning about what to do next>
Action: <the tool to call, or 'finish' if done>
 
After you call a tool, you will receive an Observation. Use it to inform your next Thought.
"""

This is "ReAct prompting" on top of native function calling. It costs you a few tokens per turn and often pays for itself in correctness.

The legacy

Almost everything you'll build in Tracks 2 through 4 is a refinement of ReAct:

Plan-and-execute agents add a separate planning phase before the ReAct loop starts.
Reflexion agents add a meta-thought layer after each task to evaluate the outcome.
Tree-of-thought agents run multiple ReAct branches in parallel and pick the best.
Multi-agent systems split the thought and action phases across different specialized models.

You don't need to read each of those papers to build agents. You do need to internalize ReAct, because every other pattern is a delta against it.

Read the original

The paper is short and unusually readable for ML literature. If you build agents seriously, spend an hour with it: arXiv 2210.03629. The benchmarks are dated, but the framing of "reasoning + acting as alternating moves" is still the cleanest description of what an agent does.

Key takeaway

ReAct is the idea that an agent should alternate between writing a thought and taking an action, with observations from the world feeding the next thought. It's the design pattern underneath every modern agent framework, even when those frameworks don't mention it. The next lesson builds a ReAct agent from scratch using a model that doesn't have native function calling, so you can see the full mechanism in plain text.

Done with this lesson?

Tool execution and result injection

Tool use

Implementing ReAct from scratch

The ReAct pattern