
Agent Loops — ReAct, Reflexion, Tree-of-Thoughts & Plan-and-Execute
Master the four canonical reasoning loops for agents. Learn when each one is the right tool, how much they actually cost, and the bounds every production loop must enforce.
What you will learn
Once you have an LLM that can call tools, the next question is how it should think between calls. The answer is an agent loop: a discipline for interleaving reasoning, action, and observation. Four loops dominate production today — ReAct, Reflexion, Tree-of-Thoughts, and Plan-and-Execute. Each fits a different shape of problem; using the wrong one wastes 10× the tokens for worse results.
ReAct — The Default Loop
ReAct: Synergizing Reasoning and Acting in Language Models (Yao et al., NeurIPS 2023) is the loop almost every agent framework still uses by default. The insight: don't separate reasoning from acting — interleave them. Each turn, the model emits a thought, then an action, then sees the observation, then thinks again. The original paper showed +34% absolute on ALFWorld and +10% on WebShop versus chain-of-thought-only baselines.
def react_loop(client, user_input, tools, max_iters=10): messages = [{"role": "user", "content": user_input}] for i in range(max_iters): resp = client.messages.create( model="claude-sonnet-4-6", max_tokens=2048, tools=tools, messages=messages, ) messages.append({"role": "assistant", "content": resp.content}) if resp.stop_reason != "tool_use": return resp.content # final answer tool_uses = [b for b in resp.content if b.type == "tool_use"] results = [execute_tool(t.name, t.input) for t in tool_uses] messages.append({"role": "user", "content": [ {"type": "tool_result", "tool_use_id": t.id, "content": str(r)} for t, r in zip(tool_uses, results) ]}) raise RuntimeError("agent exceeded budget")
That 20-line loop is the foundation. Add prompt caching, structured logging, parallel tool execution, and a graceful timeout, and you have a production-ready ReAct agent.
Reflexion — Verbal Self-Criticism
Reflexion: Language Agents with Verbal Reinforcement Learning (Shinn et al., NeurIPS 2023) layers a self-critique step on top of ReAct. After each attempt, the model writes a textual reflection — what went wrong, what to try next — and that reflection is appended to memory for the next attempt. The original paper hit 91% pass@1 on HumanEval, beating GPT-4's then-baseline of 80%.
def reflexion(task, score_fn, max_attempts=3): reflections = [] for attempt in range(max_attempts): prompt = build_prompt(task, reflections) result = react_loop(prompt) score = score_fn(result) if score >= threshold: return result # Ask the model to reflect on what went wrong reflection = client.messages.create( model="claude-sonnet-4-6", messages=[{"role": "user", "content": f"Attempt: {result}\nScore: {score}\nReflect: what went wrong, what to try."}], ) reflections.append(reflection.content) return result # best effort
Tree of Thoughts — Branching Search
Tree of Thoughts: Deliberate Problem Solving with Large Language Models (Yao et al., NeurIPS 2023) generalizes chain-of-thought into a search tree. At each step the model proposes multiple candidate next thoughts, an evaluator scores them, and the algorithm explores the most promising branches. The famous result: 74% on Game of 24 vs. 4% for chain-of-thought with the same GPT-4 backbone.
Plan-and-Execute — Decompose Once, Run
Reasoning with Language Model is Planning with World Model (Hao et al., 2023) and the related "Plan-and-Solve" pattern split the agent into two roles: a planner that drafts a multi-step plan up front, and an executor that runs each step (with its own ReAct loop if needed). The planner uses a strong model; the executor can use a cheaper one.
This pattern shines when (a) the task has long horizon, (b) early steps don't depend on later observations, and (c) you want a human in the loop to approve the plan before execution. It struggles when reality diverges from the plan — every step that fails forces re-planning.
Choosing the Right Loop
| Loop | Best for | Cost vs. ReAct | Watch out for |
|---|---|---|---|
| ReAct | General tool use, interactive tasks, anything where each step depends on the last observation | 1× (baseline) | Tool-call thrashing on errors; reflection loops |
| Reflexion | Code gen, math, anything with cheap automatic scoring | 2-5× (multiple attempts) | Useless without a scoring function; reflection drift |
| Tree of Thoughts | Combinatorial search, planning puzzles, problems with structured intermediate states | 5-20× (breadth × depth) | Cost explosion; need a real evaluator |
| Plan-and-Execute | Long-horizon tasks, multi-step research, anywhere you want HITL approval | 1.2-2× (one planner pass + executor) | Plans go stale fast; re-planning logic is essential |
Bounding the Loop — The Non-Negotiables
Whatever loop you pick, every production agent enforces these caps in code (not in the prompt):
- Max iterations — typically 5-20. After this, the loop terminates with whatever partial result exists.
- Max wall-clock time — typically 30-300s. Cancel pending tool calls and exit.
- Max input tokens — typically 50k-500k. Truncate or summarize older turns when approaching the limit.
- Max tool calls per turn — typically 3-10. Even with parallel tool use, prevent fan-out blowups.
- Max identical-tool retries — typically 2. Detect
(name, input)repeats and break out of thrashing.
Show answer
- ReAct: Synergizing Reasoning and Acting in Language Models (Yao et al., 2022)arxiv.org
- Reflexion: Language Agents with Verbal Reinforcement Learning (Shinn et al., 2023)arxiv.org
- Tree of Thoughts: Deliberate Problem Solving with Large Language Models (Yao et al., 2023)arxiv.org
- Reasoning with Language Model is Planning with World Model (Hao et al., 2023)arxiv.org
- Anthropic — Building Effective Agentsanthropic.com
Finished reading?