The Engineering Codex/Agentic AI with LLM APIs
DAY 2
03 / 09

Agent Loops — ReAct, Reflexion, Tree-of-Thoughts & Plan-and-Execute

schedule6 minsignal_cellular_altIntermediate1,345 words
Master the four canonical reasoning loops for agents. Learn when each one is the right tool, how much they actually cost, and the bounds every production loop must enforce.

What you will learn

01ReAct — The Default Loop
02Reflexion — Verbal Self-Criticism
03Tree of Thoughts — Branching Search
04Plan-and-Execute — Decompose Once, Run
05Choosing the Right Loop
06Bounding the Loop — The Non-Negotiables

Once you have an LLM that can call tools, the next question is how it should think between calls. The answer is an agent loop: a discipline for interleaving reasoning, action, and observation. Four loops dominate production today — ReAct, Reflexion, Tree-of-Thoughts, and Plan-and-Execute. Each fits a different shape of problem; using the wrong one wastes 10× the tokens for worse results.

🔑
The loops in one sentence each
ReAct = think, act, observe, repeat. Reflexion = ReAct + verbal self-criticism between attempts. Tree-of-Thoughts = branch into multiple reasoning paths, score them, keep the best. Plan-and-Execute = draft a multi-step plan once, then execute steps. Pick by the shape of your problem, not the popularity of the paper.

ReAct — The Default Loop

ReAct: Synergizing Reasoning and Acting in Language Models (Yao et al., NeurIPS 2023) is the loop almost every agent framework still uses by default. The insight: don't separate reasoning from acting — interleave them. Each turn, the model emits a thought, then an action, then sees the observation, then thinks again. The original paper showed +34% absolute on ALFWorld and +10% on WebShop versus chain-of-thought-only baselines.

REACT — THINK · ACT · OBSERVE · REPEAT Thought "I need to find X" Action tool_use call Observation tool_result
ReAct's three nodes. The thought is private monologue; the action is a tool call; the observation is the tool's result. The loop terminates when the model emits a final text response with no tool call.
Python · ReAct loop with bounded budget
def react_loop(client, user_input, tools, max_iters=10):
    messages = [{"role": "user", "content": user_input}]
    for i in range(max_iters):
        resp = client.messages.create(
            model="claude-sonnet-4-6", max_tokens=2048,
            tools=tools, messages=messages,
        )
        messages.append({"role": "assistant", "content": resp.content})
        if resp.stop_reason != "tool_use":
            return resp.content   # final answer

        tool_uses = [b for b in resp.content if b.type == "tool_use"]
        results = [execute_tool(t.name, t.input) for t in tool_uses]
        messages.append({"role": "user", "content": [
            {"type": "tool_result", "tool_use_id": t.id, "content": str(r)}
            for t, r in zip(tool_uses, results)
        ]})
    raise RuntimeError("agent exceeded budget")

That 20-line loop is the foundation. Add prompt caching, structured logging, parallel tool execution, and a graceful timeout, and you have a production-ready ReAct agent.

Reflexion — Verbal Self-Criticism

Reflexion: Language Agents with Verbal Reinforcement Learning (Shinn et al., NeurIPS 2023) layers a self-critique step on top of ReAct. After each attempt, the model writes a textual reflection — what went wrong, what to try next — and that reflection is appended to memory for the next attempt. The original paper hit 91% pass@1 on HumanEval, beating GPT-4's then-baseline of 80%.

🔄
When Reflexion helps
Reflexion only beats ReAct when (a) you can cheaply score the output (unit tests for code, exact-match for QA) and (b) the task allows multiple attempts. If you can't tell whether the agent succeeded, the reflection is just expensive prose. Use it for code generation, math, and bounded reasoning tasks; skip it for one-shot interactive responses.
Python · Reflexion sketch
def reflexion(task, score_fn, max_attempts=3):
    reflections = []
    for attempt in range(max_attempts):
        prompt = build_prompt(task, reflections)
        result = react_loop(prompt)
        score = score_fn(result)
        if score >= threshold:
            return result
        # Ask the model to reflect on what went wrong
        reflection = client.messages.create(
            model="claude-sonnet-4-6",
            messages=[{"role": "user", "content": f"Attempt: {result}\nScore: {score}\nReflect: what went wrong, what to try."}],
        )
        reflections.append(reflection.content)
    return result   # best effort

Tree of Thoughts — Branching Search

Tree of Thoughts: Deliberate Problem Solving with Large Language Models (Yao et al., NeurIPS 2023) generalizes chain-of-thought into a search tree. At each step the model proposes multiple candidate next thoughts, an evaluator scores them, and the algorithm explores the most promising branches. The famous result: 74% on Game of 24 vs. 4% for chain-of-thought with the same GPT-4 backbone.

root A0.2 B0.9 C0.5 B10.6 B2 ★0.95 B30.3 prune low expand best
ToT explores multiple thoughts per step, scores them with an evaluator (LLM-as-judge or rule-based), and prunes weak branches. The cost grows with breadth × depth, so use only when search structure justifies it.

Plan-and-Execute — Decompose Once, Run

Reasoning with Language Model is Planning with World Model (Hao et al., 2023) and the related "Plan-and-Solve" pattern split the agent into two roles: a planner that drafts a multi-step plan up front, and an executor that runs each step (with its own ReAct loop if needed). The planner uses a strong model; the executor can use a cheaper one.

This pattern shines when (a) the task has long horizon, (b) early steps don't depend on later observations, and (c) you want a human in the loop to approve the plan before execution. It struggles when reality diverges from the plan — every step that fails forces re-planning.

Choosing the Right Loop

LoopBest forCost vs. ReActWatch out for
ReActGeneral tool use, interactive tasks, anything where each step depends on the last observation1× (baseline)Tool-call thrashing on errors; reflection loops
ReflexionCode gen, math, anything with cheap automatic scoring2-5× (multiple attempts)Useless without a scoring function; reflection drift
Tree of ThoughtsCombinatorial search, planning puzzles, problems with structured intermediate states5-20× (breadth × depth)Cost explosion; need a real evaluator
Plan-and-ExecuteLong-horizon tasks, multi-step research, anywhere you want HITL approval1.2-2× (one planner pass + executor)Plans go stale fast; re-planning logic is essential
⚠️
Don't combine loops naively
It's tempting to run "ReAct + Reflexion + ToT" because more = better. It isn't. Each loop multiplies cost; combining them stacks 50-200× the baseline tokens. Use the simplest loop that solves your problem. If a single ReAct attempt with good tools works 90% of the time, that's almost always better than a baroque multi-loop architecture that works 92% at 20× the cost.

Bounding the Loop — The Non-Negotiables

Whatever loop you pick, every production agent enforces these caps in code (not in the prompt):

  • Max iterations — typically 5-20. After this, the loop terminates with whatever partial result exists.
  • Max wall-clock time — typically 30-300s. Cancel pending tool calls and exit.
  • Max input tokens — typically 50k-500k. Truncate or summarize older turns when approaching the limit.
  • Max tool calls per turn — typically 3-10. Even with parallel tool use, prevent fan-out blowups.
  • Max identical-tool retries — typically 2. Detect (name, input) repeats and break out of thrashing.
Quick check
You're building an agent that drafts marketing emails based on a brief. The same agent occasionally produces excellent emails on attempt 1 and mediocre ones on attempt 2 with no obvious cause. Which loop and what scoring function would you reach for?
Show answer
Reflexion is the right shape — multiple attempts with verbal reflection between them — but the bottleneck is the scoring function. Marketing copy doesn't have unit tests. Practical answer: use an LLM-as-judge with a rubric (clarity, brand voice, call-to-action quality), score each draft, and keep the highest-scoring one. Bound to 3 attempts max. This is exactly the "evaluator-optimizer" pattern from Anthropic's playbook.
🔑
Key takeaways
1) Default to ReAct — it solves 80% of agent tasks in 20 lines. 2) Add Reflexion only when you can cheaply score outputs. 3) Reach for ToT only when the search space is structured and an evaluator exists. 4) Use Plan-and-Execute for long-horizon work and human-in-loop approval. 5) Every loop must be bounded by code: iterations, wall-clock, tokens, retries — don't trust the model to enforce its own budget.

Finished reading?