Reflexion | Just Add AI

The Idea

When a student gets an exam question wrong, a good teacher doesn't just say "try again." They ask "what went wrong?" and "what would you do differently?" That reflection is what makes the next attempt better, not just another random guess.

Reflexion applies this to AI agents. After a failed attempt, the AI generates a natural-language reflection — a verbal analysis of what went wrong and what to change. That reflection is stored in memory. On the next attempt, the AI reads its past reflections before acting, so each retry is informed by specific lessons learned. This simple addition took GPT-4's code generation accuracy from 67% to 91%.

Building Blocks

This composition builds on:

ReAct Check Your Work

Reflexion wraps the ReAct agent loop in an outer learning cycle: attempt, evaluate, reflect on failure, store the lesson, and retry with that context. The self-critique becomes persistent memory.

The Four Components

Actor

A ReAct-style agent that attempts the task. On retries, it receives past reflections as additional context — essentially reading its own "lessons learned" before trying again.

Evaluator

Scores the attempt. For code, this is running tests. For games, it's the environment outcome. For reasoning, it could be an LLM judge. The key: clear pass/fail signals.

Self-Reflection

When the evaluator says "fail," this generates a verbal analysis: what went wrong, why, and what to do differently next time. Concrete, actionable insights — not vague "try harder."

Episodic Memory

Stores reflections across attempts. Kept small (1–3 reflections) so it fits in context. Each retry reads the full memory, carrying forward all lessons learned so far.

See It in Action

Task: "Write a function that finds the second-largest number in a list."

Episode 1 — First Attempt

AI writes a solution

Generated code (simplified)

Sort the list in descending order, return the second element.

Test results

second_largest([3, 1, 4, 1, 5]) → 4 PASS
second_largest([5, 5, 5]) → 5 FAIL (expected: None — no second-largest)
second_largest([1]) → IndexError FAIL

↓ failed — time to reflect

Self-Reflection

AI analyzes what went wrong

Reflection (stored in memory)

"My function failed on two edge cases: (1) when all elements are the same, there is no true second-largest — I need to handle duplicates by working with unique values; (2) when the list has fewer than 2 unique values, I should return None instead of crashing."

↓ retry with reflection in context

Episode 2 — Informed Retry

AI tries again, guided by its reflection

Generated code (with lessons applied)

Get unique values from the list. If fewer than 2 unique values, return None. Otherwise sort unique values and return the second-largest.

Test results

second_largest([3, 1, 4, 1, 5]) → 4 PASS
second_largest([5, 5, 5]) → None PASS
second_largest([1]) → None PASS

The second attempt wasn't just another guess — it was guided by a specific analysis of what went wrong.

The Results

Code (HumanEval)

67%

+ Reflexion

91%

Decisions (ALFWorld)

73%

+ Reflexion

97%

Adding self-reflection and memory to the same model yields dramatic improvements — no retraining needed.

Why This Works

A naive retry loop repeats the same mistakes because it has no memory of what went wrong. Reflexion gives the agent episodic memory — specific, verbal lessons from past failures that shape future attempts. Each retry starts from a better understanding.

The verbal format is key. Instead of opaque numerical signals, the AI writes reflections in plain language: "I forgot to handle edge cases" or "I searched for the wrong keyword." These are exactly the kind of insights that make the next attempt meaningfully different from the last.

The Composition

Try. Evaluate. If it failed, reflect on why and remember the lesson. Retry with that memory. Each attempt is informed by specific insights from past failures — not just another blind guess.

When to Use This

• Code generation with executable tests — the ideal use case, with clear pass/fail feedback
• Tasks with measurable success criteria where you know if an attempt worked
• Decision-making and navigation tasks with environment feedback
• Problems that benefit from trial-and-error when you can afford multiple attempts

When to Skip This

• One-shot tasks — if there's no opportunity to retry, reflection has no benefit
• No clear evaluation signal — without reliable pass/fail feedback, the reflection has nothing to learn from
• Time-sensitive applications — multiple episodes multiply latency and cost
• Simple problems — if the first attempt usually succeeds, the reflection overhead isn't worth it

How It Relates

Reflexion extends ReAct with an outer learning loop. While ReAct handles a single attempt (think-act-observe), Reflexion wraps multiple attempts with evaluation and reflection between them. It's also a more sophisticated version of Check Your Work — instead of just reviewing and fixing in one pass, it generates lasting insights stored in memory.

More advanced systems build on Reflexion: LATS adds tree search over multiple reasoning paths, evaluating and reflecting across an entire search tree rather than just sequential retries. If Reflexion is learning from your mistakes, LATS is exploring all the paths you could have taken.