The Idea
When a student gets an exam question wrong, a good teacher doesn't just say "try again." They ask "what went wrong?" and "what would you do differently?" That reflection is what makes the next attempt better, not just another random guess.
Reflexion applies this to AI agents. After a failed attempt, the AI generates a natural-language reflection — a verbal analysis of what went wrong and what to change. That reflection is stored in memory. On the next attempt, the AI reads its past reflections before acting, so each retry is informed by specific lessons learned. This simple addition took GPT-4's code generation accuracy from 67% to 91%.
Building Blocks
This composition builds on:
ReAct Check Your WorkReflexion wraps the ReAct agent loop in an outer learning cycle: attempt, evaluate, reflect on failure, store the lesson, and retry with that context. The self-critique becomes persistent memory.
The Four Components
Actor
A ReAct-style agent that attempts the task. On retries, it receives past reflections as additional context — essentially reading its own "lessons learned" before trying again.
Evaluator
Scores the attempt. For code, this is running tests. For games, it's the environment outcome. For reasoning, it could be an LLM judge. The key: clear pass/fail signals.
Self-Reflection
When the evaluator says "fail," this generates a verbal analysis: what went wrong, why, and what to do differently next time. Concrete, actionable insights — not vague "try harder."
Episodic Memory
Stores reflections across attempts. Kept small (1–3 reflections) so it fits in context. Each retry reads the full memory, carrying forward all lessons learned so far.
See It in Action
Task: "Write a function that finds the second-largest number in a list."
second_largest([5, 5, 5]) → 5 FAIL (expected: None — no second-largest)
second_largest([1]) → IndexError FAIL
second_largest([5, 5, 5]) → None PASS
second_largest([1]) → None PASS
The second attempt wasn't just another guess — it was guided by a specific analysis of what went wrong.
The Results
Adding self-reflection and memory to the same model yields dramatic improvements — no retraining needed.
Why This Works
A naive retry loop repeats the same mistakes because it has no memory of what went wrong. Reflexion gives the agent episodic memory — specific, verbal lessons from past failures that shape future attempts. Each retry starts from a better understanding.
The verbal format is key. Instead of opaque numerical signals, the AI writes reflections in plain language: "I forgot to handle edge cases" or "I searched for the wrong keyword." These are exactly the kind of insights that make the next attempt meaningfully different from the last.
The Composition
Try. Evaluate. If it failed, reflect on why and remember the lesson. Retry with that memory. Each attempt is informed by specific insights from past failures — not just another blind guess.
When to Use This
- • Code generation with executable tests — the ideal use case, with clear pass/fail feedback
- • Tasks with measurable success criteria where you know if an attempt worked
- • Decision-making and navigation tasks with environment feedback
- • Problems that benefit from trial-and-error when you can afford multiple attempts
When to Skip This
- • One-shot tasks — if there's no opportunity to retry, reflection has no benefit
- • No clear evaluation signal — without reliable pass/fail feedback, the reflection has nothing to learn from
- • Time-sensitive applications — multiple episodes multiply latency and cost
- • Simple problems — if the first attempt usually succeeds, the reflection overhead isn't worth it
How It Relates
Reflexion extends ReAct with an outer learning loop. While ReAct handles a single attempt (think-act-observe), Reflexion wraps multiple attempts with evaluation and reflection between them. It's also a more sophisticated version of Check Your Work — instead of just reviewing and fixing in one pass, it generates lasting insights stored in memory.
More advanced systems build on Reflexion: LATS adds tree search over multiple reasoning paths, evaluating and reflecting across an entire search tree rather than just sequential retries. If Reflexion is learning from your mistakes, LATS is exploring all the paths you could have taken.