The Idea
Writing good prompts is surprisingly hard. Small changes in wording can swing accuracy by 10% or more. Most people find good prompts through intuition and trial-and-error — testing a few variations, picking what seems to work. But what if you could test fifty variations systematically?
APE treats prompt discovery as a search problem. It asks AI to generate many candidate instructions for a task, tests each one against real examples, and picks the winner. This automated search famously discovered a chain-of-thought prompt that outperformed the best human-designed version — proving that AI can be better at writing prompts than the people who design them.
Building Blocks
This composition builds on:
Ask a Better Question Show by ExampleAPE automates the insight that prompt quality matters enormously. It uses examples to guide candidate generation and systematic evaluation to find the best phrasing.
See It in Action
Task: Find the best prompt for solving math word problems.
2. "Think through this problem carefully before answering."
3. "Let's work this out in a step by step way to be sure we have the right answer."
4. "Break this problem into parts and solve each one."
5. "Calculate the answer, showing your work."
...and 45 more
- "Calculate the answer, showing your work." 71%
- "Think through this problem carefully..." 74%
- "Solve the following math problem step by step." 79%
- "Let's work this out in a step by step way to be sure we have the right answer." 82%
The Famous Discovery
Human-designed prompt:
78.7% accuracy on math benchmarks
APE-discovered prompt:
82.0% accuracy — +3.3% improvement
The subtle difference in phrasing — adding "to be sure we have the right answer" — was something no human had thought to try, but the automated search found it.
Why This Works
Humans can test maybe 5–10 prompt variations before running out of ideas or patience. APE tests 50–100+ candidates systematically. The search covers phrasing variations that humans wouldn't think to try, and the evaluation is objective — measured accuracy, not subjective judgment.
It works because prompt sensitivity is real: tiny wording changes cause big performance swings. The only way to navigate this landscape reliably is to search it broadly and measure rigorously. APE does both.
The Composition
Generate many candidate prompts. Test each on real examples. Pick the winner. Let AI discover phrasings that humans would never try — and that actually work better.
When to Use This
- • You have labeled evaluation data — examples with known correct answers to score against
- • Production systems where a few percentage points of accuracy translate to real value
- • When you suspect your hand-crafted prompts aren't optimal
- • Researching how sensitive your task is to prompt phrasing
When to Skip This
- • One-off tasks — the overhead of generating and testing 50+ candidates isn't worth it for a single query
- • No evaluation data — without labeled examples, you can't objectively score candidates
- • Tight compute budget — testing each candidate requires multiple LLM calls
- • Open-ended creative tasks — when there's no measurable "right answer," automated scoring doesn't apply
How It Relates
APE is the foundational idea behind DSPy, which takes prompt optimization much further — automatically compiling entire prompt pipelines, not just single instructions. Think of APE as the simple, powerful core that DSPy builds a full framework around.
It's also related to Directional Stimulus Prompting, which uses a small model to generate hints that steer a large model. Both are about optimizing what you feed to the model, but APE optimizes the instruction while Directional Stimulus optimizes the hints given alongside the question.