APE – Automatic Prompt Engineer

The Idea

Writing good prompts is surprisingly hard. Small changes in wording can swing accuracy by 10% or more. Most people find good prompts through intuition and trial-and-error — testing a few variations, picking what seems to work. But what if you could test fifty variations systematically?

APE treats prompt discovery as a search problem. It asks AI to generate many candidate instructions for a task, tests each one against real examples, and picks the winner. This automated search famously discovered a chain-of-thought prompt that outperformed the best human-designed version — proving that AI can be better at writing prompts than the people who design them.

Building Blocks

This composition builds on:

Ask a Better Question Show by Example

APE automates the insight that prompt quality matters enormously. It uses examples to guide candidate generation and systematic evaluation to find the best phrasing.

See It in Action

Task: Find the best prompt for solving math word problems.

Generate candidate prompts

AI sees examples of math problems and generates many possible instructions

"Here are input-output pairs. Write 10 different instructions that would help an AI solve these problems."

Candidates generated

1. "Solve the following math problem step by step."
2. "Think through this problem carefully before answering."
3. "Let's work this out in a step by step way to be sure we have the right answer."
4. "Break this problem into parts and solve each one."
5. "Calculate the answer, showing your work."
...and 45 more

↓ test each candidate on real examples

Evaluate every candidate

"Calculate the answer, showing your work." 71%
"Think through this problem carefully..." 74%
"Solve the following math problem step by step." 79%
"Let's work this out in a step by step way to be sure we have the right answer." 82%

↓ winner selected

Best prompt deployed

Result

The winning prompt is used for all future math problems. It scored 82% — beating every manually designed alternative.

The Famous Discovery

Human-designed prompt:

"Let's think step by step"

78.7% accuracy on math benchmarks

vs.

APE-discovered prompt:

"Let's work this out in a step by step way to be sure we have the right answer"

82.0% accuracy — +3.3% improvement

The subtle difference in phrasing — adding "to be sure we have the right answer" — was something no human had thought to try, but the automated search found it.

Why This Works

Humans can test maybe 5–10 prompt variations before running out of ideas or patience. APE tests 50–100+ candidates systematically. The search covers phrasing variations that humans wouldn't think to try, and the evaluation is objective — measured accuracy, not subjective judgment.

It works because prompt sensitivity is real: tiny wording changes cause big performance swings. The only way to navigate this landscape reliably is to search it broadly and measure rigorously. APE does both.

The Composition

Generate many candidate prompts. Test each on real examples. Pick the winner. Let AI discover phrasings that humans would never try — and that actually work better.

When to Use This

• You have labeled evaluation data — examples with known correct answers to score against
• Production systems where a few percentage points of accuracy translate to real value
• When you suspect your hand-crafted prompts aren't optimal
• Researching how sensitive your task is to prompt phrasing

When to Skip This

• One-off tasks — the overhead of generating and testing 50+ candidates isn't worth it for a single query
• No evaluation data — without labeled examples, you can't objectively score candidates
• Tight compute budget — testing each candidate requires multiple LLM calls
• Open-ended creative tasks — when there's no measurable "right answer," automated scoring doesn't apply

How It Relates

APE is the foundational idea behind DSPy, which takes prompt optimization much further — automatically compiling entire prompt pipelines, not just single instructions. Think of APE as the simple, powerful core that DSPy builds a full framework around.

It's also related to Directional Stimulus Prompting, which uses a small model to generate hints that steer a large model. Both are about optimizing what you feed to the model, but APE optimizes the instruction while Directional Stimulus optimizes the hints given alongside the question.