The Idea
Most AI tool use works through prompting: you tell the model what tools are available and show it examples of when to use them. Toolformer takes a fundamentally different approach — it trains the model to recognize when a tool would help, call it inline as part of its text, and use the result naturally.
The remarkable part: the model teaches itself. No human needs to label when tools should be used. The training process automatically discovers where tool calls reduce the model's uncertainty about what comes next, and only keeps the useful ones. The result? A model 25 times smaller than GPT-3 that outperforms it on tasks where tools help.
Building Blocks
This composition builds on:
Ask It to Search Define Your ToolsToolformer takes the concepts of tool invocation and tool definition and bakes them directly into the model's learned behavior, rather than relying on prompts to teach tool use at runtime.
How It Learns
Annotate
Try inserting tool calls at many points in training text. "What is 594 × 832?" becomes "What is Calculator(594×832) → 494,208"
Filter
Keep only the tool calls that actually help — the ones that make the model more confident about what text comes next. Discard the rest.
Train
Fine-tune the model on this filtered data. It learns the pattern: "when I see this kind of question, calling this tool makes my answer better."
See It in Action
After training, the model naturally inserts tool calls when they'd help — and skips them when they wouldn't.
847 times 294 is 248,918.
Alexander Graham Bell invented the telephone.
The model learned that no tool would help here — so it just answers directly.
The David vs. Goliath Result
GPT-3 (no tools)
The largest model of its era, relying purely on memorized knowledge. Gets 23% on math benchmarks.
Toolformer (with tools)
25x smaller, but knows when to reach for a calculator or look something up. Gets 44% on the same math benchmarks.
A small model that knows when to use tools can beat a giant model that doesn't.
Why This Works
The genius is in the self-supervised filtering. Instead of humans labeling where tools should be used, the model itself determines usefulness: "Did calling the calculator make me more confident about what comes next?" If yes, keep it. If not, discard it.
This means the model learns nuanced judgment — not just how to use tools, but when to use them and when to rely on its own knowledge instead. It won't reach for a calculator to add 2 + 2, but it will for 847 × 294.
The Composition
Train the model to discover for itself when tools help. It learns to pause mid-generation, call the right tool, and weave the result naturally into its response — no prompting required.
When to Use This
- • You control the model and can fine-tune it (not just use it via API)
- • High-volume tool-using applications where the training cost pays off
- • You want deterministic, learned tool behavior rather than prompt-dependent behavior
- • Reducing prompt overhead matters — no need for tool descriptions in every request
When to Skip This
- • Using an API-only model — you can't fine-tune models you access through an API; use ReAct or function calling instead
- • Tools change frequently — every new tool or change requires retraining
- • Prototyping quickly — prompt-based tool use (like ReAct) lets you iterate in minutes, not days
- • Flexibility is key — if you need to add tools on the fly, learned tool use is too rigid
How It Relates
Toolformer and TALM represent the trained approach to tool use, while ReAct represents the prompted approach. Most production systems today use the prompted approach because it's more flexible — you can add or change tools without retraining. But the Toolformer insight (that models can learn when tools help, not just how to use them) has deeply influenced how modern function-calling APIs are designed.
Think of it as the difference between teaching someone a skill through practice (Toolformer) versus giving them instructions in the moment (ReAct). Both work; the right choice depends on whether you need flexibility or efficiency.