The Idea

Most AI tool use works through prompting: you tell the model what tools are available and show it examples of when to use them. Toolformer takes a fundamentally different approach — it trains the model to recognize when a tool would help, call it inline as part of its text, and use the result naturally.

The remarkable part: the model teaches itself. No human needs to label when tools should be used. The training process automatically discovers where tool calls reduce the model's uncertainty about what comes next, and only keeps the useful ones. The result? A model 25 times smaller than GPT-3 that outperforms it on tasks where tools help.

Building Blocks

This composition builds on:

Ask It to Search Define Your Tools

Toolformer takes the concepts of tool invocation and tool definition and bakes them directly into the model's learned behavior, rather than relying on prompts to teach tool use at runtime.

How It Learns

1

Annotate

Try inserting tool calls at many points in training text. "What is 594 × 832?" becomes "What is Calculator(594×832) → 494,208"

2

Filter

Keep only the tool calls that actually help — the ones that make the model more confident about what text comes next. Discard the rest.

3

Train

Fine-tune the model on this filtered data. It learns the pattern: "when I see this kind of question, calling this tool makes my answer better."

See It in Action

After training, the model naturally inserts tool calls when they'd help — and skips them when they wouldn't.

1
Math question — calls Calculator
You ask
What is 847 times 294?
Model generates (with inline tool call)
I need to calculate: Calculator(847×294)248,918
847 times 294 is 248,918.
2
Factual question — calls Q&A
You ask
Who invented the telephone?
Model generates (with inline tool call)
Let me check: QA("inventor of telephone")Alexander Graham Bell
Alexander Graham Bell invented the telephone.
3
Casual question — no tool needed
You ask
Tell me a joke about cats.
Model generates (no tool call)
Why don't cats play poker in the jungle? Too many cheetahs!

The model learned that no tool would help here — so it just answers directly.

The David vs. Goliath Result

175B

GPT-3 (no tools)

The largest model of its era, relying purely on memorized knowledge. Gets 23% on math benchmarks.

6.7B

Toolformer (with tools)

25x smaller, but knows when to reach for a calculator or look something up. Gets 44% on the same math benchmarks.

A small model that knows when to use tools can beat a giant model that doesn't.

Why This Works

The genius is in the self-supervised filtering. Instead of humans labeling where tools should be used, the model itself determines usefulness: "Did calling the calculator make me more confident about what comes next?" If yes, keep it. If not, discard it.

This means the model learns nuanced judgment — not just how to use tools, but when to use them and when to rely on its own knowledge instead. It won't reach for a calculator to add 2 + 2, but it will for 847 × 294.

The Composition

Train the model to discover for itself when tools help. It learns to pause mid-generation, call the right tool, and weave the result naturally into its response — no prompting required.

When to Use This

When to Skip This

How It Relates

Toolformer and TALM represent the trained approach to tool use, while ReAct represents the prompted approach. Most production systems today use the prompted approach because it's more flexible — you can add or change tools without retraining. But the Toolformer insight (that models can learn when tools help, not just how to use them) has deeply influenced how modern function-calling APIs are designed.

Think of it as the difference between teaching someone a skill through practice (Toolformer) versus giving them instructions in the moment (ReAct). Both work; the right choice depends on whether you need flexibility or efficiency.