Multimodal Chain-of-Thought

The Idea

Chain-of-thought reasoning works brilliantly for text problems, but falls apart when images are involved. Ask an AI to reason step by step about a diagram, and it often skips the visual analysis entirely — jumping straight to an answer that sounds confident but misses what the image actually shows.

Multimodal Chain-of-Thought fixes this by splitting the task into two distinct stages. First, the model looks at the image and generates a rationale — a description of what it sees and what that means. Then, using both the original image and its own rationale, it produces a final answer. This separation of "what I see" from "what I conclude" forces genuine visual analysis before answering.

Building Blocks

This composition builds on:

Think Step by Step Show It

Multimodal CoT combines chain-of-thought reasoning (step-by-step explanation) with visual input processing, adding a two-stage architecture that forces the model to analyze images before answering.

Two Stages, One Insight

Stage 1: Look

Generate Rationale

The model examines the image alongside the question and produces a detailed description of what it observes and what that implies. No answer yet — just analysis.

Stage 2: Answer

Infer from Rationale

Now the model has three inputs: the original image, the question, and its own rationale. It can verify its analysis against the image before committing to an answer.

The key: Stage 2 can cross-check the rationale against the original image, catching mistakes before they become answers.

See It in Action

A science question about a food web diagram.

🖼

Diagram: Food web showing Grass → Rabbit → Fox, with arrows indicating energy flow

Question

"If all the grass dies, what happens to the rabbit population?"

↓ Stage 1: Look at the image

Rationale Generation

What I see

"Looking at the food web diagram, I can identify: Grass is at the bottom as a producer. Arrows show rabbits eat grass directly — it's their primary food source. Foxes eat rabbits, placing them at the top. If grass dies, rabbits lose their only food source shown in this web."

↓ Stage 2: Answer using rationale + image

Answer Inference

Final answer

"Based on my analysis of the food web, the rabbit population will decrease. Grass is the rabbits' direct food source, and no alternative food is shown in the diagram. Without grass, rabbits cannot sustain their population."

Without vs. With Multimodal CoT

Without — just guessing

"The rabbit population increases because when grass dies, rabbits might find other food sources and adapt."

Skipped the diagram entirely. Hallucinated an answer that sounds plausible but contradicts the visual evidence.

With — look first, then reason

"The rabbit population decreases. The food web shows grass as the rabbits' only food source. No alternative food path exists in this diagram."

Analyzed the image first. Answer directly references what the diagram shows.

The Small Model That Beat GPT-3.5

87.5%

MM-CoT (under 1B parameters)

75.2%

GPT-3.5 with CoT (175B parameters)

On ScienceQA — a model 200x smaller wins because it actually looks at the images instead of guessing from text alone.

This is one of the most striking results in AI research: structure beats scale. A sub-1-billion parameter model with the right two-stage architecture outperforms a 175-billion parameter model that tries to reason from text alone. The lesson is clear — how you reason matters more than how big you are.

Why This Works

The two-stage separation is the key. When you ask a model to look at an image and answer in one shot, it often takes shortcuts — generating plausible-sounding text without deeply analyzing the visual content. The rationale stage forces genuine observation.

Stage 2 is where the magic compounds. The model doesn't just use its rationale — it also has the original image available. This means it can verify: "Does my description actually match what the diagram shows?" This cross-checking catches hallucinated rationales before they corrupt the final answer.

The Composition

Don't ask the AI to look and answer at the same time. First let it describe what it sees. Then let it reason from its own description — with the image still available as a reality check.

When to Use This

• Science questions with diagrams, charts, or graphs that need step-by-step visual analysis
• Educational settings where showing reasoning about images is as important as the answer
• Chart and graph interpretation where specific visual details determine the answer
• Any visual QA task where the model tends to ignore the image and guess from text

When to Skip This

• Text-only tasks — standard chain-of-thought is simpler and works fine without images
• Simple image classification — "Is this a cat or dog?" doesn't need step-by-step reasoning
• Latency-sensitive applications — two stages means roughly double the processing time
• Modern multimodal models — GPT-4V and similar models have built-in visual reasoning that may not need explicit two-stage prompting

How It Relates

Multimodal CoT extends Think Step by Step into the visual domain. Where standard chain-of-thought decomposes text reasoning, this technique decomposes visual reasoning into "observe" and "conclude" phases.

The two-stage architecture shares DNA with Plan-and-Execute (plan first, then act) and Self-Ask (generate intermediate questions before answering). All three techniques benefit from the same insight: separating analysis from conclusion produces better results than doing both at once.