JARVIS / HuggingGPT | Just Add AI

The Idea

No single AI model is good at everything. Some models generate images brilliantly. Others transcribe speech. Others analyze sentiment. JARVIS treats an LLM as a "brain" that orchestrates all of these specialists — much like a project manager who doesn't do every task but knows exactly who to call for each one.

When you ask "Generate an image of a sunset, describe it, and write a poem about it," JARVIS breaks this into three tasks, selects the best model for each (Stable Diffusion for the image, BLIP for the description, GPT for the poem), executes them in the right order, and synthesizes a single coherent response. One request, multiple models, seamless output.

Component Patterns

The controller uses Level 2 compositions at each stage:

Self-Ask Meta-Prompting LLMCompiler ReWOO

Self-Ask decomposes the request into typed tasks. Meta-Prompting selects the best specialist model for each. LLMCompiler/ReWOO handles dependency-aware execution with placeholder resolution.

Four Stages

Task Planning

The LLM analyzes the user request and decomposes it into a structured task graph. Each task has a type, ID, dependencies, and arguments. Tasks can reference outputs of prior tasks using placeholders.

Self-Ask • Least-to-Most

↓

Model Selection

For each task, the LLM selects the best model from a registry based on capability match, quality indicators (like download counts), and specific requirements. Like picking the best specialist for each job.

Meta-Prompting

↓

Task Execution

Tasks are sorted by dependencies. Independent tasks run in parallel. Placeholders are resolved to actual outputs from completed tasks. The execution engine manages the whole flow.

LLMCompiler • ReWOO

↓

Response Synthesis

The LLM takes all task results and the original request, then produces a coherent, natural response that explains what was done and presents the results.

Chain-of-Thought

See It in Action

Request: "Generate an image of a sunset over mountains, describe what you see, and create a poem about it."

Plan Tasks

Task graph generated

Task 0 text-to-image: "sunset over mountains" (no dependencies)
Task 1 image-to-text: describe [output of Task 0] ← depends on Task 0
Task 2 text-generation: write poem about [output of Task 1] ← depends on Task 1

↓

Select Models

Best specialist for each task

Task 0 → stable-diffusion-v1-5 (1.2M downloads, best text-to-image)
Task 1 → blip-image-captioning-large (800K downloads, best image-to-text)
Task 2 → GPT-4 (best creative text generation)

↓

Execute & Synthesize

Sequential execution (dependency chain)

Stable Diffusion generates sunset image → BLIP describes "Warm orange and purple hues fade behind snow-capped peaks..." → GPT-4 writes poem → All results woven into final response.

Why This Works

JARVIS leverages the most fundamental principle in computing: specialization. Rather than building one massive model that does everything mediocrely, it uses the best available model for each specific task. A dedicated image generator will always outperform a general-purpose model at image generation.

The LLM controller makes this practical. It handles the complex orchestration — understanding the user's intent, breaking it into typed tasks, managing dependencies, resolving cross-task references — so the specialists can focus on what they do best.

The System

Plan the workflow. Pick the best specialist for each step. Execute in dependency order. Synthesize everything into a coherent response. One brain, many hands.

When to Use This

• Multi-modal tasks spanning image, text, audio, and code
• Complex workflows with data dependencies between steps
• Dynamic model selection from a large registry of specialized models
• Workflows that no single model can handle end-to-end

When to Skip This

• Simple text-only queries — no need for multi-model orchestration
• Real-time applications — four LLM calls plus model execution is slow
• Fixed, known toolsets — if you always use the same models, hardcode the pipeline
• Cost-sensitive applications — multiple LLM calls plus model inference adds up

How It Relates

JARVIS handles the Act stage of the Cognitive Loop when multi-model workflows are needed. The Adaptive Pattern Router shares the model-selection concept but operates at the pattern level rather than the model level. AutoGPT/BabyAGI can use JARVIS-style orchestration for tool-heavy tasks.

At Level 4, the Cognitive Operating System uses JARVIS as its multi-modal execution engine, allowing the higher-level system to treat diverse AI models as interchangeable services.