The Idea
No single AI model is good at everything. Some models generate images brilliantly. Others transcribe speech. Others analyze sentiment. JARVIS treats an LLM as a "brain" that orchestrates all of these specialists — much like a project manager who doesn't do every task but knows exactly who to call for each one.
When you ask "Generate an image of a sunset, describe it, and write a poem about it," JARVIS breaks this into three tasks, selects the best model for each (Stable Diffusion for the image, BLIP for the description, GPT for the poem), executes them in the right order, and synthesizes a single coherent response. One request, multiple models, seamless output.
Component Patterns
The controller uses Level 2 compositions at each stage:
Self-Ask Meta-Prompting LLMCompiler ReWOOSelf-Ask decomposes the request into typed tasks. Meta-Prompting selects the best specialist model for each. LLMCompiler/ReWOO handles dependency-aware execution with placeholder resolution.
Four Stages
Task Planning
The LLM analyzes the user request and decomposes it into a structured task graph. Each task has a type, ID, dependencies, and arguments. Tasks can reference outputs of prior tasks using placeholders.
Self-Ask • Least-to-MostModel Selection
For each task, the LLM selects the best model from a registry based on capability match, quality indicators (like download counts), and specific requirements. Like picking the best specialist for each job.
Meta-PromptingTask Execution
Tasks are sorted by dependencies. Independent tasks run in parallel. Placeholders are resolved to actual outputs from completed tasks. The execution engine manages the whole flow.
LLMCompiler • ReWOOResponse Synthesis
The LLM takes all task results and the original request, then produces a coherent, natural response that explains what was done and presents the results.
Chain-of-ThoughtSee It in Action
Request: "Generate an image of a sunset over mountains, describe what you see, and create a poem about it."
Task 1 image-to-text: describe [output of Task 0] ← depends on Task 0
Task 2 text-generation: write poem about [output of Task 1] ← depends on Task 1
Task 1 → blip-image-captioning-large (800K downloads, best image-to-text)
Task 2 → GPT-4 (best creative text generation)
Why This Works
JARVIS leverages the most fundamental principle in computing: specialization. Rather than building one massive model that does everything mediocrely, it uses the best available model for each specific task. A dedicated image generator will always outperform a general-purpose model at image generation.
The LLM controller makes this practical. It handles the complex orchestration — understanding the user's intent, breaking it into typed tasks, managing dependencies, resolving cross-task references — so the specialists can focus on what they do best.
The System
Plan the workflow. Pick the best specialist for each step. Execute in dependency order. Synthesize everything into a coherent response. One brain, many hands.
When to Use This
- • Multi-modal tasks spanning image, text, audio, and code
- • Complex workflows with data dependencies between steps
- • Dynamic model selection from a large registry of specialized models
- • Workflows that no single model can handle end-to-end
When to Skip This
- • Simple text-only queries — no need for multi-model orchestration
- • Real-time applications — four LLM calls plus model execution is slow
- • Fixed, known toolsets — if you always use the same models, hardcode the pipeline
- • Cost-sensitive applications — multiple LLM calls plus model inference adds up
How It Relates
JARVIS handles the Act stage of the Cognitive Loop when multi-model workflows are needed. The Adaptive Pattern Router shares the model-selection concept but operates at the pattern level rather than the model level. AutoGPT/BabyAGI can use JARVIS-style orchestration for tool-heavy tasks.
At Level 4, the Cognitive Operating System uses JARVIS as its multi-modal execution engine, allowing the higher-level system to treat diverse AI models as interchangeable services.