Use this when a product uses (or should use) more than one AI model, or when you need to make a deliberate model selection decision. Helps you match use cases to models based on capability, cost, speed, and quality rather than defaulting to whatever model the team tried first. Also covers routing strategies and cost modeling.

Related skills: For tool-level selection (which AI product to use), see /tool-recommend. Model selection feeds into /ai-product-spec. Compare model quality using /ai-eval-design.

Platform context: Platform routing decisions affect all consumers. Model selection at the platform layer sets the floor -- default models, cost guardrails, approved providers -- while team-level selection refines for specific use cases. Over-centralizing model selection is a common anti-pattern (see AI Platform Anti-Patterns).

Process

Step 1: Map use cases to requirements

Ask the user:

What AI-powered capabilities does the product have (or plan to have)?
What models are you using today? (If any -- what's working, what's not, what's too expensive)

For each capability, define requirements:

Use case	Capability needed	Quality bar	Latency need	Volume	Context size	Data sensitivity
(e.g., chat support)	Conversation, tool use	High -- customer-facing	Real-time (< 3s)	10K/day	Medium (< 32K)	Private, not regulated
(e.g., document summary)	Summarization	Medium -- internal tool	Interactive (< 10s)	500/day	Large (> 100K)	Confidential
(e.g., content classification)	Classification	Must be accurate	Batch OK	50K/day	Small (< 4K)	Public
(e.g., code review)	Code analysis	High -- blocks deploys	Interactive (< 15s)	200/day	Large (> 100K)	Source code

Not every use case needs the same model. That's the point.

Step 2: Evaluate model options per use case

For each use case from Step 1, evaluate candidate models:

Model capability reference (representative as of June 2026 -- verify current model lineups and pricing before committing; the lineup churns every few months):

Model	Best for	Strengths	Limitations	Relative cost
Claude Fable 5	The hardest reasoning, frontier intelligence	Top-tier reasoning, 1M context, adaptive thinking	Highest cost	$$$$
Claude Opus 4.8	Complex reasoning, long-horizon agentic work, long context	Deep analysis, strong instruction following, 1M context, adaptive thinking + effort control	Higher cost, slower	$$$
Claude Sonnet 4.6	Balanced tasks, coding, general use	Strong quality-to-cost ratio, fast, 1M context	Less depth than Opus on the hardest problems	$$
Claude Haiku 4.5	High-volume, simple tasks, classification	Very fast, very cheap, 200K context	Less capable on complex reasoning	$
OpenAI GPT-5.5	Multimodal general purpose, coding, professional work	Strong across tasks, vision, reasoning variants (Pro, Instant)	Cost at scale	$$$
OpenAI GPT-5.5 Instant	Latency-sensitive general tasks	Faster, cheaper tier of the 5.5 family	Less depth than full GPT-5.5	$$
Google Gemini 3.1 Pro	Long context, multimodal, Google ecosystem	Very large context, strong multimodal	Varies by task type	$$
Google Gemini 3.5 Flash	Speed-critical, high volume	Very fast, large context, low cost	Trades depth for speed	$
Open-source (Llama, Mistral, DeepSeek)	Self-hosted, data-sensitive, cost-critical	No per-token cost, full control; DeepSeek and Mistral competitive on reasoning	Requires infrastructure, generally trails the frontier on the hardest tasks	$ (infra cost)
Specialized models	Domain-specific (code, medical, legal)	Tuned for specific domains	Narrow capability	Varies

Reasoning is now a knob, not a tier. Frontier models expose reasoning controls (adaptive or extended thinking, and an effort or reasoning level) that trade depth against cost and latency on the same model. Before reaching for a more expensive model, check whether turning up reasoning effort on the current one closes the gap. For current Claude model IDs, pricing, and the thinking/effort parameters, the /claude-api reference is authoritative; do not write model IDs from memory.

Per-use-case recommendation format:

Use case	Recommended model	Why	Alternative	When to switch
(chat support)	(Claude Sonnet 4.6)	(Good quality, fast enough, reasonable cost)	(GPT-5.5)	(If multilingual quality matters more)
(document summary)	(Gemini 3.1 Pro)	(Handles long documents natively)	(Claude Opus 4.8)	(If summary quality needs to be higher)

Before committing to a model: Run a small benchmark (20-30 representative inputs) through your top 2 candidates and compare quality side-by-side. Use /ai-eval-design for a structured eval if the decision is close.

Step 3: Design the routing strategy

If using multiple models, define how requests get to the right model:

Routing approaches:

Approach	How it works	Best for	Complexity
Static routing	Each feature hardcodes its model	Simple products, few use cases	Low
Gateway/router	Central service (OpenRouter, custom) routes by rules	Multi-model products	Medium
Classifier-based	A small model classifies the request, routes to the right model	Varied input types on a single endpoint	High (adds latency from classification call)
Fallback chain	Try primary model, fall back to secondary on failure/timeout	Reliability-critical features	Medium
Multi-agent orchestration	Multiple agents work in parallel on sub-tasks, coordinated by a conductor	Complex multi-step workflows, large codebases	High (emerging tooling: Maestri, Google Antigravity, Warp)

Emerging pattern -- multi-agent coordination: Tools like Maestri (spatial canvas for terminal agents), Google Antigravity (parallel coding agents in IDE), and Warp (agent platform) represent a new orchestration layer. Instead of routing one request to one model, these tools run multiple agents simultaneously on related sub-tasks. Consider this pattern when the workflow involves parallel, decomposable work (e.g., multiple files in a refactor, simultaneous test writing and implementation).

Design decisions:

Where does routing logic live? (Application code, API gateway, managed router)
What are the fallback rules? (If primary model times out, use X. If rate-limited, use Y.)
How do you handle model version changes? (Pin versions, auto-update, test-then-update)
Do you need A/B testing across models? (Route a % of traffic to each for comparison)

If using a managed router like OpenRouter:

Map each use case to a model preference with fallback
Set cost limits per route
Configure retry and timeout behavior

Managed routing and unified API tools:

Tool	Approach	Best for
OpenRouter	Managed router with model marketplace	Teams wanting simple model switching + discovery
liteLLM	Open-source unified API proxy	Teams needing consistent interface across OpenAI, Anthropic, Gemini, and open-source models with fallback support. Self-hostable.

liteLLM is particularly useful when you want to swap providers without changing application code -- it normalizes the API interface so routing changes are configuration, not code changes.

Step 4: Model the costs

Build a cost projection for the full model strategy:

Use case	Model	Avg input tokens	Avg output tokens	Cost per request	Daily volume	Daily cost	Monthly cost
(chat support)	(Sonnet 4.6)	(2,000)	(500)	($0.009)	(10,000)	($90)	($2,700)
(doc summary)	(Gemini 3.1 Pro)	(50,000)	(2,000)	($0.08)	(500)	($40)	($1,200)
(classification)	(Haiku 4.5)	(500)	(50)	($0.0003)	(50,000)	($15)	($450)
Total						$145	$4,350

Don't forget rate limits: At high volume (10K+ requests/day per model), check your API tier's rate limits. You may need to account for queuing, retries, or tier upgrades in your cost model.

Cost optimization levers:

Prompt caching -- reuse system prompts across requests so you don't pay full input cost on every call (Claude, GPT support this)
Shorter prompts -- trim unnecessary context; measure quality impact
Batch processing -- for non-real-time tasks, batch API calls at lower rates
Model tiering -- use expensive models only where quality demands it; cheap models everywhere else
Output limits -- set max_tokens to prevent runaway generation costs
Caching outputs -- cache identical or near-identical requests

Step 5: Generate the model strategy document

# Multi-Model Strategy: (Product name)

**Generated:** (date)
**Product:** (brief description)
**Models in use:** (list)
**Monthly cost projection:** (total)

## Use Case - Model Mapping
(Table from Step 1 + Step 2 -- use cases, requirements, recommended models)

## Routing Architecture
(Approach from Step 3 -- how requests reach models, fallback behavior)

### Routing rules
- (Rule 1: e.g., "All chat requests -> Claude Sonnet 4.6 via OpenRouter")
- (Rule 2: e.g., "Document processing > 100K tokens -> Gemini 3.1 Pro")
- (Rule 3: e.g., "Classification tasks -> Claude Haiku 4.5, batch mode")

### Fallback chain
- (Primary failure -> secondary model)
- (Rate limit -> queue and retry vs. fallback model)
- (Timeout threshold -> fallback trigger)

## Cost Projections
(Table from Step 4 -- per-use-case and total costs)

## Cost Optimization Plan
(Prioritized list of cost reduction levers with estimated savings)

## Model Governance
- **Version pinning policy:** (Pin major versions, auto-update minor? Or pin everything?)
- **Eval before upgrade:** (Run golden dataset eval before switching model versions)
- **Vendor redundancy:** (Single vendor risk? Fallback vendor for critical paths?)
- **Review cadence:** (How often to re-evaluate model choices -- quarterly? Check `knowledge/ai-market-landscape-reference.md` for market shifts.)

## Migration Plan
(If switching models or adding new ones)
- **Phase 1:** (Shadow mode -- run new model in parallel, compare outputs)
- **Phase 2:** (Canary -- route 5-10% of traffic to new model)
- **Phase 3:** (Gradual rollout -- increase to 50%, then 100%)
- **Rollback trigger:** (What metrics trigger a rollback to the previous model)

## Open Questions
- (Unresolved model decisions)
- (Things that need benchmarking to determine)

Step 6: Review and finalize

Ask the user:

Does the model mapping make sense for each use case?
Is the cost projection within budget?
Is the routing approach practical for your engineering team?
Are there use cases missing from the map?
What's the plan for re-evaluating as new models are released?

Adjust based on feedback.

Output location

Present the strategy as formatted text in the conversation for the user to copy into their docs tool.

Example Output

Input

Product: Meridian Health — clinical documentation assistant for outpatient practices; physicians dictate notes, the product transcribes, structures, and drafts into EHR-ready SOAP notes
Capabilities in use today: (1) real-time transcription cleanup, (2) SOAP note drafting from raw transcript, (3) ICD-10 code suggestions, (4) patient-facing visit summary generation
Current model setup: GPT-4o for everything — team picked it first and never revisited; monthly bill is $34K and climbing
Pain points: Classification (ICD-10) doesn't need frontier-model quality but costs the same as note drafting; long transcripts (sometimes 80K+ tokens) occasionally hit context limits; PHI handling requires SOC 2-compliant providers only

Output (abbreviated)

Multi-Model Strategy: Meridian Health Clinical Documentation Assistant

Generated: 2025-07-14 Product: AI-assisted clinical documentation — transcription cleanup, SOAP drafting, ICD-10 coding, patient summaries Models in use (proposed): Claude Sonnet, Claude Haiku, Gemini Pro 1.5, GPT-4o (retained for one path) Monthly cost projection: ~$11,200 (vs. $34,000 current — ~67% reduction)

Use Case – Model Mapping

Use case	Capability needed	Quality bar	Latency need	Volume	Context size	Data sensitivity
Transcription cleanup	Light editing, formatting	High — physician reviews	Real-time (< 3s)	8,000/day	Small (< 8K)	PHI — regulated
SOAP note drafting	Long-form generation, clinical structure	Very high — clinical accuracy	Interactive (< 12s)	8,000/day	Medium–Large (20–80K)	PHI — regulated
ICD-10 code suggestions	Classification (multi-label)	High accuracy, low recall cost	Batch OK (< 30s)	8,000/day	Small (< 4K)	PHI — regulated
Patient visit summary	Plain-language generation	Medium — reviewed before sending	Interactive (< 15s)	4,000/day	Medium (< 20K)	PHI — regulated

Recommended model per use case:

Use case	Recommended model	Why	Alternative	When to switch
Transcription cleanup	Claude Haiku	Fast, cheap, task is constrained — no complex reasoning needed	GPT-4o mini	If error rate on medical terminology exceeds 2% in eval
SOAP note drafting	Claude Sonnet	Clinical instruction-following, handles 80K context, strong quality-to-cost ratio	Claude Opus	If physician edit rate stays above 25% after tuning
ICD-10 code suggestions	Claude Haiku (batch)	Classification only, batching cuts cost 50%, Haiku accuracy sufficient for assist (not autonomous)	Fine-tuned Llama 3 (self-hosted)	If volume exceeds 50K/day and self-hosting ROI crosses over
Patient visit summary	Claude Sonnet	Plain-language generation, same provider as SOAP (reduces integration surface), SOC 2 compliant	GPT-4o	If patient satisfaction scores on summary clarity trail benchmark

Note on GPT-4o: Retained as fallback provider only. Removed from primary paths to reduce vendor concentration risk and cost. Anthropic (Claude) is the primary vendor; all models are available under Anthropic's HIPAA BAA.

Routing Architecture

Approach: Static routing per feature + fallback chain on primary model failure

Each product feature hardcodes its model assignment. No classifier-based routing needed — use case boundaries are already well-defined at the application layer (separate API calls for each step in the documentation pipeline).

Transcription cleanup   → Claude Haiku         (via liteLLM proxy)
SOAP note drafting      → Claude Sonnet        (via liteLLM proxy)
ICD-10 suggestions      → Claude Haiku, batch  (via liteLLM proxy)
Patient summary         → Claude Sonnet        (via liteLLM proxy)

Why liteLLM: Normalizes the API surface so swapping Haiku → Sonnet (or adding Gemini as a long-context fallback) is a config change, not a code change. Self-hostable within Meridian's existing AWS VPC — keeps PHI off third-party routing infrastructure.

Routing rules

All requests routed through internal liteLLM proxy deployed in HIPAA-compliant AWS environment
SOAP drafts with input tokens > 90K → overflow to Gemini Pro 1.5 (1M context window), flagged for physician review
ICD-10 requests queued in batches of 100, dispatched every 60 seconds during clinic hours

Fallback chain

Claude Haiku timeout (> 5s) → retry once → fall back to Claude Sonnet (same task, higher cost, logged)
Claude Sonnet timeout (> 20s) → fall back to GPT-4o with alert to on-call
Anthropic rate limit → queue with 30s retry; if queue depth > 500 → activate GPT-4o fallback route
Gemini overflow path failure → return partial transcript to physician with "manual review required" flag

Cost Projections

Use case	Model	Avg input tokens	Avg output tokens	Cost/request	Daily volume	Daily cost	Monthly cost
Transcription cleanup	Claude Haiku	3,000	800	$0.0011	8,000	$8.80	$264
SOAP note drafting	Claude Sonnet	35,000	2,500	$0.061	8,000	$488	$14,640
ICD-10 suggestions	Claude Haiku (batch)	600	100	$0.00019	8,000	$1.52	$46
Patient summary	Claude Sonnet	12,000	1,200	$0.022	4,000	$88	$2,640
Gemini overflow (SOAP)	Gemini Pro 1.5	110,000	3,000	$0.135	~200	$27	$810
Total						$613	$18,400

vs. current GPT-4o baseline: Estimated $34,000/month. Projected savings: ~$15,600/month (~46%). Note: The $11,200 figure in the header reflects additional savings from prompt caching (see below). Verify current Anthropic and Google pricing before finalizing budget commitments.

Cost Optimization Plan

Prompt caching on SOAP drafting — System prompt + EHR schema template is ~4,000 tokens, repeated on every call. Claude's prompt caching reduces this to near-zero after first call. Estimated savings: ~$2,800/month.
Batch ICD-10 calls — Already modeled above. Shift from synchronous to async batch API. Estimated savings: $400/month.
Output token limits — SOAP notes: cap at 2,800 tokens. Summaries: cap at 1,500. Add structured output schema to reduce generation verbosity. Estimated savings: $600/month.
Cache patient summaries for re-sends — Same visit summary requested multiple times (e.g., patient portal re-delivery). Cache by visit ID + hash. Estimated savings: $150/month.
Trim transcription cleanup prompts — Current prompt includes 1,200-token style guide injected every call. Move to few-shot examples (6 examples, ~400 tokens). Benchmark quality first. Estimated savings: $80/month.

Total estimated optimizations: ~$4,030/month additional → landing at ~$14,370/month (from $34K baseline, 58% reduction).

Model Governance

Version pinning: Pin to claude-sonnet-3-5 and claude-haiku-3-5 explicitly. No auto-updates. Anthropic model

Run this now

Try /multi-model-strategy on your own input

0/4000

Part of these Playbook topics

Agent Skills AI Maturity Agent Experience

Related AI & Agents skills

Agent Eval Harness Agent Reliability Audit AI Agent Design AI Eval Design AI Guardrails Design AI Health Check AI Product Spec AI Risk Register

Back to Skills Catalog