Use this when a product uses (or should use) more than one AI model, or when you need to make a deliberate model selection decision. Helps you match use cases to models based on capability, cost, speed, and quality rather than defaulting to whatever model the team tried first. Also covers routing strategies and cost modeling.
Related skills: For tool-level selection (which AI product to use), see
/tool-recommend. Model selection feeds into/ai-product-spec. Compare model quality using/ai-eval-design.
Platform context: Platform routing decisions affect all consumers. Model selection at the platform layer sets the floor -- default models, cost guardrails, approved providers -- while team-level selection refines for specific use cases. Over-centralizing model selection is a common anti-pattern (see AI Platform Anti-Patterns).
Process
Step 1: Map use cases to requirements
Ask the user:
- What AI-powered capabilities does the product have (or plan to have)?
- What models are you using today? (If any -- what's working, what's not, what's too expensive)
For each capability, define requirements:
| Use case | Capability needed | Quality bar | Latency need | Volume | Context size | Data sensitivity |
|---|---|---|---|---|---|---|
| (e.g., chat support) | Conversation, tool use | High -- customer-facing | Real-time (< 3s) | 10K/day | Medium (< 32K) | Private, not regulated |
| (e.g., document summary) | Summarization | Medium -- internal tool | Interactive (< 10s) | 500/day | Large (> 100K) | Confidential |
| (e.g., content classification) | Classification | Must be accurate | Batch OK | 50K/day | Small (< 4K) | Public |
| (e.g., code review) | Code analysis | High -- blocks deploys | Interactive (< 15s) | 200/day | Large (> 100K) | Source code |
Not every use case needs the same model. That's the point.
Step 2: Evaluate model options per use case
For each use case from Step 1, evaluate candidate models:
Model capability reference (representative as of early 2026 -- verify current model lineups and pricing before committing):
| Model | Best for | Strengths | Limitations | Relative cost |
|---|---|---|---|---|
| Claude Opus | Complex reasoning, nuance, long context | Deep analysis, instruction following, 200K context | Slower, higher cost | $$$ |
| Claude Sonnet | Balanced tasks, coding, general use | Good quality-to-cost ratio, fast | Less nuanced than Opus for complex tasks | $$ |
| Claude Haiku | High-volume, simple tasks, classification | Very fast, very cheap | Less capable on complex reasoning | $ |
| GPT-4o | Multimodal, general purpose | Strong across tasks, vision | Cost at scale, rate limits | $$$ |
| GPT-4o mini | Budget-friendly general tasks | Good for simple tasks, cheap | Quality drops on complex prompts | $ |
| Gemini Pro | Long context, multimodal | 1M+ token context, Google integration | Varies by task type | $$ |
| Gemini Flash | Speed-critical, high volume | Very fast, large context | Trades quality for speed | $ |
| Open-source (Llama, Mistral) | Self-hosted, data-sensitive, cost-critical | No per-token cost, full control | Requires infrastructure, less capable | $ (infra cost) |
| Specialized models | Domain-specific (code, medical, legal) | Tuned for specific domains | Narrow capability | Varies |
Per-use-case recommendation format:
| Use case | Recommended model | Why | Alternative | When to switch |
|---|---|---|---|---|
| (chat support) | (Claude Sonnet) | (Good quality, fast enough, reasonable cost) | (GPT-4o) | (If multilingual quality matters more) |
| (document summary) | (Gemini Pro) | (Handles long documents natively) | (Claude Opus) | (If summary quality needs to be higher) |
Before committing to a model: Run a small benchmark (20-30 representative inputs) through your top 2 candidates and compare quality side-by-side. Use /ai-eval-design for a structured eval if the decision is close.
Step 3: Design the routing strategy
If using multiple models, define how requests get to the right model:
Routing approaches:
| Approach | How it works | Best for | Complexity |
|---|---|---|---|
| Static routing | Each feature hardcodes its model | Simple products, few use cases | Low |
| Gateway/router | Central service (OpenRouter, custom) routes by rules | Multi-model products | Medium |
| Classifier-based | A small model classifies the request, routes to the right model | Varied input types on a single endpoint | High (adds latency from classification call) |
| Fallback chain | Try primary model, fall back to secondary on failure/timeout | Reliability-critical features | Medium |
| Multi-agent orchestration | Multiple agents work in parallel on sub-tasks, coordinated by a conductor | Complex multi-step workflows, large codebases | High (emerging tooling: Maestri, Google Antigravity, Warp) |
Emerging pattern -- multi-agent coordination: Tools like Maestri (spatial canvas for terminal agents), Google Antigravity (parallel coding agents in IDE), and Warp (agent platform) represent a new orchestration layer. Instead of routing one request to one model, these tools run multiple agents simultaneously on related sub-tasks. Consider this pattern when the workflow involves parallel, decomposable work (e.g., multiple files in a refactor, simultaneous test writing and implementation).
Design decisions:
- Where does routing logic live? (Application code, API gateway, managed router)
- What are the fallback rules? (If primary model times out, use X. If rate-limited, use Y.)
- How do you handle model version changes? (Pin versions, auto-update, test-then-update)
- Do you need A/B testing across models? (Route a % of traffic to each for comparison)
If using a managed router like OpenRouter:
- Map each use case to a model preference with fallback
- Set cost limits per route
- Configure retry and timeout behavior
Managed routing and unified API tools:
| Tool | Approach | Best for |
|---|---|---|
| OpenRouter | Managed router with model marketplace | Teams wanting simple model switching + discovery |
| liteLLM | Open-source unified API proxy | Teams needing consistent interface across OpenAI, Anthropic, Gemini, and open-source models with fallback support. Self-hostable. |
liteLLM is particularly useful when you want to swap providers without changing application code -- it normalizes the API interface so routing changes are configuration, not code changes.
Step 4: Model the costs
Build a cost projection for the full model strategy:
| Use case | Model | Avg input tokens | Avg output tokens | Cost per request | Daily volume | Daily cost | Monthly cost |
|---|---|---|---|---|---|---|---|
| (chat support) | (Sonnet) | (2,000) | (500) | ($0.009) | (10,000) | ($90) | ($2,700) |
| (doc summary) | (Gemini Pro) | (50,000) | (2,000) | ($0.08) | (500) | ($40) | ($1,200) |
| (classification) | (Haiku) | (500) | (50) | ($0.0003) | (50,000) | ($15) | ($450) |
| Total | $145 | $4,350 |
Don't forget rate limits: At high volume (10K+ requests/day per model), check your API tier's rate limits. You may need to account for queuing, retries, or tier upgrades in your cost model.
Cost optimization levers:
- Prompt caching -- reuse system prompts across requests so you don't pay full input cost on every call (Claude, GPT support this)
- Shorter prompts -- trim unnecessary context; measure quality impact
- Batch processing -- for non-real-time tasks, batch API calls at lower rates
- Model tiering -- use expensive models only where quality demands it; cheap models everywhere else
- Output limits -- set max_tokens to prevent runaway generation costs
- Caching outputs -- cache identical or near-identical requests
Step 5: Generate the model strategy document
# Multi-Model Strategy: (Product name)
**Generated:** (date)
**Product:** (brief description)
**Models in use:** (list)
**Monthly cost projection:** (total)
## Use Case - Model Mapping
(Table from Step 1 + Step 2 -- use cases, requirements, recommended models)
## Routing Architecture
(Approach from Step 3 -- how requests reach models, fallback behavior)
### Routing rules
- (Rule 1: e.g., "All chat requests -> Claude Sonnet via OpenRouter")
- (Rule 2: e.g., "Document processing > 100K tokens -> Gemini Pro")
- (Rule 3: e.g., "Classification tasks -> Claude Haiku, batch mode")
### Fallback chain
- (Primary failure -> secondary model)
- (Rate limit -> queue and retry vs. fallback model)
- (Timeout threshold -> fallback trigger)
## Cost Projections
(Table from Step 4 -- per-use-case and total costs)
## Cost Optimization Plan
(Prioritized list of cost reduction levers with estimated savings)
## Model Governance
- **Version pinning policy:** (Pin major versions, auto-update minor? Or pin everything?)
- **Eval before upgrade:** (Run golden dataset eval before switching model versions)
- **Vendor redundancy:** (Single vendor risk? Fallback vendor for critical paths?)
- **Review cadence:** (How often to re-evaluate model choices -- quarterly? Check `knowledge/ai-market-landscape-reference.md` for market shifts.)
## Migration Plan
(If switching models or adding new ones)
- **Phase 1:** (Shadow mode -- run new model in parallel, compare outputs)
- **Phase 2:** (Canary -- route 5-10% of traffic to new model)
- **Phase 3:** (Gradual rollout -- increase to 50%, then 100%)
- **Rollback trigger:** (What metrics trigger a rollback to the previous model)
## Open Questions
- (Unresolved model decisions)
- (Things that need benchmarking to determine)
Step 6: Review and finalize
Ask the user:
- Does the model mapping make sense for each use case?
- Is the cost projection within budget?
- Is the routing approach practical for your engineering team?
- Are there use cases missing from the map?
- What's the plan for re-evaluating as new models are released?
Adjust based on feedback.
Output location
Present the strategy as formatted text in the conversation for the user to copy into their docs tool.
Example Output
Input
- Product: Meridian Health — clinical documentation assistant for outpatient practices; physicians dictate notes, the product transcribes, structures, and drafts into EHR-ready SOAP notes
- Capabilities in use today: (1) real-time transcription cleanup, (2) SOAP note drafting from raw transcript, (3) ICD-10 code suggestions, (4) patient-facing visit summary generation
- Current model setup: GPT-4o for everything — team picked it first and never revisited; monthly bill is $34K and climbing
- Pain points: Classification (ICD-10) doesn't need frontier-model quality but costs the same as note drafting; long transcripts (sometimes 80K+ tokens) occasionally hit context limits; PHI handling requires SOC 2-compliant providers only
Output (abbreviated)
Multi-Model Strategy: Meridian Health Clinical Documentation Assistant
Generated: 2025-07-14 Product: AI-assisted clinical documentation — transcription cleanup, SOAP drafting, ICD-10 coding, patient summaries Models in use (proposed): Claude Sonnet, Claude Haiku, Gemini Pro 1.5, GPT-4o (retained for one path) Monthly cost projection: ~$11,200 (vs. $34,000 current — ~67% reduction)
Use Case – Model Mapping
| Use case | Capability needed | Quality bar | Latency need | Volume | Context size | Data sensitivity |
|---|---|---|---|---|---|---|
| Transcription cleanup | Light editing, formatting | High — physician reviews | Real-time (< 3s) | 8,000/day | Small (< 8K) | PHI — regulated |
| SOAP note drafting | Long-form generation, clinical structure | Very high — clinical accuracy | Interactive (< 12s) | 8,000/day | Medium–Large (20–80K) | PHI — regulated |
| ICD-10 code suggestions | Classification (multi-label) | High accuracy, low recall cost | Batch OK (< 30s) | 8,000/day | Small (< 4K) | PHI — regulated |
| Patient visit summary | Plain-language generation | Medium — reviewed before sending | Interactive (< 15s) | 4,000/day | Medium (< 20K) | PHI — regulated |
Recommended model per use case:
| Use case | Recommended model | Why | Alternative | When to switch |
|---|---|---|---|---|
| Transcription cleanup | Claude Haiku | Fast, cheap, task is constrained — no complex reasoning needed | GPT-4o mini | If error rate on medical terminology exceeds 2% in eval |
| SOAP note drafting | Claude Sonnet | Clinical instruction-following, handles 80K context, strong quality-to-cost ratio | Claude Opus | If physician edit rate stays above 25% after tuning |
| ICD-10 code suggestions | Claude Haiku (batch) | Classification only, batching cuts cost 50%, Haiku accuracy sufficient for assist (not autonomous) | Fine-tuned Llama 3 (self-hosted) | If volume exceeds 50K/day and self-hosting ROI crosses over |
| Patient visit summary | Claude Sonnet | Plain-language generation, same provider as SOAP (reduces integration surface), SOC 2 compliant | GPT-4o | If patient satisfaction scores on summary clarity trail benchmark |
Note on GPT-4o: Retained as fallback provider only. Removed from primary paths to reduce vendor concentration risk and cost. Anthropic (Claude) is the primary vendor; all models are available under Anthropic's HIPAA BAA.
Routing Architecture
Approach: Static routing per feature + fallback chain on primary model failure
Each product feature hardcodes its model assignment. No classifier-based routing needed — use case boundaries are already well-defined at the application layer (separate API calls for each step in the documentation pipeline).
Transcription cleanup → Claude Haiku (via liteLLM proxy)
SOAP note drafting → Claude Sonnet (via liteLLM proxy)
ICD-10 suggestions → Claude Haiku, batch (via liteLLM proxy)
Patient summary → Claude Sonnet (via liteLLM proxy)
Why liteLLM: Normalizes the API surface so swapping Haiku → Sonnet (or adding Gemini as a long-context fallback) is a config change, not a code change. Self-hostable within Meridian's existing AWS VPC — keeps PHI off third-party routing infrastructure.
Routing rules
- All requests routed through internal liteLLM proxy deployed in HIPAA-compliant AWS environment
- SOAP drafts with input tokens > 90K → overflow to Gemini Pro 1.5 (1M context window), flagged for physician review
- ICD-10 requests queued in batches of 100, dispatched every 60 seconds during clinic hours
Fallback chain
- Claude Haiku timeout (> 5s) → retry once → fall back to Claude Sonnet (same task, higher cost, logged)
- Claude Sonnet timeout (> 20s) → fall back to GPT-4o with alert to on-call
- Anthropic rate limit → queue with 30s retry; if queue depth > 500 → activate GPT-4o fallback route
- Gemini overflow path failure → return partial transcript to physician with "manual review required" flag
Cost Projections
| Use case | Model | Avg input tokens | Avg output tokens | Cost/request | Daily volume | Daily cost | Monthly cost |
|---|---|---|---|---|---|---|---|
| Transcription cleanup | Claude Haiku | 3,000 | 800 | $0.0011 | 8,000 | $8.80 | $264 |
| SOAP note drafting | Claude Sonnet | 35,000 | 2,500 | $0.061 | 8,000 | $488 | $14,640 |
| ICD-10 suggestions | Claude Haiku (batch) | 600 | 100 | $0.00019 | 8,000 | $1.52 | $46 |
| Patient summary | Claude Sonnet | 12,000 | 1,200 | $0.022 | 4,000 | $88 | $2,640 |
| Gemini overflow (SOAP) | Gemini Pro 1.5 | 110,000 | 3,000 | $0.135 | ~200 | $27 | $810 |
| Total | $613 | $18,400 |
vs. current GPT-4o baseline: Estimated $34,000/month. Projected savings: ~$15,600/month (~46%). Note: The $11,200 figure in the header reflects additional savings from prompt caching (see below). Verify current Anthropic and Google pricing before finalizing budget commitments.
Cost Optimization Plan
- Prompt caching on SOAP drafting — System prompt + EHR schema template is ~4,000 tokens, repeated on every call. Claude's prompt caching reduces this to near-zero after first call. Estimated savings: ~$2,800/month.
- Batch ICD-10 calls — Already modeled above. Shift from synchronous to async batch API. Estimated savings: $400/month.
- Output token limits — SOAP notes: cap at 2,800 tokens. Summaries: cap at 1,500. Add structured output schema to reduce generation verbosity. Estimated savings: $600/month.
- Cache patient summaries for re-sends — Same visit summary requested multiple times (e.g., patient portal re-delivery). Cache by visit ID + hash. Estimated savings: $150/month.
- Trim transcription cleanup prompts — Current prompt includes 1,200-token style guide injected every call. Move to few-shot examples (6 examples, ~400 tokens). Benchmark quality first. Estimated savings: $80/month.
Total estimated optimizations: ~$4,030/month additional → landing at ~$14,370/month (from $34K baseline, 58% reduction).
Model Governance
- Version pinning: Pin to
claude-sonnet-3-5andclaude-haiku-3-5explicitly. No auto-updates. Anthropic model