Skip to main content
Engineering/multi-model-strategy

Multi-Model Strategy

You need to choose the right AI model for each job in your product – model mapping, routing, cost modeling, and migration planning.

Use this when a product uses (or should use) more than one AI model, or when you need to make a deliberate model selection decision. Helps you match use cases to models based on capability, cost, speed, and quality rather than defaulting to whatever model the team tried first. Also covers routing strategies and cost modeling.

Related skills: For tool-level selection (which AI product to use), see /tool-recommend. Model selection feeds into /ai-product-spec. Compare model quality using /ai-eval-design.

Platform context: Platform routing decisions affect all consumers. Model selection at the platform layer sets the floor -- default models, cost guardrails, approved providers -- while team-level selection refines for specific use cases. Over-centralizing model selection is a common anti-pattern (see AI Platform Anti-Patterns).

Process

Step 1: Map use cases to requirements

Ask the user:

  • What AI-powered capabilities does the product have (or plan to have)?
  • What models are you using today? (If any -- what's working, what's not, what's too expensive)

For each capability, define requirements:

Use caseCapability neededQuality barLatency needVolumeContext sizeData sensitivity
(e.g., chat support)Conversation, tool useHigh -- customer-facingReal-time (< 3s)10K/dayMedium (< 32K)Private, not regulated
(e.g., document summary)SummarizationMedium -- internal toolInteractive (< 10s)500/dayLarge (> 100K)Confidential
(e.g., content classification)ClassificationMust be accurateBatch OK50K/daySmall (< 4K)Public
(e.g., code review)Code analysisHigh -- blocks deploysInteractive (< 15s)200/dayLarge (> 100K)Source code

Not every use case needs the same model. That's the point.

Step 2: Evaluate model options per use case

For each use case from Step 1, evaluate candidate models:

Model capability reference (representative as of early 2026 -- verify current model lineups and pricing before committing):

ModelBest forStrengthsLimitationsRelative cost
Claude OpusComplex reasoning, nuance, long contextDeep analysis, instruction following, 200K contextSlower, higher cost$$$
Claude SonnetBalanced tasks, coding, general useGood quality-to-cost ratio, fastLess nuanced than Opus for complex tasks$$
Claude HaikuHigh-volume, simple tasks, classificationVery fast, very cheapLess capable on complex reasoning$
GPT-4oMultimodal, general purposeStrong across tasks, visionCost at scale, rate limits$$$
GPT-4o miniBudget-friendly general tasksGood for simple tasks, cheapQuality drops on complex prompts$
Gemini ProLong context, multimodal1M+ token context, Google integrationVaries by task type$$
Gemini FlashSpeed-critical, high volumeVery fast, large contextTrades quality for speed$
Open-source (Llama, Mistral)Self-hosted, data-sensitive, cost-criticalNo per-token cost, full controlRequires infrastructure, less capable$ (infra cost)
Specialized modelsDomain-specific (code, medical, legal)Tuned for specific domainsNarrow capabilityVaries

Per-use-case recommendation format:

Use caseRecommended modelWhyAlternativeWhen to switch
(chat support)(Claude Sonnet)(Good quality, fast enough, reasonable cost)(GPT-4o)(If multilingual quality matters more)
(document summary)(Gemini Pro)(Handles long documents natively)(Claude Opus)(If summary quality needs to be higher)

Before committing to a model: Run a small benchmark (20-30 representative inputs) through your top 2 candidates and compare quality side-by-side. Use /ai-eval-design for a structured eval if the decision is close.

Step 3: Design the routing strategy

If using multiple models, define how requests get to the right model:

Routing approaches:

ApproachHow it worksBest forComplexity
Static routingEach feature hardcodes its modelSimple products, few use casesLow
Gateway/routerCentral service (OpenRouter, custom) routes by rulesMulti-model productsMedium
Classifier-basedA small model classifies the request, routes to the right modelVaried input types on a single endpointHigh (adds latency from classification call)
Fallback chainTry primary model, fall back to secondary on failure/timeoutReliability-critical featuresMedium
Multi-agent orchestrationMultiple agents work in parallel on sub-tasks, coordinated by a conductorComplex multi-step workflows, large codebasesHigh (emerging tooling: Maestri, Google Antigravity, Warp)

Emerging pattern -- multi-agent coordination: Tools like Maestri (spatial canvas for terminal agents), Google Antigravity (parallel coding agents in IDE), and Warp (agent platform) represent a new orchestration layer. Instead of routing one request to one model, these tools run multiple agents simultaneously on related sub-tasks. Consider this pattern when the workflow involves parallel, decomposable work (e.g., multiple files in a refactor, simultaneous test writing and implementation).

Design decisions:

  • Where does routing logic live? (Application code, API gateway, managed router)
  • What are the fallback rules? (If primary model times out, use X. If rate-limited, use Y.)
  • How do you handle model version changes? (Pin versions, auto-update, test-then-update)
  • Do you need A/B testing across models? (Route a % of traffic to each for comparison)

If using a managed router like OpenRouter:

  • Map each use case to a model preference with fallback
  • Set cost limits per route
  • Configure retry and timeout behavior

Managed routing and unified API tools:

ToolApproachBest for
OpenRouterManaged router with model marketplaceTeams wanting simple model switching + discovery
liteLLMOpen-source unified API proxyTeams needing consistent interface across OpenAI, Anthropic, Gemini, and open-source models with fallback support. Self-hostable.

liteLLM is particularly useful when you want to swap providers without changing application code -- it normalizes the API interface so routing changes are configuration, not code changes.

Step 4: Model the costs

Build a cost projection for the full model strategy:

Use caseModelAvg input tokensAvg output tokensCost per requestDaily volumeDaily costMonthly cost
(chat support)(Sonnet)(2,000)(500)($0.009)(10,000)($90)($2,700)
(doc summary)(Gemini Pro)(50,000)(2,000)($0.08)(500)($40)($1,200)
(classification)(Haiku)(500)(50)($0.0003)(50,000)($15)($450)
Total$145$4,350

Don't forget rate limits: At high volume (10K+ requests/day per model), check your API tier's rate limits. You may need to account for queuing, retries, or tier upgrades in your cost model.

Cost optimization levers:

  • Prompt caching -- reuse system prompts across requests so you don't pay full input cost on every call (Claude, GPT support this)
  • Shorter prompts -- trim unnecessary context; measure quality impact
  • Batch processing -- for non-real-time tasks, batch API calls at lower rates
  • Model tiering -- use expensive models only where quality demands it; cheap models everywhere else
  • Output limits -- set max_tokens to prevent runaway generation costs
  • Caching outputs -- cache identical or near-identical requests

Step 5: Generate the model strategy document

# Multi-Model Strategy: (Product name)

**Generated:** (date)
**Product:** (brief description)
**Models in use:** (list)
**Monthly cost projection:** (total)

## Use Case - Model Mapping
(Table from Step 1 + Step 2 -- use cases, requirements, recommended models)

## Routing Architecture
(Approach from Step 3 -- how requests reach models, fallback behavior)

### Routing rules
- (Rule 1: e.g., "All chat requests -> Claude Sonnet via OpenRouter")
- (Rule 2: e.g., "Document processing > 100K tokens -> Gemini Pro")
- (Rule 3: e.g., "Classification tasks -> Claude Haiku, batch mode")

### Fallback chain
- (Primary failure -> secondary model)
- (Rate limit -> queue and retry vs. fallback model)
- (Timeout threshold -> fallback trigger)

## Cost Projections
(Table from Step 4 -- per-use-case and total costs)

## Cost Optimization Plan
(Prioritized list of cost reduction levers with estimated savings)

## Model Governance
- **Version pinning policy:** (Pin major versions, auto-update minor? Or pin everything?)
- **Eval before upgrade:** (Run golden dataset eval before switching model versions)
- **Vendor redundancy:** (Single vendor risk? Fallback vendor for critical paths?)
- **Review cadence:** (How often to re-evaluate model choices -- quarterly? Check `knowledge/ai-market-landscape-reference.md` for market shifts.)

## Migration Plan
(If switching models or adding new ones)
- **Phase 1:** (Shadow mode -- run new model in parallel, compare outputs)
- **Phase 2:** (Canary -- route 5-10% of traffic to new model)
- **Phase 3:** (Gradual rollout -- increase to 50%, then 100%)
- **Rollback trigger:** (What metrics trigger a rollback to the previous model)

## Open Questions
- (Unresolved model decisions)
- (Things that need benchmarking to determine)

Step 6: Review and finalize

Ask the user:

  • Does the model mapping make sense for each use case?
  • Is the cost projection within budget?
  • Is the routing approach practical for your engineering team?
  • Are there use cases missing from the map?
  • What's the plan for re-evaluating as new models are released?

Adjust based on feedback.

Output location

Present the strategy as formatted text in the conversation for the user to copy into their docs tool.

Example Output

Input

  • Product: Meridian Health — clinical documentation assistant for outpatient practices; physicians dictate notes, the product transcribes, structures, and drafts into EHR-ready SOAP notes
  • Capabilities in use today: (1) real-time transcription cleanup, (2) SOAP note drafting from raw transcript, (3) ICD-10 code suggestions, (4) patient-facing visit summary generation
  • Current model setup: GPT-4o for everything — team picked it first and never revisited; monthly bill is $34K and climbing
  • Pain points: Classification (ICD-10) doesn't need frontier-model quality but costs the same as note drafting; long transcripts (sometimes 80K+ tokens) occasionally hit context limits; PHI handling requires SOC 2-compliant providers only

Output (abbreviated)

Multi-Model Strategy: Meridian Health Clinical Documentation Assistant

Generated: 2025-07-14 Product: AI-assisted clinical documentation — transcription cleanup, SOAP drafting, ICD-10 coding, patient summaries Models in use (proposed): Claude Sonnet, Claude Haiku, Gemini Pro 1.5, GPT-4o (retained for one path) Monthly cost projection: ~$11,200 (vs. $34,000 current — ~67% reduction)


Use Case – Model Mapping

Use caseCapability neededQuality barLatency needVolumeContext sizeData sensitivity
Transcription cleanupLight editing, formattingHigh — physician reviewsReal-time (< 3s)8,000/daySmall (< 8K)PHI — regulated
SOAP note draftingLong-form generation, clinical structureVery high — clinical accuracyInteractive (< 12s)8,000/dayMedium–Large (20–80K)PHI — regulated
ICD-10 code suggestionsClassification (multi-label)High accuracy, low recall costBatch OK (< 30s)8,000/daySmall (< 4K)PHI — regulated
Patient visit summaryPlain-language generationMedium — reviewed before sendingInteractive (< 15s)4,000/dayMedium (< 20K)PHI — regulated

Recommended model per use case:

Use caseRecommended modelWhyAlternativeWhen to switch
Transcription cleanupClaude HaikuFast, cheap, task is constrained — no complex reasoning neededGPT-4o miniIf error rate on medical terminology exceeds 2% in eval
SOAP note draftingClaude SonnetClinical instruction-following, handles 80K context, strong quality-to-cost ratioClaude OpusIf physician edit rate stays above 25% after tuning
ICD-10 code suggestionsClaude Haiku (batch)Classification only, batching cuts cost 50%, Haiku accuracy sufficient for assist (not autonomous)Fine-tuned Llama 3 (self-hosted)If volume exceeds 50K/day and self-hosting ROI crosses over
Patient visit summaryClaude SonnetPlain-language generation, same provider as SOAP (reduces integration surface), SOC 2 compliantGPT-4oIf patient satisfaction scores on summary clarity trail benchmark

Note on GPT-4o: Retained as fallback provider only. Removed from primary paths to reduce vendor concentration risk and cost. Anthropic (Claude) is the primary vendor; all models are available under Anthropic's HIPAA BAA.


Routing Architecture

Approach: Static routing per feature + fallback chain on primary model failure

Each product feature hardcodes its model assignment. No classifier-based routing needed — use case boundaries are already well-defined at the application layer (separate API calls for each step in the documentation pipeline).

Transcription cleanup   → Claude Haiku         (via liteLLM proxy)
SOAP note drafting      → Claude Sonnet        (via liteLLM proxy)
ICD-10 suggestions      → Claude Haiku, batch  (via liteLLM proxy)
Patient summary         → Claude Sonnet        (via liteLLM proxy)

Why liteLLM: Normalizes the API surface so swapping Haiku → Sonnet (or adding Gemini as a long-context fallback) is a config change, not a code change. Self-hostable within Meridian's existing AWS VPC — keeps PHI off third-party routing infrastructure.

Routing rules

  • All requests routed through internal liteLLM proxy deployed in HIPAA-compliant AWS environment
  • SOAP drafts with input tokens > 90K → overflow to Gemini Pro 1.5 (1M context window), flagged for physician review
  • ICD-10 requests queued in batches of 100, dispatched every 60 seconds during clinic hours

Fallback chain

  • Claude Haiku timeout (> 5s) → retry once → fall back to Claude Sonnet (same task, higher cost, logged)
  • Claude Sonnet timeout (> 20s) → fall back to GPT-4o with alert to on-call
  • Anthropic rate limit → queue with 30s retry; if queue depth > 500 → activate GPT-4o fallback route
  • Gemini overflow path failure → return partial transcript to physician with "manual review required" flag

Cost Projections

Use caseModelAvg input tokensAvg output tokensCost/requestDaily volumeDaily costMonthly cost
Transcription cleanupClaude Haiku3,000800$0.00118,000$8.80$264
SOAP note draftingClaude Sonnet35,0002,500$0.0618,000$488$14,640
ICD-10 suggestionsClaude Haiku (batch)600100$0.000198,000$1.52$46
Patient summaryClaude Sonnet12,0001,200$0.0224,000$88$2,640
Gemini overflow (SOAP)Gemini Pro 1.5110,0003,000$0.135~200$27$810
Total$613$18,400

vs. current GPT-4o baseline: Estimated $34,000/month. Projected savings: ~$15,600/month (~46%). Note: The $11,200 figure in the header reflects additional savings from prompt caching (see below). Verify current Anthropic and Google pricing before finalizing budget commitments.


Cost Optimization Plan

  1. Prompt caching on SOAP drafting — System prompt + EHR schema template is ~4,000 tokens, repeated on every call. Claude's prompt caching reduces this to near-zero after first call. Estimated savings: ~$2,800/month.
  2. Batch ICD-10 calls — Already modeled above. Shift from synchronous to async batch API. Estimated savings: $400/month.
  3. Output token limits — SOAP notes: cap at 2,800 tokens. Summaries: cap at 1,500. Add structured output schema to reduce generation verbosity. Estimated savings: $600/month.
  4. Cache patient summaries for re-sends — Same visit summary requested multiple times (e.g., patient portal re-delivery). Cache by visit ID + hash. Estimated savings: $150/month.
  5. Trim transcription cleanup prompts — Current prompt includes 1,200-token style guide injected every call. Move to few-shot examples (6 examples, ~400 tokens). Benchmark quality first. Estimated savings: $80/month.

Total estimated optimizations: ~$4,030/month additional → landing at ~$14,370/month (from $34K baseline, 58% reduction).


Model Governance

  • Version pinning: Pin to claude-sonnet-3-5 and claude-haiku-3-5 explicitly. No auto-updates. Anthropic model