Skip to main content
Engineering/llm-observability-plan

LLM Observability Plan

You need to plan monitoring for LLM-powered features in production – quality metrics, cost tracking, drift detection, and alerting.

Use this when you have (or are about to ship) AI-powered features and need to monitor them in production. Covers LLM-specific concerns that traditional observability and SRE monitoring miss: prompt/response quality, token costs, latency per model, hallucination drift, and model version regression. If you're looking to measure user behavior, use /observability-plan. If you're looking to measure system health, use /instrumentation-plan. This skill covers the AI-specific layer between those two.

The distinction: /observability-plan answers "Are users successful?" /instrumentation-plan answers "Is the system healthy?" This skill answers "Is the AI behaving correctly, consistently, and affordably?"

Related skills: Complements /observability-plan (product analytics) and /instrumentation-plan (SRE metrics). Eval criteria from /ai-eval-design become monitoring thresholds here. Quality dimensions from /ai-health-check CARATS framework inform what to monitor.

Process

Step 1: Identify LLM touchpoints

Ask the user:

  1. What AI features are in production (or about to ship)? (List each feature that calls an LLM)
  2. Which models does each feature use? (Provider, model name, version)
  3. What's the current monitoring? (Any logging, dashboards, or alerts already in place?)
  4. What problems have you seen? (Quality issues, cost surprises, latency spikes, outages)
  5. What's the sensitivity level? (Can you log prompts/responses, or are there PII/compliance constraints?)
  6. What observability tools are in use or available? (Helicone, Langfuse, Datadog, custom, etc.)

Map each LLM touchpoint:

FeatureModelCalls/dayAvg latencyCurrent monitoringKnown issues
(Chat support)(Claude Sonnet)(10,000)(2.1s)(None)(Occasional hallucinations)
(Doc summary)(Gemini Pro)(500)(8.3s)(Basic logging)(Slow on large docs)

Step 2: Define quality metrics

Quality monitoring answers: "Is the AI output still good?" Define metrics that catch degradation:

MetricDefinitionHow to measureThresholdAlert when
Consistency scoreSame input produces similar outputs over timeRun golden dataset weekly, compare scores> 3.5 rubric avgScore drops below 3.0
Hallucination rate% of outputs containing fabricated informationAutomated fact-check or LLM-as-judge sampling< 5%Rate exceeds 10%
Relevance score% of outputs that address the user's actual questionLLM-as-judge on sample + user feedback signals> 85%Below 75%
Tone complianceOutput matches expected voice/styleTone rubric scoring on sample> 90% passBelow 80%
Safety incidentsHarmful, biased, or inappropriate outputsContent filter + human review of flagged items0 criticalAny critical incident
User satisfaction signalThumbs up/down, regeneration rate, copy rate, conversation abandonment rateIn-product feedback + behavioral tracking (did the user accept, edit, or discard the output?)(baseline)Drops > 20% from baseline

Not every feature needs every metric. Match metrics to the feature's risk profile:

  • Customer-facing, high-stakes: All metrics, tight thresholds
  • Internal tool, moderate stakes: Consistency + relevance + cost
  • Batch processing, low stakes: Cost + basic error rate

Step 3: Design the logging strategy

What to capture on every LLM call:

FieldTypePurposePII concern?
request_idstringUnique identifier for the callNo
session_idstringConversation or session ID (for multi-turn features)No
timestampISO 8601When the call was madeNo
modelstringModel name and versionNo
featurestringWhich product feature triggered thisNo
input_tokensintegerToken count of the promptNo
output_tokensintegerToken count of the responseNo
latency_msintegerTotal response timeNo
time_to_first_token_msintegerStreaming start timeNo
statusstringSuccess, error, timeout, rate_limitedNo
error_typestringError category if failedNo
prompt_hashstringHash of the system prompt template (not content)No
costfloatCalculated cost of this callNo
tool_callsJSONTool/function calls made during the request (if any)Depends on tool
prompt_textstringFull prompt (if allowed)Yes -- may contain PII
response_textstringFull response (if allowed)Yes -- may contain PII
user_idstringAnonymized user identifierYes -- handle carefully

PII handling decisions:

  • Can you log full prompts and responses? (Best for debugging, worst for privacy)
  • If not, can you log sanitized versions? (Strip PII, keep structure)
  • If not, can you log metadata only? (Tokens, latency, cost -- no content)
  • What's the data retention policy? (30 days? 90 days? Indefinite?)
  • Who has access to raw logs vs. aggregated dashboards?

Sampling strategy for high-volume features:

  • Log metadata (tokens, latency, cost, status) on 100% of calls
  • Log full prompt/response on 1-10% of calls (configurable)
  • Log full prompt/response on 100% of error/timeout calls
  • Run quality eval (LLM-as-judge) on 1-5% sample

Step 4: Plan cost and latency monitoring

Cost monitoring:

MetricFormulaDashboardAlert when
Cost per interaction(input_tokens x input_price + output_tokens x output_price)Real-time> 2x baseline
Daily cost by featureSum of interaction costs per feature per dayDaily> budget ceiling
Monthly cost projectionDaily cost x days remainingWeekly> monthly budget
Cost per userTotal LLM cost / active usersMonthlyTrending up > 20% MoM
Token efficiencyOutput quality score / tokens usedWeeklyEfficiency drops > 15%

Latency monitoring:

MetricTargetAlert when
p50 response time(target, e.g., < 2s)> 1.5x target
p95 response time(target, e.g., < 5s)> 2x target
p99 response time(target, e.g., < 10s)> 3x target
Time to first token (streaming)(target, e.g., < 500ms)> 1s
Timeout rate< 1%> 3%
Rate limit hit rate< 0.1%> 1%

Step 5: Design drift and regression detection

Model behavior changes over time -- from model updates, prompt changes, data shifts, or provider-side changes:

Drift detection approach:

SignalWhat changesHow to detectFrequency
Model version changeProvider updates the modelMonitor model version in API responsesEvery call
Output distribution shiftAverage output length, vocabulary, structure changesStatistical comparison of output properties week over weekWeekly
Quality regressionEval scores dropRun golden dataset eval, compare to baselineWeekly
Cost driftToken usage changes without prompt changesCompare avg tokens per call week over weekDaily
Latency driftResponse times changeCompare p50/p95 week over weekDaily
Prompt template changeTeam modifies system promptsTrack prompt_hash, alert on changes, require eval rerunOn change

Regression response playbook:

  1. Alert fires -- quality score dropped or cost spiked
  2. Triage -- is this a model version change, prompt change, or data change?
  3. Compare -- run golden dataset eval on current vs. previous model/prompt
  4. Decide -- revert prompt, switch model version, adjust thresholds, or accept new baseline
  5. Document -- log the incident and resolution for future reference

Step 6: Choose tooling

ToolWhat it doesBest forPricing model
HeliconeLLM proxy with logging, cost tracking, cachingTeams wanting zero-code setup, cost optimizationFree tier + usage-based
LangfuseOpen-source LLM observability, tracing, evalTeams wanting self-hosted or detailed tracingFree (self-hosted) or cloud pricing
BraintrustEval platform with logging and experimentsTeams focused on systematic eval and prompt iterationUsage-based
Datadog LLM MonitoringExtension of Datadog APM for LLM callsTeams already on DatadogPer-host pricing
Arize PhoenixOpen-source LLM tracing and evaluationTeams wanting self-hosted with strong eval integrationFree (self-hosted) or cloud pricing
LangSmithLangChain's observability platformTeams using LangChain/LangGraphFree tier + usage-based
Custom (OpenTelemetry)Roll your own with standard instrumentationTeams with specific requirements or existing infraInfrastructure cost

Selection criteria:

  • What's your existing monitoring stack? (Extend it vs. add a new tool)
  • Do you need self-hosted? (Compliance, data sovereignty)
  • What's the budget for monitoring tooling?
  • How many LLM calls per day? (Determines whether free tiers are viable)

Default recommendation for most teams: Start with Langfuse if you don't already have LLM monitoring in your stack. It covers tracing, eval, and cost tracking in one tool, offers both self-hosted and cloud options, and has the strongest open-source community momentum (5.0 rating, 41 reviews on Product Hunt as of March 2026). Move to Datadog LLM Monitoring only if your team already runs Datadog and wants unified APM + LLM observability.

Step 7: Generate the observability plan

Compile into a structured document:

# LLM Observability Plan: (Product name)

**Generated:** (date)
**Product:** (brief description)
**LLM features:** (count and list)
**Models in use:** (list with versions)

## LLM Touchpoint Map
(Table from Step 1 -- features, models, volume, current monitoring)

## Quality Metrics
(Table from Step 2 -- metrics, thresholds, measurement methods, alert rules)

## Logging Architecture
(Strategy from Step 3 -- what to log, PII handling, sampling rates)

### Log schema
(Field list with types and PII flags)

### Sampling rules
- Metadata: (100% of calls)
- Full content: (N% of calls, 100% of errors)
- Quality eval: (N% sample via LLM-as-judge)

### PII handling
- (Approach: full logging / sanitized / metadata only)
- (Retention policy)
- (Access controls)

## Cost & Latency Monitoring
(Tables from Step 4 -- cost metrics, latency targets, alert thresholds)

## Drift & Regression Detection
(Signals and playbook from Step 5)

## Tooling
(Recommendation from Step 6 with rationale)

## Implementation Checklist
- [ ] **(P0)** Instrument metadata logging on all LLM calls (tokens, latency, cost, status)
- [ ] **(P0)** Set up cost tracking dashboard with daily spend by feature
- [ ] **(P0)** Configure latency alerts (p95 > threshold)
- [ ] **(P1)** Implement prompt/response logging with PII handling
- [ ] **(P1)** Set up golden dataset eval as weekly automated run
- [ ] **(P1)** Build quality metrics dashboard
- [ ] **(P2)** Implement drift detection (output distribution monitoring)
- [ ] **(P2)** Create regression response playbook and runbook
- [ ] **(P2)** Set up model version change alerts

## Open Questions
- (Unresolved monitoring decisions)
- (Things that need baseline data to determine thresholds)

Step 8: Review and finalize

Ask the user:

  • Are the quality metrics capturing what matters most for your AI features?
  • Is the logging strategy practical given PII constraints?
  • Are the cost alert thresholds realistic based on current spend?
  • Is the drift detection approach proportionate to your risk tolerance?
  • Does the tooling recommendation fit your existing stack and budget?
  • Who owns LLM monitoring? (Engineering, ML team, product, SRE?)

Adjust based on feedback.

Output location

Present the plan as formatted text in the conversation for the user to copy into their product wiki, analytics documentation, or team shared drive.

Example Output

Input

  • Product: Meridian Health — AI-powered clinical documentation assistant that auto-generates SOAP notes and pre-authorization letters from physician voice recordings
  • LLM features in production: (1) SOAP note generation from transcribed audio, (2) prior auth letter drafting, (3) ICD-10 code suggestion
  • Models in use: GPT-4o for SOAP notes and prior auth (OpenAI), GPT-3.5-turbo for ICD-10 suggestions (cost optimization); no versioning tracked currently
  • Current monitoring: CloudWatch logs latency and HTTP status codes only; no token tracking, no quality metrics, no cost dashboards
  • Known problems: Physicians flagging ~15% of SOAP notes as needing "major edits"; prior auth letters occasionally cite incorrect CPT codes; monthly OpenAI bill jumped 340% over 3 months with no explanation
  • Constraints: Full PHI in prompts and responses — HIPAA compliance required, no third-party SaaS logging of raw content; existing stack is AWS (CloudWatch, Grafana)

Output (abbreviated)

LLM Observability Plan: Meridian Health Clinical Documentation Assistant

Generated: 2025-07-14 Product: AI-assisted clinical documentation — SOAP notes, prior auth letters, ICD-10 coding LLM features: 3 (SOAP generation, prior auth drafting, ICD-10 suggestion) Models in use: GPT-4o (gpt-4o-2024-08-06), GPT-3.5-turbo (gpt-3.5-turbo-0125)


LLM Touchpoint Map

FeatureModelEst. calls/dayAvg latencyCurrent monitoringKnown issues
SOAP note generationGPT-4o~3,200~6.8sHTTP status only~15% major edit rate; quality regression suspected
Prior auth letter draftingGPT-4o~800~9.1sHTTP status onlyIncorrect CPT codes in ~8% of letters
ICD-10 code suggestionGPT-3.5-turbo~3,200~1.4sHTTP status onlyUnknown accuracy; no tracking

Total: ~7,200 LLM calls/day across 3 features. Cost anomaly on SOAP generation is the highest-priority investigative target.


Quality Metrics

MetricFeatureHow to measureTarget thresholdAlert when
Major edit rateSOAP notesTrack physician edits post-generation (word delta > 40% = major)< 8% major edits> 12% in any rolling 7-day window
CPT/ICD code accuracyPrior auth, ICD-10LLM-as-judge cross-check against structured EHR codes on 10% sample> 94% match< 88% match
Clinical completenessSOAP notesRubric: does output contain all 4 SOAP sections with non-trivial content?> 97% complete< 93% complete
Hallucination ratePrior authAutomated check: cited diagnosis codes present in patient record?< 3%> 6%
Regeneration rateAll features% of sessions where physician requests a new generation< 5%> 10%
Safety incidentsAll featuresOutputs flagged by content filter or physician-reported errors with patient safety implication0 criticalAny critical incident triggers P0 response

Risk profile: All three features are customer-facing and clinically high-stakes. Full metric coverage applies to SOAP and prior auth. ICD-10 gets accuracy + cost monitoring as a minimum bar until baseline is established.


Logging Architecture

PII handling decision

HIPAA compliance prohibits logging raw prompt/response content to any third-party SaaS. All logging goes to AWS CloudWatch + S3 (encrypted, us-east-1) with access restricted to the on-call engineering team via IAM role. No Helicone, Langfuse cloud, or Datadog SaaS for content logs.

Three-tier log approach:

TierContentVolumeRetention
Tier 1 — MetadataTokens, latency, cost, status, model version, feature tag, prompt_hash100% of calls90 days
Tier 2 — Sanitized contentResponse structure only (section headers present/absent, code counts, output length) — no PHI100% of calls90 days
Tier 3 — Full contentRaw prompt + response, encrypted at rest, restricted access100% of error/timeout calls; 2% random sample of successes30 days, then purge

Log schema (Tier 1 — always captured)

FieldTypePII?
request_idstringNo
session_idstringNo
timestampISO 8601No
featureenum: soap_note, prior_auth, icd10No
modelstring (name + version from API response header)No
input_tokensintegerNo
output_tokensintegerNo
latency_msintegerNo
time_to_first_token_msintegerNo
statusenum: success, error, timeout, rate_limitedNo
error_typestringNo
prompt_hashSHA-256 of system prompt templateNo
cost_usdfloatNo
provider_model_versionstring (from API response)No
user_idanonymized hashYes — hash only, no raw ID

Sampling rules

  • Tier 1 metadata: 100% of all calls
  • Tier 2 sanitized structure: 100% of all calls
  • Tier 3 full content: 100% of errors/timeouts + 2% random success sample
  • Quality eval (LLM-as-judge): 5% sample for SOAP and prior auth; 10% for ICD-10 (lower volume, higher code-accuracy risk)

Cost & Latency Monitoring

Cost monitoring

MetricCurrent baselineAlert thresholdDashboard cadence
Daily cost — SOAP (GPT-4o)~$320/day (estimate from token logs)> $480/day (1.5x)Real-time
Daily cost — Prior auth (GPT-4o)~$95/day> $150/dayReal-time
Daily cost — ICD-10 (GPT-3.5)~$12/day> $25/dayDaily
Cost per SOAP note~$0.10> $0.20Weekly trend
Monthly projection~$12,600/mo> $18,000/moWeekly
Token efficiency (SOAP)Baseline TBD week 1> 20% increase in avg input tokens without quality gainWeekly

On the 340% cost spike: Hypothesis is prompt template bloat — input tokens grew without output quality improving. Token efficiency metric will confirm. Check prompt_hash history once Tier 1 logging is live.

Latency targets

MetricTargetAlert
SOAP — p50< 5s> 7.5s
SOAP — p95< 12s> 18s
Prior auth — p50< 7s> 10s
ICD-10 — p50< 1.5s> 2.5s
Time to first token (streaming, SOAP)< 800ms> 1.5s
Timeout rate (any feature)< 0.5%> 2%
Rate limit hit rate< 0.1%> 0.5%

Drift & Regression Detection

| Signal | What to watch | Detection