Use this when you have (or are about to ship) AI-powered features and need to monitor them in production. Covers LLM-specific concerns that traditional observability and SRE monitoring miss: prompt/response quality, token costs, latency per model, hallucination drift, and model version regression. If you're looking to measure user behavior, use /observability-plan. If you're looking to measure system health, use /instrumentation-plan. This skill covers the AI-specific layer between those two.

The distinction: /observability-plan answers "Are users successful?" /instrumentation-plan answers "Is the system healthy?" This skill answers "Is the AI behaving correctly, consistently, and affordably?"

Related skills: Complements /observability-plan (product analytics) and /instrumentation-plan (SRE metrics). Eval criteria from /ai-eval-design become monitoring thresholds here. Quality dimensions from /ai-health-check CARATS framework inform what to monitor.

The hard part most teams miss

Traditional monitoring watches for things that crash. An AI feature almost never crashes. It fails by producing a slightly worse output and running up a slightly larger bill, both of which are invisible until a user complains or finance asks a question.

The bill is the alert you get last. Cost and quality do not throw exceptions, so nothing pages you when they drift. An unmonitored AI feature can quietly run up a five-figure bill or degrade for weeks. Both need active instrumentation, not the absence of errors (Steps 2 and 4).
You are monitoring a distribution, not an uptime number. Quality degrades continuously, not in a binary up/down. A weekly golden-dataset score is the smoke detector. Without it, your regression detector is your users, and they detect by leaving.
An alert without an owner is decoration. The leverage is not another dashboard. It is naming who gets paged when quality drops or cost spikes, and what they do next (Step 5 playbook, Step 8 ownership). Monitoring nobody acts on is noise you pay to generate.

Everything below is the instrumentation. These three are why it is worth building.

Process

Step 1: Identify LLM touchpoints

Ask the user:

What AI features are in production (or about to ship)? (List each feature that calls an LLM)
Which models does each feature use? (Provider, model name, version)
What's the current monitoring? (Any logging, dashboards, or alerts already in place?)
What problems have you seen? (Quality issues, cost surprises, latency spikes, outages)
What's the sensitivity level? (Can you log prompts/responses, or are there PII/compliance constraints?)
What observability tools are in use or available? (Helicone, Langfuse, Datadog, custom, etc.)

Map each LLM touchpoint:

Feature	Model	Calls/day	Avg latency	Current monitoring	Known issues
(Chat support)	(Claude Sonnet)	(10,000)	(2.1s)	(None)	(Occasional hallucinations)
(Doc summary)	(Gemini Pro)	(500)	(8.3s)	(Basic logging)	(Slow on large docs)

Step 2: Define quality metrics

Quality monitoring answers: "Is the AI output still good?" Define metrics that catch degradation:

Metric	Definition	How to measure	Threshold	Alert when
Consistency score	Same input produces similar outputs over time	Run golden dataset weekly, compare scores	> 3.5 rubric avg	Score drops below 3.0
Hallucination rate	% of outputs containing fabricated information	Automated fact-check or LLM-as-judge sampling	< 5%	Rate exceeds 10%
Relevance score	% of outputs that address the user's actual question	LLM-as-judge on sample + user feedback signals	> 85%	Below 75%
Tone compliance	Output matches expected voice/style	Tone rubric scoring on sample	> 90% pass	Below 80%
Safety incidents	Harmful, biased, or inappropriate outputs	Content filter + human review of flagged items	0 critical	Any critical incident
User satisfaction signal	Thumbs up/down, regeneration rate, copy rate, conversation abandonment rate	In-product feedback + behavioral tracking (did the user accept, edit, or discard the output?)	(baseline)	Drops > 20% from baseline

Not every feature needs every metric. Match metrics to the feature's risk profile:

Customer-facing, high-stakes: All metrics, tight thresholds
Internal tool, moderate stakes: Consistency + relevance + cost
Batch processing, low stakes: Cost + basic error rate

Step 3: Design the logging strategy

What to capture on every LLM call:

Field	Type	Purpose	PII concern?
`request_id`	string	Unique identifier for the call	No
`session_id`	string	Conversation or session ID (for multi-turn features)	No
`timestamp`	ISO 8601	When the call was made	No
`model`	string	Model name and version	No
`feature`	string	Which product feature triggered this	No
`input_tokens`	integer	Token count of the prompt	No
`output_tokens`	integer	Token count of the response	No
`latency_ms`	integer	Total response time	No
`time_to_first_token_ms`	integer	Streaming start time	No
`status`	string	Success, error, timeout, rate_limited	No
`error_type`	string	Error category if failed	No
`prompt_hash`	string	Hash of the system prompt template (not content)	No
`cost`	float	Calculated cost of this call	No
`tool_calls`	JSON	Tool/function calls made during the request (if any)	Depends on tool
`prompt_text`	string	Full prompt (if allowed)	Yes -- may contain PII
`response_text`	string	Full response (if allowed)	Yes -- may contain PII
`user_id`	string	Anonymized user identifier	Yes -- handle carefully

PII handling decisions:

Can you log full prompts and responses? (Best for debugging, worst for privacy)
If not, can you log sanitized versions? (Strip PII, keep structure)
If not, can you log metadata only? (Tokens, latency, cost -- no content)
What's the data retention policy? (30 days? 90 days? Indefinite?)
Who has access to raw logs vs. aggregated dashboards?

Sampling strategy for high-volume features:

Log metadata (tokens, latency, cost, status) on 100% of calls
Log full prompt/response on 1-10% of calls (configurable)
Log full prompt/response on 100% of error/timeout calls
Run quality eval (LLM-as-judge) on 1-5% sample

Step 4: Plan cost and latency monitoring

Cost monitoring:

Metric	Formula	Dashboard	Alert when
Cost per interaction	(input_tokens x input_price + output_tokens x output_price)	Real-time	> 2x baseline
Daily cost by feature	Sum of interaction costs per feature per day	Daily	> budget ceiling
Monthly cost projection	Daily cost x days remaining	Weekly	> monthly budget
Cost per user	Total LLM cost / active users	Monthly	Trending up > 20% MoM
Token efficiency	Output quality score / tokens used	Weekly	Efficiency drops > 15%

Cost the whole workflow, not just the model call. As of 2026, the per-LLM-call cost above undercounts agentic features. A single user-facing turn can fan out into retrieval, multiple tool calls, and several model calls before it answers. LangSmith and the other 2026 platforms now roll cost up across the entire agent run (retrieval + tool/API spend + every LLM hop), so attribute spend to the trace, not the call. Add a cost per completed task metric (total spend across all calls in one trace / one resolved user request) alongside cost per interaction. A feature can look cheap per call and expensive per task when it loops, retries, or over-retrieves.

Latency monitoring:

Metric	Target	Alert when
p50 response time	(target, e.g., < 2s)	> 1.5x target
p95 response time	(target, e.g., < 5s)	> 2x target
p99 response time	(target, e.g., < 10s)	> 3x target
Time to first token (streaming)	(target, e.g., < 500ms)	> 1s
Timeout rate	< 1%	> 3%
Rate limit hit rate	< 0.1%	> 1%

Step 4b: Cost-reduction playbook

Monitoring cost is half the job. When the dashboard shows the bill climbing, this is the ranked set of levers to pull, biggest typical win first. Pull them in order; the early ones are larger and lower-risk than the late ones.

Lever	Mechanism	Typical magnitude	Watch-out
Cache the stable prefix	Mark the unchanging part of the prompt (system prompt, retrieved context, few-shot block) as cached so repeated calls reuse it at a fraction of input cost	Up to ~90% off the cached input tokens; the single most-missed win	Only works if the prefix is byte-stable. A timestamp or per-request ID in the "stable" part silently drops the hit rate to zero
Right-size the model per call	Route easy calls to a cheaper, faster model and reserve the flagship for hard ones; or have a cheap model draft and the flagship verify only when needed	40 to 70% on the share of traffic that does not need the top model	Needs a cheap, reliable classifier for "is this hard"; a bad router sends hard calls to the weak model
Cap and trim output tokens	Output tokens cost a multiple of input tokens. Set a real max, and prompt for the shortest correct answer	Direct: 30% shorter output is roughly 30% off the output half of the bill	Capping too tight truncates mid-answer; pair with a length-aware prompt
Retrieve context, do not stuff it	Move large reference material to RAG so each call pays only for the context it actually needs	Large on context-heavy features; you stop paying for the 90% of context any one call ignores	Retrieval quality becomes a new failure mode to monitor
Batch the non-urgent work	Send latency-tolerant jobs (summaries, backfills, offline scoring) through an async batch path	Commonly ~50% off versus the synchronous API	Not for anything a user is waiting on
Cache whole responses	Serve identical or near-identical queries from an exact or semantic cache, skipping the model entirely	100% on the cached share; large for features with repetitive queries	Semantic cache needs a similarity threshold; too loose returns a stale answer to a different question

Tie each lever back to the token efficiency metric in Step 4 (output quality per token). A cost cut that tanks quality is not a win; watch both numbers move together.

Step 5: Design drift and regression detection

Model behavior changes over time -- from model updates, prompt changes, data shifts, or provider-side changes:

Drift detection approach:

Signal	What changes	How to detect	Frequency
Model version change	Provider updates the model	Monitor model version in API responses	Every call
Output distribution shift	Average output length, vocabulary, structure changes	Statistical comparison of output properties week over week	Weekly
Quality regression	Eval scores drop	Run golden dataset eval, compare to baseline	Weekly
Cost drift	Token usage changes without prompt changes	Compare avg tokens per call week over week	Daily
Latency drift	Response times change	Compare p50/p95 week over week	Daily
Prompt template change	Team modifies system prompts	Track prompt_hash, alert on changes, require eval rerun	On change

Regression response playbook:

Alert fires -- quality score dropped or cost spiked
Triage -- is this a model version change, prompt change, or data change?
Compare -- run golden dataset eval on current vs. previous model/prompt
Decide -- revert prompt, switch model version, adjust thresholds, or accept new baseline
Document -- log the incident and resolution for future reference

Step 5b: A/B testing prompts and models in production

The regression playbook above answers "did something break?" This answers "should we ship this change?" You cannot A/B a prompt the way you A/B a button color, because the outcome is multi-dimensional (quality and cost and latency at once) and per-call quality is not directly observed. Treat a prompt or model A/B as a controlled regression test with an offline arm and an online arm.

The safe rollout ladder (do not skip rungs):

Offline golden eval. Run the variant against the golden dataset first. If it regresses there, stop. This is free and catches most bad changes before any user sees them.
Shadow mode. Run the variant in parallel with production on live traffic, score it, but serve the current version to the user. Compares the two on real inputs at zero user risk.
Canary. Route a small percentage (1 to 5%) to the variant with guardrail metrics wired (the Step 2 quality metrics plus cost and latency). Ramp only if guardrails hold.
Holdout. Keep a slice on the old version after ramping so you can measure the real lift, not just assume it.

Judge on the joint metric, never one axis. A prompt change that cuts cost 20% and quality 5% might be the right call or the wrong one; it depends entirely on where you sit relative to the quality bar from /ai-eval-design. Plot quality against cost and decide deliberately. Shipping a prompt because it is cheaper, without watching quality, is how features rot.

Read it out honestly. Sampled LLM-as-judge scores need enough volume to be significant; do not call a winner off 20 calls. Between eval runs, lean on cheap online proxies already in your logs (regeneration rate, thumbs, acceptance or edit rate from Step 2) as a faster, noisier signal. The offline golden score is the ground truth; the online proxies are the early warning.

Step 5c: Wire online evals into the trace (the core 2026 instrument)

The weekly golden-dataset run in Step 2 tells you the average is holding. It does not tell you which live request just went wrong, or why. The 2026 standard closes that gap: the trace is the primary record of every production run, and online evals score live runs as they happen. This is the AI-in-the-loop layer that replaces "wait for a user to complain."

Treat evals as a first-class layer on top of monitoring, not a separate offline activity. Post-LangChain 1.0 (October 2025), LangSmith and the comparable platforms run this natively: a trace captures the full run (every model call, retrieval, and tool call), and an online eval scores a sampled share of those live traces with either an LLM-as-judge rubric or a custom Python check.

How to set it up:

Make the trace the record. Every production run emits a trace covering the whole workflow, not just the final model call. The trace is what you score, alert on, and debug from.
Reuse the eval criteria you already wrote. The rubrics from /ai-eval-design become the online judge prompts here. Do not invent a second, looser definition of "good" for production. The offline criterion and the online threshold are the same standard, measured in two places.
Score a sample of live traffic continuously. Run the judge on 1 to 5% of production traces (100% of errors and flagged outputs), the same sampling spine as Step 3 logging. Each scored trace writes its result back as feedback on the trace, so a low score is one click from the exact run that produced it.
Set the threshold from the eval bar, then alert on it. A relevance rubric that must clear 85% offline (Step 2) becomes a live alert when the rolling online score drops below it. Wire the alert to PagerDuty or a webhook with a named owner (Step 8), not just a dashboard tile.

Why this beats the weekly run alone: the golden dataset is a fixed, friendly set of inputs. Online evals score the messy real distribution, on the inputs users actually send, in close to real time. Keep both. The weekly run is your regression baseline; the online evals are your live smoke detector and your fastest path from "a number moved" to "here is the trace that moved it."

Step 6: Choose tooling

Tool	What it does	Best for	Pricing model
Helicone	LLM proxy with logging, cost tracking, caching	Teams wanting zero-code setup, cost optimization	Free tier + usage-based
Langfuse	Open-source LLM observability, tracing, eval	Teams wanting self-hosted or detailed tracing	Free (self-hosted) or cloud pricing
Braintrust	Eval platform with logging and experiments	Teams focused on systematic eval and prompt iteration	Usage-based
Datadog LLM Monitoring	Extension of Datadog APM for LLM calls	Teams already on Datadog	Per-host pricing
Arize Phoenix	Open-source LLM tracing and evaluation	Teams wanting self-hosted with strong eval integration	Free (self-hosted) or cloud pricing
LangSmith	LangChain's observability platform; trace-first with online evals and unified agent-workflow cost as a first-class layer (post-LangChain 1.0)	Teams using LangChain/LangGraph, or anyone wanting online evals scored on live traces	Free tier + usage-based
Custom (OpenTelemetry)	Roll your own with standard instrumentation	Teams with specific requirements or existing infra	Infrastructure cost

Selection criteria:

What's your existing monitoring stack? (Extend it vs. add a new tool)
Do you need self-hosted? (Compliance, data sovereignty)
What's the budget for monitoring tooling?
How many LLM calls per day? (Determines whether free tiers are viable)

Default recommendation for most teams: Start with Langfuse if you don't already have LLM monitoring in your stack. It covers tracing, eval, and cost tracking in one tool, offers both self-hosted and cloud options, and has strong open-source community momentum and adoption as of mid-2026. Move to Datadog LLM Monitoring only if your team already runs Datadog and wants unified APM + LLM observability.

Step 7: Generate the observability plan

Compile into a structured document:

# LLM Observability Plan: (Product name)

**Generated:** (date)
**Product:** (brief description)
**LLM features:** (count and list)
**Models in use:** (list with versions)

## LLM Touchpoint Map
(Table from Step 1 -- features, models, volume, current monitoring)

## Quality Metrics
(Table from Step 2 -- metrics, thresholds, measurement methods, alert rules)

## Logging Architecture
(Strategy from Step 3 -- what to log, PII handling, sampling rates)

### Log schema
(Field list with types and PII flags)

### Sampling rules
- Metadata: (100% of calls)
- Full content: (N% of calls, 100% of errors)
- Quality eval: (N% sample via LLM-as-judge)

### PII handling
- (Approach: full logging / sanitized / metadata only)
- (Retention policy)
- (Access controls)

## Cost & Latency Monitoring
(Tables from Step 4 -- cost metrics, latency targets, alert thresholds)

## Drift & Regression Detection
(Signals and playbook from Step 5)

## Tooling
(Recommendation from Step 6 with rationale)

## Implementation Checklist
- [ ] **(P0)** Instrument metadata logging on all LLM calls (tokens, latency, cost, status)
- [ ] **(P0)** Set up cost tracking dashboard with daily spend by feature
- [ ] **(P0)** Configure latency alerts (p95 > threshold)
- [ ] **(P1)** Implement prompt/response logging with PII handling
- [ ] **(P1)** Set up golden dataset eval as weekly automated run
- [ ] **(P1)** Emit a full-workflow trace per run; wire online evals to score 1-5% of live traces using the `/ai-eval-design` rubrics, alerting on the same thresholds
- [ ] **(P1)** Add cost-per-completed-task (spend rolled up across the whole agent trace) alongside cost per interaction
- [ ] **(P1)** Build quality metrics dashboard
- [ ] **(P2)** Implement drift detection (output distribution monitoring)
- [ ] **(P2)** Create regression response playbook and runbook
- [ ] **(P2)** Set up model version change alerts

## Open Questions
- (Unresolved monitoring decisions)
- (Things that need baseline data to determine thresholds)

Step 8: Review and finalize

Ask the user:

Are the quality metrics capturing what matters most for your AI features?
Is the logging strategy practical given PII constraints?
Are the cost alert thresholds realistic based on current spend?
Is the drift detection approach proportionate to your risk tolerance?
Does the tooling recommendation fit your existing stack and budget?
Who owns LLM monitoring? (Engineering, ML team, product, SRE?)

Adjust based on feedback.

Output location

Present the plan as formatted text in the conversation for the user to copy into their product wiki, analytics documentation, or team shared drive.

Example Output

Input

Product: Meridian Health — AI-powered clinical documentation assistant that auto-generates SOAP notes and pre-authorization letters from physician voice recordings
LLM features in production: (1) SOAP note generation from transcribed audio, (2) prior auth letter drafting, (3) ICD-10 code suggestion
Models in use: GPT-4o for SOAP notes and prior auth (OpenAI), GPT-3.5-turbo for ICD-10 suggestions (cost optimization); no versioning tracked currently
Current monitoring: CloudWatch logs latency and HTTP status codes only; no token tracking, no quality metrics, no cost dashboards
Known problems: Physicians flagging ~15% of SOAP notes as needing "major edits"; prior auth letters occasionally cite incorrect CPT codes; monthly OpenAI bill jumped 340% over 3 months with no explanation
Constraints: Full PHI in prompts and responses — HIPAA compliance required, no third-party SaaS logging of raw content; existing stack is AWS (CloudWatch, Grafana)

Output (abbreviated)

LLM Observability Plan: Meridian Health Clinical Documentation Assistant

Generated: 2025-07-14 Product: AI-assisted clinical documentation — SOAP notes, prior auth letters, ICD-10 coding LLM features: 3 (SOAP generation, prior auth drafting, ICD-10 suggestion) Models in use: GPT-4o (gpt-4o-2024-08-06), GPT-3.5-turbo (gpt-3.5-turbo-0125)

LLM Touchpoint Map

Feature	Model	Est. calls/day	Avg latency	Current monitoring	Known issues
SOAP note generation	GPT-4o	~3,200	~6.8s	HTTP status only	~15% major edit rate; quality regression suspected
Prior auth letter drafting	GPT-4o	~800	~9.1s	HTTP status only	Incorrect CPT codes in ~8% of letters
ICD-10 code suggestion	GPT-3.5-turbo	~3,200	~1.4s	HTTP status only	Unknown accuracy; no tracking

Total: ~7,200 LLM calls/day across 3 features. Cost anomaly on SOAP generation is the highest-priority investigative target.

Quality Metrics

Metric	Feature	How to measure	Target threshold	Alert when
Major edit rate	SOAP notes	Track physician edits post-generation (word delta > 40% = major)	< 8% major edits	> 12% in any rolling 7-day window
CPT/ICD code accuracy	Prior auth, ICD-10	LLM-as-judge cross-check against structured EHR codes on 10% sample	> 94% match	< 88% match
Clinical completeness	SOAP notes	Rubric: does output contain all 4 SOAP sections with non-trivial content?	> 97% complete	< 93% complete
Hallucination rate	Prior auth	Automated check: cited diagnosis codes present in patient record?	< 3%	> 6%
Regeneration rate	All features	% of sessions where physician requests a new generation	< 5%	> 10%
Safety incidents	All features	Outputs flagged by content filter or physician-reported errors with patient safety implication	0 critical	Any critical incident triggers P0 response

Risk profile: All three features are customer-facing and clinically high-stakes. Full metric coverage applies to SOAP and prior auth. ICD-10 gets accuracy + cost monitoring as a minimum bar until baseline is established.

Logging Architecture

PII handling decision

HIPAA compliance prohibits logging raw prompt/response content to any third-party SaaS. All logging goes to AWS CloudWatch + S3 (encrypted, us-east-1) with access restricted to the on-call engineering team via IAM role. No Helicone, Langfuse cloud, or Datadog SaaS for content logs.

Three-tier log approach:

Tier	Content	Volume	Retention
Tier 1 — Metadata	Tokens, latency, cost, status, model version, feature tag, prompt_hash	100% of calls	90 days
Tier 2 — Sanitized content	Response structure only (section headers present/absent, code counts, output length) — no PHI	100% of calls	90 days
Tier 3 — Full content	Raw prompt + response, encrypted at rest, restricted access	100% of error/timeout calls; 2% random sample of successes	30 days, then purge

Log schema (Tier 1 — always captured)

Field	Type	PII?
`request_id`	string	No
`session_id`	string	No
`timestamp`	ISO 8601	No
`feature`	enum: soap_note, prior_auth, icd10	No
`model`	string (name + version from API response header)	No
`input_tokens`	integer	No
`output_tokens`	integer	No
`latency_ms`	integer	No
`time_to_first_token_ms`	integer	No
`status`	enum: success, error, timeout, rate_limited	No
`error_type`	string	No
`prompt_hash`	SHA-256 of system prompt template	No
`cost_usd`	float	No
`provider_model_version`	string (from API response)	No
`user_id`	anonymized hash	Yes — hash only, no raw ID

Sampling rules

Tier 1 metadata: 100% of all calls
Tier 2 sanitized structure: 100% of all calls
Tier 3 full content: 100% of errors/timeouts + 2% random success sample
Quality eval (LLM-as-judge): 5% sample for SOAP and prior auth; 10% for ICD-10 (lower volume, higher code-accuracy risk)

Cost & Latency Monitoring

Cost monitoring

Metric	Current baseline	Alert threshold	Dashboard cadence
Daily cost — SOAP (GPT-4o)	~$320/day (estimate from token logs)	> $480/day (1.5x)	Real-time
Daily cost — Prior auth (GPT-4o)	~$95/day	> $150/day	Real-time
Daily cost — ICD-10 (GPT-3.5)	~$12/day	> $25/day	Daily
Cost per SOAP note	~$0.10	> $0.20	Weekly trend
Monthly projection	~$12,600/mo	> $18,000/mo	Weekly
Token efficiency (SOAP)	Baseline TBD week 1	> 20% increase in avg input tokens without quality gain	Weekly

On the 340% cost spike: Hypothesis is prompt template bloat — input tokens grew without output quality improving. Token efficiency metric will confirm. Check prompt_hash history once Tier 1 logging is live.

Latency targets

Metric	Target	Alert
SOAP — p50	< 5s	> 7.5s
SOAP — p95	< 12s	> 18s
Prior auth — p50	< 7s	> 10s
ICD-10 — p50	< 1.5s	> 2.5s
Time to first token (streaming, SOAP)	< 800ms	> 1.5s
Timeout rate (any feature)	< 0.5%	> 2%
Rate limit hit rate	< 0.1%	> 0.5%

Drift & Regression Detection

| Signal | What to watch | Detection

Run this now

Try /llm-observability-plan on your own input

0/4000

Part of these Playbook topics

AI Health Indicator Agent Experience

Related AI & Agents skills

Agent Eval Harness Agent Reliability Audit AI Agent Design AI Eval Design AI Guardrails Design AI Health Check AI Product Spec AI Risk Register

Back to Skills Catalog