Use this when you have (or are about to ship) AI-powered features and need to monitor them in production. Covers LLM-specific concerns that traditional observability and SRE monitoring miss: prompt/response quality, token costs, latency per model, hallucination drift, and model version regression. If you're looking to measure user behavior, use /observability-plan. If you're looking to measure system health, use /instrumentation-plan. This skill covers the AI-specific layer between those two.
The distinction:
/observability-plananswers "Are users successful?"/instrumentation-plananswers "Is the system healthy?" This skill answers "Is the AI behaving correctly, consistently, and affordably?"
Related skills: Complements
/observability-plan(product analytics) and/instrumentation-plan(SRE metrics). Eval criteria from/ai-eval-designbecome monitoring thresholds here. Quality dimensions from/ai-health-checkCARATS framework inform what to monitor.
Process
Step 1: Identify LLM touchpoints
Ask the user:
- What AI features are in production (or about to ship)? (List each feature that calls an LLM)
- Which models does each feature use? (Provider, model name, version)
- What's the current monitoring? (Any logging, dashboards, or alerts already in place?)
- What problems have you seen? (Quality issues, cost surprises, latency spikes, outages)
- What's the sensitivity level? (Can you log prompts/responses, or are there PII/compliance constraints?)
- What observability tools are in use or available? (Helicone, Langfuse, Datadog, custom, etc.)
Map each LLM touchpoint:
| Feature | Model | Calls/day | Avg latency | Current monitoring | Known issues |
|---|---|---|---|---|---|
| (Chat support) | (Claude Sonnet) | (10,000) | (2.1s) | (None) | (Occasional hallucinations) |
| (Doc summary) | (Gemini Pro) | (500) | (8.3s) | (Basic logging) | (Slow on large docs) |
Step 2: Define quality metrics
Quality monitoring answers: "Is the AI output still good?" Define metrics that catch degradation:
| Metric | Definition | How to measure | Threshold | Alert when |
|---|---|---|---|---|
| Consistency score | Same input produces similar outputs over time | Run golden dataset weekly, compare scores | > 3.5 rubric avg | Score drops below 3.0 |
| Hallucination rate | % of outputs containing fabricated information | Automated fact-check or LLM-as-judge sampling | < 5% | Rate exceeds 10% |
| Relevance score | % of outputs that address the user's actual question | LLM-as-judge on sample + user feedback signals | > 85% | Below 75% |
| Tone compliance | Output matches expected voice/style | Tone rubric scoring on sample | > 90% pass | Below 80% |
| Safety incidents | Harmful, biased, or inappropriate outputs | Content filter + human review of flagged items | 0 critical | Any critical incident |
| User satisfaction signal | Thumbs up/down, regeneration rate, copy rate, conversation abandonment rate | In-product feedback + behavioral tracking (did the user accept, edit, or discard the output?) | (baseline) | Drops > 20% from baseline |
Not every feature needs every metric. Match metrics to the feature's risk profile:
- Customer-facing, high-stakes: All metrics, tight thresholds
- Internal tool, moderate stakes: Consistency + relevance + cost
- Batch processing, low stakes: Cost + basic error rate
Step 3: Design the logging strategy
What to capture on every LLM call:
| Field | Type | Purpose | PII concern? |
|---|---|---|---|
request_id | string | Unique identifier for the call | No |
session_id | string | Conversation or session ID (for multi-turn features) | No |
timestamp | ISO 8601 | When the call was made | No |
model | string | Model name and version | No |
feature | string | Which product feature triggered this | No |
input_tokens | integer | Token count of the prompt | No |
output_tokens | integer | Token count of the response | No |
latency_ms | integer | Total response time | No |
time_to_first_token_ms | integer | Streaming start time | No |
status | string | Success, error, timeout, rate_limited | No |
error_type | string | Error category if failed | No |
prompt_hash | string | Hash of the system prompt template (not content) | No |
cost | float | Calculated cost of this call | No |
tool_calls | JSON | Tool/function calls made during the request (if any) | Depends on tool |
prompt_text | string | Full prompt (if allowed) | Yes -- may contain PII |
response_text | string | Full response (if allowed) | Yes -- may contain PII |
user_id | string | Anonymized user identifier | Yes -- handle carefully |
PII handling decisions:
- Can you log full prompts and responses? (Best for debugging, worst for privacy)
- If not, can you log sanitized versions? (Strip PII, keep structure)
- If not, can you log metadata only? (Tokens, latency, cost -- no content)
- What's the data retention policy? (30 days? 90 days? Indefinite?)
- Who has access to raw logs vs. aggregated dashboards?
Sampling strategy for high-volume features:
- Log metadata (tokens, latency, cost, status) on 100% of calls
- Log full prompt/response on 1-10% of calls (configurable)
- Log full prompt/response on 100% of error/timeout calls
- Run quality eval (LLM-as-judge) on 1-5% sample
Step 4: Plan cost and latency monitoring
Cost monitoring:
| Metric | Formula | Dashboard | Alert when |
|---|---|---|---|
| Cost per interaction | (input_tokens x input_price + output_tokens x output_price) | Real-time | > 2x baseline |
| Daily cost by feature | Sum of interaction costs per feature per day | Daily | > budget ceiling |
| Monthly cost projection | Daily cost x days remaining | Weekly | > monthly budget |
| Cost per user | Total LLM cost / active users | Monthly | Trending up > 20% MoM |
| Token efficiency | Output quality score / tokens used | Weekly | Efficiency drops > 15% |
Latency monitoring:
| Metric | Target | Alert when |
|---|---|---|
| p50 response time | (target, e.g., < 2s) | > 1.5x target |
| p95 response time | (target, e.g., < 5s) | > 2x target |
| p99 response time | (target, e.g., < 10s) | > 3x target |
| Time to first token (streaming) | (target, e.g., < 500ms) | > 1s |
| Timeout rate | < 1% | > 3% |
| Rate limit hit rate | < 0.1% | > 1% |
Step 5: Design drift and regression detection
Model behavior changes over time -- from model updates, prompt changes, data shifts, or provider-side changes:
Drift detection approach:
| Signal | What changes | How to detect | Frequency |
|---|---|---|---|
| Model version change | Provider updates the model | Monitor model version in API responses | Every call |
| Output distribution shift | Average output length, vocabulary, structure changes | Statistical comparison of output properties week over week | Weekly |
| Quality regression | Eval scores drop | Run golden dataset eval, compare to baseline | Weekly |
| Cost drift | Token usage changes without prompt changes | Compare avg tokens per call week over week | Daily |
| Latency drift | Response times change | Compare p50/p95 week over week | Daily |
| Prompt template change | Team modifies system prompts | Track prompt_hash, alert on changes, require eval rerun | On change |
Regression response playbook:
- Alert fires -- quality score dropped or cost spiked
- Triage -- is this a model version change, prompt change, or data change?
- Compare -- run golden dataset eval on current vs. previous model/prompt
- Decide -- revert prompt, switch model version, adjust thresholds, or accept new baseline
- Document -- log the incident and resolution for future reference
Step 6: Choose tooling
| Tool | What it does | Best for | Pricing model |
|---|---|---|---|
| Helicone | LLM proxy with logging, cost tracking, caching | Teams wanting zero-code setup, cost optimization | Free tier + usage-based |
| Langfuse | Open-source LLM observability, tracing, eval | Teams wanting self-hosted or detailed tracing | Free (self-hosted) or cloud pricing |
| Braintrust | Eval platform with logging and experiments | Teams focused on systematic eval and prompt iteration | Usage-based |
| Datadog LLM Monitoring | Extension of Datadog APM for LLM calls | Teams already on Datadog | Per-host pricing |
| Arize Phoenix | Open-source LLM tracing and evaluation | Teams wanting self-hosted with strong eval integration | Free (self-hosted) or cloud pricing |
| LangSmith | LangChain's observability platform | Teams using LangChain/LangGraph | Free tier + usage-based |
| Custom (OpenTelemetry) | Roll your own with standard instrumentation | Teams with specific requirements or existing infra | Infrastructure cost |
Selection criteria:
- What's your existing monitoring stack? (Extend it vs. add a new tool)
- Do you need self-hosted? (Compliance, data sovereignty)
- What's the budget for monitoring tooling?
- How many LLM calls per day? (Determines whether free tiers are viable)
Default recommendation for most teams: Start with Langfuse if you don't already have LLM monitoring in your stack. It covers tracing, eval, and cost tracking in one tool, offers both self-hosted and cloud options, and has the strongest open-source community momentum (5.0 rating, 41 reviews on Product Hunt as of March 2026). Move to Datadog LLM Monitoring only if your team already runs Datadog and wants unified APM + LLM observability.
Step 7: Generate the observability plan
Compile into a structured document:
# LLM Observability Plan: (Product name)
**Generated:** (date)
**Product:** (brief description)
**LLM features:** (count and list)
**Models in use:** (list with versions)
## LLM Touchpoint Map
(Table from Step 1 -- features, models, volume, current monitoring)
## Quality Metrics
(Table from Step 2 -- metrics, thresholds, measurement methods, alert rules)
## Logging Architecture
(Strategy from Step 3 -- what to log, PII handling, sampling rates)
### Log schema
(Field list with types and PII flags)
### Sampling rules
- Metadata: (100% of calls)
- Full content: (N% of calls, 100% of errors)
- Quality eval: (N% sample via LLM-as-judge)
### PII handling
- (Approach: full logging / sanitized / metadata only)
- (Retention policy)
- (Access controls)
## Cost & Latency Monitoring
(Tables from Step 4 -- cost metrics, latency targets, alert thresholds)
## Drift & Regression Detection
(Signals and playbook from Step 5)
## Tooling
(Recommendation from Step 6 with rationale)
## Implementation Checklist
- [ ] **(P0)** Instrument metadata logging on all LLM calls (tokens, latency, cost, status)
- [ ] **(P0)** Set up cost tracking dashboard with daily spend by feature
- [ ] **(P0)** Configure latency alerts (p95 > threshold)
- [ ] **(P1)** Implement prompt/response logging with PII handling
- [ ] **(P1)** Set up golden dataset eval as weekly automated run
- [ ] **(P1)** Build quality metrics dashboard
- [ ] **(P2)** Implement drift detection (output distribution monitoring)
- [ ] **(P2)** Create regression response playbook and runbook
- [ ] **(P2)** Set up model version change alerts
## Open Questions
- (Unresolved monitoring decisions)
- (Things that need baseline data to determine thresholds)
Step 8: Review and finalize
Ask the user:
- Are the quality metrics capturing what matters most for your AI features?
- Is the logging strategy practical given PII constraints?
- Are the cost alert thresholds realistic based on current spend?
- Is the drift detection approach proportionate to your risk tolerance?
- Does the tooling recommendation fit your existing stack and budget?
- Who owns LLM monitoring? (Engineering, ML team, product, SRE?)
Adjust based on feedback.
Output location
Present the plan as formatted text in the conversation for the user to copy into their product wiki, analytics documentation, or team shared drive.
Example Output
Input
- Product: Meridian Health — AI-powered clinical documentation assistant that auto-generates SOAP notes and pre-authorization letters from physician voice recordings
- LLM features in production: (1) SOAP note generation from transcribed audio, (2) prior auth letter drafting, (3) ICD-10 code suggestion
- Models in use: GPT-4o for SOAP notes and prior auth (OpenAI), GPT-3.5-turbo for ICD-10 suggestions (cost optimization); no versioning tracked currently
- Current monitoring: CloudWatch logs latency and HTTP status codes only; no token tracking, no quality metrics, no cost dashboards
- Known problems: Physicians flagging ~15% of SOAP notes as needing "major edits"; prior auth letters occasionally cite incorrect CPT codes; monthly OpenAI bill jumped 340% over 3 months with no explanation
- Constraints: Full PHI in prompts and responses — HIPAA compliance required, no third-party SaaS logging of raw content; existing stack is AWS (CloudWatch, Grafana)
Output (abbreviated)
LLM Observability Plan: Meridian Health Clinical Documentation Assistant
Generated: 2025-07-14 Product: AI-assisted clinical documentation — SOAP notes, prior auth letters, ICD-10 coding LLM features: 3 (SOAP generation, prior auth drafting, ICD-10 suggestion) Models in use: GPT-4o (gpt-4o-2024-08-06), GPT-3.5-turbo (gpt-3.5-turbo-0125)
LLM Touchpoint Map
| Feature | Model | Est. calls/day | Avg latency | Current monitoring | Known issues |
|---|---|---|---|---|---|
| SOAP note generation | GPT-4o | ~3,200 | ~6.8s | HTTP status only | ~15% major edit rate; quality regression suspected |
| Prior auth letter drafting | GPT-4o | ~800 | ~9.1s | HTTP status only | Incorrect CPT codes in ~8% of letters |
| ICD-10 code suggestion | GPT-3.5-turbo | ~3,200 | ~1.4s | HTTP status only | Unknown accuracy; no tracking |
Total: ~7,200 LLM calls/day across 3 features. Cost anomaly on SOAP generation is the highest-priority investigative target.
Quality Metrics
| Metric | Feature | How to measure | Target threshold | Alert when |
|---|---|---|---|---|
| Major edit rate | SOAP notes | Track physician edits post-generation (word delta > 40% = major) | < 8% major edits | > 12% in any rolling 7-day window |
| CPT/ICD code accuracy | Prior auth, ICD-10 | LLM-as-judge cross-check against structured EHR codes on 10% sample | > 94% match | < 88% match |
| Clinical completeness | SOAP notes | Rubric: does output contain all 4 SOAP sections with non-trivial content? | > 97% complete | < 93% complete |
| Hallucination rate | Prior auth | Automated check: cited diagnosis codes present in patient record? | < 3% | > 6% |
| Regeneration rate | All features | % of sessions where physician requests a new generation | < 5% | > 10% |
| Safety incidents | All features | Outputs flagged by content filter or physician-reported errors with patient safety implication | 0 critical | Any critical incident triggers P0 response |
Risk profile: All three features are customer-facing and clinically high-stakes. Full metric coverage applies to SOAP and prior auth. ICD-10 gets accuracy + cost monitoring as a minimum bar until baseline is established.
Logging Architecture
PII handling decision
HIPAA compliance prohibits logging raw prompt/response content to any third-party SaaS. All logging goes to AWS CloudWatch + S3 (encrypted, us-east-1) with access restricted to the on-call engineering team via IAM role. No Helicone, Langfuse cloud, or Datadog SaaS for content logs.
Three-tier log approach:
| Tier | Content | Volume | Retention |
|---|---|---|---|
| Tier 1 — Metadata | Tokens, latency, cost, status, model version, feature tag, prompt_hash | 100% of calls | 90 days |
| Tier 2 — Sanitized content | Response structure only (section headers present/absent, code counts, output length) — no PHI | 100% of calls | 90 days |
| Tier 3 — Full content | Raw prompt + response, encrypted at rest, restricted access | 100% of error/timeout calls; 2% random sample of successes | 30 days, then purge |
Log schema (Tier 1 — always captured)
| Field | Type | PII? |
|---|---|---|
request_id | string | No |
session_id | string | No |
timestamp | ISO 8601 | No |
feature | enum: soap_note, prior_auth, icd10 | No |
model | string (name + version from API response header) | No |
input_tokens | integer | No |
output_tokens | integer | No |
latency_ms | integer | No |
time_to_first_token_ms | integer | No |
status | enum: success, error, timeout, rate_limited | No |
error_type | string | No |
prompt_hash | SHA-256 of system prompt template | No |
cost_usd | float | No |
provider_model_version | string (from API response) | No |
user_id | anonymized hash | Yes — hash only, no raw ID |
Sampling rules
- Tier 1 metadata: 100% of all calls
- Tier 2 sanitized structure: 100% of all calls
- Tier 3 full content: 100% of errors/timeouts + 2% random success sample
- Quality eval (LLM-as-judge): 5% sample for SOAP and prior auth; 10% for ICD-10 (lower volume, higher code-accuracy risk)
Cost & Latency Monitoring
Cost monitoring
| Metric | Current baseline | Alert threshold | Dashboard cadence |
|---|---|---|---|
| Daily cost — SOAP (GPT-4o) | ~$320/day (estimate from token logs) | > $480/day (1.5x) | Real-time |
| Daily cost — Prior auth (GPT-4o) | ~$95/day | > $150/day | Real-time |
| Daily cost — ICD-10 (GPT-3.5) | ~$12/day | > $25/day | Daily |
| Cost per SOAP note | ~$0.10 | > $0.20 | Weekly trend |
| Monthly projection | ~$12,600/mo | > $18,000/mo | Weekly |
| Token efficiency (SOAP) | Baseline TBD week 1 | > 20% increase in avg input tokens without quality gain | Weekly |
On the 340% cost spike: Hypothesis is prompt template bloat — input tokens grew without output quality improving. Token efficiency metric will confirm. Check prompt_hash history once Tier 1 logging is live.
Latency targets
| Metric | Target | Alert |
|---|---|---|
| SOAP — p50 | < 5s | > 7.5s |
| SOAP — p95 | < 12s | > 18s |
| Prior auth — p50 | < 7s | > 10s |
| ICD-10 — p50 | < 1.5s | > 2.5s |
| Time to first token (streaming, SOAP) | < 800ms | > 1.5s |
| Timeout rate (any feature) | < 0.5% | > 2% |
| Rate limit hit rate | < 0.1% | > 0.5% |
Drift & Regression Detection
| Signal | What to watch | Detection