Skip to main content
Engineering/observability-plan

Observability Plan

You need to plan or audit product observability for user behavior and outcomes.

Use this when you need to design, audit, or improve product-level observability — measuring what users do, how they experience the product, and whether they accomplish their goals. This covers event tracking, user journeys, task completion, funnel analysis, and product performance from the user's perspective. If you're looking to measure system health, uptime, or deployment reliability, use /instrumentation-plan instead.

The distinction: Observability answers "Are users successful?" Instrumentation answers "Is the system healthy?" Both are needed. Start here if you can't answer basic questions about how users interact with your product.

For AI-powered features: This skill covers product-level observability. If you need to monitor LLM-specific concerns (prompt quality, token costs, hallucination drift, model regression), use /llm-observability-plan -- it covers the AI-specific layer between product analytics and system instrumentation.

Process

Step 1: Gather context

Ask the user to provide:

  1. Product description — what does this product do? Who are the users? What are the primary use cases?
  2. Current analytics — what's already tracked? (existing analytics tools, dashboards, event logs)
  3. Key user journeys — the 3-5 most important paths a user takes (e.g., signup → first value, search → purchase, onboard → habit)
  4. Business goals — what does success look like? (activation rate, retention, revenue, engagement)
  5. Known blind spots — what questions about user behavior can't be answered today?
  6. Analytics stack — tools in use or under consideration (Amplitude, Mixpanel, PostHog, GA4, custom, etc.)

If the user doesn't have all of this, work with what's available. Flag gaps as assumptions.

Step 2: Define the event taxonomy

A clean event taxonomy is the foundation of product observability. Design it before implementing anything.

Event naming convention:

Use a consistent Object Action or object_action pattern:

PatternExampleAnti-pattern
Object ActionButton Clicked, Page Viewed, Form Submittedclick, pageview, submit
Namespace prefixOnboarding Step Completed, Search Query Executedstep_done, searched
Past tense for completed actionsAccount Created, Item Purchasedcreating_account, buying

Event categories:

CategoryPurposeExamples
LifecycleTrack user progressionAccount Created, Onboarding Completed, Subscription Started, Account Churned
EngagementTrack feature usageFeature Used, Content Viewed, Search Executed, Export Generated
ConversionTrack goal completionTrial Started, Purchase Completed, Upgrade Initiated
NavigationTrack movement patternsPage Viewed, Tab Switched, Navigation Clicked
Error / FrictionTrack failure pointsError Displayed, Form Validation Failed, Timeout Experienced
Security / AnomalyTrack security-relevant behavior for baseline buildingPermission Elevated, Unusual Access Pattern, Data Export Volume Exceeded, Off-Hours Activity, New Device Login

Event properties (payload structure):

Every event should include:

PropertyTypePurposeExample
event_namestringWhat happenedButton Clicked
timestampISO 8601When it happened2026-03-05T14:30:00Z
user_idstringWho did it (anonymized if needed)usr_abc123
session_idstringSession groupingsess_xyz789
page / screenstringWhere it happened/dashboard
propertiesobjectContext-specific details{ button_name: "Export", format: "CSV" }

Taxonomy design rules:

  • Decide on a naming convention before your first event — retrofitting is expensive
  • Every event must answer: Who did what, where, when, and with what context?
  • Limit property cardinality — a property with 10,000 unique values is hard to analyze
  • Version your taxonomy — when you rename or restructure events, document the change

Present the taxonomy as a table:

Event NameCategoryTriggerKey PropertiesPriority
(Page Viewed)NavigationAny page loadpage_path, referrer, load_time_msP0
(Feature X Used)EngagementUser completes action Xfeature_name, input_type, result_countP0
(Signup Completed)LifecycleRegistration finishessignup_method, referral_sourceP0

Step 3: Map user journeys and funnels

For each key user journey, define the funnel:

Funnel template:

StepEventSuccess CriteriaExpected Drop-offAlert If
1. (Entry)Page Viewed (landing)User arrivesTraffic < (threshold)
2. (Engagement)Feature ExploredUser interacts40-60% drop expectedDrop > 70%
3. (Activation)First Value AchievedUser gets value20-40% drop expectedDrop > 50%
4. (Conversion)Goal CompletedUser converts10-30% drop expectedDrop > 40%

For each funnel:

  1. Name the journey — e.g., "New user → first value" or "Search → purchase"
  2. Define the steps — specific events that mark progression
  3. Set baseline expectations — what's a healthy drop-off rate at each step?
  4. Define alert thresholds — when does drop-off signal a problem vs. normal behavior?
  5. Identify branch points — where do users take alternate paths? Are those paths tracked?

Step 4: Task completion and timing

Beyond funnels, measure whether users accomplish what they came to do:

Task completion framework:

TaskStart EventEnd EventSuccess CriteriaTime TargetMeasure
(Complete onboarding)Onboarding StartedOnboarding CompletedAll steps finished< 5 minutesCompletion rate, median time
(Find and use a feature)Search ExecutedFeature UsedUser finds what they need< 30 secondsSuccess rate, time to result
(Submit a form)Form OpenedForm SubmittedValid submission< 2 minutesCompletion rate, error rate, abandonment point

Timing metrics to capture:

  • Time to first value — how long from signup/entry to the first meaningful outcome?
  • Time on task — how long does a specific workflow take?
  • Time between sessions — how frequently do users return?
  • Perceived performance — Core Web Vitals (LCP, FID/INP, CLS) as user-facing performance signals

Behavioral baseline signals (optional -- include when the product feeds security or anomaly detection):

If the organization uses AI-powered security tools (UEBA, anomaly detection), product observability events serve double duty -- they're both product analytics and security intelligence. Consider tracking these behavioral patterns as part of the event taxonomy:

SignalWhat it establishesAnomaly example
Access frequency per userNormal usage cadenceSudden spike or off-pattern access
Typical session durationExpected engagement lengthUnusually long or short sessions
Normal data access volumeBaseline download/export behaviorBulk data export outside normal range
Geographic consistencyExpected access locationsLogin from new region or impossible travel
Feature access patternsWhich features a user typically usesSudden access to admin or sensitive features

These signals feed /telemetry-readiness-audit assessments and enable AI security tools to build meaningful behavioral baselines from the same instrumentation effort.

Step 5: Feature adoption and retention signals

Track whether features are actually used and whether usage sticks:

Adoption metrics:

MetricFormulaWhat it tells you
Feature adoption rateUsers who used feature / total active usersIs the feature discoverable?
Activation rateUsers who completed key action / users who signed upAre users getting value?
Breadth of use# of features used per user per sessionAre users exploring or stuck?
Depth of useFrequency of feature use per user per weekIs usage habitual or one-time?
Retention (D1/D7/D30)Users returning on day N / users who started on day 0Does the product stick?
Stickiness (DAU/MAU)Daily active users / monthly active usersHow often do users come back?

Cohort analysis guidance:

  • Always segment by acquisition cohort (week or month of first use)
  • Compare feature adoption across cohorts to detect trends
  • Separate new users from power users in adoption metrics — they have different baselines

Step 6: Experiment instrumentation

If the team runs A/B tests or experiments, ensure the observability layer supports them:

Experiment tracking requirements:

  • Every user session tagged with active experiment variants
  • Experiment assignment logged as an event (Experiment Assigned with experiment_name, variant, user_id)
  • Primary and secondary metrics defined before the experiment starts
  • Sample size and duration calculated before launch (not after)
  • Guardrail metrics defined — metrics that must not degrade (e.g., page load time, error rate)

Experiment event structure:

EventPropertiesWhen
Experiment Assignedexperiment_name, variant, assignment_methodUser enters experiment
Experiment Exposedexperiment_name, variant, exposure_contextUser sees the variant
Experiment Goal Reachedexperiment_name, variant, goal_name, goal_valueUser hits primary metric

Step 7: Generate the observability plan

Compile everything into a single document:


Observability Plan — (Project name)

Generated: (date) Product: (brief description) Current state: (summary of what's tracked today)

Event Taxonomy

(Table from Step 2 — event names, categories, triggers, properties, priority)

User Journey Funnels

(Funnel definitions from Step 3 — one per key journey)

Task Completion Metrics

(Table from Step 4 — tasks, events, targets, timing)

Feature Adoption & Retention

(Metrics from Step 5 — adoption rate, activation, retention cohorts)

Experiment Instrumentation

(Structure from Step 6 — if applicable; omit if team doesn't run experiments yet)

Implementation Checklist

Priority-ordered list of what to implement next:

  • (P0) (Most critical gap — e.g., "No event tracking exists; implement page views and core action events")
  • (P0) (Second critical gap — e.g., "Signup funnel has no step-level tracking")
  • (P1) (Important but not urgent — e.g., "Add timing instrumentation to onboarding flow")
  • (P1) (Next important item)
  • (P2) (Nice to have — e.g., "Implement breadth-of-use metric across feature set")

Data Governance Notes

  • PII handling: (what user data is collected, how it's anonymized or consented)
  • Retention policy: (how long event data is kept)
  • Access: (who can see raw events vs. aggregated dashboards)

Open Questions

(Anything that couldn't be resolved without more information)


Step 8: Review and refine

Ask the user:

  • Does the event taxonomy cover the questions you need to answer about user behavior?
  • Are the funnels measuring the right steps? Any steps missing or too granular?
  • Are the task completion targets realistic based on what you know about user behavior?
  • Is the implementation checklist ordered correctly for your current priorities?
  • Are there privacy or compliance constraints that affect what can be tracked?
  • Do you need experiment instrumentation now, or is that a future concern?

Adjust based on feedback.

Related skills

  • /instrumentation-plan — plan SRE instrumentation for system health, uptime, and deployment reliability
  • /experiment-design — design a specific experiment to validate a product hypothesis
  • /research-synthesize — synthesize qualitative research alongside quantitative observability data

Output location

Present the plan as formatted text in the conversation. The user can copy it into their product wiki, analytics documentation, or team shared drive.

Example Output

Input

  • Product description: Capsule is a B2B SaaS product that helps HR teams build and manage employee onboarding programs — creating task checklists, sending automated welcome sequences, and tracking new hire progress through their first 90 days
  • Current analytics: Google Analytics 4 (page views only), no event tracking beyond that; no funnel visibility; onboarding completion rates estimated manually from Salesforce notes
  • Key user journeys: (1) HR admin signup → first program published, (2) New hire receives invite → completes onboarding checklist, (3) HR admin reviews progress dashboard → takes action on at-risk employee
  • Business goals: Improve activation rate (HR admin publishes first program within 7 days of signup), improve new hire completion rate (target 80%), reduce churn among accounts where no program is published within 14 days
  • Known blind spots: No visibility into where HR admins abandon the program builder, no data on which checklist steps new hires skip most, no session-level timing, can't tell if progress dashboard is actually used
  • Analytics stack: Migrating to PostHog (self-hosted); Salesforce for CRM; Intercom for in-app messaging; no data warehouse yet

Output (abbreviated)

Observability Plan — Capsule HR Onboarding

Generated: 2025-07-14 Product: B2B SaaS for HR teams — builds and tracks employee onboarding programs through 90-day new hire journeys Current state: GA4 page view tracking only. No event-level visibility. Activation, funnel, and task completion data are estimated manually. PostHog migration in progress — this plan targets PostHog as the implementation target.


Event Taxonomy

Event NameCategoryTriggerKey PropertiesPriority
Page ViewedNavigationAny page loadpage_path, referrer, load_time_ms, user_roleP0
Account CreatedLifecycleHR admin completes registrationsignup_method, company_size, referral_sourceP0
Program Builder OpenedEngagementAdmin clicks "Create Program"entry_point, template_usedP0
Program Step AddedEngagementAdmin adds a task to programstep_type, step_index, program_idP0
Program PublishedConversionAdmin clicks "Publish"program_id, step_count, time_to_publish_days, template_usedP0
New Hire InvitedLifecycleAdmin sends onboarding inviteprogram_id, invite_method, days_before_start_dateP0
Onboarding Checklist OpenedEngagementNew hire opens their checklistprogram_id, device_type, hours_since_inviteP0
Checklist Step CompletedEngagementNew hire marks a step donestep_id, step_type, step_index, program_id, completion_methodP0
Checklist Step SkippedError / FrictionNew hire skips or bypasses a stepstep_id, step_type, step_index, skip_reasonP0
Onboarding CompletedLifecycleAll required steps finishedprogram_id, total_steps, days_to_complete, skip_countP0
Progress Dashboard ViewedEngagementAdmin opens new hire progress viewnew_hire_count, at_risk_count, view_depth_secondsP1
At-Risk Employee ActionedConversionAdmin sends nudge or reassigns stepaction_type, days_since_last_hire_activity, program_idP1
Program Builder AbandonedError / FrictionAdmin exits builder without publishing (session ends)last_step_reached, steps_added, time_in_builder_minutesP1
Form Validation FailedError / FrictionInline error shown to userform_name, field_name, error_type, user_roleP1
Experiment AssignedLifecycleUser enters A/B testexperiment_name, variant, user_roleP1
Account ChurnedLifecycleSubscription cancelled or not renewedtenure_days, programs_published, last_active_dateP1
Bulk Export GeneratedSecurity / AnomalyAdmin exports new hire datarecord_count, export_format, time_of_dayP2
Permission ElevatedSecurity / AnomalyUser role changed to adminchanged_by, previous_role, account_idP2

User Journey Funnels

Journey 1: HR Admin Signup → First Program Published (Activation)

StepEventSuccess CriteriaExpected Drop-offAlert If
1. SignupAccount CreatedAdmin registersVolume < 20% below 7-day avg
2. Builder EntryProgram Builder OpenedAdmin starts building within 7 days20–30% dropDrop > 45%
3. Content AddedProgram Step Added (3+ events)Admin adds at least 3 steps20–30% dropDrop > 40%
4. PublishedProgram PublishedAdmin publishes first program25–35% dropDrop > 50%

Target activation rate: ≥ 55% of signups publish a program within 7 days Critical blind spot addressed: Program Builder Abandoned event reveals where admins stall — step count and time in builder pinpoint the friction.


Journey 2: New Hire → Onboarding Completed

StepEventSuccess CriteriaExpected Drop-offAlert If
1. InvitedNew Hire InvitedInvite deliveredDelivery failure rate > 5%
2. Checklist OpenedOnboarding Checklist OpenedNew hire opens within 48 hrs10–20% dropDrop > 35%
3. First Step CompletedChecklist Step Completed (step_index = 1)Any first action taken15–25% dropDrop > 40%
4. HalfwayChecklist Step Completed (step_index = 50% of total)Sustained progress15–25% dropDrop > 35%
5. CompletedOnboarding CompletedAll required steps done10–20% dropCompletion rate < 70%

Target completion rate: ≥ 80% of invited new hires Note: Checklist Step Skipped by step_index will reveal which specific tasks block completion — this is Capsule's most actionable unknown today.


Journey 3: HR Admin → Progress Dashboard → Action Taken

StepEventSuccess CriteriaExpected Drop-offAlert If
1. Dashboard OpenedProgress Dashboard ViewedAdmin views dashboardLess than 40% of active accounts/week
2. At-Risk IdentifiedDashboard view with at_risk_count > 0Admin sees a flagged hireVaries
3. Action TakenAt-Risk Employee ActionedAdmin responds within 48 hrs40–60% dropAction rate < 25% on at-risk accounts

Task Completion Metrics

TaskStart EventEnd EventSuccess CriteriaTime TargetMeasure
Publish first programProgram Builder OpenedProgram PublishedProgram has ≥ 3 steps< 20 minutesCompletion rate, median time, abandonment step
New hire completes onboardingOnboarding Checklist OpenedOnboarding CompletedAll required steps done< 30 daysCompletion rate, skip