Metrics

Every team I work with has too many metrics and too little signal. Dashboards with 40 charts that nobody opens. Weekly reports that get skimmed and filed. Quarterly reviews where everyone argues about which number matters. The problem is never "we don't measure enough." The problem is that measurement isn't connected to decisions.

The metrics hierarchy

Good metrics practice follows a hierarchy. Each level serves a different audience and cadence:

Level	Question it answers	Cadence	Audience
North Star	Are we winning?	Quarterly	Leadership
Health metrics	Is the system working?	Weekly	Product + Engineering
Diagnostic metrics	What's broken and why?	Daily/on-demand	Engineering + Support
Experiment metrics	Did this change work?	Per experiment	Product

Most teams mix these levels. They put diagnostic metrics in the quarterly review and north star metrics in the daily standup. The result is noise. Match the metric to the decision it's supposed to inform.

Choosing metrics that work

A useful metric passes three tests:

Actionable - if it moves, you know what to do about it. "Monthly active users" fails this test for most teams. "Activation rate for users who complete onboarding" passes it.
Comparable - you can compare it across time periods, segments, or experiments. Absolute numbers are usually worse than rates or ratios.
Connected to outcomes - it ladders up to something the business cares about. If you can't draw a line from the metric to revenue, retention, or user value, it's a vanity metric. (See outcomes over output for why this distinction matters.)

Metrics for AI systems

AI products need measurement practices that traditional software doesn't. The failure modes are different: AI systems don't crash, they degrade. They don't throw errors, they produce confident wrong answers.

The AI Health Indicator framework provides six dimensions to measure (Consistency, Accuracy, Reliability, Alignment, Tone, Security). For each dimension, you need:

Baseline metrics - what does "normal" look like? Establish this before launch, not after the first incident.
Drift detection - is quality changing over time? Model updates, data distribution shifts, and user behavior changes can all degrade output quality silently.
Eval pass rates - what percentage of outputs meet your quality criteria? This is the AI equivalent of test coverage. Track it continuously.

The eval-metric bridge

Teams with strong traditional engineering backgrounds often ask "how is this different from testing?" The key difference: tests verify deterministic behavior (given X, expect Y). Evals verify stochastic behavior (given X, the output should be correct, consistent, and appropriate most of the time). You need both.

Prefer pass/fail classification over subjective scales. "Is this factually correct? Yes/No" is more useful than "Rate the quality 1-10." It's easier to track, easier to automate, and produces cleaner trend data.

Common anti-patterns

The dashboard graveyard. Team builds elaborate dashboards during a metrics initiative. Six months later, nobody opens them. Fix: tie every dashboard to a recurring meeting or decision point. If no meeting uses it, delete it.

Metric theater. Team reports metrics that always look good because they chose metrics that can't go down. Fix: include at least one metric that measures what you're bad at.

Measurement without action. "Awareness" metrics that nobody acts on. If the metric drops 20%, would anyone do anything differently? If no, stop tracking it.

AI-specific: testing outputs instead of outcomes. Measuring whether the AI produced a response (uptime) instead of whether the response was useful (quality). The first is table stakes. The second is the product.

The metrics hierarchy

Choosing metrics that work

Metrics for AI systems

The eval-metric bridge

Common anti-patterns

Related practices

Related services

Want help with metrics?