Skip to main content
Ai StrategyIntermediate7 min read

AI Health Indicator

A diagnostic framework for catching AI quality problems before your users do

Most teams shipping AI features have no structured way to answer "is this actually working?" They rely on user complaints, gut feel, or the absence of obvious failures. That is not a quality strategy. That is hope.

The AI Health Indicator (AHI) is a diagnostic framework built around six dimensions that matter for production AI systems. It gives you a score, a risk map, and a clear picture of where to invest next.

The CARATS framework

CARATS stands for Consistency, Accuracy, Reliability, Alignment, Tone, and Security. Each dimension captures a different way AI systems can fail silently.

DimensionWhat it measuresSilent failure mode
ConsistencySame input produces similar output across runsOutputs drift without anyone noticing
AccuracyOutputs are factually correct and completeConfident wrong answers that look right
ReliabilitySystem performs under load, over time, across user segmentsWorks in demo, degrades in production
AlignmentOutputs match what users actually needTechnically correct but practically useless
ToneCommunication style fits the audience and contextMedical assistant sounds like a chatbot
SecurityProtected from injection, leakage, adversarial manipulationPrompt injection exposes system instructions

Built from patterns across 30+ AI engagements, the theme that kept emerging: teams were measuring whether the AI was running but not whether it was working.

How to use it

Run the assessment

The AI Health Check tool walks you through 10 questions covering all six dimensions plus structural health factors (evaluation maturity, context discipline, experimentation rigor). It takes about 5 minutes and produces a scored breakdown with risk areas highlighted.

Read the scores

Each dimension scores on a 1-5 scale:

  • 4.0+: Healthy. You have practices in place and they're working.
  • 3.5-3.9: At risk. You have some practices but gaps are showing.
  • Below 3.5: Needs attention. This dimension is a liability.

Your overall score is the average, but the real value is the per-dimension breakdown. A team scoring 4.5 on Consistency but 2.0 on Security has a very different action plan than one scoring 3.0 across the board.

Act on the gaps

For each dimension below healthy, the framework points to specific practices:

Beyond CARATS: structural health

CARATS measures output quality. But output quality depends on structural factors:

Evaluation maturity asks whether your team writes evals before building features. If you only check quality after launch, you're doing quality assurance. If you define expected behavior before implementation, you're doing eval-driven development. The difference is the same as the gap between TDD and manual testing.

Context discipline asks whether your AI agents get the right information at the right time. Poor context management causes drift, hallucination, and inconsistency. Teams with strong context discipline use structured knowledge documents, scoped tool access, and explicit context boundaries.

Bounded autonomy asks whether there are clear lines between what the AI can do alone and what requires human review. Most failures come from AI systems operating outside their competence boundary without anyone knowing.

When to run it

  • At kickoff - baseline before building
  • Monthly - track trends, catch drift
  • When something feels off - structured diagnosis instead of guessing
  • Before a launch - confirm readiness across all dimensions

The assessment is lightweight enough to run regularly. The value compounds as you track scores over time and can see whether your investments are moving the needle.

Want help with ai health indicator?

I coach teams on this practice. Let's talk about your situation.

Get in touch