Skip to main content
Assessment & Diagnostics/ai-health-check

AI Health Check

You need a structural health check of an AI system.

Use this when you need to diagnose structural delivery risk in an AI-heavy engagement — at kickoff, mid-engagement, or when something feels off but you can't pinpoint why. Uses the CARATS framework to evaluate constraint design, evaluation maturity, context discipline, and stakeholder alignment.

Related skills: For deeper eval planning on a specific feature, use /ai-eval-design. For speccing a new AI feature, use /ai-product-spec.

Process

Step 1: Gather engagement context

Ask the assessor:

  1. What engagement is this for? (Client, project, duration so far.)
  2. What AI architecture mode is the team using? (Chatbot, AI Flow, AI App, Agentic, Cognitive — or "not sure.")
  3. What's prompting this assessment? (Kickoff baseline, something feels off, monthly check-in, escalation risk.)
  4. Who are the key stakeholders? (PM, engineering lead, client product lead, executives.)

Step 2: Evaluate the six CARATS dimensions

For each CARATS dimension, ask the assessor to describe the current state and score it:

C — Consistency

  • Do AI components produce similar outputs for similar inputs?
  • Are there known inconsistency problems?
  • Score: Red (no consistency checks) / Yellow (ad hoc) / Green (systematic evaluation)

A — Accuracy

  • Are outputs factually correct and complete?
  • How are accuracy issues detected — by users, by tests, or not at all?
  • Score: Red / Yellow / Green

R — Reliability

  • Does the system perform consistently over time and under load?
  • Are there known flaky behaviors or degradation patterns?
  • Score: Red / Yellow / Green

A — Alignment

  • Do outputs match the intended purpose and user expectations?
  • Are there gaps between what the AI does and what users expect?
  • Score: Red / Yellow / Green

T — Tone

  • Is the communication style appropriate for the audience and context?
  • Are there known tone issues (too formal, too casual, inconsistent)?
  • Score: Red / Yellow / Green

S — Security

  • Are outputs safe from injection, data leakage, and adversarial manipulation?
  • Have security boundaries been tested?
  • Has adversarial robustness been assessed -- can the AI be manipulated by poisoning input data, skewing behavioral baselines, or exploiting model confidence?
  • For AI that analyzes other systems (UEBA, anomaly detection, threat analysis): are the data sources feeding it complete and trustworthy, or could gaps in telemetry lead to confident but wrong conclusions?
  • Score: Red / Yellow / Green

Step 3: Assess structural health factors

Beyond CARATS, evaluate these structural indicators:

Evaluation maturity:

  • Are there application-specific evals? (Not just "does it work" but binary pass/fail tests for known failure modes.)
  • Is the team practicing Eval-Driven Development (EDD)?
  • Score: Red (no evals) / Yellow (some evals, reactive) / Green (EDD practiced, evals before features)

Context discipline:

  • Is context engineered (right information, right scope, right time) or ad hoc?
  • Are there known context failure modes — drift, repeated work, inconsistent outputs?
  • Score: Red / Yellow / Green

Bounded autonomy:

  • Are there clear boundaries for what AI can do autonomously vs. what requires human review?
  • Are escalation paths defined?
  • Score: Red / Yellow / Green

Observability:

  • Can the team see what agents are doing and why?
  • Are there runtime dashboards, logs, or monitoring?
  • Score: Red / Yellow / Green

Stakeholder alignment:

  • Do all stakeholders understand that AI is probabilistic, not deterministic?
  • Are there unvoiced concerns (Silent Escalator risk)?
  • Score: Red / Yellow / Green

Data quality:

  • Is the data feeding AI components complete, consistent, and fresh?
  • Are there known blind spots in the data pipeline -- missing log sources, untagged resources, gaps in audit trails?
  • Has anyone assessed whether the input data quality is sufficient for the AI's intended accuracy?
  • For AI that builds behavioral baselines: is the training data representative of normal operations, or could noisy/incomplete data produce misleading baselines?
  • Score: Red (no data quality assessment done) / Yellow (some awareness, no systematic approach) / Green (data quality monitored and maintained)

Step 4: Generate the AHI report

# AI Health Indicator Report

**Engagement:** (Name)
**Assessed by:** (Name and role)
**Date:** (Date)
**Architecture mode:** (Chatbot / AI Flow / AI App / Agentic / Cognitive)

---

## CARATS Scorecard

| Dimension | Score | Key finding |
|-----------|-------|-------------|
| Consistency | (R/Y/G) | (One sentence) |
| Accuracy | (R/Y/G) | (One sentence) |
| Reliability | (R/Y/G) | (One sentence) |
| Alignment | (R/Y/G) | (One sentence) |
| Tone | (R/Y/G) | (One sentence) |
| Security | (R/Y/G) | (One sentence) |

## Structural Health

| Factor | Score | Key finding |
|--------|-------|-------------|
| Evaluation maturity | (R/Y/G) | (One sentence) |
| Context discipline | (R/Y/G) | (One sentence) |
| Bounded autonomy | (R/Y/G) | (One sentence) |
| Observability | (R/Y/G) | (One sentence) |
| Stakeholder alignment | (R/Y/G) | (One sentence) |
| Data quality | (R/Y/G) | (One sentence) |

## Overall health: (Healthy / At Risk / Escalation Risk)

## Top risks

1. (Highest-impact risk — what could go wrong and when)
2. (Second risk)
3. (Third risk)

## Recommended interventions

### Immediate (this week)
- (Action to address the highest risk)

### Short-term (next 2-4 weeks)
- (Actions to move Red scores to Yellow)

### Structural (ongoing)
- (Practices to adopt for sustained health)

## Escalation risk assessment

(Is the engagement on track to hit the Week 4-8 escalation pattern? What are the leading indicators?)

## Next check-in: (Date — typically 2-4 weeks)

Step 5: Review and discuss

Present the report and ask:

  • Does this match your gut feeling about the engagement's health?
  • Are there risks I've missed?
  • Which interventions feel most impactful and achievable?
  • Are there stakeholder conversations that need to happen based on this?

Output location

Present the AHI report as formatted text in the conversation. The assessor can copy it into engagement documentation or use it to frame a stakeholder conversation.

Example Output

Input

  • Engagement: Meridian Health Partners — AI-assisted prior authorization triage system, 6 weeks in (12-week engagement)
  • Architecture mode: AI Flow (LLM-powered classification pipeline routing auth requests to clinical reviewers)
  • Assessment trigger: Mid-engagement check-in; PM flagged that clinical reviewers are "correcting the AI constantly" but no one has quantified the error rate
  • Key stakeholders: Engagement PM (Riya Nair), engineering lead (Derek Luo), client product lead (Dr. Sandra Osei, VP of Clinical Operations), CFO sponsoring the initiative

Output

AI Health Indicator Report

Engagement: Meridian Health Partners — Prior Authorization Triage System Assessed by: Riya Nair, Engagement PM Date: November 14, 2024 Architecture mode: AI Flow


CARATS Scorecard

DimensionScoreKey finding
Consistency🟡 YellowSame request type produces different urgency classifications depending on input phrasing; no systematic checks in place
Accuracy🔴 RedClinical reviewers report frequent misclassifications but error rate has never been measured — no ground truth logging
Reliability🟡 YellowPipeline performs well under normal load but has produced silent failures (no output, no error) on two occasions
Alignment🔴 RedSystem optimizes for routing speed; reviewers report it misses clinical nuance that determines actual authorization priority
Tone🟢 GreenRationale summaries generated for reviewers are clear and appropriately clinical in register
Security🟡 YellowPHI handling reviewed at kickoff; prompt injection boundaries not tested; no adversarial robustness assessment done

Structural Health

FactorScoreKey finding
Evaluation maturity🔴 RedNo application-specific evals exist; "works" is defined as "reviewer didn't escalate" — a lagging, invisible signal
Context discipline🟡 YellowAuthorization request context is passed in full but prior auth history and payer-specific rules are inconsistently included
Bounded autonomy🟡 YellowAll outputs require human review, which is appropriate, but escalation paths for low-confidence classifications are undefined
Observability🔴 RedNo runtime logging of classification decisions or confidence scores; team cannot reconstruct why a routing decision was made
Stakeholder alignment🟡 YellowDr. Osei understands AI limitations; CFO sponsor has expressed expectation of "90%+ straight-through processing" within 90 days — not yet pressure-tested against current accuracy reality
Data quality🟡 YellowTraining examples drawn from last 18 months of auth requests, but pandemic-era volume patterns may be skewing urgency baselines; no freshness monitoring

Overall health: 🔴 Escalation Risk


Top risks

  1. Unquantified accuracy gap becomes a client trust crisis. Clinical reviewers are absorbing AI errors silently. When Dr. Osei or the CFO asks for a performance number — likely at the Week 8 business review — the team will have no defensible answer. This is the classic Week 4-8 escalation pattern in motion.
  2. CFO's 90% straight-through processing expectation is almost certainly unachievable at current accuracy levels, and no one has had the conversation to reset it. When the gap surfaces, it will feel like the team has been concealing it.
  3. Zero observability means defects compound invisibly. Without logging classification rationale or confidence scores, the team cannot distinguish systemic model failure from edge cases — making it impossible to prioritize fixes or demonstrate improvement.

Recommended interventions

Immediate (this week)

  • Instrument the pipeline to log every classification decision with input, output, confidence score, and reviewer override (yes/no). This is non-negotiable before Week 8.
  • Set up a 30-minute alignment conversation between Riya and the CFO's chief of staff to introduce the concept of an accuracy baseline period — framing current phase as "measuring to improve," not "already at target."

Short-term (next 2-4 weeks)

  • Build a minimum eval set of 50 labeled prior auth cases (drawn with Dr. Osei's clinical team) covering known hard cases: pediatric requests, oncology urgency flags, and payer-specific edge cases. Run weekly against the pipeline.
  • Define confidence thresholds: classifications below a set score should auto-route to a senior reviewer rather than a standard queue. Document this as a bounded autonomy policy.
  • Audit context injection to confirm payer rule sets and prior auth history are included consistently — spot-check 20 recent requests manually.

Structural (ongoing)

  • Adopt Eval-Driven Development: no new classification category or routing rule ships without a corresponding eval case added to the suite.
  • Establish a monthly data quality review to assess whether the training distribution still reflects current auth request patterns, especially as payer mix changes.
  • Add a standing agenda item in weekly team syncs: "What did reviewers override this week, and why?" — converts reviewer corrections into a continuous signal rather than invisible noise.

Escalation risk assessment

This engagement matches the Week 4-8 escalation pattern closely. The team is six weeks in, a senior stakeholder (CFO) holds an unvalidated performance expectation, accuracy has never been measured, and the primary failure signal (reviewer overrides) is being absorbed without capture. The combination of no observability and a high-stakes business review on the horizon creates a predictable forcing event. The intervention window is approximately two weeks before the CFO conversation becomes unavoidable.

Leading indicators to watch: (1) any increase in reviewer complaints reaching Dr. Osei, (2) CFO's team requesting an early performance update, (3) Derek flagging pipeline changes that would affect classification logic — which, without evals, cannot be safely deployed.


Next check-in: December 5, 2024