To test for AI hallucinations, you build an eval that checks whether the model's output is grounded in fact or source material rather than invented. You gather inputs where the AI is tempted to make things up, define what counts as an unsupported claim, and score each answer for statements it can't back up.
A hallucination is output that sounds confident and reads fluently but isn't true or isn't supported by the source you gave the model. It's the failure mode that erodes trust fastest, because it doesn't look like a failure. There's no error, no crash, just a wrong answer delivered with total composure.
Why you can't just eyeball it
The thing that makes hallucinations dangerous is the same thing that makes them hard to catch: they're plausible. A reviewer skimming outputs will wave through a confident, well-formed answer without checking whether the cited policy actually says that. Hallucinations hide in the outputs that look most correct, which is exactly why you need a structured test rather than a gut check.
How to build a hallucination eval
A hallucination eval is a regular eval pointed at a specific question: is every claim supported? Three scoring methods do most of the work.
- Grounding checks. For systems that answer from a source (retrieval, a document, a knowledge base), check whether each claim in the output traces back to the provided context. Anything that doesn't is a candidate hallucination.
- Citation verification. If the AI cites sources, verify the citation exists and actually supports the claim. Models invent plausible-looking references, so a citation is not proof on its own.
- LLM-as-judge for faithfulness. A second model grades whether the answer stays faithful to the source, flagging claims that go beyond it. Give the judge the source text and the answer, and ask it to list unsupported statements before scoring.
Build your dataset from the cases most likely to trigger invention: questions just outside the source's coverage, ambiguous prompts, and requests for specifics the model won't have. Those are where hallucinations live.
Detecting is not the same as reducing
An eval tells you how often the model makes things up. It doesn't fix the cause. Once you can measure the rate, you can work the levers that lower it: better retrieval so the answer has real source to stand on, prompts that tell the model to say "I don't know" rather than guess, and a lower temperature on factual tasks. The eval is what proves any of those changes actually helped instead of just feeling safer.
Where to go next
A hallucination eval is one application of the broader discipline. To build the full picture, start with how to design AI evals, see the workflow in eval-driven development for AI products, and draft your first one with the free AI Eval Builder. If the goal is getting a whole team to test AI output as a habit, that's AI adoption consulting.
Frequently asked questions
Related services
Read next
Eval-driven development means writing the eval alongside the AI feature and using its score to guide every change. Here's the loop, and how to grade LLM output you can trust.
An AI eval is a repeatable test that scores your AI's output against what good looks like. Here's how to design one that actually catches problems, from defining the task to choosing how you grade.
An AI eval is a repeatable test that scores an AI's output against a standard of what good looks like. Here's what that means, what an eval is made of, and why it matters.
Want to work together?
I help teams ship better products. Let's talk about your situation.
Get in touch