What is an AI hallucination?

An AI hallucination is output that is fluent and confident but factually wrong or unsupported by the source the model was given. It happens because language models predict plausible text, not verified truth, so they can generate convincing claims, citations, or details that don't exist.

How do you measure an AI hallucination rate?

Run the model against a dataset of inputs, then score each output for claims that aren't supported by fact or source. The hallucination rate is the share of outputs containing at least one unsupported claim. Grounding checks, citation verification, and an LLM judge grading faithfulness are the common scoring methods.

Can you eliminate AI hallucinations completely?

No. You can reduce them substantially with better retrieval, prompts that permit "I don't know," and lower temperature on factual tasks, but no current technique drives the rate to zero. That's why a standing hallucination eval matters: it tells you the real rate and whether your mitigations are working.

How to Test for AI Hallucinations: A Practical Guide

To test for AI hallucinations, you build an eval that checks whether the model's output is grounded in fact or source material rather than invented. You gather inputs where the AI is tempted to make things up, define what counts as an unsupported claim, and score each answer for statements it can't back up.

A hallucination is output that sounds confident and reads fluently but isn't true or isn't supported by the source you gave the model. It's the failure mode that erodes trust fastest, because it doesn't look like a failure. There's no error, no crash, just a wrong answer delivered with total composure.

Why you can't just eyeball it

The thing that makes hallucinations dangerous is the same thing that makes them hard to catch: they're plausible. A reviewer skimming outputs will wave through a confident, well-formed answer without checking whether the cited policy actually says that. Hallucinations hide in the outputs that look most correct, which is exactly why you need a structured test rather than a gut check.

How to build a hallucination eval

A hallucination eval is a regular eval pointed at a specific question: is every claim supported? Three scoring methods do most of the work.

Grounding checks. For systems that answer from a source (retrieval, a document, a knowledge base), check whether each claim in the output traces back to the provided context. Anything that doesn't is a candidate hallucination.
Citation verification. If the AI cites sources, verify the citation exists and actually supports the claim. Models invent plausible-looking references, so a citation is not proof on its own.
LLM-as-judge for faithfulness. A second model grades whether the answer stays faithful to the source, flagging claims that go beyond it. Give the judge the source text and the answer, and ask it to list unsupported statements before scoring.

Build your dataset from the cases most likely to trigger invention: questions just outside the source's coverage, ambiguous prompts, and requests for specifics the model won't have. Those are where hallucinations live.

Detecting is not the same as reducing

An eval tells you how often the model makes things up. It doesn't fix the cause. Once you can measure the rate, you can work the levers that lower it: better retrieval so the answer has real source to stand on, prompts that tell the model to say "I don't know" rather than guess, and a lower temperature on factual tasks. The eval is what proves any of those changes actually helped instead of just feeling safer.

Where to go next

A hallucination eval is one application of the broader discipline. To build the full picture, start with how to design AI evals, see the workflow in eval-driven development for AI products, and draft your first one with the free AI Eval Builder. If the goal is getting a whole team to test AI output as a habit, that's AI adoption consulting.

Why you can't just eyeball it

How to build a hallucination eval

A hallucination eval is a regular eval pointed at a specific question: is every claim supported? Three scoring methods do most of the work.

Grounding checks. For systems that answer from a source (retrieval, a document, a knowledge base), check whether each claim in the output traces back to the provided context. Anything that doesn't is a candidate hallucination.

Citation verification. If the AI cites sources, verify the citation exists and actually supports the claim. Models invent plausible-looking references, so a citation is not proof on its own.

LLM-as-judge for faithfulness. A second model grades whether the answer stays faithful to the source, flagging claims that go beyond it. Give the judge the source text and the answer, and ask it to list unsupported statements before scoring.

Detecting is not the same as reducing

Where to go next

How to Test for AI Hallucinations

Why you can't just eyeball it

How to build a hallucination eval

Detecting is not the same as reducing

Where to go next

Frequently asked questions

Related services

Read next

Want to work together?

How to Test for AI Hallucinations

Why you can't just eyeball it

How to build a hallucination eval

Detecting is not the same as reducing

Where to go next

Frequently asked questions

Related services

Read next

Want to work together?