RAG Implementation

Use this when you need an AI feature to answer from your own knowledge (docs, tickets, a knowledge base) rather than the model's training. Covers the real first decision (RAG vs fine-tune vs long-context prompt), chunking, embeddings and retrieval, and the step teams skip: evaluating retrieval on its own. If you are speccing the whole feature, start with /ai-product-spec and use this for the retrieval layer.

Related skills: Specs the feature with /ai-product-spec. The model choice is in /multi-model-strategy. Evaluate the answers with /ai-eval-design. Monitor retrieval quality with /llm-observability-plan.

The hard part most teams miss

Retrieval quality is the product. The LLM is the easy part. Most "the RAG is hallucinating" reports are retrieval failures wearing a generation costume.

RAG is one of three answers, and often the wrong one. Grounding in private knowledge can mean retrieval, fine-tuning, or just putting the documents in a long-context prompt. Teams reach for RAG reflexively. If the corpus is small and stable, a long-context prompt is simpler and more accurate; if the need is a behavior or format, fine-tuning fits better. Make this decision before building anything (Step 2).
If the model never sees the right chunk, no prompt can save it. The model can only answer from what retrieval handed it. Bad chunks, a mismatched embedding model, or a too-small top-k starve the model, and the failure looks like a generation problem. It is not.
Nobody evaluates retrieval separately, so nobody fixes it. Teams measure the final answer and tune the prompt, while the actual defect is recall: the answer was not in the retrieved set. Score retrieval on its own (Step 5) or you are debugging blind.

Process

Step 1: Gather inputs

Ask the user:

What knowledge are you grounding in? (Source, rough size: hundreds of pages, or millions.)
How often does it change? (Static, daily, real-time. This rules options in and out.)
What do the queries look like? (Lookups, multi-hop questions, summaries across many docs.)
What is the accuracy bar and the cost of a wrong answer? (Sets how much retrieval rigor is worth.)
Latency and cost ceiling per query? (Caps embedding, retrieval, and rerank choices.)
What structure does the source have? (Headings, tables, code, metadata you can filter on.)

Step 2: Decide RAG vs fine-tune vs long-context prompt

Approach	Fits when	Watch-out
Long-context prompt	Corpus is small and fairly stable; fits in the window	Cost per call scales with context; cache the stable prefix
RAG (retrieval)	Corpus is large, changes often, or you need citations to source	Retrieval quality becomes a system you must build and evaluate
Fine-tune	You need a behavior, format, or domain style, not fresh facts	Does not keep facts current; retrain to update knowledge

These combine (fine-tune for style plus RAG for facts). Pick deliberately; do not default to RAG because it is the familiar word.

Step 3: Chunking strategy

Size and overlap: chunk to the unit a query actually needs (a section, not a whole doc, not a single sentence). Overlap a little so an answer split across a boundary is not lost.
Respect structure: split on headings, list items, table rows, and code blocks rather than blind character counts. A chunk that straddles two topics retrieves for neither.
Carry metadata: attach source, section, date, and any field you will filter on. Metadata filtering is often a bigger accuracy win than a better embedding model.

Step 4: Embeddings and retrieval

Embedding model: match it to your content (domain, language, code) and to your query style. The embedding model and the chunking decide what is findable.
Vector store and top-k: start with a top-k large enough that the right chunk is usually present, then narrow. Too-small top-k is a silent recall killer.
Hybrid and rerank: combine semantic search with keyword search when queries contain exact terms (names, IDs, error codes) that embeddings blur. Add a reranker when precision matters and top-k is noisy.

Step 5: Evaluate retrieval, separately from generation

Build a retrieval eval set: real queries paired with the chunk(s) that should be retrieved.
Measure recall@k: for what fraction of queries is a correct chunk in the top-k? This number, not the answer quality, tells you whether retrieval is the bottleneck.
Then evaluate the answer with /ai-eval-design, on top of known-good retrieval. Fixing the prompt while recall is low is wasted effort.

Step 6: Output the RAG design

# RAG Design: (feature)

**Grounding decision:** (RAG / long-context / fine-tune / hybrid) and why
**Corpus:** (size, update frequency, structure)

## Chunking
- Strategy: (structure-aware split, size, overlap)
- Metadata captured: (fields)

## Retrieval
- Embedding model: (name and why)
- Store + top-k: (choice)
- Hybrid / rerank: (yes/no and why)

## Evaluation
- Retrieval eval set: (size, source)
- Recall@k target: (value)
- Answer eval: (link to eval plan)

## Open questions
- (unresolved decisions)

Step 7: Review