Use this when you need an AI feature to answer from your own knowledge (docs, tickets, a knowledge base) rather than the model's training. Covers the real first decision (RAG vs fine-tune vs long-context prompt), chunking, embeddings and retrieval, and the step teams skip: evaluating retrieval on its own. If you are speccing the whole feature, start with /ai-product-spec and use this for the retrieval layer.
Related skills: Specs the feature with
/ai-product-spec. The model choice is in/multi-model-strategy. Evaluate the answers with/ai-eval-design. Monitor retrieval quality with/llm-observability-plan.
The hard part most teams miss
Retrieval quality is the product. The LLM is the easy part. Most "the RAG is hallucinating" reports are retrieval failures wearing a generation costume.
- RAG is one of three answers, and often the wrong one. Grounding in private knowledge can mean retrieval, fine-tuning, or just putting the documents in a long-context prompt. Teams reach for RAG reflexively. If the corpus is small and stable, a long-context prompt is simpler and more accurate; if the need is a behavior or format, fine-tuning fits better. Make this decision before building anything (Step 2).
- If the model never sees the right chunk, no prompt can save it. The model can only answer from what retrieval handed it. Bad chunks, a mismatched embedding model, or a too-small top-k starve the model, and the failure looks like a generation problem. It is not.
- Nobody evaluates retrieval separately, so nobody fixes it. Teams measure the final answer and tune the prompt, while the actual defect is recall: the answer was not in the retrieved set. Score retrieval on its own (Step 5) or you are debugging blind.
Process
Step 1: Gather inputs
Ask the user:
- What knowledge are you grounding in? (Source, rough size: hundreds of pages, or millions.)
- How often does it change? (Static, daily, real-time. This rules options in and out.)
- What do the queries look like? (Lookups, multi-hop questions, summaries across many docs.)
- What is the accuracy bar and the cost of a wrong answer? (Sets how much retrieval rigor is worth.)
- Latency and cost ceiling per query? (Caps embedding, retrieval, and rerank choices.)
- What structure does the source have? (Headings, tables, code, metadata you can filter on.)
Step 2: Decide RAG vs fine-tune vs long-context prompt
| Approach | Fits when | Watch-out |
|---|---|---|
| Long-context prompt | Corpus is small and fairly stable; fits in the window | Cost per call scales with context; cache the stable prefix |
| RAG (retrieval) | Corpus is large, changes often, or you need citations to source | Retrieval quality becomes a system you must build and evaluate |
| Fine-tune | You need a behavior, format, or domain style, not fresh facts | Does not keep facts current; retrain to update knowledge |
These combine (fine-tune for style plus RAG for facts). Pick deliberately; do not default to RAG because it is the familiar word.
Step 3: Chunking strategy
- Size and overlap: chunk to the unit a query actually needs (a section, not a whole doc, not a single sentence). Overlap a little so an answer split across a boundary is not lost.
- Respect structure: split on headings, list items, table rows, and code blocks rather than blind character counts. A chunk that straddles two topics retrieves for neither.
- Carry metadata: attach source, section, date, and any field you will filter on. Metadata filtering is often a bigger accuracy win than a better embedding model.
Step 4: Embeddings and retrieval
- Embedding model: match it to your content (domain, language, code) and to your query style. The embedding model and the chunking decide what is findable.
- Vector store and top-k: start with a top-k large enough that the right chunk is usually present, then narrow. Too-small top-k is a silent recall killer.
- Hybrid and rerank: combine semantic search with keyword search when queries contain exact terms (names, IDs, error codes) that embeddings blur. Add a reranker when precision matters and top-k is noisy.
Step 5: Evaluate retrieval, separately from generation
- Build a retrieval eval set: real queries paired with the chunk(s) that should be retrieved.
- Measure recall@k: for what fraction of queries is a correct chunk in the top-k? This number, not the answer quality, tells you whether retrieval is the bottleneck.
- Then evaluate the answer with
/ai-eval-design, on top of known-good retrieval. Fixing the prompt while recall is low is wasted effort.
Step 6: Output the RAG design
# RAG Design: (feature)
**Grounding decision:** (RAG / long-context / fine-tune / hybrid) and why
**Corpus:** (size, update frequency, structure)
## Chunking
- Strategy: (structure-aware split, size, overlap)
- Metadata captured: (fields)
## Retrieval
- Embedding model: (name and why)
- Store + top-k: (choice)
- Hybrid / rerank: (yes/no and why)
## Evaluation
- Retrieval eval set: (size, source)
- Recall@k target: (value)
- Answer eval: (link to eval plan)
## Open questions
- (unresolved decisions)
Step 7: Review
Ask the user:
- Did you confirm RAG beats a long-context prompt for this corpus?
- Can you measure recall@k today, or are you flying blind on retrieval?
- Do queries contain exact terms that need keyword search alongside embeddings?
- How does the index stay fresh as the source changes?
Anti-patterns
| Anti-pattern | Why it fails | Do instead |
|---|---|---|
| RAG by reflex | Builds a retrieval system a long-context prompt would beat | Run the Step 2 decision first |
| Fixed-size blind chunking | Chunks straddle topics; retrieval gets noise | Split on structure, carry metadata |
| No retrieval eval | Tunes the prompt while recall is the real defect | Measure recall@k separately |
| Embeddings only for exact terms | IDs, names, error codes get blurred | Add keyword / hybrid search |
| Stuffing the whole corpus every call | Pays for context no single query needs | Retrieve the relevant chunks |
| Fine-tune to add facts | Knowledge goes stale and is costly to update | Fine-tune for behavior, retrieve for facts |
Output location
Present the RAG design as formatted text in the conversation for the user to copy into their design doc.