Skip to main content
Engineering/rag-implementation

RAG Implementation

Ground an LLM in your own documents: RAG vs fine-tune vs long-context, chunking, embeddings, retrieval, and evaluating retrieval on its own.

Use this when you need an AI feature to answer from your own knowledge (docs, tickets, a knowledge base) rather than the model's training. Covers the real first decision (RAG vs fine-tune vs long-context prompt), chunking, embeddings and retrieval, and the step teams skip: evaluating retrieval on its own. If you are speccing the whole feature, start with /ai-product-spec and use this for the retrieval layer.

Related skills: Specs the feature with /ai-product-spec. The model choice is in /multi-model-strategy. Evaluate the answers with /ai-eval-design. Monitor retrieval quality with /llm-observability-plan.

The hard part most teams miss

Retrieval quality is the product. The LLM is the easy part. Most "the RAG is hallucinating" reports are retrieval failures wearing a generation costume.

  1. RAG is one of three answers, and often the wrong one. Grounding in private knowledge can mean retrieval, fine-tuning, or just putting the documents in a long-context prompt. Teams reach for RAG reflexively. If the corpus is small and stable, a long-context prompt is simpler and more accurate; if the need is a behavior or format, fine-tuning fits better. Make this decision before building anything (Step 2).
  2. If the model never sees the right chunk, no prompt can save it. The model can only answer from what retrieval handed it. Bad chunks, a mismatched embedding model, or a too-small top-k starve the model, and the failure looks like a generation problem. It is not.
  3. Nobody evaluates retrieval separately, so nobody fixes it. Teams measure the final answer and tune the prompt, while the actual defect is recall: the answer was not in the retrieved set. Score retrieval on its own (Step 5) or you are debugging blind.

Process

Step 1: Gather inputs

Ask the user:

  1. What knowledge are you grounding in? (Source, rough size: hundreds of pages, or millions.)
  2. How often does it change? (Static, daily, real-time. This rules options in and out.)
  3. What do the queries look like? (Lookups, multi-hop questions, summaries across many docs.)
  4. What is the accuracy bar and the cost of a wrong answer? (Sets how much retrieval rigor is worth.)
  5. Latency and cost ceiling per query? (Caps embedding, retrieval, and rerank choices.)
  6. What structure does the source have? (Headings, tables, code, metadata you can filter on.)

Step 2: Decide RAG vs fine-tune vs long-context prompt

ApproachFits whenWatch-out
Long-context promptCorpus is small and fairly stable; fits in the windowCost per call scales with context; cache the stable prefix
RAG (retrieval)Corpus is large, changes often, or you need citations to sourceRetrieval quality becomes a system you must build and evaluate
Fine-tuneYou need a behavior, format, or domain style, not fresh factsDoes not keep facts current; retrain to update knowledge

These combine (fine-tune for style plus RAG for facts). Pick deliberately; do not default to RAG because it is the familiar word.

Step 3: Chunking strategy

  • Size and overlap: chunk to the unit a query actually needs (a section, not a whole doc, not a single sentence). Overlap a little so an answer split across a boundary is not lost.
  • Respect structure: split on headings, list items, table rows, and code blocks rather than blind character counts. A chunk that straddles two topics retrieves for neither.
  • Carry metadata: attach source, section, date, and any field you will filter on. Metadata filtering is often a bigger accuracy win than a better embedding model.

Step 4: Embeddings and retrieval

  • Embedding model: match it to your content (domain, language, code) and to your query style. The embedding model and the chunking decide what is findable.
  • Vector store and top-k: start with a top-k large enough that the right chunk is usually present, then narrow. Too-small top-k is a silent recall killer.
  • Hybrid and rerank: combine semantic search with keyword search when queries contain exact terms (names, IDs, error codes) that embeddings blur. Add a reranker when precision matters and top-k is noisy.

Step 5: Evaluate retrieval, separately from generation

  • Build a retrieval eval set: real queries paired with the chunk(s) that should be retrieved.
  • Measure recall@k: for what fraction of queries is a correct chunk in the top-k? This number, not the answer quality, tells you whether retrieval is the bottleneck.
  • Then evaluate the answer with /ai-eval-design, on top of known-good retrieval. Fixing the prompt while recall is low is wasted effort.

Step 6: Output the RAG design

# RAG Design: (feature)

**Grounding decision:** (RAG / long-context / fine-tune / hybrid) and why
**Corpus:** (size, update frequency, structure)

## Chunking
- Strategy: (structure-aware split, size, overlap)
- Metadata captured: (fields)

## Retrieval
- Embedding model: (name and why)
- Store + top-k: (choice)
- Hybrid / rerank: (yes/no and why)

## Evaluation
- Retrieval eval set: (size, source)
- Recall@k target: (value)
- Answer eval: (link to eval plan)

## Open questions
- (unresolved decisions)

Step 7: Review

Ask the user:

  • Did you confirm RAG beats a long-context prompt for this corpus?
  • Can you measure recall@k today, or are you flying blind on retrieval?
  • Do queries contain exact terms that need keyword search alongside embeddings?
  • How does the index stay fresh as the source changes?

Anti-patterns

Anti-patternWhy it failsDo instead
RAG by reflexBuilds a retrieval system a long-context prompt would beatRun the Step 2 decision first
Fixed-size blind chunkingChunks straddle topics; retrieval gets noiseSplit on structure, carry metadata
No retrieval evalTunes the prompt while recall is the real defectMeasure recall@k separately
Embeddings only for exact termsIDs, names, error codes get blurredAdd keyword / hybrid search
Stuffing the whole corpus every callPays for context no single query needsRetrieve the relevant chunks
Fine-tune to add factsKnowledge goes stale and is costly to updateFine-tune for behavior, retrieve for facts

Output location

Present the RAG design as formatted text in the conversation for the user to copy into their design doc.