Back to Learn AI Concepts | Home
AI / NLP · Teaching Reference

Retrieval-Augmented Generation

How RAG lets a language model answer questions grounded in your documents — not just its training data.

Example Code micro / rag-example git.shadyknollcave.io/micro/rag-example

Phase 1 · Offline — Indexing

Done once
📄
Documents
Raw text, PDFs, notes
✂️
Chunking
Overlapping word windows
🔢
Embedding Model
AI model that converts text to numbers
🗄️
Vector Index
searchable database of chunk representations
Chunking detail Slides a window of N words across the document with a stride smaller than the window, so adjacent chunks overlap. Overlap prevents a sentence from being "cut" between two chunks and losing context.
Embedding detail Each chunk → a dense vector of ~384–768 floats. Semantically similar text lands close together in this high-dimensional space. The model is frozen; all chunks use the same weights.
Vector index detail All chunk representations are stored together in an index file on disk — essentially a searchable database of your document contents in numeric form. Simple implementations use in-memory arrays; production systems use dedicated vector databases (see the RAG Ecosystem guide) for scale and performance.

Phase 2 · Online — Query

Per question
💬
User Question
Natural language query
🔢
Embedding Model
Same model as Phase 1
📐
Similarity Search
compares question to every stored chunk
🏆
Top-K Chunks
Most relevant passages
📝
LLM Prompt
chunks + question injected
Answer
Grounded in your docs
Similarity search The question is converted into the same numeric format as the stored chunks, then compared against all of them in one fast operation. Chunks with a score closest to 1.0 are considered most relevant. The LLM never reads the full document set — only the top matching passages.
Why must it be the same embedding model? Chunks and the query must be encoded the same way to be comparable. Using different models would be like measuring distance with different rulers — the numbers wouldn't mean the same thing.
Prompt construction The retrieved chunks are pasted into a system message: "Use only the context below to answer." This grounds the LLM and prevents hallucination on topics outside the documents.

❌ Without RAG

The LLM answers from training data alone. It may hallucinate facts it wasn't trained on, give outdated information, or confidently make up details about your private documents it has never seen.

✅ With RAG

The LLM only sees the retrieved excerpts and is instructed to answer from them. If the answer isn't in the retrieved chunks, it says so — no hallucination about content outside its context window.

Student Pipeline Walk-Through

1

Chunking — Slice the documents into pieces

A long document is too big to search precisely. We slide a window of, say, 100 words across the text, moving 50 words at a time (50-word overlap). Each window = one chunk. Overlap ensures that a sentence spanning two windows is fully captured in at least one of them. Think of it like tearing a book into index cards where consecutive cards share a few lines.

2

Indexing — Turn each chunk into a number vector

We pass every chunk through a sentence-transformer model, which outputs a vector of ~384 numbers (an embedding). These numbers encode the meaning of the text — similar sentences get similar vectors. All chunk vectors are stacked into a matrix: the vector index. This is done once and saved to disk.

3

Retrieval — Find what's relevant using semantic similarity

When a user asks a question, we convert it into the same numeric format as the stored chunks using the same embedding model. A single fast operation then computes how similar the question is to every chunk at once. We pick the top-K highest-scoring chunks — those most likely to contain the answer. Unlike keyword search, this works on meaning: "heart attack" and "myocardial infarction" match even though the words differ.

4

Generation — Ask the LLM with grounded context

We build a prompt that says: "Here are relevant excerpts from the documents: [top-K chunks]. Using only these, answer: [user question]." The LLM sees only the retrieved chunks, not the full corpus. It can now give a precise, sourced answer. If the chunks don't contain the answer, a well-prompted model will say so rather than hallucinating.

Go Deeper RAG Ecosystem Reference Enterprise platforms · Vector DBs · Embedding models · Frameworks
All Guides AI Teaching Reference Back to the full index of teaching pages