RAG Pipeline – Teaching Diagram

Phase 1 · Offline — Indexing

Done once

📄

Documents

Raw text, PDFs, notes

✂️

Chunking

Overlapping word windows

🔢

Embedding Model

AI model that converts text to numbers

🗄️

Vector Index

searchable database of chunk representations

Chunking detail Slides a window of N words across the document with a stride smaller than the window, so adjacent chunks overlap. Overlap prevents a sentence from being "cut" between two chunks and losing context.

Embedding detail Each chunk → a dense vector of ~384–768 floats. Semantically similar text lands close together in this high-dimensional space. The model is frozen; all chunks use the same weights.

Vector index detail All chunk representations are stored together in an index file on disk — essentially a searchable database of your document contents in numeric form. Simple implementations use in-memory arrays; production systems use dedicated vector databases (see the RAG Ecosystem guide) for scale and performance.

Phase 2 · Online — Query

Per question

💬

User Question

Natural language query

🔢

Embedding Model

Same model as Phase 1

📐

Similarity Search

compares question to every stored chunk

🏆

Top-K Chunks

Most relevant passages

📝

LLM Prompt

chunks + question injected

✅

Answer

Grounded in your docs

Similarity search The question is converted into the same numeric format as the stored chunks, then compared against all of them in one fast operation. Chunks with a score closest to 1.0 are considered most relevant. The LLM never reads the full document set — only the top matching passages.

Why must it be the same embedding model? Chunks and the query must be encoded the same way to be comparable. Using different models would be like measuring distance with different rulers — the numbers wouldn't mean the same thing.

Prompt construction The retrieved chunks are pasted into a system message: "Use only the context below to answer." This grounds the LLM and prevents hallucination on topics outside the documents.

❌ Without RAG

The LLM answers from training data alone. It may hallucinate facts it wasn't trained on, give outdated information, or confidently make up details about your private documents it has never seen.

✅ With RAG

The LLM only sees the retrieved excerpts and is instructed to answer from them. If the answer isn't in the retrieved chunks, it says so — no hallucination about content outside its context window.

Student Pipeline Walk-Through

Chunking — Slice the documents into pieces

A long document is too big to search precisely. We slide a window of, say, 100 words across the text, moving 50 words at a time (50-word overlap). Each window = one chunk. Overlap ensures that a sentence spanning two windows is fully captured in at least one of them. Think of it like tearing a book into index cards where consecutive cards share a few lines.

Indexing — Turn each chunk into a number vector

We pass every chunk through a sentence-transformer model, which outputs a vector of ~384 numbers (an embedding). These numbers encode the meaning of the text — similar sentences get similar vectors. All chunk vectors are stacked into a matrix: the vector index. This is done once and saved to disk.

Retrieval — Find what's relevant using semantic similarity

When a user asks a question, we convert it into the same numeric format as the stored chunks using the same embedding model. A single fast operation then computes how similar the question is to every chunk at once. We pick the top-K highest-scoring chunks — those most likely to contain the answer. Unlike keyword search, this works on meaning: "heart attack" and "myocardial infarction" match even though the words differ.

Generation — Ask the LLM with grounded context

We build a prompt that says: "Here are relevant excerpts from the documents: [top-K chunks]. Using only these, answer: [user question]." The LLM sees only the retrieved chunks, not the full corpus. It can now give a precise, sourced answer. If the chunks don't contain the answer, a well-prompted model will say so rather than hallucinating.