How RAG lets a language model answer questions grounded in your documents — not just its training data.
The LLM answers from training data alone. It may hallucinate facts it wasn't trained on, give outdated information, or confidently make up details about your private documents it has never seen.
The LLM only sees the retrieved excerpts and is instructed to answer from them. If the answer isn't in the retrieved chunks, it says so — no hallucination about content outside its context window.
A long document is too big to search precisely. We slide a window of, say, 100 words across the text, moving 50 words at a time (50-word overlap). Each window = one chunk. Overlap ensures that a sentence spanning two windows is fully captured in at least one of them. Think of it like tearing a book into index cards where consecutive cards share a few lines.
We pass every chunk through a sentence-transformer model, which outputs a vector of ~384 numbers (an embedding). These numbers encode the meaning of the text — similar sentences get similar vectors. All chunk vectors are stacked into a matrix: the vector index. This is done once and saved to disk.
When a user asks a question, we convert it into the same numeric format as the stored chunks using the same embedding model. A single fast operation then computes how similar the question is to every chunk at once. We pick the top-K highest-scoring chunks — those most likely to contain the answer. Unlike keyword search, this works on meaning: "heart attack" and "myocardial infarction" match even though the words differ.
We build a prompt that says: "Here are relevant excerpts from the documents: [top-K chunks]. Using only these, answer: [user question]." The LLM sees only the retrieved chunks, not the full corpus. It can now give a precise, sourced answer. If the chunks don't contain the answer, a well-prompted model will say so rather than hallucinating.