Generate Answers
Grounding the LLM in Your Data
Context Assembly
You've retrieved the top matching chunks. Now you need to assemble them into a prompt that the LLM can use to generate an answer. This is where RAG comes together.
┌──────────────────────────────────────────┐
│ System Prompt │
│ "Answer based ONLY on the provided │
│ context. If the context doesn't │
│ contain the answer, say so." │
├──────────────────────────────────────────┤
│ Context Block │
│ [Source: handbook.md, Section: PTO] │
│ Employees receive 15 days of paid... │
│ │
│ [Source: handbook.md, Section: Holidays] │
│ The company observes 10 federal... │
├──────────────────────────────────────────┤
│ User Question │
│ "How many vacation days do I get?" │
└──────────────────────────────────────────┘The context block sits between the system prompt and the user's question. Each chunk is labeled with its source — this helps the LLM attribute its answer and helps you verify correctness.
Prompt Template
Here's the pattern:
function buildPrompt(question: string, chunks: SearchResult[]): string {
const context = chunks
.map((c, i) =>
`[Source ${i + 1}: ${c.source}${c.heading ? `, ${c.heading}` : ""}]\n${c.content}`
)
.join("\n\n");
return `Answer the user's question based ONLY on the following context.
If the context does not contain enough information to answer, say "I don't have enough information to answer that question."
Always cite which source(s) you used.
## Context
${context}
## Question
${question}`;
}Key Design Decisions
[Source 1], [Source 2]) gives the LLM a way to reference them in its answer.Grounding the LLM
Grounding means the LLM's response is anchored to specific evidence. Without grounding:
Q: "What's our remote work policy?"
A: "Most companies allow 2-3 days of remote work per week..." ← hallucinated from training dataWith grounding:
Q: "What's our remote work policy?"
A: "According to the remote work policy, employees can work remotely
up to 3 days per week with manager approval. Core hours are
10 AM - 3 PM in your local timezone. [Source: remote-work.md]"The grounded answer cites the actual policy document. If someone updates the policy, the RAG system automatically reflects the change — no retraining needed.
Handling "I Don't Know"
A well-built RAG system knows when it doesn't know. Two signals:
async function answer(question: string): Promise<string> {
const chunks = await searchDocs(question, { threshold: 0.7 });
if (chunks.length === 0) {
return "I don't have any information about that topic in the documents I've been given.";
}
const prompt = buildPrompt(question, chunks);
const response = await llm.chat(prompt);
return response;
}This two-layer check is important. The retrieval filter catches obvious misses ("What's the weather?"). The prompt instruction catches subtle ones where chunks match topically but don't contain the specific answer.
Temperature for Q&A
Temperature controls how creative/random the LLM's output is:
| Temperature | Behavior | Use Case |
|---|---|---|
| 0.0 | Deterministic, picks most likely token | Factual Q&A, data extraction |
| 0.3 | Slight variation, still focused | Document Q&A (our default) |
| 0.7 | More creative, diverse phrasing | Content generation, brainstorming |
| 1.0 | Maximum randomness | Creative writing |
For RAG Q&A, use temperature 0.0-0.3. You want the model to stick to the facts in the retrieved context, not get creative.
Streaming Responses
For a good user experience, stream the response token by token instead of waiting for the full answer:
const stream = await llm.chat(prompt, { stream: true });
for await (const chunk of stream) {
process.stdout.write(chunk); // or send via SSE to the frontend
}Streaming makes the system feel responsive even when generating long answers. The user sees the answer forming in real time.
This is chapter 5 of RAG in 60 Minutes.
Get the full hands-on course — free during early access. Build the complete system. Your projects become your portfolio.
View course details