4 min

Generate Answers

Grounding the LLM in Your Data

Context Assembly

You've retrieved the top matching chunks. Now you need to assemble them into a prompt that the LLM can use to generate an answer. This is where RAG comes together.

┌──────────────────────────────────────────┐
│              System Prompt               │
│  "Answer based ONLY on the provided      │
│   context. If the context doesn't        │
│   contain the answer, say so."           │
├──────────────────────────────────────────┤
│              Context Block               │
│  [Source: handbook.md, Section: PTO]     │
│  Employees receive 15 days of paid...    │
│                                          │
│  [Source: handbook.md, Section: Holidays] │
│  The company observes 10 federal...      │
├──────────────────────────────────────────┤
│              User Question               │
│  "How many vacation days do I get?"      │
└──────────────────────────────────────────┘

The context block sits between the system prompt and the user's question. Each chunk is labeled with its source — this helps the LLM attribute its answer and helps you verify correctness.

Prompt Template

Here's the pattern:

function buildPrompt(question: string, chunks: SearchResult[]): string {
  const context = chunks
    .map((c, i) =>
      `[Source ${i + 1}: ${c.source}${c.heading ? `, ${c.heading}` : ""}]\n${c.content}`
    )
    .join("\n\n");

  return `Answer the user's question based ONLY on the following context.
If the context does not contain enough information to answer, say "I don't have enough information to answer that question."
Always cite which source(s) you used.

## Context

${context}

## Question

${question}`;
}

Key Design Decisions

"ONLY on the following context" — This constraint prevents hallucination. Without it, the LLM will mix retrieved facts with its training data.

Source labels — Numbering sources ([Source 1], [Source 2]) gives the LLM a way to reference them in its answer.

Explicit fallback — "say I don't have enough information" teaches the model to admit gaps rather than fabricate answers.

Grounding the LLM

Grounding means the LLM's response is anchored to specific evidence. Without grounding:

Q: "What's our remote work policy?"
A: "Most companies allow 2-3 days of remote work per week..."  ← hallucinated from training data

With grounding:

Q: "What's our remote work policy?"
A: "According to the remote work policy, employees can work remotely
    up to 3 days per week with manager approval. Core hours are
    10 AM - 3 PM in your local timezone. [Source: remote-work.md]"

The grounded answer cites the actual policy document. If someone updates the policy, the RAG system automatically reflects the change — no retraining needed.

Handling "I Don't Know"

A well-built RAG system knows when it doesn't know. Two signals:

No chunks above threshold — The retrieval step found nothing relevant (similarity < 0.7)

LLM says so — The prompt instruction tells it to say "I don't have enough information"

async function answer(question: string): Promise<string> {
  const chunks = await searchDocs(question, { threshold: 0.7 });

  if (chunks.length === 0) {
    return "I don't have any information about that topic in the documents I've been given.";
  }

  const prompt = buildPrompt(question, chunks);
  const response = await llm.chat(prompt);
  return response;
}

This two-layer check is important. The retrieval filter catches obvious misses ("What's the weather?"). The prompt instruction catches subtle ones where chunks match topically but don't contain the specific answer.

Temperature for Q&A

Temperature controls how creative/random the LLM's output is:

Temperature	Behavior	Use Case
0.0	Deterministic, picks most likely token	Factual Q&A, data extraction
0.3	Slight variation, still focused	Document Q&A (our default)
0.7	More creative, diverse phrasing	Content generation, brainstorming
1.0	Maximum randomness	Creative writing

For RAG Q&A, use temperature 0.0-0.3. You want the model to stick to the facts in the retrieved context, not get creative.

Streaming Responses

For a good user experience, stream the response token by token instead of waiting for the full answer:

const stream = await llm.chat(prompt, { stream: true });

for await (const chunk of stream) {
  process.stdout.write(chunk); // or send via SSE to the frontend
}

Streaming makes the system feel responsive even when generating long answers. The user sees the answer forming in real time.

This is chapter 5 of RAG in 60 Minutes.

Get the full hands-on course — free during early access. Build the complete system. Your projects become your portfolio.

View course details

Ch. 4: Search & Retrieve

Ch. 6: Add Citations