Upload & Chunk
Splitting Documents for Search
Document Formats
Real companies store knowledge in many formats. Your RAG pipeline needs to handle all of them:
| Format | Example | Challenge |
|---|---|---|
| Markdown | Company handbook, policies | Headers create natural section boundaries |
| JSON | FAQ databases, product catalogs | Structured data needs flattening into text |
| Plain text | Meeting notes, emails | No structure — must infer boundaries |
| Contracts, reports | Needs text extraction (parsing libraries or OCR) |
In this course, your pre-seeded data includes Markdown policies, JSON FAQs, and JSON product docs. The patterns you learn here apply to any format.
Why Chunking Matters
LLMs have a limited context window — you can't paste your entire handbook into every prompt. Even if you could, the model performs worse with too much irrelevant text. Chunking solves this by splitting documents into small, focused pieces so you only retrieve what's relevant.
┌──────────────────────────────────┐
│ Full Company Handbook │
│ (50 pages, 25,000 words) │
└──────────────┬───────────────────┘
│ chunk
▼
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
│ Chunk 1 │ │ Chunk 2 │ │ Chunk 3 │ │ Chunk N │
│ PTO │ │ Remote │ │ Expense │ │ ... │
│ Policy │ │ Work │ │ Policy │ │ │
└─────────┘ └─────────┘ └─────────┘ └─────────┘When someone asks "What's the PTO policy?", you retrieve Chunk 1 — not the entire 50-page handbook.
Overlap and Boundaries
The Overlap Problem
If you split at exactly 500 characters, you might cut a sentence in half:
Chunk 1: "...employees receive 15 days of paid time off per year. Unused days"
Chunk 2: "carry over up to a maximum of 5 days into the following year..."The answer to "Do PTO days carry over?" is split across two chunks. Overlap fixes this by repeating some text at chunk boundaries:
Chunk 1: "...employees receive 15 days of paid time off per year. Unused days carry over up to"
Chunk 2: "Unused days carry over up to a maximum of 5 days into the following year..."A typical overlap is 10-20% of chunk size.
Section-Aware Chunking
Better than fixed-size: split on natural boundaries. Markdown headers (##, ###) mark topic changes. Split there and you get topically coherent chunks:
function chunkByHeaders(markdown: string): string[] {
// Split on ## headers, keeping each section together
return markdown
.split(/(?=^## )/m)
.filter((s) => s.trim().length > 0);
}Chunk Size Tradeoffs
| Chunk Size | Pros | Cons |
|---|---|---|
| **Small** (200-300 chars) | Precise retrieval, less noise | May lose context, more chunks to search |
| **Medium** (500-1000 chars) | Good balance of context and precision | Default choice for most use cases |
| **Large** (1500-2000 chars) | Full context preserved | May include irrelevant text, fewer fit in prompt |
Start with 500-800 characters for document Q&A. You can always adjust after seeing retrieval quality.
Building a Simple Document Splitter
Here's the pattern you'll implement:
interface Chunk {
id: string;
content: string;
metadata: {
source: string;
chunkIndex: number;
heading?: string;
};
}
function splitDocument(content: string, source: string): Chunk[] {
const sections = content.split(/(?=^## )/m);
return sections.map((section, i) => ({
id: `${source}-chunk-${i}`,
content: section.trim(),
metadata: {
source,
chunkIndex: i,
heading: section.match(/^## (.+)/)?.[1],
},
}));
}Each chunk carries metadata — the source file, its position, and the section heading. This metadata powers citations later in Module 6.
This is chapter 2 of RAG in 60 Minutes.
Get the full hands-on course — free during early access. Build the complete system. Your projects become your portfolio.
View course details