Back to guides
2
4 min

Upload & Chunk

Splitting Documents for Search

Document Formats

Real companies store knowledge in many formats. Your RAG pipeline needs to handle all of them:

FormatExampleChallenge
MarkdownCompany handbook, policiesHeaders create natural section boundaries
JSONFAQ databases, product catalogsStructured data needs flattening into text
Plain textMeeting notes, emailsNo structure — must infer boundaries
PDFContracts, reportsNeeds text extraction (parsing libraries or OCR)

In this course, your pre-seeded data includes Markdown policies, JSON FAQs, and JSON product docs. The patterns you learn here apply to any format.

Why Chunking Matters

LLMs have a limited context window — you can't paste your entire handbook into every prompt. Even if you could, the model performs worse with too much irrelevant text. Chunking solves this by splitting documents into small, focused pieces so you only retrieve what's relevant.

┌──────────────────────────────────┐
│     Full Company Handbook        │
│     (50 pages, 25,000 words)     │
└──────────────┬───────────────────┘
               │ chunk
               ▼
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
│ Chunk 1 │ │ Chunk 2 │ │ Chunk 3 │ │ Chunk N │
│ PTO     │ │ Remote  │ │ Expense │ │ ...     │
│ Policy  │ │ Work    │ │ Policy  │ │         │
└─────────┘ └─────────┘ └─────────┘ └─────────┘

When someone asks "What's the PTO policy?", you retrieve Chunk 1 — not the entire 50-page handbook.

Overlap and Boundaries

The Overlap Problem

If you split at exactly 500 characters, you might cut a sentence in half:

Chunk 1: "...employees receive 15 days of paid time off per year. Unused days"
Chunk 2: "carry over up to a maximum of 5 days into the following year..."

The answer to "Do PTO days carry over?" is split across two chunks. Overlap fixes this by repeating some text at chunk boundaries:

Chunk 1: "...employees receive 15 days of paid time off per year. Unused days carry over up to"
Chunk 2: "Unused days carry over up to a maximum of 5 days into the following year..."

A typical overlap is 10-20% of chunk size.

Section-Aware Chunking

Better than fixed-size: split on natural boundaries. Markdown headers (##, ###) mark topic changes. Split there and you get topically coherent chunks:

function chunkByHeaders(markdown: string): string[] {
  // Split on ## headers, keeping each section together
  return markdown
    .split(/(?=^## )/m)
    .filter((s) => s.trim().length > 0);
}

Chunk Size Tradeoffs

Chunk SizeProsCons
**Small** (200-300 chars)Precise retrieval, less noiseMay lose context, more chunks to search
**Medium** (500-1000 chars)Good balance of context and precisionDefault choice for most use cases
**Large** (1500-2000 chars)Full context preservedMay include irrelevant text, fewer fit in prompt

Start with 500-800 characters for document Q&A. You can always adjust after seeing retrieval quality.

Building a Simple Document Splitter

Here's the pattern you'll implement:

interface Chunk {
  id: string;
  content: string;
  metadata: {
    source: string;
    chunkIndex: number;
    heading?: string;
  };
}

function splitDocument(content: string, source: string): Chunk[] {
  const sections = content.split(/(?=^## )/m);
  return sections.map((section, i) => ({
    id: `${source}-chunk-${i}`,
    content: section.trim(),
    metadata: {
      source,
      chunkIndex: i,
      heading: section.match(/^## (.+)/)?.[1],
    },
  }));
}

Each chunk carries metadata — the source file, its position, and the section heading. This metadata powers citations later in Module 6.

This is chapter 2 of RAG in 60 Minutes.

Get the full hands-on course — free during early access. Build the complete system. Your projects become your portfolio.

View course details