4 min

Upload & Chunk

Splitting Documents for Search

Document Formats

Real companies store knowledge in many formats. Your RAG pipeline needs to handle all of them:

Format	Example	Challenge
Markdown	Company handbook, policies	Headers create natural section boundaries
JSON	FAQ databases, product catalogs	Structured data needs flattening into text
Plain text	Meeting notes, emails	No structure — must infer boundaries
PDF	Contracts, reports	Needs text extraction (parsing libraries or OCR)

In this course, your pre-seeded data includes Markdown policies, JSON FAQs, and JSON product docs. The patterns you learn here apply to any format.

Why Chunking Matters

LLMs have a limited context window — you can't paste your entire handbook into every prompt. Even if you could, the model performs worse with too much irrelevant text. Chunking solves this by splitting documents into small, focused pieces so you only retrieve what's relevant.

┌──────────────────────────────────┐
│     Full Company Handbook        │
│     (50 pages, 25,000 words)     │
└──────────────┬───────────────────┘
               │ chunk
               ▼
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
│ Chunk 1 │ │ Chunk 2 │ │ Chunk 3 │ │ Chunk N │
│ PTO     │ │ Remote  │ │ Expense │ │ ...     │
│ Policy  │ │ Work    │ │ Policy  │ │         │
└─────────┘ └─────────┘ └─────────┘ └─────────┘

When someone asks "What's the PTO policy?", you retrieve Chunk 1 — not the entire 50-page handbook.

Overlap and Boundaries

The Overlap Problem

If you split at exactly 500 characters, you might cut a sentence in half:

Chunk 1: "...employees receive 15 days of paid time off per year. Unused days"
Chunk 2: "carry over up to a maximum of 5 days into the following year..."

The answer to "Do PTO days carry over?" is split across two chunks. Overlap fixes this by repeating some text at chunk boundaries:

Chunk 1: "...employees receive 15 days of paid time off per year. Unused days carry over up to"
Chunk 2: "Unused days carry over up to a maximum of 5 days into the following year..."

A typical overlap is 10-20% of chunk size.

Section-Aware Chunking

Better than fixed-size: split on natural boundaries. Markdown headers (##, ###) mark topic changes. Split there and you get topically coherent chunks:

function chunkByHeaders(markdown: string): string[] {
  // Split on ## headers, keeping each section together
  return markdown
    .split(/(?=^## )/m)
    .filter((s) => s.trim().length > 0);
}

Chunk Size Tradeoffs

Chunk Size	Pros	Cons
Small (200-300 chars)	Precise retrieval, less noise	May lose context, more chunks to search
Medium (500-1000 chars)	Good balance of context and precision	Default choice for most use cases
Large (1500-2000 chars)	Full context preserved	May include irrelevant text, fewer fit in prompt

Start with 500-800 characters for document Q&A. You can always adjust after seeing retrieval quality.

Building a Simple Document Splitter

Here's the pattern you'll implement:

interface Chunk {
  id: string;
  content: string;
  metadata: {
    source: string;
    chunkIndex: number;
    heading?: string;
  };
}

function splitDocument(content: string, source: string): Chunk[] {
  const sections = content.split(/(?=^## )/m);
  return sections.map((section, i) => ({
    id: `${source}-chunk-${i}`,
    content: section.trim(),
    metadata: {
      source,
      chunkIndex: i,
      heading: section.match(/^## (.+)/)?.[1],
    },
  }));
}

Each chunk carries metadata — the source file, its position, and the section heading. This metadata powers citations later in Module 6.

This is chapter 2 of RAG in 60 Minutes.

Get the full hands-on course — free during early access. Build the complete system. Your projects become your portfolio.

View course details

Ch. 1: What is RAG?

Ch. 3: Embed & Store