4 min

Chunk & Organize

Smart Splitting and Metadata

Why You Can't Search Whole Documents

You have 46 ingested documents. Some are a single sentence (a bookmark summary). Others are several paragraphs (a meeting transcript). If you search over whole documents, short items get unfairly boosted — their entire content matches because there's nothing else to dilute the signal.

Chunking solves this by splitting documents into pieces of roughly equal size, so search compares apples to apples.

The Chunking Trade-Off

Chunk Size	Pros	Cons
Small (100-200 tokens)	Precise search results	Loses surrounding context
Medium (200-500 tokens)	Good balance of precision and context	May split mid-thought
Large (500-1000 tokens)	Full context preserved	Search results too broad

For a personal knowledge base, 200-400 tokens is the sweet spot. Your notes and bookmarks are already short. Meeting transcripts and articles need splitting.

Fixed-Size vs. Smart Splitting

Fixed-size chunking splits text every N characters, regardless of content:

"The key insight from today's standup is that the API refactor|
 is blocked on the database migration. We agreed to prioritize..."

The pipe shows where a fixed-size chunk might cut — right in the middle of a thought. This produces chunks that start and end mid-sentence.

Smart splitting respects natural boundaries:

Paragraph breaks (double newline)

Section headers (markdown ##)

Sentence endings (period + space)

Smart splitting produces chunks that are complete thoughts, even if they vary slightly in size.

Overlap: Preventing Information Loss

When you split text into chunks, information at the boundary gets split too. Overlap fixes this by including the last N tokens of the previous chunk at the start of the next one:

Chunk 1: "...the API refactor is blocked on the database migration."
Chunk 2: "...blocked on the database migration. We agreed to prioritize the migration this sprint."

The overlap (50-100 tokens) ensures that a search for "database migration priority" hits both chunks.

Metadata Enrichment

Raw chunks are just text. Enriched chunks carry context:

interface EnrichedChunk {
  id: string;
  content: string;
  documentId: string;        // Parent document
  chunkIndex: number;        // Position in document
  source: string;            // notes, bookmarks, meetings...
  tags: string[];            // Inherited from parent + auto-detected
  createdAt: string;         // From parent document
  tokenCount: number;        // For cost estimation
}

Metadata enables filtered search. "Find my notes about TypeScript from last month" uses three filters: source=notes, tags=typescript, date=last 30 days — before semantic search even runs.

Auto-Tagging

Not every document comes with tags. Auto-tagging adds them based on content patterns:

Keyword detection: if content mentions "React," "Vue," or "Angular," auto-tag as frontend

Entity extraction: detect project names, people, dates

Source-based defaults: meeting notes get meetings tag, bookmarks get reference

Auto-tags supplement manual tags — they don't replace them. The combination gives you rich filtering without requiring perfect tagging discipline.

When NOT to Chunk

Short documents (under 200 tokens) shouldn't be chunked at all. A bookmark summary or a one-paragraph note is already chunk-sized. Splitting it would destroy its meaning.

Rule of thumb: if the document is shorter than your target chunk size, keep it whole.

Key Takeaways

Chunking normalizes document length so search compares equally-sized pieces of text.

Smart splitting at paragraph and sentence boundaries beats fixed-size cutting.

Overlap (50-100 tokens) prevents information loss at chunk boundaries.

Metadata on chunks enables filtered search — narrow by source, date, or tags before semantic matching.

This is chapter 2 of AI-Powered Second Brain.

Get the full hands-on course — free during early access. Build the complete system. Your projects become your portfolio.

View course details

Ch. 1: Collect Everything

Ch. 3: Semantic Search