Back to guides
2
4 min

Chunk & Organize

Smart Splitting and Metadata

Why You Can't Search Whole Documents

You have 46 ingested documents. Some are a single sentence (a bookmark summary). Others are several paragraphs (a meeting transcript). If you search over whole documents, short items get unfairly boosted — their entire content matches because there's nothing else to dilute the signal.

Chunking solves this by splitting documents into pieces of roughly equal size, so search compares apples to apples.

The Chunking Trade-Off

Chunk SizeProsCons
Small (100-200 tokens)Precise search resultsLoses surrounding context
Medium (200-500 tokens)Good balance of precision and contextMay split mid-thought
Large (500-1000 tokens)Full context preservedSearch results too broad

For a personal knowledge base, 200-400 tokens is the sweet spot. Your notes and bookmarks are already short. Meeting transcripts and articles need splitting.

Fixed-Size vs. Smart Splitting

Fixed-size chunking splits text every N characters, regardless of content:

"The key insight from today's standup is that the API refactor|
 is blocked on the database migration. We agreed to prioritize..."

The pipe shows where a fixed-size chunk might cut — right in the middle of a thought. This produces chunks that start and end mid-sentence.

Smart splitting respects natural boundaries:

  • Paragraph breaks (double newline)
  • Section headers (markdown ##)
  • Sentence endings (period + space)
  • Smart splitting produces chunks that are complete thoughts, even if they vary slightly in size.

    Overlap: Preventing Information Loss

    When you split text into chunks, information at the boundary gets split too. Overlap fixes this by including the last N tokens of the previous chunk at the start of the next one:

    Chunk 1: "...the API refactor is blocked on the database migration."
    Chunk 2: "...blocked on the database migration. We agreed to prioritize the migration this sprint."

    The overlap (50-100 tokens) ensures that a search for "database migration priority" hits both chunks.

    Metadata Enrichment

    Raw chunks are just text. Enriched chunks carry context:

    interface EnrichedChunk {
      id: string;
      content: string;
      documentId: string;        // Parent document
      chunkIndex: number;        // Position in document
      source: string;            // notes, bookmarks, meetings...
      tags: string[];            // Inherited from parent + auto-detected
      createdAt: string;         // From parent document
      tokenCount: number;        // For cost estimation
    }

    Metadata enables filtered search. "Find my notes about TypeScript from last month" uses three filters: source=notes, tags=typescript, date=last 30 days — before semantic search even runs.

    Auto-Tagging

    Not every document comes with tags. Auto-tagging adds them based on content patterns:

  • Keyword detection: if content mentions "React," "Vue," or "Angular," auto-tag as frontend
  • Entity extraction: detect project names, people, dates
  • Source-based defaults: meeting notes get meetings tag, bookmarks get reference
  • Auto-tags supplement manual tags — they don't replace them. The combination gives you rich filtering without requiring perfect tagging discipline.

    When NOT to Chunk

    Short documents (under 200 tokens) shouldn't be chunked at all. A bookmark summary or a one-paragraph note is already chunk-sized. Splitting it would destroy its meaning.

    Rule of thumb: if the document is shorter than your target chunk size, keep it whole.

    Key Takeaways

  • Chunking normalizes document length so search compares equally-sized pieces of text.
  • Smart splitting at paragraph and sentence boundaries beats fixed-size cutting.
  • Overlap (50-100 tokens) prevents information loss at chunk boundaries.
  • Metadata on chunks enables filtered search — narrow by source, date, or tags before semantic matching.
  • This is chapter 2 of AI-Powered Second Brain.

    Get the full hands-on course — free during early access. Build the complete system. Your projects become your portfolio.

    View course details