Chunk & Organize
Smart Splitting and Metadata
Why You Can't Search Whole Documents
You have 46 ingested documents. Some are a single sentence (a bookmark summary). Others are several paragraphs (a meeting transcript). If you search over whole documents, short items get unfairly boosted — their entire content matches because there's nothing else to dilute the signal.
Chunking solves this by splitting documents into pieces of roughly equal size, so search compares apples to apples.
The Chunking Trade-Off
| Chunk Size | Pros | Cons |
|---|---|---|
| Small (100-200 tokens) | Precise search results | Loses surrounding context |
| Medium (200-500 tokens) | Good balance of precision and context | May split mid-thought |
| Large (500-1000 tokens) | Full context preserved | Search results too broad |
For a personal knowledge base, 200-400 tokens is the sweet spot. Your notes and bookmarks are already short. Meeting transcripts and articles need splitting.
Fixed-Size vs. Smart Splitting
Fixed-size chunking splits text every N characters, regardless of content:
"The key insight from today's standup is that the API refactor|
is blocked on the database migration. We agreed to prioritize..."The pipe shows where a fixed-size chunk might cut — right in the middle of a thought. This produces chunks that start and end mid-sentence.
Smart splitting respects natural boundaries:
##)Smart splitting produces chunks that are complete thoughts, even if they vary slightly in size.
Overlap: Preventing Information Loss
When you split text into chunks, information at the boundary gets split too. Overlap fixes this by including the last N tokens of the previous chunk at the start of the next one:
Chunk 1: "...the API refactor is blocked on the database migration."
Chunk 2: "...blocked on the database migration. We agreed to prioritize the migration this sprint."The overlap (50-100 tokens) ensures that a search for "database migration priority" hits both chunks.
Metadata Enrichment
Raw chunks are just text. Enriched chunks carry context:
interface EnrichedChunk {
id: string;
content: string;
documentId: string; // Parent document
chunkIndex: number; // Position in document
source: string; // notes, bookmarks, meetings...
tags: string[]; // Inherited from parent + auto-detected
createdAt: string; // From parent document
tokenCount: number; // For cost estimation
}Metadata enables filtered search. "Find my notes about TypeScript from last month" uses three filters: source=notes, tags=typescript, date=last 30 days — before semantic search even runs.
Auto-Tagging
Not every document comes with tags. Auto-tagging adds them based on content patterns:
frontendmeetings tag, bookmarks get referenceAuto-tags supplement manual tags — they don't replace them. The combination gives you rich filtering without requiring perfect tagging discipline.
When NOT to Chunk
Short documents (under 200 tokens) shouldn't be chunked at all. A bookmark summary or a one-paragraph note is already chunk-sized. Splitting it would destroy its meaning.
Rule of thumb: if the document is shorter than your target chunk size, keep it whole.
Key Takeaways
This is chapter 2 of AI-Powered Second Brain.
Get the full hands-on course — free during early access. Build the complete system. Your projects become your portfolio.
View course details