4 min

Collect Everything

Ingesting Notes, Bookmarks & Docs

The Collection Problem

You have knowledge scattered everywhere. Notes in one app, bookmarks in another, articles you half-read in a third, meeting notes in documents, project updates in chat threads. Each source has a different format, different metadata, different structure.

A second brain starts by solving this: get everything into one place, in one format, with the right metadata attached.

Why Unified Ingestion Matters

Without a unified ingestion layer, you end up with:

Notes you can't find because they're in a different app than your search

Bookmarks saved and forgotten because they're not connected to your notes

Meeting action items buried in transcripts nobody re-reads

Project context lost when you switch tools

The fix isn't a better app. It's a pipeline that normalizes everything into a common format.

The Universal Document Schema

Every piece of knowledge, regardless of source, has these core properties:

Field	Purpose	Example
`id`	Unique identifier	`note-003`, `bookmark-007`
`content`	The actual text	Note body, article summary, meeting transcript
`source`	Where it came from	`notes`, `bookmarks`, `articles`, `meetings`, `projects`
`title`	Human-readable label	"React Server Components Deep Dive"
`tags`	Topic labels	`["react", "architecture", "frontend"]`
`createdAt`	When it was captured	`2025-03-15`
`metadata`	Source-specific extras	URL for bookmarks, attendees for meetings

The key insight: metadata varies by source, but the core schema is universal. A bookmark has a URL. A meeting note has attendees. But both have content, tags, and a date.

Building the Pipeline

A good ingestion pipeline follows three steps:

Read — Load each data source from its native format (JSON, CSV, plaintext)

Transform — Map source-specific fields to the universal schema

Deduplicate — Hash the content to prevent the same item from being stored twice

Source Files → Reader → Transformer → Deduplicator → Unified Documents

Each source gets its own reader function, but they all output the same Document type. This makes the rest of the pipeline (chunking, search, connections) source-agnostic.

Content Hashing for Deduplication

When you ingest from multiple sources, the same content can appear more than once — a note that quotes a bookmark, a meeting summary that restates a project doc. Content hashing catches these:

import { createHash } from "crypto";

function hashContent(content: string): string {
  return createHash("sha256").update(content.trim().toLowerCase()).digest("hex");
}

If two documents produce the same hash, keep the one with richer metadata.

Making It Extensible

A well-designed ingestion layer makes adding new sources trivial. Each source is a function that takes raw data and returns Documents. When you want to add Slack messages or email archives later, you write one new reader function — everything downstream just works.

Key Takeaways

A second brain starts with unified ingestion — one schema for all sources.

Core fields (content, source, title, tags, date) are universal; metadata varies by source.

Content hashing prevents duplicates when the same idea appears in multiple sources.

Design for extensibility — new sources should require only a new reader function.

This is chapter 1 of AI-Powered Second Brain.

Get the full hands-on course — free during early access. Build the complete system. Your projects become your portfolio.

View course details

Ch. 2: Chunk & Organize