Back to guides
1
4 min

Collect Everything

Ingesting Notes, Bookmarks & Docs

The Collection Problem

You have knowledge scattered everywhere. Notes in one app, bookmarks in another, articles you half-read in a third, meeting notes in documents, project updates in chat threads. Each source has a different format, different metadata, different structure.

A second brain starts by solving this: get everything into one place, in one format, with the right metadata attached.

Why Unified Ingestion Matters

Without a unified ingestion layer, you end up with:

  • Notes you can't find because they're in a different app than your search
  • Bookmarks saved and forgotten because they're not connected to your notes
  • Meeting action items buried in transcripts nobody re-reads
  • Project context lost when you switch tools
  • The fix isn't a better app. It's a pipeline that normalizes everything into a common format.

    The Universal Document Schema

    Every piece of knowledge, regardless of source, has these core properties:

    FieldPurposeExample
    `id`Unique identifier`note-003`, `bookmark-007`
    `content`The actual textNote body, article summary, meeting transcript
    `source`Where it came from`notes`, `bookmarks`, `articles`, `meetings`, `projects`
    `title`Human-readable label"React Server Components Deep Dive"
    `tags`Topic labels`["react", "architecture", "frontend"]`
    `createdAt`When it was captured`2025-03-15`
    `metadata`Source-specific extrasURL for bookmarks, attendees for meetings

    The key insight: metadata varies by source, but the core schema is universal. A bookmark has a URL. A meeting note has attendees. But both have content, tags, and a date.

    Building the Pipeline

    A good ingestion pipeline follows three steps:

  • Read — Load each data source from its native format (JSON, CSV, plaintext)
  • Transform — Map source-specific fields to the universal schema
  • Deduplicate — Hash the content to prevent the same item from being stored twice
  • Source Files → Reader → Transformer → Deduplicator → Unified Documents

    Each source gets its own reader function, but they all output the same Document type. This makes the rest of the pipeline (chunking, search, connections) source-agnostic.

    Content Hashing for Deduplication

    When you ingest from multiple sources, the same content can appear more than once — a note that quotes a bookmark, a meeting summary that restates a project doc. Content hashing catches these:

    import { createHash } from "crypto";
    
    function hashContent(content: string): string {
      return createHash("sha256").update(content.trim().toLowerCase()).digest("hex");
    }

    If two documents produce the same hash, keep the one with richer metadata.

    Making It Extensible

    A well-designed ingestion layer makes adding new sources trivial. Each source is a function that takes raw data and returns Documents. When you want to add Slack messages or email archives later, you write one new reader function — everything downstream just works.

    Key Takeaways

  • A second brain starts with unified ingestion — one schema for all sources.
  • Core fields (content, source, title, tags, date) are universal; metadata varies by source.
  • Content hashing prevents duplicates when the same idea appears in multiple sources.
  • Design for extensibility — new sources should require only a new reader function.
  • This is chapter 1 of AI-Powered Second Brain.

    Get the full hands-on course — free during early access. Build the complete system. Your projects become your portfolio.

    View course details