Back to guides
1
5 min

Code Parsing

Teaching Your AI to Read Diffs

Why Parsing Matters

A raw unified diff is just text -- lines starting with + and -. But a useful code review agent needs far more context: which functions were modified, what language is this, is this a refactor or a new feature, and which symbols are affected across the codebase?

Parsing transforms raw text into structured data that downstream analysis can consume. Without it, your AI is reviewing strings. With it, your AI understands code.

The principle: Parse once, analyze many times. The structured output of the parsing layer feeds every subsequent module -- detection, scoring, and review generation all depend on accurate parsing.

Unified Diff Format

Every code review system starts with the diff. The unified diff format has a precise structure:

ElementPatternPurpose
File header`--- a/path` / `+++ b/path`Identifies which file changed
Hunk header`@@ -15,7 +15,18 @@`Locates the change within the file
Context linesLines starting with spaceUnchanged surrounding code
Removed linesLines starting with `-`Code that was deleted
Added linesLines starting with `+`Code that was added

The hunk header @@ -15,7 +15,18 @@ means: starting at line 15 in the old file (7 lines of context) and line 15 in the new file (18 lines of context). This line attribution is critical -- when the review generator creates a comment like "SQL injection on line 42," it needs exact line numbers from the parser.

AST Analysis

The diff parser tells you WHAT lines changed. The AST analyzer tells you WHAT THOSE LINES MEAN:

  • Functions -- Which functions were added, modified, or removed
  • Classes -- Which class definitions changed
  • Imports -- What new dependencies were introduced
  • Exports -- What the public API surface looks like now
  • For each changed line, the analyzer identifies which function or class scope it belongs to. This is how the parser knows "line 42 is inside the processPayment function" -- essential context for generating useful review comments.

    Symbol Resolution

    Symbol resolution builds a dependency map: which symbols are defined where, and which files reference them. When a PR modifies createCheckoutSession in stripe.ts, and that function is called from 3 other files, the reviewer needs to know.

    The blast radius -- how many files are affected by a change -- is a key risk signal. A change to a widely-imported utility is riskier than a change to an internal helper.

    Change Classification

    The classifier combines signals to determine change type:

  • Feature -- New files, new exports, new routes (high confidence when multiple signals agree)
  • Bugfix -- Small targeted changes, often touching error handling
  • Refactor -- Code moved between files, symbols renamed, duplication reduced
  • Test -- Only test files modified
  • Docs -- Only documentation changed
  • Classification confidence reflects signal agreement. A PR adding new files AND new routes AND new types is almost certainly a feature. A PR modifying both production and test code is ambiguous. The confidence score communicates this uncertainty to downstream modules.

    This is chapter 1 of AI Code Review Agent.

    Get the full hands-on course — free during early access. Build the complete system. Your projects become your portfolio.

    View course details