Code Parsing
Teaching Your AI to Read Diffs
Why Parsing Matters
A raw unified diff is just text -- lines starting with + and -. But a useful code review agent needs far more context: which functions were modified, what language is this, is this a refactor or a new feature, and which symbols are affected across the codebase?
Parsing transforms raw text into structured data that downstream analysis can consume. Without it, your AI is reviewing strings. With it, your AI understands code.
The principle: Parse once, analyze many times. The structured output of the parsing layer feeds every subsequent module -- detection, scoring, and review generation all depend on accurate parsing.
Unified Diff Format
Every code review system starts with the diff. The unified diff format has a precise structure:
| Element | Pattern | Purpose |
|---|---|---|
| File header | `--- a/path` / `+++ b/path` | Identifies which file changed |
| Hunk header | `@@ -15,7 +15,18 @@` | Locates the change within the file |
| Context lines | Lines starting with space | Unchanged surrounding code |
| Removed lines | Lines starting with `-` | Code that was deleted |
| Added lines | Lines starting with `+` | Code that was added |
The hunk header @@ -15,7 +15,18 @@ means: starting at line 15 in the old file (7 lines of context) and line 15 in the new file (18 lines of context). This line attribution is critical -- when the review generator creates a comment like "SQL injection on line 42," it needs exact line numbers from the parser.
AST Analysis
The diff parser tells you WHAT lines changed. The AST analyzer tells you WHAT THOSE LINES MEAN:
For each changed line, the analyzer identifies which function or class scope it belongs to. This is how the parser knows "line 42 is inside the processPayment function" -- essential context for generating useful review comments.
Symbol Resolution
Symbol resolution builds a dependency map: which symbols are defined where, and which files reference them. When a PR modifies createCheckoutSession in stripe.ts, and that function is called from 3 other files, the reviewer needs to know.
The blast radius -- how many files are affected by a change -- is a key risk signal. A change to a widely-imported utility is riskier than a change to an internal helper.
Change Classification
The classifier combines signals to determine change type:
Classification confidence reflects signal agreement. A PR adding new files AND new routes AND new types is almost certainly a feature. A PR modifying both production and test code is ambiguous. The confidence score communicates this uncertainty to downstream modules.
This is chapter 1 of AI Code Review Agent.
Get the full hands-on course — free during early access. Build the complete system. Your projects become your portfolio.
View course details