Production Pipeline
From Demo to Automated Workflow
The 30/70 Rule
Building the review tool is 30% of the work. Making it run automatically, collect feedback, and improve over time is the other 70%. A code review agent that requires manual invocation gets abandoned within a week. One that automatically reviews every PR and posts comments on GitHub becomes part of the team's workflow.
GitHub Webhook Integration
GitHub sends HTTP POST requests to your URL when events happen. For code review, three events matter:
| Event | When | Action |
|---|---|---|
| `pull_request.opened` | New PR created | Run initial review |
| `pull_request.synchronize` | New commits pushed to PR | Re-run review on updated diff |
| `pull_request.reopened` | Closed PR reopened | Re-run review |
The webhook handler:
GET /repos/{owner}/{repo}/pulls/{number}.diff to get the current diff.CI/CD as a Merge Gate
Posting review comments is informative. Blocking merges is enforcing. GitHub commit statuses and check runs serve this purpose:
Commit statuses are simple: success, failure, or pending. They appear as green check, red X, or yellow dot in the PR. When configured as a required status check, the merge button is disabled until the check passes.
Check runs are richer. They support annotations (inline comments in GitHub's diff view), markdown summaries in the Checks tab, and conclusions (success, failure, neutral, action_required). The annotations appear exactly like a human reviewer's comments.
The decision logic: any dimension scoring below the fail threshold (< 60) or any critical issue present sets the check to failure. Warnings allow merge but surface suggestions.
The Feedback Flywheel
Static analysis tools start strong and decay. Rules that were accurate at launch accumulate false positives as the codebase evolves. The feedback loop prevents this decay.
When the agent posts a review comment, it includes a feedback mechanism (reaction buttons, a link, or integration with the PR review UI). Developers mark each comment as:
This feedback drives three improvements:
Accuracy Metrics
Two metrics matter for evaluating detector quality:
Precision -- Of the issues we flagged, how many were actually issues?
precision = true_positives / (true_positives + false_positives)High precision means low false positive rate. Developers trust the tool because when it flags something, it's usually right.
Recall -- Of the actual issues that existed, how many did we catch?
recall = true_positives / (true_positives + false_negatives)High recall means the tool doesn't miss real issues. The codebase is safer because issues don't slip through.
The trade-off: tuning for precision (fewer false positives) reduces recall (more missed issues), and vice versa. Production systems target precision >= 80% and recall >= 70% -- it's better to be trustworthy than exhaustive.
Production Hardening
Five concerns separate a demo from a deployed system:
This is chapter 6 of AI Code Review Agent.
Get the full hands-on course — free during early access. Build the complete system. Your projects become your portfolio.
View course details