6 min

Production Pipeline

From Demo to Automated Workflow

The 30/70 Rule

Building the review tool is 30% of the work. Making it run automatically, collect feedback, and improve over time is the other 70%. A code review agent that requires manual invocation gets abandoned within a week. One that automatically reviews every PR and posts comments on GitHub becomes part of the team's workflow.

GitHub Webhook Integration

GitHub sends HTTP POST requests to your URL when events happen. For code review, three events matter:

Event	When	Action
`pull_request.opened`	New PR created	Run initial review
`pull_request.synchronize`	New commits pushed to PR	Re-run review on updated diff
`pull_request.reopened`	Closed PR reopened	Re-run review

The webhook handler:

Verifies the signature -- GitHub signs each payload with HMAC-SHA256. Always verify to prevent spoofed requests.

Fetches the diff -- Calls GET /repos/{owner}/{repo}/pulls/{number}.diff to get the current diff.

Runs the pipeline -- Parse, detect, score, review -- the same pipeline from the API route.

Posts results -- Creates a PR review via the GitHub API with inline comments and summary.

CI/CD as a Merge Gate

Posting review comments is informative. Blocking merges is enforcing. GitHub commit statuses and check runs serve this purpose:

Commit statuses are simple: success, failure, or pending. They appear as green check, red X, or yellow dot in the PR. When configured as a required status check, the merge button is disabled until the check passes.

Check runs are richer. They support annotations (inline comments in GitHub's diff view), markdown summaries in the Checks tab, and conclusions (success, failure, neutral, action_required). The annotations appear exactly like a human reviewer's comments.

The decision logic: any dimension scoring below the fail threshold (< 60) or any critical issue present sets the check to failure. Warnings allow merge but surface suggestions.

The Feedback Flywheel

Static analysis tools start strong and decay. Rules that were accurate at launch accumulate false positives as the codebase evolves. The feedback loop prevents this decay.

When the agent posts a review comment, it includes a feedback mechanism (reaction buttons, a link, or integration with the PR review UI). Developers mark each comment as:

Helpful -- Real issue, useful suggestion

Not helpful -- False positive or unhelpful suggestion

Resolved -- Developer fixed the issue

This feedback drives three improvements:

Per-rule accuracy tracking -- If a rule has a 40% "not helpful" rate, it needs tuning. Raise its confidence threshold or refine its pattern.

Severity calibration -- If developers consistently dismiss "medium" issues from a specific rule, that rule's severity might be too aggressive for the team.

Detector evolution -- Rules with high true-positive rates get promoted (lower confidence threshold, higher severity). Rules with high false-positive rates get demoted or disabled.

Accuracy Metrics

Two metrics matter for evaluating detector quality:

Precision -- Of the issues we flagged, how many were actually issues?

precision = true_positives / (true_positives + false_positives)

High precision means low false positive rate. Developers trust the tool because when it flags something, it's usually right.

Recall -- Of the actual issues that existed, how many did we catch?

recall = true_positives / (true_positives + false_negatives)

High recall means the tool doesn't miss real issues. The codebase is safer because issues don't slip through.

The trade-off: tuning for precision (fewer false positives) reduces recall (more missed issues), and vice versa. Production systems target precision >= 80% and recall >= 70% -- it's better to be trustworthy than exhaustive.

Production Hardening

Five concerns separate a demo from a deployed system:

Caching -- Cache parsed ASTs and detection results by file content hash. When a PR updates, only re-analyze changed files.

Error isolation -- If one detector crashes on one file, the rest of the review should still complete.

Monitoring -- Track pipeline latency (p50/p95/p99), false positive rate (from feedback), and API rate limit headroom.

Cost control -- If using AI for enhanced detection, set token budgets per PR. Fall back to rule-only mode when exceeded.

Staged rollout -- Start in comment-only mode (no merge blocking). Observe false positive rates for 2 weeks. Then enable merge gating once precision exceeds 80%.

This is chapter 6 of AI Code Review Agent.

Get the full hands-on course — free during early access. Build the complete system. Your projects become your portfolio.

View course details

Ch. 5: Review App