Quality Scoring
Quantifying Code Quality Across Dimensions
Why Numbers Drive Behavior
"14 issues found" doesn't help a developer decide whether a PR is safe to merge. "72/100 security score with a downward trend" does. Numbers create accountability, enable automation (block merge below threshold), and track progress over time.
A single aggregate score is tempting but useless. A PR can be flawless on security but terrible on maintainability. Multi-dimensional scoring surfaces these trade-offs explicitly.
Four Quality Dimensions
| Dimension | Measures | Key Signals |
|---|---|---|
| Correctness | Will the code work as intended? | Null checks, off-by-one, race conditions, unhandled errors |
| Security | Is the code safe from attackers? | SQL injection, XSS, hardcoded secrets, insecure crypto |
| Maintainability | Can other developers understand and modify this? | God functions, deep nesting, complexity, naming |
| Performance | Will this code perform well at scale? | N+1 queries, blocking I/O, missing indexes |
Each dimension starts at 100 and loses points for each detected issue. The deduction formula:
deduction = severityWeight * confidence * categoryMultiplierSeverity weights: critical (-25), high (-15), medium (-8), low (-3). These are scaled by the detector's confidence score, so a high-severity issue with 0.72 confidence loses fewer points than one with 0.97 confidence.
Weighted Aggregation
The overall score is a weighted average of the four dimensions. Default weights reflect enterprise priorities:
These weights are configurable. A fintech company might set security to 50%. A game studio might boost performance to 30%. A startup in rapid iteration might reduce maintainability to 10%.
Thresholds and Verdicts
Each dimension gets a verdict based on configurable thresholds:
The overall verdict follows the strictest dimension: if any dimension fails, the PR gets "request changes." This prevents a PR with excellent correctness but critical security issues from slipping through.
Trend Tracking
Individual PR scores are snapshots. Trends tell the story of codebase health over time.
The trend tracker maintains a rolling window of the last 20 PR scores and computes:
When the repository's security trend is declining, the reviewer adds a note: "This PR continues a downward trend in security scores." This turns a reactive tool (review this PR) into a proactive one (your codebase is getting worse).
Scoring Profiles
Different teams have different quality standards. Named profiles encode these:
The scoring engine accepts a profile name and applies the corresponding thresholds and weights. This makes the tool configurable without requiring teams to understand the underlying math.
This is chapter 3 of AI Code Review Agent.
Get the full hands-on course — free during early access. Build the complete system. Your projects become your portfolio.
View course details