5 min

Quality Scoring

Quantifying Code Quality Across Dimensions

Why Numbers Drive Behavior

"14 issues found" doesn't help a developer decide whether a PR is safe to merge. "72/100 security score with a downward trend" does. Numbers create accountability, enable automation (block merge below threshold), and track progress over time.

A single aggregate score is tempting but useless. A PR can be flawless on security but terrible on maintainability. Multi-dimensional scoring surfaces these trade-offs explicitly.

Four Quality Dimensions

Dimension	Measures	Key Signals
Correctness	Will the code work as intended?	Null checks, off-by-one, race conditions, unhandled errors
Security	Is the code safe from attackers?	SQL injection, XSS, hardcoded secrets, insecure crypto
Maintainability	Can other developers understand and modify this?	God functions, deep nesting, complexity, naming
Performance	Will this code perform well at scale?	N+1 queries, blocking I/O, missing indexes

Each dimension starts at 100 and loses points for each detected issue. The deduction formula:

deduction = severityWeight * confidence * categoryMultiplier

Severity weights: critical (-25), high (-15), medium (-8), low (-3). These are scaled by the detector's confidence score, so a high-severity issue with 0.72 confidence loses fewer points than one with 0.97 confidence.

Weighted Aggregation

The overall score is a weighted average of the four dimensions. Default weights reflect enterprise priorities:

Security: 35% -- A security vulnerability can compromise the entire system

Correctness: 30% -- Bugs directly affect users

Maintainability: 20% -- Long-term health matters but isn't urgent

Performance: 15% -- Performance issues are usually less severe than correctness or security

These weights are configurable. A fintech company might set security to 50%. A game studio might boost performance to 30%. A startup in rapid iteration might reduce maintainability to 10%.

Thresholds and Verdicts

Each dimension gets a verdict based on configurable thresholds:

Pass (>= 80) -- No significant issues, safe to merge

Warn (60-79) -- Has issues worth addressing, but not blocking

Fail (< 60) -- Significant issues that should be fixed before merging

The overall verdict follows the strictest dimension: if any dimension fails, the PR gets "request changes." This prevents a PR with excellent correctness but critical security issues from slipping through.

Trend Tracking

Individual PR scores are snapshots. Trends tell the story of codebase health over time.

The trend tracker maintains a rolling window of the last 20 PR scores and computes:

Moving average -- Smooths out individual PR variance

Linear regression slope -- Positive means improving, negative means declining

Per-dimension trends -- "Security is improving but maintainability is declining"

Velocity -- How fast the trend is changing

When the repository's security trend is declining, the reviewer adds a note: "This PR continues a downward trend in security scores." This turns a reactive tool (review this PR) into a proactive one (your codebase is getting worse).

Scoring Profiles

Different teams have different quality standards. Named profiles encode these:

Strict -- Pass >= 90, security weight 50%, zero tolerance for critical issues

Standard -- Pass >= 80, default weights, balanced approach

Relaxed -- Pass >= 70, even weights, appropriate for early-stage rapid iteration

The scoring engine accepts a profile name and applies the corresponding thresholds and weights. This makes the tool configurable without requiring teams to understand the underlying math.

This is chapter 3 of AI Code Review Agent.

Get the full hands-on course — free during early access. Build the complete system. Your projects become your portfolio.

View course details

Ch. 2: Issue Detection

Ch. 4: Review Generation