Back to guides
3
5 min

Quality Scoring

Quantifying Code Quality Across Dimensions

Why Numbers Drive Behavior

"14 issues found" doesn't help a developer decide whether a PR is safe to merge. "72/100 security score with a downward trend" does. Numbers create accountability, enable automation (block merge below threshold), and track progress over time.

A single aggregate score is tempting but useless. A PR can be flawless on security but terrible on maintainability. Multi-dimensional scoring surfaces these trade-offs explicitly.

Four Quality Dimensions

DimensionMeasuresKey Signals
CorrectnessWill the code work as intended?Null checks, off-by-one, race conditions, unhandled errors
SecurityIs the code safe from attackers?SQL injection, XSS, hardcoded secrets, insecure crypto
MaintainabilityCan other developers understand and modify this?God functions, deep nesting, complexity, naming
PerformanceWill this code perform well at scale?N+1 queries, blocking I/O, missing indexes

Each dimension starts at 100 and loses points for each detected issue. The deduction formula:

deduction = severityWeight * confidence * categoryMultiplier

Severity weights: critical (-25), high (-15), medium (-8), low (-3). These are scaled by the detector's confidence score, so a high-severity issue with 0.72 confidence loses fewer points than one with 0.97 confidence.

Weighted Aggregation

The overall score is a weighted average of the four dimensions. Default weights reflect enterprise priorities:

  • Security: 35% -- A security vulnerability can compromise the entire system
  • Correctness: 30% -- Bugs directly affect users
  • Maintainability: 20% -- Long-term health matters but isn't urgent
  • Performance: 15% -- Performance issues are usually less severe than correctness or security
  • These weights are configurable. A fintech company might set security to 50%. A game studio might boost performance to 30%. A startup in rapid iteration might reduce maintainability to 10%.

    Thresholds and Verdicts

    Each dimension gets a verdict based on configurable thresholds:

  • Pass (>= 80) -- No significant issues, safe to merge
  • Warn (60-79) -- Has issues worth addressing, but not blocking
  • Fail (< 60) -- Significant issues that should be fixed before merging
  • The overall verdict follows the strictest dimension: if any dimension fails, the PR gets "request changes." This prevents a PR with excellent correctness but critical security issues from slipping through.

    Trend Tracking

    Individual PR scores are snapshots. Trends tell the story of codebase health over time.

    The trend tracker maintains a rolling window of the last 20 PR scores and computes:

  • Moving average -- Smooths out individual PR variance
  • Linear regression slope -- Positive means improving, negative means declining
  • Per-dimension trends -- "Security is improving but maintainability is declining"
  • Velocity -- How fast the trend is changing
  • When the repository's security trend is declining, the reviewer adds a note: "This PR continues a downward trend in security scores." This turns a reactive tool (review this PR) into a proactive one (your codebase is getting worse).

    Scoring Profiles

    Different teams have different quality standards. Named profiles encode these:

  • Strict -- Pass >= 90, security weight 50%, zero tolerance for critical issues
  • Standard -- Pass >= 80, default weights, balanced approach
  • Relaxed -- Pass >= 70, even weights, appropriate for early-stage rapid iteration
  • The scoring engine accepts a profile name and applies the corresponding thresholds and weights. This makes the tool configurable without requiring teams to understand the underlying math.

    This is chapter 3 of AI Code Review Agent.

    Get the full hands-on course — free during early access. Build the complete system. Your projects become your portfolio.

    View course details