Compare

The compare command computes deltas between two evaluation runs for A/B testing.

Usage

Run two evaluations and compare them:

agentv eval evals/my-eval.yaml --out before.jsonl
# ... make changes to your agent ...
agentv eval evals/my-eval.yaml --out after.jsonl
agentv compare before.jsonl after.jsonl

Options

Option	Description
`--threshold`, `-t`	Score delta threshold for win/loss classification (default: 0.1)
`--format`, `-f`	Output format: `table` (default) or `json`
`--json`	Shorthand for `--format=json`

How It Works

Load Results — reads both JSONL files containing evaluation results
Match by test_id — pairs results with matching test_id fields
Compute Deltas — calculates delta = score2 - score1 for each pair
Compute Normalized Gain — calculates g = delta / (1 - score1) for each pair (see below)
Classify Outcomes:
- win: delta >= threshold (candidate better)
- loss: delta <= -threshold (baseline better)
- tie: |delta| < threshold (no significant difference)
Output Summary — human-readable table or JSON

Normalized Gain (g)

In addition to raw delta, compare reports normalized gain (g):

g = (score_candidate − score_baseline) / (1 − score_baseline)

g measures improvement relative to remaining headroom rather than as an absolute number. This matters when baselines differ across tasks:

Baseline	Candidate	Δ	g	Interpretation
0.10	0.55	+0.45	+0.50	Captured 50% of remaining headroom
0.90	0.95	+0.05	+0.50	Same proportional gain despite smaller Δ
0.50	0.25	−0.25	−0.50	Regression: lost 50% of headroom

g is null when the baseline is already 1.0 (no headroom to improve). Null values are excluded from the mean.

Output Formats

Table Format (default)

Comparing: baseline/ → candidate/

  Test ID              Baseline  Candidate     Delta  Result
  ───────────────────  ────────  ─────────  ────────  ────────
  fix-cwd-bug              0.00       0.60     +0.60  ✓ win
  spec-driven-impl         0.40       0.80     +0.40  ✓ win
  multi-file-refactor      0.60       0.40     -0.20  ✗ loss

Summary: 2 wins, 1 loss, 0 ties | Mean Δ: +0.267 | g: +0.256 | Status: improved

Wins are highlighted green, losses red, and ties gray. Colors are automatically disabled when output is piped or NO_COLOR is set.

JSON Format

Use --json or --format=json for machine-readable output. Fields use snake_case for Python ecosystem compatibility:

{
  "matched": [
    {
      "test_id": "fix-cwd-bug",
      "score1": 0.0,
      "score2": 0.6,
      "delta": 0.6,
      "normalized_gain": 0.6,
      "outcome": "win"
    }
  ],
  "unmatched": {
    "file1": 0,
    "file2": 0
  },
  "summary": {
    "total": 6,
    "matched": 3,
    "wins": 2,
    "losses": 1,
    "ties": 0,
    "mean_delta": 0.267,
    "mean_normalized_gain": 0.256
  }
}

Exit Codes

Code	Meaning
`0`	Candidate is equal or better (mean delta >= 0)
`1`	Baseline is better (regression detected)

Use exit codes to gate CI pipelines — a non-zero exit signals regression.

Workflow Examples

Model Comparison

Compare different model versions:

# Run baseline evaluation
agentv eval evals/*.yaml --target gpt-4 --out baseline.jsonl

# Run candidate evaluation
agentv eval evals/*.yaml --target gpt-4o --out candidate.jsonl

# Compare results
agentv compare baseline.jsonl candidate.jsonl

Prompt Optimization

Compare before/after prompt changes:

# Run with original prompt
agentv eval evals/*.yaml --out before.jsonl

# Modify prompt, then run again
agentv eval evals/*.yaml --out after.jsonl

# Compare with strict threshold
agentv compare before.jsonl after.jsonl --threshold 0.05

CI Quality Gate

Fail CI if the candidate regresses:

#!/bin/bash
agentv compare baseline.jsonl candidate.jsonl
if [ $? -eq 1 ]; then
  echo "Regression detected! Candidate performs worse than baseline."
  exit 1
fi
echo "Candidate is equal or better than baseline."

Tips

Threshold selection — the default 0.1 means a 10% difference is required for a win or loss. Use stricter thresholds (0.05) for critical evaluations.
Normalized gain vs delta — use g to compare across tasks with different baseline difficulty; use Δ for absolute improvement tracking.
Unmatched results — check unmatched counts in JSON output to identify tests that only exist in one file.
Multiple comparisons — compare against multiple baselines by running the command multiple times.