Compare
The compare command computes deltas between two evaluation runs for A/B testing.
Run two evaluations and compare them:
agentv eval evals/my-eval.yaml --out before.jsonl# ... make changes to your agent ...agentv eval evals/my-eval.yaml --out after.jsonlagentv compare before.jsonl after.jsonlOptions
Section titled “Options”| Option | Description |
|---|---|
--threshold, -t | Score delta threshold for win/loss classification (default: 0.1) |
--format, -f | Output format: table (default) or json |
--json | Shorthand for --format=json |
How It Works
Section titled “How It Works”- Load Results — reads both JSONL files containing evaluation results
- Match by test_id — pairs results with matching
test_idfields - Compute Deltas — calculates
delta = score2 - score1for each pair - Compute Normalized Gain — calculates
g = delta / (1 - score1)for each pair (see below) - Classify Outcomes:
- win: delta >= threshold (candidate better)
- loss: delta <= -threshold (baseline better)
- tie: |delta| < threshold (no significant difference)
- Output Summary — human-readable table or JSON
Normalized Gain (g)
Section titled “Normalized Gain (g)”In addition to raw delta, compare reports normalized gain (g):
g = (score_candidate − score_baseline) / (1 − score_baseline)g measures improvement relative to remaining headroom rather than as an absolute number. This matters when baselines differ across tasks:
| Baseline | Candidate | Δ | g | Interpretation |
|---|---|---|---|---|
| 0.10 | 0.55 | +0.45 | +0.50 | Captured 50% of remaining headroom |
| 0.90 | 0.95 | +0.05 | +0.50 | Same proportional gain despite smaller Δ |
| 0.50 | 0.25 | −0.25 | −0.50 | Regression: lost 50% of headroom |
g is null when the baseline is already 1.0 (no headroom to improve). Null values are excluded from the mean.
Output Formats
Section titled “Output Formats”Table Format (default)
Section titled “Table Format (default)”Comparing: baseline/ → candidate/
Test ID Baseline Candidate Delta Result ─────────────────── ──────── ───────── ──────── ──────── fix-cwd-bug 0.00 0.60 +0.60 ✓ win spec-driven-impl 0.40 0.80 +0.40 ✓ win multi-file-refactor 0.60 0.40 -0.20 ✗ loss
Summary: 2 wins, 1 loss, 0 ties | Mean Δ: +0.267 | g: +0.256 | Status: improvedWins are highlighted green, losses red, and ties gray. Colors are automatically disabled when output is piped or NO_COLOR is set.
JSON Format
Section titled “JSON Format”Use --json or --format=json for machine-readable output. Fields use snake_case for Python ecosystem compatibility:
{ "matched": [ { "test_id": "fix-cwd-bug", "score1": 0.0, "score2": 0.6, "delta": 0.6, "normalized_gain": 0.6, "outcome": "win" } ], "unmatched": { "file1": 0, "file2": 0 }, "summary": { "total": 6, "matched": 3, "wins": 2, "losses": 1, "ties": 0, "mean_delta": 0.267, "mean_normalized_gain": 0.256 }}Exit Codes
Section titled “Exit Codes”| Code | Meaning |
|---|---|
0 | Candidate is equal or better (mean delta >= 0) |
1 | Baseline is better (regression detected) |
Use exit codes to gate CI pipelines — a non-zero exit signals regression.
Workflow Examples
Section titled “Workflow Examples”Model Comparison
Section titled “Model Comparison”Compare different model versions:
# Run baseline evaluationagentv eval evals/*.yaml --target gpt-4 --out baseline.jsonl
# Run candidate evaluationagentv eval evals/*.yaml --target gpt-4o --out candidate.jsonl
# Compare resultsagentv compare baseline.jsonl candidate.jsonlPrompt Optimization
Section titled “Prompt Optimization”Compare before/after prompt changes:
# Run with original promptagentv eval evals/*.yaml --out before.jsonl
# Modify prompt, then run againagentv eval evals/*.yaml --out after.jsonl
# Compare with strict thresholdagentv compare before.jsonl after.jsonl --threshold 0.05CI Quality Gate
Section titled “CI Quality Gate”Fail CI if the candidate regresses:
#!/bin/bashagentv compare baseline.jsonl candidate.jsonlif [ $? -eq 1 ]; then echo "Regression detected! Candidate performs worse than baseline." exit 1fiecho "Candidate is equal or better than baseline."- Threshold selection — the default 0.1 means a 10% difference is required for a win or loss. Use stricter thresholds (0.05) for critical evaluations.
- Normalized gain vs delta — use
gto compare across tasks with different baseline difficulty; useΔfor absolute improvement tracking. - Unmatched results — check
unmatchedcounts in JSON output to identify tests that only exist in one file. - Multiple comparisons — compare against multiple baselines by running the command multiple times.