Benchmarking
daydream bench scores deep-review findings against
Martian's Code Review Benchmark
on an offline replay subset of 26 PRs.
The pinned PR registry lives in
daydream/benchmark/prs.py:
| Source repo | PR count |
|---|---|
| getsentry/sentry | 6 |
| grafana/grafana | 10 |
| calcom/cal.com | 10 |
| Total | 26 |
For each PR, the benchmark harness runs daydream --non-interactive --base <base_sha> --trajectory <path> <checkout> against a blobless clone, reads the
merged findings from .daydream/deep/merged-items.json, and injects them as
synthetic review comments into the benchmark corpus. Scoring then runs Martian's step2 (extract),
step2.5 (dedup), and step3 (judge) modules, producing micro-averaged precision
and recall against the golden labels.
The judge requires MARTIAN_API_KEY. The default judge model is
anthropic/claude-opus-4.5. Two judge routes are available:
- Martian/OpenAI-compatible route: drives Martian's step2/step2.5/step3
modules via
uv run. Producesevaluations.jsonwith per-PR precision/recall. - Direct Anthropic route
(
anthropic_score.py): calls the Anthropic Messages API directly, bypassing the Martian runtime. Same output schema; useful when the Martian runtime is unavailable.
Per-model results land under results/<model-with-slashes-replaced> in the
benchmark repo. Any reviewer backend can be benchmarked, including GLM via
OpenRouter.
Review-bot comparison harness
A standalone harness in
bench/review-bot-compare/
replays daydream against any third-party GitHub review bot (CodeRabbit,
Greptile) on real PR history. It enforces a same-snapshot invariant: daydream
reviews the exact code state the external bot reviewed.
The four-stage pipeline communicates via JSON artifacts in
out/<owner__repo>/:
- Harvest (
harvest.py): fetches the bot's historical review comments viagh api. - Replay (
replay.py): runs daydream against the exact PR checkout. - Judge (
judge.py): scores overlap between daydream and the external bot using deterministic and LLM-judged comparison. - Compare (
compare.py): produces per-PR precision/recall metrics with cost synthesized from measured tokens.
A self-contained HTML report can be rendered from the comparison results:
make benchmark-report
make benchmark-report BENCH=/path/to/benchmark/offline RUN=opus-4-5-baseline
Each run lands at bench/benchmark-report/runs/<run-id>/, never overwriting a
prior report. The generator auto-discovers every judge model under results/
and writes per-reviewer scorecards with sortable per-PR breakdowns.
These five repos are excluded from the training corpus
(exclusion.txt)
so the benchmark remains a clean held-out evaluation set: Sentry, Grafana,
Cal.com, Discourse, and Keycloak.