Benchmarking

daydream bench scores deep-review findings against Martian's Code Review Benchmark on an offline replay subset of 26 PRs.

The pinned PR registry lives in daydream/benchmark/prs.py:

Source repo PR count
getsentry/sentry 6
grafana/grafana 10
calcom/cal.com 10
Total 26

For each PR, the benchmark harness runs daydream --non-interactive --base <base_sha> --trajectory <path> <checkout> against a blobless clone, reads the merged findings from .daydream/deep/merged-items.json, and injects them as synthetic review comments into the benchmark corpus. Scoring then runs Martian's step2 (extract), step2.5 (dedup), and step3 (judge) modules, producing micro-averaged precision and recall against the golden labels.

The judge requires MARTIAN_API_KEY. The default judge model is anthropic/claude-opus-4.5. Two judge routes are available:

  • Martian/OpenAI-compatible route: drives Martian's step2/step2.5/step3 modules via uv run. Produces evaluations.json with per-PR precision/recall.
  • Direct Anthropic route (anthropic_score.py): calls the Anthropic Messages API directly, bypassing the Martian runtime. Same output schema; useful when the Martian runtime is unavailable.

Per-model results land under results/<model-with-slashes-replaced> in the benchmark repo. Any reviewer backend can be benchmarked, including GLM via OpenRouter.

Review-bot comparison harness

A standalone harness in bench/review-bot-compare/ replays daydream against any third-party GitHub review bot (CodeRabbit, Greptile) on real PR history. It enforces a same-snapshot invariant: daydream reviews the exact code state the external bot reviewed.

The four-stage pipeline communicates via JSON artifacts in out/<owner__repo>/:

  1. Harvest (harvest.py): fetches the bot's historical review comments via gh api.
  2. Replay (replay.py): runs daydream against the exact PR checkout.
  3. Judge (judge.py): scores overlap between daydream and the external bot using deterministic and LLM-judged comparison.
  4. Compare (compare.py): produces per-PR precision/recall metrics with cost synthesized from measured tokens.

A self-contained HTML report can be rendered from the comparison results:

make benchmark-report
make benchmark-report BENCH=/path/to/benchmark/offline RUN=opus-4-5-baseline

Each run lands at bench/benchmark-report/runs/<run-id>/, never overwriting a prior report. The generator auto-discovers every judge model under results/ and writes per-reviewer scorecards with sortable per-PR breakdowns.

These five repos are excluded from the training corpus (exclusion.txt) so the benchmark remains a clean held-out evaluation set: Sentry, Grafana, Cal.com, Discourse, and Keycloak.

Back to Daydream

Daydream overview