daydream

What it is

Daydream is a code-review agent that produces structured training data from its own runs. Use it when you want review findings to land as verified edits instead of a backlog. It reviews diffs using stack-specific Beagle skills, applies fixes gated on the project's test suite, and records every agent interaction as an ATIF (Agent Trajectory Interchange Format) v1.6 trajectory, the same open format emitted by OpenHands, Claude Code, and Codex. A bitemporal corpus pipeline then scores, labels, and projects those trajectories into JSONL datasets for supervised fine-tuning (SFT) and reinforcement learning (RL).

The longer-term goal is an open-weight code-review model trained on Daydream's own trajectory archive, benchmarked against commercial code-review bots and against human reviewers' acted-upon comments on held-out PRs.

Architecture

Review pipeline

Daydream runs in two modes. Deep review (the default) is a staged pipeline:

  1. Exploration pre-scan — tree-sitter import resolution and convention detection
  2. Intent analysis — understands the diff and commit history
  3. Alternative review — identifies improvements as numbered findings
  4. Per-stack reviews — parallel Beagle skill invocations, one per detected stack (Python, React, Elixir, Go, Rust, iOS)
  5. Cross-stack merge — deduplicates per-stack findings into a unified report

After the merge, a verification pass adjudicates each finding against the actual code, attaching a per-finding verdict and evidence; a contradicts verdict blocks the fixer from applying that recommendation literally. An optional fix gate then applies the surviving fixes one at a time and validates the result against the project's test suite, with a bounded fix-and-retry heal loop on failure. Shallow review (--shallow) collapses this to a single-skill loop: review → parse → fix → test, useful for single-stack projects or when forcing a specific Beagle skill.

Trajectory recording

Every run produces an ATIF v1.6 trajectory capturing prompts, responses, tool calls, and per-step token and cost metrics. Parallel fan-outs fork sibling trajectories. Daydream redacts secrets before writing — ordered rules covering URL credentials, PEM blocks, env-var assignments, bare API keys, and JWTs, with a redact-or-omit failure mode — and interrupted runs flush a .partial file.

Corpus pipeline

A bitemporal pipeline converts archived trajectories into fine-tuning datasets:

  • Harvest assembles bronze signals (verifier verdicts, finding records, grounding rate, review length), scores an intrinsic reward, derives an outcome label, and appends a bitemporal annotation — re-harvesting with unchanged evidence is an idempotent no-op, while a reward-version bump appends a new generation, so older as_of pins still resolve their original scores.
  • Reward scoring is a pure composite over capture-time signals:
    • Correctness (0.6, mean over per-finding verifier verdicts) and grounding (0.4) carry all the credit; a format_valid gate floors the composite to zero, and a bounded length penalty (0.2 weight) applies.
    • Missing signals are never imputed as zero: absent axes stay None with presence flags, and weights renormalize over the axes actually measured, so an uninstrumented run can't masquerade as a failed one.
    • Maintainer accept/reject feedback is kept as a sibling axis rather than subtracted into the composite.
    • Only the default weights earn the canonical reward-version stamp — custom weights get a content-hash suffix, so analysis-time sweeps can't contaminate the canonical corpus.
  • Build corpus projects as_of-pinned annotations into JSONL training records, filtered by label, reward threshold, skill, and license, and writes a content-addressed lineage.json manifest for byte-for-byte reproducibility. A temporal-leakage guard drops from labels any annotation written before the pin that encodes a PR-merge outcome dated after it.

Archive

Each run is archived with its manifest, trajectory, review output, evaluation results, and deep artifacts. A SQLite index supports querying across projects by repo, backend, cost, grounding rate, and outcome label.

Training roadmap and benchmarking

The roadmap targets an open-weight code-review model (Qwen2.5-Coder-7B-Instruct, trained with QLoRA, quantized low-rank adapters) via a staged recipe: rejection-filtered SFT, span-segmented SFT on ATIF reasoning and action spans, and KTO (Kahneman-Tversky Optimization) preference training on PR-comment accept/reject labels. daydream bench scores deep-review findings against Martian's Code Review Benchmark on an offline replay subset of 26 PRs from Sentry, Grafana, and Cal.com, tracking precision and recall against the model's targets.

Run a review

Daydream runs as a CLI agent against a working tree or a pull-request diff. The golden path is a deep, multi-stack review that runs review, fix, and test in sequence:

daydream /path/to/project

The run ends with a Review Summary table showing issues found, fixes applied, and a tests PASSED/FAILED badge.

Common variations:

  • daydream --comment — post findings as inline PR comments, with no fix or test phase
  • daydream --shallow — run the single-skill loop for single-stack projects

Per-phase model and backend selection, corpus commands, and the full flag surface live in the project's README.

Source

github.com/existential-birds/daydream