Architecture

Review pipeline

Daydream's review runs as one of four registered flows — deep (the default), shallow (--shallow), review (--review / --comment), and pr-feedback (daydream feedback) — each an ordered list of named steps executed by a small engine (run_flow in daydream/flows/engine.py) over a shared context. The built-in steps are seeded into a per-run registry; a fork can insert, remove, replace, or reorder them through the extension seam. The deep flow's step bodies live in daydream/deep/orchestrator.py. Its user-visible steps are:

  1. Exploration pre-scan (exploration): tree-sitter import resolution and convention detection across changed files, implemented in daydream/exploration_runner.py.
  2. Intent analysis (intent): the agent reads the diff and commit history to understand what the change is trying to do.
  3. Alternative review (alternatives): identifies potential improvements as numbered findings, each with file, line, severity, and rationale.
  4. Per-stack reviews (per-stack-reviews): parallel Beagle skill invocations, one per detected stack (Python, React, Elixir, Go, Rust, iOS). Each stack gets its own review agent that runs the matching review-* skill.
  5. Per-stack parse (per-stack-parse): each stack's raw output is parsed into structured findings and cross-checked for duplicates.
  6. Arbiter (arbiter): resolves high-severity and contested findings by re-reading the code. The arbiter fires only when qualifying findings exist; the pre-flight estimate always includes it so the extra model call is never a surprise. An arbiter verdict can revise severity, confidence, description, or rationale. An explicit keep: false drops a finding. A missing or ambiguous verdict fails open: the original finding is retained.
  7. Cross-stack merge (cross-stack-merge): deduplicates per-stack findings into a unified report. A structural evidence gate (_apply_evidence_gate in phases.py) drops any finding that lacks a grounded path:line citation before it reaches merged-items.json. Dropped items are logged to .daydream/deep/dropped-speculative.json for audit. The confidence enum is HIGH/MEDIUM only; LOW/no-evidence is removed.

After the merge, the fix half of the flow runs:

  • Verification (verify): adjudicates each finding against the actual code, attaching a per-finding verdict of consistent, uncertain, or contradicts, with evidence. A contradicts verdict blocks the fixer from applying that recommendation. The pass runs conditionally: it is skipped when no findings need verification.
  • Fix gate (fix-gate, then fix and test): applies the surviving fixes and validates the result against the project's test suite, with a bounded fix-and-retry heal loop on failure. Same-file findings are batched into a single run_agent() call (phase_fix_batched); distinct files run concurrently under an anyio.CapacityLimiter(4).

The pre-flight agent count for a deep run is 2 + 2N + 2, where N is the number of detected stacks: two TTT agents (intent and alternative review), N per-stack review agents, N per-stack parse passes, one merge agent, and one conditional arbiter agent (total_agent_count).

Shallow review (--shallow) is the shallow flow: a single-skill loop of review, parse, fix, and test steps (repeated up to --loop N passes). It is useful for single-stack projects or when forcing a specific Beagle skill with --skill.

Tiny-diff short-circuit: when a deep-mode diff has at most 2 changed files (configurable via shallow_fanout_threshold), the per-language fan-out collapses to a single combined assignment and the merge and arbiter stages are skipped. The evidence gate still applies to the bypass output. A single-file single-language diff still gets a per-language Beagle skill; only diffs with two or more distinct language stacks fall back to the generic reviewer.

Trajectory recording

Every run produces an ATIF v1.6 trajectory capturing prompts, responses, tool calls, and per-step token and cost metrics. The recorder is the sole owner of ATIF Pydantic model construction; other modules import only its public surface.

Each agent invocation (run_agent) opens one Invocation scope on the shared recorder. Backends emit a unified AgentEvent stream (daydream/backends/__init__.py) that the invocation buffers into ATIF Steps. Event types: TextEvent, ThinkingEvent, ToolStartEvent, ToolResultEvent, CostEvent, MetricsEvent, TurnEndEvent, and ResultEvent. A TurnEndEvent closes the open Step, so a multi-turn invocation produces N Steps rather than one collapsed Step.

Parallel fan-outs fork sibling trajectories under the run directory. Each fork writes a separate trajectory file alongside the parent.

Daydream redacts secrets before writing. Ordered rules cover URL credentials, PEM private key blocks, secret-bearing environment variable assignments, bare API keys (sk-, ghp_, ghs_, xoxb-, AKIA), JWTs, and local username paths. The redactor applies to all four ATIF text surfaces: step messages, reasoning content, tool call arguments, and observation results. On internal failure, the redactor degrades to [REDACTION_FAILED] rather than letting the raw value through.

On SIGINT or SIGTERM, the signal handler (daydream/cli.py) flushes a .partial trajectory file with extra.partial set, so interrupted runs are not lost. The handler uses a module-level recorder stack rather than the async ContextVar, because signal handlers fire in the main thread outside the asyncio task context.

Archive

Each run is archived to ~/.daydream/archive/runs/<session_id>/ with its manifest, trajectory (including forks), review output, evaluation results (on by default; skip with --no-eval), deep artifacts, and the diff patch. A separate recommended.patch captures daydream's proposed diff distinct from the PR-under-review diff. Archiving is non-fatal: if it fails, the run still completes and the failure is logged.

A SQLite index at ~/.daydream/archive/index.db (daydream/archive/_schema.py) supports querying across projects. Two tables:

  • runs: 39 columns including repo_slug, backend, total_cost_usd, grounding_rate, outcome_labels, composite_reward, changed_files, head_sha, base_sha, and per-phase backend columns.
  • label_observations: the bitemporal label table. Tracks transaction time (observed_at), valid time (valid_at), evidence SHA, reward version, reward JSON, composite reward, reviewer logins, posterior flag, and source (auto or human). Human labels take precedence over automated labels in every projection.

The index runs in WAL mode with a 5-second busy timeout. Five indexes cover repo_slug, archived_at, outcome_labels, observed_at, and session_id.

Back to Daydream

Daydream overview