Architecture
Review pipeline
Daydream's review runs as one of four registered flows — deep (the default),
shallow (--shallow), review (--review / --comment), and pr-feedback
(daydream feedback) — each an ordered list of named steps executed by a small
engine
(run_flow in daydream/flows/engine.py)
over a shared context. The built-in steps are seeded into a per-run registry; a
fork can insert, remove, replace, or reorder them through the
extension seam. The deep flow's step bodies
live in
daydream/deep/orchestrator.py.
Its user-visible steps are:
- Exploration pre-scan (
exploration): tree-sitter import resolution and convention detection across changed files, implemented indaydream/exploration_runner.py. - Intent analysis (
intent): the agent reads the diff and commit history to understand what the change is trying to do. - Alternative review (
alternatives): identifies potential improvements as numbered findings, each with file, line, severity, and rationale. - Per-stack reviews (
per-stack-reviews): parallel Beagle skill invocations, one per detected stack (Python, React, Elixir, Go, Rust, iOS). Each stack gets its own review agent that runs the matchingreview-*skill. - Per-stack parse (
per-stack-parse): each stack's raw output is parsed into structured findings and cross-checked for duplicates. - Arbiter (
arbiter): resolves high-severity and contested findings by re-reading the code. The arbiter fires only when qualifying findings exist; the pre-flight estimate always includes it so the extra model call is never a surprise. An arbiter verdict can revise severity, confidence, description, or rationale. An explicitkeep: falsedrops a finding. A missing or ambiguous verdict fails open: the original finding is retained. - Cross-stack merge (
cross-stack-merge): deduplicates per-stack findings into a unified report. A structural evidence gate (_apply_evidence_gateinphases.py) drops any finding that lacks a groundedpath:linecitation before it reachesmerged-items.json. Dropped items are logged to.daydream/deep/dropped-speculative.jsonfor audit. The confidence enum is HIGH/MEDIUM only; LOW/no-evidence is removed.
After the merge, the fix half of the flow runs:
- Verification (
verify): adjudicates each finding against the actual code, attaching a per-finding verdict ofconsistent,uncertain, orcontradicts, with evidence. Acontradictsverdict blocks the fixer from applying that recommendation. The pass runs conditionally: it is skipped when no findings need verification. - Fix gate (
fix-gate, thenfixandtest): applies the surviving fixes and validates the result against the project's test suite, with a bounded fix-and-retry heal loop on failure. Same-file findings are batched into a singlerun_agent()call (phase_fix_batched); distinct files run concurrently under ananyio.CapacityLimiter(4).
The pre-flight agent count for a deep run is
2 + 2N + 2, where N is the number of detected stacks: two TTT agents (intent
and alternative review), N per-stack review agents, N per-stack parse passes,
one merge agent, and one conditional arbiter agent
(total_agent_count).
Shallow review (--shallow) is the shallow flow: a single-skill loop of
review, parse, fix, and test steps (repeated up to --loop N passes). It is
useful for single-stack projects or when forcing a specific Beagle skill with
--skill.
Tiny-diff short-circuit: when a deep-mode diff has at most 2
changed files (configurable via
shallow_fanout_threshold),
the per-language fan-out collapses to a single combined assignment and the merge
and arbiter stages are skipped. The evidence gate still applies to the bypass
output. A single-file single-language diff still gets a per-language Beagle
skill; only diffs with two or more distinct language stacks fall back to the
generic reviewer.
Trajectory recording
Every run produces an ATIF v1.6 trajectory capturing prompts, responses, tool calls, and per-step token and cost metrics. The recorder is the sole owner of ATIF Pydantic model construction; other modules import only its public surface.
Each agent invocation (run_agent) opens one Invocation scope on the shared
recorder. Backends emit a unified AgentEvent stream
(daydream/backends/__init__.py)
that the invocation buffers into ATIF Steps. Event types: TextEvent,
ThinkingEvent, ToolStartEvent, ToolResultEvent, CostEvent,
MetricsEvent, TurnEndEvent, and ResultEvent. A TurnEndEvent closes the
open Step, so a multi-turn invocation produces N Steps rather than one collapsed
Step.
Parallel fan-outs fork sibling trajectories under the run directory. Each fork writes a separate trajectory file alongside the parent.
Daydream redacts secrets before writing. Ordered rules cover URL credentials,
PEM private key blocks, secret-bearing environment variable assignments, bare
API keys (sk-, ghp_, ghs_, xoxb-, AKIA), JWTs, and local username paths. The
redactor applies to all four ATIF text surfaces: step messages, reasoning
content, tool call arguments, and observation results. On internal failure, the
redactor degrades to [REDACTION_FAILED] rather than letting the raw value
through.
On SIGINT or SIGTERM, the signal handler
(daydream/cli.py)
flushes a .partial trajectory file with extra.partial set, so interrupted
runs are not lost. The handler uses a module-level recorder stack rather than
the async ContextVar, because signal handlers fire in the main thread outside
the asyncio task context.
Archive
Each run is archived to ~/.daydream/archive/runs/<session_id>/ with its
manifest, trajectory (including forks), review output, evaluation results (on
by default; skip with --no-eval), deep artifacts, and the diff patch. A
separate recommended.patch captures daydream's proposed diff distinct from
the PR-under-review diff. Archiving is non-fatal:
if it fails, the run still completes and the failure is logged.
A SQLite index at ~/.daydream/archive/index.db
(daydream/archive/_schema.py)
supports querying across projects. Two tables:
runs: 39 columns includingrepo_slug,backend,total_cost_usd,grounding_rate,outcome_labels,composite_reward,changed_files,head_sha,base_sha, and per-phase backend columns.label_observations: the bitemporal label table. Tracks transaction time (observed_at), valid time (valid_at), evidence SHA, reward version, reward JSON, composite reward, reviewer logins, posterior flag, and source (autoorhuman). Human labels take precedence over automated labels in every projection.
The index runs in WAL mode with a 5-second busy timeout. Five indexes cover
repo_slug, archived_at, outcome_labels, observed_at, and session_id.