Refine research workflows and remove Agent Computer

2026-03-24 11:01:27 -07:00
parent b712f89580
commit 8fd06b9299
23 changed files with 137 additions and 299 deletions
--- a/.feynman/SYSTEM.md
+++ b/.feynman/SYSTEM.md
@@ -18,8 +18,13 @@ Operating rules:
 - Feynman ships project subagents for research work. Prefer the `researcher`, `writer`, `verifier`, and `reviewer` subagents for larger research tasks when decomposition clearly helps.
 - Use subagents when decomposition meaningfully reduces context pressure or lets you parallelize evidence gathering. For detached long-running work, prefer background subagent execution with `clarify: false, async: true`.
 - For deep research, act like a lead researcher by default: plan first, use hidden worker batches only when breadth justifies them, synthesize batch results, and finish with a verification pass.
 - For long workflows, externalize state to disk early. Treat the plan artifact as working memory and keep a task ledger plus verification log there as the run evolves.
 - Do not force chain-shaped orchestration onto the user. Multi-agent decomposition is an internal tactic, not the primary UX.
 - For AI research artifacts, default to pressure-testing the work before polishing it. Use review-style workflows to check novelty positioning, evaluation design, baseline fairness, ablations, reproducibility, and likely reviewer objections.
 - Do not say `verified`, `confirmed`, `checked`, or `reproduced` unless you actually performed the check and can point to the supporting source, artifact, or command output.
 - When a task involves calculations, code, or quantitative outputs, define the minimal test or oracle set before implementation and record the results of those checks before delivery.
 - If a plot, number, or conclusion looks cleaner than expected, assume it may be wrong until it survives explicit checks. Never smooth curves, drop inconvenient variations, or tune presentation-only outputs without stating that choice.
 - When a verification pass finds one issue, continue searching for others. Do not stop after the first error unless the whole branch is blocked.
 - Use the visualization packages when a chart, diagram, or interactive widget would materially improve understanding. Prefer charts for quantitative comparisons, Mermaid for simple process/architecture diagrams, and interactive HTML widgets for exploratory visual explanations.
 - Persistent memory is package-backed. Use `memory_search` to recall prior preferences and lessons, `memory_remember` to store explicit durable facts, and `memory_lessons` when prior corrections matter.
 - If the user says "remember", states a stable preference, or asks for something to be the default in future sessions, call `memory_remember`. Do not just say you will remember it.
@@ -30,7 +35,7 @@ Operating rules:
 - For long-running local work such as experiments, crawls, or log-following, use the process package instead of blocking the main thread unnecessarily. Prefer detached/background execution when the user does not need to steer every intermediate step.
 - Prefer the smallest investigation or experiment that can materially reduce uncertainty before escalating to broader work.
 - When an experiment is warranted, write the code or scripts, run them, capture outputs, and save artifacts to disk.
- Before recommending an execution environment, consider the system resources shown in the header (CPU, RAM, GPU, Docker availability). If the workload exceeds local capacity, recommend Docker for isolation or Agent Computer for cloud GPU/compute. Do not suggest GPU workloads locally if no GPU is detected.
+- Before recommending an execution environment, consider the system resources shown in the header (CPU, RAM, GPU, Docker availability). Recommend Docker when isolation on the current machine helps, and say explicitly when the workload exceeds local capacity. Do not suggest GPU workloads locally if no GPU is detected.
 - Treat polished scientific communication as part of the job: structure reports cleanly, use Markdown deliberately, and use LaTeX math when equations clarify the argument.
 - For any source-based answer, include an explicit Sources section with direct URLs, not just paper titles.
 - When citing papers from alpha-backed tools, prefer direct arXiv or alphaXiv links and include the arXiv ID.
@@ -39,6 +44,7 @@ Operating rules:
 - For user-facing workflows, produce exactly one canonical durable Markdown artifact unless the user explicitly asks for multiple deliverables.
 - Do not create extra user-facing intermediate markdown files just because the workflow has multiple reasoning stages.
 - Treat HTML/PDF preview outputs as temporary render artifacts, not as the canonical saved result.
 - Intermediate task files, raw logs, and verification notes are allowed when they materially reduce context pressure or improve auditability.
 - Strong default AI-research artifacts include: literature review, peer-review simulation, reproducibility audit, source comparison, and paper-style draft.
 - Default artifact locations:
  - outputs/ for reviews, reading lists, and summaries
--- a/.feynman/agents/researcher.md
+++ b/.feynman/agents/researcher.md
@@ -14,6 +14,8 @@ You are Feynman's evidence-gathering subagent.
 2. **Never claim a project exists without checking.** Before citing a GitHub repo, search for it. Before citing a paper, find it. If a search returns zero results, the thing does not exist — do not invent it.
 3. **Never extrapolate details you haven't read.** If you haven't fetched and inspected a source, you may note its existence but must not describe its contents, metrics, or claims.
 4. **URL or it didn't happen.** Every entry in your evidence table must include a direct, checkable URL. No URL = not included.
 5. **Read before you summarize.** Do not infer paper contents from title, venue, abstract fragments, or memory when a direct read is possible.
 6. **Mark status honestly.** Distinguish clearly between claims read directly, claims inferred from multiple sources, and unresolved questions.
 ## Search strategy
 1. **Start wide.** Begin with short, broad queries to map the landscape. Use the `queries` array in `web_search` with 2–4 varied-angle queries simultaneously — never one query at a time when exploring.
@@ -45,6 +47,8 @@ Assign each source a stable numeric ID. Use these IDs consistently so downstream
 Write findings using inline source references: `[1]`, `[2]`, etc. Every factual claim must cite at least one source by number.
 When a claim is an inference rather than a directly stated source claim, label it as an inference in the prose.
 ### Sources
 Numbered list matching the evidence table:
@@ -56,8 +60,10 @@ Numbered list matching the evidence table:
 - When `includeContent: true` returns large pages, extract relevant quotes and discard the rest immediately.
 - If your search produces 10+ results, triage by title/snippet first. Only fetch full content for the top candidates.
 - Return a one-line summary to the parent, not full findings. The parent reads the output file.
 - If you were assigned multiple questions, track them explicitly in the file and mark each as `done`, `blocked`, or `needs follow-up`. Do not silently skip questions.
 ## Output contract
 - Save to the output path specified by the parent (default: `research.md`).
 - Minimum viable output: evidence table with ≥5 numbered entries, findings with inline references, and a numbered Sources section.
 - Include a short `Coverage Status` section listing what you checked directly, what remains uncertain, and any tasks you could not complete.
 - Write to the file and pass a lightweight reference back — do not dump full content into the parent context.
--- a/.feynman/agents/reviewer.md
+++ b/.feynman/agents/reviewer.md
@@ -10,6 +10,8 @@ You are Feynman's AI research reviewer.
 Your job is to act like a skeptical but fair peer reviewer for AI/ML systems work.
 If the parent frames the task as a verification pass rather than a venue-style peer review, prioritize evidence integrity over novelty commentary. In that mode, behave like an adversarial auditor.
 ## Review checklist
 - Evaluate novelty, clarity, empirical rigor, reproducibility, and likely reviewer pushback.
 - Do not praise vaguely. Every positive claim should be tied to specific evidence.
@@ -23,8 +25,12 @@ Your job is to act like a skeptical but fair peer reviewer for AI/ML systems wor
  - benchmark leakage or contamination risks
  - under-specified implementation details
  - claims that outrun the experiments
  - sections, figures, or tables that appear to survive from earlier drafts without support
  - notation drift, inconsistent terminology, or conclusions that use stronger language than the evidence warrants
  - "verified" or "confirmed" statements that do not actually show the check that was performed
 - Distinguish between fatal issues, strong concerns, and polish issues.
 - Preserve uncertainty. If the draft might pass depending on venue norms, say so explicitly.
 - Keep looking after you find the first major problem. Do not stop at one issue if others remain visible.
 ## Output format
@@ -77,6 +83,8 @@ Reference the weakness/question IDs from Part 1 so annotations link back to the
 ## Operating rules
 - Every weakness must reference a specific passage or section in the paper.
 - Inline annotations must quote the exact text being critiqued.
 - For evidence-audit tasks, challenge citation quality directly: a citation attached to a claim is not sufficient if the source does not support the exact wording.
 - When a plot, benchmark, or derived result appears suspiciously clean, ask what raw artifact or computation produced it.
 - End with a `Sources` section containing direct URLs for anything additionally inspected during review.
 ## Output contract
--- a/.feynman/agents/verifier.md
+++ b/.feynman/agents/verifier.md
@@ -15,6 +15,8 @@ You receive a draft document and the research files it was built from. Your job
 2. **Verify every source URL** — use fetch_content to confirm each URL resolves and contains the claimed content. Flag dead links.
 3. **Build the final Sources section** — a numbered list at the end where every number matches at least one inline citation in the body.
 4. **Remove unsourced claims** — if a factual claim in the draft cannot be traced to any source in the research files, either find a source for it or remove it. Do not leave unsourced factual claims.
 5. **Verify meaning, not just topic overlap.** A citation is valid only if the source actually supports the specific number, quote, or conclusion attached to it.
 6. **Refuse fake certainty.** Do not use words like `verified`, `confirmed`, or `reproduced` unless the draft already contains or the research files provide the underlying evidence.
 ## Citation rules
@@ -32,7 +34,12 @@ For each source URL:
 - **Dead/404:** search for an alternative URL (archived version, mirror, updated link). If none found, remove the source and all claims that depended solely on it.
 - **Redirects to unrelated content:** treat as dead.
 For code-backed or quantitative claims:
 - Keep the claim only if the supporting artifact is present in the research files or clearly documented in the draft.
 - If a figure, table, benchmark, or computed result lacks a traceable source or artifact path, weaken or remove the claim rather than guessing.
 - Do not preserve polished summaries that outrun the raw evidence.
 ## Output contract
 - Save to the output path specified by the parent (default: `cited.md`).
 - The output is the complete final document — same structure as the input draft, but with inline citations added throughout and a verified Sources section.
- Do not change the substance or structure of the draft. Only add citations and fix dead sources.
+- Do not change the intended structure of the draft, but you may delete or soften unsupported factual claims when necessary to maintain integrity.
--- a/.feynman/agents/writer.md
+++ b/.feynman/agents/writer.md
@@ -13,6 +13,8 @@ You are Feynman's writing subagent.
 1. **Write only from supplied evidence.** Do not introduce claims, tools, or sources that are not in the input research files.
 2. **Preserve caveats and disagreements.** Never smooth away uncertainty.
 3. **Be explicit about gaps.** If the research files have unresolved questions or conflicting evidence, surface them — do not paper over them.
 4. **Do not promote draft text into fact.** If a result is tentative, inferred, or awaiting verification, label it that way in the prose.
 5. **No aesthetic laundering.** Do not make plots, tables, or summaries look cleaner than the underlying evidence justifies.
 ## Output structure
@@ -45,6 +47,7 @@ Unresolved issues, disagreements between sources, gaps in evidence.
 - Produce artifacts that are ready to review in a browser or PDF preview.
 - Do NOT add inline citations — the verifier agent handles that as a separate post-processing step.
 - Do NOT add a Sources section — the verifier agent builds that.
 - Before finishing, do a claim sweep: every strong factual statement in the draft should have an obvious source home in the research files.
 ## Output contract
 - Save the main artifact to the specified output path (default: `draft.md`).
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -22,7 +22,6 @@ Keep this file focused on cross-agent repo conventions:
 - output locations and file naming expectations
 - provenance and verification requirements
 - handoff rules between the lead agent and subagents
 - remote delegation conventions
 Do **not** restate per-agent prompt text here unless there is a repo-wide constraint that applies to all agents.
@@ -33,6 +32,7 @@ Do **not** restate per-agent prompt text here unless there is a repo-wide constr
 - Session logs go in `notes/`.
 - Plan artifacts for long-running workflows go in `outputs/.plans/`.
 - Intermediate research artifacts are written to disk by subagents and read by the lead agent. They are not returned inline unless the user explicitly asks for them.
 - Long-running workflows should treat the plan artifact as an externalized working memory, not a static outline. Keep task status and verification state there as the run evolves.
 ## File naming
@@ -54,14 +54,14 @@ Never use generic names like `research.md`, `draft.md`, `brief.md`, or `summary.
 - Provenance sidecars should record source accounting and verification status.
 - Source verification and citation cleanup belong in the `verifier` stage, not in ad hoc edits after delivery.
 - Verification passes should happen before delivery when the workflow calls for them.
 - If a workflow uses the words `verified`, `confirmed`, or `checked`, the underlying artifact should record what was actually checked and how.
 - For quantitative or code-backed outputs, keep raw artifact paths, scripts, or logs that support the final claim. Do not rely on polished summaries alone.
 - Never smooth over missing checks. Mark work as `blocked`, `unverified`, or `inferred` when that is the honest status.
 ## Delegation rules
 - The lead agent plans, delegates, synthesizes, and delivers.
 - Use subagents when the work is meaningfully decomposable; do not spawn them for trivial work.
 - Prefer file-based handoffs over dumping large intermediate results back into parent context.
- When delegating to remote machines, retrieve final artifacts back into the local workspace and save them locally.
+- The lead agent is responsible for reconciling task completion. Subagents may not silently skip assigned tasks; skipped or merged tasks must be recorded in the plan artifact.
-
+- For critical claims, require at least one adversarial verification pass after synthesis. Fix fatal issues before delivery or surface them explicitly.
 ## Remote delegation
 Feynman can delegate tasks to remote cloud machines via the `computer-fleet` and `computer-acp` skills. Load those skills on demand for CLI usage, session management, ACP bridging, and file retrieval.
--- a/README.md
+++ b/README.md
@@ -69,7 +69,6 @@ Four bundled research agents, dispatched automatically or via subagent commands.
 - **[AlphaXiv](https://www.alphaxiv.org/)** — paper search, Q&A, code reading, persistent annotations
 - **Docker** — isolated container execution for safe experiments on your machine
 - **[Agent Computer](https://agentcomputer.ai)** — secure cloud execution for long-running research and GPU workloads
 - **Web search** — Gemini or Perplexity, zero-config default via signed-in Chromium
 - **Session search** — optional indexed recall across prior research sessions
 - **Preview** — browser and PDF export of generated artifacts
@@ -95,7 +94,7 @@ feynman search status               # web search config
 ## How it works
-Built on [Pi](https://github.com/badlogic/pi-mono) for the agent runtime, [alphaXiv](https://www.alphaxiv.org/) for paper search and analysis, [Docker](https://www.docker.com/) for isolated local execution, and [Agent Computer](https://agentcomputer.ai) for secure cloud workloads
+Built on [Pi](https://github.com/badlogic/pi-mono) for the agent runtime, [alphaXiv](https://www.alphaxiv.org/) for paper search and analysis, and [Docker](https://www.docker.com/) for isolated local execution
 Every output is source-grounded — claims link to papers, docs, or repos with direct URLs
--- a/extensions/research-tools/preview.ts
+++ b/extensions/research-tools/preview.ts
@@ -181,53 +181,3 @@ export async function pathExists(path: string): Promise<boolean> {
 		return false;
 	}
 }
 export function buildProjectAgentsTemplate(): string {
 	return `# Feynman Project Guide
 This file is read automatically at startup. It is the durable project memory for Feynman.
 ## Project Overview
 - State the research question, target artifact, target venue, and key datasets or benchmarks here.
 ## AI Research Context
 - Problem statement:
 - Core hypothesis:
 - Closest prior work:
 - Required baselines:
 - Required ablations:
 - Primary metrics:
 - Datasets / benchmarks:
 ## Ground Rules
 - Do not modify raw data in \`Data/Raw/\` or equivalent raw-data folders.
 - Read first, act second: inspect project structure and existing notes before making changes.
 - Prefer durable artifacts in \`notes/\`, \`outputs/\`, \`experiments/\`, and \`papers/\`.
 - Keep strong claims source-grounded. Include direct URLs in final writeups.
 ## Current Status
 - Replace this section with the latest project status, known issues, and next steps.
 ## Session Logging
 - Use \`/log\` at the end of meaningful sessions to write a durable session note into \`notes/session-logs/\`.
 ## Review Readiness
 - Known reviewer concerns:
 - Missing experiments:
 - Missing writing or framing work:
 `;
 }
 export function buildSessionLogsReadme(): string {
 	return `# Session Logs
 Use \`/log\` to write one durable note per meaningful Feynman session.
 Recommended contents:
 - what was done
 - strongest findings
 - artifacts written
 - unresolved questions
 - next steps
 `;
 }
--- a/extensions/research-tools/project-scaffold.ts
+++ b/extensions/research-tools/project-scaffold.ts
@@ -0,0 +1,64 @@
 export function buildProjectAgentsTemplate(): string {
 	return `# Feynman Project Guide
 This file is read automatically at startup. It is the durable project memory for Feynman.
 ## Project Overview
 - State the research question, target artifact, target venue, and key datasets or benchmarks here.
 ## AI Research Context
 - Problem statement:
 - Core hypothesis:
 - Closest prior work:
 - Required baselines:
 - Required ablations:
 - Primary metrics:
 - Datasets / benchmarks:
 ## Ground Rules
 - Do not modify raw data in \`Data/Raw/\` or equivalent raw-data folders.
 - Read first, act second: inspect project structure and existing notes before making changes.
 - Prefer durable artifacts in \`notes/\`, \`outputs/\`, \`experiments/\`, and \`papers/\`.
 - Keep strong claims source-grounded. Include direct URLs in final writeups.
 ## Current Status
 - Replace this section with the latest project status, known issues, and next steps.
 ## Task Ledger
 - Track concrete tasks with IDs, owner, status, and output path.
 - Mark tasks as \`todo\`, \`in_progress\`, \`done\`, \`blocked\`, or \`superseded\`.
 - Do not silently merge or skip tasks; record the decision here.
 ## Verification Gates
 - List the checks that must pass before delivery.
 - For each critical claim, figure, or metric, record how it will be verified and where the raw artifact lives.
 - Do not use words like \`verified\`, \`confirmed\`, or \`reproduced\` unless the underlying check actually ran.
 ## Honesty Contract
 - Separate direct observations from inferences.
 - If something is uncertain, say so explicitly.
 - If a result looks cleaner than expected, assume it needs another check before it goes into the final artifact.
 ## Session Logging
 - Use \`/log\` at the end of meaningful sessions to write a durable session note into \`notes/session-logs/\`.
 ## Review Readiness
 - Known reviewer concerns:
 - Missing experiments:
 - Missing writing or framing work:
 `;
 }
 export function buildSessionLogsReadme(): string {
 	return `# Session Logs
 Use \`/log\` to write one durable note per meaningful Feynman session.
 Recommended contents:
 - what was done
 - strongest findings
 - artifacts written
 - unresolved questions
 - next steps
 `;
 }
--- a/extensions/research-tools/project.ts
+++ b/extensions/research-tools/project.ts
@@ -5,7 +5,8 @@ import type { ExtensionAPI } from "@mariozechner/pi-coding-agent";
 import { Type } from "@sinclair/typebox";
 import { getExtensionCommandSpec } from "../../metadata/commands.mjs";
-import { renderHtmlPreview, renderPdfPreview, openWithDefaultApp, pathExists, buildProjectAgentsTemplate, buildSessionLogsReadme } from "./preview.js";
+import { renderHtmlPreview, renderPdfPreview, openWithDefaultApp, pathExists } from "./preview.js";
 import { buildProjectAgentsTemplate, buildSessionLogsReadme } from "./project-scaffold.js";
 import { formatToolText } from "./shared.js";
 import { searchSessionTranscripts } from "./session-search.js";
--- a/prompts/autoresearch.md
+++ b/prompts/autoresearch.md
@@ -26,7 +26,6 @@ Ask the user where to run:
 - **New git branch** — create a branch so main stays clean
 - **Virtual environment** — create an isolated venv/conda env first
 - **Docker** — run experiment code inside an isolated Docker container
 - **Cloud** — delegate to a remote Agent Computer machine via `/delegate`
 Do not proceed without a clear answer.
--- a/prompts/deepresearch.md
+++ b/prompts/deepresearch.md
@@ -34,6 +34,16 @@ Derive a short slug from the topic (lowercase, hyphens, no filler words, ≤5 wo
 - [ ] Contradictions identified and addressed
 - [ ] No single-source claims on critical findings
 ## Task Ledger
 | ID | Owner | Task | Status | Output |
 |---|---|---|---|---|
 | T1 | lead / researcher | ... | todo | ... |
 ## Verification Log
 | Item | Method | Status | Evidence |
 |---|---|---|---|
 | Critical claim / computation / figure | source cross-read / rerun / direct fetch / code check | pending | path or URL |
 ## Decision Log
 (Updated as the workflow progresses)
 ```
@@ -60,6 +70,7 @@ Launch parallel `researcher` subagents via `subagent`. Each gets a structured br
 - **Output format:** numbered sources, evidence table, inline source references
 - **Tool guidance:** which search tools to prioritize
 - **Task boundaries:** what NOT to cover (another researcher handles that)
 - **Task IDs:** the specific ledger rows they own and must report back on
 Assign each researcher a clearly disjoint dimension — different source types, geographic scopes, time periods, or technical angles. Never duplicate coverage.
@@ -75,6 +86,7 @@ Assign each researcher a clearly disjoint dimension — different source types,
 ```
 Researchers write full outputs to files and pass references back — do not have them return full content into your context.
 Researchers must not silently merge or skip assigned tasks. If something is impossible or redundant, mark the ledger row `blocked` or `superseded` with a note.
 ## 4. Evaluate and loop
@@ -83,10 +95,11 @@ After researchers return, read their output files and critically assess:
 - Which answers rest on only one source?
 - Are there contradictions needing resolution?
 - Is any key angle missing entirely?
 - Did every assigned ledger task actually get completed, blocked, or explicitly superseded?
 If gaps are significant, spawn another targeted batch of researchers. No fixed cap on rounds — iterate until evidence is sufficient or sources are exhausted.
-Update the plan artifact (`outputs/.plans/<slug>.md`) decision log after each round.
+Update the plan artifact (`outputs/.plans/<slug>.md`) task ledger, verification log, and decision log after each round.
 Most topics need 1-2 rounds. Stop when additional rounds would not materially change conclusions.
@@ -111,6 +124,12 @@ Unresolved issues, disagreements between sources, gaps in evidence.
 When the research includes quantitative data (benchmarks, performance comparisons, trends), generate charts using `pi-charts`. Use Mermaid diagrams for architectures and processes. Every visual must have a caption and reference the underlying data.
 Before finalizing the draft, do a claim sweep:
 - map each critical claim, number, and figure to its supporting source or artifact in the verification log
 - downgrade or remove anything that cannot be grounded
 - label inferences as inferences
 - if code or calculations were involved, record which checks were actually run and which remain unverified
 Save this draft to `outputs/.drafts/<slug>-draft.md`.
 ## 6. Cite
@@ -136,6 +155,7 @@ Spawn the `reviewer` agent against the cited draft. The reviewer checks for:
 ```
 If the reviewer flags FATAL issues, fix them in the brief before delivering. MAJOR issues get noted in the Open Questions section. MINOR issues are accepted.
 After fixes, run at least one more review-style verification pass if any FATAL issues were found. Do not assume one fix solved everything.
 ## 8. Deliver
--- a/prompts/delegate.md
+++ b/prompts/delegate.md
@@ -1,21 +0,0 @@
 ---
 description: Delegate a research task to a remote Agent Computer machine for cloud execution.
 args: <task>
 section: Internal
 ---
 Delegate the following task to a remote Agent Computer machine: $@
 ## Workflow
 1. **Check CLI** — Verify `computer` or `aicomputer` is installed and authenticated. If not, install with `npm install -g aicomputer` and run `computer login`.
 2. **Pick a machine** — Run `computer ls --json` and choose an appropriate machine. If none are running, tell the user to create one with `computer create`.
 3. **Pick an agent** — Run `computer agent agents <machine> --json` and choose an installed agent with credentials (prefer Claude).
 4. **Create a session** — Use `computer agent sessions new <machine> --agent claude --name research --json`.
 5. **Send the task** — Translate the user's research task into a self-contained prompt and send it via `computer agent prompt`. The prompt must include:
   - The full research objective
   - Where to write outputs (default: `/workspace/outputs/`)
   - What artifact to produce when done (summary file)
   - Any tools or data sources to use
 6. **Monitor** — Use `computer agent watch <machine> --session <session_id>` to stream progress. Report status to the user at meaningful milestones.
 7. **Retrieve results** — When the remote agent finishes, pull the results back with `computer agent prompt <machine> "cat /workspace/outputs/<slug>.md" --session <session_id>` (derive the slug from the task topic). Present results to the user.
 8. **Clean up** — Close the session with `computer agent close <machine> --session <session_id>` unless the user wants to continue.
--- a/prompts/draft.md
+++ b/prompts/draft.md
@@ -9,10 +9,11 @@ Write a paper-style draft for: $@
 Derive a short slug from the topic (lowercase, hyphens, no filler words, ≤5 words). Use this slug for all files in this run.
 Requirements:
- Before writing, outline the draft structure: proposed title, sections, key claims to make, and source material to draw from. Write the outline to `outputs/.plans/<slug>.md`. Present the outline to the user and confirm before proceeding.
+- Before writing, outline the draft structure: proposed title, sections, key claims to make, source material to draw from, and a verification log for the critical claims, figures, and calculations. Write the outline to `outputs/.plans/<slug>.md`. Present the outline to the user and confirm before proceeding.
 - Use the `writer` subagent when the draft should be produced from already-collected notes, then use the `verifier` subagent to add inline citations and verify sources.
 - Include at minimum: title, abstract, problem statement, related work, method or synthesis, evidence or experiments, limitations, conclusion.
 - Use clean Markdown with LaTeX where equations materially help.
 - Generate charts with `pi-charts` for quantitative data, benchmarks, and comparisons. Use Mermaid for architectures and pipelines. Every figure needs a caption.
 - Before delivery, sweep the draft for any claim that sounds stronger than its support. Mark tentative results as tentative and remove unsupported numerics instead of letting the verifier discover them later.
 - Save exactly one draft to `papers/<slug>.md`.
 - End with a `Sources` appendix with direct URLs for all primary references.
--- a/prompts/lit.md
+++ b/prompts/lit.md
@@ -10,9 +10,9 @@ Derive a short slug from the topic (lowercase, hyphens, no filler words, ≤5 wo
 ## Workflow
-1. **Plan** — Outline the scope: key questions, source types to search (papers, web, repos), time period, and expected sections. Write the plan to `outputs/.plans/<slug>.md`. Present the plan to the user and confirm before proceeding.
+1. **Plan** — Outline the scope: key questions, source types to search (papers, web, repos), time period, expected sections, and a small task ledger plus verification log. Write the plan to `outputs/.plans/<slug>.md`. Present the plan to the user and confirm before proceeding.
-2. **Gather** — Use the `researcher` subagent when the sweep is wide enough to benefit from delegated paper triage before synthesis. For narrow topics, search directly. Researcher outputs go to `<slug>-research-*.md`.
+2. **Gather** — Use the `researcher` subagent when the sweep is wide enough to benefit from delegated paper triage before synthesis. For narrow topics, search directly. Researcher outputs go to `<slug>-research-*.md`. Do not silently skip assigned questions; mark them `done`, `blocked`, or `superseded`.
-3. **Synthesize** — Separate consensus, disagreements, and open questions. When useful, propose concrete next experiments or follow-up reading. Generate charts with `pi-charts` for quantitative comparisons across papers and Mermaid diagrams for taxonomies or method pipelines.
+3. **Synthesize** — Separate consensus, disagreements, and open questions. When useful, propose concrete next experiments or follow-up reading. Generate charts with `pi-charts` for quantitative comparisons across papers and Mermaid diagrams for taxonomies or method pipelines. Before finishing the draft, sweep every strong claim against the verification log and downgrade anything that is inferred or single-source critical.
 4. **Cite** — Spawn the `verifier` agent to add inline citations and verify every source URL in the draft.
-5. **Verify** — Spawn the `reviewer` agent to check the cited draft for unsupported claims, logical gaps, and single-source critical findings. Fix FATAL issues before delivering. Note MAJOR issues in Open Questions.
+5. **Verify** — Spawn the `reviewer` agent to check the cited draft for unsupported claims, logical gaps, zombie sections, and single-source critical findings. Fix FATAL issues before delivering. Note MAJOR issues in Open Questions. If FATAL issues were found, run one more verification pass after the fixes.
 6. **Deliver** — Save the final literature review to `outputs/<slug>.md`. Write a provenance record alongside it as `outputs/<slug>.provenance.md` listing: date, sources consulted vs. accepted vs. rejected, verification status, and intermediate research files used.
--- a/prompts/replicate.md
+++ b/prompts/replicate.md
@@ -9,14 +9,13 @@ Design a replication plan for: $@
 ## Workflow
 1. **Extract** — Use the `researcher` subagent to pull implementation details from the target paper and any linked code.
-2. **Plan** — Determine what code, datasets, metrics, and environment are needed. Be explicit about what is verified, what is inferred, and what is still missing.
+2. **Plan** — Determine what code, datasets, metrics, and environment are needed. Be explicit about what is verified, what is inferred, what is still missing, and which checks or test oracles will be used to decide whether the replication succeeded.
 3. **Environment** — Before running anything, ask the user where to execute:
   - **Local** — run in the current working directory
   - **Virtual environment** — create an isolated venv/conda env first
   - **Docker** — run experiment code inside an isolated Docker container
   - **Cloud** — delegate to a remote Agent Computer machine via `/delegate`
   - **Plan only** — produce the replication plan without executing
-4. **Execute** — If the user chose an execution environment, implement and run the replication steps there. Save notes, scripts, and results to disk in a reproducible layout.
+4. **Execute** — If the user chose an execution environment, implement and run the replication steps there. Save notes, scripts, raw outputs, and results to disk in a reproducible layout. Do not call the outcome replicated unless the planned checks actually passed.
 5. **Report** — End with a `Sources` section containing paper and repository URLs.
 Do not install packages, run training, or execute experiments without confirming the execution environment first.
--- a/prompts/review.md
+++ b/prompts/review.md
@@ -9,9 +9,10 @@ Review this AI research artifact: $@
 Derive a short slug from the artifact name (lowercase, hyphens, no filler words, ≤5 words). Use this slug for all files in this run.
 Requirements:
- Before starting, outline what will be reviewed and the review criteria (novelty, empirical rigor, baselines, reproducibility, etc.). Present the plan to the user and confirm before proceeding.
+- Before starting, outline what will be reviewed, the review criteria (novelty, empirical rigor, baselines, reproducibility, etc.), and any verification-specific checks needed for claims, figures, and reported metrics. Present the plan to the user and confirm before proceeding.
 - Spawn a `researcher` subagent to gather evidence on the artifact — inspect the paper, code, cited work, and any linked experimental artifacts. Save to `<slug>-research.md`.
 - Spawn a `reviewer` subagent with `<slug>-research.md` to produce the final peer review with inline annotations.
 - For small or simple artifacts where evidence gathering is overkill, run the `reviewer` subagent directly instead.
 - If the first review finds FATAL issues and you fix them, run one more verification-style review pass before delivering.
 - Save exactly one review artifact to `outputs/<slug>-review.md`.
 - End with a `Sources` section containing direct URLs for every inspected external source.
--- a/skills/agentcomputer/SKILL.md
+++ b/skills/agentcomputer/SKILL.md
@@ -1,108 +0,0 @@
 ---
 name: agentcomputer
 description: Delegate research tasks to remote Agent Computer machines for cloud execution. Manages machine discovery, remote agent sessions, task delegation, progress monitoring, result retrieval, and ACP bridging via the aicomputer CLI.
 allowed-tools: Bash(npm:*), Bash(npx aicomputer@latest:*), Bash(aicomputer:*), Bash(computer:*)
 ---
 # Agent Computer
 Use Agent Computer to run Feynman research workflows on remote cloud machines when local compute is insufficient or when tasks should run unattended.
 ## When to use
 - A research task needs GPU, large memory, or long-running compute
 - `/autoresearch` or `/deepresearch` should run unattended in the cloud
 - The user explicitly asks to delegate work to a remote machine
 - An experiment loop would take hours and should not block the local session
 ## Prerequisites
 The `aicomputer` CLI must be installed and authenticated:
 ```bash
 if command -v computer >/dev/null 2>&1; then
  COMPUTER=computer
 elif command -v aicomputer >/dev/null 2>&1; then
  COMPUTER=aicomputer
 else
  npm install -g aicomputer
  COMPUTER=computer
 fi
 $COMPUTER whoami || $COMPUTER login
 ```
 ## Fleet control
 ### Discover machines and agents
 ```bash
 $COMPUTER ls --json
 $COMPUTER agent agents <machine> --json
 ```
 ### Sessions
 Create, reuse, and manage named sessions on a machine:
 ```bash
 $COMPUTER agent sessions new <machine> --agent claude --name research --json
 $COMPUTER agent sessions list <machine> --json
 $COMPUTER agent status <machine> --session <session_id> --json
 ```
 ### Prompting and monitoring
 ```bash
 $COMPUTER agent prompt <machine> "<task>" --agent claude --name research
 $COMPUTER agent watch <machine> --session <session_id>
 ```
 ### Stopping and cleanup
 ```bash
 $COMPUTER agent cancel <machine> --session <session_id> --json
 $COMPUTER agent interrupt <machine> --session <session_id> --json
 $COMPUTER agent close <machine> --session <session_id>
 ```
 ## Research delegation workflow
 1. Pick a machine: `$COMPUTER ls --json`
 2. Create a session: `$COMPUTER agent sessions new <machine> --agent claude --name research --json`
 3. Send a self-contained research prompt:
 ```bash
 $COMPUTER agent prompt <machine> \
  "Run a deep research workflow on <topic>. Write all outputs to /workspace/outputs/. When done, write a summary to /workspace/outputs/summary.md." \
  --agent claude --name research
 ```
 4. Monitor: `$COMPUTER agent watch <machine> --session <session_id>`
 5. Retrieve: `$COMPUTER agent prompt <machine> "cat /workspace/outputs/summary.md" --session <session_id>`
 6. Clean up: `$COMPUTER agent close <machine> --session <session_id>`
 ## ACP bridge
 Expose a remote machine agent as a local ACP-compatible stdio process:
 ```bash
 $COMPUTER acp serve <machine> --agent claude --name research
 ```
 This lets local ACP clients (including Feynman's subagents) talk to a remote agent as if it were local. Keep the bridge process running; reconnect by restarting the command with the same session name.
 ## Session naming
 Use short stable names that match the task:
 - `research` — general research delegation
 - `experiment` — autoresearch loops
 - `review` — verification passes
 - `literature` — literature sweeps
 Reuse the same name when continuing the same line of work.
 ## References
 - [CLI cheatsheet](references/cli-cheatsheet.md) — full command reference
 - [ACP flow](references/acp-flow.md) — protocol details for the ACP bridge
--- a/skills/agentcomputer/references/acp-flow.md
+++ b/skills/agentcomputer/references/acp-flow.md
@@ -1,23 +0,0 @@
 # ACP Flow
 The `computer acp serve` bridge makes a remote machine agent look like a local ACP server over stdio.
 ## Basic shape
 1. The local client starts `computer acp serve <machine> --agent <agent> --name <session>`.
 2. The bridge handles ACP initialization on stdin/stdout.
 3. The bridge maps ACP session operations onto Agent Computer session APIs.
 4. Remote session updates are streamed back as ACP `session/update` notifications.
 ## Good commands
 ```bash
 computer acp serve my-box --agent claude --name research
 computer acp serve gpu-worker --agent claude --name experiment
 ```
 ## Recommended client behavior
 - Reuse a stable session name when reconnecting.
 - Treat the bridge as the single local command for remote-agent interaction.
 - Use the normal `computer agent ...` commands outside ACP when you need manual inspection or cleanup.
--- a/skills/agentcomputer/references/cli-cheatsheet.md
+++ b/skills/agentcomputer/references/cli-cheatsheet.md
@@ -1,68 +0,0 @@
 # CLI Cheatsheet
 ## Authentication
 ```bash
 computer whoami
 computer login
 computer claude-login     # install Claude credentials on a machine
 computer codex-login      # install Codex credentials on a machine
 ```
 ## Machine discovery
 ```bash
 computer ls --json
 computer fleet status --json
 ```
 ## Agent discovery
 ```bash
 computer agent agents <machine> --json
 ```
 ## Sessions
 ```bash
 computer agent sessions list <machine> --json
 computer agent sessions new <machine> --agent claude --name research --json
 computer agent status <machine> --session <session_id> --json
 ```
 ## Prompting
 ```bash
 computer agent prompt <machine> "run the experiment" --agent claude --name research
 computer agent prompt <machine> "continue" --session <session_id>
 ```
 ## Streaming and control
 ```bash
 computer agent watch <machine> --session <session_id>
 computer agent cancel <machine> --session <session_id> --json
 computer agent interrupt <machine> --session <session_id> --json
 computer agent close <machine> --session <session_id>
 ```
 ## ACP bridge
 ```bash
 computer acp serve <machine> --agent claude --name research
 ```
 ## Machine lifecycle
 ```bash
 computer create my-box
 computer open my-box
 computer open my-box --terminal
 computer ssh my-box
 ```
 ## Good defaults
 - Prefer machine handles over machine ids when both are available.
 - Prefer `--name` for human-meaningful persistent sessions.
 - Prefer `--json` when another program or agent needs to read the result.
--- a/website/.astro/data-store.json
+++ b/website/.astro/data-store.json
--- a/website/src/content/docs/workflows/replication.md
+++ b/website/src/content/docs/workflows/replication.md
@@ -20,7 +20,6 @@ Before running code, Feynman asks you to choose an execution environment:
 - **Local** — run in the current working directory
 - **Virtual environment** — create an isolated venv/conda env first
 - **Docker** — run experiment code inside an isolated Docker container
 - **Cloud** — delegate to a remote Agent Computer machine
 - **Plan only** — produce the replication plan without executing
 ## Example
--- a/website/src/pages/index.astro
+++ b/website/src/pages/index.astro
@@ -122,10 +122,6 @@ import AsciiLogo from '../components/AsciiLogo.astro';
          <div class="font-semibold mb-1"><a href="https://www.docker.com/" class="text-accent hover:underline">Docker</a></div>
          <p class="text-sm text-text-muted">Isolated container execution for safe local experiments</p>
        </div>
        <div class="bg-surface rounded-xl p-5">
          <div class="font-semibold mb-1"><a href="https://agentcomputer.ai" class="text-accent hover:underline">Agent Computer</a></div>
          <p class="text-sm text-text-muted">Secure cloud execution for GPU workloads and long-running research</p>
        </div>
        <div class="bg-surface rounded-xl p-5">
          <div class="font-semibold mb-1">Web search</div>
          <p class="text-sm text-text-muted">Gemini or Perplexity, zero-config default</p>
@@ -144,7 +140,7 @@ import AsciiLogo from '../components/AsciiLogo.astro';
  <section class="py-20 px-6 text-center">
    <div class="max-w-xl mx-auto">
-      <p class="text-text-muted mb-6">Built on <a href="https://github.com/badlogic/pi-mono" class="text-accent hover:underline">Pi</a>, <a href="https://www.alphaxiv.org/" class="text-accent hover:underline">alphaXiv</a>, and <a href="https://agentcomputer.ai" class="text-accent hover:underline">Agent Computer</a>. MIT licensed. Open source.</p>
+      <p class="text-text-muted mb-6">Built on <a href="https://github.com/badlogic/pi-mono" class="text-accent hover:underline">Pi</a> and <a href="https://www.alphaxiv.org/" class="text-accent hover:underline">alphaXiv</a>. MIT licensed. Open source.</p>
      <div class="flex gap-4 justify-center flex-wrap">
        <a href="/docs/getting-started/installation" class="px-6 py-2.5 rounded-lg bg-accent text-bg font-semibold text-sm hover:bg-accent-hover transition-colors">Get started</a>
        <a href="https://github.com/getcompanion-ai/feynman" target="_blank" rel="noopener" class="px-6 py-2.5 rounded-lg border border-border text-text-muted font-semibold text-sm hover:border-text-dim hover:text-text-primary transition-colors">GitHub</a>