Refine research workflows and remove Agent Computer

2026-03-24 11:01:27 -07:00
parent b712f89580
commit 8fd06b9299
23 changed files with 137 additions and 299 deletions
--- a/prompts/autoresearch.md
+++ b/prompts/autoresearch.md
@@ -26,7 +26,6 @@ Ask the user where to run:
 - **New git branch** — create a branch so main stays clean
 - **Virtual environment** — create an isolated venv/conda env first
 - **Docker** — run experiment code inside an isolated Docker container
- **Cloud** — delegate to a remote Agent Computer machine via `/delegate`

 Do not proceed without a clear answer.

--- a/prompts/deepresearch.md
+++ b/prompts/deepresearch.md
@@ -34,6 +34,16 @@ Derive a short slug from the topic (lowercase, hyphens, no filler words, ≤5 wo
 - [ ] Contradictions identified and addressed
 - [ ] No single-source claims on critical findings

+## Task Ledger
+| ID | Owner | Task | Status | Output |
+|---|---|---|---|---|
+| T1 | lead / researcher | ... | todo | ... |
+
+## Verification Log
+| Item | Method | Status | Evidence |
+|---|---|---|---|
+| Critical claim / computation / figure | source cross-read / rerun / direct fetch / code check | pending | path or URL |
+
 ## Decision Log
 (Updated as the workflow progresses)
 ```
@@ -60,6 +70,7 @@ Launch parallel `researcher` subagents via `subagent`. Each gets a structured br
 - **Output format:** numbered sources, evidence table, inline source references
 - **Tool guidance:** which search tools to prioritize
 - **Task boundaries:** what NOT to cover (another researcher handles that)
+- **Task IDs:** the specific ledger rows they own and must report back on

 Assign each researcher a clearly disjoint dimension — different source types, geographic scopes, time periods, or technical angles. Never duplicate coverage.

@@ -75,6 +86,7 @@ Assign each researcher a clearly disjoint dimension — different source types,
 ```

 Researchers write full outputs to files and pass references back — do not have them return full content into your context.
+Researchers must not silently merge or skip assigned tasks. If something is impossible or redundant, mark the ledger row `blocked` or `superseded` with a note.

 ## 4. Evaluate and loop

@@ -83,10 +95,11 @@ After researchers return, read their output files and critically assess:
 - Which answers rest on only one source?
 - Are there contradictions needing resolution?
 - Is any key angle missing entirely?
+- Did every assigned ledger task actually get completed, blocked, or explicitly superseded?

 If gaps are significant, spawn another targeted batch of researchers. No fixed cap on rounds — iterate until evidence is sufficient or sources are exhausted.

-Update the plan artifact (`outputs/.plans/<slug>.md`) decision log after each round.
+Update the plan artifact (`outputs/.plans/<slug>.md`) task ledger, verification log, and decision log after each round.

 Most topics need 1-2 rounds. Stop when additional rounds would not materially change conclusions.

@@ -111,6 +124,12 @@ Unresolved issues, disagreements between sources, gaps in evidence.

 When the research includes quantitative data (benchmarks, performance comparisons, trends), generate charts using `pi-charts`. Use Mermaid diagrams for architectures and processes. Every visual must have a caption and reference the underlying data.

+Before finalizing the draft, do a claim sweep:
+- map each critical claim, number, and figure to its supporting source or artifact in the verification log
+- downgrade or remove anything that cannot be grounded
+- label inferences as inferences
+- if code or calculations were involved, record which checks were actually run and which remain unverified
+
 Save this draft to `outputs/.drafts/<slug>-draft.md`.

 ## 6. Cite
@@ -136,6 +155,7 @@ Spawn the `reviewer` agent against the cited draft. The reviewer checks for:
 ```

 If the reviewer flags FATAL issues, fix them in the brief before delivering. MAJOR issues get noted in the Open Questions section. MINOR issues are accepted.
+After fixes, run at least one more review-style verification pass if any FATAL issues were found. Do not assume one fix solved everything.

 ## 8. Deliver

--- a/prompts/delegate.md
+++ b/prompts/delegate.md
@@ -1,21 +0,0 @@
---
-description: Delegate a research task to a remote Agent Computer machine for cloud execution.
-args: <task>
-section: Internal
---
-Delegate the following task to a remote Agent Computer machine: $@
-
-## Workflow
-
-1. **Check CLI** — Verify `computer` or `aicomputer` is installed and authenticated. If not, install with `npm install -g aicomputer` and run `computer login`.
-2. **Pick a machine** — Run `computer ls --json` and choose an appropriate machine. If none are running, tell the user to create one with `computer create`.
-3. **Pick an agent** — Run `computer agent agents <machine> --json` and choose an installed agent with credentials (prefer Claude).
-4. **Create a session** — Use `computer agent sessions new <machine> --agent claude --name research --json`.
-5. **Send the task** — Translate the user's research task into a self-contained prompt and send it via `computer agent prompt`. The prompt must include:
-   - The full research objective
-   - Where to write outputs (default: `/workspace/outputs/`)
-   - What artifact to produce when done (summary file)
-   - Any tools or data sources to use
-6. **Monitor** — Use `computer agent watch <machine> --session <session_id>` to stream progress. Report status to the user at meaningful milestones.
-7. **Retrieve results** — When the remote agent finishes, pull the results back with `computer agent prompt <machine> "cat /workspace/outputs/<slug>.md" --session <session_id>` (derive the slug from the task topic). Present results to the user.
-8. **Clean up** — Close the session with `computer agent close <machine> --session <session_id>` unless the user wants to continue.
--- a/prompts/draft.md
+++ b/prompts/draft.md
@@ -9,10 +9,11 @@ Write a paper-style draft for: $@
 Derive a short slug from the topic (lowercase, hyphens, no filler words, ≤5 words). Use this slug for all files in this run.

 Requirements:
- Before writing, outline the draft structure: proposed title, sections, key claims to make, and source material to draw from. Write the outline to `outputs/.plans/<slug>.md`. Present the outline to the user and confirm before proceeding.
+- Before writing, outline the draft structure: proposed title, sections, key claims to make, source material to draw from, and a verification log for the critical claims, figures, and calculations. Write the outline to `outputs/.plans/<slug>.md`. Present the outline to the user and confirm before proceeding.
 - Use the `writer` subagent when the draft should be produced from already-collected notes, then use the `verifier` subagent to add inline citations and verify sources.
 - Include at minimum: title, abstract, problem statement, related work, method or synthesis, evidence or experiments, limitations, conclusion.
 - Use clean Markdown with LaTeX where equations materially help.
 - Generate charts with `pi-charts` for quantitative data, benchmarks, and comparisons. Use Mermaid for architectures and pipelines. Every figure needs a caption.
+- Before delivery, sweep the draft for any claim that sounds stronger than its support. Mark tentative results as tentative and remove unsupported numerics instead of letting the verifier discover them later.
 - Save exactly one draft to `papers/<slug>.md`.
 - End with a `Sources` appendix with direct URLs for all primary references.
--- a/prompts/lit.md
+++ b/prompts/lit.md
@@ -10,9 +10,9 @@ Derive a short slug from the topic (lowercase, hyphens, no filler words, ≤5 wo

 ## Workflow

-1. **Plan** — Outline the scope: key questions, source types to search (papers, web, repos), time period, and expected sections. Write the plan to `outputs/.plans/<slug>.md`. Present the plan to the user and confirm before proceeding.
-2. **Gather** — Use the `researcher` subagent when the sweep is wide enough to benefit from delegated paper triage before synthesis. For narrow topics, search directly. Researcher outputs go to `<slug>-research-*.md`.
-3. **Synthesize** — Separate consensus, disagreements, and open questions. When useful, propose concrete next experiments or follow-up reading. Generate charts with `pi-charts` for quantitative comparisons across papers and Mermaid diagrams for taxonomies or method pipelines.
+1. **Plan** — Outline the scope: key questions, source types to search (papers, web, repos), time period, expected sections, and a small task ledger plus verification log. Write the plan to `outputs/.plans/<slug>.md`. Present the plan to the user and confirm before proceeding.
+2. **Gather** — Use the `researcher` subagent when the sweep is wide enough to benefit from delegated paper triage before synthesis. For narrow topics, search directly. Researcher outputs go to `<slug>-research-*.md`. Do not silently skip assigned questions; mark them `done`, `blocked`, or `superseded`.
+3. **Synthesize** — Separate consensus, disagreements, and open questions. When useful, propose concrete next experiments or follow-up reading. Generate charts with `pi-charts` for quantitative comparisons across papers and Mermaid diagrams for taxonomies or method pipelines. Before finishing the draft, sweep every strong claim against the verification log and downgrade anything that is inferred or single-source critical.
 4. **Cite** — Spawn the `verifier` agent to add inline citations and verify every source URL in the draft.
-5. **Verify** — Spawn the `reviewer` agent to check the cited draft for unsupported claims, logical gaps, and single-source critical findings. Fix FATAL issues before delivering. Note MAJOR issues in Open Questions.
+5. **Verify** — Spawn the `reviewer` agent to check the cited draft for unsupported claims, logical gaps, zombie sections, and single-source critical findings. Fix FATAL issues before delivering. Note MAJOR issues in Open Questions. If FATAL issues were found, run one more verification pass after the fixes.
 6. **Deliver** — Save the final literature review to `outputs/<slug>.md`. Write a provenance record alongside it as `outputs/<slug>.provenance.md` listing: date, sources consulted vs. accepted vs. rejected, verification status, and intermediate research files used.
--- a/prompts/replicate.md
+++ b/prompts/replicate.md
@@ -9,14 +9,13 @@ Design a replication plan for: $@
 ## Workflow

 1. **Extract** — Use the `researcher` subagent to pull implementation details from the target paper and any linked code.
-2. **Plan** — Determine what code, datasets, metrics, and environment are needed. Be explicit about what is verified, what is inferred, and what is still missing.
+2. **Plan** — Determine what code, datasets, metrics, and environment are needed. Be explicit about what is verified, what is inferred, what is still missing, and which checks or test oracles will be used to decide whether the replication succeeded.
 3. **Environment** — Before running anything, ask the user where to execute:
   - **Local** — run in the current working directory
   - **Virtual environment** — create an isolated venv/conda env first
   - **Docker** — run experiment code inside an isolated Docker container
-   - **Cloud** — delegate to a remote Agent Computer machine via `/delegate`
   - **Plan only** — produce the replication plan without executing
-4. **Execute** — If the user chose an execution environment, implement and run the replication steps there. Save notes, scripts, and results to disk in a reproducible layout.
+4. **Execute** — If the user chose an execution environment, implement and run the replication steps there. Save notes, scripts, raw outputs, and results to disk in a reproducible layout. Do not call the outcome replicated unless the planned checks actually passed.
 5. **Report** — End with a `Sources` section containing paper and repository URLs.

 Do not install packages, run training, or execute experiments without confirming the execution environment first.
--- a/prompts/review.md
+++ b/prompts/review.md
@@ -9,9 +9,10 @@ Review this AI research artifact: $@
 Derive a short slug from the artifact name (lowercase, hyphens, no filler words, ≤5 words). Use this slug for all files in this run.

 Requirements:
- Before starting, outline what will be reviewed and the review criteria (novelty, empirical rigor, baselines, reproducibility, etc.). Present the plan to the user and confirm before proceeding.
+- Before starting, outline what will be reviewed, the review criteria (novelty, empirical rigor, baselines, reproducibility, etc.), and any verification-specific checks needed for claims, figures, and reported metrics. Present the plan to the user and confirm before proceeding.
 - Spawn a `researcher` subagent to gather evidence on the artifact — inspect the paper, code, cited work, and any linked experimental artifacts. Save to `<slug>-research.md`.
 - Spawn a `reviewer` subagent with `<slug>-research.md` to produce the final peer review with inline annotations.
 - For small or simple artifacts where evidence gathering is overkill, run the `reviewer` subagent directly instead.
+- If the first review finds FATAL issues and you fix them, run one more verification-style review pass before delivering.
 - Save exactly one review artifact to `outputs/<slug>-review.md`.
 - End with a `Sources` section containing direct URLs for every inspected external source.