Make deepresearch execute reliably over RPC

2026-04-17 18:52:57 -07:00
parent 40939859b9
commit 3d46b581e0
4 changed files with 121 additions and 165 deletions
--- a/package-lock.json
+++ b/package-lock.json
@@ -1,12 +1,12 @@
 {
  "name": "@companion-ai/feynman",
-  "version": "0.2.32",
+  "version": "0.2.33",
  "lockfileVersion": 3,
  "requires": true,
  "packages": {
    "": {
      "name": "@companion-ai/feynman",
-      "version": "0.2.32",
+      "version": "0.2.33",
      "hasInstallScript": true,
      "license": "MIT",
      "dependencies": {
--- a/package.json
+++ b/package.json
@@ -1,6 +1,6 @@
 {
  "name": "@companion-ai/feynman",
-  "version": "0.2.32",
+  "version": "0.2.33",
  "description": "Research-first CLI agent built on Pi and alphaXiv",
  "license": "MIT",
  "type": "module",
--- a/prompts/deepresearch.md
+++ b/prompts/deepresearch.md
@@ -4,221 +4,175 @@ args: <topic>
 section: Research Workflows
 topLevelCli: true
 ---
-Run a deep research workflow for: $@
+Run deep research for: $@
-This is an execution request, not a request to explain or implement the workflow instructions. Carry out the workflow with tools and durable files. Do not answer by describing the protocol, converting it into programming steps, or saying how someone could implement it.
+This is an execution request, not a request to explain or implement the workflow instructions.
 Execute the workflow. Do not answer by describing the protocol, do not explain these instructions, do not restate the protocol, and do not ask for confirmation. Do not stop after planning. Your first actions should be tool calls that create directories and write the plan artifact.
-You are the Lead Researcher. You plan, delegate, evaluate, verify, write, and cite. Internal orchestration is invisible to the user unless they ask.
+## Required Artifacts
-## 1. Plan
+Derive a short slug from the topic: lowercase, hyphenated, no filler words, at most 5 words.
-Analyze the research question using extended thinking. Develop a research strategy:
+Every run must leave these files on disk:
- Key questions that must be answered
+- `outputs/.plans/<slug>.md`
- Evidence types needed (papers, web, code, data, docs)
+- `outputs/.drafts/<slug>-draft.md`
- Sub-questions disjoint enough to parallelize
+- `outputs/.drafts/<slug>-cited.md`
- Source types and time periods that matter
+- `outputs/<slug>.md` or `papers/<slug>.md`
- Acceptance criteria: what evidence would make the answer "sufficient"
+- `outputs/<slug>.provenance.md` or `papers/<slug>.provenance.md`
 If any capability fails, continue in degraded mode and still write a blocked or partial final output and provenance sidecar. Never end with chat-only output. Never end with only an explanation in chat. Use `Verification: BLOCKED` when verification could not be completed.
 ## Step 1: Plan
 Create `outputs/.plans/<slug>.md` immediately. The plan must include:
 - Key questions
 - Evidence needed
 - Scale decision
 - Task ledger
 - Verification log
 - Decision log
 Make the scale decision before assigning owners in the plan. If the topic is a narrow "what is X" explainer, the plan must use lead-owned direct search tasks only; do not allocate researcher subagents in the task ledger.
-Derive a short slug from the topic (lowercase, hyphens, no filler words, ≤5 words — e.g. "cloud-sandbox-pricing" not "deepresearch-plan"). Write the plan to `outputs/.plans/<slug>.md` as a self-contained artifact. Use this same slug for all artifacts in this run.
+Also save the plan with `memory_remember` using key `deepresearch.<slug>.plan` if that tool is available. If it is not available, continue without it.
 If `CHANGELOG.md` exists, read the most recent relevant entries before finalizing the plan. Once the workflow becomes multi-round or spans enough work to merit resume support, append concise entries to `CHANGELOG.md` after meaningful progress and before stopping.
-```markdown
+After writing the plan, continue immediately. Do not pause for approval.
 # Research Plan: [topic]
-## Questions
+## Step 2: Scale
 1. ...
-## Strategy
+Use direct search for:
- Researcher allocations and dimensions
+- Single fact or narrow question, including "what is X" explainers
- Expected rounds
+- Work you can answer with 3-10 tool calls
 ## Acceptance Criteria
 - [ ] All key questions answered with ≥2 independent sources
 - [ ] Contradictions identified and addressed
 - [ ] No single-source claims on critical findings
 ## Task Ledger
 | ID | Owner | Task | Status | Output |
 |---|---|---|---|---|
 | T1 | lead / researcher | ... | todo | ... |
 ## Verification Log
 | Item | Method | Status | Evidence |
 |---|---|---|---|
 | Critical claim / computation / figure | source cross-read / rerun / direct fetch / code check | pending | path or URL |
 ## Decision Log
 (Updated as the workflow progresses)
 ```
 Also save the plan with `memory_remember` (type: `fact`, key: `deepresearch.<slug>.plan`) so it survives context truncation.
 Briefly summarize the plan to the user and continue immediately. Do not ask for confirmation or wait for a proceed response unless the user explicitly requested plan review.
 Do not stop after planning. If live search, subagents, web access, alphaXiv, or any other capability is unavailable, continue in degraded mode and write a durable blocked/partial report that records exactly which capabilities failed.
 ## 2. Scale decision
 | Query type | Execution |
 |---|---|
 | Single fact or narrow question, including "what is X" explainers | Search directly yourself, no subagents, 3-10 tool calls |
 | Direct comparison (2-3 items) | 2 parallel `researcher` subagents |
 | Broad survey or multi-faceted topic | 3-4 parallel `researcher` subagents |
 | Complex multi-domain research | 4-6 parallel `researcher` subagents |
 Never spawn subagents for work you can do in 5 tool calls.
 For "what is X" explainer topics, you MUST NOT spawn researcher subagents unless the user explicitly asks for comprehensive coverage, current landscape, benchmarks, or production deployment.
 Do not inflate a simple explainer into a multi-agent survey.
-## 3. Spawn researchers
+Use subagents only when decomposition clearly helps:
 - Direct comparison of 2-3 items: 2 `researcher` subagents
 - Broad survey or multi-faceted topic: 3-4 `researcher` subagents
 - Complex multi-domain research: 4-6 `researcher` subagents
-Skip this section entirely when the scale decision chose direct search/no subagents. In that case, gather evidence yourself with search/fetch/paper tools, write notes directly to `<slug>-research-direct.md`, and continue to Section 4.
+## Step 3: Gather Evidence
-Launch parallel `researcher` subagents via `subagent`. Each gets a structured brief with:
+Avoid crash-prone PDF parsing in this workflow. Do not call `alpha_get_paper` and do not fetch `.pdf` URLs unless the user explicitly asks for PDF extraction. Prefer paper metadata, abstracts, HTML pages, official docs, and web snippets. If only a PDF exists, cite the PDF URL from search metadata and mark full-text PDF parsing as blocked instead of fetching it.
 - **Objective:** what to find
 - **Output format:** numbered sources, evidence table, inline source references
 - **Tool guidance:** which search tools to prioritize
 - **Task boundaries:** what NOT to cover (another researcher handles that)
 - **Task IDs:** the specific ledger rows they own and must report back on
-Assign each researcher a clearly disjoint dimension — different source types, geographic scopes, time periods, or technical angles. Never duplicate coverage.
+If direct search was chosen:
-Keep `subagent` tool-call JSON small and valid. For detailed task instructions, write a per-researcher brief first, e.g. `outputs/.plans/<slug>-T1.md`, then pass a short task string that points to that brief and the required output file. Do not place multi-paragraph instructions inside the `subagent` JSON.
+- Skip researcher spawning entirely.
-Use only supported `subagent` keys. Do not add extra keys such as `artifacts` unless the tool schema explicitly exposes them.
+- Search and fetch sources yourself.
-When using parallel researchers, always set `failFast: false` so one blocked researcher does not abort the whole workflow.
+- Write notes to `<slug>-research-direct.md`.
-Do not name exact tool commands in subagent tasks unless those tool names are visible in the current tool set. Prefer broad guidance such as "use paper search and web search"; if a PDF parser or paper fetch fails, the researcher must continue from metadata, abstracts, and web sources and mark PDF parsing as blocked.
+- Continue to synthesis.
-```
+If subagents were chosen:
 - Write a per-researcher brief first, such as `outputs/.plans/<slug>-T1.md`.
 - Keep `subagent` tool-call JSON small and valid.
 - Do not place multi-paragraph instructions inside the `subagent` JSON.
 - Use only supported `subagent` keys. Do not add extra keys such as `artifacts` unless the tool schema explicitly exposes them.
 - Always set `failFast: false`.
 - Do not name exact tool commands in subagent tasks unless those tool names are visible in the current tool set.
 - Prefer broad guidance such as "use paper search and web search"; if a PDF parser or paper fetch fails, the researcher must continue from metadata, abstracts, and web sources and mark PDF parsing as blocked.
 Example shape:
 ```json
 {
-  tasks: [
+  "tasks": [
-    { agent: "researcher", task: "Read outputs/.plans/<slug>-T1.md and write <slug>-research-web.md.", output: "<slug>-research-web.md" },
+    { "agent": "researcher", "task": "Read outputs/.plans/<slug>-T1.md and write <slug>-research-web.md.", "output": "<slug>-research-web.md" },
-    { agent: "researcher", task: "Read outputs/.plans/<slug>-T2.md and write <slug>-research-papers.md.", output: "<slug>-research-papers.md" }
+    { "agent": "researcher", "task": "Read outputs/.plans/<slug>-T2.md and write <slug>-research-papers.md.", "output": "<slug>-research-papers.md" }
  ],
-  concurrency: 4,
+  "concurrency": 4,
-  failFast: false
+  "failFast": false
 }
 ```
-Researchers write full outputs to files and pass references back — do not have them return full content into your context.
+After evidence gathering, update the plan ledger and verification log. If research failed, record exactly what failed and proceed with a blocked or partial draft.
 Researchers must not silently merge or skip assigned tasks. If something is impossible or redundant, mark the ledger row `blocked` or `superseded` with a note.
-## 4. Evaluate and loop
+## Step 4: Draft
-After researchers return, read their output files and critically assess:
+Write the report yourself. Do not delegate synthesis.
 - Which plan questions remain unanswered?
 - Which answers rest on only one source?
 - Are there contradictions needing resolution?
 - Is any key angle missing entirely?
 - Did every assigned ledger task actually get completed, blocked, or explicitly superseded?
-If gaps are significant, spawn another targeted batch of researchers. No fixed cap on rounds — iterate until evidence is sufficient or sources are exhausted.
+Save to `outputs/.drafts/<slug>-draft.md`.
-Update the plan artifact (`outputs/.plans/<slug>.md`) task ledger, verification log, and decision log after each round.
+Include:
-When the work spans multiple rounds, also append a concise chronological entry to `CHANGELOG.md` covering what changed, what was verified, what remains blocked, and the next recommended step.
+- Executive summary
 - Findings organized by question/theme
 - Evidence-backed caveats and disagreements
 - Open questions
 - No invented sources, results, figures, benchmarks, images, charts, or tables
-Most topics need 1-2 rounds. Stop when additional rounds would not materially change conclusions.
+Before citation, sweep the draft:
 - Every critical claim, number, figure, table, or benchmark must map to a source URL, research note, raw artifact path, or command/script output.
 - Remove or downgrade unsupported claims.
 - Mark inferences as inferences.
-If no researcher files can be produced because tools, subagents, or network access failed, create `outputs/.drafts/<slug>-draft.md` yourself as a blocked report with:
+## Step 5: Cite
 - what was requested,
 - which capabilities failed,
 - what evidence was and was not gathered,
 - a proposed source-gathering plan,
 - no invented sources or results.
-## 5. Write the report
+If direct search/no researcher subagents was chosen:
 - Do citation yourself.
 - Verify reachable HTML/doc URLs with available fetch/search tools.
 - Copy or rewrite `outputs/.drafts/<slug>-draft.md` to `outputs/.drafts/<slug>-cited.md` with inline citations and a Sources section.
 - Do not spawn the `verifier` subagent for simple direct-search runs.
-Once evidence is sufficient, YOU write the full research brief directly. Do not delegate writing to another agent. Read the research files, synthesize the findings, and produce a complete document:
+If researcher subagents were used, run the `verifier` agent after the draft exists. This step is mandatory and must complete before any reviewer runs. Do not run the `verifier` and `reviewer` in the same parallel `subagent` call.
-```markdown
+Use this shape:
 # Title
-## Executive Summary
+```json
-2-3 paragraph overview of key findings.
+{
-
+  "agent": "verifier",
-## Section 1: ...
+  "task": "Add inline citations to outputs/.drafts/<slug>-draft.md using the research files as source material. Verify every URL. Write the complete cited brief to outputs/.drafts/<slug>-cited.md.",
-Detailed findings organized by theme or question.
+  "output": "outputs/.drafts/<slug>-cited.md"
-
+}
 ## Section N: ...
 ## Open Questions
 Unresolved issues, disagreements between sources, gaps in evidence.
 ```
-When the research includes quantitative data (benchmarks, performance comparisons, trends), generate charts using `pi-charts`. Use Mermaid diagrams for architectures and processes. Every visual must have a caption and reference the underlying data.
+After the verifier returns, verify on disk that `outputs/.drafts/<slug>-cited.md` exists. If the verifier wrote elsewhere, find the cited file and move or copy it to `outputs/.drafts/<slug>-cited.md`.
-Before finalizing the draft, do a claim sweep:
+## Step 6: Review
 - map each critical claim, number, and figure to its supporting source or artifact in the verification log
 - downgrade or remove anything that cannot be grounded
 - label inferences as inferences
 - if code or calculations were involved, record which checks were actually run and which remain unverified
-Save this draft to `outputs/.drafts/<slug>-draft.md`.
+If direct search/no researcher subagents was chosen:
 - Review the cited draft yourself.
 - Write `<slug>-verification.md` with FATAL / MAJOR / MINOR findings and the checks performed.
 - Fix FATAL issues before delivery.
 - Do not spawn the `reviewer` subagent for simple direct-search runs.
-## 6. Cite
+If researcher subagents were used, only after `outputs/.drafts/<slug>-cited.md` exists, run the `reviewer` agent against it.
-Spawn the `verifier` agent to post-process YOUR draft. The verifier agent adds inline citations, verifies every source URL, and produces the final output:
+Use this shape:
-```
+```json
-{ agent: "verifier", task: "Add inline citations to outputs/.drafts/<slug>-draft.md using the research files as source material. Verify every URL. Write the complete cited brief to outputs/.drafts/<slug>-cited.md.", output: "outputs/.drafts/<slug>-cited.md" }
+{
  "agent": "reviewer",
  "task": "Verify outputs/.drafts/<slug>-cited.md. Flag unsupported claims, logical gaps, single-source critical claims, and overstated confidence. This is a verification pass, not a peer review.",
  "output": "<slug>-verification.md"
 }
 ```
-The verifier agent does not rewrite the report — it only anchors claims to sources and builds the numbered Sources section.
+If the reviewer flags FATAL issues, fix them before delivery and run one more review pass. Note MAJOR issues in Open Questions. Accept MINOR issues.
 This step is mandatory and must complete before any reviewer runs. Do not run the `verifier` and `reviewer` in the same parallel `subagent` call.
 After the verifier returns, verify on disk that `outputs/.drafts/<slug>-cited.md` exists. If the verifier wrote to a different path, find the cited file, move or copy it to `outputs/.drafts/<slug>-cited.md`, and use that path from this point forward.
-## 7. Verify
+When applying reviewer fixes, do not issue one giant `edit` tool call with many replacements. Use small localized edits only for 1-3 simple corrections. For section rewrites, table rewrites, or more than 3 substantive fixes, read the cited draft and write a corrected full file to `outputs/.drafts/<slug>-revised.md` instead.
 Only after `outputs/.drafts/<slug>-cited.md` exists, spawn the `reviewer` agent against that cited draft. The reviewer checks for:
 - Unsupported claims that slipped past citation
 - Logical gaps or contradictions between sections
 - Single-source claims on critical findings
 - Overstated confidence relative to evidence quality
 ```
 { agent: "reviewer", task: "Verify outputs/.drafts/<slug>-cited.md — flag any claims that lack sufficient source backing, identify logical gaps, and check that confidence levels match evidence strength. This is a verification pass, not a peer review.", output: "<slug>-verification.md" }
 ```
 If the reviewer flags FATAL issues, fix them in the brief before delivering. MAJOR issues get noted in the Open Questions section. MINOR issues are accepted.
 After fixes, run at least one more review-style verification pass if any FATAL issues were found. Do not assume one fix solved everything.
 When applying reviewer fixes, do not issue one giant `edit` tool call with many replacements. Use small localized edits only when there are 1-3 simple corrections. For section rewrites, table rewrites, or more than 3 substantive fixes, read the cited draft and write a corrected full file to `outputs/.drafts/<slug>-revised.md` instead. Then run the follow-up review against `outputs/.drafts/<slug>-revised.md`.
 The final candidate is `outputs/.drafts/<slug>-revised.md` if it exists; otherwise it is `outputs/.drafts/<slug>-cited.md`.
-## 8. Deliver
+## Step 7: Deliver
-Copy the final cited and verified output to the appropriate folder:
+Copy the final candidate to:
- Paper-style drafts → `papers/`
+- `papers/<slug>.md` for paper-style drafts
- Everything else → `outputs/`
+- `outputs/<slug>.md` for everything else
-Save the final output as `<slug>.md` (in `outputs/` or `papers/` per the rule above).
+Write provenance next to it as `<slug>.provenance.md`:
 Write a provenance record alongside it as `<slug>.provenance.md`:
 ```markdown
 # Provenance: [topic]
 - **Date:** [date]
- **Rounds:** [number of researcher rounds]
+- **Rounds:** [number of research rounds]
- **Sources consulted:** [total unique sources across all research files]
+- **Sources consulted:** [count and/or list]
- **Sources accepted:** [sources that survived citation verification]
+- **Sources accepted:** [count and/or list]
- **Sources rejected:** [dead links, unverifiable, or removed]
+- **Sources rejected:** [dead, unverifiable, or removed]
- **Verification:** [PASS / PASS WITH NOTES — summary of reviewer findings]
+- **Verification:** [PASS / PASS WITH NOTES / BLOCKED]
 - **Plan:** outputs/.plans/<slug>.md
- **Research files:** [list of intermediate <slug>-research-*.md files]
+- **Research files:** [files used]
 ```
-Before you stop, verify on disk that all of these exist:
+Before responding, verify on disk that all required artifacts exist. If verification could not be completed, set `Verification: BLOCKED` or `PASS WITH NOTES` and list the missing checks.
 - `outputs/.plans/<slug>.md`
 - `outputs/.drafts/<slug>-draft.md`
 - `outputs/.drafts/<slug>-cited.md` intermediate cited brief
 - `outputs/<slug>.md` or `papers/<slug>.md` final promoted deliverable
 - `outputs/<slug>.provenance.md` or `papers/<slug>.provenance.md` provenance sidecar
-Do not stop at the cited or revised draft alone. If the cited/revised brief exists but the promoted final output or provenance sidecar does not, create them before responding.
+Final response should be brief: link the final file, provenance file, and any blocked checks.
 If full verification could not be completed, still create the final deliverable and provenance sidecar with `Verification: BLOCKED` or `PASS WITH NOTES` and list the missing checks. Never end with only an explanation in chat.
 ## Background execution
 If the user wants unattended execution or the sweep will clearly take a while:
 - Launch the full workflow via `subagent` using `clarify: false, async: true`
 - Report the async ID and how to check status with `subagent_status`
--- a/tests/content-policy.test.ts
+++ b/tests/content-policy.test.ts
@@ -91,8 +91,10 @@ test("deepresearch keeps subagent tool calls small and skips subagents for narro
 	assert.match(deepResearchPrompt, /lead-owned direct search tasks only/i);
 	assert.match(deepResearchPrompt, /MUST NOT spawn researcher subagents/i);
 	assert.match(deepResearchPrompt, /Do not inflate a simple explainer into a multi-agent survey/i);
-	assert.match(deepResearchPrompt, /Skip this section entirely when the scale decision chose direct search\/no subagents/i);
+	assert.match(deepResearchPrompt, /Skip researcher spawning entirely/i);
 	assert.match(deepResearchPrompt, /<slug>-research-direct\.md/i);
 	assert.match(deepResearchPrompt, /Do not call `alpha_get_paper`/i);
 	assert.match(deepResearchPrompt, /do not fetch `\.pdf` URLs/i);
 	assert.match(deepResearchPrompt, /Keep `subagent` tool-call JSON small and valid/i);
 	assert.match(deepResearchPrompt, /write a per-researcher brief first/i);
 	assert.match(deepResearchPrompt, /Do not place multi-paragraph instructions inside the `subagent` JSON/i);