Overhaul Feynman harness: streamline agents, prompts, and extensions

Remove legacy chains, skills, and config modules. Add citation agent, SYSTEM.md, modular research-tools extension, and web-access layer. Add ralph-wiggum to Pi package stack for long-running loops. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-23 14:59:30 -07:00
parent d23e679331
commit 406d50b3ff
60 changed files with 2994 additions and 3191 deletions
--- a/prompts/ablate.md
+++ b/prompts/ablate.md
@@ -1,17 +0,0 @@
---
-description: Design the smallest convincing ablation set for an AI research project.
---
-Design an ablation plan for: $@
-
-Requirements:
- Identify the exact claims the paper is making.
- For each claim, determine what ablation or control is necessary to support it.
- Prefer the `verifier` subagent when the claim structure is complicated.
- Distinguish:
-  - must-have ablations
-  - nice-to-have ablations
-  - unnecessary experiments
- Call out where benchmark norms imply mandatory controls.
- Optimize for the minimum convincing set, not experiment sprawl.
- If the user wants a durable artifact, save exactly one plan to `outputs/` as markdown.
- End with a `Sources` section containing direct URLs for any external sources used.
--- a/prompts/audit.md
+++ b/prompts/audit.md
@@ -4,11 +4,8 @@ description: Compare a paper's claims against its public codebase and identify m
 Audit the paper and codebase for: $@

 Requirements:
- Prefer the `researcher` subagent for evidence gathering and the `verifier` subagent for the mismatch pass when the audit is non-trivial.
- Identify the canonical paper first with `alpha_search` and `alpha_get_paper`.
- Extract implementation-sensitive claims with `alpha_ask_paper`.
- If a public repo exists, inspect it with `alpha_read_code`.
- Compare claimed methods, defaults, metrics, and data handling against the repository.
+- Use the `researcher` subagent for evidence gathering and the `citation` subagent to verify sources and add inline citations when the audit is non-trivial.
+- Compare claimed methods, defaults, metrics, and data handling against the actual code.
 - Call out missing code, mismatches, ambiguous defaults, and reproduction risks.
- End with a `Sources` section containing paper and repository URLs.
 - Save exactly one audit artifact to `outputs/` as markdown.
+- End with a `Sources` section containing paper and repository URLs.
--- a/prompts/autoresearch.md
+++ b/prompts/autoresearch.md
@@ -1,19 +1,32 @@
 ---
-description: Turn a research idea into a paper-oriented end-to-end run with literature, hypotheses, experiments when possible, and a draft artifact.
+description: Autonomous experiment loop — try ideas, measure results, keep what works, discard what doesn't, repeat.
 ---
-Run an autoresearch workflow for: $@
+Start an autoresearch optimization loop for: $@

-Requirements:
- Prefer the project `auto` chain or the `planner` + `researcher` + `verifier` + `writer` subagents when the task is broad enough to benefit from decomposition.
- If the run is likely to take a while, or the user wants it detached, launch the subagent workflow in background with `clarify: false, async: true` and report how to inspect status.
- Start by clarifying the research objective, scope, and target contribution.
- Search for the strongest relevant primary sources first.
- If the topic is current, product-oriented, market-facing, or asks about latest developments, start with `web_search` and `fetch_content`.
- Use `alpha_search` for academic background or paper-centric parts of the topic, but do not rely on it alone for current topics.
- Build a compact evidence table before committing to a paper narrative.
- If experiments are feasible in the current environment, design and run the smallest experiment that materially reduces uncertainty.
- If experiments are not feasible, produce a paper-style draft that is explicit about missing validation and limitations.
- Produce one final durable markdown artifact for the user-facing result.
- If the result is a paper-style draft, save it to `papers/`; otherwise save it to `outputs/`.
- Do not create extra user-facing intermediate markdown files unless the user explicitly asks for them.
- End with a `Sources` section containing direct URLs for every source used.
+This command uses pi-autoresearch. Enter autoresearch mode and begin the autonomous experiment loop.
+
+## Behavior
+
+- If `autoresearch.md` and `autoresearch.jsonl` already exist in the project, resume the existing session with the user's input as additional context.
+- Otherwise, gather the optimization target from the user:
+  - What to optimize (test speed, bundle size, training loss, build time, etc.)
+  - The benchmark command to run
+  - The metric name, unit, and direction (lower/higher is better)
+  - Files in scope for changes
+- Then initialize the session: create `autoresearch.md`, `autoresearch.sh`, run the baseline, and start looping.
+
+## Loop
+
+Each iteration: edit → commit → `run_experiment` → `log_experiment` → keep or revert → repeat. Do not stop unless interrupted or `maxIterations` is reached.
+
+## Key tools
+
+- `init_experiment` — one-time session config (name, metric, unit, direction)
+- `run_experiment` — run the benchmark command, capture output and wall-clock time
+- `log_experiment` — record result, auto-commit, update dashboard
+
+## Subcommands
+
+- `/autoresearch <text>` — start or resume the loop
+- `/autoresearch off` — stop the loop, keep data
+- `/autoresearch clear` — delete all state and start fresh
--- a/prompts/compare.md
+++ b/prompts/compare.md
@@ -4,17 +4,8 @@ description: Compare multiple sources on a topic and produce a source-grounded m
 Compare sources for: $@

 Requirements:
- Use the `researcher` subagent to gather source material when the comparison set is broad, and the `verifier` subagent to pressure-test the resulting matrix when needed.
- Identify the strongest relevant primary sources first.
- For current or market-facing topics, use `web_search` and `fetch_content` to gather up-to-date primary sources before comparing them.
- For academic claims, use `alpha_search` and inspect the strongest papers directly.
- Inspect the top sources directly before comparing them.
- Build a comparison matrix covering:
-  - source
-  - key claim
-  - evidence type
-  - caveats
-  - confidence
+- Use the `researcher` subagent to gather source material when the comparison set is broad, and the `citation` subagent to verify sources and add inline citations to the final matrix.
+- Build a comparison matrix covering: source, key claim, evidence type, caveats, confidence.
 - Distinguish agreement, disagreement, and uncertainty clearly.
+- Save exactly one comparison to `outputs/` as markdown.
 - End with a `Sources` section containing direct URLs for every source used.
- If the user wants a durable artifact, save exactly one comparison to `outputs/` as markdown.
--- a/prompts/deepresearch.md
+++ b/prompts/deepresearch.md
@@ -1,34 +1,107 @@
 ---
-description: Run a thorough, source-heavy investigation on a topic and produce a durable research brief with explicit evidence and source links.
+description: Run a thorough, source-heavy investigation on a topic and produce a durable research brief with inline citations.
 ---
 Run a deep research workflow for: $@

-Requirements:
- Treat `/deepresearch` as one coherent Feynman workflow from the user's perspective. Do not expose internal orchestration primitives unless the user explicitly asks.
- Start as the lead researcher. First make a compact plan: what must be answered, what evidence types are needed, and which sub-questions are worth splitting out.
- Stay single-agent by default for narrow topics. Only use `subagent` when the task is broad enough that separate context windows materially improve breadth or speed.
- If you use subagents, launch them as one worker batch around clearly disjoint sub-questions. Wait for the batch to finish, synthesize the results, and only then decide whether a second batch is needed.
- Prefer breadth-first worker batches for deep research: different market segments, different source types, different time periods, different technical angles, or different competing explanations.
- Use `researcher` workers for evidence gathering, `verifier` workers for adversarial claim-checking, and `writer` only if you already have solid evidence and need help polishing the final artifact.
- Do not make the workflow chain-shaped by default. Hidden worker batches are optional implementation details, not the user-facing model.
- If the user wants it to run unattended, or the sweep will clearly take a while, prefer background execution with `subagent` using `clarify: false, async: true`, then report how to inspect status.
- If the topic is current, product-oriented, market-facing, regulatory, or asks about latest developments, start with `web_search` and `fetch_content`.
- If the topic has an academic literature component, use `alpha_search`, `alpha_get_paper`, and `alpha_ask_paper` for the strongest papers.
- Do not rely on a single source type when the topic spans both current reality and academic background.
- Build a compact evidence table before synthesizing conclusions.
- After synthesis, run a final verification/citation pass. For the strongest claims, independently confirm support and remove anything unsupported, fabricated, or stale.
- Distinguish clearly between established facts, plausible inferences, disagreements, and unresolved questions.
- Produce exactly one durable markdown artifact in `outputs/`.
- The final artifact should read like one deep research memo, not like stitched-together worker transcripts.
- Do not leave extra user-facing intermediate markdown files behind unless the user explicitly asks for them.
- End with a `Sources` section containing direct URLs for every source used.
+You are the Lead Researcher. You plan, delegate, evaluate, loop, write, and cite. Internal orchestration is invisible to the user unless they ask.

-Default execution shape:
-1. Clarify the actual research objective if needed.
-2. Make a short plan and identify the key sub-questions.
-3. Decide single-agent versus worker-batch execution.
-4. Gather evidence across the needed source types.
-5. Synthesize findings and identify remaining gaps.
-6. If needed, run one more worker batch for unresolved gaps.
-7. Perform a verification/citation pass.
-8. Write the final brief with a strict `Sources` section.
+## 1. Plan
+
+Analyze the research question using extended thinking. Develop a research strategy:
+- Key questions that must be answered
+- Evidence types needed (papers, web, code, data, docs)
+- Sub-questions disjoint enough to parallelize
+- Source types and time periods that matter
+
+Save the plan immediately with `memory_remember` (type: `fact`, key: `deepresearch.plan`). Context windows get truncated on long runs — the plan must survive.
+
+## 2. Scale decision
+
+| Query type | Execution |
+|---|---|
+| Single fact or narrow question | Search directly yourself, no subagents, 3-10 tool calls |
+| Direct comparison (2-3 items) | 2 parallel `researcher` subagents |
+| Broad survey or multi-faceted topic | 3-4 parallel `researcher` subagents |
+| Complex multi-domain research | 4-6 parallel `researcher` subagents |
+
+Never spawn subagents for work you can do in 5 tool calls.
+
+## 3. Spawn researchers
+
+Launch parallel `researcher` subagents via `subagent`. Each gets a structured brief with:
+- **Objective:** what to find
+- **Output format:** numbered sources, evidence table, inline source references
+- **Tool guidance:** which search tools to prioritize
+- **Task boundaries:** what NOT to cover (another researcher handles that)
+
+Assign each researcher a clearly disjoint dimension — different source types, geographic scopes, time periods, or technical angles. Never duplicate coverage.
+
+```
+{
+  tasks: [
+    { agent: "researcher", task: "...", output: "research-web.md" },
+    { agent: "researcher", task: "...", output: "research-papers.md" }
+  ],
+  concurrency: 4,
+  failFast: false
+}
+```
+
+Researchers write full outputs to files and pass references back — do not have them return full content into your context.
+
+## 4. Evaluate and loop
+
+After researchers return, read their output files and critically assess:
+- Which plan questions remain unanswered?
+- Which answers rest on only one source?
+- Are there contradictions needing resolution?
+- Is any key angle missing entirely?
+
+If gaps are significant, spawn another targeted batch of researchers. No fixed cap on rounds — iterate until evidence is sufficient or sources are exhausted. Update the stored plan with `memory_remember` as it evolves.
+
+Most topics need 1-2 rounds. Stop when additional rounds would not materially change conclusions.
+
+## 5. Write the report
+
+Once evidence is sufficient, YOU write the full research brief directly. Do not delegate writing to another agent. Read the research files, synthesize the findings, and produce a complete document:
+
+```markdown
+# Title
+
+## Executive Summary
+2-3 paragraph overview of key findings.
+
+## Section 1: ...
+Detailed findings organized by theme or question.
+
+## Section N: ...
+
+## Open Questions
+Unresolved issues, disagreements between sources, gaps in evidence.
+```
+
+Save this draft to a temp file (e.g., `draft.md` in the chain artifacts dir or a temp path).
+
+## 6. Cite
+
+Spawn the `citation` agent to post-process YOUR draft. The citation agent adds inline citations, verifies every source URL, and produces the final output:
+
+```
+{ agent: "citation", task: "Add inline citations to draft.md using the research files as source material. Verify every URL.", output: "brief.md" }
+```
+
+The citation agent does not rewrite the report — it only anchors claims to sources and builds the numbered Sources section.
+
+## 7. Deliver
+
+Copy the final cited output to the appropriate folder:
+- Paper-style drafts → `papers/`
+- Everything else → `outputs/`
+
+Use a descriptive filename based on the topic.
+
+## Background execution
+
+If the user wants unattended execution or the sweep will clearly take a while:
+- Launch the full workflow via `subagent` using `clarify: false, async: true`
+- Report the async ID and how to check status with `subagent_status`
--- a/prompts/draft.md
+++ b/prompts/draft.md
@@ -4,18 +4,8 @@ description: Turn research findings into a polished paper-style draft with equat
 Write a paper-style draft for: $@

 Requirements:
- Prefer the `writer` subagent when the draft should be produced from already-collected notes, and use `verifier` first if the evidence still looks shaky.
- Ground every claim in inspected sources, experiments, or explicit inference.
- Use clean Markdown structure with LaTeX where equations materially help.
- Include at minimum:
-  - title
-  - abstract
-  - problem statement
-  - related work
-  - method or synthesis
-  - evidence or experiments
-  - limitations
-  - conclusion
- If citations are available, include citation placeholders or references clearly enough to convert later.
- Add a `Sources` appendix with direct URLs for all primary references used while drafting.
+- Use the `writer` subagent when the draft should be produced from already-collected notes, then use the `citation` subagent to add inline citations and verify sources.
+- Include at minimum: title, abstract, problem statement, related work, method or synthesis, evidence or experiments, limitations, conclusion.
+- Use clean Markdown with LaTeX where equations materially help.
 - Save exactly one draft to `papers/` as markdown.
+- End with a `Sources` appendix with direct URLs for all primary references.
--- a/prompts/lit.md
+++ b/prompts/lit.md
@@ -5,12 +5,7 @@ Investigate the following topic as a literature review: $@

 Requirements:
 - Use the `researcher` subagent when the sweep is wide enough to benefit from delegated paper triage before synthesis.
- If the topic is academic or paper-centric, use `alpha_search` first.
- If the topic is current, product-oriented, market-facing, or asks about latest developments, use `web_search` and `fetch_content` first, then use `alpha_search` only for academic background.
- Use `alpha_get_paper` on the most relevant papers before making strong claims.
- Use `alpha_ask_paper` for targeted follow-up questions when the report is not enough.
- Prefer primary sources and note when something appears to be a preprint or secondary summary.
 - Separate consensus, disagreements, and open questions.
 - When useful, propose concrete next experiments or follow-up reading.
- End with a `Sources` section containing direct URLs for every paper or source used.
- If the user wants an artifact, write exactly one review to disk as markdown.
+- Save exactly one literature review to `outputs/` as markdown.
+- End with a `Sources` section containing direct URLs for every source used.
--- a/prompts/memo.md
+++ b/prompts/memo.md
@@ -1,14 +0,0 @@
---
-description: Produce a general research memo grounded in explicit sources and direct links.
---
-Write a research memo about: $@
-
-Requirements:
- Use the `researcher` and `writer` subagents when decomposition will improve quality or reduce context pressure.
- Start by finding the strongest relevant sources.
- If the topic is current, market-facing, product-oriented, regulatory, or asks about latest developments, use `web_search` and `fetch_content` first.
- Use `alpha_search` for academic background where relevant, but do not rely on it alone for current topics.
- Read or inspect the top sources directly before making strong claims.
- Distinguish facts, interpretations, and open questions.
- End with a `Sources` section containing direct URLs for every source used.
- If the user wants a durable artifact, save exactly one memo to `outputs/` as markdown.
--- a/prompts/reading.md
+++ b/prompts/reading.md
@@ -1,15 +0,0 @@
---
-description: Build a prioritized reading list on a research topic with rationale for each paper.
---
-Create a research reading list for: $@
-
-Requirements:
- Use the `researcher` subagent when a wider literature sweep would help before curating the final list.
- If the topic is academic, use `alpha_search` with `all` mode.
- If the topic is current, product-oriented, or asks for the latest landscape, use `web_search` and `fetch_content` first, then add `alpha_search` for academic background when relevant.
- Inspect the strongest papers or primary sources directly before recommending them.
- Use `alpha_ask_paper` when a paper's fit is unclear.
- Group papers by role when useful: foundational, strongest recent work, methods, benchmarks, critiques, replication targets.
- For each paper, explain why it is on the list.
- Include direct URLs for each recommended source.
- Save exactly one final reading list to `outputs/` as markdown.
--- a/prompts/rebuttal.md
+++ b/prompts/rebuttal.md
@@ -1,18 +0,0 @@
---
-description: Turn reviewer comments into a structured rebuttal and revision plan for an AI research paper.
---
-Prepare a rebuttal workflow for: $@
-
-Requirements:
- If reviewer comments are provided, organize them into a response matrix.
- If reviewer comments are not yet provided, infer the likely strongest objections from the current draft and review them before drafting responses.
- Prefer the `reviewer` subagent or the project `review` chain when fresh critical review is still needed.
- For each issue, produce:
-  - reviewer concern
-  - whether it is valid
-  - evidence available now
-  - paper changes needed
-  - rebuttal language
- Do not overclaim fixes that have not been implemented.
- Save exactly one rebuttal matrix to `outputs/` as markdown.
- End with a `Sources` section containing direct URLs for all inspected external sources.
--- a/prompts/related.md
+++ b/prompts/related.md
@@ -1,19 +0,0 @@
---
-description: Build a related-work map and justify why an AI research project needs to exist.
---
-Build the related-work and justification view for: $@
-
-Requirements:
- Search for the closest and strongest relevant papers first.
- Prefer the `researcher` subagent when the space is broad or moving quickly.
- Identify:
-  - foundational papers
-  - closest prior work
-  - strongest recent competing approaches
-  - benchmarks and evaluation norms
-  - critiques or known weaknesses in the area
- For each important paper, explain why it matters to this project.
- Be explicit about what real gap remains after considering the strongest prior work.
- If the project is not differentiated enough, say so clearly.
- If the user wants a durable result, save exactly one artifact to `outputs/` as markdown.
- End with a `Sources` section containing direct URLs.
--- a/prompts/replicate.md
+++ b/prompts/replicate.md
@@ -4,11 +4,7 @@ description: Plan or execute a replication workflow for a paper, claim, or bench
 Design a replication plan for: $@

 Requirements:
- Use the `subagent` tool for decomposition when the replication needs separate planning, evidence extraction, and execution passes.
- Identify the canonical paper or source material first.
- Use `alpha_get_paper` for the target paper.
- Use `alpha_ask_paper` to extract the exact implementation or evaluation details you still need.
- If the paper links code, inspect it with `alpha_read_code`.
+- Use the `researcher` subagent to extract implementation details from the target paper and any linked code.
 - Determine what code, datasets, metrics, and environment are needed.
 - If enough information is available locally, implement and run the replication steps.
 - Save notes, scripts, and results to disk in a reproducible layout.
--- a/prompts/review.md
+++ b/prompts/review.md
@@ -4,21 +4,8 @@ description: Simulate an AI research peer review with likely objections, severit
 Review this AI research artifact: $@

 Requirements:
- Prefer the project `review` chain or the `researcher` + `verifier` + `reviewer` subagents when the artifact is large or the review needs to inspect paper, code, and experiments together.
- Inspect the strongest relevant sources directly before making strong review claims.
- If the artifact is a paper or draft, evaluate:
-  - novelty and related-work positioning
-  - clarity of claims
-  - baseline fairness
-  - evaluation design
-  - missing ablations
-  - reproducibility details
-  - whether conclusions outrun the evidence
- If code or experiment artifacts exist, compare them against the claimed method and evaluation.
- Produce:
-  - short verdict
-  - likely reviewer objections
-  - severity for each issue
-  - revision plan in priority order
+- Spawn a `researcher` subagent to gather evidence on the artifact — inspect the paper, code, cited work, and any linked experimental artifacts. Save to `research.md`.
+- Spawn a `reviewer` subagent with `research.md` to produce the final peer review with inline annotations.
+- For small or simple artifacts where evidence gathering is overkill, run the `reviewer` subagent directly instead.
 - Save exactly one review artifact to `outputs/` as markdown.
 - End with a `Sources` section containing direct URLs for every inspected external source.
--- a/prompts/watch.md
+++ b/prompts/watch.md
@@ -4,11 +4,8 @@ description: Set up a recurring or deferred research watch on a topic, company,
 Create a research watch for: $@

 Requirements:
- Start with a baseline sweep of the topic using the strongest relevant sources.
- If the watch is about current events, products, markets, regulations, or releases, use `web_search` and `fetch_content` first.
- If the watch has a literature component, add `alpha_search` and inspect the strongest papers directly.
+- Start with a baseline sweep of the topic.
 - Summarize what should be monitored, what signals matter, and what counts as a meaningful change.
 - Use `schedule_prompt` to create the recurring or delayed follow-up instead of merely promising to check later.
- If the user wants detached execution for the initial sweep, use `subagent` in background mode and report how to inspect status.
- Save exactly one durable baseline artifact to `outputs/`.
+- Save exactly one baseline artifact to `outputs/`.
 - End with a `Sources` section containing direct URLs for every source used.