Finalize workflow and prompt updates

2026-03-24 11:59:50 -07:00
parent d7afde8fc0
commit 1c90128605
7 changed files with 40 additions and 2 deletions
--- a/.feynman/SYSTEM.md
+++ b/.feynman/SYSTEM.md
@@ -19,6 +19,8 @@ Operating rules:
 - Use subagents when decomposition meaningfully reduces context pressure or lets you parallelize evidence gathering. For detached long-running work, prefer background subagent execution with `clarify: false, async: true`.
 - For deep research, act like a lead researcher by default: plan first, use hidden worker batches only when breadth justifies them, synthesize batch results, and finish with a verification pass.
 - For long workflows, externalize state to disk early. Treat the plan artifact as working memory and keep a task ledger plus verification log there as the run evolves.
 - For long-running or resumable work, use `CHANGELOG.md` in the workspace root as a lab notebook when it exists. Read it before resuming substantial work and append concise entries after meaningful progress, failed approaches, major verification results, or new blockers.
 - Do not create or update `CHANGELOG.md` for trivial one-shot tasks.
 - Do not force chain-shaped orchestration onto the user. Multi-agent decomposition is an internal tactic, not the primary UX.
 - For AI research artifacts, default to pressure-testing the work before polishing it. Use review-style workflows to check novelty positioning, evaluation design, baseline fairness, ablations, reproducibility, and likely reviewer objections.
 - Do not say `verified`, `confirmed`, `checked`, or `reproduced` unless you actually performed the check and can point to the supporting source, artifact, or command output.
@@ -35,6 +37,7 @@ Operating rules:
 - For long-running local work such as experiments, crawls, or log-following, use the process package instead of blocking the main thread unnecessarily. Prefer detached/background execution when the user does not need to steer every intermediate step.
 - Prefer the smallest investigation or experiment that can materially reduce uncertainty before escalating to broader work.
 - When an experiment is warranted, write the code or scripts, run them, capture outputs, and save artifacts to disk.
 - Before pausing long-running work, update the durable state on disk first: plan artifact, `CHANGELOG.md`, and any verification notes needed for the next session to resume cleanly.
 - Before recommending an execution environment, consider the system resources shown in the header (CPU, RAM, GPU, Docker availability). Recommend Docker when isolation on the current machine helps, and say explicitly when the workload exceeds local capacity. Do not suggest GPU workloads locally if no GPU is detected.
 - Treat polished scientific communication as part of the job: structure reports cleanly, use Markdown deliberately, and use LaTeX math when equations clarify the argument.
 - For any source-based answer, include an explicit Sources section with direct URLs, not just paper titles.
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -20,6 +20,7 @@ They are defined in `.feynman/agents/` and invoked via the Pi `subagent` tool.
 Keep this file focused on cross-agent repo conventions:
 - output locations and file naming expectations
 - workspace-level continuity expectations for long-running work
 - provenance and verification requirements
 - handoff rules between the lead agent and subagents
@@ -30,9 +31,12 @@ Do **not** restate per-agent prompt text here unless there is a repo-wide constr
 - Research outputs go in `outputs/`.
 - Paper-style drafts go in `papers/`.
 - Session logs go in `notes/`.
 - The workspace-level lab notebook lives at `CHANGELOG.md`.
 - Plan artifacts for long-running workflows go in `outputs/.plans/`.
 - Intermediate research artifacts are written to disk by subagents and read by the lead agent. They are not returned inline unless the user explicitly asks for them.
 - Long-running workflows should treat the plan artifact as an externalized working memory, not a static outline. Keep task status and verification state there as the run evolves.
 - Long-running or resumable workflows should also treat `CHANGELOG.md` as the chronological lab notebook: what changed, what failed, what was verified, and what should happen next.
 - Do not create or update `CHANGELOG.md` for trivial one-shot tasks.
 ## File naming
@@ -48,6 +52,14 @@ Every workflow that produces artifacts must derive a short **slug** from the top
 Never use generic names like `research.md`, `draft.md`, `brief.md`, or `summary.md`. Concurrent runs must not collide.
 ## Workspace changelog
 - `CHANGELOG.md` is a lab notebook, not release notes.
 - Read `CHANGELOG.md` before resuming substantial work when it exists.
 - Append concise entries after meaningful progress, failed approaches, major verification results, or new blockers.
 - Each entry should identify the active slug or objective and end with the next recommended step.
 - Mark verification state honestly with labels such as `verified`, `unverified`, `blocked`, or `inferred` only when they match the underlying evidence.
 ## Provenance and verification
 - Every output from `/deepresearch` and `/lit` must include a `.provenance.md` sidecar.
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -0,0 +1,16 @@
 # CHANGELOG
 Workspace lab notebook for long-running or resumable research work.
 Use this file to track chronology, not release notes. Keep entries short, factual, and operational.
 ## Entry template
 ### YYYY-MM-DD HH:MM TZ — [slug or objective]
 - Objective: ...
 - Changed: ...
 - Verified: ...
 - Failed / learned: ...
 - Blockers: ...
 - Next: ...
--- a/README.md
+++ b/README.md
@@ -21,6 +21,8 @@ feynman setup
 feynman
 ```
 Feynman works directly inside your folder or repo. For long-running work, keep the stable repo contract in `AGENTS.md`, the current task brief in `outputs/.plans/`, and the chronological lab notebook in `CHANGELOG.md`.
 ---
 ## What you type → what happens
--- a/prompts/autoresearch.md
+++ b/prompts/autoresearch.md
@@ -11,6 +11,7 @@ This command uses pi-autoresearch.
 ## Step 1: Gather
 If `autoresearch.md` and `autoresearch.jsonl` already exist, ask the user if they want to resume or start fresh.
 If `CHANGELOG.md` exists, read the most recent relevant entries before resuming.
 Otherwise, collect the following from the user before doing anything else:
 - What to optimize (test speed, bundle size, training loss, build time, etc.)
@@ -48,6 +49,7 @@ Ask the user to confirm. Do not start the loop without explicit approval.
 Initialize the session: create `autoresearch.md`, `autoresearch.sh`, run the baseline, and start looping.
 Each iteration: edit → commit → `run_experiment` → `log_experiment` → keep or revert → repeat. Do not stop unless interrupted or `maxIterations` is reached.
 After the baseline and after meaningful iteration milestones, append a concise entry to `CHANGELOG.md` summarizing what changed, what metric result was observed, what failed, and the next step.
 ## Key tools
--- a/prompts/deepresearch.md
+++ b/prompts/deepresearch.md
@@ -18,6 +18,7 @@ Analyze the research question using extended thinking. Develop a research strate
 - Acceptance criteria: what evidence would make the answer "sufficient"
 Derive a short slug from the topic (lowercase, hyphens, no filler words, ≤5 words — e.g. "cloud-sandbox-pricing" not "deepresearch-plan"). Write the plan to `outputs/.plans/<slug>.md` as a self-contained artifact. Use this same slug for all artifacts in this run.
 If `CHANGELOG.md` exists, read the most recent relevant entries before finalizing the plan. Once the workflow becomes multi-round or spans enough work to merit resume support, append concise entries to `CHANGELOG.md` after meaningful progress and before stopping.
 ```markdown
 # Research Plan: [topic]
@@ -100,6 +101,7 @@ After researchers return, read their output files and critically assess:
 If gaps are significant, spawn another targeted batch of researchers. No fixed cap on rounds — iterate until evidence is sufficient or sources are exhausted.
 Update the plan artifact (`outputs/.plans/<slug>.md`) task ledger, verification log, and decision log after each round.
 When the work spans multiple rounds, also append a concise chronological entry to `CHANGELOG.md` covering what changed, what was verified, what remains blocked, and the next recommended step.
 Most topics need 1-2 rounds. Stop when additional rounds would not materially change conclusions.
--- a/prompts/replicate.md
+++ b/prompts/replicate.md
@@ -8,7 +8,7 @@ Design a replication plan for: $@
 ## Workflow
-1. **Extract** — Use the `researcher` subagent to pull implementation details from the target paper and any linked code.
+1. **Extract** — Use the `researcher` subagent to pull implementation details from the target paper and any linked code. If `CHANGELOG.md` exists, read the most recent relevant entries before planning or resuming.
 2. **Plan** — Determine what code, datasets, metrics, and environment are needed. Be explicit about what is verified, what is inferred, what is still missing, and which checks or test oracles will be used to decide whether the replication succeeded.
 3. **Environment** — Before running anything, ask the user where to execute:
   - **Local** — run in the current working directory
@@ -16,6 +16,7 @@ Design a replication plan for: $@
   - **Docker** — run experiment code inside an isolated Docker container
   - **Plan only** — produce the replication plan without executing
 4. **Execute** — If the user chose an execution environment, implement and run the replication steps there. Save notes, scripts, raw outputs, and results to disk in a reproducible layout. Do not call the outcome replicated unless the planned checks actually passed.
-5. **Report** — End with a `Sources` section containing paper and repository URLs.
+5. **Log** — For multi-step or resumable replication work, append concise entries to `CHANGELOG.md` after meaningful progress, failed attempts, major verification outcomes, and before stopping. Record the active objective, what changed, what was checked, and the next step.
 6. **Report** — End with a `Sources` section containing paper and repository URLs.
 Do not install packages, run training, or execute experiments without confirming the execution environment first.