Files

salvacybersec 3126dadd19 chore: CLAUDE.md + build.py refresh + feynman-skills import

- CLAUDE.md: updated project guidance
- build.py: install flow tweaks (post install_opencode fix)
- personas/_shared/feynman-skills/: 20 Feynman skills imported from ~/Documents/opencode-skills-parked/, sibling _platform-mapping.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-19 01:35:13 +03:00

7.3 KiB

Raw Blame History

name, description, allowed-tools

name	description	allowed-tools
summarize	Summarize any URL, local file, or PDF using the RLM pattern — source stored on disk, never injected raw into context. Use when the user asks to summarize a long document, paper, webpage, or PDF that might exceed safe context-window limits.	Bash(curl:), Bash(pdftotext:), Bash(python3:*)

Summarize (RLM Pattern)

Summarize a URL, local file, or PDF without injecting the full document into context. The source stays on disk as an external variable; only bounded windows enter context.

Derive a short slug from the source filename or URL domain (lowercase, hyphens, ≤5 words — e.g. attention-is-all-you-need). All files use this prefix.

Why the RLM pattern

Standard summarization injects the full document into context. Above ~15k tokens, early content degrades as the window fills (context rot). This workflow keeps the document on disk and reads only bounded windows — context pressure is proportional to the window size, not the document size.

Tier 1 (<8k chars) is a deliberate exception: direct injection is safe at ~2k tokens and windowed reading would add unnecessary friction.

Step 1 — Fetch, validate, measure

Run all guards before any tier logic. A failure here is cheap; a failure mid-Tier-3 is not.

GitHub repo URL (https://github.com/owner/repo — exactly 4 slashes): fetch the raw README instead. Try https://raw.githubusercontent.com/{owner}/{repo}/main/README.md, then /master/README.md. A repo HTML page is not the document the user wants to summarize.
Remote URL: fetch to disk: curl -sL -o outputs/.notes/<slug>-raw.txt <url>. Do NOT use a fetch tool whose return value enters context directly — that bypasses the RLM principle.
Local file or PDF: copy or extract to outputs/.notes/<slug>-raw.txt. For PDFs, extract text via pdftotext <file> outputs/.notes/<slug>-raw.txt (or equivalent) before measuring.
Empty or failed fetch: if the file is <50 bytes after fetching, stop and surface the error — do not proceed to tier selection.
Binary content: if the file is >1 KB but contains <100 readable text characters, stop and tell the user the content appears binary or unextracted.
Existing output: if outputs/<slug>-summary.md already exists, ask whether to overwrite or use a different slug. Do not proceed until confirmed.

Measure decoded text characters (not bytes — UTF-8 multi-byte chars would overcount). Log: [summarize] source=<source> slug=<slug> chars=<count>.

Step 2 — Choose tier

Chars	Tier	Strategy
<8 000	1	Direct read — full content enters context (safe at ~2k tokens)
8 000 – 60 000	2	RLM-lite — windowed bash extraction, progressive notes to disk
>60 000	3	Full RLM — bash chunking + parallel researcher subagents

Log: [summarize] tier=<N> chars=<count>.

Tier 1 — Direct read

Read outputs/.notes/<slug>-raw.txt in full. Summarize directly using the output format below. Write to outputs/<slug>-summary.md.

Tier 2 — RLM-lite windowed read

The document stays on disk. Extract 6 000-char windows via bash/python:

# f.seek/f.read: the Read tool uses line offsets, not char offsets.
# For exact char-boundary windowing across arbitrary text, bash/python is required.
with open("outputs/.notes/<slug>-raw.txt", encoding="utf-8") as f:
    f.seek(n * 6000)
    window = f.read(6000)

For each window:

Extract key claims and evidence.
Append to outputs/.notes/<slug>-notes.md before reading the next window. This is the checkpoint: if the session is interrupted, processed windows survive.
Log: [summarize] window <N>/<total> done.

After all windows, synthesize outputs/.notes/<slug>-notes.md into outputs/<slug>-summary.md.

Tier 3 — Full RLM parallel chunks

Each chunk gets a fresh researcher subagent context window — context rot is impossible because no subagent sees more than 6 000 chars.

Why 500-char overlap: academic documents contain multi-sentence arguments that span chunk boundaries. 500 chars (~80 words) ensures a cross-boundary claim appears fully in at least one adjacent chunk.

3a. Chunk the document

import os
os.makedirs("outputs/.notes", exist_ok=True)

with open("outputs/.notes/<slug>-raw.txt", encoding="utf-8") as f:
    text = f.read()

chunk_size, overlap = 6000, 500
chunks, i = [], 0
while i < len(text):
    chunks.append(text[i : i + chunk_size])
    i += chunk_size - overlap

for n, chunk in enumerate(chunks):
    # Zero-pad so files sort correctly (chunk-002 before chunk-010)
    with open(f"outputs/.notes/<slug>-chunk-{n:03d}.txt", "w", encoding="utf-8") as f:
        f.write(chunk)

print(f"[summarize] chunks={len(chunks)} chunk_size={chunk_size} overlap={overlap}")

3b. Confirm before spawning

Briefly summarize: "Source is ~ chars → chunks → researcher subagents. This may take several minutes." Then continue automatically. Do not ask for confirmation or wait for a proceed response unless the user explicitly requested review before launching.

3c. Dispatch researcher subagents

Dispatch one subagent per chunk (see ../_platform-mapping.md for role mapping). Each subagent's prompt:

Read ONLY outputs/.notes/<slug>-chunk-NNN.txt. Extract: (1) key claims (2) methodology or technical approach (3) cited evidence

Do NOT use web search or fetch external URLs — this is single-source summarization. If a claim appears to start or end mid-sentence at the file boundary, mark it BOUNDARY PARTIAL. Write to outputs/.notes/<slug>-summary-chunk-NNN.md.

Use failFast: false / equivalent so one chunk failure doesn't kill the batch. Cap concurrency at ~4 to avoid rate limits.

3d. Aggregate

After all subagents return, verify every expected outputs/.notes/<slug>-summary-chunk-NNN.md exists. Note any missing chunk indices — they appear in the Coverage gaps section of the output. Do not abort on partial coverage; a partial summary with gaps noted is more useful than none.

When synthesizing:

Deduplicate — a claim in multiple chunks is one claim; keep the most complete formulation.
Resolve boundary conflicts — for adjacent-chunk contradictions, prefer the version with more supporting context.
Remove BOUNDARY PARTIAL markers where a complete version exists in a neighbouring chunk.

Write the final synthesis to outputs/<slug>-summary.md.

Output format

All tiers produce the same artifact at outputs/<slug>-summary.md:

# Summary: <document title or source filename>

**Source:** <URL or file path>
**Date:** YYYY-MM-DD
**Tier:** 1 | 2 (N windows) | 3 (N chunks)

## Key Claims
<3–7 most important assertions, each as a bullet>

## Methodology
<approach, dataset, evaluation, baselines — omit for non-research documents>

## Limitations
<what the source explicitly flags as weak, incomplete, or out of scope>

## Verdict
<one paragraph: what this document establishes, its credibility, who should read it>

## Sources
1. <title or filename> — <URL or file path>

## Coverage gaps
<only for Tier 3 with missing chunks — list missing indices and approximate byte ranges>

Before stopping, verify on disk that outputs/<slug>-summary.md exists. Sources contains only the single source confirmed reachable in Step 1. No verifier subagent is needed — there are no URLs constructed from memory to verify.

7.3 KiB Raw Blame History Unescape Escape