diff --git a/.planning/phases/11-osint_search_paste/11-02-SUMMARY.md b/.planning/phases/11-osint_search_paste/11-02-SUMMARY.md new file mode 100644 index 0000000..8a1088d --- /dev/null +++ b/.planning/phases/11-osint_search_paste/11-02-SUMMARY.md @@ -0,0 +1,91 @@ +--- +phase: 11-osint-search-paste +plan: 02 +subsystem: recon +tags: [pastebin, gist, paste-sites, scraping, osint] + +requires: + - phase: 10-osint-code-hosting + provides: ReconSource interface, shared HTTP client, extractAnchorHrefs helper, BuildQueries + +provides: + - PastebinSource for pastebin.com search+raw scanning + - GistPasteSource for gist.github.com unauthenticated search scraping + - PasteSitesSource multi-platform aggregator (dpaste, paste.ee, rentry, hastebin) + +affects: [11-03, recon-registration, recon-engine] + +tech-stack: + added: [] + patterns: [two-phase search+raw-fetch for paste sources, multi-platform aggregator reuse from sandboxes] + +key-files: + created: + - pkg/recon/sources/pastebin.go + - pkg/recon/sources/pastebin_test.go + - pkg/recon/sources/gistpaste.go + - pkg/recon/sources/gistpaste_test.go + - pkg/recon/sources/pastesites.go + - pkg/recon/sources/pastesites_test.go + modified: [] + +key-decisions: + - "Two-phase approach for all paste sources: search HTML for links, then fetch raw content and keyword-match" + - "PasteSitesSource reuses SandboxesSource multi-platform pattern with pastePlatform struct" + - "GistPasteSource named 'gistpaste' to avoid collision with Phase 10 GistSource ('gist')" + +patterns-established: + - "Paste source pattern: search page -> extract links -> fetch raw -> keyword match -> emit finding" + +requirements-completed: [RECON-PASTE-01] + +duration: 5min +completed: 2026-04-06 +--- + +# Phase 11 Plan 02: Paste Site Sources Summary + +**Three paste site ReconSources implementing two-phase search+raw-fetch with keyword matching against provider registry** + +## What Was Built + +### PastebinSource (`pkg/recon/sources/pastebin.go`) +- Searches pastebin.com for provider keywords, extracts 8-char paste IDs from HTML +- Fetches `/raw/{pasteID}` content (256KB cap), matches against provider keyword set +- Emits findings with SourceType="recon:pastebin" and ProviderName from matched keyword +- Rate: Every(3s), Burst 1, credential-free, respects robots.txt + +### GistPasteSource (`pkg/recon/sources/gistpaste.go`) +- Scrapes gist.github.com public search (no auth needed, distinct from Phase 10 API-based GistSource) +- Extracts gist links matching `//` pattern, fetches `{gistPath}/raw` +- Keyword-matches raw content, emits findings with SourceType="recon:gistpaste" +- Rate: Every(3s), Burst 1, credential-free + +### PasteSitesSource (`pkg/recon/sources/pastesites.go`) +- Multi-platform aggregator following SandboxesSource pattern +- Covers 4 paste sub-platforms: dpaste.org, paste.ee, rentry.co, hastebin.com +- Each platform has configurable SearchPath, ResultLinkRegex, and RawPathTemplate +- Per-platform error isolation: failures logged and skipped without aborting others +- Findings tagged with `platform=` in KeyMasked field + +## Test Coverage + +9 tests total across 3 test files: +- Sweep with httptest fixtures verifying finding extraction and keyword matching +- Name/rate/burst/robots/enabled metadata assertions +- Context cancellation handling + +## Deviations from Plan + +None - plan executed exactly as written. + +## Commits + +| Task | Commit | Description | +|------|--------|-------------| +| 1 | 3c500b5 | PastebinSource + GistPasteSource with tests | +| 2 | ed148d4 | PasteSitesSource multi-paste aggregator with tests | + +## Self-Check: PASSED + +All 7 files found. Both commit hashes verified in git log.