docs(11-02): complete paste site sources plan

- SUMMARY.md for PastebinSource, GistPasteSource, PasteSitesSource
This commit is contained in:
salvacybersec
2026-04-06 11:57:21 +03:00
parent ed148d47e1
commit da0bf800f9

View File

@@ -0,0 +1,91 @@
---
phase: 11-osint-search-paste
plan: 02
subsystem: recon
tags: [pastebin, gist, paste-sites, scraping, osint]
requires:
- phase: 10-osint-code-hosting
provides: ReconSource interface, shared HTTP client, extractAnchorHrefs helper, BuildQueries
provides:
- PastebinSource for pastebin.com search+raw scanning
- GistPasteSource for gist.github.com unauthenticated search scraping
- PasteSitesSource multi-platform aggregator (dpaste, paste.ee, rentry, hastebin)
affects: [11-03, recon-registration, recon-engine]
tech-stack:
added: []
patterns: [two-phase search+raw-fetch for paste sources, multi-platform aggregator reuse from sandboxes]
key-files:
created:
- pkg/recon/sources/pastebin.go
- pkg/recon/sources/pastebin_test.go
- pkg/recon/sources/gistpaste.go
- pkg/recon/sources/gistpaste_test.go
- pkg/recon/sources/pastesites.go
- pkg/recon/sources/pastesites_test.go
modified: []
key-decisions:
- "Two-phase approach for all paste sources: search HTML for links, then fetch raw content and keyword-match"
- "PasteSitesSource reuses SandboxesSource multi-platform pattern with pastePlatform struct"
- "GistPasteSource named 'gistpaste' to avoid collision with Phase 10 GistSource ('gist')"
patterns-established:
- "Paste source pattern: search page -> extract links -> fetch raw -> keyword match -> emit finding"
requirements-completed: [RECON-PASTE-01]
duration: 5min
completed: 2026-04-06
---
# Phase 11 Plan 02: Paste Site Sources Summary
**Three paste site ReconSources implementing two-phase search+raw-fetch with keyword matching against provider registry**
## What Was Built
### PastebinSource (`pkg/recon/sources/pastebin.go`)
- Searches pastebin.com for provider keywords, extracts 8-char paste IDs from HTML
- Fetches `/raw/{pasteID}` content (256KB cap), matches against provider keyword set
- Emits findings with SourceType="recon:pastebin" and ProviderName from matched keyword
- Rate: Every(3s), Burst 1, credential-free, respects robots.txt
### GistPasteSource (`pkg/recon/sources/gistpaste.go`)
- Scrapes gist.github.com public search (no auth needed, distinct from Phase 10 API-based GistSource)
- Extracts gist links matching `/<user>/<hex-hash>` pattern, fetches `{gistPath}/raw`
- Keyword-matches raw content, emits findings with SourceType="recon:gistpaste"
- Rate: Every(3s), Burst 1, credential-free
### PasteSitesSource (`pkg/recon/sources/pastesites.go`)
- Multi-platform aggregator following SandboxesSource pattern
- Covers 4 paste sub-platforms: dpaste.org, paste.ee, rentry.co, hastebin.com
- Each platform has configurable SearchPath, ResultLinkRegex, and RawPathTemplate
- Per-platform error isolation: failures logged and skipped without aborting others
- Findings tagged with `platform=<name>` in KeyMasked field
## Test Coverage
9 tests total across 3 test files:
- Sweep with httptest fixtures verifying finding extraction and keyword matching
- Name/rate/burst/robots/enabled metadata assertions
- Context cancellation handling
## Deviations from Plan
None - plan executed exactly as written.
## Commits
| Task | Commit | Description |
|------|--------|-------------|
| 1 | 3c500b5 | PastebinSource + GistPasteSource with tests |
| 2 | ed148d4 | PasteSitesSource multi-paste aggregator with tests |
## Self-Check: PASSED
All 7 files found. Both commit hashes verified in git log.