--- phase: 11-osint-search-paste plan: 02 type: execute wave: 1 depends_on: [] files_modified: - pkg/recon/sources/pastebin.go - pkg/recon/sources/pastebin_test.go - pkg/recon/sources/gistpaste.go - pkg/recon/sources/gistpaste_test.go - pkg/recon/sources/pastesites.go - pkg/recon/sources/pastesites_test.go autonomous: true requirements: [RECON-PASTE-01] must_haves: truths: - "PastebinSource scrapes Pastebin search results and emits findings for pastes containing provider keywords" - "GistPasteSource searches public GitHub Gists via unauthenticated scraping (distinct from Phase 10 GistSource which uses API)" - "PasteSitesSource aggregates results from dpaste, paste.ee, rentry.co, ix.io, and similar sites" - "All paste sources feed raw content through keyword matching against the provider registry" - "Missing credentials disable sources that need them; credential-free sources are always enabled" artifacts: - path: "pkg/recon/sources/pastebin.go" provides: "PastebinSource implementing recon.ReconSource" contains: "func (s *PastebinSource) Sweep" - path: "pkg/recon/sources/gistpaste.go" provides: "GistPasteSource implementing recon.ReconSource" contains: "func (s *GistPasteSource) Sweep" - path: "pkg/recon/sources/pastesites.go" provides: "PasteSitesSource implementing recon.ReconSource with multi-site sub-platform pattern" contains: "func (s *PasteSitesSource) Sweep" key_links: - from: "pkg/recon/sources/pastebin.go" to: "pkg/recon/sources/httpclient.go" via: "sources.Client for HTTP with retry" pattern: "client\\.Do" - from: "pkg/recon/sources/pastesites.go" to: "providers.Registry" via: "keyword matching on paste content" pattern: "keywordSet|BuildQueries" --- Implement three paste site ReconSource implementations: PastebinSource, GistPasteSource, and PasteSitesSource (multi-site aggregator for dpaste, paste.ee, rentry.co, ix.io, etc.). Purpose: RECON-PASTE-01 -- detect API key leaks across public paste sites. Output: Three source files + tests covering paste site scanning. @$HOME/.claude/get-shit-done/workflows/execute-plan.md @$HOME/.claude/get-shit-done/templates/summary.md @.planning/PROJECT.md @.planning/ROADMAP.md @.planning/STATE.md @pkg/recon/source.go @pkg/recon/sources/httpclient.go @pkg/recon/sources/queries.go @pkg/recon/sources/gist.go (reference: Phase 10 GistSource uses GitHub API -- this plan's GistPasteSource is a scraping alternative) @pkg/recon/sources/replit.go (reference pattern for HTML scraping source) @pkg/recon/sources/sandboxes.go (reference pattern for multi-platform aggregator) From pkg/recon/source.go: ```go type ReconSource interface { Name() string RateLimit() rate.Limit Burst() int RespectsRobots() bool Enabled(cfg Config) bool Sweep(ctx context.Context, query string, out chan<- Finding) error } ``` From pkg/recon/sources/httpclient.go: ```go func NewClient() *Client func (c *Client) Do(ctx context.Context, req *http.Request) (*http.Response, error) ``` From pkg/recon/sources/gist.go (existing Phase 10 GistSource -- avoid name collision): ```go type GistSource struct { ... } // Name() == "gist" -- already taken func (s *GistSource) keywordSet() map[string]string // pattern to reuse ``` Task 1: PastebinSource + GistPasteSource pkg/recon/sources/pastebin.go, pkg/recon/sources/pastebin_test.go, pkg/recon/sources/gistpaste.go, pkg/recon/sources/gistpaste_test.go - PastebinSource.Name() == "pastebin" - PastebinSource.RateLimit() == rate.Every(3*time.Second) (conservative -- Pastebin scraping) - PastebinSource.Burst() == 1 - PastebinSource.RespectsRobots() == true (HTML scraper) - PastebinSource.Enabled() always true (credential-free Google dorking of pastebin.com) - PastebinSource.Sweep(): For each provider keyword, scrape Google (via the same DuckDuckGo HTML endpoint as a proxy to avoid Google ToS) with query `site:pastebin.com "{keyword}"`. Parse result links. For each pastebin.com URL found, fetch the raw paste content via /raw/{paste_id} endpoint, scan content for keyword matches, emit Finding with Source=paste URL, SourceType="recon:pastebin", ProviderName from match. - GistPasteSource.Name() == "gistpaste" (not "gist" -- that's Phase 10's API source) - GistPasteSource.RateLimit() == rate.Every(3*time.Second) - GistPasteSource.RespectsRobots() == true (HTML scraper) - GistPasteSource.Enabled() always true (credential-free) - GistPasteSource.Sweep(): Scrape gist.github.com/search?q={keyword} (public search, no auth needed), parse HTML for gist links, fetch raw content, keyword-match against registry Create `pkg/recon/sources/pastebin.go`: - Struct: `PastebinSource` with BaseURL, Registry, Limiters, Client fields - Name() "pastebin", RateLimit() Every(3s), Burst() 1, RespectsRobots() true - Enabled() always true - Sweep(): Use a two-phase approach: Phase A: Search -- iterate BuildQueries(registry, "pastebin"). For each keyword, GET `{BaseURL}/search?q={url.QueryEscape(keyword)}` (Pastebin's own search). Parse HTML for paste links matching `^/[A-Za-z0-9]{8}$` pattern (Pastebin paste IDs are 8 alphanumeric chars). Collect unique paste IDs. Phase B: Fetch+Scan -- for each paste ID: wait limiter, GET `{BaseURL}/raw/{pasteID}`, read body (limit 256KB), scan content against keywordSet() (same pattern as GistSource.keywordSet). If any keyword matches, emit Finding with Source=`{BaseURL}/{pasteID}`, SourceType="recon:pastebin", ProviderName from matched keyword. - Helper: `pastebinKeywordSet(reg)` returning map[string]string (keyword -> provider name), same as GistSource pattern. Create `pkg/recon/sources/gistpaste.go`: - Struct: `GistPasteSource` with BaseURL, Registry, Limiters, Client fields - Name() "gistpaste", RateLimit() Every(3s), Burst() 1, RespectsRobots() true - Enabled() always true - Sweep(): iterate BuildQueries(registry, "gistpaste"). For each keyword, GET `{BaseURL}/search?q={url.QueryEscape(keyword)}` (gist.github.com search). Parse HTML for gist links matching `^/[^/]+/[a-f0-9]+$` pattern. For each gist link, construct raw URL `{BaseURL}{gistPath}/raw` and fetch content (limit 256KB). Keyword-match and emit Finding with SourceType="recon:gistpaste". Tests: httptest servers serving HTML search results + raw paste content fixtures. Verify findings emitted with correct SourceType, Source URL, and ProviderName. cd /home/salva/Documents/apikey && go test ./pkg/recon/sources/ -run "TestPastebin|TestGistPaste" -v -count=1 PastebinSource and GistPasteSource compile, pass all tests, handle ctx cancellation. Task 2: PasteSitesSource (multi-paste aggregator) pkg/recon/sources/pastesites.go, pkg/recon/sources/pastesites_test.go - PasteSitesSource.Name() == "pastesites" - PasteSitesSource.RateLimit() == rate.Every(3*time.Second) - PasteSitesSource.RespectsRobots() == true - PasteSitesSource.Enabled() always true (all credential-free) - PasteSitesSource.Sweep() iterates across sub-platforms: dpaste.org, paste.ee, rentry.co, ix.io, hastebin.com - Each sub-platform has: Name, SearchURL pattern, result link regex, and optional raw URL construction - Sweep emits at least one Finding per platform when fixture data matches keywords - ctx cancellation stops the sweep promptly Create `pkg/recon/sources/pastesites.go` following the SandboxesSource multi-platform pattern from pkg/recon/sources/sandboxes.go: - Define `pastePlatform` struct: Name string, SearchPath string (with %s for query), ResultLinkRegex string, RawPathTemplate string (optional, for fetching raw content), IsJSON bool - Default platforms: 1. dpaste: SearchPath="/search/?q=%s", result links matching `^/[A-Za-z0-9]+$`, raw via `/{id}/raw` 2. paste.ee: SearchPath="/search?q=%s", result links matching `^/p/[A-Za-z0-9]+$`, raw via `/r/{id}` 3. rentry.co: SearchPath="/search?q=%s", result links matching `^/[a-z0-9-]+$`, raw via `/{slug}/raw` 4. ix.io: No search -- skip (ix.io has no search). Remove from list. 5. hastebin: SearchPath="/search?q=%s", result links matching `^/[a-z]+$`, raw via `/raw/{id}` - Struct: `PasteSitesSource` with Platforms []pastePlatform, BaseURL string (test override), Registry, Limiters, Client fields - Name() "pastesites", RateLimit() Every(3s), Burst() 1, RespectsRobots() true - Enabled() always true - Sweep(): For each platform, for each keyword from BuildQueries(registry, "pastesites"): 1. Wait limiter 2. GET `{platform base or BaseURL}{searchPath with keyword}` 3. Parse HTML, extract result links matching platform regex 4. For each result link: wait limiter, GET raw content URL, read body (256KB limit), keyword-match against registry 5. Emit Finding with Source=paste URL, SourceType="recon:pastesites", ProviderName from keyword match - Default platforms populated in a `defaultPastePlatforms()` function. Tests override Platforms to use httptest URLs. Test: httptest mux serving search HTML + raw content for each sub-platform. Verify at least one Finding per platform fixture. Verify SourceType="recon:pastesites" on all. cd /home/salva/Documents/apikey && go test ./pkg/recon/sources/ -run "TestPasteSites" -v -count=1 PasteSitesSource aggregates across multiple paste sites, keyword-matches content, emits findings with correct SourceType. All paste sources compile and pass unit tests: ```bash cd /home/salva/Documents/apikey && go test ./pkg/recon/sources/ -run "TestPastebin|TestGistPaste|TestPasteSites" -v -count=1 ``` - 3 new source files exist (pastebin.go, gistpaste.go, pastesites.go) with tests - Each implements recon.ReconSource with compile-time assertion - PasteSitesSource covers 3+ paste sub-platforms - Keyword matching uses provider Registry for ProviderName population - All tests pass After completion, create `.planning/phases/11-osint_search_paste/11-02-SUMMARY.md`