Files
keyhunter/.planning/phases/11-osint_search_paste/11-02-PLAN.md

10 KiB

phase, plan, type, wave, depends_on, files_modified, autonomous, requirements, must_haves
phase plan type wave depends_on files_modified autonomous requirements must_haves
11-osint-search-paste 02 execute 1
pkg/recon/sources/pastebin.go
pkg/recon/sources/pastebin_test.go
pkg/recon/sources/gistpaste.go
pkg/recon/sources/gistpaste_test.go
pkg/recon/sources/pastesites.go
pkg/recon/sources/pastesites_test.go
true
RECON-PASTE-01
truths artifacts key_links
PastebinSource scrapes Pastebin search results and emits findings for pastes containing provider keywords
GistPasteSource searches public GitHub Gists via unauthenticated scraping (distinct from Phase 10 GistSource which uses API)
PasteSitesSource aggregates results from dpaste, paste.ee, rentry.co, ix.io, and similar sites
All paste sources feed raw content through keyword matching against the provider registry
Missing credentials disable sources that need them; credential-free sources are always enabled
path provides contains
pkg/recon/sources/pastebin.go PastebinSource implementing recon.ReconSource func (s *PastebinSource) Sweep
path provides contains
pkg/recon/sources/gistpaste.go GistPasteSource implementing recon.ReconSource func (s *GistPasteSource) Sweep
path provides contains
pkg/recon/sources/pastesites.go PasteSitesSource implementing recon.ReconSource with multi-site sub-platform pattern func (s *PasteSitesSource) Sweep
from to via pattern
pkg/recon/sources/pastebin.go pkg/recon/sources/httpclient.go sources.Client for HTTP with retry client.Do
from to via pattern
pkg/recon/sources/pastesites.go providers.Registry keyword matching on paste content keywordSet|BuildQueries
Implement three paste site ReconSource implementations: PastebinSource, GistPasteSource, and PasteSitesSource (multi-site aggregator for dpaste, paste.ee, rentry.co, ix.io, etc.).

Purpose: RECON-PASTE-01 -- detect API key leaks across public paste sites. Output: Three source files + tests covering paste site scanning.

<execution_context> @$HOME/.claude/get-shit-done/workflows/execute-plan.md @$HOME/.claude/get-shit-done/templates/summary.md </execution_context>

@.planning/PROJECT.md @.planning/ROADMAP.md @.planning/STATE.md @pkg/recon/source.go @pkg/recon/sources/httpclient.go @pkg/recon/sources/queries.go @pkg/recon/sources/gist.go (reference: Phase 10 GistSource uses GitHub API -- this plan's GistPasteSource is a scraping alternative) @pkg/recon/sources/replit.go (reference pattern for HTML scraping source) @pkg/recon/sources/sandboxes.go (reference pattern for multi-platform aggregator) From pkg/recon/source.go: ```go type ReconSource interface { Name() string RateLimit() rate.Limit Burst() int RespectsRobots() bool Enabled(cfg Config) bool Sweep(ctx context.Context, query string, out chan<- Finding) error } ```

From pkg/recon/sources/httpclient.go:

func NewClient() *Client
func (c *Client) Do(ctx context.Context, req *http.Request) (*http.Response, error)

From pkg/recon/sources/gist.go (existing Phase 10 GistSource -- avoid name collision):

type GistSource struct { ... }  // Name() == "gist" -- already taken
func (s *GistSource) keywordSet() map[string]string  // pattern to reuse
Task 1: PastebinSource + GistPasteSource pkg/recon/sources/pastebin.go, pkg/recon/sources/pastebin_test.go, pkg/recon/sources/gistpaste.go, pkg/recon/sources/gistpaste_test.go - PastebinSource.Name() == "pastebin" - PastebinSource.RateLimit() == rate.Every(3*time.Second) (conservative -- Pastebin scraping) - PastebinSource.Burst() == 1 - PastebinSource.RespectsRobots() == true (HTML scraper) - PastebinSource.Enabled() always true (credential-free Google dorking of pastebin.com) - PastebinSource.Sweep(): For each provider keyword, scrape Google (via the same DuckDuckGo HTML endpoint as a proxy to avoid Google ToS) with query `site:pastebin.com "{keyword}"`. Parse result links. For each pastebin.com URL found, fetch the raw paste content via /raw/{paste_id} endpoint, scan content for keyword matches, emit Finding with Source=paste URL, SourceType="recon:pastebin", ProviderName from match. - GistPasteSource.Name() == "gistpaste" (not "gist" -- that's Phase 10's API source) - GistPasteSource.RateLimit() == rate.Every(3*time.Second) - GistPasteSource.RespectsRobots() == true (HTML scraper) - GistPasteSource.Enabled() always true (credential-free) - GistPasteSource.Sweep(): Scrape gist.github.com/search?q={keyword} (public search, no auth needed), parse HTML for gist links, fetch raw content, keyword-match against registry Create `pkg/recon/sources/pastebin.go`: - Struct: `PastebinSource` with BaseURL, Registry, Limiters, Client fields - Name() "pastebin", RateLimit() Every(3s), Burst() 1, RespectsRobots() true - Enabled() always true - Sweep(): Use a two-phase approach: Phase A: Search -- iterate BuildQueries(registry, "pastebin"). For each keyword, GET `{BaseURL}/search?q={url.QueryEscape(keyword)}` (Pastebin's own search). Parse HTML for paste links matching `^/[A-Za-z0-9]{8}$` pattern (Pastebin paste IDs are 8 alphanumeric chars). Collect unique paste IDs. Phase B: Fetch+Scan -- for each paste ID: wait limiter, GET `{BaseURL}/raw/{pasteID}`, read body (limit 256KB), scan content against keywordSet() (same pattern as GistSource.keywordSet). If any keyword matches, emit Finding with Source=`{BaseURL}/{pasteID}`, SourceType="recon:pastebin", ProviderName from matched keyword. - Helper: `pastebinKeywordSet(reg)` returning map[string]string (keyword -> provider name), same as GistSource pattern.
Create `pkg/recon/sources/gistpaste.go`:
- Struct: `GistPasteSource` with BaseURL, Registry, Limiters, Client fields
- Name() "gistpaste", RateLimit() Every(3s), Burst() 1, RespectsRobots() true
- Enabled() always true
- Sweep(): iterate BuildQueries(registry, "gistpaste"). For each keyword, GET `{BaseURL}/search?q={url.QueryEscape(keyword)}` (gist.github.com search). Parse HTML for gist links matching `^/[^/]+/[a-f0-9]+$` pattern. For each gist link, construct raw URL `{BaseURL}{gistPath}/raw` and fetch content (limit 256KB). Keyword-match and emit Finding with SourceType="recon:gistpaste".

Tests: httptest servers serving HTML search results + raw paste content fixtures. Verify findings emitted with correct SourceType, Source URL, and ProviderName.
cd /home/salva/Documents/apikey && go test ./pkg/recon/sources/ -run "TestPastebin|TestGistPaste" -v -count=1 PastebinSource and GistPasteSource compile, pass all tests, handle ctx cancellation. Task 2: PasteSitesSource (multi-paste aggregator) pkg/recon/sources/pastesites.go, pkg/recon/sources/pastesites_test.go - PasteSitesSource.Name() == "pastesites" - PasteSitesSource.RateLimit() == rate.Every(3*time.Second) - PasteSitesSource.RespectsRobots() == true - PasteSitesSource.Enabled() always true (all credential-free) - PasteSitesSource.Sweep() iterates across sub-platforms: dpaste.org, paste.ee, rentry.co, ix.io, hastebin.com - Each sub-platform has: Name, SearchURL pattern, result link regex, and optional raw URL construction - Sweep emits at least one Finding per platform when fixture data matches keywords - ctx cancellation stops the sweep promptly Create `pkg/recon/sources/pastesites.go` following the SandboxesSource multi-platform pattern from pkg/recon/sources/sandboxes.go:
- Define `pastePlatform` struct: Name string, SearchPath string (with %s for query), ResultLinkRegex string, RawPathTemplate string (optional, for fetching raw content), IsJSON bool
- Default platforms:
  1. dpaste: SearchPath="/search/?q=%s", result links matching `^/[A-Za-z0-9]+$`, raw via `/{id}/raw`
  2. paste.ee: SearchPath="/search?q=%s", result links matching `^/p/[A-Za-z0-9]+$`, raw via `/r/{id}`
  3. rentry.co: SearchPath="/search?q=%s", result links matching `^/[a-z0-9-]+$`, raw via `/{slug}/raw`
  4. ix.io: No search -- skip (ix.io has no search). Remove from list.
  5. hastebin: SearchPath="/search?q=%s", result links matching `^/[a-z]+$`, raw via `/raw/{id}`

- Struct: `PasteSitesSource` with Platforms []pastePlatform, BaseURL string (test override), Registry, Limiters, Client fields
- Name() "pastesites", RateLimit() Every(3s), Burst() 1, RespectsRobots() true
- Enabled() always true
- Sweep(): For each platform, for each keyword from BuildQueries(registry, "pastesites"):
  1. Wait limiter
  2. GET `{platform base or BaseURL}{searchPath with keyword}`
  3. Parse HTML, extract result links matching platform regex
  4. For each result link: wait limiter, GET raw content URL, read body (256KB limit), keyword-match against registry
  5. Emit Finding with Source=paste URL, SourceType="recon:pastesites", ProviderName from keyword match
- Default platforms populated in a `defaultPastePlatforms()` function. Tests override Platforms to use httptest URLs.

Test: httptest mux serving search HTML + raw content for each sub-platform. Verify at least one Finding per platform fixture. Verify SourceType="recon:pastesites" on all.
cd /home/salva/Documents/apikey && go test ./pkg/recon/sources/ -run "TestPasteSites" -v -count=1 PasteSitesSource aggregates across multiple paste sites, keyword-matches content, emits findings with correct SourceType. All paste sources compile and pass unit tests: ```bash cd /home/salva/Documents/apikey && go test ./pkg/recon/sources/ -run "TestPastebin|TestGistPaste|TestPasteSites" -v -count=1 ```

<success_criteria>

  • 3 new source files exist (pastebin.go, gistpaste.go, pastesites.go) with tests
  • Each implements recon.ReconSource with compile-time assertion
  • PasteSitesSource covers 3+ paste sub-platforms
  • Keyword matching uses provider Registry for ProviderName population
  • All tests pass </success_criteria>
After completion, create `.planning/phases/11-osint_search_paste/11-02-SUMMARY.md`