Files

200 lines
10 KiB
Markdown

---
phase: 11-osint-search-paste
plan: 02
type: execute
wave: 1
depends_on: []
files_modified:
- pkg/recon/sources/pastebin.go
- pkg/recon/sources/pastebin_test.go
- pkg/recon/sources/gistpaste.go
- pkg/recon/sources/gistpaste_test.go
- pkg/recon/sources/pastesites.go
- pkg/recon/sources/pastesites_test.go
autonomous: true
requirements: [RECON-PASTE-01]
must_haves:
truths:
- "PastebinSource scrapes Pastebin search results and emits findings for pastes containing provider keywords"
- "GistPasteSource searches public GitHub Gists via unauthenticated scraping (distinct from Phase 10 GistSource which uses API)"
- "PasteSitesSource aggregates results from dpaste, paste.ee, rentry.co, ix.io, and similar sites"
- "All paste sources feed raw content through keyword matching against the provider registry"
- "Missing credentials disable sources that need them; credential-free sources are always enabled"
artifacts:
- path: "pkg/recon/sources/pastebin.go"
provides: "PastebinSource implementing recon.ReconSource"
contains: "func (s *PastebinSource) Sweep"
- path: "pkg/recon/sources/gistpaste.go"
provides: "GistPasteSource implementing recon.ReconSource"
contains: "func (s *GistPasteSource) Sweep"
- path: "pkg/recon/sources/pastesites.go"
provides: "PasteSitesSource implementing recon.ReconSource with multi-site sub-platform pattern"
contains: "func (s *PasteSitesSource) Sweep"
key_links:
- from: "pkg/recon/sources/pastebin.go"
to: "pkg/recon/sources/httpclient.go"
via: "sources.Client for HTTP with retry"
pattern: "client\\.Do"
- from: "pkg/recon/sources/pastesites.go"
to: "providers.Registry"
via: "keyword matching on paste content"
pattern: "keywordSet|BuildQueries"
---
<objective>
Implement three paste site ReconSource implementations: PastebinSource, GistPasteSource, and PasteSitesSource (multi-site aggregator for dpaste, paste.ee, rentry.co, ix.io, etc.).
Purpose: RECON-PASTE-01 -- detect API key leaks across public paste sites.
Output: Three source files + tests covering paste site scanning.
</objective>
<execution_context>
@$HOME/.claude/get-shit-done/workflows/execute-plan.md
@$HOME/.claude/get-shit-done/templates/summary.md
</execution_context>
<context>
@.planning/PROJECT.md
@.planning/ROADMAP.md
@.planning/STATE.md
@pkg/recon/source.go
@pkg/recon/sources/httpclient.go
@pkg/recon/sources/queries.go
@pkg/recon/sources/gist.go (reference: Phase 10 GistSource uses GitHub API -- this plan's GistPasteSource is a scraping alternative)
@pkg/recon/sources/replit.go (reference pattern for HTML scraping source)
@pkg/recon/sources/sandboxes.go (reference pattern for multi-platform aggregator)
<interfaces>
From pkg/recon/source.go:
```go
type ReconSource interface {
Name() string
RateLimit() rate.Limit
Burst() int
RespectsRobots() bool
Enabled(cfg Config) bool
Sweep(ctx context.Context, query string, out chan<- Finding) error
}
```
From pkg/recon/sources/httpclient.go:
```go
func NewClient() *Client
func (c *Client) Do(ctx context.Context, req *http.Request) (*http.Response, error)
```
From pkg/recon/sources/gist.go (existing Phase 10 GistSource -- avoid name collision):
```go
type GistSource struct { ... } // Name() == "gist" -- already taken
func (s *GistSource) keywordSet() map[string]string // pattern to reuse
```
</interfaces>
</context>
<tasks>
<task type="auto" tdd="true">
<name>Task 1: PastebinSource + GistPasteSource</name>
<files>pkg/recon/sources/pastebin.go, pkg/recon/sources/pastebin_test.go, pkg/recon/sources/gistpaste.go, pkg/recon/sources/gistpaste_test.go</files>
<behavior>
- PastebinSource.Name() == "pastebin"
- PastebinSource.RateLimit() == rate.Every(3*time.Second) (conservative -- Pastebin scraping)
- PastebinSource.Burst() == 1
- PastebinSource.RespectsRobots() == true (HTML scraper)
- PastebinSource.Enabled() always true (credential-free Google dorking of pastebin.com)
- PastebinSource.Sweep(): For each provider keyword, scrape Google (via the same DuckDuckGo HTML endpoint as a proxy to avoid Google ToS) with query `site:pastebin.com "{keyword}"`. Parse result links. For each pastebin.com URL found, fetch the raw paste content via /raw/{paste_id} endpoint, scan content for keyword matches, emit Finding with Source=paste URL, SourceType="recon:pastebin", ProviderName from match.
- GistPasteSource.Name() == "gistpaste" (not "gist" -- that's Phase 10's API source)
- GistPasteSource.RateLimit() == rate.Every(3*time.Second)
- GistPasteSource.RespectsRobots() == true (HTML scraper)
- GistPasteSource.Enabled() always true (credential-free)
- GistPasteSource.Sweep(): Scrape gist.github.com/search?q={keyword} (public search, no auth needed), parse HTML for gist links, fetch raw content, keyword-match against registry
</behavior>
<action>
Create `pkg/recon/sources/pastebin.go`:
- Struct: `PastebinSource` with BaseURL, Registry, Limiters, Client fields
- Name() "pastebin", RateLimit() Every(3s), Burst() 1, RespectsRobots() true
- Enabled() always true
- Sweep(): Use a two-phase approach:
Phase A: Search -- iterate BuildQueries(registry, "pastebin"). For each keyword, GET `{BaseURL}/search?q={url.QueryEscape(keyword)}` (Pastebin's own search). Parse HTML for paste links matching `^/[A-Za-z0-9]{8}$` pattern (Pastebin paste IDs are 8 alphanumeric chars). Collect unique paste IDs.
Phase B: Fetch+Scan -- for each paste ID: wait limiter, GET `{BaseURL}/raw/{pasteID}`, read body (limit 256KB), scan content against keywordSet() (same pattern as GistSource.keywordSet). If any keyword matches, emit Finding with Source=`{BaseURL}/{pasteID}`, SourceType="recon:pastebin", ProviderName from matched keyword.
- Helper: `pastebinKeywordSet(reg)` returning map[string]string (keyword -> provider name), same as GistSource pattern.
Create `pkg/recon/sources/gistpaste.go`:
- Struct: `GistPasteSource` with BaseURL, Registry, Limiters, Client fields
- Name() "gistpaste", RateLimit() Every(3s), Burst() 1, RespectsRobots() true
- Enabled() always true
- Sweep(): iterate BuildQueries(registry, "gistpaste"). For each keyword, GET `{BaseURL}/search?q={url.QueryEscape(keyword)}` (gist.github.com search). Parse HTML for gist links matching `^/[^/]+/[a-f0-9]+$` pattern. For each gist link, construct raw URL `{BaseURL}{gistPath}/raw` and fetch content (limit 256KB). Keyword-match and emit Finding with SourceType="recon:gistpaste".
Tests: httptest servers serving HTML search results + raw paste content fixtures. Verify findings emitted with correct SourceType, Source URL, and ProviderName.
</action>
<verify>
<automated>cd /home/salva/Documents/apikey && go test ./pkg/recon/sources/ -run "TestPastebin|TestGistPaste" -v -count=1</automated>
</verify>
<done>PastebinSource and GistPasteSource compile, pass all tests, handle ctx cancellation.</done>
</task>
<task type="auto" tdd="true">
<name>Task 2: PasteSitesSource (multi-paste aggregator)</name>
<files>pkg/recon/sources/pastesites.go, pkg/recon/sources/pastesites_test.go</files>
<behavior>
- PasteSitesSource.Name() == "pastesites"
- PasteSitesSource.RateLimit() == rate.Every(3*time.Second)
- PasteSitesSource.RespectsRobots() == true
- PasteSitesSource.Enabled() always true (all credential-free)
- PasteSitesSource.Sweep() iterates across sub-platforms: dpaste.org, paste.ee, rentry.co, ix.io, hastebin.com
- Each sub-platform has: Name, SearchURL pattern, result link regex, and optional raw URL construction
- Sweep emits at least one Finding per platform when fixture data matches keywords
- ctx cancellation stops the sweep promptly
</behavior>
<action>
Create `pkg/recon/sources/pastesites.go` following the SandboxesSource multi-platform pattern from pkg/recon/sources/sandboxes.go:
- Define `pastePlatform` struct: Name string, SearchPath string (with %s for query), ResultLinkRegex string, RawPathTemplate string (optional, for fetching raw content), IsJSON bool
- Default platforms:
1. dpaste: SearchPath="/search/?q=%s", result links matching `^/[A-Za-z0-9]+$`, raw via `/{id}/raw`
2. paste.ee: SearchPath="/search?q=%s", result links matching `^/p/[A-Za-z0-9]+$`, raw via `/r/{id}`
3. rentry.co: SearchPath="/search?q=%s", result links matching `^/[a-z0-9-]+$`, raw via `/{slug}/raw`
4. ix.io: No search -- skip (ix.io has no search). Remove from list.
5. hastebin: SearchPath="/search?q=%s", result links matching `^/[a-z]+$`, raw via `/raw/{id}`
- Struct: `PasteSitesSource` with Platforms []pastePlatform, BaseURL string (test override), Registry, Limiters, Client fields
- Name() "pastesites", RateLimit() Every(3s), Burst() 1, RespectsRobots() true
- Enabled() always true
- Sweep(): For each platform, for each keyword from BuildQueries(registry, "pastesites"):
1. Wait limiter
2. GET `{platform base or BaseURL}{searchPath with keyword}`
3. Parse HTML, extract result links matching platform regex
4. For each result link: wait limiter, GET raw content URL, read body (256KB limit), keyword-match against registry
5. Emit Finding with Source=paste URL, SourceType="recon:pastesites", ProviderName from keyword match
- Default platforms populated in a `defaultPastePlatforms()` function. Tests override Platforms to use httptest URLs.
Test: httptest mux serving search HTML + raw content for each sub-platform. Verify at least one Finding per platform fixture. Verify SourceType="recon:pastesites" on all.
</action>
<verify>
<automated>cd /home/salva/Documents/apikey && go test ./pkg/recon/sources/ -run "TestPasteSites" -v -count=1</automated>
</verify>
<done>PasteSitesSource aggregates across multiple paste sites, keyword-matches content, emits findings with correct SourceType.</done>
</task>
</tasks>
<verification>
All paste sources compile and pass unit tests:
```bash
cd /home/salva/Documents/apikey && go test ./pkg/recon/sources/ -run "TestPastebin|TestGistPaste|TestPasteSites" -v -count=1
```
</verification>
<success_criteria>
- 3 new source files exist (pastebin.go, gistpaste.go, pastesites.go) with tests
- Each implements recon.ReconSource with compile-time assertion
- PasteSitesSource covers 3+ paste sub-platforms
- Keyword matching uses provider Registry for ProviderName population
- All tests pass
</success_criteria>
<output>
After completion, create `.planning/phases/11-osint_search_paste/11-02-SUMMARY.md`
</output>