docs(11): create phase plan — 3 plans for search engine dorking + paste sites
This commit is contained in:
199
.planning/phases/11-osint_search_paste/11-02-PLAN.md
Normal file
199
.planning/phases/11-osint_search_paste/11-02-PLAN.md
Normal file
@@ -0,0 +1,199 @@
|
||||
---
|
||||
phase: 11-osint-search-paste
|
||||
plan: 02
|
||||
type: execute
|
||||
wave: 1
|
||||
depends_on: []
|
||||
files_modified:
|
||||
- pkg/recon/sources/pastebin.go
|
||||
- pkg/recon/sources/pastebin_test.go
|
||||
- pkg/recon/sources/gistpaste.go
|
||||
- pkg/recon/sources/gistpaste_test.go
|
||||
- pkg/recon/sources/pastesites.go
|
||||
- pkg/recon/sources/pastesites_test.go
|
||||
autonomous: true
|
||||
requirements: [RECON-PASTE-01]
|
||||
|
||||
must_haves:
|
||||
truths:
|
||||
- "PastebinSource scrapes Pastebin search results and emits findings for pastes containing provider keywords"
|
||||
- "GistPasteSource searches public GitHub Gists via unauthenticated scraping (distinct from Phase 10 GistSource which uses API)"
|
||||
- "PasteSitesSource aggregates results from dpaste, paste.ee, rentry.co, ix.io, and similar sites"
|
||||
- "All paste sources feed raw content through keyword matching against the provider registry"
|
||||
- "Missing credentials disable sources that need them; credential-free sources are always enabled"
|
||||
artifacts:
|
||||
- path: "pkg/recon/sources/pastebin.go"
|
||||
provides: "PastebinSource implementing recon.ReconSource"
|
||||
contains: "func (s *PastebinSource) Sweep"
|
||||
- path: "pkg/recon/sources/gistpaste.go"
|
||||
provides: "GistPasteSource implementing recon.ReconSource"
|
||||
contains: "func (s *GistPasteSource) Sweep"
|
||||
- path: "pkg/recon/sources/pastesites.go"
|
||||
provides: "PasteSitesSource implementing recon.ReconSource with multi-site sub-platform pattern"
|
||||
contains: "func (s *PasteSitesSource) Sweep"
|
||||
key_links:
|
||||
- from: "pkg/recon/sources/pastebin.go"
|
||||
to: "pkg/recon/sources/httpclient.go"
|
||||
via: "sources.Client for HTTP with retry"
|
||||
pattern: "client\\.Do"
|
||||
- from: "pkg/recon/sources/pastesites.go"
|
||||
to: "providers.Registry"
|
||||
via: "keyword matching on paste content"
|
||||
pattern: "keywordSet|BuildQueries"
|
||||
---
|
||||
|
||||
<objective>
|
||||
Implement three paste site ReconSource implementations: PastebinSource, GistPasteSource, and PasteSitesSource (multi-site aggregator for dpaste, paste.ee, rentry.co, ix.io, etc.).
|
||||
|
||||
Purpose: RECON-PASTE-01 -- detect API key leaks across public paste sites.
|
||||
Output: Three source files + tests covering paste site scanning.
|
||||
</objective>
|
||||
|
||||
<execution_context>
|
||||
@$HOME/.claude/get-shit-done/workflows/execute-plan.md
|
||||
@$HOME/.claude/get-shit-done/templates/summary.md
|
||||
</execution_context>
|
||||
|
||||
<context>
|
||||
@.planning/PROJECT.md
|
||||
@.planning/ROADMAP.md
|
||||
@.planning/STATE.md
|
||||
@pkg/recon/source.go
|
||||
@pkg/recon/sources/httpclient.go
|
||||
@pkg/recon/sources/queries.go
|
||||
@pkg/recon/sources/gist.go (reference: Phase 10 GistSource uses GitHub API -- this plan's GistPasteSource is a scraping alternative)
|
||||
@pkg/recon/sources/replit.go (reference pattern for HTML scraping source)
|
||||
@pkg/recon/sources/sandboxes.go (reference pattern for multi-platform aggregator)
|
||||
|
||||
<interfaces>
|
||||
From pkg/recon/source.go:
|
||||
```go
|
||||
type ReconSource interface {
|
||||
Name() string
|
||||
RateLimit() rate.Limit
|
||||
Burst() int
|
||||
RespectsRobots() bool
|
||||
Enabled(cfg Config) bool
|
||||
Sweep(ctx context.Context, query string, out chan<- Finding) error
|
||||
}
|
||||
```
|
||||
|
||||
From pkg/recon/sources/httpclient.go:
|
||||
```go
|
||||
func NewClient() *Client
|
||||
func (c *Client) Do(ctx context.Context, req *http.Request) (*http.Response, error)
|
||||
```
|
||||
|
||||
From pkg/recon/sources/gist.go (existing Phase 10 GistSource -- avoid name collision):
|
||||
```go
|
||||
type GistSource struct { ... } // Name() == "gist" -- already taken
|
||||
func (s *GistSource) keywordSet() map[string]string // pattern to reuse
|
||||
```
|
||||
</interfaces>
|
||||
</context>
|
||||
|
||||
<tasks>
|
||||
|
||||
<task type="auto" tdd="true">
|
||||
<name>Task 1: PastebinSource + GistPasteSource</name>
|
||||
<files>pkg/recon/sources/pastebin.go, pkg/recon/sources/pastebin_test.go, pkg/recon/sources/gistpaste.go, pkg/recon/sources/gistpaste_test.go</files>
|
||||
<behavior>
|
||||
- PastebinSource.Name() == "pastebin"
|
||||
- PastebinSource.RateLimit() == rate.Every(3*time.Second) (conservative -- Pastebin scraping)
|
||||
- PastebinSource.Burst() == 1
|
||||
- PastebinSource.RespectsRobots() == true (HTML scraper)
|
||||
- PastebinSource.Enabled() always true (credential-free Google dorking of pastebin.com)
|
||||
- PastebinSource.Sweep(): For each provider keyword, scrape Google (via the same DuckDuckGo HTML endpoint as a proxy to avoid Google ToS) with query `site:pastebin.com "{keyword}"`. Parse result links. For each pastebin.com URL found, fetch the raw paste content via /raw/{paste_id} endpoint, scan content for keyword matches, emit Finding with Source=paste URL, SourceType="recon:pastebin", ProviderName from match.
|
||||
- GistPasteSource.Name() == "gistpaste" (not "gist" -- that's Phase 10's API source)
|
||||
- GistPasteSource.RateLimit() == rate.Every(3*time.Second)
|
||||
- GistPasteSource.RespectsRobots() == true (HTML scraper)
|
||||
- GistPasteSource.Enabled() always true (credential-free)
|
||||
- GistPasteSource.Sweep(): Scrape gist.github.com/search?q={keyword} (public search, no auth needed), parse HTML for gist links, fetch raw content, keyword-match against registry
|
||||
</behavior>
|
||||
<action>
|
||||
Create `pkg/recon/sources/pastebin.go`:
|
||||
- Struct: `PastebinSource` with BaseURL, Registry, Limiters, Client fields
|
||||
- Name() "pastebin", RateLimit() Every(3s), Burst() 1, RespectsRobots() true
|
||||
- Enabled() always true
|
||||
- Sweep(): Use a two-phase approach:
|
||||
Phase A: Search -- iterate BuildQueries(registry, "pastebin"). For each keyword, GET `{BaseURL}/search?q={url.QueryEscape(keyword)}` (Pastebin's own search). Parse HTML for paste links matching `^/[A-Za-z0-9]{8}$` pattern (Pastebin paste IDs are 8 alphanumeric chars). Collect unique paste IDs.
|
||||
Phase B: Fetch+Scan -- for each paste ID: wait limiter, GET `{BaseURL}/raw/{pasteID}`, read body (limit 256KB), scan content against keywordSet() (same pattern as GistSource.keywordSet). If any keyword matches, emit Finding with Source=`{BaseURL}/{pasteID}`, SourceType="recon:pastebin", ProviderName from matched keyword.
|
||||
- Helper: `pastebinKeywordSet(reg)` returning map[string]string (keyword -> provider name), same as GistSource pattern.
|
||||
|
||||
Create `pkg/recon/sources/gistpaste.go`:
|
||||
- Struct: `GistPasteSource` with BaseURL, Registry, Limiters, Client fields
|
||||
- Name() "gistpaste", RateLimit() Every(3s), Burst() 1, RespectsRobots() true
|
||||
- Enabled() always true
|
||||
- Sweep(): iterate BuildQueries(registry, "gistpaste"). For each keyword, GET `{BaseURL}/search?q={url.QueryEscape(keyword)}` (gist.github.com search). Parse HTML for gist links matching `^/[^/]+/[a-f0-9]+$` pattern. For each gist link, construct raw URL `{BaseURL}{gistPath}/raw` and fetch content (limit 256KB). Keyword-match and emit Finding with SourceType="recon:gistpaste".
|
||||
|
||||
Tests: httptest servers serving HTML search results + raw paste content fixtures. Verify findings emitted with correct SourceType, Source URL, and ProviderName.
|
||||
</action>
|
||||
<verify>
|
||||
<automated>cd /home/salva/Documents/apikey && go test ./pkg/recon/sources/ -run "TestPastebin|TestGistPaste" -v -count=1</automated>
|
||||
</verify>
|
||||
<done>PastebinSource and GistPasteSource compile, pass all tests, handle ctx cancellation.</done>
|
||||
</task>
|
||||
|
||||
<task type="auto" tdd="true">
|
||||
<name>Task 2: PasteSitesSource (multi-paste aggregator)</name>
|
||||
<files>pkg/recon/sources/pastesites.go, pkg/recon/sources/pastesites_test.go</files>
|
||||
<behavior>
|
||||
- PasteSitesSource.Name() == "pastesites"
|
||||
- PasteSitesSource.RateLimit() == rate.Every(3*time.Second)
|
||||
- PasteSitesSource.RespectsRobots() == true
|
||||
- PasteSitesSource.Enabled() always true (all credential-free)
|
||||
- PasteSitesSource.Sweep() iterates across sub-platforms: dpaste.org, paste.ee, rentry.co, ix.io, hastebin.com
|
||||
- Each sub-platform has: Name, SearchURL pattern, result link regex, and optional raw URL construction
|
||||
- Sweep emits at least one Finding per platform when fixture data matches keywords
|
||||
- ctx cancellation stops the sweep promptly
|
||||
</behavior>
|
||||
<action>
|
||||
Create `pkg/recon/sources/pastesites.go` following the SandboxesSource multi-platform pattern from pkg/recon/sources/sandboxes.go:
|
||||
|
||||
- Define `pastePlatform` struct: Name string, SearchPath string (with %s for query), ResultLinkRegex string, RawPathTemplate string (optional, for fetching raw content), IsJSON bool
|
||||
- Default platforms:
|
||||
1. dpaste: SearchPath="/search/?q=%s", result links matching `^/[A-Za-z0-9]+$`, raw via `/{id}/raw`
|
||||
2. paste.ee: SearchPath="/search?q=%s", result links matching `^/p/[A-Za-z0-9]+$`, raw via `/r/{id}`
|
||||
3. rentry.co: SearchPath="/search?q=%s", result links matching `^/[a-z0-9-]+$`, raw via `/{slug}/raw`
|
||||
4. ix.io: No search -- skip (ix.io has no search). Remove from list.
|
||||
5. hastebin: SearchPath="/search?q=%s", result links matching `^/[a-z]+$`, raw via `/raw/{id}`
|
||||
|
||||
- Struct: `PasteSitesSource` with Platforms []pastePlatform, BaseURL string (test override), Registry, Limiters, Client fields
|
||||
- Name() "pastesites", RateLimit() Every(3s), Burst() 1, RespectsRobots() true
|
||||
- Enabled() always true
|
||||
- Sweep(): For each platform, for each keyword from BuildQueries(registry, "pastesites"):
|
||||
1. Wait limiter
|
||||
2. GET `{platform base or BaseURL}{searchPath with keyword}`
|
||||
3. Parse HTML, extract result links matching platform regex
|
||||
4. For each result link: wait limiter, GET raw content URL, read body (256KB limit), keyword-match against registry
|
||||
5. Emit Finding with Source=paste URL, SourceType="recon:pastesites", ProviderName from keyword match
|
||||
- Default platforms populated in a `defaultPastePlatforms()` function. Tests override Platforms to use httptest URLs.
|
||||
|
||||
Test: httptest mux serving search HTML + raw content for each sub-platform. Verify at least one Finding per platform fixture. Verify SourceType="recon:pastesites" on all.
|
||||
</action>
|
||||
<verify>
|
||||
<automated>cd /home/salva/Documents/apikey && go test ./pkg/recon/sources/ -run "TestPasteSites" -v -count=1</automated>
|
||||
</verify>
|
||||
<done>PasteSitesSource aggregates across multiple paste sites, keyword-matches content, emits findings with correct SourceType.</done>
|
||||
</task>
|
||||
|
||||
</tasks>
|
||||
|
||||
<verification>
|
||||
All paste sources compile and pass unit tests:
|
||||
```bash
|
||||
cd /home/salva/Documents/apikey && go test ./pkg/recon/sources/ -run "TestPastebin|TestGistPaste|TestPasteSites" -v -count=1
|
||||
```
|
||||
</verification>
|
||||
|
||||
<success_criteria>
|
||||
- 3 new source files exist (pastebin.go, gistpaste.go, pastesites.go) with tests
|
||||
- Each implements recon.ReconSource with compile-time assertion
|
||||
- PasteSitesSource covers 3+ paste sub-platforms
|
||||
- Keyword matching uses provider Registry for ProviderName population
|
||||
- All tests pass
|
||||
</success_criteria>
|
||||
|
||||
<output>
|
||||
After completion, create `.planning/phases/11-osint_search_paste/11-02-SUMMARY.md`
|
||||
</output>
|
||||
Reference in New Issue
Block a user