diff --git a/.planning/phases/09-osint-infrastructure/09-CONTEXT.md b/.planning/phases/09-osint-infrastructure/09-CONTEXT.md new file mode 100644 index 0000000..b20e463 --- /dev/null +++ b/.planning/phases/09-osint-infrastructure/09-CONTEXT.md @@ -0,0 +1,120 @@ +# Phase 9: OSINT Infrastructure - Context + +**Gathered:** 2026-04-05 +**Status:** Ready for planning +**Mode:** Auto-generated + + +## Phase Boundary + +Foundation for all OSINT/recon work. Creates the `pkg/recon/` package with: +- `ReconSource` interface every source module implements +- Per-source `rate.Limiter` (golang.org/x/time/rate) — no central limiter +- Stealth mode: user-agent rotation, jitter delays +- robots.txt respect for web scrapers +- Parallel sweep orchestrator that fans out to all enabled sources +- Deduplication across sources before persisting + +Phases 10-16 implement individual sources on top of this foundation. Phase 9 ships with a single `ReconSource` stub (e.g., `ExampleSource`) to prove the framework end-to-end. + + + + +## Implementation Decisions + +### ReconSource Interface +```go +type ReconSource interface { + Name() string // "shodan", "github", etc. + RateLimit() rate.Limit // per-source token bucket rate + Burst() int // burst capacity + RespectsRobots() bool // web scrapers: true; APIs: false + Enabled(cfg Config) bool // enabled via config/env + Sweep(ctx context.Context, query string, out chan<- Finding) error +} +``` + +### Package Layout +``` +pkg/recon/ + source.go — ReconSource interface + engine.go — Engine with Register/Sweep, parallel fanout, dedup + limiter.go — per-source rate.Limiter wrapper + jitter + stealth.go — user-agent pool, randomized delays + robots.go — robots.txt cache + allowed() check + dedup.go — cross-source finding dedup (hash by key+source_url) + example.go — ExampleSource stub implementation for testing + engine_test.go + stealth_test.go + robots_test.go +``` + +### Rate Limiting (RECON-INFRA-05) +- Each source constructs `rate.NewLimiter(r.RateLimit(), r.Burst())` at registration +- `Engine.Sweep` calls `limiter.Wait(ctx)` before each source request +- Default rates per source configurable via config file + +### Stealth Mode (RECON-INFRA-06) +- User-Agent pool: 10 realistic browser UAs in stealth.go (Chrome/Firefox/Safari on Linux/macOS/Windows) +- Jitter: random delay between 100ms and 1s before each request +- Activated via `--stealth` flag on `keyhunter recon` command +- "source exhausted" log events when a source returns zero results or hits rate limit + +### robots.txt (RECON-INFRA-07) +- `pkg/recon/robots.go`: simple parser, cached per-host for 1 hour +- Only consulted when `RespectsRobots()` returns true (web scrapers) +- `--respect-robots` flag default on, can be disabled with `--no-respect-robots` +- Check User-Agent: "keyhunter" against disallowed paths + +### Parallel Sweep Orchestrator (RECON-INFRA-08) +- `Engine.SweepAll(ctx, query, cfg)` fans out to all enabled sources via ants pool +- Each source runs in its own goroutine, writes findings to shared chan +- Deduplication: SHA256(provider + masked_key + source_url) — drop duplicates +- Result aggregated and returned for persistence + +### CLI Command +- `keyhunter recon full [--stealth] [--respect-robots] [--query=STRING]` — sweep all enabled sources +- `keyhunter recon list` — list registered sources +- Phase 9 ships with `ExampleSource` to demonstrate the pipeline; real sources come in later phases + +### Dependencies +- `golang.org/x/time/rate` — already in go.mod +- `github.com/temoto/robotstxt` — for robots.txt parsing (small, well-maintained) +- No new deps beyond robotstxt + + + + +## Existing Code Insights + +### Reusable Assets +- pkg/engine/engine.go — ants pool pattern to mirror +- pkg/engine/finding.go — Finding struct (can reuse or create recon-specific) +- pkg/storage/findings.go — persistence +- cmd/stubs.go — recon stub to replace with real tree + +### Finding vs ReconFinding +- Decision: reuse existing `engine.Finding`, extend SourceType field with recon sources +- SourceType values: "file", "git", "stdin", "url", "clipboard" (Phase 4), now add "recon:shodan", "recon:github", "recon:example" + + + + +## Specific Ideas + +- ExampleSource should generate fake findings from a hardcoded list to prove the pipeline +- Each source's `Name()` should match its YAML dork source name for easy cross-referencing +- Dedup should preserve the first-seen source and metadata +- Stealth UA pool: include rare-but-realistic UAs to avoid UA-fingerprint blocking + + + + +## Deferred Ideas + +- Proxy/TOR support — can be added later via http.Transport config +- Source-specific retry with exponential backoff — source modules handle their own retries +- Distributed rate limiting across multiple keyhunter instances — out of scope +- Webhook notifications on source exhaustion — defer to Phase 17 (Telegram) + +