4.7 KiB
4.7 KiB
Phase 9: OSINT Infrastructure - Context
Gathered: 2026-04-05 Status: Ready for planning Mode: Auto-generated
## Phase BoundaryFoundation for all OSINT/recon work. Creates the pkg/recon/ package with:
ReconSourceinterface every source module implements- Per-source
rate.Limiter(golang.org/x/time/rate) — no central limiter - Stealth mode: user-agent rotation, jitter delays
- robots.txt respect for web scrapers
- Parallel sweep orchestrator that fans out to all enabled sources
- Deduplication across sources before persisting
Phases 10-16 implement individual sources on top of this foundation. Phase 9 ships with a single ReconSource stub (e.g., ExampleSource) to prove the framework end-to-end.
ReconSource Interface
type ReconSource interface {
Name() string // "shodan", "github", etc.
RateLimit() rate.Limit // per-source token bucket rate
Burst() int // burst capacity
RespectsRobots() bool // web scrapers: true; APIs: false
Enabled(cfg Config) bool // enabled via config/env
Sweep(ctx context.Context, query string, out chan<- Finding) error
}
Package Layout
pkg/recon/
source.go — ReconSource interface
engine.go — Engine with Register/Sweep, parallel fanout, dedup
limiter.go — per-source rate.Limiter wrapper + jitter
stealth.go — user-agent pool, randomized delays
robots.go — robots.txt cache + allowed() check
dedup.go — cross-source finding dedup (hash by key+source_url)
example.go — ExampleSource stub implementation for testing
engine_test.go
stealth_test.go
robots_test.go
Rate Limiting (RECON-INFRA-05)
- Each source constructs
rate.NewLimiter(r.RateLimit(), r.Burst())at registration Engine.Sweepcallslimiter.Wait(ctx)before each source request- Default rates per source configurable via config file
Stealth Mode (RECON-INFRA-06)
- User-Agent pool: 10 realistic browser UAs in stealth.go (Chrome/Firefox/Safari on Linux/macOS/Windows)
- Jitter: random delay between 100ms and 1s before each request
- Activated via
--stealthflag onkeyhunter reconcommand - "source exhausted" log events when a source returns zero results or hits rate limit
robots.txt (RECON-INFRA-07)
pkg/recon/robots.go: simple parser, cached per-host for 1 hour- Only consulted when
RespectsRobots()returns true (web scrapers) --respect-robotsflag default on, can be disabled with--no-respect-robots- Check User-Agent: "keyhunter" against disallowed paths
Parallel Sweep Orchestrator (RECON-INFRA-08)
Engine.SweepAll(ctx, query, cfg)fans out to all enabled sources via ants pool- Each source runs in its own goroutine, writes findings to shared chan
- Deduplication: SHA256(provider + masked_key + source_url) — drop duplicates
- Result aggregated and returned for persistence
CLI Command
keyhunter recon full [--stealth] [--respect-robots] [--query=STRING]— sweep all enabled sourceskeyhunter recon list— list registered sources- Phase 9 ships with
ExampleSourceto demonstrate the pipeline; real sources come in later phases
Dependencies
golang.org/x/time/rate— already in go.modgithub.com/temoto/robotstxt— for robots.txt parsing (small, well-maintained)- No new deps beyond robotstxt
<code_context>
Existing Code Insights
Reusable Assets
- pkg/engine/engine.go — ants pool pattern to mirror
- pkg/engine/finding.go — Finding struct (can reuse or create recon-specific)
- pkg/storage/findings.go — persistence
- cmd/stubs.go — recon stub to replace with real tree
Finding vs ReconFinding
- Decision: reuse existing
engine.Finding, extend SourceType field with recon sources - SourceType values: "file", "git", "stdin", "url", "clipboard" (Phase 4), now add "recon:shodan", "recon:github", "recon:example"
</code_context>
## Specific Ideas- ExampleSource should generate fake findings from a hardcoded list to prove the pipeline
- Each source's
Name()should match its YAML dork source name for easy cross-referencing - Dedup should preserve the first-seen source and metadata
- Stealth UA pool: include rare-but-realistic UAs to avoid UA-fingerprint blocking
- Proxy/TOR support — can be added later via http.Transport config
- Source-specific retry with exponential backoff — source modules handle their own retries
- Distributed rate limiting across multiple keyhunter instances — out of scope
- Webhook notifications on source exhaustion — defer to Phase 17 (Telegram)