Files
keyhunter/.planning/phases/09-osint-infrastructure/09-CONTEXT.md
2026-04-06 00:33:44 +03:00

4.7 KiB

Phase 9: OSINT Infrastructure - Context

Gathered: 2026-04-05 Status: Ready for planning Mode: Auto-generated

## Phase Boundary

Foundation for all OSINT/recon work. Creates the pkg/recon/ package with:

  • ReconSource interface every source module implements
  • Per-source rate.Limiter (golang.org/x/time/rate) — no central limiter
  • Stealth mode: user-agent rotation, jitter delays
  • robots.txt respect for web scrapers
  • Parallel sweep orchestrator that fans out to all enabled sources
  • Deduplication across sources before persisting

Phases 10-16 implement individual sources on top of this foundation. Phase 9 ships with a single ReconSource stub (e.g., ExampleSource) to prove the framework end-to-end.

## Implementation Decisions

ReconSource Interface

type ReconSource interface {
    Name() string                          // "shodan", "github", etc.
    RateLimit() rate.Limit                 // per-source token bucket rate
    Burst() int                            // burst capacity
    RespectsRobots() bool                  // web scrapers: true; APIs: false
    Enabled(cfg Config) bool               // enabled via config/env
    Sweep(ctx context.Context, query string, out chan<- Finding) error
}

Package Layout

pkg/recon/
  source.go           — ReconSource interface
  engine.go           — Engine with Register/Sweep, parallel fanout, dedup
  limiter.go          — per-source rate.Limiter wrapper + jitter
  stealth.go          — user-agent pool, randomized delays
  robots.go           — robots.txt cache + allowed() check
  dedup.go            — cross-source finding dedup (hash by key+source_url)
  example.go          — ExampleSource stub implementation for testing
  engine_test.go
  stealth_test.go
  robots_test.go

Rate Limiting (RECON-INFRA-05)

  • Each source constructs rate.NewLimiter(r.RateLimit(), r.Burst()) at registration
  • Engine.Sweep calls limiter.Wait(ctx) before each source request
  • Default rates per source configurable via config file

Stealth Mode (RECON-INFRA-06)

  • User-Agent pool: 10 realistic browser UAs in stealth.go (Chrome/Firefox/Safari on Linux/macOS/Windows)
  • Jitter: random delay between 100ms and 1s before each request
  • Activated via --stealth flag on keyhunter recon command
  • "source exhausted" log events when a source returns zero results or hits rate limit

robots.txt (RECON-INFRA-07)

  • pkg/recon/robots.go: simple parser, cached per-host for 1 hour
  • Only consulted when RespectsRobots() returns true (web scrapers)
  • --respect-robots flag default on, can be disabled with --no-respect-robots
  • Check User-Agent: "keyhunter" against disallowed paths

Parallel Sweep Orchestrator (RECON-INFRA-08)

  • Engine.SweepAll(ctx, query, cfg) fans out to all enabled sources via ants pool
  • Each source runs in its own goroutine, writes findings to shared chan
  • Deduplication: SHA256(provider + masked_key + source_url) — drop duplicates
  • Result aggregated and returned for persistence

CLI Command

  • keyhunter recon full [--stealth] [--respect-robots] [--query=STRING] — sweep all enabled sources
  • keyhunter recon list — list registered sources
  • Phase 9 ships with ExampleSource to demonstrate the pipeline; real sources come in later phases

Dependencies

  • golang.org/x/time/rate — already in go.mod
  • github.com/temoto/robotstxt — for robots.txt parsing (small, well-maintained)
  • No new deps beyond robotstxt

<code_context>

Existing Code Insights

Reusable Assets

  • pkg/engine/engine.go — ants pool pattern to mirror
  • pkg/engine/finding.go — Finding struct (can reuse or create recon-specific)
  • pkg/storage/findings.go — persistence
  • cmd/stubs.go — recon stub to replace with real tree

Finding vs ReconFinding

  • Decision: reuse existing engine.Finding, extend SourceType field with recon sources
  • SourceType values: "file", "git", "stdin", "url", "clipboard" (Phase 4), now add "recon:shodan", "recon:github", "recon:example"

</code_context>

## Specific Ideas
  • ExampleSource should generate fake findings from a hardcoded list to prove the pipeline
  • Each source's Name() should match its YAML dork source name for easy cross-referencing
  • Dedup should preserve the first-seen source and metadata
  • Stealth UA pool: include rare-but-realistic UAs to avoid UA-fingerprint blocking
## Deferred Ideas
  • Proxy/TOR support — can be added later via http.Transport config
  • Source-specific retry with exponential backoff — source modules handle their own retries
  • Distributed rate limiting across multiple keyhunter instances — out of scope
  • Webhook notifications on source exhaustion — defer to Phase 17 (Telegram)