docs(09): OSINT infrastructure context
This commit is contained in:
120
.planning/phases/09-osint-infrastructure/09-CONTEXT.md
Normal file
120
.planning/phases/09-osint-infrastructure/09-CONTEXT.md
Normal file
@@ -0,0 +1,120 @@
|
|||||||
|
# Phase 9: OSINT Infrastructure - Context
|
||||||
|
|
||||||
|
**Gathered:** 2026-04-05
|
||||||
|
**Status:** Ready for planning
|
||||||
|
**Mode:** Auto-generated
|
||||||
|
|
||||||
|
<domain>
|
||||||
|
## Phase Boundary
|
||||||
|
|
||||||
|
Foundation for all OSINT/recon work. Creates the `pkg/recon/` package with:
|
||||||
|
- `ReconSource` interface every source module implements
|
||||||
|
- Per-source `rate.Limiter` (golang.org/x/time/rate) — no central limiter
|
||||||
|
- Stealth mode: user-agent rotation, jitter delays
|
||||||
|
- robots.txt respect for web scrapers
|
||||||
|
- Parallel sweep orchestrator that fans out to all enabled sources
|
||||||
|
- Deduplication across sources before persisting
|
||||||
|
|
||||||
|
Phases 10-16 implement individual sources on top of this foundation. Phase 9 ships with a single `ReconSource` stub (e.g., `ExampleSource`) to prove the framework end-to-end.
|
||||||
|
|
||||||
|
</domain>
|
||||||
|
|
||||||
|
<decisions>
|
||||||
|
## Implementation Decisions
|
||||||
|
|
||||||
|
### ReconSource Interface
|
||||||
|
```go
|
||||||
|
type ReconSource interface {
|
||||||
|
Name() string // "shodan", "github", etc.
|
||||||
|
RateLimit() rate.Limit // per-source token bucket rate
|
||||||
|
Burst() int // burst capacity
|
||||||
|
RespectsRobots() bool // web scrapers: true; APIs: false
|
||||||
|
Enabled(cfg Config) bool // enabled via config/env
|
||||||
|
Sweep(ctx context.Context, query string, out chan<- Finding) error
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Package Layout
|
||||||
|
```
|
||||||
|
pkg/recon/
|
||||||
|
source.go — ReconSource interface
|
||||||
|
engine.go — Engine with Register/Sweep, parallel fanout, dedup
|
||||||
|
limiter.go — per-source rate.Limiter wrapper + jitter
|
||||||
|
stealth.go — user-agent pool, randomized delays
|
||||||
|
robots.go — robots.txt cache + allowed() check
|
||||||
|
dedup.go — cross-source finding dedup (hash by key+source_url)
|
||||||
|
example.go — ExampleSource stub implementation for testing
|
||||||
|
engine_test.go
|
||||||
|
stealth_test.go
|
||||||
|
robots_test.go
|
||||||
|
```
|
||||||
|
|
||||||
|
### Rate Limiting (RECON-INFRA-05)
|
||||||
|
- Each source constructs `rate.NewLimiter(r.RateLimit(), r.Burst())` at registration
|
||||||
|
- `Engine.Sweep` calls `limiter.Wait(ctx)` before each source request
|
||||||
|
- Default rates per source configurable via config file
|
||||||
|
|
||||||
|
### Stealth Mode (RECON-INFRA-06)
|
||||||
|
- User-Agent pool: 10 realistic browser UAs in stealth.go (Chrome/Firefox/Safari on Linux/macOS/Windows)
|
||||||
|
- Jitter: random delay between 100ms and 1s before each request
|
||||||
|
- Activated via `--stealth` flag on `keyhunter recon` command
|
||||||
|
- "source exhausted" log events when a source returns zero results or hits rate limit
|
||||||
|
|
||||||
|
### robots.txt (RECON-INFRA-07)
|
||||||
|
- `pkg/recon/robots.go`: simple parser, cached per-host for 1 hour
|
||||||
|
- Only consulted when `RespectsRobots()` returns true (web scrapers)
|
||||||
|
- `--respect-robots` flag default on, can be disabled with `--no-respect-robots`
|
||||||
|
- Check User-Agent: "keyhunter" against disallowed paths
|
||||||
|
|
||||||
|
### Parallel Sweep Orchestrator (RECON-INFRA-08)
|
||||||
|
- `Engine.SweepAll(ctx, query, cfg)` fans out to all enabled sources via ants pool
|
||||||
|
- Each source runs in its own goroutine, writes findings to shared chan
|
||||||
|
- Deduplication: SHA256(provider + masked_key + source_url) — drop duplicates
|
||||||
|
- Result aggregated and returned for persistence
|
||||||
|
|
||||||
|
### CLI Command
|
||||||
|
- `keyhunter recon full [--stealth] [--respect-robots] [--query=STRING]` — sweep all enabled sources
|
||||||
|
- `keyhunter recon list` — list registered sources
|
||||||
|
- Phase 9 ships with `ExampleSource` to demonstrate the pipeline; real sources come in later phases
|
||||||
|
|
||||||
|
### Dependencies
|
||||||
|
- `golang.org/x/time/rate` — already in go.mod
|
||||||
|
- `github.com/temoto/robotstxt` — for robots.txt parsing (small, well-maintained)
|
||||||
|
- No new deps beyond robotstxt
|
||||||
|
|
||||||
|
</decisions>
|
||||||
|
|
||||||
|
<code_context>
|
||||||
|
## Existing Code Insights
|
||||||
|
|
||||||
|
### Reusable Assets
|
||||||
|
- pkg/engine/engine.go — ants pool pattern to mirror
|
||||||
|
- pkg/engine/finding.go — Finding struct (can reuse or create recon-specific)
|
||||||
|
- pkg/storage/findings.go — persistence
|
||||||
|
- cmd/stubs.go — recon stub to replace with real tree
|
||||||
|
|
||||||
|
### Finding vs ReconFinding
|
||||||
|
- Decision: reuse existing `engine.Finding`, extend SourceType field with recon sources
|
||||||
|
- SourceType values: "file", "git", "stdin", "url", "clipboard" (Phase 4), now add "recon:shodan", "recon:github", "recon:example"
|
||||||
|
|
||||||
|
</code_context>
|
||||||
|
|
||||||
|
<specifics>
|
||||||
|
## Specific Ideas
|
||||||
|
|
||||||
|
- ExampleSource should generate fake findings from a hardcoded list to prove the pipeline
|
||||||
|
- Each source's `Name()` should match its YAML dork source name for easy cross-referencing
|
||||||
|
- Dedup should preserve the first-seen source and metadata
|
||||||
|
- Stealth UA pool: include rare-but-realistic UAs to avoid UA-fingerprint blocking
|
||||||
|
|
||||||
|
</specifics>
|
||||||
|
|
||||||
|
<deferred>
|
||||||
|
## Deferred Ideas
|
||||||
|
|
||||||
|
- Proxy/TOR support — can be added later via http.Transport config
|
||||||
|
- Source-specific retry with exponential backoff — source modules handle their own retries
|
||||||
|
- Distributed rate limiting across multiple keyhunter instances — out of scope
|
||||||
|
- Webhook notifications on source exhaustion — defer to Phase 17 (Telegram)
|
||||||
|
|
||||||
|
</deferred>
|
||||||
Reference in New Issue
Block a user