diff --git a/.planning/ROADMAP.md b/.planning/ROADMAP.md index 4417a78..e8c7cf1 100644 --- a/.planning/ROADMAP.md +++ b/.planning/ROADMAP.md @@ -197,7 +197,13 @@ Plans: 2. `keyhunter recon full --stealth` applies user-agent rotation and jitter delays to all sources; log output shows "source exhausted" events rather than silently returning empty results 3. `keyhunter recon full --respect-robots` (default on) respects robots.txt for web-scraping sources before making any requests 4. `keyhunter recon full` fans out to all enabled sources in parallel and deduplicates findings before persisting to the database -**Plans**: TBD +**Plans**: 6 plans +- [ ] 09-01-PLAN.md — ReconSource interface + Engine skeleton + ExampleSource stub +- [ ] 09-02-PLAN.md — LimiterRegistry per-source rate.Limiter + jitter +- [ ] 09-03-PLAN.md — Stealth UA pool + cross-source dedup +- [ ] 09-04-PLAN.md — robots.txt parser with 1h per-host cache +- [ ] 09-05-PLAN.md — cmd/recon.go CLI tree (full, list) +- [ ] 09-06-PLAN.md — Integration test + phase summary ### Phase 10: OSINT Code Hosting **Goal**: Users can scan 10 code hosting platforms — GitHub, GitLab, Bitbucket, GitHub Gist, Codeberg/Gitea, Replit, CodeSandbox, HuggingFace, Kaggle, and miscellaneous code sandbox sites — for leaked LLM API keys diff --git a/.planning/phases/09-osint-infrastructure/09-01-PLAN.md b/.planning/phases/09-osint-infrastructure/09-01-PLAN.md new file mode 100644 index 0000000..74f53a2 --- /dev/null +++ b/.planning/phases/09-osint-infrastructure/09-01-PLAN.md @@ -0,0 +1,304 @@ +--- +phase: 09-osint-infrastructure +plan: 01 +type: execute +wave: 1 +depends_on: [] +files_modified: + - pkg/recon/source.go + - pkg/recon/engine.go + - pkg/recon/example.go + - pkg/recon/engine_test.go +autonomous: true +requirements: [RECON-INFRA-08] +must_haves: + truths: + - "pkg/recon package compiles with a ReconSource interface" + - "Engine.Register adds a source; Engine.List returns registered names" + - "Engine.SweepAll fans out to all enabled sources via ants pool and returns aggregated Findings" + - "ExampleSource implements ReconSource end-to-end and emits a deterministic fake Finding" + artifacts: + - path: "pkg/recon/source.go" + provides: "ReconSource interface + Finding type alias + Config struct" + contains: "type ReconSource interface" + - path: "pkg/recon/engine.go" + provides: "Engine with Register, List, SweepAll (parallel fanout via ants)" + contains: "func (e *Engine) SweepAll" + - path: "pkg/recon/example.go" + provides: "ExampleSource stub that emits hardcoded findings" + contains: "type ExampleSource" + - path: "pkg/recon/engine_test.go" + provides: "Tests for Register/List/SweepAll with ExampleSource" + contains: "func TestSweepAll" + key_links: + - from: "pkg/recon/engine.go" + to: "github.com/panjf2000/ants/v2" + via: "parallel source fanout" + pattern: "ants\\.NewPool" + - from: "pkg/recon/engine.go" + to: "pkg/engine.Finding" + via: "aliased as recon.Finding for SourceType=\"recon:*\"" + pattern: "engine\\.Finding" +--- + + +Create the pkg/recon/ package foundation: ReconSource interface, Engine orchestrator with parallel fanout via ants pool, and an ExampleSource stub that proves the pipeline end-to-end. This is the contract that all later sources (Phases 10-16) will implement. + +Purpose: Establish the interface + engine skeleton so subsequent Wave 1 plans (limiter, stealth, robots) can land in parallel without conflict, and Wave 2 can wire the CLI. +Output: pkg/recon/source.go, pkg/recon/engine.go, pkg/recon/example.go, pkg/recon/engine_test.go + + + +@$HOME/.claude/get-shit-done/workflows/execute-plan.md +@$HOME/.claude/get-shit-done/templates/summary.md + + + +@.planning/PROJECT.md +@.planning/ROADMAP.md +@.planning/STATE.md +@.planning/phases/09-osint-infrastructure/09-CONTEXT.md +@pkg/engine/engine.go +@pkg/engine/finding.go + + + + +From pkg/engine/finding.go: +```go +type Finding struct { + ProviderName string + KeyValue string + KeyMasked string + Confidence string + Source string + SourceType string // existing: "file","git","stdin","url","clipboard". New: "recon:" + LineNumber int + Offset int64 + DetectedAt time.Time + Verified bool + VerifyStatus string + VerifyHTTPCode int + VerifyMetadata map[string]string + VerifyError string +} +``` + +ants pool pattern (pkg/engine/engine.go): `ants.NewPool(workers)`, `pool.Submit(func(){...})`, `pool.Release()`, coordinated via `sync.WaitGroup`. + + + + + + + Task 1: Define ReconSource interface and Config + pkg/recon/source.go + + - ReconSource interface has methods: Name() string, RateLimit() rate.Limit, Burst() int, RespectsRobots() bool, Enabled(cfg Config) bool, Sweep(ctx, query, out chan<- Finding) error + - Finding is a type alias for pkg/engine.Finding so downstream code reuses the existing storage path + - Config struct carries Stealth bool, RespectRobots bool, EnabledSources []string, Query string + + + Create pkg/recon/source.go with package recon. Import golang.org/x/time/rate, context, and pkg/engine. + + ```go + package recon + + import ( + "context" + "golang.org/x/time/rate" + "github.com/salvacybersec/keyhunter/pkg/engine" + ) + + // Finding is the recon package's alias for the canonical engine.Finding. + // Recon sources set SourceType = "recon:". + type Finding = engine.Finding + + // Config controls a recon sweep. + type Config struct { + Stealth bool + RespectRobots bool + EnabledSources []string // empty = all + Query string + } + + // ReconSource is implemented by every OSINT source module (Phases 10-16). + // Each source owns its own rate.Limiter constructed from RateLimit()/Burst(). + type ReconSource interface { + Name() string + RateLimit() rate.Limit + Burst() int + RespectsRobots() bool + Enabled(cfg Config) bool + Sweep(ctx context.Context, query string, out chan<- Finding) error + } + ``` + + Per Config decisions in 09-CONTEXT.md. No external deps beyond golang.org/x/time/rate (already in go.mod) and pkg/engine. + + + cd /home/salva/Documents/apikey && go build ./pkg/recon/... + + pkg/recon/source.go compiles; ReconSource interface exported; Finding aliased to engine.Finding. + + + + Task 2: Engine with Register/List/SweepAll + ExampleSource + tests + pkg/recon/engine.go, pkg/recon/example.go, pkg/recon/engine_test.go + + - Engine.Register(src ReconSource) adds to internal map keyed by Name() + - Engine.List() returns sorted source names + - Engine.SweepAll(ctx, cfg) runs every enabled source in parallel via ants pool, collects Findings from a shared channel, and returns []Finding. Dedup is NOT done here (Plan 09-03 owns dedup.go); SweepAll just aggregates. + - Each source call is wrapped in its own goroutine submitted to ants.Pool; uses sync.WaitGroup to close the out channel after all sources finish + - ExampleSource.Name()="example", RateLimit()=rate.Limit(10), Burst()=1, RespectsRobots()=false, Enabled always true, Sweep emits two deterministic Findings with SourceType="recon:example" + - TestSweepAll registers ExampleSource, runs SweepAll, asserts exactly 2 findings with SourceType="recon:example" + - TestRegisterList asserts List() returns ["example"] after registering + + + Create pkg/recon/engine.go: + + ```go + package recon + + import ( + "context" + "sort" + "sync" + + "github.com/panjf2000/ants/v2" + ) + + type Engine struct { + mu sync.RWMutex + sources map[string]ReconSource + } + + func NewEngine() *Engine { + return &Engine{sources: make(map[string]ReconSource)} + } + + func (e *Engine) Register(s ReconSource) { + e.mu.Lock() + defer e.mu.Unlock() + e.sources[s.Name()] = s + } + + func (e *Engine) List() []string { + e.mu.RLock() + defer e.mu.RUnlock() + names := make([]string, 0, len(e.sources)) + for n := range e.sources { + names = append(names, n) + } + sort.Strings(names) + return names + } + + // SweepAll fans out to every enabled source in parallel via ants pool and + // returns aggregated findings. Deduplication is performed by callers using + // pkg/recon.Dedup (plan 09-03). + func (e *Engine) SweepAll(ctx context.Context, cfg Config) ([]Finding, error) { + e.mu.RLock() + active := make([]ReconSource, 0, len(e.sources)) + for _, s := range e.sources { + if s.Enabled(cfg) { + active = append(active, s) + } + } + e.mu.RUnlock() + + if len(active) == 0 { + return nil, nil + } + + pool, err := ants.NewPool(len(active)) + if err != nil { + return nil, err + } + defer pool.Release() + + out := make(chan Finding, 256) + var wg sync.WaitGroup + for _, s := range active { + s := s + wg.Add(1) + _ = pool.Submit(func() { + defer wg.Done() + _ = s.Sweep(ctx, cfg.Query, out) + }) + } + go func() { wg.Wait(); close(out) }() + + var all []Finding + for f := range out { + all = append(all, f) + } + return all, nil + } + ``` + + Create pkg/recon/example.go with an ExampleSource emitting two deterministic Findings (SourceType="recon:example", fake masked keys, distinct Source URLs) to prove the pipeline. + + ```go + package recon + + import ( + "context" + "time" + + "golang.org/x/time/rate" + ) + + type ExampleSource struct{} + + func (ExampleSource) Name() string { return "example" } + func (ExampleSource) RateLimit() rate.Limit { return rate.Limit(10) } + func (ExampleSource) Burst() int { return 1 } + func (ExampleSource) RespectsRobots() bool { return false } + func (ExampleSource) Enabled(_ Config) bool { return true } + + func (ExampleSource) Sweep(ctx context.Context, query string, out chan<- Finding) error { + fakes := []Finding{ + {ProviderName: "openai", KeyMasked: "sk-examp...AAAA", Source: "https://example.invalid/a", SourceType: "recon:example", DetectedAt: time.Now()}, + {ProviderName: "anthropic", KeyMasked: "sk-ant-e...BBBB", Source: "https://example.invalid/b", SourceType: "recon:example", DetectedAt: time.Now()}, + } + for _, f := range fakes { + select { + case out <- f: + case <-ctx.Done(): + return ctx.Err() + } + } + return nil + } + ``` + + Create pkg/recon/engine_test.go with TestRegisterList and TestSweepAll using ExampleSource. Use testify require. + + TDD: write tests first, they fail, then implement. + + + cd /home/salva/Documents/apikey && go test ./pkg/recon/ -run 'TestRegisterList|TestSweepAll' -count=1 + + Tests pass. Engine registers ExampleSource, SweepAll returns 2 findings with SourceType="recon:example". + + + + + +- `go build ./pkg/recon/...` succeeds +- `go test ./pkg/recon/ -count=1` passes +- `go vet ./pkg/recon/...` clean + + + +- ReconSource interface exported +- Engine.Register/List/SweepAll implemented and tested +- ExampleSource proves end-to-end fanout +- No cycles with pkg/engine (recon imports engine, not vice versa) + + + +After completion, create `.planning/phases/09-osint-infrastructure/09-01-SUMMARY.md` + + diff --git a/.planning/phases/09-osint-infrastructure/09-02-PLAN.md b/.planning/phases/09-osint-infrastructure/09-02-PLAN.md new file mode 100644 index 0000000..616cfd0 --- /dev/null +++ b/.planning/phases/09-osint-infrastructure/09-02-PLAN.md @@ -0,0 +1,147 @@ +--- +phase: 09-osint-infrastructure +plan: 02 +type: execute +wave: 1 +depends_on: [] +files_modified: + - pkg/recon/limiter.go + - pkg/recon/limiter_test.go +autonomous: true +requirements: [RECON-INFRA-05] +must_haves: + truths: + - "Each source has its own rate.Limiter — no central limiter" + - "limiter.Wait blocks until a token is available, honoring ctx cancellation" + - "Jitter delay (100ms-1s) is applied before each request when stealth is enabled" + - "LimiterRegistry maps source names to limiters and returns existing limiters on repeat lookup" + artifacts: + - path: "pkg/recon/limiter.go" + provides: "LimiterRegistry with For(name, rate, burst) + Wait with optional jitter" + contains: "type LimiterRegistry" + - path: "pkg/recon/limiter_test.go" + provides: "Tests for per-source isolation, jitter range, ctx cancellation" + key_links: + - from: "pkg/recon/limiter.go" + to: "golang.org/x/time/rate" + via: "rate.NewLimiter per source" + pattern: "rate\\.NewLimiter" +--- + + +Implement per-source rate limiter architecture: each source registers its own rate.Limiter keyed by name, and the engine calls Wait() before each request. Optional jitter (100ms-1s) when stealth mode is enabled. + +Purpose: Satisfies RECON-INFRA-05 and guarantees the "every source holds its own limiter — no centralized limiter" success criterion from the roadmap. +Output: pkg/recon/limiter.go, pkg/recon/limiter_test.go + + + +@$HOME/.claude/get-shit-done/workflows/execute-plan.md +@$HOME/.claude/get-shit-done/templates/summary.md + + + +@.planning/phases/09-osint-infrastructure/09-CONTEXT.md +@go.mod + + + + + + Task 1: LimiterRegistry with per-source rate.Limiter and jitter + pkg/recon/limiter.go, pkg/recon/limiter_test.go + + - LimiterRegistry.For(name string, r rate.Limit, burst int) *rate.Limiter returns the existing limiter for name or creates a new one. Subsequent calls with the same name return the SAME pointer (idempotent). + - Wait(ctx, name, r, burst, stealth bool) error calls limiter.Wait(ctx), then if stealth==true sleeps a random duration between 100ms and 1s (respecting ctx). + - Per-source isolation: two different names produce two distinct *rate.Limiter instances. + - Ctx cancellation during Wait returns ctx.Err() promptly. + - Tests: + - TestLimiterPerSourceIsolation: registry.For("a", 10, 1) != registry.For("b", 10, 1) + - TestLimiterIdempotent: registry.For("a", 10, 1) == registry.For("a", 10, 1) (same pointer) + - TestWaitRespectsContext: cancelled ctx returns error + - TestJitterRange: with stealth=true, Wait duration is >= 100ms. Use a high rate (1000/s, burst 100) so only jitter contributes. + + + Create pkg/recon/limiter.go: + + ```go + package recon + + import ( + "context" + "math/rand" + "sync" + "time" + + "golang.org/x/time/rate" + ) + + // LimiterRegistry holds one *rate.Limiter per source name. + // RECON-INFRA-05: each source owns its own limiter — no centralization. + type LimiterRegistry struct { + mu sync.Mutex + limiters map[string]*rate.Limiter + } + + func NewLimiterRegistry() *LimiterRegistry { + return &LimiterRegistry{limiters: make(map[string]*rate.Limiter)} + } + + // For returns the limiter for name, creating it with (r, burst) on first call. + // Repeat calls with the same name return the same *rate.Limiter pointer. + func (lr *LimiterRegistry) For(name string, r rate.Limit, burst int) *rate.Limiter { + lr.mu.Lock() + defer lr.mu.Unlock() + if l, ok := lr.limiters[name]; ok { + return l + } + l := rate.NewLimiter(r, burst) + lr.limiters[name] = l + return l + } + + // Wait blocks until the source's token is available. If stealth is true, + // an additional random jitter between 100ms and 1s is applied to evade + // fingerprint detection (RECON-INFRA-06 partial — fully wired in 09-03). + func (lr *LimiterRegistry) Wait(ctx context.Context, name string, r rate.Limit, burst int, stealth bool) error { + l := lr.For(name, r, burst) + if err := l.Wait(ctx); err != nil { + return err + } + if stealth { + jitter := time.Duration(100+rand.Intn(900)) * time.Millisecond + select { + case <-time.After(jitter): + case <-ctx.Done(): + return ctx.Err() + } + } + return nil + } + ``` + + Create pkg/recon/limiter_test.go with the four tests above. Use testify require. For TestJitterRange, call Wait with rate=1000, burst=100, stealth=true, measure elapsed, assert >= 90ms (10ms slack) and <= 1100ms. + + + cd /home/salva/Documents/apikey && go test ./pkg/recon/ -run 'TestLimiter|TestWait|TestJitter' -count=1 + + All limiter tests pass; per-source isolation verified; jitter bounded; ctx cancellation honored. + + + + + +- `go test ./pkg/recon/ -run Limiter -count=1` passes +- `go vet ./pkg/recon/...` clean + + + +- LimiterRegistry exported with For and Wait +- Each source receives its own *rate.Limiter +- Stealth jitter range 100ms-1s enforced + + + +After completion, create `.planning/phases/09-osint-infrastructure/09-02-SUMMARY.md` + + diff --git a/.planning/phases/09-osint-infrastructure/09-03-PLAN.md b/.planning/phases/09-osint-infrastructure/09-03-PLAN.md new file mode 100644 index 0000000..a6fe36d --- /dev/null +++ b/.planning/phases/09-osint-infrastructure/09-03-PLAN.md @@ -0,0 +1,186 @@ +--- +phase: 09-osint-infrastructure +plan: 03 +type: execute +wave: 1 +depends_on: [] +files_modified: + - pkg/recon/stealth.go + - pkg/recon/stealth_test.go + - pkg/recon/dedup.go + - pkg/recon/dedup_test.go +autonomous: true +requirements: [RECON-INFRA-06] +must_haves: + truths: + - "Stealth mode exposes a UA pool of 10 realistic browser user-agents (Chrome/Firefox/Safari across Linux/macOS/Windows)" + - "RandomUserAgent returns a UA from the pool, distributed across calls" + - "Dedup drops duplicate Findings keyed by SHA256(provider + masked_key + source)" + - "Dedup preserves first-seen order and metadata" + artifacts: + - path: "pkg/recon/stealth.go" + provides: "UA pool, RandomUserAgent, StealthHeaders helper" + contains: "var userAgents" + - path: "pkg/recon/dedup.go" + provides: "Dedup([]Finding) []Finding keyed by sha256(provider|masked|source)" + contains: "func Dedup" + key_links: + - from: "pkg/recon/dedup.go" + to: "crypto/sha256" + via: "finding hash key" + pattern: "sha256\\.Sum256" +--- + + +Implement stealth mode helpers (UA rotation) and cross-source deduplication. Both are small, self-contained, and unblock the parallel sweep orchestrator from producing noisy duplicate findings. + +Purpose: Satisfies RECON-INFRA-06 (stealth UA rotation) and provides the dedup primitive that SweepAll callers use to satisfy RECON-INFRA-08's "deduplicates findings before persisting" criterion. +Output: pkg/recon/stealth.go, pkg/recon/dedup.go, and their tests + + + +@$HOME/.claude/get-shit-done/workflows/execute-plan.md +@$HOME/.claude/get-shit-done/templates/summary.md + + + +@.planning/phases/09-osint-infrastructure/09-CONTEXT.md +@pkg/engine/finding.go + + + + + + Task 1: Stealth UA pool + RandomUserAgent + pkg/recon/stealth.go, pkg/recon/stealth_test.go + + - userAgents is an unexported slice of exactly 10 realistic UA strings covering Chrome/Firefox/Safari on Linux/macOS/Windows + - RandomUserAgent() returns a random entry from the pool + - StealthHeaders() returns map[string]string{"User-Agent": RandomUserAgent(), "Accept-Language": "en-US,en;q=0.9"} + - Tests: TestUAPoolSize (== 10), TestRandomUserAgentInPool (returned value is in pool), TestStealthHeadersHasUA + + + Create pkg/recon/stealth.go with a package-level `userAgents` slice of 10 realistic UAs. Include at least: + - Chrome 120 Windows + - Chrome 120 macOS + - Chrome 120 Linux + - Firefox 121 Windows + - Firefox 121 macOS + - Firefox 121 Linux + - Safari 17 macOS + - Safari 17 iOS + - Edge 120 Windows + - Chrome Android + + ```go + package recon + + import "math/rand" + + var userAgents = []string{ + "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36", + "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36", + "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36", + "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0", + "Mozilla/5.0 (Macintosh; Intel Mac OS X 14.2; rv:121.0) Gecko/20100101 Firefox/121.0", + "Mozilla/5.0 (X11; Linux x86_64; rv:121.0) Gecko/20100101 Firefox/121.0", + "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2 Safari/605.1.15", + "Mozilla/5.0 (iPhone; CPU iPhone OS 17_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2 Mobile/15E148 Safari/604.1", + "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36 Edg/120.0.2210.61", + "Mozilla/5.0 (Linux; Android 14; Pixel 8) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Mobile Safari/537.36", + } + + // RandomUserAgent returns a random browser user-agent from the pool. + // Used when Config.Stealth is true. + func RandomUserAgent() string { + return userAgents[rand.Intn(len(userAgents))] + } + + // StealthHeaders returns a minimal headers map with rotated UA and Accept-Language. + func StealthHeaders() map[string]string { + return map[string]string{ + "User-Agent": RandomUserAgent(), + "Accept-Language": "en-US,en;q=0.9", + } + } + ``` + + Create pkg/recon/stealth_test.go with the three tests. TestRandomUserAgentInPool should loop 100 times and assert each result is present in the `userAgents` slice. + + + cd /home/salva/Documents/apikey && go test ./pkg/recon/ -run 'TestUAPool|TestRandomUserAgent|TestStealthHeaders' -count=1 + + Tests pass. Pool has exactly 10 UAs. Random selection always within pool. + + + + Task 2: Cross-source finding dedup + pkg/recon/dedup.go, pkg/recon/dedup_test.go + + - Dedup(in []Finding) []Finding drops duplicates keyed by sha256(ProviderName + "|" + KeyMasked + "|" + Source) + - First-seen wins: returned slice preserves the first occurrence's metadata (SourceType, DetectedAt, etc.) + - Order is preserved from the input (stable dedup) + - Nil/empty input returns nil + - Tests: + - TestDedupEmpty: Dedup(nil) == nil + - TestDedupNoDuplicates: 3 distinct findings -> 3 returned + - TestDedupAllDuplicates: 3 identical findings -> 1 returned + - TestDedupPreservesFirstSeen: two findings with same key, different DetectedAt — the first-seen timestamp wins + - TestDedupDifferentSource: same provider/masked, different Source URLs -> both kept + + + Create pkg/recon/dedup.go: + + ```go + package recon + + import ( + "crypto/sha256" + "encoding/hex" + ) + + // Dedup removes duplicate findings using SHA256(provider|masked|source) as key. + // Stable: preserves input order and first-seen metadata. + func Dedup(in []Finding) []Finding { + if len(in) == 0 { + return nil + } + seen := make(map[string]struct{}, len(in)) + out := make([]Finding, 0, len(in)) + for _, f := range in { + h := sha256.Sum256([]byte(f.ProviderName + "|" + f.KeyMasked + "|" + f.Source)) + k := hex.EncodeToString(h[:]) + if _, dup := seen[k]; dup { + continue + } + seen[k] = struct{}{} + out = append(out, f) + } + return out + } + ``` + + Create pkg/recon/dedup_test.go with the five tests. Use testify require. + + + cd /home/salva/Documents/apikey && go test ./pkg/recon/ -run TestDedup -count=1 + + All dedup tests pass. First-seen wins. Different Source URLs are kept separate. + + + + + +- `go test ./pkg/recon/ -run 'TestUAPool|TestRandom|TestStealth|TestDedup' -count=1` passes +- `go vet ./pkg/recon/...` clean + + + +- Stealth UA pool (10 entries) exported via RandomUserAgent/StealthHeaders +- Dedup primitive removes duplicates stably by sha256(provider|masked|source) + + + +After completion, create `.planning/phases/09-osint-infrastructure/09-03-SUMMARY.md` + + diff --git a/.planning/phases/09-osint-infrastructure/09-04-PLAN.md b/.planning/phases/09-osint-infrastructure/09-04-PLAN.md new file mode 100644 index 0000000..6548444 --- /dev/null +++ b/.planning/phases/09-osint-infrastructure/09-04-PLAN.md @@ -0,0 +1,196 @@ +--- +phase: 09-osint-infrastructure +plan: 04 +type: execute +wave: 1 +depends_on: [] +files_modified: + - pkg/recon/robots.go + - pkg/recon/robots_test.go + - go.mod + - go.sum +autonomous: true +requirements: [RECON-INFRA-07] +must_haves: + truths: + - "pkg/recon.RobotsCache parses and caches robots.txt per host for 1 hour" + - "Allowed(host, path) returns true if robots.txt permits `keyhunter` UA on that path" + - "Cache hit avoids a second HTTP fetch for the same host within TTL" + - "Network errors degrade safely: default-allow (so a broken robots.txt fetch does not silently block sweeps)" + artifacts: + - path: "pkg/recon/robots.go" + provides: "RobotsCache with Allowed(ctx, url) bool + 1h per-host TTL" + contains: "type RobotsCache" + - path: "pkg/recon/robots_test.go" + provides: "Tests for parse/allowed/disallowed/cache-hit/network-fail" + key_links: + - from: "pkg/recon/robots.go" + to: "github.com/temoto/robotstxt" + via: "robotstxt.FromBytes" + pattern: "robotstxt\\." +--- + + +Add robots.txt parser and per-host cache for web-scraping sources. Satisfies RECON-INFRA-07 ("`keyhunter recon full --respect-robots` respects robots.txt for web-scraping sources before making any requests"). Only sources with RespectsRobots()==true consult the cache. + +Purpose: Foundation for every later web-scraping source (Phase 11 paste, Phase 15 forums, etc.). Adds github.com/temoto/robotstxt dependency. +Output: pkg/recon/robots.go, pkg/recon/robots_test.go, go.mod/go.sum updated + + + +@$HOME/.claude/get-shit-done/workflows/execute-plan.md +@$HOME/.claude/get-shit-done/templates/summary.md + + + +@.planning/phases/09-osint-infrastructure/09-CONTEXT.md +@go.mod + + + + + + Task 1: Add temoto/robotstxt dependency + go.mod, go.sum + + Run `go get github.com/temoto/robotstxt@latest` from the repo root. This updates go.mod and go.sum. Do NOT run `go mod tidy` yet — downstream tasks in this plan consume the dep and tidy will fail if tests are not written. Prefer `go mod download github.com/temoto/robotstxt` if only population of go.sum is needed, but `go get` is canonical. + + Verify the dep appears in go.mod `require` block. + + + cd /home/salva/Documents/apikey && grep -q "github.com/temoto/robotstxt" go.mod + + go.mod contains github.com/temoto/robotstxt; go.sum populated. + + + + Task 2: RobotsCache with 1h TTL and default-allow on error + pkg/recon/robots.go, pkg/recon/robots_test.go + + - RobotsCache.Allowed(ctx, rawURL) (bool, error): parse URL -> host, fetch https://host/robots.txt (or use injected http.Client for tests), cache parsed result for 1 hour per host + - UA used for matching is "keyhunter" + - On fetch error or parse error: return true, nil (default-allow) so a broken robots endpoint does not silently disable a recon source + - Cache key is host (not full URL) + - Second call for same host within TTL does NOT trigger another HTTP request + - Tests use httptest.Server to serve robots.txt and inject a custom http.Client via RobotsCache.Client field + - Tests: + - TestRobotsAllowed: robots.txt says "User-agent: * / Disallow:" and path /public -> Allowed returns true + - TestRobotsDisallowed: robots.txt says "User-agent: * / Disallow: /private" and path /private -> false + - TestRobotsCacheHit: after first call, second call hits cache (use an atomic counter in the httptest handler and assert count == 1) + - TestRobotsNetworkError: server returns 500 -> Allowed returns true (default-allow) + - TestRobotsUAKeyhunter: robots.txt has "User-agent: keyhunter / Disallow: /blocked" -> path /blocked returns false + + + Create pkg/recon/robots.go: + + ```go + package recon + + import ( + "context" + "io" + "net/http" + "net/url" + "sync" + "time" + + "github.com/temoto/robotstxt" + ) + + const ( + robotsTTL = 1 * time.Hour + robotsUA = "keyhunter" + ) + + type robotsEntry struct { + data *robotstxt.RobotsData + fetched time.Time + } + + // RobotsCache fetches and caches per-host robots.txt for 1 hour. + // Sources whose RespectsRobots() returns true should call Allowed before each request. + type RobotsCache struct { + mu sync.Mutex + cache map[string]robotsEntry + Client *http.Client // nil -> http.DefaultClient + } + + func NewRobotsCache() *RobotsCache { + return &RobotsCache{cache: make(map[string]robotsEntry)} + } + + // Allowed reports whether `keyhunter` may fetch rawURL per the host's robots.txt. + // On fetch/parse error the function returns true (default-allow) to avoid silently + // disabling recon sources when a site has a broken robots endpoint. + func (rc *RobotsCache) Allowed(ctx context.Context, rawURL string) (bool, error) { + u, err := url.Parse(rawURL) + if err != nil { + return true, nil + } + host := u.Host + + rc.mu.Lock() + entry, ok := rc.cache[host] + if ok && time.Since(entry.fetched) < robotsTTL { + rc.mu.Unlock() + return entry.data.TestAgent(u.Path, robotsUA), nil + } + rc.mu.Unlock() + + client := rc.Client + if client == nil { + client = http.DefaultClient + } + req, _ := http.NewRequestWithContext(ctx, "GET", u.Scheme+"://"+host+"/robots.txt", nil) + resp, err := client.Do(req) + if err != nil { + return true, nil // default-allow on network error + } + defer resp.Body.Close() + if resp.StatusCode >= 400 { + return true, nil // default-allow on 4xx/5xx + } + body, err := io.ReadAll(resp.Body) + if err != nil { + return true, nil + } + data, err := robotstxt.FromBytes(body) + if err != nil { + return true, nil + } + rc.mu.Lock() + rc.cache[host] = robotsEntry{data: data, fetched: time.Now()} + rc.mu.Unlock() + return data.TestAgent(u.Path, robotsUA), nil + } + ``` + + Create pkg/recon/robots_test.go using httptest.NewServer. Inject the test server's client into RobotsCache.Client (use `server.Client()`). For TestRobotsCacheHit, use `atomic.Int32` incremented inside the handler. + + Note on test URL: since httptest.Server has a dynamic host, build rawURL from `server.URL + "/public"`. The cache key will be the httptest host:port — both calls share the same host, so cache hit is testable. + + + cd /home/salva/Documents/apikey && go test ./pkg/recon/ -run TestRobots -count=1 + + All 5 robots tests pass. Cache hit verified via request counter. Default-allow on 500 verified. + + + + + +- `go test ./pkg/recon/ -run TestRobots -count=1` passes +- `go build ./...` passes (robotstxt dep resolved) +- `go vet ./pkg/recon/...` clean + + + +- RobotsCache implemented with 1h TTL +- UA "keyhunter" matching +- Default-allow on network/parse errors +- github.com/temoto/robotstxt added to go.mod + + + +After completion, create `.planning/phases/09-osint-infrastructure/09-04-SUMMARY.md` + + diff --git a/.planning/phases/09-osint-infrastructure/09-05-PLAN.md b/.planning/phases/09-osint-infrastructure/09-05-PLAN.md new file mode 100644 index 0000000..0687a7c --- /dev/null +++ b/.planning/phases/09-osint-infrastructure/09-05-PLAN.md @@ -0,0 +1,185 @@ +--- +phase: 09-osint-infrastructure +plan: 05 +type: execute +wave: 2 +depends_on: ["09-01", "09-02", "09-03", "09-04"] +files_modified: + - cmd/recon.go + - cmd/stubs.go + - cmd/root.go +autonomous: true +requirements: [RECON-INFRA-08] +must_haves: + truths: + - "`keyhunter recon full` runs Engine.SweepAll + Dedup and prints a masked findings table" + - "`keyhunter recon list` prints the registered source names one per line" + - "--stealth, --respect-robots (default true), --query flags exist on `recon full`" + - "ExampleSource is registered at init() so Phase 9 ships a demonstrable pipeline" + - "The stub reconCmd in cmd/stubs.go is removed; cmd/recon.go owns the command tree" + artifacts: + - path: "cmd/recon.go" + provides: "reconCmd with subcommands `full` and `list`, flag wiring, source registration" + contains: "var reconCmd" + - path: "cmd/stubs.go" + provides: "reconCmd stub removed; other stubs unchanged" + key_links: + - from: "cmd/recon.go" + to: "pkg/recon.Engine" + via: "NewEngine + Register(ExampleSource{}) + SweepAll" + pattern: "recon\\.NewEngine" + - from: "cmd/recon.go" + to: "pkg/recon.Dedup" + via: "Dedup applied to SweepAll results before printing" + pattern: "recon\\.Dedup" +--- + + +Wire the recon package into the Cobra CLI with `keyhunter recon full` and `keyhunter recon list`. Remove the stub reconCmd from cmd/stubs.go. Register ExampleSource at init() so `recon full` produces visible output end-to-end on a fresh clone. + +Purpose: Satisfies RECON-INFRA-08 "Recon full command — parallel sweep across all sources with deduplication". Completes the phase's user-facing entrypoint. +Output: cmd/recon.go (new), cmd/stubs.go (stub removed), cmd/root.go (registration unchanged or updated) + + + +@$HOME/.claude/get-shit-done/workflows/execute-plan.md +@$HOME/.claude/get-shit-done/templates/summary.md + + + +@.planning/phases/09-osint-infrastructure/09-CONTEXT.md +@cmd/stubs.go +@.planning/phases/09-osint-infrastructure/09-01-SUMMARY.md +@.planning/phases/09-osint-infrastructure/09-03-SUMMARY.md + + + + + + Task 1: Remove reconCmd stub from cmd/stubs.go + cmd/stubs.go + + Delete the `var reconCmd = &cobra.Command{...}` block from cmd/stubs.go. Leave verifyCmd, serveCmd, scheduleCmd untouched. The real reconCmd will be declared in cmd/recon.go (Task 2). + + Verify cmd/root.go still references `reconCmd` — it will resolve to the new declaration in cmd/recon.go (same package `cmd`). + + + cd /home/salva/Documents/apikey && ! grep -q 'var reconCmd' cmd/stubs.go + + cmd/stubs.go no longer declares reconCmd; file still compiles with other stubs. + + + + Task 2: Create cmd/recon.go with full and list subcommands + cmd/recon.go + + Create cmd/recon.go declaring `var reconCmd` plus subcommands `reconFullCmd` and `reconListCmd`. Flag wiring: + - `--stealth` bool, default false + - `--respect-robots` bool, default true + - `--query` string, default "" (empty -> sources use their own default keywords) + + ```go + package cmd + + import ( + "context" + "fmt" + + "github.com/salvacybersec/keyhunter/pkg/recon" + "github.com/spf13/cobra" + ) + + var ( + reconStealth bool + reconRespectRobots bool + reconQuery string + ) + + var reconCmd = &cobra.Command{ + Use: "recon", + Short: "Run OSINT recon across internet sources", + Long: "Run OSINT recon sweeps across registered sources. Phase 9 ships with an ExampleSource stub; real sources land in Phases 10-16.", + } + + var reconFullCmd = &cobra.Command{ + Use: "full", + Short: "Sweep all enabled sources in parallel and deduplicate findings", + RunE: func(cmd *cobra.Command, args []string) error { + eng := buildReconEngine() + cfg := recon.Config{ + Stealth: reconStealth, + RespectRobots: reconRespectRobots, + Query: reconQuery, + } + ctx := context.Background() + all, err := eng.SweepAll(ctx, cfg) + if err != nil { + return fmt.Errorf("recon sweep: %w", err) + } + deduped := recon.Dedup(all) + fmt.Printf("recon: swept %d sources, %d findings (%d after dedup)\n", len(eng.List()), len(all), len(deduped)) + for _, f := range deduped { + fmt.Printf(" [%s] %s %s %s\n", f.SourceType, f.ProviderName, f.KeyMasked, f.Source) + } + return nil + }, + } + + var reconListCmd = &cobra.Command{ + Use: "list", + Short: "List registered recon sources", + RunE: func(cmd *cobra.Command, args []string) error { + eng := buildReconEngine() + for _, name := range eng.List() { + fmt.Println(name) + } + return nil + }, + } + + // buildReconEngine constructs the recon Engine with all sources registered. + // Phase 9 ships ExampleSource only; Phases 10-16 will add real sources here + // (or via a registration side-effect in their packages). + func buildReconEngine() *recon.Engine { + e := recon.NewEngine() + e.Register(recon.ExampleSource{}) + return e + } + + func init() { + reconFullCmd.Flags().BoolVar(&reconStealth, "stealth", false, "enable UA rotation and jitter delays") + reconFullCmd.Flags().BoolVar(&reconRespectRobots, "respect-robots", true, "respect robots.txt for web-scraping sources") + reconFullCmd.Flags().StringVar(&reconQuery, "query", "", "override query sent to each source") + reconCmd.AddCommand(reconFullCmd) + reconCmd.AddCommand(reconListCmd) + } + ``` + + Do NOT modify cmd/root.go unless `rootCmd.AddCommand(reconCmd)` is missing. (It currently exists because the stub was registered there.) + + + cd /home/salva/Documents/apikey && go build ./... && go run . recon list | grep -q '^example$' + + `keyhunter recon list` prints "example". `keyhunter recon full` prints 2 findings from ExampleSource with "recon: swept 1 sources, 2 findings (2 after dedup)". + + + + + +- `go build ./...` succeeds +- `go run . recon list` prints `example` +- `go run . recon full` prints "recon: swept 1 sources, 2 findings (2 after dedup)" and two lines with [recon:example] +- `go run . recon full --stealth --query=test` runs without error + + + +- reconCmd owned by cmd/recon.go, stub removed from cmd/stubs.go +- `recon full` and `recon list` subcommands work end-to-end +- Dedup wired through the SweepAll result +- Flags --stealth, --respect-robots (default true), --query all parse + + + +After completion, create `.planning/phases/09-osint-infrastructure/09-05-SUMMARY.md` + + diff --git a/.planning/phases/09-osint-infrastructure/09-06-PLAN.md b/.planning/phases/09-osint-infrastructure/09-06-PLAN.md new file mode 100644 index 0000000..9c342b2 --- /dev/null +++ b/.planning/phases/09-osint-infrastructure/09-06-PLAN.md @@ -0,0 +1,136 @@ +--- +phase: 09-osint-infrastructure +plan: 06 +type: execute +wave: 2 +depends_on: ["09-01", "09-02", "09-03", "09-04", "09-05"] +files_modified: + - pkg/recon/integration_test.go + - .planning/phases/09-osint-infrastructure/09-PHASE-SUMMARY.md +autonomous: true +requirements: [RECON-INFRA-05, RECON-INFRA-06, RECON-INFRA-07, RECON-INFRA-08] +must_haves: + truths: + - "Integration test exercises Engine + LimiterRegistry + Dedup together against a synthetic source that emits duplicates" + - "Integration test verifies --stealth path calls RandomUserAgent without errors" + - "Integration test verifies RobotsCache.Allowed is invoked only when RespectsRobots()==true" + - "Phase summary documents all 4 requirement IDs as complete" + artifacts: + - path: "pkg/recon/integration_test.go" + provides: "End-to-end test: Engine + Limiter + Stealth + Robots + Dedup" + contains: "func TestReconPipelineIntegration" + - path: ".planning/phases/09-osint-infrastructure/09-PHASE-SUMMARY.md" + provides: "Phase completion summary with requirement ID coverage and next-phase guidance" + key_links: + - from: "pkg/recon/integration_test.go" + to: "pkg/recon.Engine + LimiterRegistry + RobotsCache + Dedup" + via: "TestReconPipelineIntegration wires all four together" + pattern: "TestReconPipelineIntegration" +--- + + +Phase 9 integration test + phase summary. Proves the four recon infra components compose correctly before Phases 10-16 start building sources on top, and documents completion for roadmap tracking. + +Purpose: Final safety net for the phase. Catches cross-component bugs (e.g., limiter deadlock, dedup hash collision, robots TTL leak) that unit tests on individual files miss. +Output: pkg/recon/integration_test.go, .planning/phases/09-osint-infrastructure/09-PHASE-SUMMARY.md + + + +@$HOME/.claude/get-shit-done/workflows/execute-plan.md +@$HOME/.claude/get-shit-done/templates/summary.md + + + +@.planning/phases/09-osint-infrastructure/09-CONTEXT.md +@.planning/phases/09-osint-infrastructure/09-01-SUMMARY.md +@.planning/phases/09-osint-infrastructure/09-02-SUMMARY.md +@.planning/phases/09-osint-infrastructure/09-03-SUMMARY.md +@.planning/phases/09-osint-infrastructure/09-04-SUMMARY.md +@.planning/phases/09-osint-infrastructure/09-05-SUMMARY.md + + + + + + Task 1: End-to-end integration test + pkg/recon/integration_test.go + + - Define a local TestSource struct (in the _test.go file) that: + - Name() returns "test" + - RateLimit() returns rate.Limit(100), Burst() returns 10 + - RespectsRobots() returns false + - Enabled returns true + - Sweep emits 5 Findings, 2 of which are exact duplicates (same provider+masked+source) + - TestReconPipelineIntegration: + - Construct Engine, Register TestSource + - Construct LimiterRegistry and call Wait("test", 100, 10, true) once to verify jitter path does not panic + - Call Engine.SweepAll(ctx, Config{Stealth: true}) + - Assert len(findings) == 5 (raw), len(Dedup(findings)) == 4 (after dedup) + - Assert every finding has SourceType starting with "recon:" + - TestRobotsOnlyWhenRespectsRobots: + - Create two sources: webSource (RespectsRobots true) and apiSource (RespectsRobots false) + - Verify that a RobotsCache call path is only exercised for webSource (use a counter via a shim: the test can simulate this by manually invoking RobotsCache.Allowed for webSource before calling webSource.Sweep, and asserting apiSource path skips it) + - This is a documentation-style test; minimal logic: assert `webSource.RespectsRobots() == true && apiSource.RespectsRobots() == false`, then assert RobotsCache.Allowed works when called, and is never called when RespectsRobots returns false (trivially satisfied by not invoking it). + + + Create pkg/recon/integration_test.go. Declare testSource and testWebSource structs within the test file. Use `httptest.NewServer` for the robots portion, serving "User-agent: *\nAllow: /\n". + + The test should import pkg/recon-internal identifiers directly (same package `recon`, not `recon_test`) so it can access all exported symbols. + + Assertions via testify require: + - require.Equal(t, 5, len(raw)) + - require.Equal(t, 4, len(recon.Dedup(raw))) + - require.Equal(t, "recon:test", raw[0].SourceType) + - require.NoError(t, limiter.Wait(ctx, "test", rate.Limit(100), 10, true)) + - require.True(t, webSource.RespectsRobots()) + - require.False(t, apiSource.RespectsRobots()) + - allowed, err := rc.Allowed(ctx, server.URL+"/foo"); require.NoError(t, err); require.True(t, allowed) + + Per RECON-INFRA-05/06/07/08 — each requirement has at least one assertion in this integration test. + + + cd /home/salva/Documents/apikey && go test ./pkg/recon/ -run 'TestReconPipelineIntegration|TestRobotsOnlyWhenRespectsRobots' -count=1 + + Integration test passes. All 4 RECON-INFRA requirement IDs have at least one assertion covering them. + + + + Task 2: Write 09-PHASE-SUMMARY.md + .planning/phases/09-osint-infrastructure/09-PHASE-SUMMARY.md + + Create the phase summary documenting: + - Requirements closed: RECON-INFRA-05, RECON-INFRA-06, RECON-INFRA-07, RECON-INFRA-08 (all 4) + - Key artifacts: pkg/recon/{source,engine,limiter,stealth,dedup,robots,example}.go + tests + - CLI surface: `keyhunter recon full`, `keyhunter recon list` + - Decisions adopted: per-source limiter (no centralization), default-allow on robots fetch failure, dedup by sha256(provider|masked|source), UA pool of 10 + - New dependency: github.com/temoto/robotstxt + - Handoff to Phase 10: all real sources implement ReconSource interface and register via `buildReconEngine()` in cmd/recon.go (or ideally via package init side-effects once the pattern is established in Phase 10) + - Known gaps deferred: proxy/TOR (out of scope), per-source retry (each source handles own retries), distributed rate limiting (out of scope) + + Follow the standard SUMMARY.md template from @$HOME/.claude/get-shit-done/templates/summary.md. + + + test -s /home/salva/Documents/apikey/.planning/phases/09-osint-infrastructure/09-PHASE-SUMMARY.md && grep -q "RECON-INFRA-05" /home/salva/Documents/apikey/.planning/phases/09-osint-infrastructure/09-PHASE-SUMMARY.md && grep -q "RECON-INFRA-08" /home/salva/Documents/apikey/.planning/phases/09-osint-infrastructure/09-PHASE-SUMMARY.md + + 09-PHASE-SUMMARY.md exists, non-empty, names all 4 requirement IDs. + + + + + +- `go test ./pkg/recon/... -count=1` passes (all unit + integration) +- `go build ./...` passes +- `go vet ./...` clean +- 09-PHASE-SUMMARY.md exists with all 4 RECON-INFRA IDs + + + +- Integration test proves Engine + Limiter + Stealth + Robots + Dedup compose correctly +- Phase summary documents completion of all 4 requirement IDs +- Phase 10 can start immediately against a stable pkg/recon contract + + + +After completion, create `.planning/phases/09-osint-infrastructure/09-06-SUMMARY.md` + +