--- phase: 09-osint-infrastructure plan: 04 type: execute wave: 1 depends_on: [] files_modified: - pkg/recon/robots.go - pkg/recon/robots_test.go - go.mod - go.sum autonomous: true requirements: [RECON-INFRA-07] must_haves: truths: - "pkg/recon.RobotsCache parses and caches robots.txt per host for 1 hour" - "Allowed(host, path) returns true if robots.txt permits `keyhunter` UA on that path" - "Cache hit avoids a second HTTP fetch for the same host within TTL" - "Network errors degrade safely: default-allow (so a broken robots.txt fetch does not silently block sweeps)" artifacts: - path: "pkg/recon/robots.go" provides: "RobotsCache with Allowed(ctx, url) bool + 1h per-host TTL" contains: "type RobotsCache" - path: "pkg/recon/robots_test.go" provides: "Tests for parse/allowed/disallowed/cache-hit/network-fail" key_links: - from: "pkg/recon/robots.go" to: "github.com/temoto/robotstxt" via: "robotstxt.FromBytes" pattern: "robotstxt\\." --- Add robots.txt parser and per-host cache for web-scraping sources. Satisfies RECON-INFRA-07 ("`keyhunter recon full --respect-robots` respects robots.txt for web-scraping sources before making any requests"). Only sources with RespectsRobots()==true consult the cache. Purpose: Foundation for every later web-scraping source (Phase 11 paste, Phase 15 forums, etc.). Adds github.com/temoto/robotstxt dependency. Output: pkg/recon/robots.go, pkg/recon/robots_test.go, go.mod/go.sum updated @$HOME/.claude/get-shit-done/workflows/execute-plan.md @$HOME/.claude/get-shit-done/templates/summary.md @.planning/phases/09-osint-infrastructure/09-CONTEXT.md @go.mod Task 1: Add temoto/robotstxt dependency go.mod, go.sum Run `go get github.com/temoto/robotstxt@latest` from the repo root. This updates go.mod and go.sum. Do NOT run `go mod tidy` yet — downstream tasks in this plan consume the dep and tidy will fail if tests are not written. Prefer `go mod download github.com/temoto/robotstxt` if only population of go.sum is needed, but `go get` is canonical. Verify the dep appears in go.mod `require` block. cd /home/salva/Documents/apikey && grep -q "github.com/temoto/robotstxt" go.mod go.mod contains github.com/temoto/robotstxt; go.sum populated. Task 2: RobotsCache with 1h TTL and default-allow on error pkg/recon/robots.go, pkg/recon/robots_test.go - RobotsCache.Allowed(ctx, rawURL) (bool, error): parse URL -> host, fetch https://host/robots.txt (or use injected http.Client for tests), cache parsed result for 1 hour per host - UA used for matching is "keyhunter" - On fetch error or parse error: return true, nil (default-allow) so a broken robots endpoint does not silently disable a recon source - Cache key is host (not full URL) - Second call for same host within TTL does NOT trigger another HTTP request - Tests use httptest.Server to serve robots.txt and inject a custom http.Client via RobotsCache.Client field - Tests: - TestRobotsAllowed: robots.txt says "User-agent: * / Disallow:" and path /public -> Allowed returns true - TestRobotsDisallowed: robots.txt says "User-agent: * / Disallow: /private" and path /private -> false - TestRobotsCacheHit: after first call, second call hits cache (use an atomic counter in the httptest handler and assert count == 1) - TestRobotsNetworkError: server returns 500 -> Allowed returns true (default-allow) - TestRobotsUAKeyhunter: robots.txt has "User-agent: keyhunter / Disallow: /blocked" -> path /blocked returns false Create pkg/recon/robots.go: ```go package recon import ( "context" "io" "net/http" "net/url" "sync" "time" "github.com/temoto/robotstxt" ) const ( robotsTTL = 1 * time.Hour robotsUA = "keyhunter" ) type robotsEntry struct { data *robotstxt.RobotsData fetched time.Time } // RobotsCache fetches and caches per-host robots.txt for 1 hour. // Sources whose RespectsRobots() returns true should call Allowed before each request. type RobotsCache struct { mu sync.Mutex cache map[string]robotsEntry Client *http.Client // nil -> http.DefaultClient } func NewRobotsCache() *RobotsCache { return &RobotsCache{cache: make(map[string]robotsEntry)} } // Allowed reports whether `keyhunter` may fetch rawURL per the host's robots.txt. // On fetch/parse error the function returns true (default-allow) to avoid silently // disabling recon sources when a site has a broken robots endpoint. func (rc *RobotsCache) Allowed(ctx context.Context, rawURL string) (bool, error) { u, err := url.Parse(rawURL) if err != nil { return true, nil } host := u.Host rc.mu.Lock() entry, ok := rc.cache[host] if ok && time.Since(entry.fetched) < robotsTTL { rc.mu.Unlock() return entry.data.TestAgent(u.Path, robotsUA), nil } rc.mu.Unlock() client := rc.Client if client == nil { client = http.DefaultClient } req, _ := http.NewRequestWithContext(ctx, "GET", u.Scheme+"://"+host+"/robots.txt", nil) resp, err := client.Do(req) if err != nil { return true, nil // default-allow on network error } defer resp.Body.Close() if resp.StatusCode >= 400 { return true, nil // default-allow on 4xx/5xx } body, err := io.ReadAll(resp.Body) if err != nil { return true, nil } data, err := robotstxt.FromBytes(body) if err != nil { return true, nil } rc.mu.Lock() rc.cache[host] = robotsEntry{data: data, fetched: time.Now()} rc.mu.Unlock() return data.TestAgent(u.Path, robotsUA), nil } ``` Create pkg/recon/robots_test.go using httptest.NewServer. Inject the test server's client into RobotsCache.Client (use `server.Client()`). For TestRobotsCacheHit, use `atomic.Int32` incremented inside the handler. Note on test URL: since httptest.Server has a dynamic host, build rawURL from `server.URL + "/public"`. The cache key will be the httptest host:port — both calls share the same host, so cache hit is testable. cd /home/salva/Documents/apikey && go test ./pkg/recon/ -run TestRobots -count=1 All 5 robots tests pass. Cache hit verified via request counter. Default-allow on 500 verified. - `go test ./pkg/recon/ -run TestRobots -count=1` passes - `go build ./...` passes (robotstxt dep resolved) - `go vet ./pkg/recon/...` clean - RobotsCache implemented with 1h TTL - UA "keyhunter" matching - Default-allow on network/parse errors - github.com/temoto/robotstxt added to go.mod After completion, create `.planning/phases/09-osint-infrastructure/09-04-SUMMARY.md`