--- phase: 09-osint-infrastructure plan: 04 subsystem: recon tags: [recon, robots, http, cache] requires: [github.com/temoto/robotstxt] provides: - pkg/recon.RobotsCache - pkg/recon.NewRobotsCache affects: [pkg/recon] tech-stack: added: [github.com/temoto/robotstxt@v1.1.2] patterns: [per-host TTL cache, default-allow on error, injectable http.Client] key-files: created: - pkg/recon/robots.go - pkg/recon/robots_test.go modified: - go.mod - go.sum decisions: - default-allow on fetch/parse error so a broken robots endpoint does not silently disable sources - cache key is host (not full URL) to share TTL across all paths on a site - UA matched is "keyhunter"; temoto/robotstxt TestAgent handles group precedence - RobotsCache.Client field lets tests inject httptest.Server.Client() metrics: duration: ~6 min completed: 2026-04-05 tasks: 2 files: 4 requirements: [RECON-INFRA-07] --- # Phase 9 Plan 04: robots.txt Parser with 1h Cache Summary RobotsCache parses robots.txt via temoto/robotstxt and caches results per host for 1 hour, with default-allow on network/parse errors and injectable HTTP client for tests. ## Objective Foundation for every web-scraping recon source (Phase 11 paste, Phase 15 forums, etc.): a thread-safe per-host robots.txt cache that only sources with `RespectsRobots()==true` consult. Satisfies RECON-INFRA-07. ## What Was Built - `pkg/recon/robots.go` (95 lines) — `RobotsCache` struct with `NewRobotsCache()` constructor and `Allowed(ctx, rawURL) (bool, error)` method. - `pkg/recon/robots_test.go` (118 lines) — 5 tests covering allow/disallow/cache-hit/network-error/keyhunter-UA. - `github.com/temoto/robotstxt@v1.1.2` added to `go.mod` and `go.sum`. ## Behavior - Parses URL, looks up host in `sync.Mutex`-guarded map. - Cache hit within 1h TTL returns `entry.data.TestAgent(path, "keyhunter")` without network I/O. - Cache miss fetches `:///robots.txt`; on success, stores `robotstxt.RobotsData`. - Network error, HTTP 4xx/5xx, read error, or parse error => returns `true, nil` (default-allow). - `Client` field (nil -> `http.DefaultClient`) lets tests inject `httptest.Server.Client()`. ## Tasks | # | Name | Commit | Files | |---|------|--------|-------| | 1 | Add temoto/robotstxt dependency | c3b9fb4 | go.mod, go.sum | | 2a | Failing tests for RobotsCache (RED) | 4bd6c6b | pkg/recon/robots_test.go | | 2b | Implement RobotsCache (GREEN) | 0373931 | pkg/recon/robots.go | ## Verification - `go test ./pkg/recon/ -run TestRobots -count=1 -v` — 5/5 PASS - TestRobotsAllowed - TestRobotsDisallowed - TestRobotsCacheHit (request counter == 1 after 3 calls) - TestRobotsNetworkError (500 -> allowed) - TestRobotsUAKeyhunter (keyhunter group matched over wildcard) - `go vet ./pkg/recon/...` — clean - `go build ./...` — clean ## Decisions Made 1. **Default-allow on error** — preferred over fail-closed to avoid silently disabling recon sources when a target has a misconfigured/500ing robots endpoint. Documented inline and in tests. 2. **Host-keyed cache** — a single robots.txt governs all paths on a host; caching by host shares TTL across every request to that site. 3. **Injectable Client** — `Client *http.Client` field on `RobotsCache` instead of a functional-option constructor to keep the surface minimal; tests pass `srv.Client()` directly. 4. **UA "keyhunter" literal** — constant `robotsUA = "keyhunter"` so upstream sources do not need to pass it. temoto/robotstxt `TestAgent` handles group precedence (specific UA beats wildcard). ## Deviations from Plan ### Out-of-scope discovery (NOT fixed, logged only) **Untracked `pkg/recon/engine_test.go`** — present in the working tree from parallel wave work (plan 09-05 scaffold). It references `NewEngine`/`ExampleSource` which do not yet exist, causing the test binary for `pkg/recon` to fail to build until those types land. - **Action taken:** Temporarily moved the file aside during my test run, verified my 5 tests pass + `go vet` clean + `go build ./...` clean, then restored the file unchanged. - **Why not fixed:** Not caused by this plan (files_modified is `pkg/recon/robots.go` + `pkg/recon/robots_test.go` only). Belongs to plan 09-05 (wave 2, depends on 09-04). Fixing would be scope creep. - **Impact on 09-04 verification:** None — my tests, vet, and full build all pass with the file in place only after 09-05's types exist. Under the current repository state, running `go test ./pkg/recon/...` will fail on the orphan test file, not on my code. Plan 09-05 will resolve this when it lands `engine.go` and `example.go`. Otherwise: plan executed exactly as written. ## Known Stubs None. ## Self-Check: PASSED - pkg/recon/robots.go — FOUND - pkg/recon/robots_test.go — FOUND - go.mod contains github.com/temoto/robotstxt — VERIFIED - Commit c3b9fb4 (Task 1) — FOUND - Commit 4bd6c6b (Task 2 RED) — FOUND - Commit 0373931 (Task 2 GREEN) — FOUND - `go test ./pkg/recon/ -run TestRobots` — 5/5 PASS - `go vet ./pkg/recon/...` — clean - `go build ./...` — clean