default-allow on fetch/parse error so a broken robots endpoint does not silently disable sources
cache key is host (not full URL) to share TTL across all paths on a site
UA matched is "keyhunter"; temoto/robotstxt TestAgent handles group precedence
RobotsCache.Client field lets tests inject httptest.Server.Client()
duration
completed
tasks
files
~6 min
2026-04-05
2
4
RECON-INFRA-07
Phase 9 Plan 04: robots.txt Parser with 1h Cache Summary
RobotsCache parses robots.txt via temoto/robotstxt and caches results per host for 1 hour, with default-allow on network/parse errors and injectable HTTP client for tests.
Objective
Foundation for every web-scraping recon source (Phase 11 paste, Phase 15 forums, etc.): a thread-safe per-host robots.txt cache that only sources with RespectsRobots()==true consult. Satisfies RECON-INFRA-07.
What Was Built
pkg/recon/robots.go (95 lines) — RobotsCache struct with NewRobotsCache() constructor and Allowed(ctx, rawURL) (bool, error) method.
go test ./pkg/recon/ -run TestRobots -count=1 -v — 5/5 PASS
TestRobotsAllowed
TestRobotsDisallowed
TestRobotsCacheHit (request counter == 1 after 3 calls)
TestRobotsNetworkError (500 -> allowed)
TestRobotsUAKeyhunter (keyhunter group matched over wildcard)
go vet ./pkg/recon/... — clean
go build ./... — clean
Decisions Made
Default-allow on error — preferred over fail-closed to avoid silently disabling recon sources when a target has a misconfigured/500ing robots endpoint. Documented inline and in tests.
Host-keyed cache — a single robots.txt governs all paths on a host; caching by host shares TTL across every request to that site.
Injectable Client — Client *http.Client field on RobotsCache instead of a functional-option constructor to keep the surface minimal; tests pass srv.Client() directly.
UA "keyhunter" literal — constant robotsUA = "keyhunter" so upstream sources do not need to pass it. temoto/robotstxt TestAgent handles group precedence (specific UA beats wildcard).
Deviations from Plan
Out-of-scope discovery (NOT fixed, logged only)
Untracked pkg/recon/engine_test.go — present in the working tree from parallel wave work (plan 09-05 scaffold). It references NewEngine/ExampleSource which do not yet exist, causing the test binary for pkg/recon to fail to build until those types land.
Action taken: Temporarily moved the file aside during my test run, verified my 5 tests pass + go vet clean + go build ./... clean, then restored the file unchanged.
Why not fixed: Not caused by this plan (files_modified is pkg/recon/robots.go + pkg/recon/robots_test.go only). Belongs to plan 09-05 (wave 2, depends on 09-04). Fixing would be scope creep.
Impact on 09-04 verification: None — my tests, vet, and full build all pass with the file in place only after 09-05's types exist. Under the current repository state, running go test ./pkg/recon/... will fail on the orphan test file, not on my code. Plan 09-05 will resolve this when it lands engine.go and example.go.