Files
keyhunter/.planning/phases/09-osint-infrastructure/09-04-SUMMARY.md
salvacybersec 4dbc38dcc5 docs(09-04): complete robots.txt cache plan
Adds SUMMARY, marks RECON-INFRA-07 complete, updates phase 9 roadmap.
2026-04-06 00:43:49 +03:00

5.0 KiB

phase, plan, subsystem, tags, requires, provides, affects, tech-stack, key-files, decisions, metrics, requirements
phase plan subsystem tags requires provides affects tech-stack key-files decisions metrics requirements
09-osint-infrastructure 04 recon
recon
robots
http
cache
github.com/temoto/robotstxt
pkg/recon.RobotsCache
pkg/recon.NewRobotsCache
pkg/recon
added patterns
github.com/temoto/robotstxt@v1.1.2
per-host TTL cache
default-allow on error
injectable http.Client
created modified
pkg/recon/robots.go
pkg/recon/robots_test.go
go.mod
go.sum
default-allow on fetch/parse error so a broken robots endpoint does not silently disable sources
cache key is host (not full URL) to share TTL across all paths on a site
UA matched is "keyhunter"; temoto/robotstxt TestAgent handles group precedence
RobotsCache.Client field lets tests inject httptest.Server.Client()
duration completed tasks files
~6 min 2026-04-05 2 4
RECON-INFRA-07

Phase 9 Plan 04: robots.txt Parser with 1h Cache Summary

RobotsCache parses robots.txt via temoto/robotstxt and caches results per host for 1 hour, with default-allow on network/parse errors and injectable HTTP client for tests.

Objective

Foundation for every web-scraping recon source (Phase 11 paste, Phase 15 forums, etc.): a thread-safe per-host robots.txt cache that only sources with RespectsRobots()==true consult. Satisfies RECON-INFRA-07.

What Was Built

  • pkg/recon/robots.go (95 lines) — RobotsCache struct with NewRobotsCache() constructor and Allowed(ctx, rawURL) (bool, error) method.
  • pkg/recon/robots_test.go (118 lines) — 5 tests covering allow/disallow/cache-hit/network-error/keyhunter-UA.
  • github.com/temoto/robotstxt@v1.1.2 added to go.mod and go.sum.

Behavior

  • Parses URL, looks up host in sync.Mutex-guarded map.
  • Cache hit within 1h TTL returns entry.data.TestAgent(path, "keyhunter") without network I/O.
  • Cache miss fetches <scheme>://<host>/robots.txt; on success, stores robotstxt.RobotsData.
  • Network error, HTTP 4xx/5xx, read error, or parse error => returns true, nil (default-allow).
  • Client field (nil -> http.DefaultClient) lets tests inject httptest.Server.Client().

Tasks

# Name Commit Files
1 Add temoto/robotstxt dependency c3b9fb4 go.mod, go.sum
2a Failing tests for RobotsCache (RED) 4bd6c6b pkg/recon/robots_test.go
2b Implement RobotsCache (GREEN) 0373931 pkg/recon/robots.go

Verification

  • go test ./pkg/recon/ -run TestRobots -count=1 -v — 5/5 PASS
    • TestRobotsAllowed
    • TestRobotsDisallowed
    • TestRobotsCacheHit (request counter == 1 after 3 calls)
    • TestRobotsNetworkError (500 -> allowed)
    • TestRobotsUAKeyhunter (keyhunter group matched over wildcard)
  • go vet ./pkg/recon/... — clean
  • go build ./... — clean

Decisions Made

  1. Default-allow on error — preferred over fail-closed to avoid silently disabling recon sources when a target has a misconfigured/500ing robots endpoint. Documented inline and in tests.
  2. Host-keyed cache — a single robots.txt governs all paths on a host; caching by host shares TTL across every request to that site.
  3. Injectable ClientClient *http.Client field on RobotsCache instead of a functional-option constructor to keep the surface minimal; tests pass srv.Client() directly.
  4. UA "keyhunter" literal — constant robotsUA = "keyhunter" so upstream sources do not need to pass it. temoto/robotstxt TestAgent handles group precedence (specific UA beats wildcard).

Deviations from Plan

Out-of-scope discovery (NOT fixed, logged only)

Untracked pkg/recon/engine_test.go — present in the working tree from parallel wave work (plan 09-05 scaffold). It references NewEngine/ExampleSource which do not yet exist, causing the test binary for pkg/recon to fail to build until those types land.

  • Action taken: Temporarily moved the file aside during my test run, verified my 5 tests pass + go vet clean + go build ./... clean, then restored the file unchanged.
  • Why not fixed: Not caused by this plan (files_modified is pkg/recon/robots.go + pkg/recon/robots_test.go only). Belongs to plan 09-05 (wave 2, depends on 09-04). Fixing would be scope creep.
  • Impact on 09-04 verification: None — my tests, vet, and full build all pass with the file in place only after 09-05's types exist. Under the current repository state, running go test ./pkg/recon/... will fail on the orphan test file, not on my code. Plan 09-05 will resolve this when it lands engine.go and example.go.

Otherwise: plan executed exactly as written.

Known Stubs

None.

Self-Check: PASSED

  • pkg/recon/robots.go — FOUND
  • pkg/recon/robots_test.go — FOUND
  • go.mod contains github.com/temoto/robotstxt — VERIFIED
  • Commit c3b9fb4 (Task 1) — FOUND
  • Commit 4bd6c6b (Task 2 RED) — FOUND
  • Commit 0373931 (Task 2 GREEN) — FOUND
  • go test ./pkg/recon/ -run TestRobots — 5/5 PASS
  • go vet ./pkg/recon/... — clean
  • go build ./... — clean