From 4dbc38dcc5557a70d4171f81b594f9d049351fd2 Mon Sep 17 00:00:00 2001 From: salvacybersec Date: Mon, 6 Apr 2026 00:43:49 +0300 Subject: [PATCH] docs(09-04): complete robots.txt cache plan Adds SUMMARY, marks RECON-INFRA-07 complete, updates phase 9 roadmap. --- .planning/REQUIREMENTS.md | 2 +- .planning/ROADMAP.md | 6 +- .planning/STATE.md | 19 +-- .../09-osint-infrastructure/09-04-SUMMARY.md | 108 ++++++++++++++++++ 4 files changed, 122 insertions(+), 13 deletions(-) create mode 100644 .planning/phases/09-osint-infrastructure/09-04-SUMMARY.md diff --git a/.planning/REQUIREMENTS.md b/.planning/REQUIREMENTS.md index fa71367..82c3a4d 100644 --- a/.planning/REQUIREMENTS.md +++ b/.planning/REQUIREMENTS.md @@ -206,7 +206,7 @@ Requirements for initial release. Each maps to roadmap phases. - [x] **RECON-INFRA-05**: Per-source rate limiter with configurable limits - [ ] **RECON-INFRA-06**: Stealth mode (--stealth) with UA rotation and increased delays -- [ ] **RECON-INFRA-07**: robots.txt respect (--respect-robots, default on) +- [x] **RECON-INFRA-07**: robots.txt respect (--respect-robots, default on) - [ ] **RECON-INFRA-08**: Recon full command — parallel sweep across all sources with deduplication ### Dork Engine diff --git a/.planning/ROADMAP.md b/.planning/ROADMAP.md index e8c7cf1..9ee3411 100644 --- a/.planning/ROADMAP.md +++ b/.planning/ROADMAP.md @@ -199,9 +199,9 @@ Plans: 4. `keyhunter recon full` fans out to all enabled sources in parallel and deduplicates findings before persisting to the database **Plans**: 6 plans - [ ] 09-01-PLAN.md — ReconSource interface + Engine skeleton + ExampleSource stub -- [ ] 09-02-PLAN.md — LimiterRegistry per-source rate.Limiter + jitter +- [x] 09-02-PLAN.md — LimiterRegistry per-source rate.Limiter + jitter - [ ] 09-03-PLAN.md — Stealth UA pool + cross-source dedup -- [ ] 09-04-PLAN.md — robots.txt parser with 1h per-host cache +- [x] 09-04-PLAN.md — robots.txt parser with 1h per-host cache - [ ] 09-05-PLAN.md — cmd/recon.go CLI tree (full, list) - [ ] 09-06-PLAN.md — Integration test + phase summary @@ -325,7 +325,7 @@ Phases execute in numeric order: 1 → 2 → 3 → ... → 18 | 6. Output, Reporting & Key Management | 0/? | Not started | - | | 7. Import Adapters & CI/CD Integration | 0/? | Not started | - | | 8. Dork Engine | 0/? | Not started | - | -| 9. OSINT Infrastructure | 0/? | Not started | - | +| 9. OSINT Infrastructure | 2/6 | In Progress| | | 10. OSINT Code Hosting | 0/? | Not started | - | | 11. OSINT Search & Paste | 0/? | Not started | - | | 12. OSINT IoT & Cloud Storage | 0/? | Not started | - | diff --git a/.planning/STATE.md b/.planning/STATE.md index b51ea73..8e1379b 100644 --- a/.planning/STATE.md +++ b/.planning/STATE.md @@ -3,14 +3,14 @@ gsd_state_version: 1.0 milestone: v1.0 milestone_name: milestone status: executing -stopped_at: Completed 08-07-PLAN.md -last_updated: "2026-04-05T21:32:47.810Z" +stopped_at: Completed 09-04-PLAN.md +last_updated: "2026-04-05T21:43:35.883Z" last_activity: 2026-04-05 progress: total_phases: 18 completed_phases: 8 - total_plans: 47 - completed_plans: 47 + total_plans: 53 + completed_plans: 49 percent: 20 --- @@ -21,12 +21,12 @@ progress: See: .planning/PROJECT.md (updated 2026-04-04) **Core value:** Detect leaked LLM API keys across more providers and more internet sources than any other tool, with active verification to confirm keys are real and alive. -**Current focus:** Phase 08 — dork-engine +**Current focus:** Phase 09 — osint-infrastructure ## Current Position -Phase: 9 -Plan: Not started +Phase: 09 (osint-infrastructure) — EXECUTING +Plan: 2 of 6 Status: Ready to execute Last activity: 2026-04-05 @@ -82,6 +82,7 @@ Progress: [██░░░░░░░░] 20% | Phase 08-dork-engine P02 | 12min | 2 tasks | 11 files | | Phase 08-dork-engine P03 | 10m | 2 tasks | 10 files | | Phase 08-dork-engine P07 | 3m | 1 tasks | 1 files | +| Phase 09-osint-infrastructure P04 | 6min | 2 tasks | 4 files | ## Accumulated Context @@ -129,6 +130,6 @@ None yet. ## Session Continuity -Last session: 2026-04-05T21:25:47.469Z -Stopped at: Completed 08-07-PLAN.md +Last session: 2026-04-05T21:43:35.879Z +Stopped at: Completed 09-04-PLAN.md Resume file: None diff --git a/.planning/phases/09-osint-infrastructure/09-04-SUMMARY.md b/.planning/phases/09-osint-infrastructure/09-04-SUMMARY.md new file mode 100644 index 0000000..bacdd4d --- /dev/null +++ b/.planning/phases/09-osint-infrastructure/09-04-SUMMARY.md @@ -0,0 +1,108 @@ +--- +phase: 09-osint-infrastructure +plan: 04 +subsystem: recon +tags: [recon, robots, http, cache] +requires: [github.com/temoto/robotstxt] +provides: + - pkg/recon.RobotsCache + - pkg/recon.NewRobotsCache +affects: [pkg/recon] +tech-stack: + added: [github.com/temoto/robotstxt@v1.1.2] + patterns: [per-host TTL cache, default-allow on error, injectable http.Client] +key-files: + created: + - pkg/recon/robots.go + - pkg/recon/robots_test.go + modified: + - go.mod + - go.sum +decisions: + - default-allow on fetch/parse error so a broken robots endpoint does not silently disable sources + - cache key is host (not full URL) to share TTL across all paths on a site + - UA matched is "keyhunter"; temoto/robotstxt TestAgent handles group precedence + - RobotsCache.Client field lets tests inject httptest.Server.Client() +metrics: + duration: ~6 min + completed: 2026-04-05 + tasks: 2 + files: 4 +requirements: [RECON-INFRA-07] +--- + +# Phase 9 Plan 04: robots.txt Parser with 1h Cache Summary + +RobotsCache parses robots.txt via temoto/robotstxt and caches results per host for 1 hour, with default-allow on network/parse errors and injectable HTTP client for tests. + +## Objective + +Foundation for every web-scraping recon source (Phase 11 paste, Phase 15 forums, etc.): a thread-safe per-host robots.txt cache that only sources with `RespectsRobots()==true` consult. Satisfies RECON-INFRA-07. + +## What Was Built + +- `pkg/recon/robots.go` (95 lines) — `RobotsCache` struct with `NewRobotsCache()` constructor and `Allowed(ctx, rawURL) (bool, error)` method. +- `pkg/recon/robots_test.go` (118 lines) — 5 tests covering allow/disallow/cache-hit/network-error/keyhunter-UA. +- `github.com/temoto/robotstxt@v1.1.2` added to `go.mod` and `go.sum`. + +## Behavior + +- Parses URL, looks up host in `sync.Mutex`-guarded map. +- Cache hit within 1h TTL returns `entry.data.TestAgent(path, "keyhunter")` without network I/O. +- Cache miss fetches `:///robots.txt`; on success, stores `robotstxt.RobotsData`. +- Network error, HTTP 4xx/5xx, read error, or parse error => returns `true, nil` (default-allow). +- `Client` field (nil -> `http.DefaultClient`) lets tests inject `httptest.Server.Client()`. + +## Tasks + +| # | Name | Commit | Files | +|---|------|--------|-------| +| 1 | Add temoto/robotstxt dependency | c3b9fb4 | go.mod, go.sum | +| 2a | Failing tests for RobotsCache (RED) | 4bd6c6b | pkg/recon/robots_test.go | +| 2b | Implement RobotsCache (GREEN) | 0373931 | pkg/recon/robots.go | + +## Verification + +- `go test ./pkg/recon/ -run TestRobots -count=1 -v` — 5/5 PASS + - TestRobotsAllowed + - TestRobotsDisallowed + - TestRobotsCacheHit (request counter == 1 after 3 calls) + - TestRobotsNetworkError (500 -> allowed) + - TestRobotsUAKeyhunter (keyhunter group matched over wildcard) +- `go vet ./pkg/recon/...` — clean +- `go build ./...` — clean + +## Decisions Made + +1. **Default-allow on error** — preferred over fail-closed to avoid silently disabling recon sources when a target has a misconfigured/500ing robots endpoint. Documented inline and in tests. +2. **Host-keyed cache** — a single robots.txt governs all paths on a host; caching by host shares TTL across every request to that site. +3. **Injectable Client** — `Client *http.Client` field on `RobotsCache` instead of a functional-option constructor to keep the surface minimal; tests pass `srv.Client()` directly. +4. **UA "keyhunter" literal** — constant `robotsUA = "keyhunter"` so upstream sources do not need to pass it. temoto/robotstxt `TestAgent` handles group precedence (specific UA beats wildcard). + +## Deviations from Plan + +### Out-of-scope discovery (NOT fixed, logged only) + +**Untracked `pkg/recon/engine_test.go`** — present in the working tree from parallel wave work (plan 09-05 scaffold). It references `NewEngine`/`ExampleSource` which do not yet exist, causing the test binary for `pkg/recon` to fail to build until those types land. + +- **Action taken:** Temporarily moved the file aside during my test run, verified my 5 tests pass + `go vet` clean + `go build ./...` clean, then restored the file unchanged. +- **Why not fixed:** Not caused by this plan (files_modified is `pkg/recon/robots.go` + `pkg/recon/robots_test.go` only). Belongs to plan 09-05 (wave 2, depends on 09-04). Fixing would be scope creep. +- **Impact on 09-04 verification:** None — my tests, vet, and full build all pass with the file in place only after 09-05's types exist. Under the current repository state, running `go test ./pkg/recon/...` will fail on the orphan test file, not on my code. Plan 09-05 will resolve this when it lands `engine.go` and `example.go`. + +Otherwise: plan executed exactly as written. + +## Known Stubs + +None. + +## Self-Check: PASSED + +- pkg/recon/robots.go — FOUND +- pkg/recon/robots_test.go — FOUND +- go.mod contains github.com/temoto/robotstxt — VERIFIED +- Commit c3b9fb4 (Task 1) — FOUND +- Commit 4bd6c6b (Task 2 RED) — FOUND +- Commit 0373931 (Task 2 GREEN) — FOUND +- `go test ./pkg/recon/ -run TestRobots` — 5/5 PASS +- `go vet ./pkg/recon/...` — clean +- `go build ./...` — clean