docs(09-04): complete robots.txt cache plan
Adds SUMMARY, marks RECON-INFRA-07 complete, updates phase 9 roadmap.
This commit is contained in:
@@ -206,7 +206,7 @@ Requirements for initial release. Each maps to roadmap phases.
|
|||||||
|
|
||||||
- [x] **RECON-INFRA-05**: Per-source rate limiter with configurable limits
|
- [x] **RECON-INFRA-05**: Per-source rate limiter with configurable limits
|
||||||
- [ ] **RECON-INFRA-06**: Stealth mode (--stealth) with UA rotation and increased delays
|
- [ ] **RECON-INFRA-06**: Stealth mode (--stealth) with UA rotation and increased delays
|
||||||
- [ ] **RECON-INFRA-07**: robots.txt respect (--respect-robots, default on)
|
- [x] **RECON-INFRA-07**: robots.txt respect (--respect-robots, default on)
|
||||||
- [ ] **RECON-INFRA-08**: Recon full command — parallel sweep across all sources with deduplication
|
- [ ] **RECON-INFRA-08**: Recon full command — parallel sweep across all sources with deduplication
|
||||||
|
|
||||||
### Dork Engine
|
### Dork Engine
|
||||||
|
|||||||
@@ -199,9 +199,9 @@ Plans:
|
|||||||
4. `keyhunter recon full` fans out to all enabled sources in parallel and deduplicates findings before persisting to the database
|
4. `keyhunter recon full` fans out to all enabled sources in parallel and deduplicates findings before persisting to the database
|
||||||
**Plans**: 6 plans
|
**Plans**: 6 plans
|
||||||
- [ ] 09-01-PLAN.md — ReconSource interface + Engine skeleton + ExampleSource stub
|
- [ ] 09-01-PLAN.md — ReconSource interface + Engine skeleton + ExampleSource stub
|
||||||
- [ ] 09-02-PLAN.md — LimiterRegistry per-source rate.Limiter + jitter
|
- [x] 09-02-PLAN.md — LimiterRegistry per-source rate.Limiter + jitter
|
||||||
- [ ] 09-03-PLAN.md — Stealth UA pool + cross-source dedup
|
- [ ] 09-03-PLAN.md — Stealth UA pool + cross-source dedup
|
||||||
- [ ] 09-04-PLAN.md — robots.txt parser with 1h per-host cache
|
- [x] 09-04-PLAN.md — robots.txt parser with 1h per-host cache
|
||||||
- [ ] 09-05-PLAN.md — cmd/recon.go CLI tree (full, list)
|
- [ ] 09-05-PLAN.md — cmd/recon.go CLI tree (full, list)
|
||||||
- [ ] 09-06-PLAN.md — Integration test + phase summary
|
- [ ] 09-06-PLAN.md — Integration test + phase summary
|
||||||
|
|
||||||
@@ -325,7 +325,7 @@ Phases execute in numeric order: 1 → 2 → 3 → ... → 18
|
|||||||
| 6. Output, Reporting & Key Management | 0/? | Not started | - |
|
| 6. Output, Reporting & Key Management | 0/? | Not started | - |
|
||||||
| 7. Import Adapters & CI/CD Integration | 0/? | Not started | - |
|
| 7. Import Adapters & CI/CD Integration | 0/? | Not started | - |
|
||||||
| 8. Dork Engine | 0/? | Not started | - |
|
| 8. Dork Engine | 0/? | Not started | - |
|
||||||
| 9. OSINT Infrastructure | 0/? | Not started | - |
|
| 9. OSINT Infrastructure | 2/6 | In Progress| |
|
||||||
| 10. OSINT Code Hosting | 0/? | Not started | - |
|
| 10. OSINT Code Hosting | 0/? | Not started | - |
|
||||||
| 11. OSINT Search & Paste | 0/? | Not started | - |
|
| 11. OSINT Search & Paste | 0/? | Not started | - |
|
||||||
| 12. OSINT IoT & Cloud Storage | 0/? | Not started | - |
|
| 12. OSINT IoT & Cloud Storage | 0/? | Not started | - |
|
||||||
|
|||||||
@@ -3,14 +3,14 @@ gsd_state_version: 1.0
|
|||||||
milestone: v1.0
|
milestone: v1.0
|
||||||
milestone_name: milestone
|
milestone_name: milestone
|
||||||
status: executing
|
status: executing
|
||||||
stopped_at: Completed 08-07-PLAN.md
|
stopped_at: Completed 09-04-PLAN.md
|
||||||
last_updated: "2026-04-05T21:32:47.810Z"
|
last_updated: "2026-04-05T21:43:35.883Z"
|
||||||
last_activity: 2026-04-05
|
last_activity: 2026-04-05
|
||||||
progress:
|
progress:
|
||||||
total_phases: 18
|
total_phases: 18
|
||||||
completed_phases: 8
|
completed_phases: 8
|
||||||
total_plans: 47
|
total_plans: 53
|
||||||
completed_plans: 47
|
completed_plans: 49
|
||||||
percent: 20
|
percent: 20
|
||||||
---
|
---
|
||||||
|
|
||||||
@@ -21,12 +21,12 @@ progress:
|
|||||||
See: .planning/PROJECT.md (updated 2026-04-04)
|
See: .planning/PROJECT.md (updated 2026-04-04)
|
||||||
|
|
||||||
**Core value:** Detect leaked LLM API keys across more providers and more internet sources than any other tool, with active verification to confirm keys are real and alive.
|
**Core value:** Detect leaked LLM API keys across more providers and more internet sources than any other tool, with active verification to confirm keys are real and alive.
|
||||||
**Current focus:** Phase 08 — dork-engine
|
**Current focus:** Phase 09 — osint-infrastructure
|
||||||
|
|
||||||
## Current Position
|
## Current Position
|
||||||
|
|
||||||
Phase: 9
|
Phase: 09 (osint-infrastructure) — EXECUTING
|
||||||
Plan: Not started
|
Plan: 2 of 6
|
||||||
Status: Ready to execute
|
Status: Ready to execute
|
||||||
Last activity: 2026-04-05
|
Last activity: 2026-04-05
|
||||||
|
|
||||||
@@ -82,6 +82,7 @@ Progress: [██░░░░░░░░] 20%
|
|||||||
| Phase 08-dork-engine P02 | 12min | 2 tasks | 11 files |
|
| Phase 08-dork-engine P02 | 12min | 2 tasks | 11 files |
|
||||||
| Phase 08-dork-engine P03 | 10m | 2 tasks | 10 files |
|
| Phase 08-dork-engine P03 | 10m | 2 tasks | 10 files |
|
||||||
| Phase 08-dork-engine P07 | 3m | 1 tasks | 1 files |
|
| Phase 08-dork-engine P07 | 3m | 1 tasks | 1 files |
|
||||||
|
| Phase 09-osint-infrastructure P04 | 6min | 2 tasks | 4 files |
|
||||||
|
|
||||||
## Accumulated Context
|
## Accumulated Context
|
||||||
|
|
||||||
@@ -129,6 +130,6 @@ None yet.
|
|||||||
|
|
||||||
## Session Continuity
|
## Session Continuity
|
||||||
|
|
||||||
Last session: 2026-04-05T21:25:47.469Z
|
Last session: 2026-04-05T21:43:35.879Z
|
||||||
Stopped at: Completed 08-07-PLAN.md
|
Stopped at: Completed 09-04-PLAN.md
|
||||||
Resume file: None
|
Resume file: None
|
||||||
|
|||||||
108
.planning/phases/09-osint-infrastructure/09-04-SUMMARY.md
Normal file
108
.planning/phases/09-osint-infrastructure/09-04-SUMMARY.md
Normal file
@@ -0,0 +1,108 @@
|
|||||||
|
---
|
||||||
|
phase: 09-osint-infrastructure
|
||||||
|
plan: 04
|
||||||
|
subsystem: recon
|
||||||
|
tags: [recon, robots, http, cache]
|
||||||
|
requires: [github.com/temoto/robotstxt]
|
||||||
|
provides:
|
||||||
|
- pkg/recon.RobotsCache
|
||||||
|
- pkg/recon.NewRobotsCache
|
||||||
|
affects: [pkg/recon]
|
||||||
|
tech-stack:
|
||||||
|
added: [github.com/temoto/robotstxt@v1.1.2]
|
||||||
|
patterns: [per-host TTL cache, default-allow on error, injectable http.Client]
|
||||||
|
key-files:
|
||||||
|
created:
|
||||||
|
- pkg/recon/robots.go
|
||||||
|
- pkg/recon/robots_test.go
|
||||||
|
modified:
|
||||||
|
- go.mod
|
||||||
|
- go.sum
|
||||||
|
decisions:
|
||||||
|
- default-allow on fetch/parse error so a broken robots endpoint does not silently disable sources
|
||||||
|
- cache key is host (not full URL) to share TTL across all paths on a site
|
||||||
|
- UA matched is "keyhunter"; temoto/robotstxt TestAgent handles group precedence
|
||||||
|
- RobotsCache.Client field lets tests inject httptest.Server.Client()
|
||||||
|
metrics:
|
||||||
|
duration: ~6 min
|
||||||
|
completed: 2026-04-05
|
||||||
|
tasks: 2
|
||||||
|
files: 4
|
||||||
|
requirements: [RECON-INFRA-07]
|
||||||
|
---
|
||||||
|
|
||||||
|
# Phase 9 Plan 04: robots.txt Parser with 1h Cache Summary
|
||||||
|
|
||||||
|
RobotsCache parses robots.txt via temoto/robotstxt and caches results per host for 1 hour, with default-allow on network/parse errors and injectable HTTP client for tests.
|
||||||
|
|
||||||
|
## Objective
|
||||||
|
|
||||||
|
Foundation for every web-scraping recon source (Phase 11 paste, Phase 15 forums, etc.): a thread-safe per-host robots.txt cache that only sources with `RespectsRobots()==true` consult. Satisfies RECON-INFRA-07.
|
||||||
|
|
||||||
|
## What Was Built
|
||||||
|
|
||||||
|
- `pkg/recon/robots.go` (95 lines) — `RobotsCache` struct with `NewRobotsCache()` constructor and `Allowed(ctx, rawURL) (bool, error)` method.
|
||||||
|
- `pkg/recon/robots_test.go` (118 lines) — 5 tests covering allow/disallow/cache-hit/network-error/keyhunter-UA.
|
||||||
|
- `github.com/temoto/robotstxt@v1.1.2` added to `go.mod` and `go.sum`.
|
||||||
|
|
||||||
|
## Behavior
|
||||||
|
|
||||||
|
- Parses URL, looks up host in `sync.Mutex`-guarded map.
|
||||||
|
- Cache hit within 1h TTL returns `entry.data.TestAgent(path, "keyhunter")` without network I/O.
|
||||||
|
- Cache miss fetches `<scheme>://<host>/robots.txt`; on success, stores `robotstxt.RobotsData`.
|
||||||
|
- Network error, HTTP 4xx/5xx, read error, or parse error => returns `true, nil` (default-allow).
|
||||||
|
- `Client` field (nil -> `http.DefaultClient`) lets tests inject `httptest.Server.Client()`.
|
||||||
|
|
||||||
|
## Tasks
|
||||||
|
|
||||||
|
| # | Name | Commit | Files |
|
||||||
|
|---|------|--------|-------|
|
||||||
|
| 1 | Add temoto/robotstxt dependency | c3b9fb4 | go.mod, go.sum |
|
||||||
|
| 2a | Failing tests for RobotsCache (RED) | 4bd6c6b | pkg/recon/robots_test.go |
|
||||||
|
| 2b | Implement RobotsCache (GREEN) | 0373931 | pkg/recon/robots.go |
|
||||||
|
|
||||||
|
## Verification
|
||||||
|
|
||||||
|
- `go test ./pkg/recon/ -run TestRobots -count=1 -v` — 5/5 PASS
|
||||||
|
- TestRobotsAllowed
|
||||||
|
- TestRobotsDisallowed
|
||||||
|
- TestRobotsCacheHit (request counter == 1 after 3 calls)
|
||||||
|
- TestRobotsNetworkError (500 -> allowed)
|
||||||
|
- TestRobotsUAKeyhunter (keyhunter group matched over wildcard)
|
||||||
|
- `go vet ./pkg/recon/...` — clean
|
||||||
|
- `go build ./...` — clean
|
||||||
|
|
||||||
|
## Decisions Made
|
||||||
|
|
||||||
|
1. **Default-allow on error** — preferred over fail-closed to avoid silently disabling recon sources when a target has a misconfigured/500ing robots endpoint. Documented inline and in tests.
|
||||||
|
2. **Host-keyed cache** — a single robots.txt governs all paths on a host; caching by host shares TTL across every request to that site.
|
||||||
|
3. **Injectable Client** — `Client *http.Client` field on `RobotsCache` instead of a functional-option constructor to keep the surface minimal; tests pass `srv.Client()` directly.
|
||||||
|
4. **UA "keyhunter" literal** — constant `robotsUA = "keyhunter"` so upstream sources do not need to pass it. temoto/robotstxt `TestAgent` handles group precedence (specific UA beats wildcard).
|
||||||
|
|
||||||
|
## Deviations from Plan
|
||||||
|
|
||||||
|
### Out-of-scope discovery (NOT fixed, logged only)
|
||||||
|
|
||||||
|
**Untracked `pkg/recon/engine_test.go`** — present in the working tree from parallel wave work (plan 09-05 scaffold). It references `NewEngine`/`ExampleSource` which do not yet exist, causing the test binary for `pkg/recon` to fail to build until those types land.
|
||||||
|
|
||||||
|
- **Action taken:** Temporarily moved the file aside during my test run, verified my 5 tests pass + `go vet` clean + `go build ./...` clean, then restored the file unchanged.
|
||||||
|
- **Why not fixed:** Not caused by this plan (files_modified is `pkg/recon/robots.go` + `pkg/recon/robots_test.go` only). Belongs to plan 09-05 (wave 2, depends on 09-04). Fixing would be scope creep.
|
||||||
|
- **Impact on 09-04 verification:** None — my tests, vet, and full build all pass with the file in place only after 09-05's types exist. Under the current repository state, running `go test ./pkg/recon/...` will fail on the orphan test file, not on my code. Plan 09-05 will resolve this when it lands `engine.go` and `example.go`.
|
||||||
|
|
||||||
|
Otherwise: plan executed exactly as written.
|
||||||
|
|
||||||
|
## Known Stubs
|
||||||
|
|
||||||
|
None.
|
||||||
|
|
||||||
|
## Self-Check: PASSED
|
||||||
|
|
||||||
|
- pkg/recon/robots.go — FOUND
|
||||||
|
- pkg/recon/robots_test.go — FOUND
|
||||||
|
- go.mod contains github.com/temoto/robotstxt — VERIFIED
|
||||||
|
- Commit c3b9fb4 (Task 1) — FOUND
|
||||||
|
- Commit 4bd6c6b (Task 2 RED) — FOUND
|
||||||
|
- Commit 0373931 (Task 2 GREEN) — FOUND
|
||||||
|
- `go test ./pkg/recon/ -run TestRobots` — 5/5 PASS
|
||||||
|
- `go vet ./pkg/recon/...` — clean
|
||||||
|
- `go build ./...` — clean
|
||||||
Reference in New Issue
Block a user