docs(09): create phase plan
This commit is contained in:
196
.planning/phases/09-osint-infrastructure/09-04-PLAN.md
Normal file
196
.planning/phases/09-osint-infrastructure/09-04-PLAN.md
Normal file
@@ -0,0 +1,196 @@
|
||||
---
|
||||
phase: 09-osint-infrastructure
|
||||
plan: 04
|
||||
type: execute
|
||||
wave: 1
|
||||
depends_on: []
|
||||
files_modified:
|
||||
- pkg/recon/robots.go
|
||||
- pkg/recon/robots_test.go
|
||||
- go.mod
|
||||
- go.sum
|
||||
autonomous: true
|
||||
requirements: [RECON-INFRA-07]
|
||||
must_haves:
|
||||
truths:
|
||||
- "pkg/recon.RobotsCache parses and caches robots.txt per host for 1 hour"
|
||||
- "Allowed(host, path) returns true if robots.txt permits `keyhunter` UA on that path"
|
||||
- "Cache hit avoids a second HTTP fetch for the same host within TTL"
|
||||
- "Network errors degrade safely: default-allow (so a broken robots.txt fetch does not silently block sweeps)"
|
||||
artifacts:
|
||||
- path: "pkg/recon/robots.go"
|
||||
provides: "RobotsCache with Allowed(ctx, url) bool + 1h per-host TTL"
|
||||
contains: "type RobotsCache"
|
||||
- path: "pkg/recon/robots_test.go"
|
||||
provides: "Tests for parse/allowed/disallowed/cache-hit/network-fail"
|
||||
key_links:
|
||||
- from: "pkg/recon/robots.go"
|
||||
to: "github.com/temoto/robotstxt"
|
||||
via: "robotstxt.FromBytes"
|
||||
pattern: "robotstxt\\."
|
||||
---
|
||||
|
||||
<objective>
|
||||
Add robots.txt parser and per-host cache for web-scraping sources. Satisfies RECON-INFRA-07 ("`keyhunter recon full --respect-robots` respects robots.txt for web-scraping sources before making any requests"). Only sources with RespectsRobots()==true consult the cache.
|
||||
|
||||
Purpose: Foundation for every later web-scraping source (Phase 11 paste, Phase 15 forums, etc.). Adds github.com/temoto/robotstxt dependency.
|
||||
Output: pkg/recon/robots.go, pkg/recon/robots_test.go, go.mod/go.sum updated
|
||||
</objective>
|
||||
|
||||
<execution_context>
|
||||
@$HOME/.claude/get-shit-done/workflows/execute-plan.md
|
||||
@$HOME/.claude/get-shit-done/templates/summary.md
|
||||
</execution_context>
|
||||
|
||||
<context>
|
||||
@.planning/phases/09-osint-infrastructure/09-CONTEXT.md
|
||||
@go.mod
|
||||
</context>
|
||||
|
||||
<tasks>
|
||||
|
||||
<task type="auto">
|
||||
<name>Task 1: Add temoto/robotstxt dependency</name>
|
||||
<files>go.mod, go.sum</files>
|
||||
<action>
|
||||
Run `go get github.com/temoto/robotstxt@latest` from the repo root. This updates go.mod and go.sum. Do NOT run `go mod tidy` yet — downstream tasks in this plan consume the dep and tidy will fail if tests are not written. Prefer `go mod download github.com/temoto/robotstxt` if only population of go.sum is needed, but `go get` is canonical.
|
||||
|
||||
Verify the dep appears in go.mod `require` block.
|
||||
</action>
|
||||
<verify>
|
||||
<automated>cd /home/salva/Documents/apikey && grep -q "github.com/temoto/robotstxt" go.mod</automated>
|
||||
</verify>
|
||||
<done>go.mod contains github.com/temoto/robotstxt; go.sum populated.</done>
|
||||
</task>
|
||||
|
||||
<task type="auto" tdd="true">
|
||||
<name>Task 2: RobotsCache with 1h TTL and default-allow on error</name>
|
||||
<files>pkg/recon/robots.go, pkg/recon/robots_test.go</files>
|
||||
<behavior>
|
||||
- RobotsCache.Allowed(ctx, rawURL) (bool, error): parse URL -> host, fetch https://host/robots.txt (or use injected http.Client for tests), cache parsed result for 1 hour per host
|
||||
- UA used for matching is "keyhunter"
|
||||
- On fetch error or parse error: return true, nil (default-allow) so a broken robots endpoint does not silently disable a recon source
|
||||
- Cache key is host (not full URL)
|
||||
- Second call for same host within TTL does NOT trigger another HTTP request
|
||||
- Tests use httptest.Server to serve robots.txt and inject a custom http.Client via RobotsCache.Client field
|
||||
- Tests:
|
||||
- TestRobotsAllowed: robots.txt says "User-agent: * / Disallow:" and path /public -> Allowed returns true
|
||||
- TestRobotsDisallowed: robots.txt says "User-agent: * / Disallow: /private" and path /private -> false
|
||||
- TestRobotsCacheHit: after first call, second call hits cache (use an atomic counter in the httptest handler and assert count == 1)
|
||||
- TestRobotsNetworkError: server returns 500 -> Allowed returns true (default-allow)
|
||||
- TestRobotsUAKeyhunter: robots.txt has "User-agent: keyhunter / Disallow: /blocked" -> path /blocked returns false
|
||||
</behavior>
|
||||
<action>
|
||||
Create pkg/recon/robots.go:
|
||||
|
||||
```go
|
||||
package recon
|
||||
|
||||
import (
|
||||
"context"
|
||||
"io"
|
||||
"net/http"
|
||||
"net/url"
|
||||
"sync"
|
||||
"time"
|
||||
|
||||
"github.com/temoto/robotstxt"
|
||||
)
|
||||
|
||||
const (
|
||||
robotsTTL = 1 * time.Hour
|
||||
robotsUA = "keyhunter"
|
||||
)
|
||||
|
||||
type robotsEntry struct {
|
||||
data *robotstxt.RobotsData
|
||||
fetched time.Time
|
||||
}
|
||||
|
||||
// RobotsCache fetches and caches per-host robots.txt for 1 hour.
|
||||
// Sources whose RespectsRobots() returns true should call Allowed before each request.
|
||||
type RobotsCache struct {
|
||||
mu sync.Mutex
|
||||
cache map[string]robotsEntry
|
||||
Client *http.Client // nil -> http.DefaultClient
|
||||
}
|
||||
|
||||
func NewRobotsCache() *RobotsCache {
|
||||
return &RobotsCache{cache: make(map[string]robotsEntry)}
|
||||
}
|
||||
|
||||
// Allowed reports whether `keyhunter` may fetch rawURL per the host's robots.txt.
|
||||
// On fetch/parse error the function returns true (default-allow) to avoid silently
|
||||
// disabling recon sources when a site has a broken robots endpoint.
|
||||
func (rc *RobotsCache) Allowed(ctx context.Context, rawURL string) (bool, error) {
|
||||
u, err := url.Parse(rawURL)
|
||||
if err != nil {
|
||||
return true, nil
|
||||
}
|
||||
host := u.Host
|
||||
|
||||
rc.mu.Lock()
|
||||
entry, ok := rc.cache[host]
|
||||
if ok && time.Since(entry.fetched) < robotsTTL {
|
||||
rc.mu.Unlock()
|
||||
return entry.data.TestAgent(u.Path, robotsUA), nil
|
||||
}
|
||||
rc.mu.Unlock()
|
||||
|
||||
client := rc.Client
|
||||
if client == nil {
|
||||
client = http.DefaultClient
|
||||
}
|
||||
req, _ := http.NewRequestWithContext(ctx, "GET", u.Scheme+"://"+host+"/robots.txt", nil)
|
||||
resp, err := client.Do(req)
|
||||
if err != nil {
|
||||
return true, nil // default-allow on network error
|
||||
}
|
||||
defer resp.Body.Close()
|
||||
if resp.StatusCode >= 400 {
|
||||
return true, nil // default-allow on 4xx/5xx
|
||||
}
|
||||
body, err := io.ReadAll(resp.Body)
|
||||
if err != nil {
|
||||
return true, nil
|
||||
}
|
||||
data, err := robotstxt.FromBytes(body)
|
||||
if err != nil {
|
||||
return true, nil
|
||||
}
|
||||
rc.mu.Lock()
|
||||
rc.cache[host] = robotsEntry{data: data, fetched: time.Now()}
|
||||
rc.mu.Unlock()
|
||||
return data.TestAgent(u.Path, robotsUA), nil
|
||||
}
|
||||
```
|
||||
|
||||
Create pkg/recon/robots_test.go using httptest.NewServer. Inject the test server's client into RobotsCache.Client (use `server.Client()`). For TestRobotsCacheHit, use `atomic.Int32` incremented inside the handler.
|
||||
|
||||
Note on test URL: since httptest.Server has a dynamic host, build rawURL from `server.URL + "/public"`. The cache key will be the httptest host:port — both calls share the same host, so cache hit is testable.
|
||||
</action>
|
||||
<verify>
|
||||
<automated>cd /home/salva/Documents/apikey && go test ./pkg/recon/ -run TestRobots -count=1</automated>
|
||||
</verify>
|
||||
<done>All 5 robots tests pass. Cache hit verified via request counter. Default-allow on 500 verified.</done>
|
||||
</task>
|
||||
|
||||
</tasks>
|
||||
|
||||
<verification>
|
||||
- `go test ./pkg/recon/ -run TestRobots -count=1` passes
|
||||
- `go build ./...` passes (robotstxt dep resolved)
|
||||
- `go vet ./pkg/recon/...` clean
|
||||
</verification>
|
||||
|
||||
<success_criteria>
|
||||
- RobotsCache implemented with 1h TTL
|
||||
- UA "keyhunter" matching
|
||||
- Default-allow on network/parse errors
|
||||
- github.com/temoto/robotstxt added to go.mod
|
||||
</success_criteria>
|
||||
|
||||
<output>
|
||||
After completion, create `.planning/phases/09-osint-infrastructure/09-04-SUMMARY.md`
|
||||
</output>
|
||||
</content>
|
||||
Reference in New Issue
Block a user