7.3 KiB
7.3 KiB
phase, plan, type, wave, depends_on, files_modified, autonomous, requirements, must_haves
| phase | plan | type | wave | depends_on | files_modified | autonomous | requirements | must_haves | ||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 09-osint-infrastructure | 04 | execute | 1 |
|
true |
|
|
Purpose: Foundation for every later web-scraping source (Phase 11 paste, Phase 15 forums, etc.). Adds github.com/temoto/robotstxt dependency. Output: pkg/recon/robots.go, pkg/recon/robots_test.go, go.mod/go.sum updated
<execution_context> @$HOME/.claude/get-shit-done/workflows/execute-plan.md @$HOME/.claude/get-shit-done/templates/summary.md </execution_context>
@.planning/phases/09-osint-infrastructure/09-CONTEXT.md @go.mod Task 1: Add temoto/robotstxt dependency go.mod, go.sum Run `go get github.com/temoto/robotstxt@latest` from the repo root. This updates go.mod and go.sum. Do NOT run `go mod tidy` yet — downstream tasks in this plan consume the dep and tidy will fail if tests are not written. Prefer `go mod download github.com/temoto/robotstxt` if only population of go.sum is needed, but `go get` is canonical.Verify the dep appears in go.mod `require` block.
cd /home/salva/Documents/apikey && grep -q "github.com/temoto/robotstxt" go.mod
go.mod contains github.com/temoto/robotstxt; go.sum populated.
Task 2: RobotsCache with 1h TTL and default-allow on error
pkg/recon/robots.go, pkg/recon/robots_test.go
- RobotsCache.Allowed(ctx, rawURL) (bool, error): parse URL -> host, fetch https://host/robots.txt (or use injected http.Client for tests), cache parsed result for 1 hour per host
- UA used for matching is "keyhunter"
- On fetch error or parse error: return true, nil (default-allow) so a broken robots endpoint does not silently disable a recon source
- Cache key is host (not full URL)
- Second call for same host within TTL does NOT trigger another HTTP request
- Tests use httptest.Server to serve robots.txt and inject a custom http.Client via RobotsCache.Client field
- Tests:
- TestRobotsAllowed: robots.txt says "User-agent: * / Disallow:" and path /public -> Allowed returns true
- TestRobotsDisallowed: robots.txt says "User-agent: * / Disallow: /private" and path /private -> false
- TestRobotsCacheHit: after first call, second call hits cache (use an atomic counter in the httptest handler and assert count == 1)
- TestRobotsNetworkError: server returns 500 -> Allowed returns true (default-allow)
- TestRobotsUAKeyhunter: robots.txt has "User-agent: keyhunter / Disallow: /blocked" -> path /blocked returns false
Create pkg/recon/robots.go:
```go
package recon
import (
"context"
"io"
"net/http"
"net/url"
"sync"
"time"
"github.com/temoto/robotstxt"
)
const (
robotsTTL = 1 * time.Hour
robotsUA = "keyhunter"
)
type robotsEntry struct {
data *robotstxt.RobotsData
fetched time.Time
}
// RobotsCache fetches and caches per-host robots.txt for 1 hour.
// Sources whose RespectsRobots() returns true should call Allowed before each request.
type RobotsCache struct {
mu sync.Mutex
cache map[string]robotsEntry
Client *http.Client // nil -> http.DefaultClient
}
func NewRobotsCache() *RobotsCache {
return &RobotsCache{cache: make(map[string]robotsEntry)}
}
// Allowed reports whether `keyhunter` may fetch rawURL per the host's robots.txt.
// On fetch/parse error the function returns true (default-allow) to avoid silently
// disabling recon sources when a site has a broken robots endpoint.
func (rc *RobotsCache) Allowed(ctx context.Context, rawURL string) (bool, error) {
u, err := url.Parse(rawURL)
if err != nil {
return true, nil
}
host := u.Host
rc.mu.Lock()
entry, ok := rc.cache[host]
if ok && time.Since(entry.fetched) < robotsTTL {
rc.mu.Unlock()
return entry.data.TestAgent(u.Path, robotsUA), nil
}
rc.mu.Unlock()
client := rc.Client
if client == nil {
client = http.DefaultClient
}
req, _ := http.NewRequestWithContext(ctx, "GET", u.Scheme+"://"+host+"/robots.txt", nil)
resp, err := client.Do(req)
if err != nil {
return true, nil // default-allow on network error
}
defer resp.Body.Close()
if resp.StatusCode >= 400 {
return true, nil // default-allow on 4xx/5xx
}
body, err := io.ReadAll(resp.Body)
if err != nil {
return true, nil
}
data, err := robotstxt.FromBytes(body)
if err != nil {
return true, nil
}
rc.mu.Lock()
rc.cache[host] = robotsEntry{data: data, fetched: time.Now()}
rc.mu.Unlock()
return data.TestAgent(u.Path, robotsUA), nil
}
```
Create pkg/recon/robots_test.go using httptest.NewServer. Inject the test server's client into RobotsCache.Client (use `server.Client()`). For TestRobotsCacheHit, use `atomic.Int32` incremented inside the handler.
Note on test URL: since httptest.Server has a dynamic host, build rawURL from `server.URL + "/public"`. The cache key will be the httptest host:port — both calls share the same host, so cache hit is testable.
cd /home/salva/Documents/apikey && go test ./pkg/recon/ -run TestRobots -count=1
All 5 robots tests pass. Cache hit verified via request counter. Default-allow on 500 verified.
- `go test ./pkg/recon/ -run TestRobots -count=1` passes
- `go build ./...` passes (robotstxt dep resolved)
- `go vet ./pkg/recon/...` clean
<success_criteria>
- RobotsCache implemented with 1h TTL
- UA "keyhunter" matching
- Default-allow on network/parse errors
- github.com/temoto/robotstxt added to go.mod </success_criteria>