Files
keyhunter/.planning/phases/10-osint-code-hosting/10-01-PLAN.md

332 lines
12 KiB
Markdown

---
phase: 10-osint-code-hosting
plan: 01
type: execute
wave: 1
depends_on: []
files_modified:
- pkg/recon/sources/doc.go
- pkg/recon/sources/httpclient.go
- pkg/recon/sources/httpclient_test.go
- pkg/recon/sources/queries.go
- pkg/recon/sources/queries_test.go
- pkg/recon/sources/register.go
autonomous: true
requirements: []
must_haves:
truths:
- "Shared retry HTTP client honors ctx cancellation and Retry-After on 429/403"
- "Provider registry drives per-source query templates (no hardcoded literals)"
- "Empty source registry compiles and exposes RegisterAll(engine, cfg)"
artifacts:
- path: "pkg/recon/sources/httpclient.go"
provides: "Retrying *http.Client with context + Retry-After handling"
- path: "pkg/recon/sources/queries.go"
provides: "BuildQueries(registry, sourceName) []string generator"
- path: "pkg/recon/sources/register.go"
provides: "RegisterAll(engine *recon.Engine, cfg SourcesConfig) bootstrap"
key_links:
- from: "pkg/recon/sources/httpclient.go"
to: "net/http + context + golang.org/x/time/rate"
via: "DoWithRetry(ctx, req, limiter) (*http.Response, error)"
pattern: "DoWithRetry"
- from: "pkg/recon/sources/queries.go"
to: "pkg/providers.Registry"
via: "BuildQueries iterates reg.List() and formats provider keywords"
pattern: "BuildQueries"
---
<objective>
Establish the shared foundation for all Phase 10 code hosting sources: a retry-aware HTTP
client wrapper, a provider→query template generator driven by the provider registry, and
an empty RegisterAll bootstrap that Plan 10-09 will fill in. No individual source is
implemented here — this plan exists so Wave 2 plans (10-02..10-08) can run in parallel
without fighting over shared helpers.
Purpose: Deduplicate retry/rate-limit/backoff logic across 10 sources; centralize query
generation so providers added later automatically flow to every source.
Output: Compilable `pkg/recon/sources` package skeleton with tested helpers.
</objective>
<execution_context>
@$HOME/.claude/get-shit-done/workflows/execute-plan.md
@$HOME/.claude/get-shit-done/templates/summary.md
</execution_context>
<context>
@.planning/PROJECT.md
@.planning/ROADMAP.md
@.planning/STATE.md
@.planning/phases/10-osint-code-hosting/10-CONTEXT.md
@pkg/recon/source.go
@pkg/recon/limiter.go
@pkg/dorks/github.go
@pkg/providers/registry.go
<interfaces>
From pkg/recon/source.go:
```go
type ReconSource interface {
Name() string
RateLimit() rate.Limit
Burst() int
RespectsRobots() bool
Enabled(cfg Config) bool
Sweep(ctx context.Context, query string, out chan<- Finding) error
}
type Finding = engine.Finding
type Config struct { Stealth, RespectRobots bool; EnabledSources []string; Query string }
```
From pkg/recon/limiter.go:
```go
type LimiterRegistry struct { ... }
func NewLimiterRegistry() *LimiterRegistry
func (lr *LimiterRegistry) Wait(ctx, name, r, burst, stealth) error
```
From pkg/providers/registry.go:
```go
func (r *Registry) List() []Provider
// Provider has: Name string, Keywords []string, Patterns []Pattern, Tier int
```
From pkg/engine/finding.go:
```go
type Finding struct {
ProviderName, KeyValue, KeyMasked, Confidence, Source, SourceType string
LineNumber int; Offset int64; DetectedAt time.Time
Verified bool; VerifyStatus string; ...
}
```
</interfaces>
</context>
<tasks>
<task type="auto" tdd="true">
<name>Task 1: Shared retry HTTP client helper</name>
<files>pkg/recon/sources/doc.go, pkg/recon/sources/httpclient.go, pkg/recon/sources/httpclient_test.go</files>
<behavior>
- Test A: 200 OK returns response unchanged, body readable
- Test B: 429 with Retry-After:1 triggers one retry then succeeds (verify via httptest counter)
- Test C: 403 with Retry-After triggers retry
- Test D: 401 returns ErrUnauthorized immediately, no retry
- Test E: Ctx cancellation during retry sleep returns ctx.Err()
- Test F: MaxRetries exhausted returns wrapped last-status error
</behavior>
<action>
Create `pkg/recon/sources/doc.go` with the package comment: "Package sources hosts per-OSINT-source ReconSource implementations for Phase 10 code hosting (GitHub, GitLab, Bitbucket, Gist, Codeberg, HuggingFace, Kaggle, Replit, CodeSandbox, sandboxes). Each source implements pkg/recon.ReconSource."
Create `pkg/recon/sources/httpclient.go` exporting:
```go
package sources
import (
"context"
"errors"
"fmt"
"net/http"
"strconv"
"time"
)
// ErrUnauthorized is returned when an API rejects credentials (401).
var ErrUnauthorized = errors.New("sources: unauthorized (check credentials)")
// Client is the shared retry wrapper every Phase 10 source uses.
type Client struct {
HTTP *http.Client
MaxRetries int // default 2
UserAgent string // default "keyhunter-recon/1.0"
}
// NewClient returns a Client with a 30s timeout and 2 retries.
func NewClient() *Client {
return &Client{HTTP: &http.Client{Timeout: 30 * time.Second}, MaxRetries: 2, UserAgent: "keyhunter-recon/1.0"}
}
// Do executes req with retries on 429/403/5xx honoring Retry-After.
// 401 returns ErrUnauthorized wrapped with the response body.
// Ctx cancellation is honored during sleeps.
func (c *Client) Do(ctx context.Context, req *http.Request) (*http.Response, error) {
if req.Header.Get("User-Agent") == "" { req.Header.Set("User-Agent", c.UserAgent) }
var last *http.Response
for attempt := 0; attempt <= c.MaxRetries; attempt++ {
r, err := c.HTTP.Do(req.WithContext(ctx))
if err != nil { return nil, fmt.Errorf("sources http: %w", err) }
if r.StatusCode == http.StatusOK { return r, nil }
if r.StatusCode == http.StatusUnauthorized {
body := readBody(r)
return nil, fmt.Errorf("%w: %s", ErrUnauthorized, body)
}
retriable := r.StatusCode == 429 || r.StatusCode == 403 || r.StatusCode >= 500
if !retriable || attempt == c.MaxRetries {
body := readBody(r)
return nil, fmt.Errorf("sources http %d: %s", r.StatusCode, body)
}
sleep := ParseRetryAfter(r.Header.Get("Retry-After"))
r.Body.Close()
last = r
select {
case <-time.After(sleep):
case <-ctx.Done(): return nil, ctx.Err()
}
}
_ = last
return nil, fmt.Errorf("sources http: retries exhausted")
}
// ParseRetryAfter decodes integer-seconds Retry-After, defaulting to 1s.
func ParseRetryAfter(v string) time.Duration { ... }
// readBody reads up to 4KB of the body and closes it.
func readBody(r *http.Response) string { ... }
```
Create `pkg/recon/sources/httpclient_test.go` using `net/http/httptest`:
- Table-driven tests for each behavior above. Use an atomic counter to verify
retry attempt counts. Use `httptest.NewServer` with a handler that switches on
a request counter.
- For ctx cancellation test: set Retry-After: 10, cancel ctx inside 100ms, assert
ctx.Err() returned within 500ms.
Do NOT build a LimiterRegistry wrapper here — each source calls its own LimiterRegistry.Wait
before calling Client.Do. Keeps Client single-purpose (retry only).
</action>
<verify>
<automated>cd /home/salva/Documents/apikey && go test ./pkg/recon/sources/ -run TestClient -v -timeout 30s</automated>
</verify>
<done>
All behaviors covered; Client.Do retries on 429/403/5xx honoring Retry-After; 401
returns ErrUnauthorized immediately; ctx cancellation respected; tests green.
</done>
</task>
<task type="auto" tdd="true">
<name>Task 2: Provider-driven query generator + RegisterAll skeleton</name>
<files>pkg/recon/sources/queries.go, pkg/recon/sources/queries_test.go, pkg/recon/sources/register.go</files>
<behavior>
- Test A: BuildQueries(reg, "github") returns one query per (provider, keyword) tuple formatted as GitHub search syntax, e.g. `"sk-proj-" in:file`
- Test B: BuildQueries(reg, "gitlab") returns queries formatted for GitLab search syntax (raw keyword, no `in:file`)
- Test C: BuildQueries(reg, "huggingface") returns bare keyword queries
- Test D: Unknown source name returns bare keyword queries (safe default)
- Test E: Providers with empty Keywords slice are skipped
- Test F: Keyword dedup — if two providers share keyword, emit once per source
- Test G: RegisterAll(nil, cfg) is a no-op that does not panic; RegisterAll with empty cfg does not panic
</behavior>
<action>
Create `pkg/recon/sources/queries.go`:
```go
package sources
import (
"fmt"
"sort"
"github.com/salvacybersec/keyhunter/pkg/providers"
)
// BuildQueries produces the search-string list a source should iterate for a
// given provider registry. Each keyword is formatted per source-specific syntax.
// Result is deterministic (sorted) for reproducible tests.
func BuildQueries(reg *providers.Registry, source string) []string {
if reg == nil { return nil }
seen := make(map[string]struct{})
for _, p := range reg.List() {
for _, k := range p.Keywords {
if k == "" { continue }
seen[k] = struct{}{}
}
}
keywords := make([]string, 0, len(seen))
for k := range seen { keywords = append(keywords, k) }
sort.Strings(keywords)
out := make([]string, 0, len(keywords))
for _, k := range keywords {
out = append(out, formatQuery(source, k))
}
return out
}
func formatQuery(source, keyword string) string {
switch source {
case "github", "gist":
return fmt.Sprintf("%q in:file", keyword)
case "gitlab":
return keyword // GitLab code search doesn't support in:file qualifier
case "bitbucket":
return keyword
case "codeberg":
return keyword
default:
return keyword
}
}
```
Create `pkg/recon/sources/queries_test.go` using `providers.NewRegistryFromProviders`
with two synthetic providers (shared keyword to test dedup).
Create `pkg/recon/sources/register.go`:
```go
package sources
import (
"github.com/salvacybersec/keyhunter/pkg/providers"
"github.com/salvacybersec/keyhunter/pkg/recon"
)
// SourcesConfig carries per-source credentials read from viper/env by cmd/recon.go.
// Plan 10-09 fleshes this out; for now it is a placeholder struct so downstream
// plans can depend on its shape.
type SourcesConfig struct {
GitHubToken string
GitLabToken string
BitbucketToken string
HuggingFaceToken string
KaggleUser string
KaggleKey string
Registry *providers.Registry
Limiters *recon.LimiterRegistry
}
// RegisterAll registers every Phase 10 code-hosting source on engine.
// Wave 2 plans append their source constructors here via additional
// registerXxx helpers in this file. Plan 10-09 writes the final list.
func RegisterAll(engine *recon.Engine, cfg SourcesConfig) {
if engine == nil { return }
// Populated by Plan 10-09 (after Wave 2 lands individual source files).
}
```
Do NOT wire this into cmd/recon.go yet — Plan 10-09 handles CLI integration after
every source exists.
</action>
<verify>
<automated>cd /home/salva/Documents/apikey && go test ./pkg/recon/sources/ -run "TestBuildQueries|TestRegisterAll" -v -timeout 30s && go build ./...</automated>
</verify>
<done>
BuildQueries is deterministic, dedups keywords, formats per-source syntax.
RegisterAll compiles as a no-op stub. Package builds with zero source
implementations — ready for Wave 2 plans to add files in parallel.
</done>
</task>
</tasks>
<verification>
- `go build ./...` succeeds
- `go test ./pkg/recon/sources/...` passes
- `go vet ./pkg/recon/sources/...` clean
</verification>
<success_criteria>
pkg/recon/sources package exists with httpclient.go, queries.go, register.go, doc.go
and all tests green. No source implementations present yet — that is Wave 2.
</success_criteria>
<output>
After completion, create `.planning/phases/10-osint-code-hosting/10-01-SUMMARY.md`.
</output>