docs(10-osint-code-hosting): create phase 10 plans (9 plans across 3 waves)

This commit is contained in:
salvacybersec
2026-04-06 01:07:15 +03:00
parent cfe090a5c9
commit 191bdee3bc
10 changed files with 1611 additions and 1 deletions

View File

@@ -215,7 +215,17 @@ Plans:
3. `keyhunter recon --sources=gist,bitbucket,codeberg` scans public gists, Bitbucket repos, and Codeberg/Gitea instances 3. `keyhunter recon --sources=gist,bitbucket,codeberg` scans public gists, Bitbucket repos, and Codeberg/Gitea instances
4. `keyhunter recon --sources=replit,codesandbox,kaggle` scans public repls, sandboxes, and notebooks 4. `keyhunter recon --sources=replit,codesandbox,kaggle` scans public repls, sandboxes, and notebooks
5. All code hosting source findings are stored in the database with source attribution and deduplication 5. All code hosting source findings are stored in the database with source attribution and deduplication
**Plans**: TBD **Plans**: 9 plans
Plans:
- [ ] 10-01-PLAN.md — Shared HTTP client + provider-query generator + RegisterAll skeleton
- [ ] 10-02-PLAN.md — GitHubSource (RECON-CODE-01)
- [ ] 10-03-PLAN.md — GitLabSource (RECON-CODE-02)
- [ ] 10-04-PLAN.md — BitbucketSource + GistSource (RECON-CODE-03, RECON-CODE-04)
- [ ] 10-05-PLAN.md — CodebergSource/Gitea (RECON-CODE-05)
- [ ] 10-06-PLAN.md — HuggingFaceSource (RECON-CODE-08)
- [ ] 10-07-PLAN.md — Replit + CodeSandbox + Sandboxes scrapers (RECON-CODE-06, RECON-CODE-07, RECON-CODE-10)
- [ ] 10-08-PLAN.md — KaggleSource (RECON-CODE-09)
- [ ] 10-09-PLAN.md — RegisterAll wiring + CLI integration + end-to-end test
### Phase 11: OSINT Search & Paste ### Phase 11: OSINT Search & Paste
**Goal**: Users can run automated search engine dorking against Google, Bing, DuckDuckGo, Yandex, and Brave, and scan 15+ paste site aggregations for leaked API keys **Goal**: Users can run automated search engine dorking against Google, Bing, DuckDuckGo, Yandex, and Brave, and scan 15+ paste site aggregations for leaked API keys

View File

@@ -0,0 +1,331 @@
---
phase: 10-osint-code-hosting
plan: 01
type: execute
wave: 1
depends_on: []
files_modified:
- pkg/recon/sources/doc.go
- pkg/recon/sources/httpclient.go
- pkg/recon/sources/httpclient_test.go
- pkg/recon/sources/queries.go
- pkg/recon/sources/queries_test.go
- pkg/recon/sources/register.go
autonomous: true
requirements: []
must_haves:
truths:
- "Shared retry HTTP client honors ctx cancellation and Retry-After on 429/403"
- "Provider registry drives per-source query templates (no hardcoded literals)"
- "Empty source registry compiles and exposes RegisterAll(engine, cfg)"
artifacts:
- path: "pkg/recon/sources/httpclient.go"
provides: "Retrying *http.Client with context + Retry-After handling"
- path: "pkg/recon/sources/queries.go"
provides: "BuildQueries(registry, sourceName) []string generator"
- path: "pkg/recon/sources/register.go"
provides: "RegisterAll(engine *recon.Engine, cfg SourcesConfig) bootstrap"
key_links:
- from: "pkg/recon/sources/httpclient.go"
to: "net/http + context + golang.org/x/time/rate"
via: "DoWithRetry(ctx, req, limiter) (*http.Response, error)"
pattern: "DoWithRetry"
- from: "pkg/recon/sources/queries.go"
to: "pkg/providers.Registry"
via: "BuildQueries iterates reg.List() and formats provider keywords"
pattern: "BuildQueries"
---
<objective>
Establish the shared foundation for all Phase 10 code hosting sources: a retry-aware HTTP
client wrapper, a provider→query template generator driven by the provider registry, and
an empty RegisterAll bootstrap that Plan 10-09 will fill in. No individual source is
implemented here — this plan exists so Wave 2 plans (10-02..10-08) can run in parallel
without fighting over shared helpers.
Purpose: Deduplicate retry/rate-limit/backoff logic across 10 sources; centralize query
generation so providers added later automatically flow to every source.
Output: Compilable `pkg/recon/sources` package skeleton with tested helpers.
</objective>
<execution_context>
@$HOME/.claude/get-shit-done/workflows/execute-plan.md
@$HOME/.claude/get-shit-done/templates/summary.md
</execution_context>
<context>
@.planning/PROJECT.md
@.planning/ROADMAP.md
@.planning/STATE.md
@.planning/phases/10-osint-code-hosting/10-CONTEXT.md
@pkg/recon/source.go
@pkg/recon/limiter.go
@pkg/dorks/github.go
@pkg/providers/registry.go
<interfaces>
From pkg/recon/source.go:
```go
type ReconSource interface {
Name() string
RateLimit() rate.Limit
Burst() int
RespectsRobots() bool
Enabled(cfg Config) bool
Sweep(ctx context.Context, query string, out chan<- Finding) error
}
type Finding = engine.Finding
type Config struct { Stealth, RespectRobots bool; EnabledSources []string; Query string }
```
From pkg/recon/limiter.go:
```go
type LimiterRegistry struct { ... }
func NewLimiterRegistry() *LimiterRegistry
func (lr *LimiterRegistry) Wait(ctx, name, r, burst, stealth) error
```
From pkg/providers/registry.go:
```go
func (r *Registry) List() []Provider
// Provider has: Name string, Keywords []string, Patterns []Pattern, Tier int
```
From pkg/engine/finding.go:
```go
type Finding struct {
ProviderName, KeyValue, KeyMasked, Confidence, Source, SourceType string
LineNumber int; Offset int64; DetectedAt time.Time
Verified bool; VerifyStatus string; ...
}
```
</interfaces>
</context>
<tasks>
<task type="auto" tdd="true">
<name>Task 1: Shared retry HTTP client helper</name>
<files>pkg/recon/sources/doc.go, pkg/recon/sources/httpclient.go, pkg/recon/sources/httpclient_test.go</files>
<behavior>
- Test A: 200 OK returns response unchanged, body readable
- Test B: 429 with Retry-After:1 triggers one retry then succeeds (verify via httptest counter)
- Test C: 403 with Retry-After triggers retry
- Test D: 401 returns ErrUnauthorized immediately, no retry
- Test E: Ctx cancellation during retry sleep returns ctx.Err()
- Test F: MaxRetries exhausted returns wrapped last-status error
</behavior>
<action>
Create `pkg/recon/sources/doc.go` with the package comment: "Package sources hosts per-OSINT-source ReconSource implementations for Phase 10 code hosting (GitHub, GitLab, Bitbucket, Gist, Codeberg, HuggingFace, Kaggle, Replit, CodeSandbox, sandboxes). Each source implements pkg/recon.ReconSource."
Create `pkg/recon/sources/httpclient.go` exporting:
```go
package sources
import (
"context"
"errors"
"fmt"
"net/http"
"strconv"
"time"
)
// ErrUnauthorized is returned when an API rejects credentials (401).
var ErrUnauthorized = errors.New("sources: unauthorized (check credentials)")
// Client is the shared retry wrapper every Phase 10 source uses.
type Client struct {
HTTP *http.Client
MaxRetries int // default 2
UserAgent string // default "keyhunter-recon/1.0"
}
// NewClient returns a Client with a 30s timeout and 2 retries.
func NewClient() *Client {
return &Client{HTTP: &http.Client{Timeout: 30 * time.Second}, MaxRetries: 2, UserAgent: "keyhunter-recon/1.0"}
}
// Do executes req with retries on 429/403/5xx honoring Retry-After.
// 401 returns ErrUnauthorized wrapped with the response body.
// Ctx cancellation is honored during sleeps.
func (c *Client) Do(ctx context.Context, req *http.Request) (*http.Response, error) {
if req.Header.Get("User-Agent") == "" { req.Header.Set("User-Agent", c.UserAgent) }
var last *http.Response
for attempt := 0; attempt <= c.MaxRetries; attempt++ {
r, err := c.HTTP.Do(req.WithContext(ctx))
if err != nil { return nil, fmt.Errorf("sources http: %w", err) }
if r.StatusCode == http.StatusOK { return r, nil }
if r.StatusCode == http.StatusUnauthorized {
body := readBody(r)
return nil, fmt.Errorf("%w: %s", ErrUnauthorized, body)
}
retriable := r.StatusCode == 429 || r.StatusCode == 403 || r.StatusCode >= 500
if !retriable || attempt == c.MaxRetries {
body := readBody(r)
return nil, fmt.Errorf("sources http %d: %s", r.StatusCode, body)
}
sleep := ParseRetryAfter(r.Header.Get("Retry-After"))
r.Body.Close()
last = r
select {
case <-time.After(sleep):
case <-ctx.Done(): return nil, ctx.Err()
}
}
_ = last
return nil, fmt.Errorf("sources http: retries exhausted")
}
// ParseRetryAfter decodes integer-seconds Retry-After, defaulting to 1s.
func ParseRetryAfter(v string) time.Duration { ... }
// readBody reads up to 4KB of the body and closes it.
func readBody(r *http.Response) string { ... }
```
Create `pkg/recon/sources/httpclient_test.go` using `net/http/httptest`:
- Table-driven tests for each behavior above. Use an atomic counter to verify
retry attempt counts. Use `httptest.NewServer` with a handler that switches on
a request counter.
- For ctx cancellation test: set Retry-After: 10, cancel ctx inside 100ms, assert
ctx.Err() returned within 500ms.
Do NOT build a LimiterRegistry wrapper here — each source calls its own LimiterRegistry.Wait
before calling Client.Do. Keeps Client single-purpose (retry only).
</action>
<verify>
<automated>cd /home/salva/Documents/apikey && go test ./pkg/recon/sources/ -run TestClient -v -timeout 30s</automated>
</verify>
<done>
All behaviors covered; Client.Do retries on 429/403/5xx honoring Retry-After; 401
returns ErrUnauthorized immediately; ctx cancellation respected; tests green.
</done>
</task>
<task type="auto" tdd="true">
<name>Task 2: Provider-driven query generator + RegisterAll skeleton</name>
<files>pkg/recon/sources/queries.go, pkg/recon/sources/queries_test.go, pkg/recon/sources/register.go</files>
<behavior>
- Test A: BuildQueries(reg, "github") returns one query per (provider, keyword) tuple formatted as GitHub search syntax, e.g. `"sk-proj-" in:file`
- Test B: BuildQueries(reg, "gitlab") returns queries formatted for GitLab search syntax (raw keyword, no `in:file`)
- Test C: BuildQueries(reg, "huggingface") returns bare keyword queries
- Test D: Unknown source name returns bare keyword queries (safe default)
- Test E: Providers with empty Keywords slice are skipped
- Test F: Keyword dedup — if two providers share keyword, emit once per source
- Test G: RegisterAll(nil, cfg) is a no-op that does not panic; RegisterAll with empty cfg does not panic
</behavior>
<action>
Create `pkg/recon/sources/queries.go`:
```go
package sources
import (
"fmt"
"sort"
"github.com/salvacybersec/keyhunter/pkg/providers"
)
// BuildQueries produces the search-string list a source should iterate for a
// given provider registry. Each keyword is formatted per source-specific syntax.
// Result is deterministic (sorted) for reproducible tests.
func BuildQueries(reg *providers.Registry, source string) []string {
if reg == nil { return nil }
seen := make(map[string]struct{})
for _, p := range reg.List() {
for _, k := range p.Keywords {
if k == "" { continue }
seen[k] = struct{}{}
}
}
keywords := make([]string, 0, len(seen))
for k := range seen { keywords = append(keywords, k) }
sort.Strings(keywords)
out := make([]string, 0, len(keywords))
for _, k := range keywords {
out = append(out, formatQuery(source, k))
}
return out
}
func formatQuery(source, keyword string) string {
switch source {
case "github", "gist":
return fmt.Sprintf("%q in:file", keyword)
case "gitlab":
return keyword // GitLab code search doesn't support in:file qualifier
case "bitbucket":
return keyword
case "codeberg":
return keyword
default:
return keyword
}
}
```
Create `pkg/recon/sources/queries_test.go` using `providers.NewRegistryFromProviders`
with two synthetic providers (shared keyword to test dedup).
Create `pkg/recon/sources/register.go`:
```go
package sources
import (
"github.com/salvacybersec/keyhunter/pkg/providers"
"github.com/salvacybersec/keyhunter/pkg/recon"
)
// SourcesConfig carries per-source credentials read from viper/env by cmd/recon.go.
// Plan 10-09 fleshes this out; for now it is a placeholder struct so downstream
// plans can depend on its shape.
type SourcesConfig struct {
GitHubToken string
GitLabToken string
BitbucketToken string
HuggingFaceToken string
KaggleUser string
KaggleKey string
Registry *providers.Registry
Limiters *recon.LimiterRegistry
}
// RegisterAll registers every Phase 10 code-hosting source on engine.
// Wave 2 plans append their source constructors here via additional
// registerXxx helpers in this file. Plan 10-09 writes the final list.
func RegisterAll(engine *recon.Engine, cfg SourcesConfig) {
if engine == nil { return }
// Populated by Plan 10-09 (after Wave 2 lands individual source files).
}
```
Do NOT wire this into cmd/recon.go yet — Plan 10-09 handles CLI integration after
every source exists.
</action>
<verify>
<automated>cd /home/salva/Documents/apikey && go test ./pkg/recon/sources/ -run "TestBuildQueries|TestRegisterAll" -v -timeout 30s && go build ./...</automated>
</verify>
<done>
BuildQueries is deterministic, dedups keywords, formats per-source syntax.
RegisterAll compiles as a no-op stub. Package builds with zero source
implementations — ready for Wave 2 plans to add files in parallel.
</done>
</task>
</tasks>
<verification>
- `go build ./...` succeeds
- `go test ./pkg/recon/sources/...` passes
- `go vet ./pkg/recon/sources/...` clean
</verification>
<success_criteria>
pkg/recon/sources package exists with httpclient.go, queries.go, register.go, doc.go
and all tests green. No source implementations present yet — that is Wave 2.
</success_criteria>
<output>
After completion, create `.planning/phases/10-osint-code-hosting/10-01-SUMMARY.md`.
</output>

View File

@@ -0,0 +1,238 @@
---
phase: 10-osint-code-hosting
plan: 02
type: execute
wave: 2
depends_on: [10-01]
files_modified:
- pkg/recon/sources/github.go
- pkg/recon/sources/github_test.go
autonomous: true
requirements: [RECON-CODE-01]
must_haves:
truths:
- "GitHubSource.Sweep runs BuildQueries against GitHub /search/code and emits engine.Finding per match"
- "GitHubSource is disabled when cfg token is empty (logs and returns nil, no error)"
- "GitHubSource honors ctx cancellation mid-query and rate limiter tokens before each request"
- "Each Finding has SourceType=\"recon:github\" and Source = html_url"
artifacts:
- path: "pkg/recon/sources/github.go"
provides: "GitHubSource implementing recon.ReconSource"
contains: "func (s *GitHubSource) Sweep"
- path: "pkg/recon/sources/github_test.go"
provides: "httptest-driven unit tests"
key_links:
- from: "pkg/recon/sources/github.go"
to: "pkg/recon/sources/httpclient.go"
via: "Client.Do"
pattern: "c\\.client\\.Do"
- from: "pkg/recon/sources/github.go"
to: "pkg/recon/sources/queries.go"
via: "BuildQueries(reg, \"github\")"
pattern: "BuildQueries"
---
<objective>
Implement GitHubSource — the first real Phase 10 recon source. Refactors logic from
pkg/dorks/github.go (Phase 8's GitHubExecutor) into a recon.ReconSource. Emits
engine.Finding entries for every /search/code match, driven by provider keyword
queries from pkg/recon/sources/queries.go.
Purpose: RECON-CODE-01 — users can scan GitHub public code for leaked LLM keys.
Output: pkg/recon/sources/github.go + green tests.
</objective>
<execution_context>
@$HOME/.claude/get-shit-done/workflows/execute-plan.md
@$HOME/.claude/get-shit-done/templates/summary.md
</execution_context>
<context>
@.planning/phases/10-osint-code-hosting/10-CONTEXT.md
@.planning/phases/10-osint-code-hosting/10-01-SUMMARY.md
@pkg/recon/source.go
@pkg/recon/limiter.go
@pkg/dorks/github.go
@pkg/recon/sources/httpclient.go
@pkg/recon/sources/queries.go
@pkg/recon/sources/register.go
<interfaces>
Reference pkg/dorks/github.go for the response struct shapes (ghSearchResponse,
ghCodeItem, ghRepository, ghTextMatchEntry) — copy or alias them. GitHub Code Search
endpoint: GET /search/code?q=<query>&per_page=<n> with headers:
- Accept: application/vnd.github.v3.text-match+json
- Authorization: Bearer <token>
- User-Agent: keyhunter-recon
Rate limit: 30 req/min authenticated → rate.Every(2*time.Second), burst 1.
</interfaces>
</context>
<tasks>
<task type="auto" tdd="true">
<name>Task 1: GitHubSource implementation + tests</name>
<files>pkg/recon/sources/github.go, pkg/recon/sources/github_test.go</files>
<behavior>
- Test A: Enabled returns false when token empty; true when token set
- Test B: Sweep with empty token returns nil (no error, logs disabled)
- Test C: Sweep against httptest server decodes a 2-item response, emits 2 Findings on channel with SourceType="recon:github" and Source=html_url
- Test D: ProviderName is derived by matching query keyword back to provider via the registry (pass in synthetic registry)
- Test E: Ctx cancellation before first request returns ctx.Err()
- Test F: 401 from server returns wrapped ErrUnauthorized
- Test G: Multiple queries (from BuildQueries) iterate in sorted order
</behavior>
<action>
Create `pkg/recon/sources/github.go`:
```go
package sources
import (
"context"
"encoding/json"
"errors"
"fmt"
"net/http"
"net/url"
"time"
"golang.org/x/time/rate"
"github.com/salvacybersec/keyhunter/pkg/providers"
"github.com/salvacybersec/keyhunter/pkg/recon"
)
// GitHubSource implements recon.ReconSource against GitHub Code Search.
// RECON-CODE-01.
type GitHubSource struct {
Token string
BaseURL string // default https://api.github.com, overridable for tests
Registry *providers.Registry
Limiters *recon.LimiterRegistry
client *Client
}
// NewGitHubSource constructs a source. If client is nil, NewClient() is used.
func NewGitHubSource(token string, reg *providers.Registry, lim *recon.LimiterRegistry) *GitHubSource {
return &GitHubSource{Token: token, BaseURL: "https://api.github.com", Registry: reg, Limiters: lim, client: NewClient()}
}
func (s *GitHubSource) Name() string { return "github" }
func (s *GitHubSource) RateLimit() rate.Limit { return rate.Every(2 * time.Second) }
func (s *GitHubSource) Burst() int { return 1 }
func (s *GitHubSource) RespectsRobots() bool { return false }
func (s *GitHubSource) Enabled(_ recon.Config) bool { return s.Token != "" }
func (s *GitHubSource) Sweep(ctx context.Context, _ string, out chan<- recon.Finding) error {
if s.Token == "" { return nil }
base := s.BaseURL
if base == "" { base = "https://api.github.com" }
queries := BuildQueries(s.Registry, "github")
kwToProvider := keywordIndex(s.Registry)
for _, q := range queries {
if err := ctx.Err(); err != nil { return err }
if s.Limiters != nil {
if err := s.Limiters.Wait(ctx, s.Name(), s.RateLimit(), s.Burst(), false); err != nil { return err }
}
endpoint := fmt.Sprintf("%s/search/code?q=%s&per_page=30", base, url.QueryEscape(q))
req, _ := http.NewRequestWithContext(ctx, http.MethodGet, endpoint, nil)
req.Header.Set("Accept", "application/vnd.github.v3.text-match+json")
req.Header.Set("Authorization", "Bearer "+s.Token)
resp, err := s.client.Do(ctx, req)
if err != nil {
if errors.Is(err, ErrUnauthorized) { return err }
// Other errors: log-and-continue per CONTEXT (sources downgrade, not abort)
continue
}
var parsed ghSearchResponse
_ = json.NewDecoder(resp.Body).Decode(&parsed)
resp.Body.Close()
provName := kwToProvider[extractKeyword(q)]
for _, it := range parsed.Items {
snippet := ""
if len(it.TextMatches) > 0 { snippet = it.TextMatches[0].Fragment }
f := recon.Finding{
ProviderName: provName,
KeyMasked: "",
Confidence: "low",
Source: it.HTMLURL,
SourceType: "recon:github",
DetectedAt: time.Now(),
}
_ = snippet // reserved for future content scan pass
select {
case out <- f:
case <-ctx.Done(): return ctx.Err()
}
}
}
return nil
}
// Response structs mirror pkg/dorks/github.go (kept private to this file
// to avoid cross-package coupling between dorks and recon/sources).
type ghSearchResponse struct { Items []ghCodeItem `json:"items"` }
type ghCodeItem struct {
HTMLURL string `json:"html_url"`
Repository ghRepository `json:"repository"`
TextMatches []ghTextMatchEntry `json:"text_matches"`
}
type ghRepository struct { FullName string `json:"full_name"` }
type ghTextMatchEntry struct { Fragment string `json:"fragment"` }
// keywordIndex maps keyword -> provider name using the registry.
func keywordIndex(reg *providers.Registry) map[string]string {
m := make(map[string]string)
if reg == nil { return m }
for _, p := range reg.List() {
for _, k := range p.Keywords { m[k] = p.Name }
}
return m
}
// extractKeyword parses the provider keyword out of a BuildQueries output.
// For github it's `"keyword" in:file`; for bare formats it's the whole string.
func extractKeyword(q string) string { ... strip quotes, trim ` in:file` suffix ... }
```
Create `pkg/recon/sources/github_test.go`:
- Use `providers.NewRegistryFromProviders` with 2 synthetic providers (openai/sk-proj-, anthropic/sk-ant-)
- Spin up `httptest.NewServer` that inspects `r.URL.Query().Get("q")` and returns
a JSON body with two items whose html_url encodes the query
- Assert 2 findings per query received on the channel within 2s using select/time.After
- Separate test for empty token: NewGitHubSource("", reg, lim).Sweep returns nil immediately
- Separate test for 401: server returns 401 → Sweep returns error wrapping ErrUnauthorized
- Cancel-test: cancel ctx before Sweep call; assert ctx.Err() returned
Leave GitHubSource unregistered (Plan 10-09 adds it to RegisterAll).
</action>
<verify>
<automated>cd /home/salva/Documents/apikey && go test ./pkg/recon/sources/ -run TestGitHub -v -timeout 30s</automated>
</verify>
<done>
GitHubSource satisfies recon.ReconSource (compile-time assert via `var _ recon.ReconSource = (*GitHubSource)(nil)`),
tests green, covers happy path + empty token + 401 + cancellation.
</done>
</task>
</tasks>
<verification>
- `go build ./...`
- `go test ./pkg/recon/sources/ -run TestGitHub -v`
- `go vet ./pkg/recon/sources/...`
</verification>
<success_criteria>
RECON-CODE-01 satisfied: GitHubSource queries /search/code using provider-registry-driven
keywords and emits engine.Finding. Ready for registration in Plan 10-09.
</success_criteria>
<output>
After completion, create `.planning/phases/10-osint-code-hosting/10-02-SUMMARY.md`.
</output>

View File

@@ -0,0 +1,120 @@
---
phase: 10-osint-code-hosting
plan: 03
type: execute
wave: 2
depends_on: [10-01]
files_modified:
- pkg/recon/sources/gitlab.go
- pkg/recon/sources/gitlab_test.go
autonomous: true
requirements: [RECON-CODE-02]
must_haves:
truths:
- "GitLabSource.Sweep queries GitLab /api/v4/search?scope=blobs and emits Findings"
- "Disabled when token empty; enabled otherwise"
- "Findings have SourceType=\"recon:gitlab\" and Source = web_url of blob"
artifacts:
- path: "pkg/recon/sources/gitlab.go"
provides: "GitLabSource implementing recon.ReconSource"
- path: "pkg/recon/sources/gitlab_test.go"
provides: "httptest tests"
key_links:
- from: "pkg/recon/sources/gitlab.go"
to: "pkg/recon/sources/httpclient.go"
via: "c.client.Do(ctx, req)"
pattern: "client\\.Do"
---
<objective>
Implement GitLabSource against GitLab's Search API (/api/v4/search?scope=blobs).
Honors PRIVATE-TOKEN header auth, 2000 req/min rate limit.
Purpose: RECON-CODE-02.
Output: pkg/recon/sources/gitlab.go + tests.
</objective>
<execution_context>
@$HOME/.claude/get-shit-done/workflows/execute-plan.md
@$HOME/.claude/get-shit-done/templates/summary.md
</execution_context>
<context>
@.planning/phases/10-osint-code-hosting/10-CONTEXT.md
@.planning/phases/10-osint-code-hosting/10-01-SUMMARY.md
@pkg/recon/source.go
@pkg/recon/sources/httpclient.go
@pkg/recon/sources/queries.go
<interfaces>
GitLab Search API (docs: https://docs.gitlab.com/ee/api/search.html):
GET /api/v4/search?scope=blobs&search=<query>&per_page=20
Header: PRIVATE-TOKEN: <token>
Response (array of blob objects):
[{ "basename": "...", "data": "matched snippet", "path": "...", "project_id": 123,
"ref": "main", "startline": 42 }, ...]
Project web_url must be constructed from project_id → fetch /api/v4/projects/<id> (or
just use basename+path with a placeholder Source — keep it minimal: Source =
"https://gitlab.com/projects/<project_id>/-/blob/<ref>/<path>").
Rate limit: 2000 req/min → rate.Every(30 * time.Millisecond) ≈ 2000/min, burst 5.
</interfaces>
</context>
<tasks>
<task type="auto" tdd="true">
<name>Task 1: GitLabSource implementation + tests</name>
<files>pkg/recon/sources/gitlab.go, pkg/recon/sources/gitlab_test.go</files>
<behavior>
- Test A: Enabled false when token empty
- Test B: Sweep queries /api/v4/search with scope=blobs, PRIVATE-TOKEN header set
- Test C: Decodes array response, emits one Finding per blob with Source containing project_id + path + ref
- Test D: 401 returns wrapped ErrUnauthorized
- Test E: Ctx cancellation respected
- Test F: Empty token → Sweep returns nil with no calls
</behavior>
<action>
Create `pkg/recon/sources/gitlab.go` with struct `GitLabSource { Token, BaseURL string; Registry *providers.Registry; Limiters *recon.LimiterRegistry; client *Client }`.
Default BaseURL: `https://gitlab.com`.
Name: "gitlab". RateLimit: `rate.Every(30 * time.Millisecond)`. Burst: 5. RespectsRobots: false.
Sweep loop:
- For each query from BuildQueries(reg, "gitlab"):
- Build `base + /api/v4/search?scope=blobs&search=<url-escaped>&per_page=20`
- Set header `PRIVATE-TOKEN: <token>`
- limiters.Wait, then client.Do
- Decode `[]glBlob` where glBlob has ProjectID int, Path, Ref, Data, Startline
- Emit Finding with Source = fmt.Sprintf("%s/projects/%d/-/blob/%s/%s", base, b.ProjectID, b.Ref, b.Path), SourceType="recon:gitlab", Confidence="low", ProviderName derived via keywordIndex(reg)
- Respect ctx.Done on send
Add compile-time assert: `var _ recon.ReconSource = (*GitLabSource)(nil)`.
Create `pkg/recon/sources/gitlab_test.go` with httptest server returning a JSON
array of two blob objects. Assert both Findings received, Source URLs contain
project IDs, ctx cancellation test, 401 test, empty-token test. Use synthetic
registry with 2 providers.
</action>
<verify>
<automated>cd /home/salva/Documents/apikey && go test ./pkg/recon/sources/ -run TestGitLab -v -timeout 30s</automated>
</verify>
<done>
GitLabSource compiles, implements ReconSource, all test behaviors covered.
</done>
</task>
</tasks>
<verification>
- `go build ./...`
- `go test ./pkg/recon/sources/ -run TestGitLab -v`
</verification>
<success_criteria>
RECON-CODE-02 satisfied.
</success_criteria>
<output>
After completion, create `.planning/phases/10-osint-code-hosting/10-03-SUMMARY.md`.
</output>

View File

@@ -0,0 +1,163 @@
---
phase: 10-osint-code-hosting
plan: 04
type: execute
wave: 2
depends_on: [10-01]
files_modified:
- pkg/recon/sources/bitbucket.go
- pkg/recon/sources/bitbucket_test.go
- pkg/recon/sources/gist.go
- pkg/recon/sources/gist_test.go
autonomous: true
requirements: [RECON-CODE-03, RECON-CODE-04]
must_haves:
truths:
- "BitbucketSource queries Bitbucket 2.0 code search API and emits Findings"
- "GistSource queries GitHub Gist search (re-uses GitHub token) and emits Findings"
- "Both disabled when respective credentials are empty"
artifacts:
- path: "pkg/recon/sources/bitbucket.go"
provides: "BitbucketSource implementing recon.ReconSource"
- path: "pkg/recon/sources/gist.go"
provides: "GistSource implementing recon.ReconSource"
key_links:
- from: "pkg/recon/sources/gist.go"
to: "pkg/recon/sources/httpclient.go"
via: "Client.Do with Bearer <github-token>"
pattern: "client\\.Do"
- from: "pkg/recon/sources/bitbucket.go"
to: "pkg/recon/sources/httpclient.go"
via: "Client.Do"
pattern: "client\\.Do"
---
<objective>
Implement BitbucketSource (RECON-CODE-03) and GistSource (RECON-CODE-04). Grouped
because both are small API integrations with similar shapes (JSON array/values,
per-item URL, token gating).
Purpose: RECON-CODE-03, RECON-CODE-04.
Output: Two new ReconSource implementations + tests.
</objective>
<execution_context>
@$HOME/.claude/get-shit-done/workflows/execute-plan.md
@$HOME/.claude/get-shit-done/templates/summary.md
</execution_context>
<context>
@.planning/phases/10-osint-code-hosting/10-CONTEXT.md
@.planning/phases/10-osint-code-hosting/10-01-SUMMARY.md
@pkg/recon/source.go
@pkg/recon/sources/httpclient.go
@pkg/recon/sources/queries.go
<interfaces>
Bitbucket 2.0 search (docs: https://developer.atlassian.com/cloud/bitbucket/rest/api-group-search/):
GET /2.0/workspaces/{workspace}/search/code?search_query=<query>
Auth: Bearer <token> (app password or OAuth)
Response: { "values": [{ "content_match_count": N, "file": {"path":"","commit":{...}}, "page_url": "..." }] }
Note: Requires a workspace param — make it configurable via SourcesConfig.BitbucketWorkspace;
if unset, source is disabled. Rate: 1000/hour → rate.Every(3.6 * time.Second), burst 1.
GitHub Gist search: GitHub does not expose a dedicated /search/gists endpoint that
searches gist contents. Use the /gists/public endpoint + client-side filtering as
fallback: GET /gists/public?per_page=100 returns public gists; for each gist, fetch
/gists/{id} and scan file contents for keyword matches. Keep implementation minimal:
just enumerate the first page, match against keyword list, emit Findings with
Source = gist.html_url. Auth: Bearer <github-token>. Rate: 30/min → rate.Every(2s).
</interfaces>
</context>
<tasks>
<task type="auto" tdd="true">
<name>Task 1: BitbucketSource + tests</name>
<files>pkg/recon/sources/bitbucket.go, pkg/recon/sources/bitbucket_test.go</files>
<behavior>
- Test A: Enabled false when token OR workspace empty
- Test B: Enabled true when both set
- Test C: Sweep queries /2.0/workspaces/{ws}/search/code with Bearer header
- Test D: Decodes `{values:[{file:{path,commit:{...}},page_url:"..."}]}` and emits Finding with Source=page_url, SourceType="recon:bitbucket"
- Test E: 401 → ErrUnauthorized
- Test F: Ctx cancellation
</behavior>
<action>
Create `pkg/recon/sources/bitbucket.go`:
- Struct `BitbucketSource { Token, Workspace, BaseURL string; Registry *providers.Registry; Limiters *recon.LimiterRegistry; client *Client }`
- Default BaseURL: `https://api.bitbucket.org`
- Name "bitbucket", RateLimit rate.Every(3600*time.Millisecond), Burst 1, RespectsRobots false
- Enabled = s.Token != "" && s.Workspace != ""
- Sweep: for each query in BuildQueries(reg, "bitbucket"), limiters.Wait, issue
GET request, decode into struct with `Values []struct{ PageURL string "json:page_url"; File struct{ Path string } "json:file" }`, emit Findings
- Compile-time assert `var _ recon.ReconSource = (*BitbucketSource)(nil)`
Create `pkg/recon/sources/bitbucket_test.go` with httptest server, synthetic
registry, assertions on URL path `/2.0/workspaces/testws/search/code`, Bearer
header, and emitted Findings.
</action>
<verify>
<automated>cd /home/salva/Documents/apikey && go test ./pkg/recon/sources/ -run TestBitbucket -v -timeout 30s</automated>
</verify>
<done>
BitbucketSource passes all tests, implements ReconSource.
</done>
</task>
<task type="auto" tdd="true">
<name>Task 2: GistSource + tests</name>
<files>pkg/recon/sources/gist.go, pkg/recon/sources/gist_test.go</files>
<behavior>
- Test A: Enabled false when GitHub token empty
- Test B: Sweep fetches /gists/public?per_page=100 with Bearer auth
- Test C: For each gist, iterates files map; if any file.content contains a provider keyword, emits one Finding with Source=gist.html_url
- Test D: Ctx cancellation
- Test E: 401 → ErrUnauthorized
- Test F: Gist without matching keyword → no Finding emitted
</behavior>
<action>
Create `pkg/recon/sources/gist.go`:
- Struct `GistSource { Token, BaseURL string; Registry *providers.Registry; Limiters *recon.LimiterRegistry; client *Client }`
- BaseURL default `https://api.github.com`
- Name "gist", RateLimit rate.Every(2*time.Second), Burst 1, RespectsRobots false
- Enabled = s.Token != ""
- Sweep flow:
1. Build keyword list from registry (flat set)
2. GET /gists/public?per_page=100 with Bearer header
3. Decode `[]struct{ HTMLURL string "json:html_url"; Files map[string]struct{ Filename, RawURL string "json:raw_url" } "json:files" }`
4. For each gist, for each file, if we can match without fetching raw content,
skip raw fetch (keep Phase 10 minimal). Fallback: fetch file.RawURL and
scan content for any keyword from the set; on hit, emit one Finding
per gist (not per file) with ProviderName from matched keyword.
5. Respect limiters.Wait before each outbound request (gist list + each raw fetch)
- Compile-time assert `var _ recon.ReconSource = (*GistSource)(nil)`
Create `pkg/recon/sources/gist_test.go`:
- httptest server with two routes: `/gists/public` returns 2 gists each with 1 file, raw_url pointing to same server `/raw/<id>`; `/raw/<id>` returns content containing "sk-proj-" for one and an unrelated string for the other
- Assert exactly 1 Finding emitted, Source matches the gist's html_url
- 401 test, ctx cancellation test, empty-token test
</action>
<verify>
<automated>cd /home/salva/Documents/apikey && go test ./pkg/recon/sources/ -run TestGist -v -timeout 30s</automated>
</verify>
<done>
GistSource emits Findings only when a known provider keyword is present in a gist
file body; all tests green.
</done>
</task>
</tasks>
<verification>
- `go build ./...`
- `go test ./pkg/recon/sources/ -run "TestBitbucket|TestGist" -v`
</verification>
<success_criteria>
RECON-CODE-03 and RECON-CODE-04 satisfied.
</success_criteria>
<output>
After completion, create `.planning/phases/10-osint-code-hosting/10-04-SUMMARY.md`.
</output>

View File

@@ -0,0 +1,113 @@
---
phase: 10-osint-code-hosting
plan: 05
type: execute
wave: 2
depends_on: [10-01]
files_modified:
- pkg/recon/sources/codeberg.go
- pkg/recon/sources/codeberg_test.go
autonomous: true
requirements: [RECON-CODE-05]
must_haves:
truths:
- "CodebergSource queries Gitea REST API /api/v1/repos/search and /api/v1/repos/.../contents for keyword matches"
- "No token required for public repos (but optional token honored if provided)"
- "Findings tagged SourceType=\"recon:codeberg\""
artifacts:
- path: "pkg/recon/sources/codeberg.go"
provides: "CodebergSource implementing recon.ReconSource (Gitea-compatible)"
key_links:
- from: "pkg/recon/sources/codeberg.go"
to: "pkg/recon/sources/httpclient.go"
via: "Client.Do"
pattern: "client\\.Do"
---
<objective>
Implement CodebergSource targeting Gitea's REST API. Codeberg.org runs Gitea, so the
same code works for any Gitea instance by configuring BaseURL. Public repos do not
require auth, but a token can be passed to raise rate limits.
Purpose: RECON-CODE-05.
Output: pkg/recon/sources/codeberg.go + tests.
</objective>
<execution_context>
@$HOME/.claude/get-shit-done/workflows/execute-plan.md
@$HOME/.claude/get-shit-done/templates/summary.md
</execution_context>
<context>
@.planning/phases/10-osint-code-hosting/10-CONTEXT.md
@.planning/phases/10-osint-code-hosting/10-01-SUMMARY.md
@pkg/recon/source.go
@pkg/recon/sources/httpclient.go
<interfaces>
Gitea API (v1, docs: https://docs.gitea.com/api):
GET /api/v1/repos/search?q=<query>&limit=50
Response: { "data": [{ "full_name": "...", "html_url": "..." }], "ok": true }
Header (optional): Authorization: token <token>
For this phase we only use /repos/search — matching on repo metadata (name/description).
Full-content code search is not uniformly available across Gitea instances (Codeberg
has gitea "code search" enabled via Bleve index; we rely on it when present via
GET /api/v1/repos/search?q=... which returns repos only. For content matching we
fall back to searching each provider keyword as a query string and emitting Findings
keyed to the repo html_url).
Rate: public unauth 60 req/hour → rate.Every(60 * time.Second). Burst 1.
With token: 1000/hour → rate.Every(3600 * time.Millisecond). Detect via token presence.
</interfaces>
</context>
<tasks>
<task type="auto" tdd="true">
<name>Task 1: CodebergSource + tests</name>
<files>pkg/recon/sources/codeberg.go, pkg/recon/sources/codeberg_test.go</files>
<behavior>
- Test A: Enabled always true (public API, token optional)
- Test B: Sweep queries /api/v1/repos/search?q=<query>&limit=50 for each BuildQueries entry
- Test C: Decodes `{data:[{full_name,html_url}]}` and emits Finding with Source=html_url, SourceType="recon:codeberg", ProviderName from keywordIndex
- Test D: With token set, Authorization header is "token <t>"; without token, header absent
- Test E: Ctx cancellation
- Test F: Unauth rate limit applied when Token empty (verified via RateLimit() return)
</behavior>
<action>
Create `pkg/recon/sources/codeberg.go`:
- Struct `CodebergSource { Token, BaseURL string; Registry *providers.Registry; Limiters *recon.LimiterRegistry; client *Client }`
- Default BaseURL: `https://codeberg.org`
- Name "codeberg", RespectsRobots false
- RateLimit(): if Token == "" return rate.Every(60*time.Second), else rate.Every(3600*time.Millisecond)
- Burst 1
- Enabled always returns true
- Sweep: for each query, build `base + /api/v1/repos/search?q=<q>&limit=50`, set Authorization only when Token set, client.Do, decode, emit Findings
- Compile-time assert
Create `pkg/recon/sources/codeberg_test.go` with httptest server returning a
`{data:[...],ok:true}` body. Two test cases: with token (header present) and
without (header absent — use a flag inside the handler to capture).
</action>
<verify>
<automated>cd /home/salva/Documents/apikey && go test ./pkg/recon/sources/ -run TestCodeberg -v -timeout 30s</automated>
</verify>
<done>
CodebergSource implements ReconSource, tests green for both auth modes.
</done>
</task>
</tasks>
<verification>
- `go test ./pkg/recon/sources/ -run TestCodeberg -v`
</verification>
<success_criteria>
RECON-CODE-05 satisfied.
</success_criteria>
<output>
After completion, create `.planning/phases/10-osint-code-hosting/10-05-SUMMARY.md`.
</output>

View File

@@ -0,0 +1,108 @@
---
phase: 10-osint-code-hosting
plan: 06
type: execute
wave: 2
depends_on: [10-01]
files_modified:
- pkg/recon/sources/huggingface.go
- pkg/recon/sources/huggingface_test.go
autonomous: true
requirements: [RECON-CODE-08]
must_haves:
truths:
- "HuggingFaceSource queries /api/spaces and /api/models search endpoints"
- "Token is optional — anonymous requests allowed at lower rate limit"
- "Findings have SourceType=\"recon:huggingface\" and Source = full HF URL"
artifacts:
- path: "pkg/recon/sources/huggingface.go"
provides: "HuggingFaceSource implementing recon.ReconSource"
key_links:
- from: "pkg/recon/sources/huggingface.go"
to: "pkg/recon/sources/httpclient.go"
via: "Client.Do"
pattern: "client\\.Do"
---
<objective>
Implement HuggingFaceSource scanning both Spaces and model repos via the HF Hub API.
Token optional; unauthenticated requests work but are rate-limited harder.
Purpose: RECON-CODE-08.
Output: pkg/recon/sources/huggingface.go + tests.
</objective>
<execution_context>
@$HOME/.claude/get-shit-done/workflows/execute-plan.md
@$HOME/.claude/get-shit-done/templates/summary.md
</execution_context>
<context>
@.planning/phases/10-osint-code-hosting/10-CONTEXT.md
@.planning/phases/10-osint-code-hosting/10-01-SUMMARY.md
@pkg/recon/source.go
@pkg/recon/sources/httpclient.go
<interfaces>
HuggingFace Hub API:
GET https://huggingface.co/api/spaces?search=<q>&limit=50
GET https://huggingface.co/api/models?search=<q>&limit=50
Response (either): array of { "id": "owner/name", "modelId"|"spaceId": "owner/name" }
Optional auth: Authorization: Bearer <hf-token>
URL derivation: Source = "https://huggingface.co/spaces/<id>" or ".../<id>" for models.
Rate: 1000/hour authenticated → rate.Every(3600*time.Millisecond); unauth: rate.Every(10*time.Second), burst 1.
</interfaces>
</context>
<tasks>
<task type="auto" tdd="true">
<name>Task 1: HuggingFaceSource + tests</name>
<files>pkg/recon/sources/huggingface.go, pkg/recon/sources/huggingface_test.go</files>
<behavior>
- Test A: Enabled always true (token optional)
- Test B: Sweep hits both /api/spaces and /api/models endpoints for each query
- Test C: Decodes array of {id} and emits Findings with Source prefixed by "https://huggingface.co/spaces/" or "https://huggingface.co/" for models, SourceType="recon:huggingface"
- Test D: Authorization header present when token set, absent when empty
- Test E: Ctx cancellation respected
- Test F: RateLimit returns slower rate when token empty
</behavior>
<action>
Create `pkg/recon/sources/huggingface.go`:
- Struct `HuggingFaceSource { Token, BaseURL string; Registry *providers.Registry; Limiters *recon.LimiterRegistry; client *Client }`
- Default BaseURL: `https://huggingface.co`
- Name "huggingface", RespectsRobots false, Burst 1
- RateLimit: token-dependent (see interfaces)
- Enabled always true
- Sweep: build keyword list, for each keyword iterate two endpoints
(`/api/spaces?search=<q>&limit=50`, `/api/models?search=<q>&limit=50`), emit
Findings. URL prefix differs per endpoint.
- Compile-time assert
Create `pkg/recon/sources/huggingface_test.go` with httptest server that routes
both paths. Assert exact number of Findings (2 per keyword × number of keywords)
and URL prefixes.
</action>
<verify>
<automated>cd /home/salva/Documents/apikey && go test ./pkg/recon/sources/ -run TestHuggingFace -v -timeout 30s</automated>
</verify>
<done>
HuggingFaceSource passes tests covering both endpoints, token modes, cancellation.
</done>
</task>
</tasks>
<verification>
- `go test ./pkg/recon/sources/ -run TestHuggingFace -v`
</verification>
<success_criteria>
RECON-CODE-08 satisfied.
</success_criteria>
<output>
After completion, create `.planning/phases/10-osint-code-hosting/10-06-SUMMARY.md`.
</output>

View File

@@ -0,0 +1,191 @@
---
phase: 10-osint-code-hosting
plan: 07
type: execute
wave: 2
depends_on: [10-01]
files_modified:
- pkg/recon/sources/replit.go
- pkg/recon/sources/replit_test.go
- pkg/recon/sources/codesandbox.go
- pkg/recon/sources/codesandbox_test.go
- pkg/recon/sources/sandboxes.go
- pkg/recon/sources/sandboxes_test.go
autonomous: true
requirements: [RECON-CODE-06, RECON-CODE-07, RECON-CODE-10]
must_haves:
truths:
- "ReplitSource scrapes replit.com search HTML and emits Findings tagged recon:replit"
- "CodeSandboxSource scrapes codesandbox.io search and emits Findings tagged recon:codesandbox"
- "SandboxesSource aggregates JSFiddle+CodePen+StackBlitz+Glitch+Observable+Gitpod with SourceType recon:sandboxes and sub-type in KeyMasked metadata slot"
- "All three RespectsRobots()==true and rate-limit conservatively (10/min)"
artifacts:
- path: "pkg/recon/sources/replit.go"
provides: "ReplitSource (scraper)"
- path: "pkg/recon/sources/codesandbox.go"
provides: "CodeSandboxSource (scraper)"
- path: "pkg/recon/sources/sandboxes.go"
provides: "SandboxesSource aggregator (JSFiddle, CodePen, StackBlitz, Glitch, Observable, Gitpod)"
key_links:
- from: "pkg/recon/sources/replit.go"
to: "pkg/recon/sources/httpclient.go"
via: "Client.Do on https://replit.com/search?q=..."
pattern: "client\\.Do"
- from: "pkg/recon/sources/sandboxes.go"
to: "pkg/recon/sources/httpclient.go"
via: "Client.Do on per-sandbox search URLs"
pattern: "client\\.Do"
---
<objective>
Implement three scraping-based sources for sandbox/IDE platforms without public
search APIs. All three honor robots.txt, use a conservative 10 req/min rate, and
emit Findings with best-effort HTML link extraction.
Purpose: RECON-CODE-06 (Replit), RECON-CODE-07 (CodeSandbox), RECON-CODE-10
(CodePen/JSFiddle/StackBlitz/Glitch/Observable/Gitpod aggregator).
Output: 3 new ReconSource implementations + tests.
</objective>
<execution_context>
@$HOME/.claude/get-shit-done/workflows/execute-plan.md
@$HOME/.claude/get-shit-done/templates/summary.md
</execution_context>
<context>
@.planning/phases/10-osint-code-hosting/10-CONTEXT.md
@.planning/phases/10-osint-code-hosting/10-01-SUMMARY.md
@pkg/recon/source.go
@pkg/recon/robots.go
@pkg/recon/sources/httpclient.go
<interfaces>
Scraping strategy (identical for all three sources in this plan):
1. Build per-provider keyword queries via BuildQueries (default format = bare keyword)
2. Fetch search URL via Client.Do (no auth headers)
3. Use a simple regex to extract result links from HTML (href="/@user/repl-name"
or href="/s/...") — use net/html parser for robustness
4. Emit one Finding per extracted link with SourceType="recon:<name>" and Source=absolute URL
5. Return early on ctx cancellation
Search URLs (approximations — confirm in action):
- Replit: https://replit.com/search?q=<q>&type=repls
- CodeSandbox: https://codesandbox.io/search?query=<q>&type=sandboxes
- CodePen: https://codepen.io/search/pens?q=<q>
- JSFiddle: https://jsfiddle.net/api/search/?q=<q> (returns JSON)
- StackBlitz: https://stackblitz.com/search?q=<q>
- Glitch: https://glitch.com/api/search/projects?q=<q>
- Observable: https://observablehq.com/search?query=<q>
- Gitpod: https://www.gitpod.io/ (no public search; skip with log)
All three sources set RespectsRobots()=true. Engine honors this via existing
pkg/recon/robots.go cache (caller coordinates RobotsCache check; not done here
because Phase 9 wires it at SweepAll level — if not, document TODO in code).
Rate limits: all 10 req/min → rate.Every(6 * time.Second). Burst 1.
</interfaces>
</context>
<tasks>
<task type="auto" tdd="true">
<name>Task 1: ReplitSource + CodeSandboxSource (scrapers)</name>
<files>pkg/recon/sources/replit.go, pkg/recon/sources/replit_test.go, pkg/recon/sources/codesandbox.go, pkg/recon/sources/codesandbox_test.go</files>
<behavior>
- Test A (each): Sweep fetches search URL for each keyword via httptest server
- Test B: HTML parsing extracts anchor hrefs matching expected result patterns (use golang.org/x/net/html)
- Test C: Each extracted link emitted as Finding with Source=absolute URL, SourceType="recon:replit" or "recon:codesandbox"
- Test D: RespectsRobots returns true
- Test E: Ctx cancellation respected
- Test F: Enabled always returns true (no auth)
</behavior>
<action>
Add `golang.org/x/net/html` to go.mod if not already (`go get golang.org/x/net/html`).
Create `pkg/recon/sources/replit.go`:
- Struct `ReplitSource { BaseURL string; Registry *providers.Registry; Limiters *recon.LimiterRegistry; client *Client }`
- Default BaseURL: `https://replit.com`
- Name "replit", RateLimit rate.Every(6*time.Second), Burst 1, RespectsRobots true, Enabled always true
- Sweep: for each keyword from BuildQueries, GET `{base}/search?q={keyword}&type=repls`, parse HTML with `html.Parse`, walk DOM collecting `<a href>` matching regex `^/@[^/]+/[^/]+$` (repl URLs), emit Finding per absolute URL
- Compile-time assert
Create `pkg/recon/sources/replit_test.go`:
- httptest server returning fixed HTML snippet with 2 matching anchors + 1 non-matching
- Assert exactly 2 Findings with correct absolute URLs
Create `pkg/recon/sources/codesandbox.go` with same shape but:
- Default BaseURL `https://codesandbox.io`
- Name "codesandbox"
- Search URL: `{base}/search?query=<q>&type=sandboxes`
- Link regex: `^/s/[a-zA-Z0-9-]+$` or `/p/sandbox/...`
- SourceType "recon:codesandbox"
Create `pkg/recon/sources/codesandbox_test.go` analogous to replit_test.go.
</action>
<verify>
<automated>cd /home/salva/Documents/apikey && go test ./pkg/recon/sources/ -run "TestReplit|TestCodeSandbox" -v -timeout 30s</automated>
</verify>
<done>
Both scrapers parse HTML, extract links, emit Findings; tests green.
</done>
</task>
<task type="auto" tdd="true">
<name>Task 2: SandboxesSource aggregator (JSFiddle/CodePen/StackBlitz/Glitch/Observable/Gitpod)</name>
<files>pkg/recon/sources/sandboxes.go, pkg/recon/sources/sandboxes_test.go</files>
<behavior>
- Test A: Sweep iterates 6 sub-platforms for each keyword (via test override of Platforms slice)
- Test B: JSFiddle returns JSON → parsed into Findings (Source from result URLs)
- Test C: CodePen HTML → anchor extraction
- Test D: One failing sub-platform does NOT abort others (log-and-continue)
- Test E: SourceType = "recon:sandboxes"; sub-platform identifier goes into Confidence field or separate Platform map slot (use `KeyMasked` sentinel `platform=codepen` for now — pragmatic placeholder until a Metadata field exists)
- Test F: Ctx cancellation
</behavior>
<action>
Create `pkg/recon/sources/sandboxes.go`:
- Define `subPlatform` struct: `{ Name, SearchURL, ResultLinkRegex string; IsJSON bool; JSONItemsKey string }`
- Default Platforms:
```go
var defaultPlatforms = []subPlatform{
{Name: "codepen", SearchURL: "https://codepen.io/search/pens?q=%s", ResultLinkRegex: `^/[^/]+/pen/[a-zA-Z0-9]+`, IsJSON: false},
{Name: "jsfiddle", SearchURL: "https://jsfiddle.net/api/search/?q=%s", IsJSON: true, JSONItemsKey: "results"},
{Name: "stackblitz", SearchURL: "https://stackblitz.com/search?q=%s", ResultLinkRegex: `^/edit/[a-zA-Z0-9-]+`, IsJSON: false},
{Name: "glitch", SearchURL: "https://glitch.com/api/search/projects?q=%s", IsJSON: true, JSONItemsKey: "results"},
{Name: "observable", SearchURL: "https://observablehq.com/search?query=%s", ResultLinkRegex: `^/@[^/]+/[^/]+`, IsJSON: false},
}
```
(Gitpod omitted — no public search; document in comment.)
- Struct `SandboxesSource { Platforms []subPlatform; Registry *providers.Registry; Limiters *recon.LimiterRegistry; client *Client }`
- Name "sandboxes", RateLimit rate.Every(6*time.Second), Burst 1, RespectsRobots true, Enabled always true
- Sweep: for each platform, for each keyword, fetch URL, parse either JSON or HTML, emit Findings with Source=absolute URL and KeyMasked="platform="+p.Name
- On any per-platform error, log (use stdlib log package) and continue
Create `pkg/recon/sources/sandboxes_test.go`:
- Spin up a single httptest server; override Platforms slice with 2 platforms
pointing at `/codepen-search` (HTML) and `/jsfiddle-search` (JSON)
- Assert Findings from both platforms emitted
- Failure test: one platform returns 500 → log-and-continue, other still emits
</action>
<verify>
<automated>cd /home/salva/Documents/apikey && go test ./pkg/recon/sources/ -run TestSandboxes -v -timeout 30s</automated>
</verify>
<done>
SandboxesSource iterates sub-platforms, handles HTML and JSON formats, tolerates
per-platform failure, emits Findings tagged with platform identifier.
</done>
</task>
</tasks>
<verification>
- `go build ./...`
- `go test ./pkg/recon/sources/ -run "TestReplit|TestCodeSandbox|TestSandboxes" -v`
</verification>
<success_criteria>
RECON-CODE-06, RECON-CODE-07, RECON-CODE-10 satisfied.
</success_criteria>
<output>
After completion, create `.planning/phases/10-osint-code-hosting/10-07-SUMMARY.md`.
</output>

View File

@@ -0,0 +1,109 @@
---
phase: 10-osint-code-hosting
plan: 08
type: execute
wave: 2
depends_on: [10-01]
files_modified:
- pkg/recon/sources/kaggle.go
- pkg/recon/sources/kaggle_test.go
autonomous: true
requirements: [RECON-CODE-09]
must_haves:
truths:
- "KaggleSource queries Kaggle public API /api/v1/kernels/list with Basic auth (username:key) and emits Findings"
- "Disabled when either KaggleUser or KaggleKey is empty"
- "Findings tagged recon:kaggle; Source = https://www.kaggle.com/code/<ref>"
artifacts:
- path: "pkg/recon/sources/kaggle.go"
provides: "KaggleSource implementing recon.ReconSource"
key_links:
- from: "pkg/recon/sources/kaggle.go"
to: "pkg/recon/sources/httpclient.go"
via: "Client.Do with req.SetBasicAuth(user, key)"
pattern: "SetBasicAuth"
---
<objective>
Implement KaggleSource querying Kaggle's public REST API for public notebooks
(kernels). Kaggle uses HTTP Basic auth (username + API key from kaggle.json).
Purpose: RECON-CODE-09.
Output: pkg/recon/sources/kaggle.go + tests.
</objective>
<execution_context>
@$HOME/.claude/get-shit-done/workflows/execute-plan.md
@$HOME/.claude/get-shit-done/templates/summary.md
</execution_context>
<context>
@.planning/phases/10-osint-code-hosting/10-CONTEXT.md
@.planning/phases/10-osint-code-hosting/10-01-SUMMARY.md
@pkg/recon/source.go
@pkg/recon/sources/httpclient.go
<interfaces>
Kaggle API (docs: https://www.kaggle.com/docs/api):
GET https://www.kaggle.com/api/v1/kernels/list?search=<q>&pageSize=50
Auth: HTTP Basic (username:key)
Response: array of { "ref": "owner/kernel-slug", "title": "...", "author": "..." }
URL derivation: https://www.kaggle.com/code/<ref>
Rate limit: 60/min → rate.Every(1*time.Second), burst 1.
</interfaces>
</context>
<tasks>
<task type="auto" tdd="true">
<name>Task 1: KaggleSource + tests</name>
<files>pkg/recon/sources/kaggle.go, pkg/recon/sources/kaggle_test.go</files>
<behavior>
- Test A: Enabled false when User empty; false when Key empty; true when both set
- Test B: Sweep sets Basic auth header via req.SetBasicAuth(user, key)
- Test C: Decodes array of {ref} → Findings with Source = baseURL + "/code/" + ref, SourceType="recon:kaggle"
- Test D: 401 → ErrUnauthorized
- Test E: Ctx cancellation
- Test F: Missing creds → Sweep returns nil immediately (no HTTP calls, verified via counter=0)
</behavior>
<action>
Create `pkg/recon/sources/kaggle.go`:
- Struct `KaggleSource { User, Key, BaseURL, WebBaseURL string; Registry *providers.Registry; Limiters *recon.LimiterRegistry; client *Client }`
- Default BaseURL `https://www.kaggle.com`, WebBaseURL same
- Name "kaggle", RateLimit rate.Every(1*time.Second), Burst 1, RespectsRobots false
- Enabled = s.User != "" && s.Key != ""
- Sweep: for each query from BuildQueries(reg, "kaggle"), build
`{base}/api/v1/kernels/list?search=<q>&pageSize=50`, call req.SetBasicAuth(User, Key),
client.Do, decode `[]struct{ Ref string "json:ref" }`, emit Findings
- Compile-time assert
Create `pkg/recon/sources/kaggle_test.go`:
- httptest server that validates Authorization header starts with "Basic " and
decodes to "testuser:testkey"
- Returns JSON array with 2 refs
- Assert 2 Findings with expected Source URLs
- Missing-creds test: Sweep returns nil, handler never called (use atomic counter)
- 401 and cancellation tests
</action>
<verify>
<automated>cd /home/salva/Documents/apikey && go test ./pkg/recon/sources/ -run TestKaggle -v -timeout 30s</automated>
</verify>
<done>
KaggleSource passes all tests, implements ReconSource.
</done>
</task>
</tasks>
<verification>
- `go test ./pkg/recon/sources/ -run TestKaggle -v`
</verification>
<success_criteria>
RECON-CODE-09 satisfied.
</success_criteria>
<output>
After completion, create `.planning/phases/10-osint-code-hosting/10-08-SUMMARY.md`.
</output>

View File

@@ -0,0 +1,227 @@
---
phase: 10-osint-code-hosting
plan: 09
type: execute
wave: 3
depends_on: [10-01, 10-02, 10-03, 10-04, 10-05, 10-06, 10-07, 10-08]
files_modified:
- pkg/recon/sources/register.go
- pkg/recon/sources/register_test.go
- pkg/recon/sources/integration_test.go
- cmd/recon.go
autonomous: true
requirements: []
must_haves:
truths:
- "RegisterAll wires all 10 Phase 10 sources onto a recon.Engine"
- "cmd/recon.go buildReconEngine() reads viper config + env vars for tokens and calls RegisterAll"
- "Integration test spins up httptest servers for all sources, runs SweepAll via Engine, asserts Findings from each source arrive with correct SourceType"
- "Guardrail: enabling a source without its required credential logs a skip but does not error"
artifacts:
- path: "pkg/recon/sources/register.go"
provides: "RegisterAll with 10 source constructors wired"
contains: "engine.Register"
- path: "pkg/recon/sources/integration_test.go"
provides: "End-to-end SweepAll test with httptest fixtures for every source"
- path: "cmd/recon.go"
provides: "CLI reads config and invokes sources.RegisterAll"
key_links:
- from: "cmd/recon.go"
to: "pkg/recon/sources.RegisterAll"
via: "sources.RegisterAll(eng, cfg)"
pattern: "sources\\.RegisterAll"
- from: "pkg/recon/sources/register.go"
to: "pkg/recon.Engine.Register"
via: "engine.Register(source)"
pattern: "engine\\.Register"
---
<objective>
Final Wave 3 plan: wire every Phase 10 source into `sources.RegisterAll`, update
`cmd/recon.go` to construct a real `SourcesConfig` from viper/env, and add an
end-to-end integration test that drives all 10 sources through recon.Engine.SweepAll
using httptest fixtures.
Purpose: Users can run `keyhunter recon full --sources=github,gitlab,...` and get
actual findings from any Phase 10 source whose credential is configured.
Output: Wired register.go + cmd/recon.go + passing integration test.
</objective>
<execution_context>
@$HOME/.claude/get-shit-done/workflows/execute-plan.md
@$HOME/.claude/get-shit-done/templates/summary.md
</execution_context>
<context>
@.planning/phases/10-osint-code-hosting/10-CONTEXT.md
@.planning/phases/10-osint-code-hosting/10-01-SUMMARY.md
@.planning/phases/10-osint-code-hosting/10-02-SUMMARY.md
@.planning/phases/10-osint-code-hosting/10-03-SUMMARY.md
@.planning/phases/10-osint-code-hosting/10-04-SUMMARY.md
@.planning/phases/10-osint-code-hosting/10-05-SUMMARY.md
@.planning/phases/10-osint-code-hosting/10-06-SUMMARY.md
@.planning/phases/10-osint-code-hosting/10-07-SUMMARY.md
@.planning/phases/10-osint-code-hosting/10-08-SUMMARY.md
@pkg/recon/engine.go
@pkg/recon/source.go
@pkg/providers/registry.go
@cmd/recon.go
<interfaces>
After Wave 2, each source file in pkg/recon/sources/ exports a constructor
roughly of the form:
func NewGitHubSource(token, reg, lim) *GitHubSource
func NewGitLabSource(token, reg, lim) *GitLabSource
func NewBitbucketSource(token, workspace, reg, lim) *BitbucketSource
func NewGistSource(token, reg, lim) *GistSource
func NewCodebergSource(token, reg, lim) *CodebergSource
func NewHuggingFaceSource(token, reg, lim) *HuggingFaceSource
func NewReplitSource(reg, lim) *ReplitSource
func NewCodeSandboxSource(reg, lim) *CodeSandboxSource
func NewSandboxesSource(reg, lim) *SandboxesSource
func NewKaggleSource(user, key, reg, lim) *KaggleSource
(Verify actual signatures when reading Wave 2 SUMMARYs before writing register.go.)
</interfaces>
</context>
<tasks>
<task type="auto" tdd="true">
<name>Task 1: Wire RegisterAll + register_test.go</name>
<files>pkg/recon/sources/register.go, pkg/recon/sources/register_test.go</files>
<behavior>
- Test A: RegisterAll with a fresh engine and empty SourcesConfig registers all 10 sources by name (GitHub/GitLab/Bitbucket/Gist/Codeberg/HuggingFace/Replit/CodeSandbox/Sandboxes/Kaggle)
- Test B: engine.List() returns all 10 source names in sorted order
- Test C: Calling RegisterAll(nil, cfg) is a no-op (no panic)
- Test D: Sources without creds are still registered but their Enabled() returns false
</behavior>
<action>
Rewrite `pkg/recon/sources/register.go` RegisterAll body to construct each
source with appropriate fields from SourcesConfig and call engine.Register:
```go
func RegisterAll(engine *recon.Engine, cfg SourcesConfig) {
if engine == nil { return }
reg := cfg.Registry
lim := cfg.Limiters
engine.Register(NewGitHubSource(cfg.GitHubToken, reg, lim))
engine.Register(NewGitLabSource(cfg.GitLabToken, reg, lim))
engine.Register(NewBitbucketSource(cfg.BitbucketToken, cfg.BitbucketWorkspace, reg, lim))
engine.Register(NewGistSource(cfg.GitHubToken, reg, lim))
engine.Register(NewCodebergSource(cfg.CodebergToken, reg, lim))
engine.Register(NewHuggingFaceSource(cfg.HuggingFaceToken, reg, lim))
engine.Register(NewReplitSource(reg, lim))
engine.Register(NewCodeSandboxSource(reg, lim))
engine.Register(NewSandboxesSource(reg, lim))
engine.Register(NewKaggleSource(cfg.KaggleUser, cfg.KaggleKey, reg, lim))
}
```
Extend SourcesConfig with any fields Wave 2 introduced (BitbucketWorkspace,
CodebergToken). Adjust field names to actual Wave 2 SUMMARY signatures.
Create `pkg/recon/sources/register_test.go`:
- Build minimal registry via providers.NewRegistryFromProviders with 1 synthetic provider
- Build recon.Engine, call RegisterAll with cfg having all creds empty
- Assert eng.List() returns exactly these 10 names:
bitbucket, codeberg, codesandbox, gist, github, gitlab, huggingface, kaggle, replit, sandboxes
- Assert nil engine call is no-op (no panic)
</action>
<verify>
<automated>cd /home/salva/Documents/apikey && go test ./pkg/recon/sources/ -run TestRegisterAll -v -timeout 30s</automated>
</verify>
<done>
RegisterAll wires all 10 sources; registry_test green.
</done>
</task>
<task type="auto" tdd="true">
<name>Task 2: Integration test across all sources + cmd/recon.go wiring</name>
<files>pkg/recon/sources/integration_test.go, cmd/recon.go</files>
<behavior>
- Integration test: spins up 10 httptest servers (or one multiplexed server with per-path routing) that return canned responses for each source's endpoints
- Uses BaseURL overrides on each source (direct construction, not RegisterAll, since RegisterAll uses production URLs)
- Registers each override-configured source on a fresh recon.Engine and calls SweepAll
- Asserts at least 1 Finding emerged for each of the 10 SourceType values: recon:github, recon:gitlab, recon:bitbucket, recon:gist, recon:codeberg, recon:huggingface, recon:replit, recon:codesandbox, recon:sandboxes, recon:kaggle
- CLI: `keyhunter recon list` (after wiring) prints all 10 source names in addition to "example"
</behavior>
<action>
Create `pkg/recon/sources/integration_test.go`:
- Build a single httptest server with a mux routing per-path:
`/search/code` (github) → ghSearchResponse JSON
`/api/v4/search` (gitlab) → blob array JSON
`/2.0/workspaces/ws/search/code` (bitbucket) → values JSON
`/gists/public` + `/raw/gist1` (gist) → gist list + raw matching keyword
`/api/v1/repos/search` (codeberg) → data array
`/api/spaces`, `/api/models` (huggingface) → id arrays
`/search?q=...&type=repls` (replit) → HTML fixture
`/search?query=...&type=sandboxes` (codesandbox) → HTML fixture
`/codepen-search` (sandboxes sub) → HTML; `/jsfiddle-search` → JSON
`/api/v1/kernels/list` (kaggle) → ref array
- For each source, construct with BaseURL/Platforms overrides pointing at test server
- Register all on a fresh recon.Engine
- Provide synthetic providers.Registry with keyword "sk-proj-" matching openai
- Call eng.SweepAll(ctx, recon.Config{Query:"ignored"})
- Assert findings grouped by SourceType covers all 10 expected values
- Use a 30s test timeout
Update `cmd/recon.go`:
- Import `github.com/salvacybersec/keyhunter/pkg/recon/sources`, `github.com/spf13/viper`, and the providers package
- In `buildReconEngine()`:
```go
func buildReconEngine() *recon.Engine {
e := recon.NewEngine()
e.Register(recon.ExampleSource{})
reg, err := providers.NewRegistry()
if err != nil {
fmt.Fprintf(os.Stderr, "recon: failed to load providers: %v\n", err)
return e
}
cfg := sources.SourcesConfig{
Registry: reg,
Limiters: recon.NewLimiterRegistry(),
GitHubToken: firstNonEmpty(os.Getenv("GITHUB_TOKEN"), viper.GetString("recon.github.token")),
GitLabToken: firstNonEmpty(os.Getenv("GITLAB_TOKEN"), viper.GetString("recon.gitlab.token")),
BitbucketToken: firstNonEmpty(os.Getenv("BITBUCKET_TOKEN"), viper.GetString("recon.bitbucket.token")),
BitbucketWorkspace: viper.GetString("recon.bitbucket.workspace"),
CodebergToken: firstNonEmpty(os.Getenv("CODEBERG_TOKEN"), viper.GetString("recon.codeberg.token")),
HuggingFaceToken: firstNonEmpty(os.Getenv("HUGGINGFACE_TOKEN"), viper.GetString("recon.huggingface.token")),
KaggleUser: firstNonEmpty(os.Getenv("KAGGLE_USERNAME"), viper.GetString("recon.kaggle.username")),
KaggleKey: firstNonEmpty(os.Getenv("KAGGLE_KEY"), viper.GetString("recon.kaggle.key")),
}
sources.RegisterAll(e, cfg)
return e
}
func firstNonEmpty(a, b string) string { if a != "" { return a }; return b }
```
- Preserve existing reconFullCmd / reconListCmd behavior.
</action>
<verify>
<automated>cd /home/salva/Documents/apikey && go test ./pkg/recon/sources/ -run TestIntegration -v -timeout 60s && go build ./... && go run . recon list | sort</automated>
</verify>
<done>
Integration test passes with at least one Finding per SourceType across all 10
sources. `keyhunter recon list` prints all 10 source names plus "example".
</done>
</task>
</tasks>
<verification>
- `go build ./...`
- `go vet ./...`
- `go test ./pkg/recon/sources/... -v -timeout 60s`
- `go test ./pkg/recon/... -timeout 60s` (ensure no regression in Phase 9 recon tests)
- `go run . recon list` prints all 10 new source names
</verification>
<success_criteria>
All Phase 10 code hosting sources registered via sources.RegisterAll, wired into
cmd/recon.go, and exercised end-to-end by an integration test hitting httptest
fixtures for every source. Phase 10 requirements RECON-CODE-01..10 complete.
</success_criteria>
<output>
After completion, create `.planning/phases/10-osint-code-hosting/10-09-SUMMARY.md`.
</output>