Merge branch 'worktree-agent-a7f84823'
This commit is contained in:
117
.planning/phases/10-osint-code-hosting/10-04-SUMMARY.md
Normal file
117
.planning/phases/10-osint-code-hosting/10-04-SUMMARY.md
Normal file
@@ -0,0 +1,117 @@
|
||||
---
|
||||
phase: 10-osint-code-hosting
|
||||
plan: 04
|
||||
subsystem: recon/sources
|
||||
tags: [recon, osint, bitbucket, gist, wave-2]
|
||||
requires:
|
||||
- pkg/recon/sources.Client (Plan 10-01)
|
||||
- pkg/recon/sources.BuildQueries (Plan 10-01)
|
||||
- pkg/recon.LimiterRegistry (Phase 9)
|
||||
- pkg/providers.Registry
|
||||
provides:
|
||||
- pkg/recon/sources.BitbucketSource (RECON-CODE-03)
|
||||
- pkg/recon/sources.GistSource (RECON-CODE-04)
|
||||
affects:
|
||||
- pkg/recon/sources (two new source implementations)
|
||||
tech_stack_added: []
|
||||
patterns:
|
||||
- "Token+workspace gating (Bitbucket requires both to enable)"
|
||||
- "Content-scan fallback when API has no dedicated search (Gist)"
|
||||
- "One Finding per gist (not per file) to avoid duplicate leak reports"
|
||||
- "256KB read cap on raw content fetches"
|
||||
key_files_created:
|
||||
- pkg/recon/sources/bitbucket.go
|
||||
- pkg/recon/sources/bitbucket_test.go
|
||||
- pkg/recon/sources/gist.go
|
||||
- pkg/recon/sources/gist_test.go
|
||||
key_files_modified: []
|
||||
decisions:
|
||||
- "BitbucketSource disables cleanly when either token OR workspace is empty (no error)"
|
||||
- "GistSource enumerates /gists/public first page only; broader sweeps deferred"
|
||||
- "GistSource emits one Finding per matching gist, not per file (prevents fan-out of a single leak)"
|
||||
- "providerForQuery resolves keyword→provider name for Bitbucket Findings (API doesn't echo keyword)"
|
||||
- "Bitbucket rate: rate.Every(3.6s) burst 1; Gist rate: rate.Every(2s) burst 1"
|
||||
metrics:
|
||||
duration_minutes: 6
|
||||
tasks_completed: 2
|
||||
tests_added: 9
|
||||
completed_at: "2026-04-05T22:30:00Z"
|
||||
requirements: [RECON-CODE-03, RECON-CODE-04]
|
||||
---
|
||||
|
||||
# Phase 10 Plan 04: Bitbucket + Gist Sources Summary
|
||||
|
||||
One-liner: BitbucketSource hits the Cloud 2.0 code search API with workspace+token gating, and GistSource fans out over /gists/public fetching each file's raw content to match provider keywords, emitting one Finding per matching gist.
|
||||
|
||||
## What Was Built
|
||||
|
||||
### BitbucketSource (RECON-CODE-03)
|
||||
- `pkg/recon/sources/bitbucket.go` — implements `recon.ReconSource`.
|
||||
- Endpoint: `GET {base}/2.0/workspaces/{workspace}/search/code?search_query={kw}`.
|
||||
- Auth: `Authorization: Bearer <token>`.
|
||||
- Disabled when either `Token` or `Workspace` is empty (clean no-op, no error).
|
||||
- Rate: `rate.Every(3600ms)` burst 1 (Bitbucket 1000/hr API limit).
|
||||
- Iterates `BuildQueries(registry, "bitbucket")` — one request per provider keyword.
|
||||
- Decodes `{values:[{file:{path,commit{hash}},page_url}]}` and emits one Finding per entry.
|
||||
- `SourceType = "recon:bitbucket"`, `Source = page_url` (falls back to synthetic `bitbucket:{ws}/{path}@{hash}` when page_url missing).
|
||||
|
||||
### GistSource (RECON-CODE-04)
|
||||
- `pkg/recon/sources/gist.go` — implements `recon.ReconSource`.
|
||||
- Endpoint: `GET {base}/gists/public?per_page=100`.
|
||||
- Per gist, per file: fetches `raw_url` (also with Bearer auth) and scans content against the provider keyword set (flattened `keyword → providerName` map).
|
||||
- 256KB read cap per raw file to avoid pathological payloads.
|
||||
- Emits **one Finding per matching gist** (breaks on first keyword match across that gist's files) — prevents a multi-file leak from producing N duplicate Findings.
|
||||
- `ProviderName` set from the matched keyword; `Source = gist.html_url`; `SourceType = "recon:gist"`.
|
||||
- Rate: `rate.Every(2s)` burst 1 (30 req/min). Limiter waited before **every** outbound request (list + each raw fetch) so GitHub's shared budget is respected.
|
||||
- Disabled when token is empty.
|
||||
|
||||
## How It Fits
|
||||
- Depends on Plan 10-01 foundation: `sources.Client` (retry + 401→ErrUnauthorized), `BuildQueries`, `recon.LimiterRegistry`.
|
||||
- Does **not** modify `register.go` — Plan 10-09 wires all Wave 2 sources into `RegisterAll` after every plan lands.
|
||||
- Finding shape matches `engine.Finding` so downstream dedup/verify/storage paths in Phases 9/5/4 consume them without changes.
|
||||
|
||||
## Tests
|
||||
|
||||
`go test ./pkg/recon/sources/ -run "TestBitbucket|TestGist" -v`
|
||||
|
||||
### Bitbucket (4 tests)
|
||||
- `TestBitbucket_EnabledRequiresTokenAndWorkspace` — all four gate combinations.
|
||||
- `TestBitbucket_SweepEmitsFindings` — httptest server, asserts `/2.0/workspaces/testws/search/code` path, Bearer header, non-empty `search_query`, Finding source/type.
|
||||
- `TestBitbucket_Unauthorized` — 401 → `errors.Is(err, ErrUnauthorized)`.
|
||||
- `TestBitbucket_ContextCancellation` — slow server + 50ms ctx deadline.
|
||||
|
||||
### Gist (5 tests)
|
||||
- `TestGist_EnabledRequiresToken` — empty vs set token.
|
||||
- `TestGist_SweepEmitsFindingsOnKeywordMatch` — two gists, only one raw body contains `sk-proj-`; asserts exactly 1 Finding, correct `html_url`, `ProviderName=openai`.
|
||||
- `TestGist_NoMatch_NoFinding` — gist with unrelated content produces zero Findings.
|
||||
- `TestGist_Unauthorized` — 401 → `ErrUnauthorized`.
|
||||
- `TestGist_ContextCancellation` — slow server + 50ms ctx deadline.
|
||||
|
||||
All 9 tests pass. `go build ./...` is clean.
|
||||
|
||||
## Deviations from Plan
|
||||
|
||||
None — plan executed exactly as written. No Rule 1/2/3 auto-fixes were required; all tests passed on first full run after writing implementations.
|
||||
|
||||
## Decisions Made
|
||||
|
||||
1. **Keyword→provider mapping on the Bitbucket side lives in `providerForQuery`** — Bitbucket's API doesn't echo the keyword in the response, so we parse the query back to a provider name. Simple substring match over registry keywords is sufficient at current scale.
|
||||
2. **GistSource emits one Finding per gist, not per file.** A single secret often lands in a `config.env` with supporting `README.md` and `docker-compose.yml` — treating the gist as the leak unit keeps noise down and matches how human reviewers triage.
|
||||
3. **Limiter waited before every raw fetch, not just the list call.** GitHub's 30/min budget is shared across API endpoints, so each raw content fetch consumes a token.
|
||||
4. **256KB cap on raw content reads.** Pathological gists (multi-MB logs, minified bundles) would otherwise block the sweep; 256KB is enough to surface a key that's typically near the top of a config file.
|
||||
|
||||
## Commits
|
||||
|
||||
- `d279abf` — feat(10-04): add BitbucketSource for code search recon
|
||||
- `0e16e8e` — feat(10-04): add GistSource for public gist keyword recon
|
||||
|
||||
## Self-Check: PASSED
|
||||
|
||||
- FOUND: pkg/recon/sources/bitbucket.go
|
||||
- FOUND: pkg/recon/sources/bitbucket_test.go
|
||||
- FOUND: pkg/recon/sources/gist.go
|
||||
- FOUND: pkg/recon/sources/gist_test.go
|
||||
- FOUND: commit d279abf
|
||||
- FOUND: commit 0e16e8e
|
||||
- Tests: 9/9 passing (`go test ./pkg/recon/sources/ -run "TestBitbucket|TestGist"`)
|
||||
- Build: `go build ./...` clean
|
||||
174
pkg/recon/sources/bitbucket.go
Normal file
174
pkg/recon/sources/bitbucket.go
Normal file
@@ -0,0 +1,174 @@
|
||||
package sources
|
||||
|
||||
import (
|
||||
"context"
|
||||
"encoding/json"
|
||||
"fmt"
|
||||
"net/http"
|
||||
"net/url"
|
||||
"time"
|
||||
|
||||
"golang.org/x/time/rate"
|
||||
|
||||
"github.com/salvacybersec/keyhunter/pkg/providers"
|
||||
"github.com/salvacybersec/keyhunter/pkg/recon"
|
||||
)
|
||||
|
||||
// BitbucketSource queries the Bitbucket Cloud 2.0 code search API for leaked
|
||||
// provider keywords across a configured workspace (RECON-CODE-03).
|
||||
//
|
||||
// Docs: https://developer.atlassian.com/cloud/bitbucket/rest/api-group-search/
|
||||
// Rate: 1000 req/hour → rate.Every(3.6s), burst 1.
|
||||
// Scope: requires both a token (app password or OAuth) AND a workspace slug;
|
||||
// absent either, the source disables itself cleanly (no error).
|
||||
type BitbucketSource struct {
|
||||
Token string
|
||||
Workspace string
|
||||
BaseURL string
|
||||
Registry *providers.Registry
|
||||
Limiters *recon.LimiterRegistry
|
||||
|
||||
client *Client
|
||||
}
|
||||
|
||||
var _ recon.ReconSource = (*BitbucketSource)(nil)
|
||||
|
||||
// Name returns the stable source identifier.
|
||||
func (s *BitbucketSource) Name() string { return "bitbucket" }
|
||||
|
||||
// RateLimit reports the per-source token bucket rate (1000/hour).
|
||||
func (s *BitbucketSource) RateLimit() rate.Limit {
|
||||
return rate.Every(3600 * time.Millisecond)
|
||||
}
|
||||
|
||||
// Burst reports the token bucket burst capacity.
|
||||
func (s *BitbucketSource) Burst() int { return 1 }
|
||||
|
||||
// RespectsRobots reports whether robots.txt applies (REST API → false).
|
||||
func (s *BitbucketSource) RespectsRobots() bool { return false }
|
||||
|
||||
// Enabled reports whether the source should run. Requires both token and
|
||||
// workspace to be non-empty.
|
||||
func (s *BitbucketSource) Enabled(cfg recon.Config) bool {
|
||||
return s.Token != "" && s.Workspace != ""
|
||||
}
|
||||
|
||||
// bitbucketSearchResponse mirrors the subset of the Bitbucket code search
|
||||
// response shape this source consumes.
|
||||
type bitbucketSearchResponse struct {
|
||||
Values []struct {
|
||||
ContentMatchCount int `json:"content_match_count"`
|
||||
PageURL string `json:"page_url"`
|
||||
File struct {
|
||||
Path string `json:"path"`
|
||||
Commit struct {
|
||||
Hash string `json:"hash"`
|
||||
} `json:"commit"`
|
||||
} `json:"file"`
|
||||
} `json:"values"`
|
||||
}
|
||||
|
||||
// Sweep iterates queries built from the provider registry, issues one search
|
||||
// request per query (rate-limited via Limiters), and emits one Finding per
|
||||
// `values` entry in the response.
|
||||
func (s *BitbucketSource) Sweep(ctx context.Context, _ string, out chan<- recon.Finding) error {
|
||||
if s.client == nil {
|
||||
s.client = NewClient()
|
||||
}
|
||||
base := s.BaseURL
|
||||
if base == "" {
|
||||
base = "https://api.bitbucket.org"
|
||||
}
|
||||
|
||||
queries := BuildQueries(s.Registry, "bitbucket")
|
||||
for _, q := range queries {
|
||||
if err := ctx.Err(); err != nil {
|
||||
return err
|
||||
}
|
||||
|
||||
if s.Limiters != nil {
|
||||
if err := s.Limiters.Wait(ctx, s.Name(), s.RateLimit(), s.Burst(), false); err != nil {
|
||||
return err
|
||||
}
|
||||
}
|
||||
|
||||
endpoint := fmt.Sprintf("%s/2.0/workspaces/%s/search/code", base, url.PathEscape(s.Workspace))
|
||||
req, err := http.NewRequest(http.MethodGet, endpoint, nil)
|
||||
if err != nil {
|
||||
return fmt.Errorf("bitbucket: build request: %w", err)
|
||||
}
|
||||
vals := req.URL.Query()
|
||||
vals.Set("search_query", q)
|
||||
req.URL.RawQuery = vals.Encode()
|
||||
req.Header.Set("Authorization", "Bearer "+s.Token)
|
||||
req.Header.Set("Accept", "application/json")
|
||||
|
||||
resp, err := s.client.Do(ctx, req)
|
||||
if err != nil {
|
||||
return fmt.Errorf("bitbucket: sweep: %w", err)
|
||||
}
|
||||
|
||||
var body bitbucketSearchResponse
|
||||
dec := json.NewDecoder(resp.Body)
|
||||
decodeErr := dec.Decode(&body)
|
||||
_ = resp.Body.Close()
|
||||
if decodeErr != nil {
|
||||
return fmt.Errorf("bitbucket: decode: %w", decodeErr)
|
||||
}
|
||||
|
||||
for _, v := range body.Values {
|
||||
src := v.PageURL
|
||||
if src == "" {
|
||||
src = fmt.Sprintf("bitbucket:%s/%s@%s", s.Workspace, v.File.Path, v.File.Commit.Hash)
|
||||
}
|
||||
f := recon.Finding{
|
||||
ProviderName: providerForQuery(s.Registry, q),
|
||||
Source: src,
|
||||
SourceType: "recon:bitbucket",
|
||||
DetectedAt: time.Now().UTC(),
|
||||
}
|
||||
select {
|
||||
case out <- f:
|
||||
case <-ctx.Done():
|
||||
return ctx.Err()
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
return nil
|
||||
}
|
||||
|
||||
// providerForQuery returns the provider name whose keyword appears in q, or
|
||||
// empty string if no match is found. Used to label Findings with their source
|
||||
// provider when the remote API doesn't echo the original keyword.
|
||||
func providerForQuery(reg *providers.Registry, q string) string {
|
||||
if reg == nil {
|
||||
return ""
|
||||
}
|
||||
for _, p := range reg.List() {
|
||||
for _, k := range p.Keywords {
|
||||
if k == "" {
|
||||
continue
|
||||
}
|
||||
if containsFold(q, k) {
|
||||
return p.Name
|
||||
}
|
||||
}
|
||||
}
|
||||
return ""
|
||||
}
|
||||
|
||||
func containsFold(haystack, needle string) bool {
|
||||
if needle == "" {
|
||||
return false
|
||||
}
|
||||
if len(needle) > len(haystack) {
|
||||
return false
|
||||
}
|
||||
for i := 0; i+len(needle) <= len(haystack); i++ {
|
||||
if haystack[i:i+len(needle)] == needle {
|
||||
return true
|
||||
}
|
||||
}
|
||||
return false
|
||||
}
|
||||
132
pkg/recon/sources/bitbucket_test.go
Normal file
132
pkg/recon/sources/bitbucket_test.go
Normal file
@@ -0,0 +1,132 @@
|
||||
package sources
|
||||
|
||||
import (
|
||||
"context"
|
||||
"errors"
|
||||
"net/http"
|
||||
"net/http/httptest"
|
||||
"strings"
|
||||
"testing"
|
||||
"time"
|
||||
|
||||
"github.com/salvacybersec/keyhunter/pkg/providers"
|
||||
"github.com/salvacybersec/keyhunter/pkg/recon"
|
||||
)
|
||||
|
||||
func bitbucketTestRegistry() *providers.Registry {
|
||||
return providers.NewRegistryFromProviders([]providers.Provider{
|
||||
{Name: "openai", Keywords: []string{"sk-proj-"}},
|
||||
})
|
||||
}
|
||||
|
||||
func newBitbucketSource(baseURL, token, workspace string) *BitbucketSource {
|
||||
return &BitbucketSource{
|
||||
Token: token,
|
||||
Workspace: workspace,
|
||||
BaseURL: baseURL,
|
||||
Registry: bitbucketTestRegistry(),
|
||||
Limiters: recon.NewLimiterRegistry(),
|
||||
}
|
||||
}
|
||||
|
||||
func TestBitbucket_EnabledRequiresTokenAndWorkspace(t *testing.T) {
|
||||
cfg := recon.Config{}
|
||||
|
||||
if newBitbucketSource("", "", "").Enabled(cfg) {
|
||||
t.Fatal("expected disabled when token+workspace empty")
|
||||
}
|
||||
if newBitbucketSource("", "tok", "").Enabled(cfg) {
|
||||
t.Fatal("expected disabled when workspace empty")
|
||||
}
|
||||
if newBitbucketSource("", "", "ws").Enabled(cfg) {
|
||||
t.Fatal("expected disabled when token empty")
|
||||
}
|
||||
if !newBitbucketSource("", "tok", "ws").Enabled(cfg) {
|
||||
t.Fatal("expected enabled when both set")
|
||||
}
|
||||
}
|
||||
|
||||
func TestBitbucket_SweepEmitsFindings(t *testing.T) {
|
||||
var gotAuth, gotPath string
|
||||
var gotQuery string
|
||||
srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
|
||||
gotAuth = r.Header.Get("Authorization")
|
||||
gotPath = r.URL.Path
|
||||
gotQuery = r.URL.Query().Get("search_query")
|
||||
w.Header().Set("Content-Type", "application/json")
|
||||
_, _ = w.Write([]byte(`{
|
||||
"values": [
|
||||
{
|
||||
"content_match_count": 2,
|
||||
"file": {"path": "secrets/.env", "commit": {"hash": "deadbeef"}},
|
||||
"page_url": "https://bitbucket.org/testws/repo/src/deadbeef/secrets/.env"
|
||||
}
|
||||
]
|
||||
}`))
|
||||
}))
|
||||
t.Cleanup(srv.Close)
|
||||
|
||||
src := newBitbucketSource(srv.URL, "tok", "testws")
|
||||
out := make(chan recon.Finding, 16)
|
||||
if err := src.Sweep(context.Background(), "", out); err != nil {
|
||||
t.Fatalf("Sweep: %v", err)
|
||||
}
|
||||
close(out)
|
||||
|
||||
if gotAuth != "Bearer tok" {
|
||||
t.Errorf("Authorization header = %q, want Bearer tok", gotAuth)
|
||||
}
|
||||
if gotPath != "/2.0/workspaces/testws/search/code" {
|
||||
t.Errorf("path = %q", gotPath)
|
||||
}
|
||||
if gotQuery == "" {
|
||||
t.Errorf("expected search_query param to be set")
|
||||
}
|
||||
|
||||
var findings []recon.Finding
|
||||
for f := range out {
|
||||
findings = append(findings, f)
|
||||
}
|
||||
if len(findings) == 0 {
|
||||
t.Fatal("expected at least 1 finding")
|
||||
}
|
||||
f := findings[0]
|
||||
if f.SourceType != "recon:bitbucket" {
|
||||
t.Errorf("SourceType = %q", f.SourceType)
|
||||
}
|
||||
if !strings.Contains(f.Source, "bitbucket.org/testws/repo") {
|
||||
t.Errorf("Source = %q", f.Source)
|
||||
}
|
||||
}
|
||||
|
||||
func TestBitbucket_Unauthorized(t *testing.T) {
|
||||
srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
|
||||
http.Error(w, "nope", http.StatusUnauthorized)
|
||||
}))
|
||||
t.Cleanup(srv.Close)
|
||||
|
||||
src := newBitbucketSource(srv.URL, "tok", "testws")
|
||||
out := make(chan recon.Finding, 4)
|
||||
err := src.Sweep(context.Background(), "", out)
|
||||
if !errors.Is(err, ErrUnauthorized) {
|
||||
t.Fatalf("err = %v, want ErrUnauthorized", err)
|
||||
}
|
||||
}
|
||||
|
||||
func TestBitbucket_ContextCancellation(t *testing.T) {
|
||||
srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
|
||||
time.Sleep(2 * time.Second)
|
||||
w.WriteHeader(200)
|
||||
_, _ = w.Write([]byte(`{"values":[]}`))
|
||||
}))
|
||||
t.Cleanup(srv.Close)
|
||||
|
||||
src := newBitbucketSource(srv.URL, "tok", "testws")
|
||||
ctx, cancel := context.WithTimeout(context.Background(), 50*time.Millisecond)
|
||||
defer cancel()
|
||||
out := make(chan recon.Finding, 1)
|
||||
err := src.Sweep(ctx, "", out)
|
||||
if err == nil {
|
||||
t.Fatal("expected error from cancelled context")
|
||||
}
|
||||
}
|
||||
184
pkg/recon/sources/gist.go
Normal file
184
pkg/recon/sources/gist.go
Normal file
@@ -0,0 +1,184 @@
|
||||
package sources
|
||||
|
||||
import (
|
||||
"context"
|
||||
"encoding/json"
|
||||
"fmt"
|
||||
"io"
|
||||
"net/http"
|
||||
"strings"
|
||||
"time"
|
||||
|
||||
"golang.org/x/time/rate"
|
||||
|
||||
"github.com/salvacybersec/keyhunter/pkg/providers"
|
||||
"github.com/salvacybersec/keyhunter/pkg/recon"
|
||||
)
|
||||
|
||||
// GistSource scans recent public GitHub Gists for provider keyword leaks
|
||||
// (RECON-CODE-04).
|
||||
//
|
||||
// GitHub does not expose a dedicated /search/gists endpoint, so this source
|
||||
// enumerates /gists/public (most-recent page) and fetches each file's raw URL
|
||||
// to scan its content against the provider keyword set. Keep Phase 10 minimal:
|
||||
// only the first page is walked; broader sweeps are a future optimization.
|
||||
//
|
||||
// Auth: GitHub token via Bearer header. Rate: 30 req/min (shared with GitHub
|
||||
// search limits) → rate.Every(2s), burst 1.
|
||||
type GistSource struct {
|
||||
Token string
|
||||
BaseURL string
|
||||
Registry *providers.Registry
|
||||
Limiters *recon.LimiterRegistry
|
||||
|
||||
client *Client
|
||||
}
|
||||
|
||||
var _ recon.ReconSource = (*GistSource)(nil)
|
||||
|
||||
// Name returns the stable source identifier.
|
||||
func (s *GistSource) Name() string { return "gist" }
|
||||
|
||||
// RateLimit reports the per-source token bucket rate (30/min).
|
||||
func (s *GistSource) RateLimit() rate.Limit { return rate.Every(2 * time.Second) }
|
||||
|
||||
// Burst reports the token bucket burst capacity.
|
||||
func (s *GistSource) Burst() int { return 1 }
|
||||
|
||||
// RespectsRobots reports whether robots.txt applies (REST API → false).
|
||||
func (s *GistSource) RespectsRobots() bool { return false }
|
||||
|
||||
// Enabled reports whether the source runs. Requires a GitHub token.
|
||||
func (s *GistSource) Enabled(_ recon.Config) bool { return s.Token != "" }
|
||||
|
||||
type gistListEntry struct {
|
||||
HTMLURL string `json:"html_url"`
|
||||
Files map[string]struct {
|
||||
Filename string `json:"filename"`
|
||||
RawURL string `json:"raw_url"`
|
||||
} `json:"files"`
|
||||
}
|
||||
|
||||
// Sweep fetches /gists/public, scans each file's raw content against the
|
||||
// keyword set from the registry, and emits one Finding per gist that matches
|
||||
// any keyword (not one per file — gists often split a single leak across
|
||||
// helper files).
|
||||
func (s *GistSource) Sweep(ctx context.Context, _ string, out chan<- recon.Finding) error {
|
||||
if s.client == nil {
|
||||
s.client = NewClient()
|
||||
}
|
||||
base := s.BaseURL
|
||||
if base == "" {
|
||||
base = "https://api.github.com"
|
||||
}
|
||||
|
||||
keywords := s.keywordSet()
|
||||
if len(keywords) == 0 {
|
||||
return nil
|
||||
}
|
||||
|
||||
if s.Limiters != nil {
|
||||
if err := s.Limiters.Wait(ctx, s.Name(), s.RateLimit(), s.Burst(), false); err != nil {
|
||||
return err
|
||||
}
|
||||
}
|
||||
|
||||
listReq, err := http.NewRequest(http.MethodGet, base+"/gists/public?per_page=100", nil)
|
||||
if err != nil {
|
||||
return fmt.Errorf("gist: build list request: %w", err)
|
||||
}
|
||||
listReq.Header.Set("Authorization", "Bearer "+s.Token)
|
||||
listReq.Header.Set("Accept", "application/vnd.github+json")
|
||||
|
||||
listResp, err := s.client.Do(ctx, listReq)
|
||||
if err != nil {
|
||||
return fmt.Errorf("gist: list: %w", err)
|
||||
}
|
||||
var gists []gistListEntry
|
||||
dec := json.NewDecoder(listResp.Body)
|
||||
decodeErr := dec.Decode(&gists)
|
||||
_ = listResp.Body.Close()
|
||||
if decodeErr != nil {
|
||||
return fmt.Errorf("gist: decode list: %w", decodeErr)
|
||||
}
|
||||
|
||||
for _, g := range gists {
|
||||
if err := ctx.Err(); err != nil {
|
||||
return err
|
||||
}
|
||||
matched := false
|
||||
var matchedProvider string
|
||||
|
||||
fileLoop:
|
||||
for _, f := range g.Files {
|
||||
if f.RawURL == "" {
|
||||
continue
|
||||
}
|
||||
if s.Limiters != nil {
|
||||
if err := s.Limiters.Wait(ctx, s.Name(), s.RateLimit(), s.Burst(), false); err != nil {
|
||||
return err
|
||||
}
|
||||
}
|
||||
|
||||
rawReq, err := http.NewRequest(http.MethodGet, f.RawURL, nil)
|
||||
if err != nil {
|
||||
return fmt.Errorf("gist: build raw request: %w", err)
|
||||
}
|
||||
rawReq.Header.Set("Authorization", "Bearer "+s.Token)
|
||||
rawResp, err := s.client.Do(ctx, rawReq)
|
||||
if err != nil {
|
||||
return fmt.Errorf("gist: fetch raw: %w", err)
|
||||
}
|
||||
// Cap read to 256KB to avoid pathological gists.
|
||||
body, readErr := io.ReadAll(io.LimitReader(rawResp.Body, 256*1024))
|
||||
_ = rawResp.Body.Close()
|
||||
if readErr != nil {
|
||||
return fmt.Errorf("gist: read raw: %w", readErr)
|
||||
}
|
||||
|
||||
content := string(body)
|
||||
for kw, provName := range keywords {
|
||||
if strings.Contains(content, kw) {
|
||||
matched = true
|
||||
matchedProvider = provName
|
||||
break fileLoop
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
if matched {
|
||||
select {
|
||||
case out <- recon.Finding{
|
||||
ProviderName: matchedProvider,
|
||||
Source: g.HTMLURL,
|
||||
SourceType: "recon:gist",
|
||||
DetectedAt: time.Now().UTC(),
|
||||
}:
|
||||
case <-ctx.Done():
|
||||
return ctx.Err()
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
return nil
|
||||
}
|
||||
|
||||
// keywordSet flattens the registry into a keyword→providerName map for
|
||||
// content scanning. Empty keywords are skipped.
|
||||
func (s *GistSource) keywordSet() map[string]string {
|
||||
out := make(map[string]string)
|
||||
if s.Registry == nil {
|
||||
return out
|
||||
}
|
||||
for _, p := range s.Registry.List() {
|
||||
for _, k := range p.Keywords {
|
||||
if k == "" {
|
||||
continue
|
||||
}
|
||||
if _, ok := out[k]; !ok {
|
||||
out[k] = p.Name
|
||||
}
|
||||
}
|
||||
}
|
||||
return out
|
||||
}
|
||||
166
pkg/recon/sources/gist_test.go
Normal file
166
pkg/recon/sources/gist_test.go
Normal file
@@ -0,0 +1,166 @@
|
||||
package sources
|
||||
|
||||
import (
|
||||
"context"
|
||||
"errors"
|
||||
"fmt"
|
||||
"net/http"
|
||||
"net/http/httptest"
|
||||
"strings"
|
||||
"testing"
|
||||
"time"
|
||||
|
||||
"github.com/salvacybersec/keyhunter/pkg/providers"
|
||||
"github.com/salvacybersec/keyhunter/pkg/recon"
|
||||
)
|
||||
|
||||
func gistTestRegistry() *providers.Registry {
|
||||
return providers.NewRegistryFromProviders([]providers.Provider{
|
||||
{Name: "openai", Keywords: []string{"sk-proj-"}},
|
||||
})
|
||||
}
|
||||
|
||||
func newGistSource(baseURL, token string) *GistSource {
|
||||
return &GistSource{
|
||||
Token: token,
|
||||
BaseURL: baseURL,
|
||||
Registry: gistTestRegistry(),
|
||||
Limiters: recon.NewLimiterRegistry(),
|
||||
}
|
||||
}
|
||||
|
||||
func TestGist_EnabledRequiresToken(t *testing.T) {
|
||||
cfg := recon.Config{}
|
||||
if newGistSource("", "").Enabled(cfg) {
|
||||
t.Fatal("expected disabled when token empty")
|
||||
}
|
||||
if !newGistSource("", "tok").Enabled(cfg) {
|
||||
t.Fatal("expected enabled when token set")
|
||||
}
|
||||
}
|
||||
|
||||
func TestGist_SweepEmitsFindingsOnKeywordMatch(t *testing.T) {
|
||||
var gotAuth, gotListPath string
|
||||
|
||||
mux := http.NewServeMux()
|
||||
var srv *httptest.Server
|
||||
|
||||
mux.HandleFunc("/gists/public", func(w http.ResponseWriter, r *http.Request) {
|
||||
gotAuth = r.Header.Get("Authorization")
|
||||
gotListPath = r.URL.Path
|
||||
w.Header().Set("Content-Type", "application/json")
|
||||
body := fmt.Sprintf(`[
|
||||
{
|
||||
"html_url": "https://gist.github.com/alice/aaa",
|
||||
"files": {
|
||||
"leak.env": {"filename": "leak.env", "raw_url": "%s/raw/aaa"}
|
||||
}
|
||||
},
|
||||
{
|
||||
"html_url": "https://gist.github.com/bob/bbb",
|
||||
"files": {
|
||||
"notes.md": {"filename": "notes.md", "raw_url": "%s/raw/bbb"}
|
||||
}
|
||||
}
|
||||
]`, srv.URL, srv.URL)
|
||||
_, _ = w.Write([]byte(body))
|
||||
})
|
||||
mux.HandleFunc("/raw/aaa", func(w http.ResponseWriter, r *http.Request) {
|
||||
_, _ = w.Write([]byte("OPENAI_API_KEY=sk-proj-1234567890abcdefghijk"))
|
||||
})
|
||||
mux.HandleFunc("/raw/bbb", func(w http.ResponseWriter, r *http.Request) {
|
||||
_, _ = w.Write([]byte("just some unrelated notes here"))
|
||||
})
|
||||
|
||||
srv = httptest.NewServer(mux)
|
||||
t.Cleanup(srv.Close)
|
||||
|
||||
src := newGistSource(srv.URL, "tok")
|
||||
out := make(chan recon.Finding, 8)
|
||||
if err := src.Sweep(context.Background(), "", out); err != nil {
|
||||
t.Fatalf("Sweep: %v", err)
|
||||
}
|
||||
close(out)
|
||||
|
||||
if gotAuth != "Bearer tok" {
|
||||
t.Errorf("Authorization = %q", gotAuth)
|
||||
}
|
||||
if gotListPath != "/gists/public" {
|
||||
t.Errorf("list path = %q", gotListPath)
|
||||
}
|
||||
|
||||
var findings []recon.Finding
|
||||
for f := range out {
|
||||
findings = append(findings, f)
|
||||
}
|
||||
if len(findings) != 1 {
|
||||
t.Fatalf("findings count = %d, want 1 (only aaa matches sk-proj-)", len(findings))
|
||||
}
|
||||
f := findings[0]
|
||||
if !strings.Contains(f.Source, "alice/aaa") {
|
||||
t.Errorf("Source = %q, want gist alice/aaa", f.Source)
|
||||
}
|
||||
if f.SourceType != "recon:gist" {
|
||||
t.Errorf("SourceType = %q", f.SourceType)
|
||||
}
|
||||
if f.ProviderName != "openai" {
|
||||
t.Errorf("ProviderName = %q, want openai", f.ProviderName)
|
||||
}
|
||||
}
|
||||
|
||||
func TestGist_NoMatch_NoFinding(t *testing.T) {
|
||||
var srv *httptest.Server
|
||||
mux := http.NewServeMux()
|
||||
mux.HandleFunc("/gists/public", func(w http.ResponseWriter, r *http.Request) {
|
||||
body := fmt.Sprintf(`[{"html_url":"https://gist.github.com/x/y","files":{"a.txt":{"filename":"a.txt","raw_url":"%s/raw/x"}}}]`, srv.URL)
|
||||
_, _ = w.Write([]byte(body))
|
||||
})
|
||||
mux.HandleFunc("/raw/x", func(w http.ResponseWriter, r *http.Request) {
|
||||
_, _ = w.Write([]byte("nothing interesting"))
|
||||
})
|
||||
srv = httptest.NewServer(mux)
|
||||
t.Cleanup(srv.Close)
|
||||
|
||||
src := newGistSource(srv.URL, "tok")
|
||||
out := make(chan recon.Finding, 4)
|
||||
if err := src.Sweep(context.Background(), "", out); err != nil {
|
||||
t.Fatalf("Sweep: %v", err)
|
||||
}
|
||||
close(out)
|
||||
n := 0
|
||||
for range out {
|
||||
n++
|
||||
}
|
||||
if n != 0 {
|
||||
t.Fatalf("findings = %d, want 0", n)
|
||||
}
|
||||
}
|
||||
|
||||
func TestGist_Unauthorized(t *testing.T) {
|
||||
srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
|
||||
http.Error(w, "bad", http.StatusUnauthorized)
|
||||
}))
|
||||
t.Cleanup(srv.Close)
|
||||
src := newGistSource(srv.URL, "tok")
|
||||
out := make(chan recon.Finding, 1)
|
||||
err := src.Sweep(context.Background(), "", out)
|
||||
if !errors.Is(err, ErrUnauthorized) {
|
||||
t.Fatalf("err = %v, want ErrUnauthorized", err)
|
||||
}
|
||||
}
|
||||
|
||||
func TestGist_ContextCancellation(t *testing.T) {
|
||||
srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
|
||||
time.Sleep(2 * time.Second)
|
||||
_, _ = w.Write([]byte(`[]`))
|
||||
}))
|
||||
t.Cleanup(srv.Close)
|
||||
src := newGistSource(srv.URL, "tok")
|
||||
ctx, cancel := context.WithTimeout(context.Background(), 50*time.Millisecond)
|
||||
defer cancel()
|
||||
out := make(chan recon.Finding, 1)
|
||||
err := src.Sweep(ctx, "", out)
|
||||
if err == nil {
|
||||
t.Fatal("expected error on cancelled ctx")
|
||||
}
|
||||
}
|
||||
Reference in New Issue
Block a user