From 21d5551aa42cf4e12efd68ba3dbfc706f0e86cb6 Mon Sep 17 00:00:00 2001 From: salvacybersec Date: Mon, 6 Apr 2026 01:18:53 +0300 Subject: [PATCH] docs(10-04): complete Bitbucket + Gist sources plan --- .../10-osint-code-hosting/10-04-SUMMARY.md | 117 ++++++++++++++++++ 1 file changed, 117 insertions(+) create mode 100644 .planning/phases/10-osint-code-hosting/10-04-SUMMARY.md diff --git a/.planning/phases/10-osint-code-hosting/10-04-SUMMARY.md b/.planning/phases/10-osint-code-hosting/10-04-SUMMARY.md new file mode 100644 index 0000000..7d9d162 --- /dev/null +++ b/.planning/phases/10-osint-code-hosting/10-04-SUMMARY.md @@ -0,0 +1,117 @@ +--- +phase: 10-osint-code-hosting +plan: 04 +subsystem: recon/sources +tags: [recon, osint, bitbucket, gist, wave-2] +requires: + - pkg/recon/sources.Client (Plan 10-01) + - pkg/recon/sources.BuildQueries (Plan 10-01) + - pkg/recon.LimiterRegistry (Phase 9) + - pkg/providers.Registry +provides: + - pkg/recon/sources.BitbucketSource (RECON-CODE-03) + - pkg/recon/sources.GistSource (RECON-CODE-04) +affects: + - pkg/recon/sources (two new source implementations) +tech_stack_added: [] +patterns: + - "Token+workspace gating (Bitbucket requires both to enable)" + - "Content-scan fallback when API has no dedicated search (Gist)" + - "One Finding per gist (not per file) to avoid duplicate leak reports" + - "256KB read cap on raw content fetches" +key_files_created: + - pkg/recon/sources/bitbucket.go + - pkg/recon/sources/bitbucket_test.go + - pkg/recon/sources/gist.go + - pkg/recon/sources/gist_test.go +key_files_modified: [] +decisions: + - "BitbucketSource disables cleanly when either token OR workspace is empty (no error)" + - "GistSource enumerates /gists/public first page only; broader sweeps deferred" + - "GistSource emits one Finding per matching gist, not per file (prevents fan-out of a single leak)" + - "providerForQuery resolves keyword→provider name for Bitbucket Findings (API doesn't echo keyword)" + - "Bitbucket rate: rate.Every(3.6s) burst 1; Gist rate: rate.Every(2s) burst 1" +metrics: + duration_minutes: 6 + tasks_completed: 2 + tests_added: 9 + completed_at: "2026-04-05T22:30:00Z" +requirements: [RECON-CODE-03, RECON-CODE-04] +--- + +# Phase 10 Plan 04: Bitbucket + Gist Sources Summary + +One-liner: BitbucketSource hits the Cloud 2.0 code search API with workspace+token gating, and GistSource fans out over /gists/public fetching each file's raw content to match provider keywords, emitting one Finding per matching gist. + +## What Was Built + +### BitbucketSource (RECON-CODE-03) +- `pkg/recon/sources/bitbucket.go` — implements `recon.ReconSource`. +- Endpoint: `GET {base}/2.0/workspaces/{workspace}/search/code?search_query={kw}`. +- Auth: `Authorization: Bearer `. +- Disabled when either `Token` or `Workspace` is empty (clean no-op, no error). +- Rate: `rate.Every(3600ms)` burst 1 (Bitbucket 1000/hr API limit). +- Iterates `BuildQueries(registry, "bitbucket")` — one request per provider keyword. +- Decodes `{values:[{file:{path,commit{hash}},page_url}]}` and emits one Finding per entry. +- `SourceType = "recon:bitbucket"`, `Source = page_url` (falls back to synthetic `bitbucket:{ws}/{path}@{hash}` when page_url missing). + +### GistSource (RECON-CODE-04) +- `pkg/recon/sources/gist.go` — implements `recon.ReconSource`. +- Endpoint: `GET {base}/gists/public?per_page=100`. +- Per gist, per file: fetches `raw_url` (also with Bearer auth) and scans content against the provider keyword set (flattened `keyword → providerName` map). +- 256KB read cap per raw file to avoid pathological payloads. +- Emits **one Finding per matching gist** (breaks on first keyword match across that gist's files) — prevents a multi-file leak from producing N duplicate Findings. +- `ProviderName` set from the matched keyword; `Source = gist.html_url`; `SourceType = "recon:gist"`. +- Rate: `rate.Every(2s)` burst 1 (30 req/min). Limiter waited before **every** outbound request (list + each raw fetch) so GitHub's shared budget is respected. +- Disabled when token is empty. + +## How It Fits +- Depends on Plan 10-01 foundation: `sources.Client` (retry + 401→ErrUnauthorized), `BuildQueries`, `recon.LimiterRegistry`. +- Does **not** modify `register.go` — Plan 10-09 wires all Wave 2 sources into `RegisterAll` after every plan lands. +- Finding shape matches `engine.Finding` so downstream dedup/verify/storage paths in Phases 9/5/4 consume them without changes. + +## Tests + +`go test ./pkg/recon/sources/ -run "TestBitbucket|TestGist" -v` + +### Bitbucket (4 tests) +- `TestBitbucket_EnabledRequiresTokenAndWorkspace` — all four gate combinations. +- `TestBitbucket_SweepEmitsFindings` — httptest server, asserts `/2.0/workspaces/testws/search/code` path, Bearer header, non-empty `search_query`, Finding source/type. +- `TestBitbucket_Unauthorized` — 401 → `errors.Is(err, ErrUnauthorized)`. +- `TestBitbucket_ContextCancellation` — slow server + 50ms ctx deadline. + +### Gist (5 tests) +- `TestGist_EnabledRequiresToken` — empty vs set token. +- `TestGist_SweepEmitsFindingsOnKeywordMatch` — two gists, only one raw body contains `sk-proj-`; asserts exactly 1 Finding, correct `html_url`, `ProviderName=openai`. +- `TestGist_NoMatch_NoFinding` — gist with unrelated content produces zero Findings. +- `TestGist_Unauthorized` — 401 → `ErrUnauthorized`. +- `TestGist_ContextCancellation` — slow server + 50ms ctx deadline. + +All 9 tests pass. `go build ./...` is clean. + +## Deviations from Plan + +None — plan executed exactly as written. No Rule 1/2/3 auto-fixes were required; all tests passed on first full run after writing implementations. + +## Decisions Made + +1. **Keyword→provider mapping on the Bitbucket side lives in `providerForQuery`** — Bitbucket's API doesn't echo the keyword in the response, so we parse the query back to a provider name. Simple substring match over registry keywords is sufficient at current scale. +2. **GistSource emits one Finding per gist, not per file.** A single secret often lands in a `config.env` with supporting `README.md` and `docker-compose.yml` — treating the gist as the leak unit keeps noise down and matches how human reviewers triage. +3. **Limiter waited before every raw fetch, not just the list call.** GitHub's 30/min budget is shared across API endpoints, so each raw content fetch consumes a token. +4. **256KB cap on raw content reads.** Pathological gists (multi-MB logs, minified bundles) would otherwise block the sweep; 256KB is enough to surface a key that's typically near the top of a config file. + +## Commits + +- `d279abf` — feat(10-04): add BitbucketSource for code search recon +- `0e16e8e` — feat(10-04): add GistSource for public gist keyword recon + +## Self-Check: PASSED + +- FOUND: pkg/recon/sources/bitbucket.go +- FOUND: pkg/recon/sources/bitbucket_test.go +- FOUND: pkg/recon/sources/gist.go +- FOUND: pkg/recon/sources/gist_test.go +- FOUND: commit d279abf +- FOUND: commit 0e16e8e +- Tests: 9/9 passing (`go test ./pkg/recon/sources/ -run "TestBitbucket|TestGist"`) +- Build: `go build ./...` clean