Files
keyhunter/.planning/phases/10-osint-code-hosting/10-04-SUMMARY.md
2026-04-06 01:18:53 +03:00

118 lines
6.2 KiB
Markdown

---
phase: 10-osint-code-hosting
plan: 04
subsystem: recon/sources
tags: [recon, osint, bitbucket, gist, wave-2]
requires:
- pkg/recon/sources.Client (Plan 10-01)
- pkg/recon/sources.BuildQueries (Plan 10-01)
- pkg/recon.LimiterRegistry (Phase 9)
- pkg/providers.Registry
provides:
- pkg/recon/sources.BitbucketSource (RECON-CODE-03)
- pkg/recon/sources.GistSource (RECON-CODE-04)
affects:
- pkg/recon/sources (two new source implementations)
tech_stack_added: []
patterns:
- "Token+workspace gating (Bitbucket requires both to enable)"
- "Content-scan fallback when API has no dedicated search (Gist)"
- "One Finding per gist (not per file) to avoid duplicate leak reports"
- "256KB read cap on raw content fetches"
key_files_created:
- pkg/recon/sources/bitbucket.go
- pkg/recon/sources/bitbucket_test.go
- pkg/recon/sources/gist.go
- pkg/recon/sources/gist_test.go
key_files_modified: []
decisions:
- "BitbucketSource disables cleanly when either token OR workspace is empty (no error)"
- "GistSource enumerates /gists/public first page only; broader sweeps deferred"
- "GistSource emits one Finding per matching gist, not per file (prevents fan-out of a single leak)"
- "providerForQuery resolves keyword→provider name for Bitbucket Findings (API doesn't echo keyword)"
- "Bitbucket rate: rate.Every(3.6s) burst 1; Gist rate: rate.Every(2s) burst 1"
metrics:
duration_minutes: 6
tasks_completed: 2
tests_added: 9
completed_at: "2026-04-05T22:30:00Z"
requirements: [RECON-CODE-03, RECON-CODE-04]
---
# Phase 10 Plan 04: Bitbucket + Gist Sources Summary
One-liner: BitbucketSource hits the Cloud 2.0 code search API with workspace+token gating, and GistSource fans out over /gists/public fetching each file's raw content to match provider keywords, emitting one Finding per matching gist.
## What Was Built
### BitbucketSource (RECON-CODE-03)
- `pkg/recon/sources/bitbucket.go` — implements `recon.ReconSource`.
- Endpoint: `GET {base}/2.0/workspaces/{workspace}/search/code?search_query={kw}`.
- Auth: `Authorization: Bearer <token>`.
- Disabled when either `Token` or `Workspace` is empty (clean no-op, no error).
- Rate: `rate.Every(3600ms)` burst 1 (Bitbucket 1000/hr API limit).
- Iterates `BuildQueries(registry, "bitbucket")` — one request per provider keyword.
- Decodes `{values:[{file:{path,commit{hash}},page_url}]}` and emits one Finding per entry.
- `SourceType = "recon:bitbucket"`, `Source = page_url` (falls back to synthetic `bitbucket:{ws}/{path}@{hash}` when page_url missing).
### GistSource (RECON-CODE-04)
- `pkg/recon/sources/gist.go` — implements `recon.ReconSource`.
- Endpoint: `GET {base}/gists/public?per_page=100`.
- Per gist, per file: fetches `raw_url` (also with Bearer auth) and scans content against the provider keyword set (flattened `keyword → providerName` map).
- 256KB read cap per raw file to avoid pathological payloads.
- Emits **one Finding per matching gist** (breaks on first keyword match across that gist's files) — prevents a multi-file leak from producing N duplicate Findings.
- `ProviderName` set from the matched keyword; `Source = gist.html_url`; `SourceType = "recon:gist"`.
- Rate: `rate.Every(2s)` burst 1 (30 req/min). Limiter waited before **every** outbound request (list + each raw fetch) so GitHub's shared budget is respected.
- Disabled when token is empty.
## How It Fits
- Depends on Plan 10-01 foundation: `sources.Client` (retry + 401→ErrUnauthorized), `BuildQueries`, `recon.LimiterRegistry`.
- Does **not** modify `register.go` — Plan 10-09 wires all Wave 2 sources into `RegisterAll` after every plan lands.
- Finding shape matches `engine.Finding` so downstream dedup/verify/storage paths in Phases 9/5/4 consume them without changes.
## Tests
`go test ./pkg/recon/sources/ -run "TestBitbucket|TestGist" -v`
### Bitbucket (4 tests)
- `TestBitbucket_EnabledRequiresTokenAndWorkspace` — all four gate combinations.
- `TestBitbucket_SweepEmitsFindings` — httptest server, asserts `/2.0/workspaces/testws/search/code` path, Bearer header, non-empty `search_query`, Finding source/type.
- `TestBitbucket_Unauthorized` — 401 → `errors.Is(err, ErrUnauthorized)`.
- `TestBitbucket_ContextCancellation` — slow server + 50ms ctx deadline.
### Gist (5 tests)
- `TestGist_EnabledRequiresToken` — empty vs set token.
- `TestGist_SweepEmitsFindingsOnKeywordMatch` — two gists, only one raw body contains `sk-proj-`; asserts exactly 1 Finding, correct `html_url`, `ProviderName=openai`.
- `TestGist_NoMatch_NoFinding` — gist with unrelated content produces zero Findings.
- `TestGist_Unauthorized` — 401 → `ErrUnauthorized`.
- `TestGist_ContextCancellation` — slow server + 50ms ctx deadline.
All 9 tests pass. `go build ./...` is clean.
## Deviations from Plan
None — plan executed exactly as written. No Rule 1/2/3 auto-fixes were required; all tests passed on first full run after writing implementations.
## Decisions Made
1. **Keyword→provider mapping on the Bitbucket side lives in `providerForQuery`** — Bitbucket's API doesn't echo the keyword in the response, so we parse the query back to a provider name. Simple substring match over registry keywords is sufficient at current scale.
2. **GistSource emits one Finding per gist, not per file.** A single secret often lands in a `config.env` with supporting `README.md` and `docker-compose.yml` — treating the gist as the leak unit keeps noise down and matches how human reviewers triage.
3. **Limiter waited before every raw fetch, not just the list call.** GitHub's 30/min budget is shared across API endpoints, so each raw content fetch consumes a token.
4. **256KB cap on raw content reads.** Pathological gists (multi-MB logs, minified bundles) would otherwise block the sweep; 256KB is enough to surface a key that's typically near the top of a config file.
## Commits
- `d279abf` — feat(10-04): add BitbucketSource for code search recon
- `0e16e8e` — feat(10-04): add GistSource for public gist keyword recon
## Self-Check: PASSED
- FOUND: pkg/recon/sources/bitbucket.go
- FOUND: pkg/recon/sources/bitbucket_test.go
- FOUND: pkg/recon/sources/gist.go
- FOUND: pkg/recon/sources/gist_test.go
- FOUND: commit d279abf
- FOUND: commit 0e16e8e
- Tests: 9/9 passing (`go test ./pkg/recon/sources/ -run "TestBitbucket|TestGist"`)
- Build: `go build ./...` clean