118 lines
6.2 KiB
Markdown
118 lines
6.2 KiB
Markdown
---
|
|
phase: 10-osint-code-hosting
|
|
plan: 04
|
|
subsystem: recon/sources
|
|
tags: [recon, osint, bitbucket, gist, wave-2]
|
|
requires:
|
|
- pkg/recon/sources.Client (Plan 10-01)
|
|
- pkg/recon/sources.BuildQueries (Plan 10-01)
|
|
- pkg/recon.LimiterRegistry (Phase 9)
|
|
- pkg/providers.Registry
|
|
provides:
|
|
- pkg/recon/sources.BitbucketSource (RECON-CODE-03)
|
|
- pkg/recon/sources.GistSource (RECON-CODE-04)
|
|
affects:
|
|
- pkg/recon/sources (two new source implementations)
|
|
tech_stack_added: []
|
|
patterns:
|
|
- "Token+workspace gating (Bitbucket requires both to enable)"
|
|
- "Content-scan fallback when API has no dedicated search (Gist)"
|
|
- "One Finding per gist (not per file) to avoid duplicate leak reports"
|
|
- "256KB read cap on raw content fetches"
|
|
key_files_created:
|
|
- pkg/recon/sources/bitbucket.go
|
|
- pkg/recon/sources/bitbucket_test.go
|
|
- pkg/recon/sources/gist.go
|
|
- pkg/recon/sources/gist_test.go
|
|
key_files_modified: []
|
|
decisions:
|
|
- "BitbucketSource disables cleanly when either token OR workspace is empty (no error)"
|
|
- "GistSource enumerates /gists/public first page only; broader sweeps deferred"
|
|
- "GistSource emits one Finding per matching gist, not per file (prevents fan-out of a single leak)"
|
|
- "providerForQuery resolves keyword→provider name for Bitbucket Findings (API doesn't echo keyword)"
|
|
- "Bitbucket rate: rate.Every(3.6s) burst 1; Gist rate: rate.Every(2s) burst 1"
|
|
metrics:
|
|
duration_minutes: 6
|
|
tasks_completed: 2
|
|
tests_added: 9
|
|
completed_at: "2026-04-05T22:30:00Z"
|
|
requirements: [RECON-CODE-03, RECON-CODE-04]
|
|
---
|
|
|
|
# Phase 10 Plan 04: Bitbucket + Gist Sources Summary
|
|
|
|
One-liner: BitbucketSource hits the Cloud 2.0 code search API with workspace+token gating, and GistSource fans out over /gists/public fetching each file's raw content to match provider keywords, emitting one Finding per matching gist.
|
|
|
|
## What Was Built
|
|
|
|
### BitbucketSource (RECON-CODE-03)
|
|
- `pkg/recon/sources/bitbucket.go` — implements `recon.ReconSource`.
|
|
- Endpoint: `GET {base}/2.0/workspaces/{workspace}/search/code?search_query={kw}`.
|
|
- Auth: `Authorization: Bearer <token>`.
|
|
- Disabled when either `Token` or `Workspace` is empty (clean no-op, no error).
|
|
- Rate: `rate.Every(3600ms)` burst 1 (Bitbucket 1000/hr API limit).
|
|
- Iterates `BuildQueries(registry, "bitbucket")` — one request per provider keyword.
|
|
- Decodes `{values:[{file:{path,commit{hash}},page_url}]}` and emits one Finding per entry.
|
|
- `SourceType = "recon:bitbucket"`, `Source = page_url` (falls back to synthetic `bitbucket:{ws}/{path}@{hash}` when page_url missing).
|
|
|
|
### GistSource (RECON-CODE-04)
|
|
- `pkg/recon/sources/gist.go` — implements `recon.ReconSource`.
|
|
- Endpoint: `GET {base}/gists/public?per_page=100`.
|
|
- Per gist, per file: fetches `raw_url` (also with Bearer auth) and scans content against the provider keyword set (flattened `keyword → providerName` map).
|
|
- 256KB read cap per raw file to avoid pathological payloads.
|
|
- Emits **one Finding per matching gist** (breaks on first keyword match across that gist's files) — prevents a multi-file leak from producing N duplicate Findings.
|
|
- `ProviderName` set from the matched keyword; `Source = gist.html_url`; `SourceType = "recon:gist"`.
|
|
- Rate: `rate.Every(2s)` burst 1 (30 req/min). Limiter waited before **every** outbound request (list + each raw fetch) so GitHub's shared budget is respected.
|
|
- Disabled when token is empty.
|
|
|
|
## How It Fits
|
|
- Depends on Plan 10-01 foundation: `sources.Client` (retry + 401→ErrUnauthorized), `BuildQueries`, `recon.LimiterRegistry`.
|
|
- Does **not** modify `register.go` — Plan 10-09 wires all Wave 2 sources into `RegisterAll` after every plan lands.
|
|
- Finding shape matches `engine.Finding` so downstream dedup/verify/storage paths in Phases 9/5/4 consume them without changes.
|
|
|
|
## Tests
|
|
|
|
`go test ./pkg/recon/sources/ -run "TestBitbucket|TestGist" -v`
|
|
|
|
### Bitbucket (4 tests)
|
|
- `TestBitbucket_EnabledRequiresTokenAndWorkspace` — all four gate combinations.
|
|
- `TestBitbucket_SweepEmitsFindings` — httptest server, asserts `/2.0/workspaces/testws/search/code` path, Bearer header, non-empty `search_query`, Finding source/type.
|
|
- `TestBitbucket_Unauthorized` — 401 → `errors.Is(err, ErrUnauthorized)`.
|
|
- `TestBitbucket_ContextCancellation` — slow server + 50ms ctx deadline.
|
|
|
|
### Gist (5 tests)
|
|
- `TestGist_EnabledRequiresToken` — empty vs set token.
|
|
- `TestGist_SweepEmitsFindingsOnKeywordMatch` — two gists, only one raw body contains `sk-proj-`; asserts exactly 1 Finding, correct `html_url`, `ProviderName=openai`.
|
|
- `TestGist_NoMatch_NoFinding` — gist with unrelated content produces zero Findings.
|
|
- `TestGist_Unauthorized` — 401 → `ErrUnauthorized`.
|
|
- `TestGist_ContextCancellation` — slow server + 50ms ctx deadline.
|
|
|
|
All 9 tests pass. `go build ./...` is clean.
|
|
|
|
## Deviations from Plan
|
|
|
|
None — plan executed exactly as written. No Rule 1/2/3 auto-fixes were required; all tests passed on first full run after writing implementations.
|
|
|
|
## Decisions Made
|
|
|
|
1. **Keyword→provider mapping on the Bitbucket side lives in `providerForQuery`** — Bitbucket's API doesn't echo the keyword in the response, so we parse the query back to a provider name. Simple substring match over registry keywords is sufficient at current scale.
|
|
2. **GistSource emits one Finding per gist, not per file.** A single secret often lands in a `config.env` with supporting `README.md` and `docker-compose.yml` — treating the gist as the leak unit keeps noise down and matches how human reviewers triage.
|
|
3. **Limiter waited before every raw fetch, not just the list call.** GitHub's 30/min budget is shared across API endpoints, so each raw content fetch consumes a token.
|
|
4. **256KB cap on raw content reads.** Pathological gists (multi-MB logs, minified bundles) would otherwise block the sweep; 256KB is enough to surface a key that's typically near the top of a config file.
|
|
|
|
## Commits
|
|
|
|
- `d279abf` — feat(10-04): add BitbucketSource for code search recon
|
|
- `0e16e8e` — feat(10-04): add GistSource for public gist keyword recon
|
|
|
|
## Self-Check: PASSED
|
|
|
|
- FOUND: pkg/recon/sources/bitbucket.go
|
|
- FOUND: pkg/recon/sources/bitbucket_test.go
|
|
- FOUND: pkg/recon/sources/gist.go
|
|
- FOUND: pkg/recon/sources/gist_test.go
|
|
- FOUND: commit d279abf
|
|
- FOUND: commit 0e16e8e
|
|
- Tests: 9/9 passing (`go test ./pkg/recon/sources/ -run "TestBitbucket|TestGist"`)
|
|
- Build: `go build ./...` clean
|