6.2 KiB
6.2 KiB
phase, plan, subsystem, tags, requires, provides, affects, tech_stack_added, patterns, key_files_created, key_files_modified, decisions, metrics, requirements
| phase | plan | subsystem | tags | requires | provides | affects | tech_stack_added | patterns | key_files_created | key_files_modified | decisions | metrics | requirements | |||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 10-osint-code-hosting | 04 | recon/sources |
|
|
|
|
|
|
|
|
|
Phase 10 Plan 04: Bitbucket + Gist Sources Summary
One-liner: BitbucketSource hits the Cloud 2.0 code search API with workspace+token gating, and GistSource fans out over /gists/public fetching each file's raw content to match provider keywords, emitting one Finding per matching gist.
What Was Built
BitbucketSource (RECON-CODE-03)
pkg/recon/sources/bitbucket.go— implementsrecon.ReconSource.- Endpoint:
GET {base}/2.0/workspaces/{workspace}/search/code?search_query={kw}. - Auth:
Authorization: Bearer <token>. - Disabled when either
TokenorWorkspaceis empty (clean no-op, no error). - Rate:
rate.Every(3600ms)burst 1 (Bitbucket 1000/hr API limit). - Iterates
BuildQueries(registry, "bitbucket")— one request per provider keyword. - Decodes
{values:[{file:{path,commit{hash}},page_url}]}and emits one Finding per entry. SourceType = "recon:bitbucket",Source = page_url(falls back to syntheticbitbucket:{ws}/{path}@{hash}when page_url missing).
GistSource (RECON-CODE-04)
pkg/recon/sources/gist.go— implementsrecon.ReconSource.- Endpoint:
GET {base}/gists/public?per_page=100. - Per gist, per file: fetches
raw_url(also with Bearer auth) and scans content against the provider keyword set (flattenedkeyword → providerNamemap). - 256KB read cap per raw file to avoid pathological payloads.
- Emits one Finding per matching gist (breaks on first keyword match across that gist's files) — prevents a multi-file leak from producing N duplicate Findings.
ProviderNameset from the matched keyword;Source = gist.html_url;SourceType = "recon:gist".- Rate:
rate.Every(2s)burst 1 (30 req/min). Limiter waited before every outbound request (list + each raw fetch) so GitHub's shared budget is respected. - Disabled when token is empty.
How It Fits
- Depends on Plan 10-01 foundation:
sources.Client(retry + 401→ErrUnauthorized),BuildQueries,recon.LimiterRegistry. - Does not modify
register.go— Plan 10-09 wires all Wave 2 sources intoRegisterAllafter every plan lands. - Finding shape matches
engine.Findingso downstream dedup/verify/storage paths in Phases 9/5/4 consume them without changes.
Tests
go test ./pkg/recon/sources/ -run "TestBitbucket|TestGist" -v
Bitbucket (4 tests)
TestBitbucket_EnabledRequiresTokenAndWorkspace— all four gate combinations.TestBitbucket_SweepEmitsFindings— httptest server, asserts/2.0/workspaces/testws/search/codepath, Bearer header, non-emptysearch_query, Finding source/type.TestBitbucket_Unauthorized— 401 →errors.Is(err, ErrUnauthorized).TestBitbucket_ContextCancellation— slow server + 50ms ctx deadline.
Gist (5 tests)
TestGist_EnabledRequiresToken— empty vs set token.TestGist_SweepEmitsFindingsOnKeywordMatch— two gists, only one raw body containssk-proj-; asserts exactly 1 Finding, correcthtml_url,ProviderName=openai.TestGist_NoMatch_NoFinding— gist with unrelated content produces zero Findings.TestGist_Unauthorized— 401 →ErrUnauthorized.TestGist_ContextCancellation— slow server + 50ms ctx deadline.
All 9 tests pass. go build ./... is clean.
Deviations from Plan
None — plan executed exactly as written. No Rule 1/2/3 auto-fixes were required; all tests passed on first full run after writing implementations.
Decisions Made
- Keyword→provider mapping on the Bitbucket side lives in
providerForQuery— Bitbucket's API doesn't echo the keyword in the response, so we parse the query back to a provider name. Simple substring match over registry keywords is sufficient at current scale. - GistSource emits one Finding per gist, not per file. A single secret often lands in a
config.envwith supportingREADME.mdanddocker-compose.yml— treating the gist as the leak unit keeps noise down and matches how human reviewers triage. - Limiter waited before every raw fetch, not just the list call. GitHub's 30/min budget is shared across API endpoints, so each raw content fetch consumes a token.
- 256KB cap on raw content reads. Pathological gists (multi-MB logs, minified bundles) would otherwise block the sweep; 256KB is enough to surface a key that's typically near the top of a config file.
Commits
d279abf— feat(10-04): add BitbucketSource for code search recon0e16e8e— feat(10-04): add GistSource for public gist keyword recon
Self-Check: PASSED
- FOUND: pkg/recon/sources/bitbucket.go
- FOUND: pkg/recon/sources/bitbucket_test.go
- FOUND: pkg/recon/sources/gist.go
- FOUND: pkg/recon/sources/gist_test.go
- FOUND: commit
d279abf - FOUND: commit
0e16e8e - Tests: 9/9 passing (
go test ./pkg/recon/sources/ -run "TestBitbucket|TestGist") - Build:
go build ./...clean