diff --git a/.planning/phases/10-osint-code-hosting/10-CONTEXT.md b/.planning/phases/10-osint-code-hosting/10-CONTEXT.md new file mode 100644 index 0000000..0d113dd --- /dev/null +++ b/.planning/phases/10-osint-code-hosting/10-CONTEXT.md @@ -0,0 +1,133 @@ +# Phase 10: OSINT Code Hosting - Context + +**Gathered:** 2026-04-05 +**Status:** Ready for planning +**Mode:** Auto-generated + + +## Phase Boundary + +10 code hosting source implementations under `pkg/recon/sources/`: +1. GitHub (code search API, needs token) +2. GitLab (code search API, needs token) +3. Bitbucket (REST API, public repos only without token) +4. GitHub Gist (search API, needs token) +5. Codeberg/Gitea (public API, no token needed for public repos) +6. Replit (search scraping — no public API) +7. CodeSandbox (search scraping) +8. HuggingFace (Hub API, optional token for rate limit) +9. Kaggle (notebook search, needs API key) +10. Miscellaneous sandboxes (JSFiddle, CodePen — scraping-based) + +All implement `recon.ReconSource` interface from Phase 9. + + + + +## Implementation Decisions + +### Package Layout +``` +pkg/recon/sources/ + github.go — GitHub code search (share with Phase 8's pkg/dorks/github.go logic) + gitlab.go — GitLab REST API + bitbucket.go — Bitbucket 2.0 API + gist.go — GitHub Gist search (may reuse github.go client) + codeberg.go — Gitea API (used for Codeberg + any Gitea instance) + replit.go — web scraping (robots.txt respected) + codesandbox.go — web scraping + huggingface.go — HF Hub API (/api/spaces, /api/models) + kaggle.go — Kaggle REST API + sandboxes.go — JSFiddle, CodePen, etc. aggregator + register.go — registers all sources with recon.Engine + *_test.go — httptest fixtures for each +``` + +### Common Pattern +Each source: +```go +type GitHubSource struct { + token string + client *http.Client + limiter *rate.Limiter // from recon.LimiterRegistry +} + +func (s *GitHubSource) Name() string { return "github" } +func (s *GitHubSource) RateLimit() rate.Limit { return rate.Every(2 * time.Second) } // 30/min +func (s *GitHubSource) Burst() int { return 1 } +func (s *GitHubSource) RespectsRobots() bool { return false } // REST API +func (s *GitHubSource) Enabled(cfg recon.Config) bool { return cfg.Token("github") != "" } +func (s *GitHubSource) Sweep(ctx context.Context, query string, out chan<- engine.Finding) error { + // call API, parse results, normalize to Finding, send to out +} +``` + +### Config +- Extend recon.Config (from Phase 9) with `Token(source string) string` reading from env/config file +- Per-source config: `GITHUB_TOKEN`, `GITLAB_TOKEN`, `HUGGINGFACE_TOKEN`, `KAGGLE_KEY`, `BITBUCKET_TOKEN` +- Sources without tokens log "source disabled: missing credential" and skip (not error) + +### Rate Limits (from public docs) +- GitHub: 30 req/min (authenticated search) +- GitLab: 2000 req/min +- Bitbucket: 1000 req/hour +- HuggingFace: 1000 req/hour authenticated +- Kaggle: 60 req/min +- Others (scrapers): 10 req/min max for safety + +### Scraping Sources +- Replit, CodeSandbox, JSFiddle, CodePen — no public search API +- Use robots.txt respect (inherited from Phase 9) +- Conservative rate limit (10 req/min) +- User-Agent: from stealth pool (if --stealth on) +- Query by searching for provider keywords via site-specific URLs + +### Queries +- Use keywords from loaded providers (registry.List()) to drive dorks +- Example GitHub query: `"sk-proj-" in:file extension:env` +- Query templates per provider/source in pkg/recon/sources/queries.go + +### Finding Normalization +- Each source maps API response → engine.Finding +- `Source` field: full URL to the match +- `SourceType` field: "recon:github", "recon:gitlab", etc. +- Deduplication happens in recon.Engine (from Phase 9) + + + + +## Existing Code Insights + +### Reusable Assets +- pkg/dorks/github.go — already has GitHub Code Search API client from Phase 8; refactor to share code +- pkg/recon/source.go — ReconSource interface +- pkg/recon/engine.go — SweepAll orchestrator +- pkg/recon/limiter.go — LimiterRegistry +- pkg/recon/stealth.go — UA pool +- pkg/recon/robots.go — RobotsCache +- pkg/providers/registry.go — List() for dork generation + +### Integration +- Replace `pkg/recon/example.go` stub usage with real sources via `register.go` +- cmd/recon.go already has `recon full` / `recon list` from Phase 9 — will auto-pick up new sources once registered + + + + +## Specific Ideas + +- Share HTTP client with retry logic in pkg/recon/httpclient.go (new helper) +- Token config: use viper (already imported) to read from ~/.keyhunter.yaml or env +- Queries should be parametrized per provider, not hardcoded strings + + + + +## Deferred Ideas + +- Private repo scanning (requires elevated tokens) — out of scope +- GitHub organization-wide scanning — separate feature +- Real-time PR scanning — defer to CI/CD phase (already done) +- Webhook triggers on new code — out of scope + +