4.8 KiB
4.8 KiB
Phase 10: OSINT Code Hosting - Context
Gathered: 2026-04-05 Status: Ready for planning Mode: Auto-generated
## Phase Boundary10 code hosting source implementations under pkg/recon/sources/:
- GitHub (code search API, needs token)
- GitLab (code search API, needs token)
- Bitbucket (REST API, public repos only without token)
- GitHub Gist (search API, needs token)
- Codeberg/Gitea (public API, no token needed for public repos)
- Replit (search scraping — no public API)
- CodeSandbox (search scraping)
- HuggingFace (Hub API, optional token for rate limit)
- Kaggle (notebook search, needs API key)
- Miscellaneous sandboxes (JSFiddle, CodePen — scraping-based)
All implement recon.ReconSource interface from Phase 9.
Package Layout
pkg/recon/sources/
github.go — GitHub code search (share with Phase 8's pkg/dorks/github.go logic)
gitlab.go — GitLab REST API
bitbucket.go — Bitbucket 2.0 API
gist.go — GitHub Gist search (may reuse github.go client)
codeberg.go — Gitea API (used for Codeberg + any Gitea instance)
replit.go — web scraping (robots.txt respected)
codesandbox.go — web scraping
huggingface.go — HF Hub API (/api/spaces, /api/models)
kaggle.go — Kaggle REST API
sandboxes.go — JSFiddle, CodePen, etc. aggregator
register.go — registers all sources with recon.Engine
*_test.go — httptest fixtures for each
Common Pattern
Each source:
type GitHubSource struct {
token string
client *http.Client
limiter *rate.Limiter // from recon.LimiterRegistry
}
func (s *GitHubSource) Name() string { return "github" }
func (s *GitHubSource) RateLimit() rate.Limit { return rate.Every(2 * time.Second) } // 30/min
func (s *GitHubSource) Burst() int { return 1 }
func (s *GitHubSource) RespectsRobots() bool { return false } // REST API
func (s *GitHubSource) Enabled(cfg recon.Config) bool { return cfg.Token("github") != "" }
func (s *GitHubSource) Sweep(ctx context.Context, query string, out chan<- engine.Finding) error {
// call API, parse results, normalize to Finding, send to out
}
Config
- Extend recon.Config (from Phase 9) with
Token(source string) stringreading from env/config file - Per-source config:
GITHUB_TOKEN,GITLAB_TOKEN,HUGGINGFACE_TOKEN,KAGGLE_KEY,BITBUCKET_TOKEN - Sources without tokens log "source disabled: missing credential" and skip (not error)
Rate Limits (from public docs)
- GitHub: 30 req/min (authenticated search)
- GitLab: 2000 req/min
- Bitbucket: 1000 req/hour
- HuggingFace: 1000 req/hour authenticated
- Kaggle: 60 req/min
- Others (scrapers): 10 req/min max for safety
Scraping Sources
- Replit, CodeSandbox, JSFiddle, CodePen — no public search API
- Use robots.txt respect (inherited from Phase 9)
- Conservative rate limit (10 req/min)
- User-Agent: from stealth pool (if --stealth on)
- Query by searching for provider keywords via site-specific URLs
Queries
- Use keywords from loaded providers (registry.List()) to drive dorks
- Example GitHub query:
"sk-proj-" in:file extension:env - Query templates per provider/source in pkg/recon/sources/queries.go
Finding Normalization
- Each source maps API response → engine.Finding
Sourcefield: full URL to the matchSourceTypefield: "recon:github", "recon:gitlab", etc.- Deduplication happens in recon.Engine (from Phase 9)
<code_context>
Existing Code Insights
Reusable Assets
- pkg/dorks/github.go — already has GitHub Code Search API client from Phase 8; refactor to share code
- pkg/recon/source.go — ReconSource interface
- pkg/recon/engine.go — SweepAll orchestrator
- pkg/recon/limiter.go — LimiterRegistry
- pkg/recon/stealth.go — UA pool
- pkg/recon/robots.go — RobotsCache
- pkg/providers/registry.go — List() for dork generation
Integration
- Replace
pkg/recon/example.gostub usage with real sources viaregister.go - cmd/recon.go already has
recon full/recon listfrom Phase 9 — will auto-pick up new sources once registered
</code_context>
## Specific Ideas- Share HTTP client with retry logic in pkg/recon/httpclient.go (new helper)
- Token config: use viper (already imported) to read from ~/.keyhunter.yaml or env
- Queries should be parametrized per provider, not hardcoded strings
- Private repo scanning (requires elevated tokens) — out of scope
- GitHub organization-wide scanning — separate feature
- Real-time PR scanning — defer to CI/CD phase (already done)
- Webhook triggers on new code — out of scope