docs(10): OSINT code hosting context

This commit is contained in:
salvacybersec
2026-04-06 00:59:18 +03:00
parent 226274ca9e
commit cfe090a5c9

View File

@@ -0,0 +1,133 @@
# Phase 10: OSINT Code Hosting - Context
**Gathered:** 2026-04-05
**Status:** Ready for planning
**Mode:** Auto-generated
<domain>
## Phase Boundary
10 code hosting source implementations under `pkg/recon/sources/`:
1. GitHub (code search API, needs token)
2. GitLab (code search API, needs token)
3. Bitbucket (REST API, public repos only without token)
4. GitHub Gist (search API, needs token)
5. Codeberg/Gitea (public API, no token needed for public repos)
6. Replit (search scraping — no public API)
7. CodeSandbox (search scraping)
8. HuggingFace (Hub API, optional token for rate limit)
9. Kaggle (notebook search, needs API key)
10. Miscellaneous sandboxes (JSFiddle, CodePen — scraping-based)
All implement `recon.ReconSource` interface from Phase 9.
</domain>
<decisions>
## Implementation Decisions
### Package Layout
```
pkg/recon/sources/
github.go — GitHub code search (share with Phase 8's pkg/dorks/github.go logic)
gitlab.go — GitLab REST API
bitbucket.go — Bitbucket 2.0 API
gist.go — GitHub Gist search (may reuse github.go client)
codeberg.go — Gitea API (used for Codeberg + any Gitea instance)
replit.go — web scraping (robots.txt respected)
codesandbox.go — web scraping
huggingface.go — HF Hub API (/api/spaces, /api/models)
kaggle.go — Kaggle REST API
sandboxes.go — JSFiddle, CodePen, etc. aggregator
register.go — registers all sources with recon.Engine
*_test.go — httptest fixtures for each
```
### Common Pattern
Each source:
```go
type GitHubSource struct {
token string
client *http.Client
limiter *rate.Limiter // from recon.LimiterRegistry
}
func (s *GitHubSource) Name() string { return "github" }
func (s *GitHubSource) RateLimit() rate.Limit { return rate.Every(2 * time.Second) } // 30/min
func (s *GitHubSource) Burst() int { return 1 }
func (s *GitHubSource) RespectsRobots() bool { return false } // REST API
func (s *GitHubSource) Enabled(cfg recon.Config) bool { return cfg.Token("github") != "" }
func (s *GitHubSource) Sweep(ctx context.Context, query string, out chan<- engine.Finding) error {
// call API, parse results, normalize to Finding, send to out
}
```
### Config
- Extend recon.Config (from Phase 9) with `Token(source string) string` reading from env/config file
- Per-source config: `GITHUB_TOKEN`, `GITLAB_TOKEN`, `HUGGINGFACE_TOKEN`, `KAGGLE_KEY`, `BITBUCKET_TOKEN`
- Sources without tokens log "source disabled: missing credential" and skip (not error)
### Rate Limits (from public docs)
- GitHub: 30 req/min (authenticated search)
- GitLab: 2000 req/min
- Bitbucket: 1000 req/hour
- HuggingFace: 1000 req/hour authenticated
- Kaggle: 60 req/min
- Others (scrapers): 10 req/min max for safety
### Scraping Sources
- Replit, CodeSandbox, JSFiddle, CodePen — no public search API
- Use robots.txt respect (inherited from Phase 9)
- Conservative rate limit (10 req/min)
- User-Agent: from stealth pool (if --stealth on)
- Query by searching for provider keywords via site-specific URLs
### Queries
- Use keywords from loaded providers (registry.List()) to drive dorks
- Example GitHub query: `"sk-proj-" in:file extension:env`
- Query templates per provider/source in pkg/recon/sources/queries.go
### Finding Normalization
- Each source maps API response → engine.Finding
- `Source` field: full URL to the match
- `SourceType` field: "recon:github", "recon:gitlab", etc.
- Deduplication happens in recon.Engine (from Phase 9)
</decisions>
<code_context>
## Existing Code Insights
### Reusable Assets
- pkg/dorks/github.go — already has GitHub Code Search API client from Phase 8; refactor to share code
- pkg/recon/source.go — ReconSource interface
- pkg/recon/engine.go — SweepAll orchestrator
- pkg/recon/limiter.go — LimiterRegistry
- pkg/recon/stealth.go — UA pool
- pkg/recon/robots.go — RobotsCache
- pkg/providers/registry.go — List() for dork generation
### Integration
- Replace `pkg/recon/example.go` stub usage with real sources via `register.go`
- cmd/recon.go already has `recon full` / `recon list` from Phase 9 — will auto-pick up new sources once registered
</code_context>
<specifics>
## Specific Ideas
- Share HTTP client with retry logic in pkg/recon/httpclient.go (new helper)
- Token config: use viper (already imported) to read from ~/.keyhunter.yaml or env
- Queries should be parametrized per provider, not hardcoded strings
</specifics>
<deferred>
## Deferred Ideas
- Private repo scanning (requires elevated tokens) — out of scope
- GitHub organization-wide scanning — separate feature
- Real-time PR scanning — defer to CI/CD phase (already done)
- Webhook triggers on new code — out of scope
</deferred>