docs(10): OSINT code hosting context
This commit is contained in:
133
.planning/phases/10-osint-code-hosting/10-CONTEXT.md
Normal file
133
.planning/phases/10-osint-code-hosting/10-CONTEXT.md
Normal file
@@ -0,0 +1,133 @@
|
||||
# Phase 10: OSINT Code Hosting - Context
|
||||
|
||||
**Gathered:** 2026-04-05
|
||||
**Status:** Ready for planning
|
||||
**Mode:** Auto-generated
|
||||
|
||||
<domain>
|
||||
## Phase Boundary
|
||||
|
||||
10 code hosting source implementations under `pkg/recon/sources/`:
|
||||
1. GitHub (code search API, needs token)
|
||||
2. GitLab (code search API, needs token)
|
||||
3. Bitbucket (REST API, public repos only without token)
|
||||
4. GitHub Gist (search API, needs token)
|
||||
5. Codeberg/Gitea (public API, no token needed for public repos)
|
||||
6. Replit (search scraping — no public API)
|
||||
7. CodeSandbox (search scraping)
|
||||
8. HuggingFace (Hub API, optional token for rate limit)
|
||||
9. Kaggle (notebook search, needs API key)
|
||||
10. Miscellaneous sandboxes (JSFiddle, CodePen — scraping-based)
|
||||
|
||||
All implement `recon.ReconSource` interface from Phase 9.
|
||||
|
||||
</domain>
|
||||
|
||||
<decisions>
|
||||
## Implementation Decisions
|
||||
|
||||
### Package Layout
|
||||
```
|
||||
pkg/recon/sources/
|
||||
github.go — GitHub code search (share with Phase 8's pkg/dorks/github.go logic)
|
||||
gitlab.go — GitLab REST API
|
||||
bitbucket.go — Bitbucket 2.0 API
|
||||
gist.go — GitHub Gist search (may reuse github.go client)
|
||||
codeberg.go — Gitea API (used for Codeberg + any Gitea instance)
|
||||
replit.go — web scraping (robots.txt respected)
|
||||
codesandbox.go — web scraping
|
||||
huggingface.go — HF Hub API (/api/spaces, /api/models)
|
||||
kaggle.go — Kaggle REST API
|
||||
sandboxes.go — JSFiddle, CodePen, etc. aggregator
|
||||
register.go — registers all sources with recon.Engine
|
||||
*_test.go — httptest fixtures for each
|
||||
```
|
||||
|
||||
### Common Pattern
|
||||
Each source:
|
||||
```go
|
||||
type GitHubSource struct {
|
||||
token string
|
||||
client *http.Client
|
||||
limiter *rate.Limiter // from recon.LimiterRegistry
|
||||
}
|
||||
|
||||
func (s *GitHubSource) Name() string { return "github" }
|
||||
func (s *GitHubSource) RateLimit() rate.Limit { return rate.Every(2 * time.Second) } // 30/min
|
||||
func (s *GitHubSource) Burst() int { return 1 }
|
||||
func (s *GitHubSource) RespectsRobots() bool { return false } // REST API
|
||||
func (s *GitHubSource) Enabled(cfg recon.Config) bool { return cfg.Token("github") != "" }
|
||||
func (s *GitHubSource) Sweep(ctx context.Context, query string, out chan<- engine.Finding) error {
|
||||
// call API, parse results, normalize to Finding, send to out
|
||||
}
|
||||
```
|
||||
|
||||
### Config
|
||||
- Extend recon.Config (from Phase 9) with `Token(source string) string` reading from env/config file
|
||||
- Per-source config: `GITHUB_TOKEN`, `GITLAB_TOKEN`, `HUGGINGFACE_TOKEN`, `KAGGLE_KEY`, `BITBUCKET_TOKEN`
|
||||
- Sources without tokens log "source disabled: missing credential" and skip (not error)
|
||||
|
||||
### Rate Limits (from public docs)
|
||||
- GitHub: 30 req/min (authenticated search)
|
||||
- GitLab: 2000 req/min
|
||||
- Bitbucket: 1000 req/hour
|
||||
- HuggingFace: 1000 req/hour authenticated
|
||||
- Kaggle: 60 req/min
|
||||
- Others (scrapers): 10 req/min max for safety
|
||||
|
||||
### Scraping Sources
|
||||
- Replit, CodeSandbox, JSFiddle, CodePen — no public search API
|
||||
- Use robots.txt respect (inherited from Phase 9)
|
||||
- Conservative rate limit (10 req/min)
|
||||
- User-Agent: from stealth pool (if --stealth on)
|
||||
- Query by searching for provider keywords via site-specific URLs
|
||||
|
||||
### Queries
|
||||
- Use keywords from loaded providers (registry.List()) to drive dorks
|
||||
- Example GitHub query: `"sk-proj-" in:file extension:env`
|
||||
- Query templates per provider/source in pkg/recon/sources/queries.go
|
||||
|
||||
### Finding Normalization
|
||||
- Each source maps API response → engine.Finding
|
||||
- `Source` field: full URL to the match
|
||||
- `SourceType` field: "recon:github", "recon:gitlab", etc.
|
||||
- Deduplication happens in recon.Engine (from Phase 9)
|
||||
|
||||
</decisions>
|
||||
|
||||
<code_context>
|
||||
## Existing Code Insights
|
||||
|
||||
### Reusable Assets
|
||||
- pkg/dorks/github.go — already has GitHub Code Search API client from Phase 8; refactor to share code
|
||||
- pkg/recon/source.go — ReconSource interface
|
||||
- pkg/recon/engine.go — SweepAll orchestrator
|
||||
- pkg/recon/limiter.go — LimiterRegistry
|
||||
- pkg/recon/stealth.go — UA pool
|
||||
- pkg/recon/robots.go — RobotsCache
|
||||
- pkg/providers/registry.go — List() for dork generation
|
||||
|
||||
### Integration
|
||||
- Replace `pkg/recon/example.go` stub usage with real sources via `register.go`
|
||||
- cmd/recon.go already has `recon full` / `recon list` from Phase 9 — will auto-pick up new sources once registered
|
||||
|
||||
</code_context>
|
||||
|
||||
<specifics>
|
||||
## Specific Ideas
|
||||
|
||||
- Share HTTP client with retry logic in pkg/recon/httpclient.go (new helper)
|
||||
- Token config: use viper (already imported) to read from ~/.keyhunter.yaml or env
|
||||
- Queries should be parametrized per provider, not hardcoded strings
|
||||
|
||||
</specifics>
|
||||
|
||||
<deferred>
|
||||
## Deferred Ideas
|
||||
|
||||
- Private repo scanning (requires elevated tokens) — out of scope
|
||||
- GitHub organization-wide scanning — separate feature
|
||||
- Real-time PR scanning — defer to CI/CD phase (already done)
|
||||
- Webhook triggers on new code — out of scope
|
||||
|
||||
</deferred>
|
||||
Reference in New Issue
Block a user