docs(10): OSINT code hosting context
This commit is contained in:
133
.planning/phases/10-osint-code-hosting/10-CONTEXT.md
Normal file
133
.planning/phases/10-osint-code-hosting/10-CONTEXT.md
Normal file
@@ -0,0 +1,133 @@
|
|||||||
|
# Phase 10: OSINT Code Hosting - Context
|
||||||
|
|
||||||
|
**Gathered:** 2026-04-05
|
||||||
|
**Status:** Ready for planning
|
||||||
|
**Mode:** Auto-generated
|
||||||
|
|
||||||
|
<domain>
|
||||||
|
## Phase Boundary
|
||||||
|
|
||||||
|
10 code hosting source implementations under `pkg/recon/sources/`:
|
||||||
|
1. GitHub (code search API, needs token)
|
||||||
|
2. GitLab (code search API, needs token)
|
||||||
|
3. Bitbucket (REST API, public repos only without token)
|
||||||
|
4. GitHub Gist (search API, needs token)
|
||||||
|
5. Codeberg/Gitea (public API, no token needed for public repos)
|
||||||
|
6. Replit (search scraping — no public API)
|
||||||
|
7. CodeSandbox (search scraping)
|
||||||
|
8. HuggingFace (Hub API, optional token for rate limit)
|
||||||
|
9. Kaggle (notebook search, needs API key)
|
||||||
|
10. Miscellaneous sandboxes (JSFiddle, CodePen — scraping-based)
|
||||||
|
|
||||||
|
All implement `recon.ReconSource` interface from Phase 9.
|
||||||
|
|
||||||
|
</domain>
|
||||||
|
|
||||||
|
<decisions>
|
||||||
|
## Implementation Decisions
|
||||||
|
|
||||||
|
### Package Layout
|
||||||
|
```
|
||||||
|
pkg/recon/sources/
|
||||||
|
github.go — GitHub code search (share with Phase 8's pkg/dorks/github.go logic)
|
||||||
|
gitlab.go — GitLab REST API
|
||||||
|
bitbucket.go — Bitbucket 2.0 API
|
||||||
|
gist.go — GitHub Gist search (may reuse github.go client)
|
||||||
|
codeberg.go — Gitea API (used for Codeberg + any Gitea instance)
|
||||||
|
replit.go — web scraping (robots.txt respected)
|
||||||
|
codesandbox.go — web scraping
|
||||||
|
huggingface.go — HF Hub API (/api/spaces, /api/models)
|
||||||
|
kaggle.go — Kaggle REST API
|
||||||
|
sandboxes.go — JSFiddle, CodePen, etc. aggregator
|
||||||
|
register.go — registers all sources with recon.Engine
|
||||||
|
*_test.go — httptest fixtures for each
|
||||||
|
```
|
||||||
|
|
||||||
|
### Common Pattern
|
||||||
|
Each source:
|
||||||
|
```go
|
||||||
|
type GitHubSource struct {
|
||||||
|
token string
|
||||||
|
client *http.Client
|
||||||
|
limiter *rate.Limiter // from recon.LimiterRegistry
|
||||||
|
}
|
||||||
|
|
||||||
|
func (s *GitHubSource) Name() string { return "github" }
|
||||||
|
func (s *GitHubSource) RateLimit() rate.Limit { return rate.Every(2 * time.Second) } // 30/min
|
||||||
|
func (s *GitHubSource) Burst() int { return 1 }
|
||||||
|
func (s *GitHubSource) RespectsRobots() bool { return false } // REST API
|
||||||
|
func (s *GitHubSource) Enabled(cfg recon.Config) bool { return cfg.Token("github") != "" }
|
||||||
|
func (s *GitHubSource) Sweep(ctx context.Context, query string, out chan<- engine.Finding) error {
|
||||||
|
// call API, parse results, normalize to Finding, send to out
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Config
|
||||||
|
- Extend recon.Config (from Phase 9) with `Token(source string) string` reading from env/config file
|
||||||
|
- Per-source config: `GITHUB_TOKEN`, `GITLAB_TOKEN`, `HUGGINGFACE_TOKEN`, `KAGGLE_KEY`, `BITBUCKET_TOKEN`
|
||||||
|
- Sources without tokens log "source disabled: missing credential" and skip (not error)
|
||||||
|
|
||||||
|
### Rate Limits (from public docs)
|
||||||
|
- GitHub: 30 req/min (authenticated search)
|
||||||
|
- GitLab: 2000 req/min
|
||||||
|
- Bitbucket: 1000 req/hour
|
||||||
|
- HuggingFace: 1000 req/hour authenticated
|
||||||
|
- Kaggle: 60 req/min
|
||||||
|
- Others (scrapers): 10 req/min max for safety
|
||||||
|
|
||||||
|
### Scraping Sources
|
||||||
|
- Replit, CodeSandbox, JSFiddle, CodePen — no public search API
|
||||||
|
- Use robots.txt respect (inherited from Phase 9)
|
||||||
|
- Conservative rate limit (10 req/min)
|
||||||
|
- User-Agent: from stealth pool (if --stealth on)
|
||||||
|
- Query by searching for provider keywords via site-specific URLs
|
||||||
|
|
||||||
|
### Queries
|
||||||
|
- Use keywords from loaded providers (registry.List()) to drive dorks
|
||||||
|
- Example GitHub query: `"sk-proj-" in:file extension:env`
|
||||||
|
- Query templates per provider/source in pkg/recon/sources/queries.go
|
||||||
|
|
||||||
|
### Finding Normalization
|
||||||
|
- Each source maps API response → engine.Finding
|
||||||
|
- `Source` field: full URL to the match
|
||||||
|
- `SourceType` field: "recon:github", "recon:gitlab", etc.
|
||||||
|
- Deduplication happens in recon.Engine (from Phase 9)
|
||||||
|
|
||||||
|
</decisions>
|
||||||
|
|
||||||
|
<code_context>
|
||||||
|
## Existing Code Insights
|
||||||
|
|
||||||
|
### Reusable Assets
|
||||||
|
- pkg/dorks/github.go — already has GitHub Code Search API client from Phase 8; refactor to share code
|
||||||
|
- pkg/recon/source.go — ReconSource interface
|
||||||
|
- pkg/recon/engine.go — SweepAll orchestrator
|
||||||
|
- pkg/recon/limiter.go — LimiterRegistry
|
||||||
|
- pkg/recon/stealth.go — UA pool
|
||||||
|
- pkg/recon/robots.go — RobotsCache
|
||||||
|
- pkg/providers/registry.go — List() for dork generation
|
||||||
|
|
||||||
|
### Integration
|
||||||
|
- Replace `pkg/recon/example.go` stub usage with real sources via `register.go`
|
||||||
|
- cmd/recon.go already has `recon full` / `recon list` from Phase 9 — will auto-pick up new sources once registered
|
||||||
|
|
||||||
|
</code_context>
|
||||||
|
|
||||||
|
<specifics>
|
||||||
|
## Specific Ideas
|
||||||
|
|
||||||
|
- Share HTTP client with retry logic in pkg/recon/httpclient.go (new helper)
|
||||||
|
- Token config: use viper (already imported) to read from ~/.keyhunter.yaml or env
|
||||||
|
- Queries should be parametrized per provider, not hardcoded strings
|
||||||
|
|
||||||
|
</specifics>
|
||||||
|
|
||||||
|
<deferred>
|
||||||
|
## Deferred Ideas
|
||||||
|
|
||||||
|
- Private repo scanning (requires elevated tokens) — out of scope
|
||||||
|
- GitHub organization-wide scanning — separate feature
|
||||||
|
- Real-time PR scanning — defer to CI/CD phase (already done)
|
||||||
|
- Webhook triggers on new code — out of scope
|
||||||
|
|
||||||
|
</deferred>
|
||||||
Reference in New Issue
Block a user