---
phase: 13-osint_package_registries_container_iac
plan: 01
type: execute
wave: 1
depends_on: []
files_modified:
- pkg/recon/sources/npm.go
- pkg/recon/sources/npm_test.go
- pkg/recon/sources/pypi.go
- pkg/recon/sources/pypi_test.go
- pkg/recon/sources/cratesio.go
- pkg/recon/sources/cratesio_test.go
- pkg/recon/sources/rubygems.go
- pkg/recon/sources/rubygems_test.go
autonomous: true
requirements:
- RECON-PKG-01
- RECON-PKG-02
must_haves:
truths:
- "NpmSource searches npm registry for packages matching provider keywords and emits findings"
- "PyPISource searches PyPI for packages matching provider keywords and emits findings"
- "CratesIOSource searches crates.io for crates matching provider keywords and emits findings"
- "RubyGemsSource searches rubygems.org for gems matching provider keywords and emits findings"
- "All four sources handle context cancellation, empty registries, and HTTP errors gracefully"
artifacts:
- path: "pkg/recon/sources/npm.go"
provides: "NpmSource implementing recon.ReconSource"
contains: "func (s *NpmSource) Sweep"
- path: "pkg/recon/sources/npm_test.go"
provides: "httptest-based tests for NpmSource"
contains: "httptest.NewServer"
- path: "pkg/recon/sources/pypi.go"
provides: "PyPISource implementing recon.ReconSource"
contains: "func (s *PyPISource) Sweep"
- path: "pkg/recon/sources/pypi_test.go"
provides: "httptest-based tests for PyPISource"
contains: "httptest.NewServer"
- path: "pkg/recon/sources/cratesio.go"
provides: "CratesIOSource implementing recon.ReconSource"
contains: "func (s *CratesIOSource) Sweep"
- path: "pkg/recon/sources/cratesio_test.go"
provides: "httptest-based tests for CratesIOSource"
contains: "httptest.NewServer"
- path: "pkg/recon/sources/rubygems.go"
provides: "RubyGemsSource implementing recon.ReconSource"
contains: "func (s *RubyGemsSource) Sweep"
- path: "pkg/recon/sources/rubygems_test.go"
provides: "httptest-based tests for RubyGemsSource"
contains: "httptest.NewServer"
key_links:
- from: "pkg/recon/sources/npm.go"
to: "pkg/recon/source.go"
via: "implements ReconSource interface"
pattern: "var _ recon\\.ReconSource"
- from: "pkg/recon/sources/pypi.go"
to: "pkg/recon/source.go"
via: "implements ReconSource interface"
pattern: "var _ recon\\.ReconSource"
---
Implement four package registry ReconSource modules: npm, PyPI, Crates.io, and RubyGems.
Purpose: Enables KeyHunter to scan the four most popular package registries for packages that may contain leaked API keys, covering JavaScript, Python, Rust, and Ruby ecosystems.
Output: 4 source files + 4 test files in pkg/recon/sources/
@$HOME/.claude/get-shit-done/workflows/execute-plan.md
@$HOME/.claude/get-shit-done/templates/summary.md
@.planning/PROJECT.md
@.planning/ROADMAP.md
@.planning/STATE.md
@pkg/recon/source.go
@pkg/recon/sources/register.go
@pkg/recon/sources/httpclient.go
@pkg/recon/sources/queries.go
@pkg/recon/sources/replit.go (pattern reference — credentialless scraper source)
@pkg/recon/sources/github.go (pattern reference — API-key-gated source)
@pkg/recon/sources/replit_test.go (test pattern reference)
From pkg/recon/source.go:
```go
type ReconSource interface {
Name() string
RateLimit() rate.Limit
Burst() int
RespectsRobots() bool
Enabled(cfg Config) bool
Sweep(ctx context.Context, query string, out chan<- Finding) error
}
```
From pkg/recon/sources/httpclient.go:
```go
func NewClient() *Client
func (c *Client) Do(ctx context.Context, req *http.Request) (*http.Response, error)
```
From pkg/recon/sources/queries.go:
```go
func BuildQueries(reg *providers.Registry, source string) []string
```
Task 1: Implement NpmSource and PyPISource
pkg/recon/sources/npm.go, pkg/recon/sources/npm_test.go, pkg/recon/sources/pypi.go, pkg/recon/sources/pypi_test.go
Create NpmSource in npm.go following the established ReplitSource pattern (credentialless, RespectsRobots=true):
**NpmSource** (npm.go):
- Struct: `NpmSource` with fields `BaseURL string`, `Registry *providers.Registry`, `Limiters *recon.LimiterRegistry`, `Client *Client`
- Compile-time assertion: `var _ recon.ReconSource = (*NpmSource)(nil)`
- Name() returns "npm"
- RateLimit() returns rate.Every(2 * time.Second) — npm registry is generous but be polite
- Burst() returns 2
- RespectsRobots() returns false (API endpoint, not scraped HTML)
- Enabled() always returns true (no credentials needed)
- BaseURL defaults to "https://registry.npmjs.org" if empty
- Sweep() logic:
1. Call BuildQueries(s.Registry, "npm") to get keyword list
2. For each keyword, GET `{BaseURL}/-/v1/search?text={keyword}&size=20`
3. Parse JSON response: `{"objects": [{"package": {"name": "...", "links": {"npm": "..."}}}]}`
4. Define response structs: `npmSearchResponse`, `npmObject`, `npmPackage`, `npmLinks`
5. Emit one Finding per result with Source=links.npm (or construct from package name), SourceType="recon:npm", Confidence="low"
6. Honor ctx cancellation between queries, use Limiters.Wait before each request
**PyPISource** (pypi.go):
- Same pattern as NpmSource
- Name() returns "pypi"
- RateLimit() returns rate.Every(2 * time.Second)
- Burst() returns 2
- RespectsRobots() returns false
- Enabled() always true
- BaseURL defaults to "https://pypi.org"
- Sweep() logic:
1. BuildQueries(s.Registry, "pypi")
2. For each keyword, GET `{BaseURL}/search/?q={keyword}&o=` (HTML page) OR use the XML-RPC/JSON approach:
Actually use the simple JSON API: GET `{BaseURL}/pypi/{keyword}/json` is for specific packages.
For search, use: GET `https://pypi.org/search/?q={keyword}` and parse HTML for project links.
Simpler approach: GET `{BaseURL}/simple/` is too large. Use the warehouse search page.
Best approach: GET `{BaseURL}/search/?q={keyword}` returns HTML. Parse `` links.
3. Parse HTML response for project links matching `/project/[^/]+/` pattern
4. Emit Finding per result with Source="{BaseURL}/project/{name}/", SourceType="recon:pypi"
5. Use extractAnchorHrefs pattern or a simpler regex on href attributes
**Tests** — Follow replit_test.go pattern exactly:
- npm_test.go: httptest server returning canned npm search JSON. Test Sweep extracts findings, test Name/Rate/Burst, test ctx cancellation, test Enabled always true.
- pypi_test.go: httptest server returning canned HTML with package-snippet links. Same test categories.
cd /home/salva/Documents/apikey && go test ./pkg/recon/sources/ -run "TestNpm|TestPyPI" -v -count=1
NpmSource and PyPISource pass all tests: Sweep emits correct findings from httptest fixtures, Name/Rate/Burst/Enabled return expected values, ctx cancellation is handled
Task 2: Implement CratesIOSource and RubyGemsSource
pkg/recon/sources/cratesio.go, pkg/recon/sources/cratesio_test.go, pkg/recon/sources/rubygems.go, pkg/recon/sources/rubygems_test.go
**CratesIOSource** (cratesio.go):
- Struct: `CratesIOSource` with `BaseURL`, `Registry`, `Limiters`, `Client`
- Compile-time assertion: `var _ recon.ReconSource = (*CratesIOSource)(nil)`
- Name() returns "crates"
- RateLimit() returns rate.Every(1 * time.Second) — crates.io asks for 1 req/sec
- Burst() returns 1
- RespectsRobots() returns false (JSON API)
- Enabled() always true
- BaseURL defaults to "https://crates.io"
- Sweep() logic:
1. BuildQueries(s.Registry, "crates")
2. For each keyword, GET `{BaseURL}/api/v1/crates?q={keyword}&per_page=20`
3. Parse JSON: `{"crates": [{"id": "...", "name": "...", "repository": "..."}]}`
4. Define response structs: `cratesSearchResponse`, `crateEntry`
5. Emit Finding per crate: Source="https://crates.io/crates/{name}", SourceType="recon:crates"
6. IMPORTANT: crates.io requires a custom User-Agent header. Set req.Header.Set("User-Agent", "keyhunter-recon/1.0 (https://github.com/salvacybersec/keyhunter)") before passing to client.Do
**RubyGemsSource** (rubygems.go):
- Same pattern
- Name() returns "rubygems"
- RateLimit() returns rate.Every(2 * time.Second)
- Burst() returns 2
- RespectsRobots() returns false (JSON API)
- Enabled() always true
- BaseURL defaults to "https://rubygems.org"
- Sweep() logic:
1. BuildQueries(s.Registry, "rubygems")
2. For each keyword, GET `{BaseURL}/api/v1/search.json?query={keyword}&page=1`
3. Parse JSON array: `[{"name": "...", "project_uri": "..."}]`
4. Define response struct: `rubyGemEntry`
5. Emit Finding per gem: Source=project_uri, SourceType="recon:rubygems"
**Tests** — same httptest pattern:
- cratesio_test.go: httptest serving canned JSON with crate entries. Verify User-Agent header is set. Test all standard categories.
- rubygems_test.go: httptest serving canned JSON array. Test all standard categories.
cd /home/salva/Documents/apikey && go test ./pkg/recon/sources/ -run "TestCratesIO|TestRubyGems" -v -count=1
CratesIOSource and RubyGemsSource pass all tests. CratesIO sends proper User-Agent header. Both emit correct findings from httptest fixtures.
All 8 new files compile and pass tests:
```bash
go test ./pkg/recon/sources/ -run "TestNpm|TestPyPI|TestCratesIO|TestRubyGems" -v -count=1
go vet ./pkg/recon/sources/
```
- 4 new source files implement recon.ReconSource interface
- 4 test files use httptest with canned fixtures
- All tests pass
- No compilation errors across the package