docs(10-osint-code-hosting): create phase 10 plans (9 plans across 3 waves)
This commit is contained in:
191
.planning/phases/10-osint-code-hosting/10-07-PLAN.md
Normal file
191
.planning/phases/10-osint-code-hosting/10-07-PLAN.md
Normal file
@@ -0,0 +1,191 @@
|
||||
---
|
||||
phase: 10-osint-code-hosting
|
||||
plan: 07
|
||||
type: execute
|
||||
wave: 2
|
||||
depends_on: [10-01]
|
||||
files_modified:
|
||||
- pkg/recon/sources/replit.go
|
||||
- pkg/recon/sources/replit_test.go
|
||||
- pkg/recon/sources/codesandbox.go
|
||||
- pkg/recon/sources/codesandbox_test.go
|
||||
- pkg/recon/sources/sandboxes.go
|
||||
- pkg/recon/sources/sandboxes_test.go
|
||||
autonomous: true
|
||||
requirements: [RECON-CODE-06, RECON-CODE-07, RECON-CODE-10]
|
||||
must_haves:
|
||||
truths:
|
||||
- "ReplitSource scrapes replit.com search HTML and emits Findings tagged recon:replit"
|
||||
- "CodeSandboxSource scrapes codesandbox.io search and emits Findings tagged recon:codesandbox"
|
||||
- "SandboxesSource aggregates JSFiddle+CodePen+StackBlitz+Glitch+Observable+Gitpod with SourceType recon:sandboxes and sub-type in KeyMasked metadata slot"
|
||||
- "All three RespectsRobots()==true and rate-limit conservatively (10/min)"
|
||||
artifacts:
|
||||
- path: "pkg/recon/sources/replit.go"
|
||||
provides: "ReplitSource (scraper)"
|
||||
- path: "pkg/recon/sources/codesandbox.go"
|
||||
provides: "CodeSandboxSource (scraper)"
|
||||
- path: "pkg/recon/sources/sandboxes.go"
|
||||
provides: "SandboxesSource aggregator (JSFiddle, CodePen, StackBlitz, Glitch, Observable, Gitpod)"
|
||||
key_links:
|
||||
- from: "pkg/recon/sources/replit.go"
|
||||
to: "pkg/recon/sources/httpclient.go"
|
||||
via: "Client.Do on https://replit.com/search?q=..."
|
||||
pattern: "client\\.Do"
|
||||
- from: "pkg/recon/sources/sandboxes.go"
|
||||
to: "pkg/recon/sources/httpclient.go"
|
||||
via: "Client.Do on per-sandbox search URLs"
|
||||
pattern: "client\\.Do"
|
||||
---
|
||||
|
||||
<objective>
|
||||
Implement three scraping-based sources for sandbox/IDE platforms without public
|
||||
search APIs. All three honor robots.txt, use a conservative 10 req/min rate, and
|
||||
emit Findings with best-effort HTML link extraction.
|
||||
|
||||
Purpose: RECON-CODE-06 (Replit), RECON-CODE-07 (CodeSandbox), RECON-CODE-10
|
||||
(CodePen/JSFiddle/StackBlitz/Glitch/Observable/Gitpod aggregator).
|
||||
Output: 3 new ReconSource implementations + tests.
|
||||
</objective>
|
||||
|
||||
<execution_context>
|
||||
@$HOME/.claude/get-shit-done/workflows/execute-plan.md
|
||||
@$HOME/.claude/get-shit-done/templates/summary.md
|
||||
</execution_context>
|
||||
|
||||
<context>
|
||||
@.planning/phases/10-osint-code-hosting/10-CONTEXT.md
|
||||
@.planning/phases/10-osint-code-hosting/10-01-SUMMARY.md
|
||||
@pkg/recon/source.go
|
||||
@pkg/recon/robots.go
|
||||
@pkg/recon/sources/httpclient.go
|
||||
|
||||
<interfaces>
|
||||
Scraping strategy (identical for all three sources in this plan):
|
||||
1. Build per-provider keyword queries via BuildQueries (default format = bare keyword)
|
||||
2. Fetch search URL via Client.Do (no auth headers)
|
||||
3. Use a simple regex to extract result links from HTML (href="/@user/repl-name"
|
||||
or href="/s/...") — use net/html parser for robustness
|
||||
4. Emit one Finding per extracted link with SourceType="recon:<name>" and Source=absolute URL
|
||||
5. Return early on ctx cancellation
|
||||
|
||||
Search URLs (approximations — confirm in action):
|
||||
- Replit: https://replit.com/search?q=<q>&type=repls
|
||||
- CodeSandbox: https://codesandbox.io/search?query=<q>&type=sandboxes
|
||||
- CodePen: https://codepen.io/search/pens?q=<q>
|
||||
- JSFiddle: https://jsfiddle.net/api/search/?q=<q> (returns JSON)
|
||||
- StackBlitz: https://stackblitz.com/search?q=<q>
|
||||
- Glitch: https://glitch.com/api/search/projects?q=<q>
|
||||
- Observable: https://observablehq.com/search?query=<q>
|
||||
- Gitpod: https://www.gitpod.io/ (no public search; skip with log)
|
||||
|
||||
All three sources set RespectsRobots()=true. Engine honors this via existing
|
||||
pkg/recon/robots.go cache (caller coordinates RobotsCache check; not done here
|
||||
because Phase 9 wires it at SweepAll level — if not, document TODO in code).
|
||||
|
||||
Rate limits: all 10 req/min → rate.Every(6 * time.Second). Burst 1.
|
||||
</interfaces>
|
||||
</context>
|
||||
|
||||
<tasks>
|
||||
|
||||
<task type="auto" tdd="true">
|
||||
<name>Task 1: ReplitSource + CodeSandboxSource (scrapers)</name>
|
||||
<files>pkg/recon/sources/replit.go, pkg/recon/sources/replit_test.go, pkg/recon/sources/codesandbox.go, pkg/recon/sources/codesandbox_test.go</files>
|
||||
<behavior>
|
||||
- Test A (each): Sweep fetches search URL for each keyword via httptest server
|
||||
- Test B: HTML parsing extracts anchor hrefs matching expected result patterns (use golang.org/x/net/html)
|
||||
- Test C: Each extracted link emitted as Finding with Source=absolute URL, SourceType="recon:replit" or "recon:codesandbox"
|
||||
- Test D: RespectsRobots returns true
|
||||
- Test E: Ctx cancellation respected
|
||||
- Test F: Enabled always returns true (no auth)
|
||||
</behavior>
|
||||
<action>
|
||||
Add `golang.org/x/net/html` to go.mod if not already (`go get golang.org/x/net/html`).
|
||||
|
||||
Create `pkg/recon/sources/replit.go`:
|
||||
- Struct `ReplitSource { BaseURL string; Registry *providers.Registry; Limiters *recon.LimiterRegistry; client *Client }`
|
||||
- Default BaseURL: `https://replit.com`
|
||||
- Name "replit", RateLimit rate.Every(6*time.Second), Burst 1, RespectsRobots true, Enabled always true
|
||||
- Sweep: for each keyword from BuildQueries, GET `{base}/search?q={keyword}&type=repls`, parse HTML with `html.Parse`, walk DOM collecting `<a href>` matching regex `^/@[^/]+/[^/]+$` (repl URLs), emit Finding per absolute URL
|
||||
- Compile-time assert
|
||||
|
||||
Create `pkg/recon/sources/replit_test.go`:
|
||||
- httptest server returning fixed HTML snippet with 2 matching anchors + 1 non-matching
|
||||
- Assert exactly 2 Findings with correct absolute URLs
|
||||
|
||||
Create `pkg/recon/sources/codesandbox.go` with same shape but:
|
||||
- Default BaseURL `https://codesandbox.io`
|
||||
- Name "codesandbox"
|
||||
- Search URL: `{base}/search?query=<q>&type=sandboxes`
|
||||
- Link regex: `^/s/[a-zA-Z0-9-]+$` or `/p/sandbox/...`
|
||||
- SourceType "recon:codesandbox"
|
||||
|
||||
Create `pkg/recon/sources/codesandbox_test.go` analogous to replit_test.go.
|
||||
</action>
|
||||
<verify>
|
||||
<automated>cd /home/salva/Documents/apikey && go test ./pkg/recon/sources/ -run "TestReplit|TestCodeSandbox" -v -timeout 30s</automated>
|
||||
</verify>
|
||||
<done>
|
||||
Both scrapers parse HTML, extract links, emit Findings; tests green.
|
||||
</done>
|
||||
</task>
|
||||
|
||||
<task type="auto" tdd="true">
|
||||
<name>Task 2: SandboxesSource aggregator (JSFiddle/CodePen/StackBlitz/Glitch/Observable/Gitpod)</name>
|
||||
<files>pkg/recon/sources/sandboxes.go, pkg/recon/sources/sandboxes_test.go</files>
|
||||
<behavior>
|
||||
- Test A: Sweep iterates 6 sub-platforms for each keyword (via test override of Platforms slice)
|
||||
- Test B: JSFiddle returns JSON → parsed into Findings (Source from result URLs)
|
||||
- Test C: CodePen HTML → anchor extraction
|
||||
- Test D: One failing sub-platform does NOT abort others (log-and-continue)
|
||||
- Test E: SourceType = "recon:sandboxes"; sub-platform identifier goes into Confidence field or separate Platform map slot (use `KeyMasked` sentinel `platform=codepen` for now — pragmatic placeholder until a Metadata field exists)
|
||||
- Test F: Ctx cancellation
|
||||
</behavior>
|
||||
<action>
|
||||
Create `pkg/recon/sources/sandboxes.go`:
|
||||
- Define `subPlatform` struct: `{ Name, SearchURL, ResultLinkRegex string; IsJSON bool; JSONItemsKey string }`
|
||||
- Default Platforms:
|
||||
```go
|
||||
var defaultPlatforms = []subPlatform{
|
||||
{Name: "codepen", SearchURL: "https://codepen.io/search/pens?q=%s", ResultLinkRegex: `^/[^/]+/pen/[a-zA-Z0-9]+`, IsJSON: false},
|
||||
{Name: "jsfiddle", SearchURL: "https://jsfiddle.net/api/search/?q=%s", IsJSON: true, JSONItemsKey: "results"},
|
||||
{Name: "stackblitz", SearchURL: "https://stackblitz.com/search?q=%s", ResultLinkRegex: `^/edit/[a-zA-Z0-9-]+`, IsJSON: false},
|
||||
{Name: "glitch", SearchURL: "https://glitch.com/api/search/projects?q=%s", IsJSON: true, JSONItemsKey: "results"},
|
||||
{Name: "observable", SearchURL: "https://observablehq.com/search?query=%s", ResultLinkRegex: `^/@[^/]+/[^/]+`, IsJSON: false},
|
||||
}
|
||||
```
|
||||
(Gitpod omitted — no public search; document in comment.)
|
||||
- Struct `SandboxesSource { Platforms []subPlatform; Registry *providers.Registry; Limiters *recon.LimiterRegistry; client *Client }`
|
||||
- Name "sandboxes", RateLimit rate.Every(6*time.Second), Burst 1, RespectsRobots true, Enabled always true
|
||||
- Sweep: for each platform, for each keyword, fetch URL, parse either JSON or HTML, emit Findings with Source=absolute URL and KeyMasked="platform="+p.Name
|
||||
- On any per-platform error, log (use stdlib log package) and continue
|
||||
|
||||
Create `pkg/recon/sources/sandboxes_test.go`:
|
||||
- Spin up a single httptest server; override Platforms slice with 2 platforms
|
||||
pointing at `/codepen-search` (HTML) and `/jsfiddle-search` (JSON)
|
||||
- Assert Findings from both platforms emitted
|
||||
- Failure test: one platform returns 500 → log-and-continue, other still emits
|
||||
</action>
|
||||
<verify>
|
||||
<automated>cd /home/salva/Documents/apikey && go test ./pkg/recon/sources/ -run TestSandboxes -v -timeout 30s</automated>
|
||||
</verify>
|
||||
<done>
|
||||
SandboxesSource iterates sub-platforms, handles HTML and JSON formats, tolerates
|
||||
per-platform failure, emits Findings tagged with platform identifier.
|
||||
</done>
|
||||
</task>
|
||||
|
||||
</tasks>
|
||||
|
||||
<verification>
|
||||
- `go build ./...`
|
||||
- `go test ./pkg/recon/sources/ -run "TestReplit|TestCodeSandbox|TestSandboxes" -v`
|
||||
</verification>
|
||||
|
||||
<success_criteria>
|
||||
RECON-CODE-06, RECON-CODE-07, RECON-CODE-10 satisfied.
|
||||
</success_criteria>
|
||||
|
||||
<output>
|
||||
After completion, create `.planning/phases/10-osint-code-hosting/10-07-SUMMARY.md`.
|
||||
</output>
|
||||
Reference in New Issue
Block a user