--- phase: 10-osint-code-hosting plan: 07 type: execute wave: 2 depends_on: [10-01] files_modified: - pkg/recon/sources/replit.go - pkg/recon/sources/replit_test.go - pkg/recon/sources/codesandbox.go - pkg/recon/sources/codesandbox_test.go - pkg/recon/sources/sandboxes.go - pkg/recon/sources/sandboxes_test.go autonomous: true requirements: [RECON-CODE-06, RECON-CODE-07, RECON-CODE-10] must_haves: truths: - "ReplitSource scrapes replit.com search HTML and emits Findings tagged recon:replit" - "CodeSandboxSource scrapes codesandbox.io search and emits Findings tagged recon:codesandbox" - "SandboxesSource aggregates JSFiddle+CodePen+StackBlitz+Glitch+Observable+Gitpod with SourceType recon:sandboxes and sub-type in KeyMasked metadata slot" - "All three RespectsRobots()==true and rate-limit conservatively (10/min)" artifacts: - path: "pkg/recon/sources/replit.go" provides: "ReplitSource (scraper)" - path: "pkg/recon/sources/codesandbox.go" provides: "CodeSandboxSource (scraper)" - path: "pkg/recon/sources/sandboxes.go" provides: "SandboxesSource aggregator (JSFiddle, CodePen, StackBlitz, Glitch, Observable, Gitpod)" key_links: - from: "pkg/recon/sources/replit.go" to: "pkg/recon/sources/httpclient.go" via: "Client.Do on https://replit.com/search?q=..." pattern: "client\\.Do" - from: "pkg/recon/sources/sandboxes.go" to: "pkg/recon/sources/httpclient.go" via: "Client.Do on per-sandbox search URLs" pattern: "client\\.Do" --- Implement three scraping-based sources for sandbox/IDE platforms without public search APIs. All three honor robots.txt, use a conservative 10 req/min rate, and emit Findings with best-effort HTML link extraction. Purpose: RECON-CODE-06 (Replit), RECON-CODE-07 (CodeSandbox), RECON-CODE-10 (CodePen/JSFiddle/StackBlitz/Glitch/Observable/Gitpod aggregator). Output: 3 new ReconSource implementations + tests. @$HOME/.claude/get-shit-done/workflows/execute-plan.md @$HOME/.claude/get-shit-done/templates/summary.md @.planning/phases/10-osint-code-hosting/10-CONTEXT.md @.planning/phases/10-osint-code-hosting/10-01-SUMMARY.md @pkg/recon/source.go @pkg/recon/robots.go @pkg/recon/sources/httpclient.go Scraping strategy (identical for all three sources in this plan): 1. Build per-provider keyword queries via BuildQueries (default format = bare keyword) 2. Fetch search URL via Client.Do (no auth headers) 3. Use a simple regex to extract result links from HTML (href="/@user/repl-name" or href="/s/...") — use net/html parser for robustness 4. Emit one Finding per extracted link with SourceType="recon:" and Source=absolute URL 5. Return early on ctx cancellation Search URLs (approximations — confirm in action): - Replit: https://replit.com/search?q=&type=repls - CodeSandbox: https://codesandbox.io/search?query=&type=sandboxes - CodePen: https://codepen.io/search/pens?q= - JSFiddle: https://jsfiddle.net/api/search/?q= (returns JSON) - StackBlitz: https://stackblitz.com/search?q= - Glitch: https://glitch.com/api/search/projects?q= - Observable: https://observablehq.com/search?query= - Gitpod: https://www.gitpod.io/ (no public search; skip with log) All three sources set RespectsRobots()=true. Engine honors this via existing pkg/recon/robots.go cache (caller coordinates RobotsCache check; not done here because Phase 9 wires it at SweepAll level — if not, document TODO in code). Rate limits: all 10 req/min → rate.Every(6 * time.Second). Burst 1. Task 1: ReplitSource + CodeSandboxSource (scrapers) pkg/recon/sources/replit.go, pkg/recon/sources/replit_test.go, pkg/recon/sources/codesandbox.go, pkg/recon/sources/codesandbox_test.go - Test A (each): Sweep fetches search URL for each keyword via httptest server - Test B: HTML parsing extracts anchor hrefs matching expected result patterns (use golang.org/x/net/html) - Test C: Each extracted link emitted as Finding with Source=absolute URL, SourceType="recon:replit" or "recon:codesandbox" - Test D: RespectsRobots returns true - Test E: Ctx cancellation respected - Test F: Enabled always returns true (no auth) Add `golang.org/x/net/html` to go.mod if not already (`go get golang.org/x/net/html`). Create `pkg/recon/sources/replit.go`: - Struct `ReplitSource { BaseURL string; Registry *providers.Registry; Limiters *recon.LimiterRegistry; client *Client }` - Default BaseURL: `https://replit.com` - Name "replit", RateLimit rate.Every(6*time.Second), Burst 1, RespectsRobots true, Enabled always true - Sweep: for each keyword from BuildQueries, GET `{base}/search?q={keyword}&type=repls`, parse HTML with `html.Parse`, walk DOM collecting `` matching regex `^/@[^/]+/[^/]+$` (repl URLs), emit Finding per absolute URL - Compile-time assert Create `pkg/recon/sources/replit_test.go`: - httptest server returning fixed HTML snippet with 2 matching anchors + 1 non-matching - Assert exactly 2 Findings with correct absolute URLs Create `pkg/recon/sources/codesandbox.go` with same shape but: - Default BaseURL `https://codesandbox.io` - Name "codesandbox" - Search URL: `{base}/search?query=&type=sandboxes` - Link regex: `^/s/[a-zA-Z0-9-]+$` or `/p/sandbox/...` - SourceType "recon:codesandbox" Create `pkg/recon/sources/codesandbox_test.go` analogous to replit_test.go. cd /home/salva/Documents/apikey && go test ./pkg/recon/sources/ -run "TestReplit|TestCodeSandbox" -v -timeout 30s Both scrapers parse HTML, extract links, emit Findings; tests green. Task 2: SandboxesSource aggregator (JSFiddle/CodePen/StackBlitz/Glitch/Observable/Gitpod) pkg/recon/sources/sandboxes.go, pkg/recon/sources/sandboxes_test.go - Test A: Sweep iterates 6 sub-platforms for each keyword (via test override of Platforms slice) - Test B: JSFiddle returns JSON → parsed into Findings (Source from result URLs) - Test C: CodePen HTML → anchor extraction - Test D: One failing sub-platform does NOT abort others (log-and-continue) - Test E: SourceType = "recon:sandboxes"; sub-platform identifier goes into Confidence field or separate Platform map slot (use `KeyMasked` sentinel `platform=codepen` for now — pragmatic placeholder until a Metadata field exists) - Test F: Ctx cancellation Create `pkg/recon/sources/sandboxes.go`: - Define `subPlatform` struct: `{ Name, SearchURL, ResultLinkRegex string; IsJSON bool; JSONItemsKey string }` - Default Platforms: ```go var defaultPlatforms = []subPlatform{ {Name: "codepen", SearchURL: "https://codepen.io/search/pens?q=%s", ResultLinkRegex: `^/[^/]+/pen/[a-zA-Z0-9]+`, IsJSON: false}, {Name: "jsfiddle", SearchURL: "https://jsfiddle.net/api/search/?q=%s", IsJSON: true, JSONItemsKey: "results"}, {Name: "stackblitz", SearchURL: "https://stackblitz.com/search?q=%s", ResultLinkRegex: `^/edit/[a-zA-Z0-9-]+`, IsJSON: false}, {Name: "glitch", SearchURL: "https://glitch.com/api/search/projects?q=%s", IsJSON: true, JSONItemsKey: "results"}, {Name: "observable", SearchURL: "https://observablehq.com/search?query=%s", ResultLinkRegex: `^/@[^/]+/[^/]+`, IsJSON: false}, } ``` (Gitpod omitted — no public search; document in comment.) - Struct `SandboxesSource { Platforms []subPlatform; Registry *providers.Registry; Limiters *recon.LimiterRegistry; client *Client }` - Name "sandboxes", RateLimit rate.Every(6*time.Second), Burst 1, RespectsRobots true, Enabled always true - Sweep: for each platform, for each keyword, fetch URL, parse either JSON or HTML, emit Findings with Source=absolute URL and KeyMasked="platform="+p.Name - On any per-platform error, log (use stdlib log package) and continue Create `pkg/recon/sources/sandboxes_test.go`: - Spin up a single httptest server; override Platforms slice with 2 platforms pointing at `/codepen-search` (HTML) and `/jsfiddle-search` (JSON) - Assert Findings from both platforms emitted - Failure test: one platform returns 500 → log-and-continue, other still emits cd /home/salva/Documents/apikey && go test ./pkg/recon/sources/ -run TestSandboxes -v -timeout 30s SandboxesSource iterates sub-platforms, handles HTML and JSON formats, tolerates per-platform failure, emits Findings tagged with platform identifier. - `go build ./...` - `go test ./pkg/recon/sources/ -run "TestReplit|TestCodeSandbox|TestSandboxes" -v` RECON-CODE-06, RECON-CODE-07, RECON-CODE-10 satisfied. After completion, create `.planning/phases/10-osint-code-hosting/10-07-SUMMARY.md`.