docs(10-07): complete sandbox/IDE scraping sources plan

This commit is contained in:
salvacybersec
2026-04-06 01:19:57 +03:00
parent ecebffd27d
commit 12c402ab67
4 changed files with 143 additions and 10 deletions

View File

@@ -0,0 +1,132 @@
---
phase: 10-osint-code-hosting
plan: 07
subsystem: recon/sources
tags: [recon, osint, scraping, sandboxes, replit, codesandbox, wave-2]
requires:
- pkg/recon/sources.Client (Plan 10-01)
- pkg/recon/sources.BuildQueries (Plan 10-01)
- pkg/recon.LimiterRegistry (Phase 9)
- pkg/providers.Registry
- golang.org/x/net/html (already indirect in go.mod)
provides:
- pkg/recon/sources.ReplitSource
- pkg/recon/sources.CodeSandboxSource
- pkg/recon/sources.SandboxesSource (aggregator over codepen/jsfiddle/stackblitz/glitch/observable)
- pkg/recon/sources.extractAnchorHrefs (package-internal HTML helper)
- pkg/recon/sources.extractJSONURLs (package-internal JSON helper)
affects:
- pkg/recon/sources (added three ReconSource implementations)
tech_stack_added: []
patterns:
- "HTML scraping via golang.org/x/net/html walker + per-source anchor regex"
- "Aggregator source iterates sub-platforms with per-platform log-and-continue failure tolerance"
- "Sub-platform identity encoded in Finding.KeyMasked as 'platform=<name>' (pragmatic slot until Finding gains Metadata)"
- "10 req/min conservative rate limit (rate.Every(6*time.Second), burst 1) for all three scrapers"
key_files_created:
- pkg/recon/sources/replit.go
- pkg/recon/sources/replit_test.go
- pkg/recon/sources/codesandbox.go
- pkg/recon/sources/codesandbox_test.go
- pkg/recon/sources/sandboxes.go
- pkg/recon/sources/sandboxes_test.go
key_files_modified: []
decisions:
- "SandboxesSource omits Gitpod (no public search endpoint verified 2026-04); add when one surfaces"
- "Sub-platform identity goes in KeyMasked='platform=<name>' as a pragmatic placeholder until engine.Finding exposes a structured Metadata field"
- "Per-platform errors abort that platform's remaining queries but not the whole Sweep (log-and-continue)"
- "extractAnchorHrefs deduplicates hrefs per response so mis-linked anchors don't emit twice"
- "SearchPath accepts either absolute URLs (prod) or relative paths (tests set BaseURL) — single field serves both"
metrics:
duration_minutes: 6
tasks_completed: 2
tests_added: 15
completed_at: "2026-04-05T22:18:00Z"
---
# Phase 10 Plan 07: Sandbox/IDE Scraping Sources Summary
One-liner: Three HTML/JSON-scraping ReconSource implementations (Replit, CodeSandbox, Sandboxes aggregator) that honor robots.txt and emit Findings at a conservative 10 req/min.
## What Was Built
1. **`ReplitSource`** — scrapes `https://replit.com/search?q=...&type=repls`, extracts `/@user/repl` anchors via `golang.org/x/net/html` parser, emits Findings tagged `recon:replit`. Default BaseURL `https://replit.com`; tests inject `httptest.Server.URL`.
2. **`CodeSandboxSource`** — mirrors ReplitSource against `https://codesandbox.io/search?query=...&type=sandboxes`, extracting `/s/<slug>` anchors and emitting Findings tagged `recon:codesandbox`.
3. **`SandboxesSource`** — umbrella aggregator iterating five sub-platforms (CodePen HTML, JSFiddle JSON, StackBlitz HTML, Glitch JSON, Observable HTML). Gitpod is documented-omitted. Per-platform failures are logged and skipped so one broken endpoint doesn't fail the sweep. Sub-platform identity is threaded into `Finding.KeyMasked` as `platform=<name>` until a structured metadata field lands.
All three implement `recon.ReconSource` with `RespectsRobots()=true`, `Enabled()=true`, `RateLimit()=rate.Every(6*time.Second)`, `Burst()=1`. Compile-time assertions guarantee interface conformance.
Two shared helpers live alongside the scrapers:
- `extractAnchorHrefs(io.Reader, *regexp.Regexp) ([]string, error)` — DOM walker returning deduplicated `<a href>` values matching the regex.
- `extractJSONURLs(io.Reader, itemsKey, urlKey string) ([]string, error)` — generic `{items: [{url: "..."}]}` decoder.
## Tasks
| # | Name | Commit | Status |
| - | ----------------------------------------------------- | -------- | ------ |
| 1 | ReplitSource + CodeSandboxSource (scrapers, TDD) | 62a347f | done |
| 2 | SandboxesSource aggregator (CodePen+JSFiddle+...) | ecebffd | done |
## Tests
15 tests total, all green (`go test ./pkg/recon/sources/ -run "TestReplit|TestCodeSandbox|TestSandboxes"` → PASS in ~12s).
Replit (5): HTML extraction + SourceType, RespectsRobots, Enabled, ctx cancel, Name/RateLimit/Burst.
CodeSandbox (5): HTML extraction + SourceType, RespectsRobots, Enabled, ctx cancel, Name.
Sandboxes (5): multi-platform HTML+JSON mux test, failing-platform log-and-continue, Name/RespectsRobots/Enabled/Burst, ctx cancel, default-platforms sanity.
## Decisions Made
- **Sub-platform identity in KeyMasked:** The plan explicitly called out this pragmatic placeholder. `engine.Finding` has no `Metadata map[string]string` slot yet — adding one is architectural (Rule 4 territory) and out of scope for a scraper plan. The `platform=<name>` prefix is greppable and round-trips cleanly.
- **SearchPath single field serves prod + tests:** Rather than split absolute/relative URL fields, SearchPath accepts either. When `BaseURL` is set and the formatted URL starts with `/`, BaseURL is prepended. Tests use relative paths against an httptest server; prod uses absolute URLs.
- **Log-and-continue per platform, not per query:** A failing platform means the endpoint is broken or shape-shifted — retrying with another keyword is wasted budget. We skip the platform entirely and move on.
- **Default platforms as package var, not constructor arg:** Tests override by passing a `Platforms: []subPlatform{...}` slice directly, which is simpler than exposing a `WithPlatforms(...)` option.
## Deviations from Plan
### Scope-boundary deviations
**1. [Rule 3 — Blocking] Parallel-plan build breakage**
- **Found during:** Task 1 initial test run
- **Issue:** A concurrent executor (plan 10-02 GitHub, plan 10-03 GitLab, plan 10-08 Kaggle) had uncommitted duplicate `keywordIndex` helpers and a mid-edit `github_test.go` with an unused-`fmt` import, breaking the package build.
- **Resolution:** Did not modify any other plan's files. Waited for the parallel executors to commit their state; by the time I re-ran `go test`, plans 10-02/10-03/10-08 had advanced to green states on their own, and my tests then compiled and passed cleanly.
- **Files modified:** None (all resolution came from patience + re-running).
**2. [Rule 3 — Blocking] Lost test files during stash pop**
- **Found during:** Task 1 investigation of parallel-plan breakage
- **Issue:** Early in the session I attempted `git stash -u` to inspect a clean HEAD. The follow-up `git stash pop` failed to restore untracked files (`could not restore untracked files from stash`), silently deleting `replit_test.go` and `codesandbox_test.go`.
- **Resolution:** Rewrote both test files from scratch (they were still in conversation context). Zero impact on final deliverable.
- **Files modified:** Re-created `pkg/recon/sources/replit_test.go` and `pkg/recon/sources/codesandbox_test.go`.
- **Lesson:** Avoid `git stash -u` in worktrees with untracked new-file work. Move files to `/tmp/` explicitly instead.
No functional deviations from the plan spec: every `<behavior>` clause is covered by a test, every `<files>` entry was created, and both `<verify>` commands pass.
## Verification
- `go build ./...` — clean
- `go vet ./pkg/recon/sources/` — clean
- `go test ./pkg/recon/sources/ -run "TestReplit|TestCodeSandbox|TestSandboxes" -timeout 60s` — PASS (15 tests, ~12s; 12s is limiter-gated, not CPU)
## Ready For
Plan 10-09 (RegisterAll wiring) can now import `ReplitSource`, `CodeSandboxSource`, and `SandboxesSource` and register them on `recon.Engine` alongside the GitHub/GitLab/Codeberg/HuggingFace/Kaggle sources delivered in sibling Wave 2 plans. No further config plumbing needed — all three are credential-free.
## Self-Check: PASSED
Files on disk:
- FOUND: pkg/recon/sources/replit.go
- FOUND: pkg/recon/sources/replit_test.go
- FOUND: pkg/recon/sources/codesandbox.go
- FOUND: pkg/recon/sources/codesandbox_test.go
- FOUND: pkg/recon/sources/sandboxes.go
- FOUND: pkg/recon/sources/sandboxes_test.go
Commits in git history:
- FOUND: 62a347f (feat(10-07): add Replit and CodeSandbox scraping sources)
- FOUND: ecebffd (feat(10-07): add SandboxesSource aggregator)