docs(10-07): complete sandbox/IDE scraping sources plan
This commit is contained in:
132
.planning/phases/10-osint-code-hosting/10-07-SUMMARY.md
Normal file
132
.planning/phases/10-osint-code-hosting/10-07-SUMMARY.md
Normal file
@@ -0,0 +1,132 @@
|
||||
---
|
||||
phase: 10-osint-code-hosting
|
||||
plan: 07
|
||||
subsystem: recon/sources
|
||||
tags: [recon, osint, scraping, sandboxes, replit, codesandbox, wave-2]
|
||||
requires:
|
||||
- pkg/recon/sources.Client (Plan 10-01)
|
||||
- pkg/recon/sources.BuildQueries (Plan 10-01)
|
||||
- pkg/recon.LimiterRegistry (Phase 9)
|
||||
- pkg/providers.Registry
|
||||
- golang.org/x/net/html (already indirect in go.mod)
|
||||
provides:
|
||||
- pkg/recon/sources.ReplitSource
|
||||
- pkg/recon/sources.CodeSandboxSource
|
||||
- pkg/recon/sources.SandboxesSource (aggregator over codepen/jsfiddle/stackblitz/glitch/observable)
|
||||
- pkg/recon/sources.extractAnchorHrefs (package-internal HTML helper)
|
||||
- pkg/recon/sources.extractJSONURLs (package-internal JSON helper)
|
||||
affects:
|
||||
- pkg/recon/sources (added three ReconSource implementations)
|
||||
tech_stack_added: []
|
||||
patterns:
|
||||
- "HTML scraping via golang.org/x/net/html walker + per-source anchor regex"
|
||||
- "Aggregator source iterates sub-platforms with per-platform log-and-continue failure tolerance"
|
||||
- "Sub-platform identity encoded in Finding.KeyMasked as 'platform=<name>' (pragmatic slot until Finding gains Metadata)"
|
||||
- "10 req/min conservative rate limit (rate.Every(6*time.Second), burst 1) for all three scrapers"
|
||||
key_files_created:
|
||||
- pkg/recon/sources/replit.go
|
||||
- pkg/recon/sources/replit_test.go
|
||||
- pkg/recon/sources/codesandbox.go
|
||||
- pkg/recon/sources/codesandbox_test.go
|
||||
- pkg/recon/sources/sandboxes.go
|
||||
- pkg/recon/sources/sandboxes_test.go
|
||||
key_files_modified: []
|
||||
decisions:
|
||||
- "SandboxesSource omits Gitpod (no public search endpoint verified 2026-04); add when one surfaces"
|
||||
- "Sub-platform identity goes in KeyMasked='platform=<name>' as a pragmatic placeholder until engine.Finding exposes a structured Metadata field"
|
||||
- "Per-platform errors abort that platform's remaining queries but not the whole Sweep (log-and-continue)"
|
||||
- "extractAnchorHrefs deduplicates hrefs per response so mis-linked anchors don't emit twice"
|
||||
- "SearchPath accepts either absolute URLs (prod) or relative paths (tests set BaseURL) — single field serves both"
|
||||
metrics:
|
||||
duration_minutes: 6
|
||||
tasks_completed: 2
|
||||
tests_added: 15
|
||||
completed_at: "2026-04-05T22:18:00Z"
|
||||
---
|
||||
|
||||
# Phase 10 Plan 07: Sandbox/IDE Scraping Sources Summary
|
||||
|
||||
One-liner: Three HTML/JSON-scraping ReconSource implementations (Replit, CodeSandbox, Sandboxes aggregator) that honor robots.txt and emit Findings at a conservative 10 req/min.
|
||||
|
||||
## What Was Built
|
||||
|
||||
1. **`ReplitSource`** — scrapes `https://replit.com/search?q=...&type=repls`, extracts `/@user/repl` anchors via `golang.org/x/net/html` parser, emits Findings tagged `recon:replit`. Default BaseURL `https://replit.com`; tests inject `httptest.Server.URL`.
|
||||
|
||||
2. **`CodeSandboxSource`** — mirrors ReplitSource against `https://codesandbox.io/search?query=...&type=sandboxes`, extracting `/s/<slug>` anchors and emitting Findings tagged `recon:codesandbox`.
|
||||
|
||||
3. **`SandboxesSource`** — umbrella aggregator iterating five sub-platforms (CodePen HTML, JSFiddle JSON, StackBlitz HTML, Glitch JSON, Observable HTML). Gitpod is documented-omitted. Per-platform failures are logged and skipped so one broken endpoint doesn't fail the sweep. Sub-platform identity is threaded into `Finding.KeyMasked` as `platform=<name>` until a structured metadata field lands.
|
||||
|
||||
All three implement `recon.ReconSource` with `RespectsRobots()=true`, `Enabled()=true`, `RateLimit()=rate.Every(6*time.Second)`, `Burst()=1`. Compile-time assertions guarantee interface conformance.
|
||||
|
||||
Two shared helpers live alongside the scrapers:
|
||||
- `extractAnchorHrefs(io.Reader, *regexp.Regexp) ([]string, error)` — DOM walker returning deduplicated `<a href>` values matching the regex.
|
||||
- `extractJSONURLs(io.Reader, itemsKey, urlKey string) ([]string, error)` — generic `{items: [{url: "..."}]}` decoder.
|
||||
|
||||
## Tasks
|
||||
|
||||
| # | Name | Commit | Status |
|
||||
| - | ----------------------------------------------------- | -------- | ------ |
|
||||
| 1 | ReplitSource + CodeSandboxSource (scrapers, TDD) | 62a347f | done |
|
||||
| 2 | SandboxesSource aggregator (CodePen+JSFiddle+...) | ecebffd | done |
|
||||
|
||||
## Tests
|
||||
|
||||
15 tests total, all green (`go test ./pkg/recon/sources/ -run "TestReplit|TestCodeSandbox|TestSandboxes"` → PASS in ~12s).
|
||||
|
||||
Replit (5): HTML extraction + SourceType, RespectsRobots, Enabled, ctx cancel, Name/RateLimit/Burst.
|
||||
|
||||
CodeSandbox (5): HTML extraction + SourceType, RespectsRobots, Enabled, ctx cancel, Name.
|
||||
|
||||
Sandboxes (5): multi-platform HTML+JSON mux test, failing-platform log-and-continue, Name/RespectsRobots/Enabled/Burst, ctx cancel, default-platforms sanity.
|
||||
|
||||
## Decisions Made
|
||||
|
||||
- **Sub-platform identity in KeyMasked:** The plan explicitly called out this pragmatic placeholder. `engine.Finding` has no `Metadata map[string]string` slot yet — adding one is architectural (Rule 4 territory) and out of scope for a scraper plan. The `platform=<name>` prefix is greppable and round-trips cleanly.
|
||||
- **SearchPath single field serves prod + tests:** Rather than split absolute/relative URL fields, SearchPath accepts either. When `BaseURL` is set and the formatted URL starts with `/`, BaseURL is prepended. Tests use relative paths against an httptest server; prod uses absolute URLs.
|
||||
- **Log-and-continue per platform, not per query:** A failing platform means the endpoint is broken or shape-shifted — retrying with another keyword is wasted budget. We skip the platform entirely and move on.
|
||||
- **Default platforms as package var, not constructor arg:** Tests override by passing a `Platforms: []subPlatform{...}` slice directly, which is simpler than exposing a `WithPlatforms(...)` option.
|
||||
|
||||
## Deviations from Plan
|
||||
|
||||
### Scope-boundary deviations
|
||||
|
||||
**1. [Rule 3 — Blocking] Parallel-plan build breakage**
|
||||
|
||||
- **Found during:** Task 1 initial test run
|
||||
- **Issue:** A concurrent executor (plan 10-02 GitHub, plan 10-03 GitLab, plan 10-08 Kaggle) had uncommitted duplicate `keywordIndex` helpers and a mid-edit `github_test.go` with an unused-`fmt` import, breaking the package build.
|
||||
- **Resolution:** Did not modify any other plan's files. Waited for the parallel executors to commit their state; by the time I re-ran `go test`, plans 10-02/10-03/10-08 had advanced to green states on their own, and my tests then compiled and passed cleanly.
|
||||
- **Files modified:** None (all resolution came from patience + re-running).
|
||||
|
||||
**2. [Rule 3 — Blocking] Lost test files during stash pop**
|
||||
|
||||
- **Found during:** Task 1 investigation of parallel-plan breakage
|
||||
- **Issue:** Early in the session I attempted `git stash -u` to inspect a clean HEAD. The follow-up `git stash pop` failed to restore untracked files (`could not restore untracked files from stash`), silently deleting `replit_test.go` and `codesandbox_test.go`.
|
||||
- **Resolution:** Rewrote both test files from scratch (they were still in conversation context). Zero impact on final deliverable.
|
||||
- **Files modified:** Re-created `pkg/recon/sources/replit_test.go` and `pkg/recon/sources/codesandbox_test.go`.
|
||||
- **Lesson:** Avoid `git stash -u` in worktrees with untracked new-file work. Move files to `/tmp/` explicitly instead.
|
||||
|
||||
No functional deviations from the plan spec: every `<behavior>` clause is covered by a test, every `<files>` entry was created, and both `<verify>` commands pass.
|
||||
|
||||
## Verification
|
||||
|
||||
- `go build ./...` — clean
|
||||
- `go vet ./pkg/recon/sources/` — clean
|
||||
- `go test ./pkg/recon/sources/ -run "TestReplit|TestCodeSandbox|TestSandboxes" -timeout 60s` — PASS (15 tests, ~12s; 12s is limiter-gated, not CPU)
|
||||
|
||||
## Ready For
|
||||
|
||||
Plan 10-09 (RegisterAll wiring) can now import `ReplitSource`, `CodeSandboxSource`, and `SandboxesSource` and register them on `recon.Engine` alongside the GitHub/GitLab/Codeberg/HuggingFace/Kaggle sources delivered in sibling Wave 2 plans. No further config plumbing needed — all three are credential-free.
|
||||
|
||||
## Self-Check: PASSED
|
||||
|
||||
Files on disk:
|
||||
- FOUND: pkg/recon/sources/replit.go
|
||||
- FOUND: pkg/recon/sources/replit_test.go
|
||||
- FOUND: pkg/recon/sources/codesandbox.go
|
||||
- FOUND: pkg/recon/sources/codesandbox_test.go
|
||||
- FOUND: pkg/recon/sources/sandboxes.go
|
||||
- FOUND: pkg/recon/sources/sandboxes_test.go
|
||||
|
||||
Commits in git history:
|
||||
- FOUND: 62a347f (feat(10-07): add Replit and CodeSandbox scraping sources)
|
||||
- FOUND: ecebffd (feat(10-07): add SandboxesSource aggregator)
|
||||
Reference in New Issue
Block a user