8.1 KiB
phase, plan, subsystem, tags, requires, provides, affects, tech_stack_added, patterns, key_files_created, key_files_modified, decisions, metrics
| phase | plan | subsystem | tags | requires | provides | affects | tech_stack_added | patterns | key_files_created | key_files_modified | decisions | metrics | |||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 10-osint-code-hosting | 07 | recon/sources |
|
|
|
|
|
|
|
|
Phase 10 Plan 07: Sandbox/IDE Scraping Sources Summary
One-liner: Three HTML/JSON-scraping ReconSource implementations (Replit, CodeSandbox, Sandboxes aggregator) that honor robots.txt and emit Findings at a conservative 10 req/min.
What Was Built
-
ReplitSource— scrapeshttps://replit.com/search?q=...&type=repls, extracts/@user/replanchors viagolang.org/x/net/htmlparser, emits Findings taggedrecon:replit. Default BaseURLhttps://replit.com; tests injecthttptest.Server.URL. -
CodeSandboxSource— mirrors ReplitSource againsthttps://codesandbox.io/search?query=...&type=sandboxes, extracting/s/<slug>anchors and emitting Findings taggedrecon:codesandbox. -
SandboxesSource— umbrella aggregator iterating five sub-platforms (CodePen HTML, JSFiddle JSON, StackBlitz HTML, Glitch JSON, Observable HTML). Gitpod is documented-omitted. Per-platform failures are logged and skipped so one broken endpoint doesn't fail the sweep. Sub-platform identity is threaded intoFinding.KeyMaskedasplatform=<name>until a structured metadata field lands.
All three implement recon.ReconSource with RespectsRobots()=true, Enabled()=true, RateLimit()=rate.Every(6*time.Second), Burst()=1. Compile-time assertions guarantee interface conformance.
Two shared helpers live alongside the scrapers:
extractAnchorHrefs(io.Reader, *regexp.Regexp) ([]string, error)— DOM walker returning deduplicated<a href>values matching the regex.extractJSONURLs(io.Reader, itemsKey, urlKey string) ([]string, error)— generic{items: [{url: "..."}]}decoder.
Tasks
| # | Name | Commit | Status |
|---|---|---|---|
| 1 | ReplitSource + CodeSandboxSource (scrapers, TDD) | 62a347f |
done |
| 2 | SandboxesSource aggregator (CodePen+JSFiddle+...) | ecebffd |
done |
Tests
15 tests total, all green (go test ./pkg/recon/sources/ -run "TestReplit|TestCodeSandbox|TestSandboxes" → PASS in ~12s).
Replit (5): HTML extraction + SourceType, RespectsRobots, Enabled, ctx cancel, Name/RateLimit/Burst.
CodeSandbox (5): HTML extraction + SourceType, RespectsRobots, Enabled, ctx cancel, Name.
Sandboxes (5): multi-platform HTML+JSON mux test, failing-platform log-and-continue, Name/RespectsRobots/Enabled/Burst, ctx cancel, default-platforms sanity.
Decisions Made
- Sub-platform identity in KeyMasked: The plan explicitly called out this pragmatic placeholder.
engine.Findinghas noMetadata map[string]stringslot yet — adding one is architectural (Rule 4 territory) and out of scope for a scraper plan. Theplatform=<name>prefix is greppable and round-trips cleanly. - SearchPath single field serves prod + tests: Rather than split absolute/relative URL fields, SearchPath accepts either. When
BaseURLis set and the formatted URL starts with/, BaseURL is prepended. Tests use relative paths against an httptest server; prod uses absolute URLs. - Log-and-continue per platform, not per query: A failing platform means the endpoint is broken or shape-shifted — retrying with another keyword is wasted budget. We skip the platform entirely and move on.
- Default platforms as package var, not constructor arg: Tests override by passing a
Platforms: []subPlatform{...}slice directly, which is simpler than exposing aWithPlatforms(...)option.
Deviations from Plan
Scope-boundary deviations
1. [Rule 3 — Blocking] Parallel-plan build breakage
- Found during: Task 1 initial test run
- Issue: A concurrent executor (plan 10-02 GitHub, plan 10-03 GitLab, plan 10-08 Kaggle) had uncommitted duplicate
keywordIndexhelpers and a mid-editgithub_test.gowith an unused-fmtimport, breaking the package build. - Resolution: Did not modify any other plan's files. Waited for the parallel executors to commit their state; by the time I re-ran
go test, plans 10-02/10-03/10-08 had advanced to green states on their own, and my tests then compiled and passed cleanly. - Files modified: None (all resolution came from patience + re-running).
2. [Rule 3 — Blocking] Lost test files during stash pop
- Found during: Task 1 investigation of parallel-plan breakage
- Issue: Early in the session I attempted
git stash -uto inspect a clean HEAD. The follow-upgit stash popfailed to restore untracked files (could not restore untracked files from stash), silently deletingreplit_test.goandcodesandbox_test.go. - Resolution: Rewrote both test files from scratch (they were still in conversation context). Zero impact on final deliverable.
- Files modified: Re-created
pkg/recon/sources/replit_test.goandpkg/recon/sources/codesandbox_test.go. - Lesson: Avoid
git stash -uin worktrees with untracked new-file work. Move files to/tmp/explicitly instead.
No functional deviations from the plan spec: every <behavior> clause is covered by a test, every <files> entry was created, and both <verify> commands pass.
Verification
go build ./...— cleango vet ./pkg/recon/sources/— cleango test ./pkg/recon/sources/ -run "TestReplit|TestCodeSandbox|TestSandboxes" -timeout 60s— PASS (15 tests, ~12s; 12s is limiter-gated, not CPU)
Ready For
Plan 10-09 (RegisterAll wiring) can now import ReplitSource, CodeSandboxSource, and SandboxesSource and register them on recon.Engine alongside the GitHub/GitLab/Codeberg/HuggingFace/Kaggle sources delivered in sibling Wave 2 plans. No further config plumbing needed — all three are credential-free.
Self-Check: PASSED
Files on disk:
- FOUND: pkg/recon/sources/replit.go
- FOUND: pkg/recon/sources/replit_test.go
- FOUND: pkg/recon/sources/codesandbox.go
- FOUND: pkg/recon/sources/codesandbox_test.go
- FOUND: pkg/recon/sources/sandboxes.go
- FOUND: pkg/recon/sources/sandboxes_test.go
Commits in git history: