Files
keyhunter/.planning/phases/10-osint-code-hosting/10-07-SUMMARY.md
2026-04-06 01:19:57 +03:00

8.1 KiB

phase, plan, subsystem, tags, requires, provides, affects, tech_stack_added, patterns, key_files_created, key_files_modified, decisions, metrics
phase plan subsystem tags requires provides affects tech_stack_added patterns key_files_created key_files_modified decisions metrics
10-osint-code-hosting 07 recon/sources
recon
osint
scraping
sandboxes
replit
codesandbox
wave-2
pkg/recon/sources.Client (Plan 10-01)
pkg/recon/sources.BuildQueries (Plan 10-01)
pkg/recon.LimiterRegistry (Phase 9)
pkg/providers.Registry
golang.org/x/net/html (already indirect in go.mod)
pkg/recon/sources.ReplitSource
pkg/recon/sources.CodeSandboxSource
pkg/recon/sources.SandboxesSource (aggregator over codepen/jsfiddle/stackblitz/glitch/observable)
pkg/recon/sources.extractAnchorHrefs (package-internal HTML helper)
pkg/recon/sources.extractJSONURLs (package-internal JSON helper)
pkg/recon/sources (added three ReconSource implementations)
HTML scraping via golang.org/x/net/html walker + per-source anchor regex
Aggregator source iterates sub-platforms with per-platform log-and-continue failure tolerance
Sub-platform identity encoded in Finding.KeyMasked as 'platform=<name>' (pragmatic slot until Finding gains Metadata)
10 req/min conservative rate limit (rate.Every(6*time.Second), burst 1) for all three scrapers
pkg/recon/sources/replit.go
pkg/recon/sources/replit_test.go
pkg/recon/sources/codesandbox.go
pkg/recon/sources/codesandbox_test.go
pkg/recon/sources/sandboxes.go
pkg/recon/sources/sandboxes_test.go
SandboxesSource omits Gitpod (no public search endpoint verified 2026-04); add when one surfaces
Sub-platform identity goes in KeyMasked='platform=<name>' as a pragmatic placeholder until engine.Finding exposes a structured Metadata field
Per-platform errors abort that platform's remaining queries but not the whole Sweep (log-and-continue)
extractAnchorHrefs deduplicates hrefs per response so mis-linked anchors don't emit twice
SearchPath accepts either absolute URLs (prod) or relative paths (tests set BaseURL) — single field serves both
duration_minutes tasks_completed tests_added completed_at
6 2 15 2026-04-05T22:18:00Z

Phase 10 Plan 07: Sandbox/IDE Scraping Sources Summary

One-liner: Three HTML/JSON-scraping ReconSource implementations (Replit, CodeSandbox, Sandboxes aggregator) that honor robots.txt and emit Findings at a conservative 10 req/min.

What Was Built

  1. ReplitSource — scrapes https://replit.com/search?q=...&type=repls, extracts /@user/repl anchors via golang.org/x/net/html parser, emits Findings tagged recon:replit. Default BaseURL https://replit.com; tests inject httptest.Server.URL.

  2. CodeSandboxSource — mirrors ReplitSource against https://codesandbox.io/search?query=...&type=sandboxes, extracting /s/<slug> anchors and emitting Findings tagged recon:codesandbox.

  3. SandboxesSource — umbrella aggregator iterating five sub-platforms (CodePen HTML, JSFiddle JSON, StackBlitz HTML, Glitch JSON, Observable HTML). Gitpod is documented-omitted. Per-platform failures are logged and skipped so one broken endpoint doesn't fail the sweep. Sub-platform identity is threaded into Finding.KeyMasked as platform=<name> until a structured metadata field lands.

All three implement recon.ReconSource with RespectsRobots()=true, Enabled()=true, RateLimit()=rate.Every(6*time.Second), Burst()=1. Compile-time assertions guarantee interface conformance.

Two shared helpers live alongside the scrapers:

  • extractAnchorHrefs(io.Reader, *regexp.Regexp) ([]string, error) — DOM walker returning deduplicated <a href> values matching the regex.
  • extractJSONURLs(io.Reader, itemsKey, urlKey string) ([]string, error) — generic {items: [{url: "..."}]} decoder.

Tasks

# Name Commit Status
1 ReplitSource + CodeSandboxSource (scrapers, TDD) 62a347f done
2 SandboxesSource aggregator (CodePen+JSFiddle+...) ecebffd done

Tests

15 tests total, all green (go test ./pkg/recon/sources/ -run "TestReplit|TestCodeSandbox|TestSandboxes" → PASS in ~12s).

Replit (5): HTML extraction + SourceType, RespectsRobots, Enabled, ctx cancel, Name/RateLimit/Burst.

CodeSandbox (5): HTML extraction + SourceType, RespectsRobots, Enabled, ctx cancel, Name.

Sandboxes (5): multi-platform HTML+JSON mux test, failing-platform log-and-continue, Name/RespectsRobots/Enabled/Burst, ctx cancel, default-platforms sanity.

Decisions Made

  • Sub-platform identity in KeyMasked: The plan explicitly called out this pragmatic placeholder. engine.Finding has no Metadata map[string]string slot yet — adding one is architectural (Rule 4 territory) and out of scope for a scraper plan. The platform=<name> prefix is greppable and round-trips cleanly.
  • SearchPath single field serves prod + tests: Rather than split absolute/relative URL fields, SearchPath accepts either. When BaseURL is set and the formatted URL starts with /, BaseURL is prepended. Tests use relative paths against an httptest server; prod uses absolute URLs.
  • Log-and-continue per platform, not per query: A failing platform means the endpoint is broken or shape-shifted — retrying with another keyword is wasted budget. We skip the platform entirely and move on.
  • Default platforms as package var, not constructor arg: Tests override by passing a Platforms: []subPlatform{...} slice directly, which is simpler than exposing a WithPlatforms(...) option.

Deviations from Plan

Scope-boundary deviations

1. [Rule 3 — Blocking] Parallel-plan build breakage

  • Found during: Task 1 initial test run
  • Issue: A concurrent executor (plan 10-02 GitHub, plan 10-03 GitLab, plan 10-08 Kaggle) had uncommitted duplicate keywordIndex helpers and a mid-edit github_test.go with an unused-fmt import, breaking the package build.
  • Resolution: Did not modify any other plan's files. Waited for the parallel executors to commit their state; by the time I re-ran go test, plans 10-02/10-03/10-08 had advanced to green states on their own, and my tests then compiled and passed cleanly.
  • Files modified: None (all resolution came from patience + re-running).

2. [Rule 3 — Blocking] Lost test files during stash pop

  • Found during: Task 1 investigation of parallel-plan breakage
  • Issue: Early in the session I attempted git stash -u to inspect a clean HEAD. The follow-up git stash pop failed to restore untracked files (could not restore untracked files from stash), silently deleting replit_test.go and codesandbox_test.go.
  • Resolution: Rewrote both test files from scratch (they were still in conversation context). Zero impact on final deliverable.
  • Files modified: Re-created pkg/recon/sources/replit_test.go and pkg/recon/sources/codesandbox_test.go.
  • Lesson: Avoid git stash -u in worktrees with untracked new-file work. Move files to /tmp/ explicitly instead.

No functional deviations from the plan spec: every <behavior> clause is covered by a test, every <files> entry was created, and both <verify> commands pass.

Verification

  • go build ./... — clean
  • go vet ./pkg/recon/sources/ — clean
  • go test ./pkg/recon/sources/ -run "TestReplit|TestCodeSandbox|TestSandboxes" -timeout 60s — PASS (15 tests, ~12s; 12s is limiter-gated, not CPU)

Ready For

Plan 10-09 (RegisterAll wiring) can now import ReplitSource, CodeSandboxSource, and SandboxesSource and register them on recon.Engine alongside the GitHub/GitLab/Codeberg/HuggingFace/Kaggle sources delivered in sibling Wave 2 plans. No further config plumbing needed — all three are credential-free.

Self-Check: PASSED

Files on disk:

  • FOUND: pkg/recon/sources/replit.go
  • FOUND: pkg/recon/sources/replit_test.go
  • FOUND: pkg/recon/sources/codesandbox.go
  • FOUND: pkg/recon/sources/codesandbox_test.go
  • FOUND: pkg/recon/sources/sandboxes.go
  • FOUND: pkg/recon/sources/sandboxes_test.go

Commits in git history:

  • FOUND: 62a347f (feat(10-07): add Replit and CodeSandbox scraping sources)
  • FOUND: ecebffd (feat(10-07): add SandboxesSource aggregator)