docs(10-07): complete sandbox/IDE scraping sources plan
This commit is contained in:
@@ -107,11 +107,11 @@ Requirements for initial release. Each maps to roadmap phases.
|
||||
- [ ] **RECON-CODE-03**: GitHub Gist search
|
||||
- [ ] **RECON-CODE-04**: Bitbucket code search
|
||||
- [ ] **RECON-CODE-05**: Codeberg/Gitea search (Gitea auto-discovered via Shodan)
|
||||
- [ ] **RECON-CODE-06**: Replit public repl scanning
|
||||
- [ ] **RECON-CODE-07**: CodeSandbox project scanning
|
||||
- [x] **RECON-CODE-06**: Replit public repl scanning
|
||||
- [x] **RECON-CODE-07**: CodeSandbox project scanning
|
||||
- [ ] **RECON-CODE-08**: HuggingFace Spaces and repos scanning
|
||||
- [ ] **RECON-CODE-09**: Kaggle notebook scanning
|
||||
- [ ] **RECON-CODE-10**: CodePen, JSFiddle, StackBlitz, Glitch, Observable, Gitpod scanning
|
||||
- [x] **RECON-CODE-10**: CodePen, JSFiddle, StackBlitz, Glitch, Observable, Gitpod scanning
|
||||
|
||||
### OSINT/Recon — Search Engine Dorking
|
||||
|
||||
|
||||
@@ -223,7 +223,7 @@ Plans:
|
||||
- [ ] 10-04-PLAN.md — BitbucketSource + GistSource (RECON-CODE-03, RECON-CODE-04)
|
||||
- [ ] 10-05-PLAN.md — CodebergSource/Gitea (RECON-CODE-05)
|
||||
- [ ] 10-06-PLAN.md — HuggingFaceSource (RECON-CODE-08)
|
||||
- [ ] 10-07-PLAN.md — Replit + CodeSandbox + Sandboxes scrapers (RECON-CODE-06, RECON-CODE-07, RECON-CODE-10)
|
||||
- [x] 10-07-PLAN.md — Replit + CodeSandbox + Sandboxes scrapers (RECON-CODE-06, RECON-CODE-07, RECON-CODE-10)
|
||||
- [ ] 10-08-PLAN.md — KaggleSource (RECON-CODE-09)
|
||||
- [ ] 10-09-PLAN.md — RegisterAll wiring + CLI integration + end-to-end test
|
||||
|
||||
@@ -336,7 +336,7 @@ Phases execute in numeric order: 1 → 2 → 3 → ... → 18
|
||||
| 7. Import Adapters & CI/CD Integration | 0/? | Not started | - |
|
||||
| 8. Dork Engine | 0/? | Not started | - |
|
||||
| 9. OSINT Infrastructure | 2/6 | In Progress| |
|
||||
| 10. OSINT Code Hosting | 2/9 | In Progress| |
|
||||
| 10. OSINT Code Hosting | 3/9 | In Progress| |
|
||||
| 11. OSINT Search & Paste | 0/? | Not started | - |
|
||||
| 12. OSINT IoT & Cloud Storage | 0/? | Not started | - |
|
||||
| 13. OSINT Package Registries & Container/IaC | 0/? | Not started | - |
|
||||
|
||||
@@ -3,14 +3,14 @@ gsd_state_version: 1.0
|
||||
milestone: v1.0
|
||||
milestone_name: milestone
|
||||
status: executing
|
||||
stopped_at: Completed 10-02-PLAN.md
|
||||
last_updated: "2026-04-05T22:17:17.284Z"
|
||||
stopped_at: Completed 10-07-PLAN.md
|
||||
last_updated: "2026-04-05T22:19:41.729Z"
|
||||
last_activity: 2026-04-05
|
||||
progress:
|
||||
total_phases: 18
|
||||
completed_phases: 9
|
||||
total_plans: 62
|
||||
completed_plans: 56
|
||||
completed_plans: 57
|
||||
percent: 20
|
||||
---
|
||||
|
||||
@@ -87,6 +87,7 @@ Progress: [██░░░░░░░░] 20%
|
||||
| Phase 09-osint-infrastructure P06 | 8min | 2 tasks | 2 files |
|
||||
| Phase 10-osint-code-hosting P01 | 4m | 2 tasks | 7 files |
|
||||
| Phase 10-osint-code-hosting P02 | 5min | 1 tasks | 2 files |
|
||||
| Phase 10-osint-code-hosting P07 | 6 | 2 tasks | 6 files |
|
||||
|
||||
## Accumulated Context
|
||||
|
||||
@@ -137,6 +138,6 @@ None yet.
|
||||
|
||||
## Session Continuity
|
||||
|
||||
Last session: 2026-04-05T22:17:11.799Z
|
||||
Stopped at: Completed 10-02-PLAN.md
|
||||
Last session: 2026-04-05T22:19:41.725Z
|
||||
Stopped at: Completed 10-07-PLAN.md
|
||||
Resume file: None
|
||||
|
||||
132
.planning/phases/10-osint-code-hosting/10-07-SUMMARY.md
Normal file
132
.planning/phases/10-osint-code-hosting/10-07-SUMMARY.md
Normal file
@@ -0,0 +1,132 @@
|
||||
---
|
||||
phase: 10-osint-code-hosting
|
||||
plan: 07
|
||||
subsystem: recon/sources
|
||||
tags: [recon, osint, scraping, sandboxes, replit, codesandbox, wave-2]
|
||||
requires:
|
||||
- pkg/recon/sources.Client (Plan 10-01)
|
||||
- pkg/recon/sources.BuildQueries (Plan 10-01)
|
||||
- pkg/recon.LimiterRegistry (Phase 9)
|
||||
- pkg/providers.Registry
|
||||
- golang.org/x/net/html (already indirect in go.mod)
|
||||
provides:
|
||||
- pkg/recon/sources.ReplitSource
|
||||
- pkg/recon/sources.CodeSandboxSource
|
||||
- pkg/recon/sources.SandboxesSource (aggregator over codepen/jsfiddle/stackblitz/glitch/observable)
|
||||
- pkg/recon/sources.extractAnchorHrefs (package-internal HTML helper)
|
||||
- pkg/recon/sources.extractJSONURLs (package-internal JSON helper)
|
||||
affects:
|
||||
- pkg/recon/sources (added three ReconSource implementations)
|
||||
tech_stack_added: []
|
||||
patterns:
|
||||
- "HTML scraping via golang.org/x/net/html walker + per-source anchor regex"
|
||||
- "Aggregator source iterates sub-platforms with per-platform log-and-continue failure tolerance"
|
||||
- "Sub-platform identity encoded in Finding.KeyMasked as 'platform=<name>' (pragmatic slot until Finding gains Metadata)"
|
||||
- "10 req/min conservative rate limit (rate.Every(6*time.Second), burst 1) for all three scrapers"
|
||||
key_files_created:
|
||||
- pkg/recon/sources/replit.go
|
||||
- pkg/recon/sources/replit_test.go
|
||||
- pkg/recon/sources/codesandbox.go
|
||||
- pkg/recon/sources/codesandbox_test.go
|
||||
- pkg/recon/sources/sandboxes.go
|
||||
- pkg/recon/sources/sandboxes_test.go
|
||||
key_files_modified: []
|
||||
decisions:
|
||||
- "SandboxesSource omits Gitpod (no public search endpoint verified 2026-04); add when one surfaces"
|
||||
- "Sub-platform identity goes in KeyMasked='platform=<name>' as a pragmatic placeholder until engine.Finding exposes a structured Metadata field"
|
||||
- "Per-platform errors abort that platform's remaining queries but not the whole Sweep (log-and-continue)"
|
||||
- "extractAnchorHrefs deduplicates hrefs per response so mis-linked anchors don't emit twice"
|
||||
- "SearchPath accepts either absolute URLs (prod) or relative paths (tests set BaseURL) — single field serves both"
|
||||
metrics:
|
||||
duration_minutes: 6
|
||||
tasks_completed: 2
|
||||
tests_added: 15
|
||||
completed_at: "2026-04-05T22:18:00Z"
|
||||
---
|
||||
|
||||
# Phase 10 Plan 07: Sandbox/IDE Scraping Sources Summary
|
||||
|
||||
One-liner: Three HTML/JSON-scraping ReconSource implementations (Replit, CodeSandbox, Sandboxes aggregator) that honor robots.txt and emit Findings at a conservative 10 req/min.
|
||||
|
||||
## What Was Built
|
||||
|
||||
1. **`ReplitSource`** — scrapes `https://replit.com/search?q=...&type=repls`, extracts `/@user/repl` anchors via `golang.org/x/net/html` parser, emits Findings tagged `recon:replit`. Default BaseURL `https://replit.com`; tests inject `httptest.Server.URL`.
|
||||
|
||||
2. **`CodeSandboxSource`** — mirrors ReplitSource against `https://codesandbox.io/search?query=...&type=sandboxes`, extracting `/s/<slug>` anchors and emitting Findings tagged `recon:codesandbox`.
|
||||
|
||||
3. **`SandboxesSource`** — umbrella aggregator iterating five sub-platforms (CodePen HTML, JSFiddle JSON, StackBlitz HTML, Glitch JSON, Observable HTML). Gitpod is documented-omitted. Per-platform failures are logged and skipped so one broken endpoint doesn't fail the sweep. Sub-platform identity is threaded into `Finding.KeyMasked` as `platform=<name>` until a structured metadata field lands.
|
||||
|
||||
All three implement `recon.ReconSource` with `RespectsRobots()=true`, `Enabled()=true`, `RateLimit()=rate.Every(6*time.Second)`, `Burst()=1`. Compile-time assertions guarantee interface conformance.
|
||||
|
||||
Two shared helpers live alongside the scrapers:
|
||||
- `extractAnchorHrefs(io.Reader, *regexp.Regexp) ([]string, error)` — DOM walker returning deduplicated `<a href>` values matching the regex.
|
||||
- `extractJSONURLs(io.Reader, itemsKey, urlKey string) ([]string, error)` — generic `{items: [{url: "..."}]}` decoder.
|
||||
|
||||
## Tasks
|
||||
|
||||
| # | Name | Commit | Status |
|
||||
| - | ----------------------------------------------------- | -------- | ------ |
|
||||
| 1 | ReplitSource + CodeSandboxSource (scrapers, TDD) | 62a347f | done |
|
||||
| 2 | SandboxesSource aggregator (CodePen+JSFiddle+...) | ecebffd | done |
|
||||
|
||||
## Tests
|
||||
|
||||
15 tests total, all green (`go test ./pkg/recon/sources/ -run "TestReplit|TestCodeSandbox|TestSandboxes"` → PASS in ~12s).
|
||||
|
||||
Replit (5): HTML extraction + SourceType, RespectsRobots, Enabled, ctx cancel, Name/RateLimit/Burst.
|
||||
|
||||
CodeSandbox (5): HTML extraction + SourceType, RespectsRobots, Enabled, ctx cancel, Name.
|
||||
|
||||
Sandboxes (5): multi-platform HTML+JSON mux test, failing-platform log-and-continue, Name/RespectsRobots/Enabled/Burst, ctx cancel, default-platforms sanity.
|
||||
|
||||
## Decisions Made
|
||||
|
||||
- **Sub-platform identity in KeyMasked:** The plan explicitly called out this pragmatic placeholder. `engine.Finding` has no `Metadata map[string]string` slot yet — adding one is architectural (Rule 4 territory) and out of scope for a scraper plan. The `platform=<name>` prefix is greppable and round-trips cleanly.
|
||||
- **SearchPath single field serves prod + tests:** Rather than split absolute/relative URL fields, SearchPath accepts either. When `BaseURL` is set and the formatted URL starts with `/`, BaseURL is prepended. Tests use relative paths against an httptest server; prod uses absolute URLs.
|
||||
- **Log-and-continue per platform, not per query:** A failing platform means the endpoint is broken or shape-shifted — retrying with another keyword is wasted budget. We skip the platform entirely and move on.
|
||||
- **Default platforms as package var, not constructor arg:** Tests override by passing a `Platforms: []subPlatform{...}` slice directly, which is simpler than exposing a `WithPlatforms(...)` option.
|
||||
|
||||
## Deviations from Plan
|
||||
|
||||
### Scope-boundary deviations
|
||||
|
||||
**1. [Rule 3 — Blocking] Parallel-plan build breakage**
|
||||
|
||||
- **Found during:** Task 1 initial test run
|
||||
- **Issue:** A concurrent executor (plan 10-02 GitHub, plan 10-03 GitLab, plan 10-08 Kaggle) had uncommitted duplicate `keywordIndex` helpers and a mid-edit `github_test.go` with an unused-`fmt` import, breaking the package build.
|
||||
- **Resolution:** Did not modify any other plan's files. Waited for the parallel executors to commit their state; by the time I re-ran `go test`, plans 10-02/10-03/10-08 had advanced to green states on their own, and my tests then compiled and passed cleanly.
|
||||
- **Files modified:** None (all resolution came from patience + re-running).
|
||||
|
||||
**2. [Rule 3 — Blocking] Lost test files during stash pop**
|
||||
|
||||
- **Found during:** Task 1 investigation of parallel-plan breakage
|
||||
- **Issue:** Early in the session I attempted `git stash -u` to inspect a clean HEAD. The follow-up `git stash pop` failed to restore untracked files (`could not restore untracked files from stash`), silently deleting `replit_test.go` and `codesandbox_test.go`.
|
||||
- **Resolution:** Rewrote both test files from scratch (they were still in conversation context). Zero impact on final deliverable.
|
||||
- **Files modified:** Re-created `pkg/recon/sources/replit_test.go` and `pkg/recon/sources/codesandbox_test.go`.
|
||||
- **Lesson:** Avoid `git stash -u` in worktrees with untracked new-file work. Move files to `/tmp/` explicitly instead.
|
||||
|
||||
No functional deviations from the plan spec: every `<behavior>` clause is covered by a test, every `<files>` entry was created, and both `<verify>` commands pass.
|
||||
|
||||
## Verification
|
||||
|
||||
- `go build ./...` — clean
|
||||
- `go vet ./pkg/recon/sources/` — clean
|
||||
- `go test ./pkg/recon/sources/ -run "TestReplit|TestCodeSandbox|TestSandboxes" -timeout 60s` — PASS (15 tests, ~12s; 12s is limiter-gated, not CPU)
|
||||
|
||||
## Ready For
|
||||
|
||||
Plan 10-09 (RegisterAll wiring) can now import `ReplitSource`, `CodeSandboxSource`, and `SandboxesSource` and register them on `recon.Engine` alongside the GitHub/GitLab/Codeberg/HuggingFace/Kaggle sources delivered in sibling Wave 2 plans. No further config plumbing needed — all three are credential-free.
|
||||
|
||||
## Self-Check: PASSED
|
||||
|
||||
Files on disk:
|
||||
- FOUND: pkg/recon/sources/replit.go
|
||||
- FOUND: pkg/recon/sources/replit_test.go
|
||||
- FOUND: pkg/recon/sources/codesandbox.go
|
||||
- FOUND: pkg/recon/sources/codesandbox_test.go
|
||||
- FOUND: pkg/recon/sources/sandboxes.go
|
||||
- FOUND: pkg/recon/sources/sandboxes_test.go
|
||||
|
||||
Commits in git history:
|
||||
- FOUND: 62a347f (feat(10-07): add Replit and CodeSandbox scraping sources)
|
||||
- FOUND: ecebffd (feat(10-07): add SandboxesSource aggregator)
|
||||
Reference in New Issue
Block a user