docs(10-07): complete sandbox/IDE scraping sources plan

This commit is contained in:
salvacybersec
2026-04-06 01:19:57 +03:00
parent ecebffd27d
commit 12c402ab67
4 changed files with 143 additions and 10 deletions

View File

@@ -107,11 +107,11 @@ Requirements for initial release. Each maps to roadmap phases.
- [ ] **RECON-CODE-03**: GitHub Gist search
- [ ] **RECON-CODE-04**: Bitbucket code search
- [ ] **RECON-CODE-05**: Codeberg/Gitea search (Gitea auto-discovered via Shodan)
- [ ] **RECON-CODE-06**: Replit public repl scanning
- [ ] **RECON-CODE-07**: CodeSandbox project scanning
- [x] **RECON-CODE-06**: Replit public repl scanning
- [x] **RECON-CODE-07**: CodeSandbox project scanning
- [ ] **RECON-CODE-08**: HuggingFace Spaces and repos scanning
- [ ] **RECON-CODE-09**: Kaggle notebook scanning
- [ ] **RECON-CODE-10**: CodePen, JSFiddle, StackBlitz, Glitch, Observable, Gitpod scanning
- [x] **RECON-CODE-10**: CodePen, JSFiddle, StackBlitz, Glitch, Observable, Gitpod scanning
### OSINT/Recon — Search Engine Dorking

View File

@@ -223,7 +223,7 @@ Plans:
- [ ] 10-04-PLAN.md — BitbucketSource + GistSource (RECON-CODE-03, RECON-CODE-04)
- [ ] 10-05-PLAN.md — CodebergSource/Gitea (RECON-CODE-05)
- [ ] 10-06-PLAN.md — HuggingFaceSource (RECON-CODE-08)
- [ ] 10-07-PLAN.md — Replit + CodeSandbox + Sandboxes scrapers (RECON-CODE-06, RECON-CODE-07, RECON-CODE-10)
- [x] 10-07-PLAN.md — Replit + CodeSandbox + Sandboxes scrapers (RECON-CODE-06, RECON-CODE-07, RECON-CODE-10)
- [ ] 10-08-PLAN.md — KaggleSource (RECON-CODE-09)
- [ ] 10-09-PLAN.md — RegisterAll wiring + CLI integration + end-to-end test
@@ -336,7 +336,7 @@ Phases execute in numeric order: 1 → 2 → 3 → ... → 18
| 7. Import Adapters & CI/CD Integration | 0/? | Not started | - |
| 8. Dork Engine | 0/? | Not started | - |
| 9. OSINT Infrastructure | 2/6 | In Progress| |
| 10. OSINT Code Hosting | 2/9 | In Progress| |
| 10. OSINT Code Hosting | 3/9 | In Progress| |
| 11. OSINT Search & Paste | 0/? | Not started | - |
| 12. OSINT IoT & Cloud Storage | 0/? | Not started | - |
| 13. OSINT Package Registries & Container/IaC | 0/? | Not started | - |

View File

@@ -3,14 +3,14 @@ gsd_state_version: 1.0
milestone: v1.0
milestone_name: milestone
status: executing
stopped_at: Completed 10-02-PLAN.md
last_updated: "2026-04-05T22:17:17.284Z"
stopped_at: Completed 10-07-PLAN.md
last_updated: "2026-04-05T22:19:41.729Z"
last_activity: 2026-04-05
progress:
total_phases: 18
completed_phases: 9
total_plans: 62
completed_plans: 56
completed_plans: 57
percent: 20
---
@@ -87,6 +87,7 @@ Progress: [██░░░░░░░░] 20%
| Phase 09-osint-infrastructure P06 | 8min | 2 tasks | 2 files |
| Phase 10-osint-code-hosting P01 | 4m | 2 tasks | 7 files |
| Phase 10-osint-code-hosting P02 | 5min | 1 tasks | 2 files |
| Phase 10-osint-code-hosting P07 | 6 | 2 tasks | 6 files |
## Accumulated Context
@@ -137,6 +138,6 @@ None yet.
## Session Continuity
Last session: 2026-04-05T22:17:11.799Z
Stopped at: Completed 10-02-PLAN.md
Last session: 2026-04-05T22:19:41.725Z
Stopped at: Completed 10-07-PLAN.md
Resume file: None

View File

@@ -0,0 +1,132 @@
---
phase: 10-osint-code-hosting
plan: 07
subsystem: recon/sources
tags: [recon, osint, scraping, sandboxes, replit, codesandbox, wave-2]
requires:
- pkg/recon/sources.Client (Plan 10-01)
- pkg/recon/sources.BuildQueries (Plan 10-01)
- pkg/recon.LimiterRegistry (Phase 9)
- pkg/providers.Registry
- golang.org/x/net/html (already indirect in go.mod)
provides:
- pkg/recon/sources.ReplitSource
- pkg/recon/sources.CodeSandboxSource
- pkg/recon/sources.SandboxesSource (aggregator over codepen/jsfiddle/stackblitz/glitch/observable)
- pkg/recon/sources.extractAnchorHrefs (package-internal HTML helper)
- pkg/recon/sources.extractJSONURLs (package-internal JSON helper)
affects:
- pkg/recon/sources (added three ReconSource implementations)
tech_stack_added: []
patterns:
- "HTML scraping via golang.org/x/net/html walker + per-source anchor regex"
- "Aggregator source iterates sub-platforms with per-platform log-and-continue failure tolerance"
- "Sub-platform identity encoded in Finding.KeyMasked as 'platform=<name>' (pragmatic slot until Finding gains Metadata)"
- "10 req/min conservative rate limit (rate.Every(6*time.Second), burst 1) for all three scrapers"
key_files_created:
- pkg/recon/sources/replit.go
- pkg/recon/sources/replit_test.go
- pkg/recon/sources/codesandbox.go
- pkg/recon/sources/codesandbox_test.go
- pkg/recon/sources/sandboxes.go
- pkg/recon/sources/sandboxes_test.go
key_files_modified: []
decisions:
- "SandboxesSource omits Gitpod (no public search endpoint verified 2026-04); add when one surfaces"
- "Sub-platform identity goes in KeyMasked='platform=<name>' as a pragmatic placeholder until engine.Finding exposes a structured Metadata field"
- "Per-platform errors abort that platform's remaining queries but not the whole Sweep (log-and-continue)"
- "extractAnchorHrefs deduplicates hrefs per response so mis-linked anchors don't emit twice"
- "SearchPath accepts either absolute URLs (prod) or relative paths (tests set BaseURL) — single field serves both"
metrics:
duration_minutes: 6
tasks_completed: 2
tests_added: 15
completed_at: "2026-04-05T22:18:00Z"
---
# Phase 10 Plan 07: Sandbox/IDE Scraping Sources Summary
One-liner: Three HTML/JSON-scraping ReconSource implementations (Replit, CodeSandbox, Sandboxes aggregator) that honor robots.txt and emit Findings at a conservative 10 req/min.
## What Was Built
1. **`ReplitSource`** — scrapes `https://replit.com/search?q=...&type=repls`, extracts `/@user/repl` anchors via `golang.org/x/net/html` parser, emits Findings tagged `recon:replit`. Default BaseURL `https://replit.com`; tests inject `httptest.Server.URL`.
2. **`CodeSandboxSource`** — mirrors ReplitSource against `https://codesandbox.io/search?query=...&type=sandboxes`, extracting `/s/<slug>` anchors and emitting Findings tagged `recon:codesandbox`.
3. **`SandboxesSource`** — umbrella aggregator iterating five sub-platforms (CodePen HTML, JSFiddle JSON, StackBlitz HTML, Glitch JSON, Observable HTML). Gitpod is documented-omitted. Per-platform failures are logged and skipped so one broken endpoint doesn't fail the sweep. Sub-platform identity is threaded into `Finding.KeyMasked` as `platform=<name>` until a structured metadata field lands.
All three implement `recon.ReconSource` with `RespectsRobots()=true`, `Enabled()=true`, `RateLimit()=rate.Every(6*time.Second)`, `Burst()=1`. Compile-time assertions guarantee interface conformance.
Two shared helpers live alongside the scrapers:
- `extractAnchorHrefs(io.Reader, *regexp.Regexp) ([]string, error)` — DOM walker returning deduplicated `<a href>` values matching the regex.
- `extractJSONURLs(io.Reader, itemsKey, urlKey string) ([]string, error)` — generic `{items: [{url: "..."}]}` decoder.
## Tasks
| # | Name | Commit | Status |
| - | ----------------------------------------------------- | -------- | ------ |
| 1 | ReplitSource + CodeSandboxSource (scrapers, TDD) | 62a347f | done |
| 2 | SandboxesSource aggregator (CodePen+JSFiddle+...) | ecebffd | done |
## Tests
15 tests total, all green (`go test ./pkg/recon/sources/ -run "TestReplit|TestCodeSandbox|TestSandboxes"` → PASS in ~12s).
Replit (5): HTML extraction + SourceType, RespectsRobots, Enabled, ctx cancel, Name/RateLimit/Burst.
CodeSandbox (5): HTML extraction + SourceType, RespectsRobots, Enabled, ctx cancel, Name.
Sandboxes (5): multi-platform HTML+JSON mux test, failing-platform log-and-continue, Name/RespectsRobots/Enabled/Burst, ctx cancel, default-platforms sanity.
## Decisions Made
- **Sub-platform identity in KeyMasked:** The plan explicitly called out this pragmatic placeholder. `engine.Finding` has no `Metadata map[string]string` slot yet — adding one is architectural (Rule 4 territory) and out of scope for a scraper plan. The `platform=<name>` prefix is greppable and round-trips cleanly.
- **SearchPath single field serves prod + tests:** Rather than split absolute/relative URL fields, SearchPath accepts either. When `BaseURL` is set and the formatted URL starts with `/`, BaseURL is prepended. Tests use relative paths against an httptest server; prod uses absolute URLs.
- **Log-and-continue per platform, not per query:** A failing platform means the endpoint is broken or shape-shifted — retrying with another keyword is wasted budget. We skip the platform entirely and move on.
- **Default platforms as package var, not constructor arg:** Tests override by passing a `Platforms: []subPlatform{...}` slice directly, which is simpler than exposing a `WithPlatforms(...)` option.
## Deviations from Plan
### Scope-boundary deviations
**1. [Rule 3 — Blocking] Parallel-plan build breakage**
- **Found during:** Task 1 initial test run
- **Issue:** A concurrent executor (plan 10-02 GitHub, plan 10-03 GitLab, plan 10-08 Kaggle) had uncommitted duplicate `keywordIndex` helpers and a mid-edit `github_test.go` with an unused-`fmt` import, breaking the package build.
- **Resolution:** Did not modify any other plan's files. Waited for the parallel executors to commit their state; by the time I re-ran `go test`, plans 10-02/10-03/10-08 had advanced to green states on their own, and my tests then compiled and passed cleanly.
- **Files modified:** None (all resolution came from patience + re-running).
**2. [Rule 3 — Blocking] Lost test files during stash pop**
- **Found during:** Task 1 investigation of parallel-plan breakage
- **Issue:** Early in the session I attempted `git stash -u` to inspect a clean HEAD. The follow-up `git stash pop` failed to restore untracked files (`could not restore untracked files from stash`), silently deleting `replit_test.go` and `codesandbox_test.go`.
- **Resolution:** Rewrote both test files from scratch (they were still in conversation context). Zero impact on final deliverable.
- **Files modified:** Re-created `pkg/recon/sources/replit_test.go` and `pkg/recon/sources/codesandbox_test.go`.
- **Lesson:** Avoid `git stash -u` in worktrees with untracked new-file work. Move files to `/tmp/` explicitly instead.
No functional deviations from the plan spec: every `<behavior>` clause is covered by a test, every `<files>` entry was created, and both `<verify>` commands pass.
## Verification
- `go build ./...` — clean
- `go vet ./pkg/recon/sources/` — clean
- `go test ./pkg/recon/sources/ -run "TestReplit|TestCodeSandbox|TestSandboxes" -timeout 60s` — PASS (15 tests, ~12s; 12s is limiter-gated, not CPU)
## Ready For
Plan 10-09 (RegisterAll wiring) can now import `ReplitSource`, `CodeSandboxSource`, and `SandboxesSource` and register them on `recon.Engine` alongside the GitHub/GitLab/Codeberg/HuggingFace/Kaggle sources delivered in sibling Wave 2 plans. No further config plumbing needed — all three are credential-free.
## Self-Check: PASSED
Files on disk:
- FOUND: pkg/recon/sources/replit.go
- FOUND: pkg/recon/sources/replit_test.go
- FOUND: pkg/recon/sources/codesandbox.go
- FOUND: pkg/recon/sources/codesandbox_test.go
- FOUND: pkg/recon/sources/sandboxes.go
- FOUND: pkg/recon/sources/sandboxes_test.go
Commits in git history:
- FOUND: 62a347f (feat(10-07): add Replit and CodeSandbox scraping sources)
- FOUND: ecebffd (feat(10-07): add SandboxesSource aggregator)