118 lines
5.6 KiB
Markdown
118 lines
5.6 KiB
Markdown
---
|
|
phase: 10-osint-code-hosting
|
|
plan: 08
|
|
subsystem: recon
|
|
tags: [kaggle, osint, http-basic-auth, httptest]
|
|
|
|
requires:
|
|
- phase: 10-osint-code-hosting
|
|
provides: "recon.ReconSource interface, sources.Client, BuildQueries, LimiterRegistry (Plan 10-01)"
|
|
provides:
|
|
- "KaggleSource implementing recon.ReconSource against Kaggle /api/v1/kernels/list"
|
|
- "HTTP Basic auth wiring via req.SetBasicAuth(user, key)"
|
|
- "Finding normalization to Source=<web>/code/<ref>, SourceType=recon:kaggle"
|
|
affects: [10-09-register, 10-full-integration]
|
|
|
|
tech-stack:
|
|
added: []
|
|
patterns:
|
|
- "Basic-auth recon source pattern (user + key) as counterpart to bearer-token sources"
|
|
- "Credential-gated Sweep: return nil without HTTP when either credential missing"
|
|
|
|
key-files:
|
|
created:
|
|
- pkg/recon/sources/kaggle.go
|
|
- pkg/recon/sources/kaggle_test.go
|
|
modified: []
|
|
|
|
key-decisions:
|
|
- "Short-circuit Sweep with nil error when User or Key is empty — no HTTP, no log spam"
|
|
- "kaggleKernel decoder ignores non-ref fields so API additions don't break decode"
|
|
- "Ignore decode errors and continue to next query (downgrade, not abort) — matches GitHubSource pattern"
|
|
|
|
patterns-established:
|
|
- "Basic auth: req.SetBasicAuth(s.User, s.Key) after NewRequestWithContext"
|
|
- "Web URL derivation from API ref: web + /code/ + ref"
|
|
|
|
requirements-completed: [RECON-CODE-09]
|
|
|
|
duration: 8min
|
|
completed: 2026-04-05
|
|
---
|
|
|
|
# Phase 10 Plan 08: KaggleSource Summary
|
|
|
|
**KaggleSource emits Findings from Kaggle public notebook search via HTTP Basic auth against /api/v1/kernels/list**
|
|
|
|
## Performance
|
|
|
|
- **Duration:** ~8 min
|
|
- **Tasks:** 1 (TDD)
|
|
- **Files created:** 2
|
|
|
|
## Accomplishments
|
|
|
|
- KaggleSource type implementing recon.ReconSource (Name, RateLimit, Burst, RespectsRobots, Enabled, Sweep)
|
|
- Credentials-gated: both User AND Key required; missing either returns nil with zero HTTP calls
|
|
- HTTP Basic auth wired via req.SetBasicAuth to Kaggle's /api/v1/kernels/list endpoint
|
|
- Findings normalized with SourceType "recon:kaggle" and Source = WebBaseURL + "/code/" + ref
|
|
- 60 req/min rate limit via rate.Every(1*time.Second), burst 1, honoring per-source LimiterRegistry
|
|
- Compile-time interface assertion: `var _ recon.ReconSource = (*KaggleSource)(nil)`
|
|
|
|
## Task Commits
|
|
|
|
1. **Task 1: KaggleSource + tests (TDD)** — `243b740` (feat)
|
|
|
|
## Files Created
|
|
|
|
- `pkg/recon/sources/kaggle.go` — KaggleSource implementation, kaggleKernel decoder, interface assertion
|
|
- `pkg/recon/sources/kaggle_test.go` — 6 httptest-driven tests
|
|
|
|
## Test Coverage
|
|
|
|
| Test | Covers |
|
|
|------|--------|
|
|
| TestKaggle_Enabled | All 4 credential combinations (empty/empty, user-only, key-only, both) |
|
|
| TestKaggle_Sweep_BasicAuthAndFindings | Authorization header decoded as testuser:testkey, 2 refs → 2 Findings with correct Source URLs and recon:kaggle SourceType |
|
|
| TestKaggle_Sweep_MissingCredentials_NoHTTP | Atomic counter verifies zero HTTP calls when either User or Key empty |
|
|
| TestKaggle_Sweep_Unauthorized | 401 response wrapped as ErrUnauthorized |
|
|
| TestKaggle_Sweep_CtxCancellation | Pre-cancelled ctx returns context.Canceled promptly |
|
|
| TestKaggle_ReconSourceInterface | Compile + runtime assertions on Name, Burst, RespectsRobots, RateLimit |
|
|
|
|
All 6 tests pass in isolation: `go test ./pkg/recon/sources/ -run TestKaggle -v`
|
|
|
|
## Decisions Made
|
|
|
|
- **Missing-cred behavior:** Sweep returns nil (no error) when either credential absent. Matches GitHubSource pattern — disabled sources log-and-skip at the Engine level, not error out.
|
|
- **Decode tolerance:** kaggleKernel struct only declares `Ref string`. Other fields (title, author, language) are silently discarded so upstream API changes don't break the source.
|
|
- **Error downgrade:** Non-401 HTTP errors skip to next query rather than aborting the whole sweep. 401 is the only hard-fail case because it means credentials are actually invalid, not transient.
|
|
- **Dual BaseURL fields:** BaseURL (API) and WebBaseURL (Finding URL stem) are separate struct fields so tests can point BaseURL at httptest.NewServer while WebBaseURL stays at the production kaggle.com domain for assertion stability.
|
|
|
|
## Deviations from Plan
|
|
|
|
None — plan executed exactly as written. All truths from frontmatter (`must_haves`) satisfied:
|
|
- KaggleSource queries `/api/v1/kernels/list` with Basic auth → TestKaggle_Sweep_BasicAuthAndFindings
|
|
- Disabled when either credential empty → TestKaggle_Enabled + TestKaggle_Sweep_MissingCredentials_NoHTTP
|
|
- Findings tagged recon:kaggle with Source = web + /code/ + ref → TestKaggle_Sweep_BasicAuthAndFindings
|
|
|
|
## Issues Encountered
|
|
|
|
- **Sibling-wave file churn:** During testing, sibling Wave 2 plans (10-02 GitHub, 10-05 Replit, 10-07 CodeSandbox, 10-03 GitLab) had already dropped partial files into `pkg/recon/sources/` in the main repo. A stray `github_test.go` with no `github.go` broke package compilation. Resolved by running tests in this plan's git worktree where only kaggle.go and kaggle_test.go are present alongside the Plan 10-01 scaffolding. No cross-plan changes made — scope boundary respected. Final wave merge will resolve all sibling files together.
|
|
|
|
## Next Phase Readiness
|
|
|
|
- KaggleSource is ready for registration in Plan 10-09 (`RegisterAll` wiring).
|
|
- No blockers for downstream plans. RECON-CODE-09 satisfied.
|
|
|
|
## Self-Check: PASSED
|
|
|
|
- File exists: `pkg/recon/sources/kaggle.go` — FOUND
|
|
- File exists: `pkg/recon/sources/kaggle_test.go` — FOUND
|
|
- Commit exists: `243b740` — FOUND (feat(10-08): add KaggleSource with HTTP Basic auth)
|
|
- Tests pass: 6/6 TestKaggle_* (verified with sibling files stashed to isolate package build)
|
|
|
|
---
|
|
*Phase: 10-osint-code-hosting*
|
|
*Plan: 08*
|
|
*Completed: 2026-04-05*
|