diff --git a/.planning/phases/04-input-sources/04-CONTEXT.md b/.planning/phases/04-input-sources/04-CONTEXT.md new file mode 100644 index 0000000..adac12c --- /dev/null +++ b/.planning/phases/04-input-sources/04-CONTEXT.md @@ -0,0 +1,103 @@ +# Phase 4: Input Sources - Context + +**Gathered:** 2026-04-05 +**Status:** Ready for planning +**Mode:** Auto-generated + + +## Phase Boundary + +Users can point KeyHunter at any content source — local files/directories, git history across all branches, piped content (stdin), remote URLs, and the clipboard — and all are scanned through the same three-stage pipeline established in Phase 1. This phase extends the `sources` package with new `Source` implementations and wires them through the `scan` CLI command. + + + + +## Implementation Decisions + +### Directory/File Scanning (INPUT-01, CORE-07) +- **Recursive traversal** via `filepath.WalkDir` (stdlib, fast, Go 1.16+) +- **Glob exclusion** via `--exclude` flag accepting multiple patterns, matched with `filepath.Match` per path +- **Default exclusions**: `.git/`, `node_modules/`, `vendor/`, `*.min.js`, `*.map` (configurable) +- **mmap threshold**: files > 10MB use `golang.org/x/exp/mmap` for zero-copy reads; smaller files use `os.ReadFile` +- **Binary file detection**: skip files with null bytes in first 512 bytes (simple heuristic, avoids text encoding complexity) + +### Git History (INPUT-02) +- **Library**: `github.com/go-git/go-git/v5` (pure Go, no CGO) — already ecosystem-standard for Go git ops +- **Scope**: all branches (refs/heads/*), all tags (refs/tags/*), all stash entries +- **Commit iteration**: walk each ref's commits, diff file contents, scan each blob +- **`--since=YYYY-MM-DD`**: filter by commit author date +- **Deduplication**: hash blob OIDs, skip already-scanned blobs (git objects are deduplicated by content) +- **Performance**: parallel blob scanning via ants pool + +### Stdin (INPUT-03) +- **Trigger**: `keyhunter scan stdin` subcommand OR `keyhunter scan -` positional +- **Implementation**: `StdinSource` reads from `os.Stdin` in chunks, feeds pipeline +- **Source name in findings**: "stdin" + +### URL (INPUT-04) +- **HTTP client**: stdlib `net/http` with 30s timeout, follows redirects (max 5) +- **User-Agent**: `keyhunter/` +- **Content-length limit**: 50MB (reject larger) +- **Content-type**: accept any text/*, application/json, application/javascript, application/xml +- **TLS**: verify by default, `--insecure` flag to skip (off by default for safety) +- **URL scheme support**: http, https only (no file://, ftp://, etc.) + +### Clipboard (INPUT-05) +- **Library**: `github.com/atotto/clipboard` (cross-platform: Linux via xclip/xsel/wl-clipboard, macOS via pbpaste, Windows via native API) +- **Graceful fallback**: if clipboard tool not installed, return clear error with install instructions +- **Linux detection**: check for xclip/xsel/wl-paste at startup, not during scan + +### Source Interface Extension (INPUT-06) +- Existing `Source` interface in `pkg/engine/sources/source.go`: `Chunks(ctx, out chan<- types.Chunk) error` +- All new sources implement this interface — pipeline stays unchanged +- `sources.NewXxxSource(args)` factory functions for each source type +- `cmd/scan.go` selects source based on flags: `--git`, `--url`, `--clipboard`, `stdin` subcommand, else default FileSource + + + + +## Existing Code Insights + +### Reusable Assets +- `pkg/engine/sources/source.go` — Source interface (keep unchanged) +- `pkg/engine/sources/file.go` — single-file FileSource (extend to DirSource OR keep separate) +- `pkg/engine/engine.go` — three-stage pipeline (unchanged) +- `pkg/types/chunk.go` — Chunk struct (unchanged) +- `cmd/scan.go` — wire new flags + source selection + +### Established Patterns +- Source emits `types.Chunk{Data, Source, Offset}` on the channel +- Context cancellation: all sources check `ctx.Done()` in their loops +- Error propagation: return error from `Chunks()`, engine drains channel + +### Integration Points +- `cmd/scan.go` — add flags: `--exclude`, `--git`, `--url`, `--clipboard`, `--since`, `--max-file-size`, `--insecure`, `--follow-redirects` +- New files: `pkg/engine/sources/{dir,git,stdin,url,clipboard}.go` +- Tests: `pkg/engine/sources/*_test.go` + +### go.mod additions needed +- `github.com/go-git/go-git/v5` +- `github.com/atotto/clipboard` +- `golang.org/x/exp/mmap` + + + + +## Specific Ideas + +- Default exclusions should match common "generated/vendored" directories to reduce noise +- Dir source should emit files in deterministic order for reproducible test output +- Git source should report commit SHA in finding source field: "git:abc1234:path/to/file.go" +- URL source should report URL in finding source field: "url:https://example.com/..." + + + + +## Deferred Ideas + +- Clipboard watch mode (poll clipboard for changes) — out of scope +- HTTP authentication (basic/bearer for --url) — defer to Phase 5 or later +- SFTP/SCP URL schemes — out of scope +- Archive file extraction (.zip, .tar.gz scanning) — defer to later phase + +