# Phase 4: Input Sources - Context **Gathered:** 2026-04-05 **Status:** Ready for planning **Mode:** Auto-generated ## Phase Boundary Users can point KeyHunter at any content source — local files/directories, git history across all branches, piped content (stdin), remote URLs, and the clipboard — and all are scanned through the same three-stage pipeline established in Phase 1. This phase extends the `sources` package with new `Source` implementations and wires them through the `scan` CLI command. ## Implementation Decisions ### Directory/File Scanning (INPUT-01, CORE-07) - **Recursive traversal** via `filepath.WalkDir` (stdlib, fast, Go 1.16+) - **Glob exclusion** via `--exclude` flag accepting multiple patterns, matched with `filepath.Match` per path - **Default exclusions**: `.git/`, `node_modules/`, `vendor/`, `*.min.js`, `*.map` (configurable) - **mmap threshold**: files > 10MB use `golang.org/x/exp/mmap` for zero-copy reads; smaller files use `os.ReadFile` - **Binary file detection**: skip files with null bytes in first 512 bytes (simple heuristic, avoids text encoding complexity) ### Git History (INPUT-02) - **Library**: `github.com/go-git/go-git/v5` (pure Go, no CGO) — already ecosystem-standard for Go git ops - **Scope**: all branches (refs/heads/*), all tags (refs/tags/*), all stash entries - **Commit iteration**: walk each ref's commits, diff file contents, scan each blob - **`--since=YYYY-MM-DD`**: filter by commit author date - **Deduplication**: hash blob OIDs, skip already-scanned blobs (git objects are deduplicated by content) - **Performance**: parallel blob scanning via ants pool ### Stdin (INPUT-03) - **Trigger**: `keyhunter scan stdin` subcommand OR `keyhunter scan -` positional - **Implementation**: `StdinSource` reads from `os.Stdin` in chunks, feeds pipeline - **Source name in findings**: "stdin" ### URL (INPUT-04) - **HTTP client**: stdlib `net/http` with 30s timeout, follows redirects (max 5) - **User-Agent**: `keyhunter/` - **Content-length limit**: 50MB (reject larger) - **Content-type**: accept any text/*, application/json, application/javascript, application/xml - **TLS**: verify by default, `--insecure` flag to skip (off by default for safety) - **URL scheme support**: http, https only (no file://, ftp://, etc.) ### Clipboard (INPUT-05) - **Library**: `github.com/atotto/clipboard` (cross-platform: Linux via xclip/xsel/wl-clipboard, macOS via pbpaste, Windows via native API) - **Graceful fallback**: if clipboard tool not installed, return clear error with install instructions - **Linux detection**: check for xclip/xsel/wl-paste at startup, not during scan ### Source Interface Extension (INPUT-06) - Existing `Source` interface in `pkg/engine/sources/source.go`: `Chunks(ctx, out chan<- types.Chunk) error` - All new sources implement this interface — pipeline stays unchanged - `sources.NewXxxSource(args)` factory functions for each source type - `cmd/scan.go` selects source based on flags: `--git`, `--url`, `--clipboard`, `stdin` subcommand, else default FileSource ## Existing Code Insights ### Reusable Assets - `pkg/engine/sources/source.go` — Source interface (keep unchanged) - `pkg/engine/sources/file.go` — single-file FileSource (extend to DirSource OR keep separate) - `pkg/engine/engine.go` — three-stage pipeline (unchanged) - `pkg/types/chunk.go` — Chunk struct (unchanged) - `cmd/scan.go` — wire new flags + source selection ### Established Patterns - Source emits `types.Chunk{Data, Source, Offset}` on the channel - Context cancellation: all sources check `ctx.Done()` in their loops - Error propagation: return error from `Chunks()`, engine drains channel ### Integration Points - `cmd/scan.go` — add flags: `--exclude`, `--git`, `--url`, `--clipboard`, `--since`, `--max-file-size`, `--insecure`, `--follow-redirects` - New files: `pkg/engine/sources/{dir,git,stdin,url,clipboard}.go` - Tests: `pkg/engine/sources/*_test.go` ### go.mod additions needed - `github.com/go-git/go-git/v5` - `github.com/atotto/clipboard` - `golang.org/x/exp/mmap` ## Specific Ideas - Default exclusions should match common "generated/vendored" directories to reduce noise - Dir source should emit files in deterministic order for reproducible test output - Git source should report commit SHA in finding source field: "git:abc1234:path/to/file.go" - URL source should report URL in finding source field: "url:https://example.com/..." ## Deferred Ideas - Clipboard watch mode (poll clipboard for changes) — out of scope - HTTP authentication (basic/bearer for --url) — defer to Phase 5 or later - SFTP/SCP URL schemes — out of scope - Archive file extraction (.zip, .tar.gz scanning) — defer to later phase