4.8 KiB
4.8 KiB
Phase 4: Input Sources - Context
Gathered: 2026-04-05 Status: Ready for planning Mode: Auto-generated
## Phase BoundaryUsers can point KeyHunter at any content source — local files/directories, git history across all branches, piped content (stdin), remote URLs, and the clipboard — and all are scanned through the same three-stage pipeline established in Phase 1. This phase extends the sources package with new Source implementations and wires them through the scan CLI command.
Directory/File Scanning (INPUT-01, CORE-07)
- Recursive traversal via
filepath.WalkDir(stdlib, fast, Go 1.16+) - Glob exclusion via
--excludeflag accepting multiple patterns, matched withfilepath.Matchper path - Default exclusions:
.git/,node_modules/,vendor/,*.min.js,*.map(configurable) - mmap threshold: files > 10MB use
golang.org/x/exp/mmapfor zero-copy reads; smaller files useos.ReadFile - Binary file detection: skip files with null bytes in first 512 bytes (simple heuristic, avoids text encoding complexity)
Git History (INPUT-02)
- Library:
github.com/go-git/go-git/v5(pure Go, no CGO) — already ecosystem-standard for Go git ops - Scope: all branches (refs/heads/), all tags (refs/tags/), all stash entries
- Commit iteration: walk each ref's commits, diff file contents, scan each blob
--since=YYYY-MM-DD: filter by commit author date- Deduplication: hash blob OIDs, skip already-scanned blobs (git objects are deduplicated by content)
- Performance: parallel blob scanning via ants pool
Stdin (INPUT-03)
- Trigger:
keyhunter scan stdinsubcommand ORkeyhunter scan -positional - Implementation:
StdinSourcereads fromos.Stdinin chunks, feeds pipeline - Source name in findings: "stdin"
URL (INPUT-04)
- HTTP client: stdlib
net/httpwith 30s timeout, follows redirects (max 5) - User-Agent:
keyhunter/<version> - Content-length limit: 50MB (reject larger)
- Content-type: accept any text/*, application/json, application/javascript, application/xml
- TLS: verify by default,
--insecureflag to skip (off by default for safety) - URL scheme support: http, https only (no file://, ftp://, etc.)
Clipboard (INPUT-05)
- Library:
github.com/atotto/clipboard(cross-platform: Linux via xclip/xsel/wl-clipboard, macOS via pbpaste, Windows via native API) - Graceful fallback: if clipboard tool not installed, return clear error with install instructions
- Linux detection: check for xclip/xsel/wl-paste at startup, not during scan
Source Interface Extension (INPUT-06)
- Existing
Sourceinterface inpkg/engine/sources/source.go:Chunks(ctx, out chan<- types.Chunk) error - All new sources implement this interface — pipeline stays unchanged
sources.NewXxxSource(args)factory functions for each source typecmd/scan.goselects source based on flags:--git,--url,--clipboard,stdinsubcommand, else default FileSource
<code_context>
Existing Code Insights
Reusable Assets
pkg/engine/sources/source.go— Source interface (keep unchanged)pkg/engine/sources/file.go— single-file FileSource (extend to DirSource OR keep separate)pkg/engine/engine.go— three-stage pipeline (unchanged)pkg/types/chunk.go— Chunk struct (unchanged)cmd/scan.go— wire new flags + source selection
Established Patterns
- Source emits
types.Chunk{Data, Source, Offset}on the channel - Context cancellation: all sources check
ctx.Done()in their loops - Error propagation: return error from
Chunks(), engine drains channel
Integration Points
cmd/scan.go— add flags:--exclude,--git,--url,--clipboard,--since,--max-file-size,--insecure,--follow-redirects- New files:
pkg/engine/sources/{dir,git,stdin,url,clipboard}.go - Tests:
pkg/engine/sources/*_test.go
go.mod additions needed
github.com/go-git/go-git/v5github.com/atotto/clipboardgolang.org/x/exp/mmap
</code_context>
## Specific Ideas- Default exclusions should match common "generated/vendored" directories to reduce noise
- Dir source should emit files in deterministic order for reproducible test output
- Git source should report commit SHA in finding source field: "git:abc1234:path/to/file.go"
- URL source should report URL in finding source field: "url:https://example.com/..."
- Clipboard watch mode (poll clipboard for changes) — out of scope
- HTTP authentication (basic/bearer for --url) — defer to Phase 5 or later
- SFTP/SCP URL schemes — out of scope
- Archive file extraction (.zip, .tar.gz scanning) — defer to later phase