Files
keyhunter/.planning/phases/04-input-sources/04-CONTEXT.md
2026-04-05 15:00:25 +03:00

4.8 KiB

Phase 4: Input Sources - Context

Gathered: 2026-04-05 Status: Ready for planning Mode: Auto-generated

## Phase Boundary

Users can point KeyHunter at any content source — local files/directories, git history across all branches, piped content (stdin), remote URLs, and the clipboard — and all are scanned through the same three-stage pipeline established in Phase 1. This phase extends the sources package with new Source implementations and wires them through the scan CLI command.

## Implementation Decisions

Directory/File Scanning (INPUT-01, CORE-07)

  • Recursive traversal via filepath.WalkDir (stdlib, fast, Go 1.16+)
  • Glob exclusion via --exclude flag accepting multiple patterns, matched with filepath.Match per path
  • Default exclusions: .git/, node_modules/, vendor/, *.min.js, *.map (configurable)
  • mmap threshold: files > 10MB use golang.org/x/exp/mmap for zero-copy reads; smaller files use os.ReadFile
  • Binary file detection: skip files with null bytes in first 512 bytes (simple heuristic, avoids text encoding complexity)

Git History (INPUT-02)

  • Library: github.com/go-git/go-git/v5 (pure Go, no CGO) — already ecosystem-standard for Go git ops
  • Scope: all branches (refs/heads/), all tags (refs/tags/), all stash entries
  • Commit iteration: walk each ref's commits, diff file contents, scan each blob
  • --since=YYYY-MM-DD: filter by commit author date
  • Deduplication: hash blob OIDs, skip already-scanned blobs (git objects are deduplicated by content)
  • Performance: parallel blob scanning via ants pool

Stdin (INPUT-03)

  • Trigger: keyhunter scan stdin subcommand OR keyhunter scan - positional
  • Implementation: StdinSource reads from os.Stdin in chunks, feeds pipeline
  • Source name in findings: "stdin"

URL (INPUT-04)

  • HTTP client: stdlib net/http with 30s timeout, follows redirects (max 5)
  • User-Agent: keyhunter/<version>
  • Content-length limit: 50MB (reject larger)
  • Content-type: accept any text/*, application/json, application/javascript, application/xml
  • TLS: verify by default, --insecure flag to skip (off by default for safety)
  • URL scheme support: http, https only (no file://, ftp://, etc.)

Clipboard (INPUT-05)

  • Library: github.com/atotto/clipboard (cross-platform: Linux via xclip/xsel/wl-clipboard, macOS via pbpaste, Windows via native API)
  • Graceful fallback: if clipboard tool not installed, return clear error with install instructions
  • Linux detection: check for xclip/xsel/wl-paste at startup, not during scan

Source Interface Extension (INPUT-06)

  • Existing Source interface in pkg/engine/sources/source.go: Chunks(ctx, out chan<- types.Chunk) error
  • All new sources implement this interface — pipeline stays unchanged
  • sources.NewXxxSource(args) factory functions for each source type
  • cmd/scan.go selects source based on flags: --git, --url, --clipboard, stdin subcommand, else default FileSource

<code_context>

Existing Code Insights

Reusable Assets

  • pkg/engine/sources/source.go — Source interface (keep unchanged)
  • pkg/engine/sources/file.go — single-file FileSource (extend to DirSource OR keep separate)
  • pkg/engine/engine.go — three-stage pipeline (unchanged)
  • pkg/types/chunk.go — Chunk struct (unchanged)
  • cmd/scan.go — wire new flags + source selection

Established Patterns

  • Source emits types.Chunk{Data, Source, Offset} on the channel
  • Context cancellation: all sources check ctx.Done() in their loops
  • Error propagation: return error from Chunks(), engine drains channel

Integration Points

  • cmd/scan.go — add flags: --exclude, --git, --url, --clipboard, --since, --max-file-size, --insecure, --follow-redirects
  • New files: pkg/engine/sources/{dir,git,stdin,url,clipboard}.go
  • Tests: pkg/engine/sources/*_test.go

go.mod additions needed

  • github.com/go-git/go-git/v5
  • github.com/atotto/clipboard
  • golang.org/x/exp/mmap

</code_context>

## Specific Ideas
  • Default exclusions should match common "generated/vendored" directories to reduce noise
  • Dir source should emit files in deterministic order for reproducible test output
  • Git source should report commit SHA in finding source field: "git:abc1234:path/to/file.go"
  • URL source should report URL in finding source field: "url:https://example.com/..."
## Deferred Ideas
  • Clipboard watch mode (poll clipboard for changes) — out of scope
  • HTTP authentication (basic/bearer for --url) — defer to Phase 5 or later
  • SFTP/SCP URL schemes — out of scope
  • Archive file extraction (.zip, .tar.gz scanning) — defer to later phase