--- phase: 04-input-sources plan: 02 subsystem: input-sources tags: [sources, dir, mmap, wave-1] requires: - 04-01 (golang.org/x/exp/mmap module available) - pkg/engine/sources/source.go Source interface - pkg/types/chunk.go Chunk struct provides: - DirSource recursive directory scanner implementing Source interface - Shared emitChunks/isBinary helpers for all sources - mmap-backed large file reading for FileSource and DirSource - Default exclusion set for generated/vendored dirs affects: - pkg/engine/sources/dir.go - pkg/engine/sources/dir_test.go - pkg/engine/sources/file.go tech-stack: added: - golang.org/x/exp/mmap (promoted from indirect to direct use) patterns: - Collect-then-sort path list for deterministic walk order - Shared package-level emitChunks helper reused by FileSource and DirSource - Binary detection via NUL sniff in leading 512 bytes key-files: created: - pkg/engine/sources/dir.go - pkg/engine/sources/dir_test.go - .planning/phases/04-input-sources/deferred-items.md modified: - pkg/engine/sources/file.go decisions: - Ordered paths by sort.Strings after full walk rather than streaming, so the public contract guarantees reproducible emission order even if filepath.WalkDir's internal ordering changes - Added NewDirSourceRaw (no-defaults) alongside NewDirSource (defaults-merged) to keep tests free of leakage from DefaultExcludes while giving callers an ergonomic constructor - Handled `dir/**` glob style with an explicit suffix check — filepath.Match does not support ** natively, and we need .git/**, node_modules/**, vendor/** to match nested paths - Per-file errors during emission are swallowed (continue walking) rather than aborting the entire scan — the engine logs elsewhere; context errors are still propagated - Refactored FileSource to share emitChunks/isBinary so single-file and directory paths stay behaviorally identical (binary skip + mmap threshold) metrics: duration: "~4 min" completed: 2026-04-05 tasks_completed: 1 files_modified: 3 files_created: 3 --- # Phase 4 Plan 2: DirSource Recursive Directory Scanner Summary DirSource walks any directory recursively with glob-based exclusions, binary-file skipping, and mmap-backed reads for files over 10MB — giving the scan pipeline its primary input source. FileSource was refactored to share the same emit and mmap helpers so both paths stay consistent. ## Objective Implement `DirSource` — a recursive directory scanner that walks a root path via `filepath.WalkDir`, honors glob exclusion patterns, detects and skips binary files, and uses memory-mapped I/O for large files. Satisfies INPUT-01 (directory/recursive scanning with exclusions) and CORE-07 (mmap large file reading). ## What Was Built ### pkg/engine/sources/dir.go (218 lines) - `DirSource` struct with `Root`, `Excludes`, `ChunkSize` fields - `NewDirSource(root, extraExcludes...)` — merges DefaultExcludes with user extras - `NewDirSourceRaw(root, excludes)` — bypasses defaults entirely (used by tests) - `Chunks(ctx, out)` — validates root is a directory, walks via `filepath.WalkDir`, collects non-excluded file paths, sorts them, and emits each file's chunks in order - `isExcluded(rel, base)` — matches glob patterns against both basename and relative path, with explicit `dir/**` suffix handling - `emitFile(ctx, path, out)` — reads file via `mmap.Open` if size ≥ MmapThreshold (10 MB), else `os.ReadFile`; skips binary files - `isBinary(data)` — NUL byte sniff in the first 512 bytes (shared helper) - `emitChunks(ctx, data, source, chunkSize, out)` — shared overlapping chunk emitter (4096 size / 256 overlap) - Constants: `MmapThreshold = 10MB`, `BinarySniffSize = 512`, `DefaultExcludes = [.git/**, node_modules/**, vendor/**, *.min.js, *.map]` ### pkg/engine/sources/file.go (refactored, 60 lines) - FileSource now shares `emitChunks`, `isBinary`, and the mmap read path with DirSource - Files ≥ 10 MB are read via `golang.org/x/exp/mmap` - Binary files are skipped before emission ### pkg/engine/sources/dir_test.go (146 lines) Eight subtests, all passing under `-race`: | Test | Status | What it proves | |------|--------|----------------| | TestDirSource_RecursiveWalk | PASS | Emits 3 chunks for 3 nested files; sources are sorted | | TestDirSource_DefaultExcludes | PASS | `.git/config`, `node_modules/foo.js`, `vendor/bar.go`, `app.min.js`, `app.js.map` all skipped; `keep.txt` emitted | | TestDirSource_UserExclude | PASS | `*.log` user glob skips `drop.log`, keeps `keep.txt` | | TestDirSource_BinarySkipped | PASS | File with NUL in first bytes skipped; text file emitted | | TestDirSource_MmapLargeFile | PASS | >10MB file emits chunks via mmap path | | TestDirSource_DeterministicOrdering | PASS | Two consecutive runs produce identical source order | | TestDirSource_MissingRoot | PASS | Non-existent root returns error | | TestDirSource_CtxCancellation | PASS | Pre-cancelled ctx returns `context.Canceled` | ## Verification ``` $ go build ./pkg/engine/sources/... # exit 0 $ go vet ./pkg/engine/sources/... # clean $ go test ./pkg/engine/sources/... -run 'TestDirSource|TestFileSource' -race -count=1 ok github.com/salvacybersec/keyhunter/pkg/engine/sources 1.120s ``` Acceptance greps all hit: - `mmap.Open` — 2 hits (dir.go:157, file.go:41) - `filepath.WalkDir` — 1 hit (dir.go:77) - `DefaultExcludes` — 4 hits (dir.go) - `isBinary` — present in dir.go:180 ## Deviations from Plan ### Minor exclude-semantics adjustment The plan's `isExcluded` implementation matched `dir/**` only against the relative path. I added one extra check so `base == prefix` also matches — this lets `.git/**` correctly exclude the top-level `.git` directory entry itself when WalkDir reports it (the `rel` for that directory is just `.git`, which already matched, but the defensive check protects against filepath quirks on Windows). ### Root-directory skip guard Added an early `if path == d.Root { return nil }` inside the WalkDir callback. Without it, an exclusion pattern that accidentally matches an empty relative path (`.`) could cause `SkipDir` on the root, halting the walk before it starts. This is defensive, not spec-changing. ## Deferred Issues ### TestClipboardSource_ReaderError (pre-existing, out of scope) `pkg/engine/sources/clipboard_test.go:42` fails because the test expects an error message containing the substring `"clipboard"`, but the current clipboard source returns `"ClipboardSource: read: no xclip installed"`. This failure is in a different source (clipboard, owned by a later plan) and unrelated to DirSource. Logged to `.planning/phases/04-input-sources/deferred-items.md` for the plan that owns clipboard source. Scoped test runs (`-run 'TestDirSource|TestFileSource'`) are green under `-race`. ## Follow-up Hooks for Later Plans - `cmd/scan.go` needs wiring: select `NewDirSource(path, excludeFlags...)` when the positional argument is a directory. Plan 04-06 (scan command integration). - Exclude-flag plumbing: `--exclude` CLI flag to append to DefaultExcludes. - Per-file error logging: today errors are swallowed; once the engine has a structured logger, DirSource should emit a warning for each skipped file. ## Self-Check: PASSED - pkg/engine/sources/dir.go — FOUND (218 lines) - pkg/engine/sources/dir_test.go — FOUND (146 lines) - pkg/engine/sources/file.go — FOUND (refactored, 60 lines) - RED commit ce6298f — verified in git log - GREEN commit 6f834c9 — verified in git log - All TestDirSource subtests pass under -race