Ordered paths by sort.Strings after full walk rather than streaming, so the public contract guarantees reproducible emission order even if filepath.WalkDir's internal ordering changes
Added NewDirSourceRaw (no-defaults) alongside NewDirSource (defaults-merged) to keep tests free of leakage from DefaultExcludes while giving callers an ergonomic constructor
Handled `dir/**` glob style with an explicit suffix check — filepath.Match does not support ** natively, and we need .git/**, node_modules/**, vendor/** to match nested paths
Per-file errors during emission are swallowed (continue walking) rather than aborting the entire scan — the engine logs elsewhere; context errors are still propagated
Refactored FileSource to share emitChunks/isBinary so single-file and directory paths stay behaviorally identical (binary skip + mmap threshold)
duration
completed
tasks_completed
files_modified
files_created
~4 min
2026-04-05
1
3
3
Phase 4 Plan 2: DirSource Recursive Directory Scanner Summary
DirSource walks any directory recursively with glob-based exclusions, binary-file skipping, and mmap-backed reads for files over 10MB — giving the scan pipeline its primary input source. FileSource was refactored to share the same emit and mmap helpers so both paths stay consistent.
Objective
Implement DirSource — a recursive directory scanner that walks a root path via filepath.WalkDir, honors glob exclusion patterns, detects and skips binary files, and uses memory-mapped I/O for large files. Satisfies INPUT-01 (directory/recursive scanning with exclusions) and CORE-07 (mmap large file reading).
What Was Built
pkg/engine/sources/dir.go (218 lines)
DirSource struct with Root, Excludes, ChunkSize fields
NewDirSource(root, extraExcludes...) — merges DefaultExcludes with user extras
NewDirSourceRaw(root, excludes) — bypasses defaults entirely (used by tests)
Chunks(ctx, out) — validates root is a directory, walks via filepath.WalkDir, collects non-excluded file paths, sorts them, and emits each file's chunks in order
isExcluded(rel, base) — matches glob patterns against both basename and relative path, with explicit dir/** suffix handling
emitFile(ctx, path, out) — reads file via mmap.Open if size ≥ MmapThreshold (10 MB), else os.ReadFile; skips binary files
isBinary(data) — NUL byte sniff in the first 512 bytes (shared helper)
FileSource now shares emitChunks, isBinary, and the mmap read path with DirSource
Files ≥ 10 MB are read via golang.org/x/exp/mmap
Binary files are skipped before emission
pkg/engine/sources/dir_test.go (146 lines)
Eight subtests, all passing under -race:
Test
Status
What it proves
TestDirSource_RecursiveWalk
PASS
Emits 3 chunks for 3 nested files; sources are sorted
TestDirSource_DefaultExcludes
PASS
.git/config, node_modules/foo.js, vendor/bar.go, app.min.js, app.js.map all skipped; keep.txt emitted
TestDirSource_UserExclude
PASS
*.log user glob skips drop.log, keeps keep.txt
TestDirSource_BinarySkipped
PASS
File with NUL in first bytes skipped; text file emitted
TestDirSource_MmapLargeFile
PASS
>10MB file emits chunks via mmap path
TestDirSource_DeterministicOrdering
PASS
Two consecutive runs produce identical source order
TestDirSource_MissingRoot
PASS
Non-existent root returns error
TestDirSource_CtxCancellation
PASS
Pre-cancelled ctx returns context.Canceled
Verification
$ go build ./pkg/engine/sources/... # exit 0
$ go vet ./pkg/engine/sources/... # clean
$ go test ./pkg/engine/sources/... -run 'TestDirSource|TestFileSource' -race -count=1
ok github.com/salvacybersec/keyhunter/pkg/engine/sources 1.120s
Acceptance greps all hit:
mmap.Open — 2 hits (dir.go:157, file.go:41)
filepath.WalkDir — 1 hit (dir.go:77)
DefaultExcludes — 4 hits (dir.go)
isBinary — present in dir.go:180
Deviations from Plan
Minor exclude-semantics adjustment
The plan's isExcluded implementation matched dir/** only against the relative path. I added one extra check so base == prefix also matches — this lets .git/** correctly exclude the top-level .git directory entry itself when WalkDir reports it (the rel for that directory is just .git, which already matched, but the defensive check protects against filepath quirks on Windows).
Root-directory skip guard
Added an early if path == d.Root { return nil } inside the WalkDir callback. Without it, an exclusion pattern that accidentally matches an empty relative path (.) could cause SkipDir on the root, halting the walk before it starts. This is defensive, not spec-changing.
Deferred Issues
TestClipboardSource_ReaderError (pre-existing, out of scope)
pkg/engine/sources/clipboard_test.go:42 fails because the test expects an error message containing the substring "clipboard", but the current clipboard source returns "ClipboardSource: read: no xclip installed". This failure is in a different source (clipboard, owned by a later plan) and unrelated to DirSource. Logged to .planning/phases/04-input-sources/deferred-items.md for the plan that owns clipboard source.
Scoped test runs (-run 'TestDirSource|TestFileSource') are green under -race.
Follow-up Hooks for Later Plans
cmd/scan.go needs wiring: select NewDirSource(path, excludeFlags...) when the positional argument is a directory. Plan 04-06 (scan command integration).
Exclude-flag plumbing: --exclude CLI flag to append to DefaultExcludes.
Per-file error logging: today errors are swallowed; once the engine has a structured logger, DirSource should emit a warning for each skipped file.
Self-Check: PASSED
pkg/engine/sources/dir.go — FOUND (218 lines)
pkg/engine/sources/dir_test.go — FOUND (146 lines)
pkg/engine/sources/file.go — FOUND (refactored, 60 lines)