Files
keyhunter/.planning/phases/04-input-sources/04-02-SUMMARY.md
2026-04-05 15:19:50 +03:00

7.4 KiB

phase, plan, subsystem, tags, requires, provides, affects, tech-stack, key-files, decisions, metrics
phase plan subsystem tags requires provides affects tech-stack key-files decisions metrics
04-input-sources 02 input-sources
sources
dir
mmap
wave-1
04-01 (golang.org/x/exp/mmap module available)
pkg/engine/sources/source.go Source interface
pkg/types/chunk.go Chunk struct
DirSource recursive directory scanner implementing Source interface
Shared emitChunks/isBinary helpers for all sources
mmap-backed large file reading for FileSource and DirSource
Default exclusion set for generated/vendored dirs
pkg/engine/sources/dir.go
pkg/engine/sources/dir_test.go
pkg/engine/sources/file.go
added patterns
golang.org/x/exp/mmap (promoted from indirect to direct use)
Collect-then-sort path list for deterministic walk order
Shared package-level emitChunks helper reused by FileSource and DirSource
Binary detection via NUL sniff in leading 512 bytes
created modified
pkg/engine/sources/dir.go
pkg/engine/sources/dir_test.go
.planning/phases/04-input-sources/deferred-items.md
pkg/engine/sources/file.go
Ordered paths by sort.Strings after full walk rather than streaming, so the public contract guarantees reproducible emission order even if filepath.WalkDir's internal ordering changes
Added NewDirSourceRaw (no-defaults) alongside NewDirSource (defaults-merged) to keep tests free of leakage from DefaultExcludes while giving callers an ergonomic constructor
Handled `dir/**` glob style with an explicit suffix check — filepath.Match does not support ** natively, and we need .git/**, node_modules/**, vendor/** to match nested paths
Per-file errors during emission are swallowed (continue walking) rather than aborting the entire scan — the engine logs elsewhere; context errors are still propagated
Refactored FileSource to share emitChunks/isBinary so single-file and directory paths stay behaviorally identical (binary skip + mmap threshold)
duration completed tasks_completed files_modified files_created
~4 min 2026-04-05 1 3 3

Phase 4 Plan 2: DirSource Recursive Directory Scanner Summary

DirSource walks any directory recursively with glob-based exclusions, binary-file skipping, and mmap-backed reads for files over 10MB — giving the scan pipeline its primary input source. FileSource was refactored to share the same emit and mmap helpers so both paths stay consistent.

Objective

Implement DirSource — a recursive directory scanner that walks a root path via filepath.WalkDir, honors glob exclusion patterns, detects and skips binary files, and uses memory-mapped I/O for large files. Satisfies INPUT-01 (directory/recursive scanning with exclusions) and CORE-07 (mmap large file reading).

What Was Built

pkg/engine/sources/dir.go (218 lines)

  • DirSource struct with Root, Excludes, ChunkSize fields
  • NewDirSource(root, extraExcludes...) — merges DefaultExcludes with user extras
  • NewDirSourceRaw(root, excludes) — bypasses defaults entirely (used by tests)
  • Chunks(ctx, out) — validates root is a directory, walks via filepath.WalkDir, collects non-excluded file paths, sorts them, and emits each file's chunks in order
  • isExcluded(rel, base) — matches glob patterns against both basename and relative path, with explicit dir/** suffix handling
  • emitFile(ctx, path, out) — reads file via mmap.Open if size ≥ MmapThreshold (10 MB), else os.ReadFile; skips binary files
  • isBinary(data) — NUL byte sniff in the first 512 bytes (shared helper)
  • emitChunks(ctx, data, source, chunkSize, out) — shared overlapping chunk emitter (4096 size / 256 overlap)
  • Constants: MmapThreshold = 10MB, BinarySniffSize = 512, DefaultExcludes = [.git/**, node_modules/**, vendor/**, *.min.js, *.map]

pkg/engine/sources/file.go (refactored, 60 lines)

  • FileSource now shares emitChunks, isBinary, and the mmap read path with DirSource
  • Files ≥ 10 MB are read via golang.org/x/exp/mmap
  • Binary files are skipped before emission

pkg/engine/sources/dir_test.go (146 lines)

Eight subtests, all passing under -race:

Test Status What it proves
TestDirSource_RecursiveWalk PASS Emits 3 chunks for 3 nested files; sources are sorted
TestDirSource_DefaultExcludes PASS .git/config, node_modules/foo.js, vendor/bar.go, app.min.js, app.js.map all skipped; keep.txt emitted
TestDirSource_UserExclude PASS *.log user glob skips drop.log, keeps keep.txt
TestDirSource_BinarySkipped PASS File with NUL in first bytes skipped; text file emitted
TestDirSource_MmapLargeFile PASS >10MB file emits chunks via mmap path
TestDirSource_DeterministicOrdering PASS Two consecutive runs produce identical source order
TestDirSource_MissingRoot PASS Non-existent root returns error
TestDirSource_CtxCancellation PASS Pre-cancelled ctx returns context.Canceled

Verification

$ go build ./pkg/engine/sources/...        # exit 0
$ go vet ./pkg/engine/sources/...          # clean
$ go test ./pkg/engine/sources/... -run 'TestDirSource|TestFileSource' -race -count=1
ok  	github.com/salvacybersec/keyhunter/pkg/engine/sources	1.120s

Acceptance greps all hit:

  • mmap.Open — 2 hits (dir.go:157, file.go:41)
  • filepath.WalkDir — 1 hit (dir.go:77)
  • DefaultExcludes — 4 hits (dir.go)
  • isBinary — present in dir.go:180

Deviations from Plan

Minor exclude-semantics adjustment

The plan's isExcluded implementation matched dir/** only against the relative path. I added one extra check so base == prefix also matches — this lets .git/** correctly exclude the top-level .git directory entry itself when WalkDir reports it (the rel for that directory is just .git, which already matched, but the defensive check protects against filepath quirks on Windows).

Root-directory skip guard

Added an early if path == d.Root { return nil } inside the WalkDir callback. Without it, an exclusion pattern that accidentally matches an empty relative path (.) could cause SkipDir on the root, halting the walk before it starts. This is defensive, not spec-changing.

Deferred Issues

TestClipboardSource_ReaderError (pre-existing, out of scope)

pkg/engine/sources/clipboard_test.go:42 fails because the test expects an error message containing the substring "clipboard", but the current clipboard source returns "ClipboardSource: read: no xclip installed". This failure is in a different source (clipboard, owned by a later plan) and unrelated to DirSource. Logged to .planning/phases/04-input-sources/deferred-items.md for the plan that owns clipboard source.

Scoped test runs (-run 'TestDirSource|TestFileSource') are green under -race.

Follow-up Hooks for Later Plans

  • cmd/scan.go needs wiring: select NewDirSource(path, excludeFlags...) when the positional argument is a directory. Plan 04-06 (scan command integration).
  • Exclude-flag plumbing: --exclude CLI flag to append to DefaultExcludes.
  • Per-file error logging: today errors are swallowed; once the engine has a structured logger, DirSource should emit a warning for each skipped file.

Self-Check: PASSED

  • pkg/engine/sources/dir.go — FOUND (218 lines)
  • pkg/engine/sources/dir_test.go — FOUND (146 lines)
  • pkg/engine/sources/file.go — FOUND (refactored, 60 lines)
  • RED commit ce6298f — verified in git log
  • GREEN commit 6f834c9 — verified in git log
  • All TestDirSource subtests pass under -race