diff --git a/.planning/REQUIREMENTS.md b/.planning/REQUIREMENTS.md index 6b4054a..149ff6d 100644 --- a/.planning/REQUIREMENTS.md +++ b/.planning/REQUIREMENTS.md @@ -15,7 +15,7 @@ Requirements for initial release. Each maps to roadmap phases. - [x] **CORE-04**: Entropy analysis as secondary signal for low-confidence providers (generic key formats) - [x] **CORE-05**: Worker pool parallelism with configurable worker count (default: CPU count) - [x] **CORE-06**: Aho-Corasick keyword pre-filter runs before regex for 10x performance on large files -- [ ] **CORE-07**: mmap-based large file reading for memory efficiency +- [x] **CORE-07**: mmap-based large file reading for memory efficiency ### Providers @@ -32,7 +32,7 @@ Requirements for initial release. Each maps to roadmap phases. ### Input Sources -- [ ] **INPUT-01**: File and directory scanning with recursive traversal and glob exclusion patterns +- [x] **INPUT-01**: File and directory scanning with recursive traversal and glob exclusion patterns - [x] **INPUT-02**: Git-aware scanning — full history, branches, stash, delta-based diffs - [ ] **INPUT-03**: Git scanning supports --since flag for time-scoped history scan - [ ] **INPUT-04**: stdin/pipe input support (cat file | keyhunter scan stdin) diff --git a/.planning/ROADMAP.md b/.planning/ROADMAP.md index ea7982f..84173e0 100644 --- a/.planning/ROADMAP.md +++ b/.planning/ROADMAP.md @@ -105,7 +105,7 @@ Plans: Plans: - [x] 04-01-PLAN.md — Wave 0: add go-git/v5, atotto/clipboard, golang.org/x/exp/mmap dependencies -- [ ] 04-02-PLAN.md — DirSource: recursive walk, glob exclusion, binary skip, mmap for large files (INPUT-01, CORE-07) +- [x] 04-02-PLAN.md — DirSource: recursive walk, glob exclusion, binary skip, mmap for large files (INPUT-01, CORE-07) - [x] 04-03-PLAN.md — GitSource: full-history scan across branches/tags with blob dedup and --since (INPUT-02) - [ ] 04-04-PLAN.md — StdinSource, URLSource, ClipboardSource (INPUT-03, INPUT-04, INPUT-05) - [ ] 04-05-PLAN.md — cmd/scan.go source-selection dispatch wiring all new sources (INPUT-06) diff --git a/.planning/STATE.md b/.planning/STATE.md index 2efccbf..2122619 100644 --- a/.planning/STATE.md +++ b/.planning/STATE.md @@ -3,14 +3,14 @@ gsd_state_version: 1.0 milestone: v1.0 milestone_name: milestone status: executing -stopped_at: Completed 04-03-PLAN.md -last_updated: "2026-04-05T12:19:11.974Z" +stopped_at: Completed 04-02-PLAN.md +last_updated: "2026-04-05T12:19:39.522Z" last_activity: 2026-04-05 progress: total_phases: 18 completed_phases: 3 total_plans: 23 - completed_plans: 20 + completed_plans: 21 percent: 20 --- @@ -26,7 +26,7 @@ See: .planning/PROJECT.md (updated 2026-04-04) ## Current Position Phase: 04 (input-sources) — EXECUTING -Plan: 3 of 5 +Plan: 4 of 5 Status: Ready to execute Last activity: 2026-04-05 @@ -67,6 +67,7 @@ Progress: [██░░░░░░░░] 20% | Phase 03 P08 | 2min | 1 tasks | 1 files | | Phase 04 P01 | 1m | 1 tasks | 2 files | | Phase 04-input-sources P03 | 6m | 1 tasks | 2 files | +| Phase 04 P02 | 4min | 1 tasks | 3 files | ## Accumulated Context @@ -102,6 +103,6 @@ None yet. ## Session Continuity -Last session: 2026-04-05T12:19:11.971Z -Stopped at: Completed 04-03-PLAN.md +Last session: 2026-04-05T12:19:39.519Z +Stopped at: Completed 04-02-PLAN.md Resume file: None diff --git a/.planning/phases/04-input-sources/04-02-SUMMARY.md b/.planning/phases/04-input-sources/04-02-SUMMARY.md new file mode 100644 index 0000000..3588e14 --- /dev/null +++ b/.planning/phases/04-input-sources/04-02-SUMMARY.md @@ -0,0 +1,136 @@ +--- +phase: 04-input-sources +plan: 02 +subsystem: input-sources +tags: [sources, dir, mmap, wave-1] +requires: + - 04-01 (golang.org/x/exp/mmap module available) + - pkg/engine/sources/source.go Source interface + - pkg/types/chunk.go Chunk struct +provides: + - DirSource recursive directory scanner implementing Source interface + - Shared emitChunks/isBinary helpers for all sources + - mmap-backed large file reading for FileSource and DirSource + - Default exclusion set for generated/vendored dirs +affects: + - pkg/engine/sources/dir.go + - pkg/engine/sources/dir_test.go + - pkg/engine/sources/file.go +tech-stack: + added: + - golang.org/x/exp/mmap (promoted from indirect to direct use) + patterns: + - Collect-then-sort path list for deterministic walk order + - Shared package-level emitChunks helper reused by FileSource and DirSource + - Binary detection via NUL sniff in leading 512 bytes +key-files: + created: + - pkg/engine/sources/dir.go + - pkg/engine/sources/dir_test.go + - .planning/phases/04-input-sources/deferred-items.md + modified: + - pkg/engine/sources/file.go +decisions: + - Ordered paths by sort.Strings after full walk rather than streaming, so the public contract guarantees reproducible emission order even if filepath.WalkDir's internal ordering changes + - Added NewDirSourceRaw (no-defaults) alongside NewDirSource (defaults-merged) to keep tests free of leakage from DefaultExcludes while giving callers an ergonomic constructor + - Handled `dir/**` glob style with an explicit suffix check — filepath.Match does not support ** natively, and we need .git/**, node_modules/**, vendor/** to match nested paths + - Per-file errors during emission are swallowed (continue walking) rather than aborting the entire scan — the engine logs elsewhere; context errors are still propagated + - Refactored FileSource to share emitChunks/isBinary so single-file and directory paths stay behaviorally identical (binary skip + mmap threshold) +metrics: + duration: "~4 min" + completed: 2026-04-05 + tasks_completed: 1 + files_modified: 3 + files_created: 3 +--- + +# Phase 4 Plan 2: DirSource Recursive Directory Scanner Summary + +DirSource walks any directory recursively with glob-based exclusions, binary-file skipping, and mmap-backed reads for files over 10MB — giving the scan pipeline its primary input source. FileSource was refactored to share the same emit and mmap helpers so both paths stay consistent. + +## Objective + +Implement `DirSource` — a recursive directory scanner that walks a root path via `filepath.WalkDir`, honors glob exclusion patterns, detects and skips binary files, and uses memory-mapped I/O for large files. Satisfies INPUT-01 (directory/recursive scanning with exclusions) and CORE-07 (mmap large file reading). + +## What Was Built + +### pkg/engine/sources/dir.go (218 lines) + +- `DirSource` struct with `Root`, `Excludes`, `ChunkSize` fields +- `NewDirSource(root, extraExcludes...)` — merges DefaultExcludes with user extras +- `NewDirSourceRaw(root, excludes)` — bypasses defaults entirely (used by tests) +- `Chunks(ctx, out)` — validates root is a directory, walks via `filepath.WalkDir`, collects non-excluded file paths, sorts them, and emits each file's chunks in order +- `isExcluded(rel, base)` — matches glob patterns against both basename and relative path, with explicit `dir/**` suffix handling +- `emitFile(ctx, path, out)` — reads file via `mmap.Open` if size ≥ MmapThreshold (10 MB), else `os.ReadFile`; skips binary files +- `isBinary(data)` — NUL byte sniff in the first 512 bytes (shared helper) +- `emitChunks(ctx, data, source, chunkSize, out)` — shared overlapping chunk emitter (4096 size / 256 overlap) +- Constants: `MmapThreshold = 10MB`, `BinarySniffSize = 512`, `DefaultExcludes = [.git/**, node_modules/**, vendor/**, *.min.js, *.map]` + +### pkg/engine/sources/file.go (refactored, 60 lines) + +- FileSource now shares `emitChunks`, `isBinary`, and the mmap read path with DirSource +- Files ≥ 10 MB are read via `golang.org/x/exp/mmap` +- Binary files are skipped before emission + +### pkg/engine/sources/dir_test.go (146 lines) + +Eight subtests, all passing under `-race`: + +| Test | Status | What it proves | +|------|--------|----------------| +| TestDirSource_RecursiveWalk | PASS | Emits 3 chunks for 3 nested files; sources are sorted | +| TestDirSource_DefaultExcludes | PASS | `.git/config`, `node_modules/foo.js`, `vendor/bar.go`, `app.min.js`, `app.js.map` all skipped; `keep.txt` emitted | +| TestDirSource_UserExclude | PASS | `*.log` user glob skips `drop.log`, keeps `keep.txt` | +| TestDirSource_BinarySkipped | PASS | File with NUL in first bytes skipped; text file emitted | +| TestDirSource_MmapLargeFile | PASS | >10MB file emits chunks via mmap path | +| TestDirSource_DeterministicOrdering | PASS | Two consecutive runs produce identical source order | +| TestDirSource_MissingRoot | PASS | Non-existent root returns error | +| TestDirSource_CtxCancellation | PASS | Pre-cancelled ctx returns `context.Canceled` | + +## Verification + +``` +$ go build ./pkg/engine/sources/... # exit 0 +$ go vet ./pkg/engine/sources/... # clean +$ go test ./pkg/engine/sources/... -run 'TestDirSource|TestFileSource' -race -count=1 +ok github.com/salvacybersec/keyhunter/pkg/engine/sources 1.120s +``` + +Acceptance greps all hit: +- `mmap.Open` — 2 hits (dir.go:157, file.go:41) +- `filepath.WalkDir` — 1 hit (dir.go:77) +- `DefaultExcludes` — 4 hits (dir.go) +- `isBinary` — present in dir.go:180 + +## Deviations from Plan + +### Minor exclude-semantics adjustment + +The plan's `isExcluded` implementation matched `dir/**` only against the relative path. I added one extra check so `base == prefix` also matches — this lets `.git/**` correctly exclude the top-level `.git` directory entry itself when WalkDir reports it (the `rel` for that directory is just `.git`, which already matched, but the defensive check protects against filepath quirks on Windows). + +### Root-directory skip guard + +Added an early `if path == d.Root { return nil }` inside the WalkDir callback. Without it, an exclusion pattern that accidentally matches an empty relative path (`.`) could cause `SkipDir` on the root, halting the walk before it starts. This is defensive, not spec-changing. + +## Deferred Issues + +### TestClipboardSource_ReaderError (pre-existing, out of scope) + +`pkg/engine/sources/clipboard_test.go:42` fails because the test expects an error message containing the substring `"clipboard"`, but the current clipboard source returns `"ClipboardSource: read: no xclip installed"`. This failure is in a different source (clipboard, owned by a later plan) and unrelated to DirSource. Logged to `.planning/phases/04-input-sources/deferred-items.md` for the plan that owns clipboard source. + +Scoped test runs (`-run 'TestDirSource|TestFileSource'`) are green under `-race`. + +## Follow-up Hooks for Later Plans + +- `cmd/scan.go` needs wiring: select `NewDirSource(path, excludeFlags...)` when the positional argument is a directory. Plan 04-06 (scan command integration). +- Exclude-flag plumbing: `--exclude` CLI flag to append to DefaultExcludes. +- Per-file error logging: today errors are swallowed; once the engine has a structured logger, DirSource should emit a warning for each skipped file. + +## Self-Check: PASSED + +- pkg/engine/sources/dir.go — FOUND (218 lines) +- pkg/engine/sources/dir_test.go — FOUND (146 lines) +- pkg/engine/sources/file.go — FOUND (refactored, 60 lines) +- RED commit ce6298f — verified in git log +- GREEN commit 6f834c9 — verified in git log +- All TestDirSource subtests pass under -race diff --git a/.planning/phases/04-input-sources/deferred-items.md b/.planning/phases/04-input-sources/deferred-items.md new file mode 100644 index 0000000..5f5f2d0 --- /dev/null +++ b/.planning/phases/04-input-sources/deferred-items.md @@ -0,0 +1,9 @@ +# Deferred Items - Phase 04 Input Sources + +## Out-of-scope failures discovered during execution + +### 04-02: TestClipboardSource_ReaderError failure +- **File:** pkg/engine/sources/clipboard_test.go:42 +- **Issue:** Test expects error message to contain "clipboard" but receives "ClipboardSource: read: no xclip installed" +- **Scope:** Pre-existing issue in clipboard source, unrelated to DirSource plan 04-02 +- **Action:** Defer to plan that owns clipboard source (04-05 or similar)