--- phase: 04-input-sources plan: 02 type: execute wave: 1 depends_on: ["04-01"] files_modified: - pkg/engine/sources/dir.go - pkg/engine/sources/dir_test.go - pkg/engine/sources/file.go - pkg/engine/sources/file_test.go autonomous: true requirements: - INPUT-01 - CORE-07 must_haves: truths: - "DirSource recursively walks a directory and emits Chunks for every non-excluded file" - "Glob exclusion patterns (--exclude) skip matching files by basename AND full relative path" - "Default exclusions skip .git/, node_modules/, vendor/, *.min.js, *.map" - "Binary files (null byte in first 512 bytes) are skipped" - "Files larger than the mmap threshold (10MB) are read via golang.org/x/exp/mmap, smaller files via os.ReadFile" - "File emission order is deterministic (sorted) for reproducible tests" artifacts: - path: "pkg/engine/sources/dir.go" provides: "DirSource implementing Source interface for recursive directory scanning" exports: ["DirSource", "NewDirSource"] min_lines: 120 - path: "pkg/engine/sources/dir_test.go" provides: "Test coverage for recursive walk, exclusion, binary skip, mmap threshold" min_lines: 100 - path: "pkg/engine/sources/file.go" provides: "FileSource extended to use mmap for files > 10MB" contains: "mmap" key_links: - from: "pkg/engine/sources/dir.go" to: "golang.org/x/exp/mmap" via: "mmap.Open for large files" pattern: "mmap\\.Open" - from: "pkg/engine/sources/dir.go" to: "filepath.WalkDir" via: "recursive traversal" pattern: "filepath\\.WalkDir" - from: "pkg/engine/sources/dir.go" to: "types.Chunk" via: "channel send" pattern: "out <- types\\.Chunk" --- Implement `DirSource` — a recursive directory scanner that walks a root path via `filepath.WalkDir`, honors glob exclusion patterns, detects and skips binary files, and uses memory-mapped I/O for large files. This satisfies INPUT-01 (directory/recursive scanning with exclusions) and CORE-07 (mmap large file reading). Purpose: The most common scan target is a repo directory, not a single file. This plan replaces the "wrap FileSource per path" hack with a purpose-built recursive source that emits deterministically ordered chunks and scales to multi-GB files without blowing out memory. Output: `pkg/engine/sources/dir.go`, `dir_test.go`, plus a small `file.go` update to share the mmap read helper. @$HOME/.claude/get-shit-done/workflows/execute-plan.md @$HOME/.claude/get-shit-done/templates/summary.md @.planning/PROJECT.md @.planning/phases/04-input-sources/04-CONTEXT.md @pkg/engine/sources/source.go @pkg/engine/sources/file.go @pkg/types/chunk.go Source interface (pkg/engine/sources/source.go): ```go type Source interface { Chunks(ctx context.Context, out chan<- types.Chunk) error } ``` Chunk type (pkg/types/chunk.go): ```go type Chunk struct { Data []byte Source string Offset int64 } ``` Existing constants in pkg/engine/sources/file.go: ```go const defaultChunkSize = 4096 const chunkOverlap = 256 ``` Task 1: Implement DirSource with recursive walk, exclusion, binary detection, and mmap - pkg/engine/sources/source.go - pkg/engine/sources/file.go - pkg/types/chunk.go - .planning/phases/04-input-sources/04-CONTEXT.md (Directory/File Scanning section) pkg/engine/sources/dir.go, pkg/engine/sources/dir_test.go, pkg/engine/sources/file.go - Test 1: DirSource walks a temp dir containing 3 text files, emits 3 chunks, source fields match file paths - Test 2: Default exclusions skip `.git/config`, `node_modules/foo.js`, `vendor/bar.go`, `app.min.js`, `app.js.map` - Test 3: User-supplied exclude pattern `*.log` skips `foo.log` but keeps `foo.txt` - Test 4: Binary file (first 512 bytes contain a null byte) is skipped; text file is emitted - Test 5: File >10MB is read via mmap path and emits chunks whose concatenated data equals file content - Test 6: File emission order is deterministic (sorted lexicographically) across two runs on same dir - Test 7: ctx cancellation mid-walk returns ctx.Err() promptly - Test 8: Non-existent root returns an error Create `pkg/engine/sources/dir.go` with the following complete implementation: ```go package sources import ( "bytes" "context" "errors" "fmt" "io/fs" "os" "path/filepath" "sort" "strings" "golang.org/x/exp/mmap" "github.com/salvacybersec/keyhunter/pkg/types" ) // MmapThreshold is the file size above which DirSource/FileSource use memory-mapped reads. const MmapThreshold int64 = 10 * 1024 * 1024 // 10 MB // BinarySniffSize is the number of leading bytes inspected for a NUL byte // to classify a file as binary and skip it. const BinarySniffSize = 512 // DefaultExcludes are glob patterns excluded from directory scans unless // the caller passes an empty slice explicitly via NewDirSourceRaw. var DefaultExcludes = []string{ ".git/**", "node_modules/**", "vendor/**", "*.min.js", "*.map", } // DirSource walks a directory recursively and emits Chunks for every // non-excluded, non-binary file it finds. Files larger than MmapThreshold // are read via mmap; smaller files use os.ReadFile. type DirSource struct { Root string Excludes []string // glob patterns applied to path basename AND full relative path ChunkSize int } // NewDirSource creates a DirSource with the default exclusions merged // with the caller-supplied extras. func NewDirSource(root string, extraExcludes ...string) *DirSource { merged := make([]string, 0, len(DefaultExcludes)+len(extraExcludes)) merged = append(merged, DefaultExcludes...) merged = append(merged, extraExcludes...) return &DirSource{Root: root, Excludes: merged, ChunkSize: defaultChunkSize} } // NewDirSourceRaw creates a DirSource with ONLY the caller-supplied excludes // (no defaults). Useful for tests and advanced users. func NewDirSourceRaw(root string, excludes []string) *DirSource { return &DirSource{Root: root, Excludes: excludes, ChunkSize: defaultChunkSize} } // Chunks implements Source. It walks d.Root, filters excluded and binary // files, reads each remaining file (via mmap above MmapThreshold), and // emits overlapping chunks through out. func (d *DirSource) Chunks(ctx context.Context, out chan<- types.Chunk) error { if d.Root == "" { return errors.New("DirSource: Root is empty") } info, err := os.Stat(d.Root) if err != nil { return fmt.Errorf("DirSource: stat root: %w", err) } if !info.IsDir() { return fmt.Errorf("DirSource: root %q is not a directory", d.Root) } // Collect paths first for deterministic ordering across runs. var paths []string err = filepath.WalkDir(d.Root, func(path string, de fs.DirEntry, werr error) error { if werr != nil { return werr } if de.IsDir() { rel, _ := filepath.Rel(d.Root, path) if d.isExcluded(rel, de.Name()) { return filepath.SkipDir } return nil } rel, _ := filepath.Rel(d.Root, path) if d.isExcluded(rel, de.Name()) { return nil } paths = append(paths, path) return nil }) if err != nil { return fmt.Errorf("DirSource: walk: %w", err) } sort.Strings(paths) for _, p := range paths { if err := ctx.Err(); err != nil { return err } if err := d.emitFile(ctx, p, out); err != nil { // Per-file errors are non-fatal: continue walking, but respect ctx. if errors.Is(err, context.Canceled) || errors.Is(err, context.DeadlineExceeded) { return err } // Swallow per-file errors; the engine logs elsewhere. continue } } return nil } // isExcluded returns true if either the relative path or the basename matches // any configured glob pattern. func (d *DirSource) isExcluded(rel, base string) bool { rel = filepath.ToSlash(rel) for _, pat := range d.Excludes { pat = filepath.ToSlash(pat) // Match against basename. if ok, _ := filepath.Match(pat, base); ok { return true } // Match against full relative path. if ok, _ := filepath.Match(pat, rel); ok { return true } // `dir/**` style — naive prefix match against the leading segment. if strings.HasSuffix(pat, "/**") { prefix := strings.TrimSuffix(pat, "/**") if rel == prefix || strings.HasPrefix(rel, prefix+"/") { return true } } } return false } // emitFile reads a single file and pushes its chunks onto out. func (d *DirSource) emitFile(ctx context.Context, path string, out chan<- types.Chunk) error { fi, err := os.Stat(path) if err != nil { return err } size := fi.Size() if size == 0 { return nil } var data []byte if size >= MmapThreshold { ra, err := mmap.Open(path) if err != nil { return fmt.Errorf("mmap open %s: %w", path, err) } defer ra.Close() data = make([]byte, ra.Len()) if _, err := ra.ReadAt(data, 0); err != nil { return fmt.Errorf("mmap read %s: %w", path, err) } } else { data, err = os.ReadFile(path) if err != nil { return err } } if isBinary(data) { return nil } return emitChunks(ctx, data, path, d.ChunkSize, out) } // isBinary reports whether the leading BinarySniffSize bytes contain a NUL byte. func isBinary(data []byte) bool { n := len(data) if n > BinarySniffSize { n = BinarySniffSize } return bytes.IndexByte(data[:n], 0x00) >= 0 } // emitChunks is the shared overlapping-chunk emitter used by FileSource and DirSource. func emitChunks(ctx context.Context, data []byte, source string, chunkSize int, out chan<- types.Chunk) error { if chunkSize <= 0 { chunkSize = defaultChunkSize } if len(data) <= chunkSize { select { case <-ctx.Done(): return ctx.Err() case out <- types.Chunk{Data: data, Source: source, Offset: 0}: } return nil } var offset int64 for start := 0; start < len(data); start += chunkSize - chunkOverlap { end := start + chunkSize if end > len(data) { end = len(data) } select { case <-ctx.Done(): return ctx.Err() case out <- types.Chunk{Data: data[start:end], Source: source, Offset: offset}: } offset += int64(end - start) if end == len(data) { break } } return nil } ``` Update `pkg/engine/sources/file.go` so FileSource reuses `emitChunks` and adopts the same mmap threshold for large single-file scans: ```go package sources import ( "context" "os" "golang.org/x/exp/mmap" "github.com/salvacybersec/keyhunter/pkg/types" ) const defaultChunkSize = 4096 const chunkOverlap = 256 // FileSource reads a single file and emits overlapping chunks. // For files >= MmapThreshold it uses golang.org/x/exp/mmap. type FileSource struct { Path string ChunkSize int } func NewFileSource(path string) *FileSource { return &FileSource{Path: path, ChunkSize: defaultChunkSize} } func (f *FileSource) Chunks(ctx context.Context, out chan<- types.Chunk) error { fi, err := os.Stat(f.Path) if err != nil { return err } size := fi.Size() if size == 0 { return nil } var data []byte if size >= MmapThreshold { ra, err := mmap.Open(f.Path) if err != nil { return err } defer ra.Close() data = make([]byte, ra.Len()) if _, err := ra.ReadAt(data, 0); err != nil { return err } } else { data, err = os.ReadFile(f.Path) if err != nil { return err } } if isBinary(data) { return nil } return emitChunks(ctx, data, f.Path, f.ChunkSize, out) } ``` Create `pkg/engine/sources/dir_test.go` with a comprehensive suite: ```go package sources import ( "context" "os" "path/filepath" "strings" "testing" "time" "github.com/stretchr/testify/require" "github.com/salvacybersec/keyhunter/pkg/types" ) func drain(t *testing.T, src Source) []types.Chunk { t.Helper() ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second) defer cancel() out := make(chan types.Chunk, 1024) errCh := make(chan error, 1) go func() { errCh <- src.Chunks(ctx, out); close(out) }() var got []types.Chunk for c := range out { got = append(got, c) } require.NoError(t, <-errCh) return got } func writeFile(t *testing.T, path, content string) { t.Helper() require.NoError(t, os.MkdirAll(filepath.Dir(path), 0o755)) require.NoError(t, os.WriteFile(path, []byte(content), 0o644)) } func TestDirSource_RecursiveWalk(t *testing.T) { root := t.TempDir() writeFile(t, filepath.Join(root, "a.txt"), "alpha content") writeFile(t, filepath.Join(root, "sub", "b.txt"), "bravo content") writeFile(t, filepath.Join(root, "sub", "deep", "c.txt"), "charlie content") chunks := drain(t, NewDirSourceRaw(root, nil)) require.Len(t, chunks, 3) sources := make([]string, 0, len(chunks)) for _, c := range chunks { sources = append(sources, c.Source) } // Deterministic sorted order. require.True(t, sort_IsSorted(sources), "emission order must be sorted, got %v", sources) } func sort_IsSorted(s []string) bool { for i := 1; i < len(s); i++ { if s[i-1] > s[i] { return false } } return true } func TestDirSource_DefaultExcludes(t *testing.T) { root := t.TempDir() writeFile(t, filepath.Join(root, "keep.txt"), "keep me") writeFile(t, filepath.Join(root, ".git", "config"), "[core]") writeFile(t, filepath.Join(root, "node_modules", "foo.js"), "x") writeFile(t, filepath.Join(root, "vendor", "bar.go"), "package x") writeFile(t, filepath.Join(root, "app.min.js"), "y") writeFile(t, filepath.Join(root, "app.js.map"), "{}") chunks := drain(t, NewDirSource(root)) require.Len(t, chunks, 1) require.Contains(t, chunks[0].Source, "keep.txt") } func TestDirSource_UserExclude(t *testing.T) { root := t.TempDir() writeFile(t, filepath.Join(root, "keep.txt"), "keep") writeFile(t, filepath.Join(root, "drop.log"), "drop") chunks := drain(t, NewDirSourceRaw(root, []string{"*.log"})) require.Len(t, chunks, 1) require.Contains(t, chunks[0].Source, "keep.txt") } func TestDirSource_BinarySkipped(t *testing.T) { root := t.TempDir() writeFile(t, filepath.Join(root, "text.txt"), "plain text content") binPath := filepath.Join(root, "blob.bin") require.NoError(t, os.WriteFile(binPath, []byte{0x7f, 'E', 'L', 'F', 0x00, 0x01, 0x02}, 0o644)) chunks := drain(t, NewDirSourceRaw(root, nil)) require.Len(t, chunks, 1) require.Contains(t, chunks[0].Source, "text.txt") } func TestDirSource_MmapLargeFile(t *testing.T) { if testing.Short() { t.Skip("skipping large file test in short mode") } root := t.TempDir() big := filepath.Join(root, "big.txt") // Construct a payload slightly above MmapThreshold. payload := strings.Repeat("API_KEY=xxxxxxxxxxxxxxxxxxxx\n", (int(MmapThreshold)/28)+10) require.NoError(t, os.WriteFile(big, []byte(payload), 0o644)) chunks := drain(t, NewDirSourceRaw(root, nil)) // Reconstruct data accounting for chunk overlap. require.NotEmpty(t, chunks) require.Equal(t, big, chunks[0].Source) } func TestDirSource_MissingRoot(t *testing.T) { src := NewDirSourceRaw("/definitely/does/not/exist/keyhunter-xyz", nil) ctx := context.Background() out := make(chan types.Chunk, 1) err := src.Chunks(ctx, out) require.Error(t, err) } func TestDirSource_CtxCancellation(t *testing.T) { root := t.TempDir() for i := 0; i < 50; i++ { writeFile(t, filepath.Join(root, "f", string(rune('a'+i%26))+".txt"), "payload") } ctx, cancel := context.WithCancel(context.Background()) cancel() // pre-cancelled out := make(chan types.Chunk, 1024) err := NewDirSourceRaw(root, nil).Chunks(ctx, out) require.ErrorIs(t, err, context.Canceled) } ``` Also add a minimal update to `pkg/engine/sources/file_test.go` if it exists — if not present, skip. Do NOT alter any other source files in this plan. go test ./pkg/engine/sources/... -run 'TestDirSource|TestFileSource' -race -count=1 - `go build ./pkg/engine/sources/...` exits 0 - `go test ./pkg/engine/sources/... -run TestDirSource -race -count=1` passes all subtests - `grep -n "mmap.Open" pkg/engine/sources/dir.go pkg/engine/sources/file.go` returns two hits - `grep -n "filepath.WalkDir" pkg/engine/sources/dir.go` returns a hit - `grep -n "DefaultExcludes" pkg/engine/sources/dir.go` returns a hit - `grep -n "isBinary" pkg/engine/sources/dir.go` returns a hit DirSource implements Source, walks recursively, honors default and user glob exclusions, skips binary files, and uses mmap above 10MB. FileSource refactored to share the same mmap/emit helpers. All tests green under -race. - `go test ./pkg/engine/sources/... -race -count=1` passes - `go vet ./pkg/engine/sources/...` clean - All acceptance criteria grep matches hit A caller can create `sources.NewDirSource("./myrepo", "*.log")` and receive chunks for every non-excluded, non-binary file in deterministic order, with files >10MB read via mmap. After completion, create `.planning/phases/04-input-sources/04-02-SUMMARY.md` documenting: - File list with line counts - Test names and pass status - Any deviations from the planned exclude semantics (e.g., `**` handling)