Files
keyhunter/.planning/phases/04-input-sources/04-02-PLAN.md
2026-04-06 12:27:23 +03:00

17 KiB

phase, plan, type, wave, depends_on, files_modified, autonomous, requirements, must_haves
phase plan type wave depends_on files_modified autonomous requirements must_haves
04-input-sources 02 execute 1
04-01
pkg/engine/sources/dir.go
pkg/engine/sources/dir_test.go
pkg/engine/sources/file.go
pkg/engine/sources/file_test.go
true
INPUT-01
CORE-07
truths artifacts key_links
DirSource recursively walks a directory and emits Chunks for every non-excluded file
Glob exclusion patterns (--exclude) skip matching files by basename AND full relative path
Default exclusions skip .git/, node_modules/, vendor/, *.min.js, *.map
Binary files (null byte in first 512 bytes) are skipped
Files larger than the mmap threshold (10MB) are read via golang.org/x/exp/mmap, smaller files via os.ReadFile
File emission order is deterministic (sorted) for reproducible tests
path provides exports min_lines
pkg/engine/sources/dir.go DirSource implementing Source interface for recursive directory scanning
DirSource
NewDirSource
120
path provides min_lines
pkg/engine/sources/dir_test.go Test coverage for recursive walk, exclusion, binary skip, mmap threshold 100
path provides contains
pkg/engine/sources/file.go FileSource extended to use mmap for files > 10MB mmap
from to via pattern
pkg/engine/sources/dir.go golang.org/x/exp/mmap mmap.Open for large files mmap.Open
from to via pattern
pkg/engine/sources/dir.go filepath.WalkDir recursive traversal filepath.WalkDir
from to via pattern
pkg/engine/sources/dir.go types.Chunk channel send out <- types.Chunk
Implement `DirSource` — a recursive directory scanner that walks a root path via `filepath.WalkDir`, honors glob exclusion patterns, detects and skips binary files, and uses memory-mapped I/O for large files. This satisfies INPUT-01 (directory/recursive scanning with exclusions) and CORE-07 (mmap large file reading).

Purpose: The most common scan target is a repo directory, not a single file. This plan replaces the "wrap FileSource per path" hack with a purpose-built recursive source that emits deterministically ordered chunks and scales to multi-GB files without blowing out memory. Output: pkg/engine/sources/dir.go, dir_test.go, plus a small file.go update to share the mmap read helper.

<execution_context> @$HOME/.claude/get-shit-done/workflows/execute-plan.md @$HOME/.claude/get-shit-done/templates/summary.md </execution_context>

@.planning/PROJECT.md @.planning/phases/04-input-sources/04-CONTEXT.md @pkg/engine/sources/source.go @pkg/engine/sources/file.go @pkg/types/chunk.go Source interface (pkg/engine/sources/source.go): ```go type Source interface { Chunks(ctx context.Context, out chan<- types.Chunk) error } ```

Chunk type (pkg/types/chunk.go):

type Chunk struct {
    Data   []byte
    Source string
    Offset int64
}

Existing constants in pkg/engine/sources/file.go:

const defaultChunkSize = 4096
const chunkOverlap = 256
Task 1: Implement DirSource with recursive walk, exclusion, binary detection, and mmap - pkg/engine/sources/source.go - pkg/engine/sources/file.go - pkg/types/chunk.go - .planning/phases/04-input-sources/04-CONTEXT.md (Directory/File Scanning section) pkg/engine/sources/dir.go, pkg/engine/sources/dir_test.go, pkg/engine/sources/file.go - Test 1: DirSource walks a temp dir containing 3 text files, emits 3 chunks, source fields match file paths - Test 2: Default exclusions skip `.git/config`, `node_modules/foo.js`, `vendor/bar.go`, `app.min.js`, `app.js.map` - Test 3: User-supplied exclude pattern `*.log` skips `foo.log` but keeps `foo.txt` - Test 4: Binary file (first 512 bytes contain a null byte) is skipped; text file is emitted - Test 5: File >10MB is read via mmap path and emits chunks whose concatenated data equals file content - Test 6: File emission order is deterministic (sorted lexicographically) across two runs on same dir - Test 7: ctx cancellation mid-walk returns ctx.Err() promptly - Test 8: Non-existent root returns an error Create `pkg/engine/sources/dir.go` with the following complete implementation:
package sources

import (
	"bytes"
	"context"
	"errors"
	"fmt"
	"io/fs"
	"os"
	"path/filepath"
	"sort"
	"strings"

	"golang.org/x/exp/mmap"

	"github.com/salvacybersec/keyhunter/pkg/types"
)

// MmapThreshold is the file size above which DirSource/FileSource use memory-mapped reads.
const MmapThreshold int64 = 10 * 1024 * 1024 // 10 MB

// BinarySniffSize is the number of leading bytes inspected for a NUL byte
// to classify a file as binary and skip it.
const BinarySniffSize = 512

// DefaultExcludes are glob patterns excluded from directory scans unless
// the caller passes an empty slice explicitly via NewDirSourceRaw.
var DefaultExcludes = []string{
	".git/**",
	"node_modules/**",
	"vendor/**",
	"*.min.js",
	"*.map",
}

// DirSource walks a directory recursively and emits Chunks for every
// non-excluded, non-binary file it finds. Files larger than MmapThreshold
// are read via mmap; smaller files use os.ReadFile.
type DirSource struct {
	Root      string
	Excludes  []string // glob patterns applied to path basename AND full relative path
	ChunkSize int
}

// NewDirSource creates a DirSource with the default exclusions merged
// with the caller-supplied extras.
func NewDirSource(root string, extraExcludes ...string) *DirSource {
	merged := make([]string, 0, len(DefaultExcludes)+len(extraExcludes))
	merged = append(merged, DefaultExcludes...)
	merged = append(merged, extraExcludes...)
	return &DirSource{Root: root, Excludes: merged, ChunkSize: defaultChunkSize}
}

// NewDirSourceRaw creates a DirSource with ONLY the caller-supplied excludes
// (no defaults). Useful for tests and advanced users.
func NewDirSourceRaw(root string, excludes []string) *DirSource {
	return &DirSource{Root: root, Excludes: excludes, ChunkSize: defaultChunkSize}
}

// Chunks implements Source. It walks d.Root, filters excluded and binary
// files, reads each remaining file (via mmap above MmapThreshold), and
// emits overlapping chunks through out.
func (d *DirSource) Chunks(ctx context.Context, out chan<- types.Chunk) error {
	if d.Root == "" {
		return errors.New("DirSource: Root is empty")
	}
	info, err := os.Stat(d.Root)
	if err != nil {
		return fmt.Errorf("DirSource: stat root: %w", err)
	}
	if !info.IsDir() {
		return fmt.Errorf("DirSource: root %q is not a directory", d.Root)
	}

	// Collect paths first for deterministic ordering across runs.
	var paths []string
	err = filepath.WalkDir(d.Root, func(path string, de fs.DirEntry, werr error) error {
		if werr != nil {
			return werr
		}
		if de.IsDir() {
			rel, _ := filepath.Rel(d.Root, path)
			if d.isExcluded(rel, de.Name()) {
				return filepath.SkipDir
			}
			return nil
		}
		rel, _ := filepath.Rel(d.Root, path)
		if d.isExcluded(rel, de.Name()) {
			return nil
		}
		paths = append(paths, path)
		return nil
	})
	if err != nil {
		return fmt.Errorf("DirSource: walk: %w", err)
	}
	sort.Strings(paths)

	for _, p := range paths {
		if err := ctx.Err(); err != nil {
			return err
		}
		if err := d.emitFile(ctx, p, out); err != nil {
			// Per-file errors are non-fatal: continue walking, but respect ctx.
			if errors.Is(err, context.Canceled) || errors.Is(err, context.DeadlineExceeded) {
				return err
			}
			// Swallow per-file errors; the engine logs elsewhere.
			continue
		}
	}
	return nil
}

// isExcluded returns true if either the relative path or the basename matches
// any configured glob pattern.
func (d *DirSource) isExcluded(rel, base string) bool {
	rel = filepath.ToSlash(rel)
	for _, pat := range d.Excludes {
		pat = filepath.ToSlash(pat)
		// Match against basename.
		if ok, _ := filepath.Match(pat, base); ok {
			return true
		}
		// Match against full relative path.
		if ok, _ := filepath.Match(pat, rel); ok {
			return true
		}
		// `dir/**` style — naive prefix match against the leading segment.
		if strings.HasSuffix(pat, "/**") {
			prefix := strings.TrimSuffix(pat, "/**")
			if rel == prefix || strings.HasPrefix(rel, prefix+"/") {
				return true
			}
		}
	}
	return false
}

// emitFile reads a single file and pushes its chunks onto out.
func (d *DirSource) emitFile(ctx context.Context, path string, out chan<- types.Chunk) error {
	fi, err := os.Stat(path)
	if err != nil {
		return err
	}
	size := fi.Size()
	if size == 0 {
		return nil
	}

	var data []byte
	if size >= MmapThreshold {
		ra, err := mmap.Open(path)
		if err != nil {
			return fmt.Errorf("mmap open %s: %w", path, err)
		}
		defer ra.Close()
		data = make([]byte, ra.Len())
		if _, err := ra.ReadAt(data, 0); err != nil {
			return fmt.Errorf("mmap read %s: %w", path, err)
		}
	} else {
		data, err = os.ReadFile(path)
		if err != nil {
			return err
		}
	}

	if isBinary(data) {
		return nil
	}
	return emitChunks(ctx, data, path, d.ChunkSize, out)
}

// isBinary reports whether the leading BinarySniffSize bytes contain a NUL byte.
func isBinary(data []byte) bool {
	n := len(data)
	if n > BinarySniffSize {
		n = BinarySniffSize
	}
	return bytes.IndexByte(data[:n], 0x00) >= 0
}

// emitChunks is the shared overlapping-chunk emitter used by FileSource and DirSource.
func emitChunks(ctx context.Context, data []byte, source string, chunkSize int, out chan<- types.Chunk) error {
	if chunkSize <= 0 {
		chunkSize = defaultChunkSize
	}
	if len(data) <= chunkSize {
		select {
		case <-ctx.Done():
			return ctx.Err()
		case out <- types.Chunk{Data: data, Source: source, Offset: 0}:
		}
		return nil
	}
	var offset int64
	for start := 0; start < len(data); start += chunkSize - chunkOverlap {
		end := start + chunkSize
		if end > len(data) {
			end = len(data)
		}
		select {
		case <-ctx.Done():
			return ctx.Err()
		case out <- types.Chunk{Data: data[start:end], Source: source, Offset: offset}:
		}
		offset += int64(end - start)
		if end == len(data) {
			break
		}
	}
	return nil
}

Update pkg/engine/sources/file.go so FileSource reuses emitChunks and adopts the same mmap threshold for large single-file scans:

package sources

import (
	"context"
	"os"

	"golang.org/x/exp/mmap"

	"github.com/salvacybersec/keyhunter/pkg/types"
)

const defaultChunkSize = 4096
const chunkOverlap = 256

// FileSource reads a single file and emits overlapping chunks.
// For files >= MmapThreshold it uses golang.org/x/exp/mmap.
type FileSource struct {
	Path      string
	ChunkSize int
}

func NewFileSource(path string) *FileSource {
	return &FileSource{Path: path, ChunkSize: defaultChunkSize}
}

func (f *FileSource) Chunks(ctx context.Context, out chan<- types.Chunk) error {
	fi, err := os.Stat(f.Path)
	if err != nil {
		return err
	}
	size := fi.Size()
	if size == 0 {
		return nil
	}
	var data []byte
	if size >= MmapThreshold {
		ra, err := mmap.Open(f.Path)
		if err != nil {
			return err
		}
		defer ra.Close()
		data = make([]byte, ra.Len())
		if _, err := ra.ReadAt(data, 0); err != nil {
			return err
		}
	} else {
		data, err = os.ReadFile(f.Path)
		if err != nil {
			return err
		}
	}
	if isBinary(data) {
		return nil
	}
	return emitChunks(ctx, data, f.Path, f.ChunkSize, out)
}

Create pkg/engine/sources/dir_test.go with a comprehensive suite:

package sources

import (
	"context"
	"os"
	"path/filepath"
	"strings"
	"testing"
	"time"

	"github.com/stretchr/testify/require"

	"github.com/salvacybersec/keyhunter/pkg/types"
)

func drain(t *testing.T, src Source) []types.Chunk {
	t.Helper()
	ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
	defer cancel()
	out := make(chan types.Chunk, 1024)
	errCh := make(chan error, 1)
	go func() { errCh <- src.Chunks(ctx, out); close(out) }()
	var got []types.Chunk
	for c := range out {
		got = append(got, c)
	}
	require.NoError(t, <-errCh)
	return got
}

func writeFile(t *testing.T, path, content string) {
	t.Helper()
	require.NoError(t, os.MkdirAll(filepath.Dir(path), 0o755))
	require.NoError(t, os.WriteFile(path, []byte(content), 0o644))
}

func TestDirSource_RecursiveWalk(t *testing.T) {
	root := t.TempDir()
	writeFile(t, filepath.Join(root, "a.txt"), "alpha content")
	writeFile(t, filepath.Join(root, "sub", "b.txt"), "bravo content")
	writeFile(t, filepath.Join(root, "sub", "deep", "c.txt"), "charlie content")

	chunks := drain(t, NewDirSourceRaw(root, nil))
	require.Len(t, chunks, 3)

	sources := make([]string, 0, len(chunks))
	for _, c := range chunks {
		sources = append(sources, c.Source)
	}
	// Deterministic sorted order.
	require.True(t, sort_IsSorted(sources), "emission order must be sorted, got %v", sources)
}

func sort_IsSorted(s []string) bool {
	for i := 1; i < len(s); i++ {
		if s[i-1] > s[i] {
			return false
		}
	}
	return true
}

func TestDirSource_DefaultExcludes(t *testing.T) {
	root := t.TempDir()
	writeFile(t, filepath.Join(root, "keep.txt"), "keep me")
	writeFile(t, filepath.Join(root, ".git", "config"), "[core]")
	writeFile(t, filepath.Join(root, "node_modules", "foo.js"), "x")
	writeFile(t, filepath.Join(root, "vendor", "bar.go"), "package x")
	writeFile(t, filepath.Join(root, "app.min.js"), "y")
	writeFile(t, filepath.Join(root, "app.js.map"), "{}")

	chunks := drain(t, NewDirSource(root))
	require.Len(t, chunks, 1)
	require.Contains(t, chunks[0].Source, "keep.txt")
}

func TestDirSource_UserExclude(t *testing.T) {
	root := t.TempDir()
	writeFile(t, filepath.Join(root, "keep.txt"), "keep")
	writeFile(t, filepath.Join(root, "drop.log"), "drop")

	chunks := drain(t, NewDirSourceRaw(root, []string{"*.log"}))
	require.Len(t, chunks, 1)
	require.Contains(t, chunks[0].Source, "keep.txt")
}

func TestDirSource_BinarySkipped(t *testing.T) {
	root := t.TempDir()
	writeFile(t, filepath.Join(root, "text.txt"), "plain text content")
	binPath := filepath.Join(root, "blob.bin")
	require.NoError(t, os.WriteFile(binPath, []byte{0x7f, 'E', 'L', 'F', 0x00, 0x01, 0x02}, 0o644))

	chunks := drain(t, NewDirSourceRaw(root, nil))
	require.Len(t, chunks, 1)
	require.Contains(t, chunks[0].Source, "text.txt")
}

func TestDirSource_MmapLargeFile(t *testing.T) {
	if testing.Short() {
		t.Skip("skipping large file test in short mode")
	}
	root := t.TempDir()
	big := filepath.Join(root, "big.txt")
	// Construct a payload slightly above MmapThreshold.
	payload := strings.Repeat("API_KEY=xxxxxxxxxxxxxxxxxxxx\n", (int(MmapThreshold)/28)+10)
	require.NoError(t, os.WriteFile(big, []byte(payload), 0o644))

	chunks := drain(t, NewDirSourceRaw(root, nil))
	// Reconstruct data accounting for chunk overlap.
	require.NotEmpty(t, chunks)
	require.Equal(t, big, chunks[0].Source)
}

func TestDirSource_MissingRoot(t *testing.T) {
	src := NewDirSourceRaw("/definitely/does/not/exist/keyhunter-xyz", nil)
	ctx := context.Background()
	out := make(chan types.Chunk, 1)
	err := src.Chunks(ctx, out)
	require.Error(t, err)
}

func TestDirSource_CtxCancellation(t *testing.T) {
	root := t.TempDir()
	for i := 0; i < 50; i++ {
		writeFile(t, filepath.Join(root, "f", string(rune('a'+i%26))+".txt"), "payload")
	}
	ctx, cancel := context.WithCancel(context.Background())
	cancel() // pre-cancelled
	out := make(chan types.Chunk, 1024)
	err := NewDirSourceRaw(root, nil).Chunks(ctx, out)
	require.ErrorIs(t, err, context.Canceled)
}

Also add a minimal update to pkg/engine/sources/file_test.go if it exists — if not present, skip. Do NOT alter any other source files in this plan. go test ./pkg/engine/sources/... -run 'TestDirSource|TestFileSource' -race -count=1 <acceptance_criteria> - go build ./pkg/engine/sources/... exits 0 - go test ./pkg/engine/sources/... -run TestDirSource -race -count=1 passes all subtests - grep -n "mmap.Open" pkg/engine/sources/dir.go pkg/engine/sources/file.go returns two hits - grep -n "filepath.WalkDir" pkg/engine/sources/dir.go returns a hit - grep -n "DefaultExcludes" pkg/engine/sources/dir.go returns a hit - grep -n "isBinary" pkg/engine/sources/dir.go returns a hit </acceptance_criteria> DirSource implements Source, walks recursively, honors default and user glob exclusions, skips binary files, and uses mmap above 10MB. FileSource refactored to share the same mmap/emit helpers. All tests green under -race.

- `go test ./pkg/engine/sources/... -race -count=1` passes - `go vet ./pkg/engine/sources/...` clean - All acceptance criteria grep matches hit

<success_criteria> A caller can create sources.NewDirSource("./myrepo", "*.log") and receive chunks for every non-excluded, non-binary file in deterministic order, with files >10MB read via mmap. </success_criteria>

After completion, create `.planning/phases/04-input-sources/04-02-SUMMARY.md` documenting: - File list with line counts - Test names and pass status - Any deviations from the planned exclude semantics (e.g., `**` handling)