--- phase: 01-foundation plan: 04 type: execute wave: 2 depends_on: [01-02] files_modified: - pkg/types/chunk.go - pkg/engine/finding.go - pkg/engine/entropy.go - pkg/engine/filter.go - pkg/engine/detector.go - pkg/engine/engine.go - pkg/engine/sources/source.go - pkg/engine/sources/file.go - pkg/engine/scanner_test.go autonomous: true requirements: [CORE-01, CORE-04, CORE-05, CORE-06] must_haves: truths: - "Shannon entropy function returns expected values for known inputs" - "Aho-Corasick pre-filter passes chunks containing provider keywords and drops those without" - "Detector correctly identifies OpenAI and Anthropic key patterns in test fixtures via regex" - "Full scan pipeline: scan testdata/samples/openai_key.txt → Finding with ProviderName==openai" - "Full scan pipeline: scan testdata/samples/no_keys.txt → zero findings" - "Worker pool uses ants v2 with configurable worker count" artifacts: - path: "pkg/types/chunk.go" provides: "Chunk struct (Data []byte, Source string, Offset int64) — shared by engine and sources packages" exports: ["Chunk"] - path: "pkg/engine/finding.go" provides: "Finding struct (provider, key value, masked, confidence, source, line)" exports: ["Finding", "MaskKey"] - path: "pkg/engine/entropy.go" provides: "Shannon(s string) float64 — ~10 line stdlib math implementation" exports: ["Shannon"] - path: "pkg/engine/filter.go" provides: "KeywordFilter stage — runs Aho-Corasick and passes/drops chunks" exports: ["KeywordFilter"] - path: "pkg/engine/detector.go" provides: "Detector stage — applies provider regexps and entropy check to chunks" exports: ["Detect"] - path: "pkg/engine/engine.go" provides: "Engine struct with Scan(ctx, src, cfg) <-chan Finding" exports: ["Engine", "NewEngine", "ScanConfig"] - path: "pkg/engine/sources/source.go" provides: "Source interface with Chunks(ctx, chan<- types.Chunk) error" exports: ["Source"] - path: "pkg/engine/sources/file.go" provides: "FileSource implementing Source for single-file scanning" exports: ["FileSource", "NewFileSource"] key_links: - from: "pkg/engine/engine.go" to: "pkg/providers/registry.go" via: "Engine holds *providers.Registry, uses Registry.AC() for pre-filter" pattern: "providers\\.Registry" - from: "pkg/engine/filter.go" to: "github.com/petar-dambovaliev/aho-corasick" via: "AC.FindAll() on each chunk" pattern: "FindAll" - from: "pkg/engine/detector.go" to: "pkg/engine/entropy.go" via: "Shannon() called when EntropyMin > 0 in pattern" pattern: "Shannon" - from: "pkg/engine/engine.go" to: "github.com/panjf2000/ants/v2" via: "ants.NewPool for detector workers" pattern: "ants\\.NewPool" - from: "pkg/engine/sources/source.go" to: "pkg/types/chunk.go" via: "Source interface uses types.Chunk — avoids circular import with pkg/engine" pattern: "types\\.Chunk" --- Build the three-stage scanning engine pipeline: Aho-Corasick keyword pre-filter, regex + entropy detector workers using ants goroutine pool, and a FileSource adapter. Wire them together in an Engine that emits Findings on a channel. Purpose: The scan engine is the core differentiator. Plans 02 and 03 provide its dependencies (Registry for patterns + keywords, storage types for Finding). The CLI (Plan 05) calls Engine.Scan() to implement `keyhunter scan`. Output: pkg/types/chunk.go, pkg/engine/{finding,entropy,filter,detector,engine}.go and sources/{source,file}.go. scanner_test.go stubs filled. NOTE on CORE-07 (mmap large file reading): FileSource uses os.ReadFile() in Phase 1, which is sufficient for the test fixtures. mmap-based reading for files > 10MB is deferred to Phase 4 (Input Sources) where it belongs architecturally alongside all other source adapter work. @$HOME/.claude/get-shit-done/workflows/execute-plan.md @$HOME/.claude/get-shit-done/templates/summary.md @.planning/phases/01-foundation/01-RESEARCH.md @.planning/phases/01-foundation/01-02-SUMMARY.md The sources sub-package (pkg/engine/sources) needs the Chunk type. If Chunk were defined in pkg/engine, then sources would import engine, and engine imports sources (for the Source interface) — a circular import. Go will refuse to compile. Resolution: Define Chunk in pkg/types (a shared, import-free package): pkg/types/chunk.go — defines types.Chunk pkg/engine/sources — imports pkg/types (no circular dep) pkg/engine — imports pkg/types and pkg/engine/sources (no circular dep) package providers type Provider struct { Name string Keywords []string Patterns []Pattern Tier int } type Pattern struct { Regex string EntropyMin float64 Confidence string } type Registry struct { ... } func (r *Registry) List() []Provider func (r *Registry) AC() ahocorasick.AhoCorasick // pre-built Aho-Corasick chunksChan chan types.Chunk (buffer: 1000) detectableChan chan types.Chunk (buffer: 500) resultsChan chan Finding (buffer: 100) Stage 1: Source.Chunks() → chunksChan (goroutine, closes chan on done) Stage 2: KeywordFilter(chunksChan) → detectableChan (goroutine, AC.FindAll) Stage 3: N detector workers (ants pool) → resultsChan type ScanConfig struct { Workers int // default: runtime.NumCPU() * 8 Verify bool // Phase 5 — always false in Phase 1 Unmask bool // for output layer } type Source interface { Chunks(ctx context.Context, out chan<- types.Chunk) error } type FileSource struct { Path string ChunkSize int // bytes per chunk, default 4096 } Chunking strategy: read file in chunks of ChunkSize bytes with overlap of max(256, maxPatternLen) to avoid splitting a key across chunk boundaries. import ahocorasick "github.com/petar-dambovaliev/aho-corasick" // ac.FindAll(s string) []ahocorasick.Match — returns match positions import "github.com/panjf2000/ants/v2" // pool, _ := ants.NewPool(workers, ants.WithOptions(...)) // pool.Submit(func() { ... }) // pool.ReleaseWithTimeout(timeout) Task 1: Shared types package, Finding, and Shannon entropy function pkg/types/chunk.go, pkg/engine/finding.go, pkg/engine/entropy.go - /home/salva/Documents/apikey/.planning/phases/01-foundation/01-RESEARCH.md (CORE-04 row: Shannon entropy, ~10-line stdlib function, threshold 3.5 bits/char) - /home/salva/Documents/apikey/pkg/storage/findings.go (Finding and MaskKey defined there — engine.Finding is a separate type for the pipeline) - Test 1: Shannon("aaaaaaa") → value near 0.0 (all same characters, no entropy) - Test 2: Shannon("abcdefgh") → value near 3.0 (8 distinct chars) - Test 3: Shannon("sk-proj-ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqr") → >= 3.5 (real key entropy) - Test 4: Shannon("") → 0.0 (empty string) - Test 5: MaskKey("sk-proj-abc1234") → "sk-proj-...1234" (first 8 + last 4) - Test 6: MaskKey("abc") → "****" (too short to mask) Create **pkg/types/chunk.go** — the shared type that breaks the circular import: ```go package types // Chunk is a segment of file content passed through the scanning pipeline. // Defined in pkg/types (not pkg/engine) so that pkg/engine/sources can use it // without creating a circular import with pkg/engine. type Chunk struct { Data []byte // raw bytes Source string // file path, URL, or description Offset int64 // byte offset of this chunk within the source } ``` Create **pkg/engine/finding.go**: ```go package engine import "time" // Finding represents a detected API key from the scanning pipeline. // KeyValue holds the plaintext key — the storage layer encrypts it before persisting. type Finding struct { ProviderName string KeyValue string // full plaintext key KeyMasked string // first8...last4 Confidence string // "high", "medium", "low" Source string // file path or description SourceType string // "file", "dir", "git", "stdin", "url" LineNumber int Offset int64 DetectedAt time.Time } // MaskKey returns a masked representation: first 8 chars + "..." + last 4 chars. // Returns "****" if the key is shorter than 12 characters. func MaskKey(key string) string { if len(key) < 12 { return "****" } return key[:8] + "..." + key[len(key)-4:] } ``` Create **pkg/engine/entropy.go**: ```go package engine import "math" // Shannon computes the Shannon entropy of a string in bits per character. // Returns 0.0 for empty strings. // A value >= 3.5 indicates high randomness, consistent with real API keys. func Shannon(s string) float64 { if len(s) == 0 { return 0.0 } freq := make(map[rune]float64) for _, c := range s { freq[c]++ } n := float64(len([]rune(s))) var entropy float64 for _, count := range freq { p := count / n entropy -= p * math.Log2(p) } return entropy } ``` cd /home/salva/Documents/apikey && go build ./pkg/types/... && go build ./pkg/engine/... && echo "BUILD OK" - `go build ./pkg/types/...` exits 0 - `go build ./pkg/engine/...` exits 0 - pkg/types/chunk.go exports Chunk with fields Data, Source, Offset - pkg/engine/finding.go exports Finding and MaskKey - pkg/engine/entropy.go exports Shannon using math.Log2 - `grep -q 'math\.Log2' pkg/engine/entropy.go` exits 0 - MaskKey("sk-proj-abc1234") produces "sk-proj-...1234" pkg/types/Chunk exists (no imports, no circular dependency risk), Finding, MaskKey, and Shannon exist and compile. Task 2: Pipeline stages, engine orchestration, FileSource, and filled test stubs pkg/engine/filter.go, pkg/engine/detector.go, pkg/engine/engine.go, pkg/engine/sources/source.go, pkg/engine/sources/file.go, pkg/engine/scanner_test.go - /home/salva/Documents/apikey/.planning/phases/01-foundation/01-RESEARCH.md (Pattern 2: Three-Stage Scanning Pipeline — exact channel-based code example) - /home/salva/Documents/apikey/pkg/types/chunk.go - /home/salva/Documents/apikey/pkg/engine/chunk.go (if exists — use pkg/types/chunk.go instead) - /home/salva/Documents/apikey/pkg/engine/finding.go - /home/salva/Documents/apikey/pkg/engine/entropy.go - /home/salva/Documents/apikey/pkg/providers/registry.go (Registry.AC() and Registry.List() signatures) - Test 1: Scan testdata/samples/openai_key.txt → 1 finding, ProviderName=="openai", KeyValue contains "sk-proj-" - Test 2: Scan testdata/samples/anthropic_key.txt → 1 finding, ProviderName=="anthropic" - Test 3: Scan testdata/samples/no_keys.txt → 0 findings - Test 4: Scan testdata/samples/multiple_keys.txt → 2 findings (openai + anthropic) - Test 5: Shannon("sk-proj-ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqr") >= 3.5 (entropy check) - Test 6: KeywordFilter drops a chunk with text "hello world" (no provider keywords) Create **pkg/engine/sources/source.go**: ```go package sources import ( "context" "github.com/salvacybersec/keyhunter/pkg/types" ) // Source is the interface all input adapters must implement. // Chunks writes content segments to the out channel until the source is exhausted or ctx is cancelled. // NOTE: Source is defined in the sources sub-package (not pkg/engine) and uses pkg/types.Chunk // to avoid a circular import: engine → sources → engine. type Source interface { Chunks(ctx context.Context, out chan<- types.Chunk) error } ``` Create **pkg/engine/sources/file.go**: ```go package sources import ( "context" "os" "github.com/salvacybersec/keyhunter/pkg/types" ) const defaultChunkSize = 4096 const chunkOverlap = 256 // overlap between chunks to avoid splitting keys at boundaries // FileSource reads a single file and emits overlapping chunks. type FileSource struct { Path string ChunkSize int } // NewFileSource creates a FileSource for the given path with the default chunk size. func NewFileSource(path string) *FileSource { return &FileSource{Path: path, ChunkSize: defaultChunkSize} } // Chunks reads the file in overlapping segments and sends each chunk to out. // Uses os.ReadFile for simplicity in Phase 1. mmap for files > 10MB is implemented // in Phase 4 (Input Sources) alongside all other source adapter enhancements. func (f *FileSource) Chunks(ctx context.Context, out chan<- types.Chunk) error { data, err := os.ReadFile(f.Path) if err != nil { return err } size := f.ChunkSize if size <= 0 { size = defaultChunkSize } if len(data) <= size { // File fits in one chunk select { case <-ctx.Done(): return ctx.Err() case out <- types.Chunk{Data: data, Source: f.Path, Offset: 0}: } return nil } // Emit overlapping chunks var offset int64 for start := 0; start < len(data); start += size - chunkOverlap { end := start + size if end > len(data) { end = len(data) } chunk := types.Chunk{ Data: data[start:end], Source: f.Path, Offset: offset, } select { case <-ctx.Done(): return ctx.Err() case out <- chunk: } offset += int64(end - start) if end == len(data) { break } } return nil } ``` Create **pkg/engine/filter.go**: ```go package engine import ( ahocorasick "github.com/petar-dambovaliev/aho-corasick" "github.com/salvacybersec/keyhunter/pkg/types" ) // KeywordFilter filters a stream of chunks using an Aho-Corasick automaton. // Only chunks that contain at least one provider keyword are sent to out. // This is Stage 2 of the pipeline (runs after Source, before Detector). func KeywordFilter(ac ahocorasick.AhoCorasick, in <-chan types.Chunk, out chan<- types.Chunk) { for chunk := range in { if len(ac.FindAll(string(chunk.Data))) > 0 { out <- chunk } } } ``` Create **pkg/engine/detector.go**: ```go package engine import ( "regexp" "strings" "time" "github.com/salvacybersec/keyhunter/pkg/providers" "github.com/salvacybersec/keyhunter/pkg/types" ) // Detect applies provider regex patterns and optional entropy checks to a chunk. // It returns all findings from the chunk. func Detect(chunk types.Chunk, providerList []providers.Provider) []Finding { var findings []Finding content := string(chunk.Data) for _, p := range providerList { for _, pat := range p.Patterns { re, err := regexp.Compile(pat.Regex) if err != nil { continue // invalid regex — skip silently } matches := re.FindAllString(content, -1) for _, match := range matches { // Apply entropy check if threshold is set if pat.EntropyMin > 0 && Shannon(match) < pat.EntropyMin { continue // too low entropy — likely a placeholder } line := lineNumber(content, match) findings = append(findings, Finding{ ProviderName: p.Name, KeyValue: match, KeyMasked: MaskKey(match), Confidence: pat.Confidence, Source: chunk.Source, SourceType: "file", LineNumber: line, Offset: chunk.Offset, DetectedAt: time.Now(), }) } } } return findings } // lineNumber returns the 1-based line number where match first appears in content. func lineNumber(content, match string) int { idx := strings.Index(content, match) if idx < 0 { return 0 } return strings.Count(content[:idx], "\n") + 1 } ``` Create **pkg/engine/engine.go**: ```go package engine import ( "context" "runtime" "sync" "time" "github.com/panjf2000/ants/v2" "github.com/salvacybersec/keyhunter/pkg/engine/sources" "github.com/salvacybersec/keyhunter/pkg/providers" "github.com/salvacybersec/keyhunter/pkg/types" ) // ScanConfig controls scan execution parameters. type ScanConfig struct { Workers int // number of detector goroutines; defaults to runtime.NumCPU() * 8 Verify bool // opt-in active verification (Phase 5) Unmask bool // include full key in Finding.KeyValue } // Engine orchestrates the three-stage scanning pipeline. type Engine struct { registry *providers.Registry } // NewEngine creates an Engine backed by the given provider registry. func NewEngine(registry *providers.Registry) *Engine { return &Engine{registry: registry} } // Scan runs the three-stage pipeline against src and returns a channel of Findings. // The channel is closed when all chunks have been processed. // The caller must drain the channel fully or cancel ctx to avoid goroutine leaks. func (e *Engine) Scan(ctx context.Context, src sources.Source, cfg ScanConfig) (<-chan Finding, error) { workers := cfg.Workers if workers <= 0 { workers = runtime.NumCPU() * 8 } // Declare channels on separate lines to ensure correct Go syntax. chunksChan := make(chan types.Chunk, 1000) detectableChan := make(chan types.Chunk, 500) resultsChan := make(chan Finding, 100) // Stage 1: source → chunksChan go func() { defer close(chunksChan) _ = src.Chunks(ctx, chunksChan) }() // Stage 2: keyword pre-filter → detectableChan go func() { defer close(detectableChan) KeywordFilter(e.registry.AC(), chunksChan, detectableChan) }() // Stage 3: detector workers → resultsChan pool, err := ants.NewPool(workers) if err != nil { close(resultsChan) return nil, err } providerList := e.registry.List() var wg sync.WaitGroup var mu sync.Mutex go func() { defer func() { wg.Wait() close(resultsChan) pool.ReleaseWithTimeout(5 * time.Second) }() for chunk := range detectableChan { c := chunk // capture loop variable wg.Add(1) _ = pool.Submit(func() { defer wg.Done() found := Detect(c, providerList) mu.Lock() for _, f := range found { select { case resultsChan <- f: case <-ctx.Done(): } } mu.Unlock() }) } }() return resultsChan, nil } ``` Fill **pkg/engine/scanner_test.go** (replacing stubs from Plan 01): ```go package engine_test import ( "context" "testing" "github.com/salvacybersec/keyhunter/pkg/engine" "github.com/salvacybersec/keyhunter/pkg/engine/sources" "github.com/salvacybersec/keyhunter/pkg/providers" "github.com/stretchr/testify/assert" "github.com/stretchr/testify/require" ) func newTestRegistry(t *testing.T) *providers.Registry { t.Helper() reg, err := providers.NewRegistry() require.NoError(t, err) return reg } func TestShannonEntropy(t *testing.T) { assert.InDelta(t, 0.0, engine.Shannon("aaaaaaa"), 0.01) assert.Greater(t, engine.Shannon("sk-proj-ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqr"), 3.5) assert.Equal(t, 0.0, engine.Shannon("")) } func TestKeywordPreFilter(t *testing.T) { reg := newTestRegistry(t) ac := reg.AC() // Chunk with OpenAI keyword should pass matches := ac.FindAll("export OPENAI_API_KEY=sk-proj-test") assert.NotEmpty(t, matches) // Chunk with no keywords should be dropped noMatches := ac.FindAll("hello world no secrets here") assert.Empty(t, noMatches) } func TestScannerPipelineOpenAI(t *testing.T) { reg := newTestRegistry(t) eng := engine.NewEngine(reg) src := sources.NewFileSource("../../testdata/samples/openai_key.txt") cfg := engine.ScanConfig{Workers: 2} ch, err := eng.Scan(context.Background(), src, cfg) require.NoError(t, err) var findings []engine.Finding for f := range ch { findings = append(findings, f) } require.Len(t, findings, 1, "expected exactly 1 finding in openai_key.txt") assert.Equal(t, "openai", findings[0].ProviderName) assert.Contains(t, findings[0].KeyValue, "sk-proj-") } func TestScannerPipelineNoKeys(t *testing.T) { reg := newTestRegistry(t) eng := engine.NewEngine(reg) src := sources.NewFileSource("../../testdata/samples/no_keys.txt") cfg := engine.ScanConfig{Workers: 2} ch, err := eng.Scan(context.Background(), src, cfg) require.NoError(t, err) var findings []engine.Finding for f := range ch { findings = append(findings, f) } assert.Empty(t, findings, "expected zero findings in no_keys.txt") } func TestScannerPipelineMultipleKeys(t *testing.T) { reg := newTestRegistry(t) eng := engine.NewEngine(reg) src := sources.NewFileSource("../../testdata/samples/multiple_keys.txt") cfg := engine.ScanConfig{Workers: 2} ch, err := eng.Scan(context.Background(), src, cfg) require.NoError(t, err) var findings []engine.Finding for f := range ch { findings = append(findings, f) } assert.GreaterOrEqual(t, len(findings), 2, "expected at least 2 findings in multiple_keys.txt") var names []string for _, f := range findings { names = append(names, f.ProviderName) } assert.Contains(t, names, "openai") assert.Contains(t, names, "anthropic") } ``` cd /home/salva/Documents/apikey && go test ./pkg/engine/... -v -count=1 2>&1 | tail -30 - `go test ./pkg/engine/... -v -count=1` exits 0 with all tests PASS (no SKIP) - `go build ./...` exits 0 with no circular import errors - TestShannonEntropy passes — 0.0 for "aaaaaaa", >= 3.5 for real key pattern - TestKeywordPreFilter passes — AC matches sk-proj-, empty for "hello world" - TestScannerPipelineOpenAI passes — 1 finding with ProviderName=="openai" - TestScannerPipelineNoKeys passes — 0 findings - TestScannerPipelineMultipleKeys passes — >= 2 findings with both provider names - `grep -q 'ants\.NewPool' pkg/engine/engine.go` exits 0 - `grep -q 'KeywordFilter' pkg/engine/engine.go` exits 0 - pkg/types/chunk.go exists and pkg/engine/sources imports pkg/types (not pkg/engine) Three-stage scanning pipeline works end-to-end: FileSource → KeywordFilter (AC) → Detect (regex + entropy) → Finding channel. Circular import resolved via pkg/types. All engine tests pass. After both tasks: - `go build ./...` exits 0 with zero circular import errors - `go test ./pkg/engine/... -v -count=1` exits 0 with all tests PASS - `grep -q 'ants\.NewPool' pkg/engine/engine.go` exits 0 - `grep -q 'math\.Log2' pkg/engine/entropy.go` exits 0 - `grep -rq 'pkg/types' pkg/engine/sources/source.go` exits 0 (sources imports types, not engine) - Scanning testdata/samples/openai_key.txt returns 1 finding with provider "openai" - Scanning testdata/samples/no_keys.txt returns 0 findings - Three-stage pipeline: AC pre-filter → regex + entropy detector → results channel (CORE-01, CORE-06) - Shannon entropy function using stdlib math (CORE-04) - ants v2 goroutine pool with configurable worker count (CORE-05) - FileSource adapter reading files in overlapping chunks using os.ReadFile (mmap deferred to Phase 4) - pkg/types/Chunk breaks the engine↔sources circular import - All engine tests pass against real testdata fixtures After completion, create `.planning/phases/01-foundation/01-04-SUMMARY.md` following the summary template.