--- phase: 01-foundation plan: 04 subsystem: engine tags: [scanning, aho-corasick, entropy, regex, ants, goroutine-pool, pipeline] requires: - phase: 01-foundation-02 provides: "Provider Registry with AC() automaton and List() for pattern matching" provides: - "Three-stage scanning pipeline: AC pre-filter, regex+entropy detector, results channel" - "Engine.Scan(ctx, source, config) -> <-chan Finding" - "Source interface for input adapters" - "FileSource for single-file scanning" - "Shannon entropy function" - "pkg/types.Chunk shared type breaking circular imports" affects: [cli-scan, input-sources, verification, output-formats] tech-stack: added: [ants/v2, pkg/types] patterns: [three-stage-channel-pipeline, goroutine-pool-with-waitgroup, overlapping-chunk-reads] key-files: created: - pkg/types/chunk.go - pkg/engine/finding.go - pkg/engine/entropy.go - pkg/engine/filter.go - pkg/engine/detector.go - pkg/engine/sources/source.go - pkg/engine/sources/file.go modified: - pkg/engine/engine.go - pkg/engine/scanner_test.go - testdata/samples/anthropic_key.txt - testdata/samples/multiple_keys.txt key-decisions: - "pkg/types/chunk.go breaks engine<->sources circular import" - "ants pool with sync.WaitGroup+Mutex for detector stage coordination" - "FileSource uses os.ReadFile with 256-byte chunk overlap; mmap deferred to Phase 4" - "Pool.Release() used instead of ReleaseWithTimeout (not in ants/v2 API)" patterns-established: - "Three-stage channel pipeline: Source->KeywordFilter->Detect->resultsChan" - "Shared types in pkg/types to avoid circular imports between engine and sources" - "Overlapping chunks (256 bytes) to prevent key splitting at boundaries" requirements-completed: [CORE-01, CORE-04, CORE-05, CORE-06] duration: 5min completed: 2026-04-05 --- # Phase 1 Plan 4: Scan Engine Summary **Three-stage scanning pipeline with Aho-Corasick pre-filter, regex+entropy detection via ants goroutine pool, and FileSource adapter** ## Performance - **Duration:** 5 min - **Started:** 2026-04-05T09:16:37Z - **Completed:** 2026-04-05T09:21:30Z - **Tasks:** 2 - **Files modified:** 12 ## Accomplishments - Three-stage pipeline (AC keyword filter -> regex+entropy detector -> results channel) working end-to-end - Shannon entropy function correctly discriminates real keys (>= 3.5 bits/char) from low-entropy strings - ants v2 goroutine pool with configurable worker count for parallel detection - FileSource with overlapping chunk reads preventing key splitting at boundaries - All 12 engine tests pass including pipeline integration tests against real testdata ## Task Commits Each task was committed atomically: 1. **Task 1: Shared types, Finding, Shannon entropy** - `45cc676` (feat) 2. **Task 2: Pipeline stages, engine, FileSource, tests** - `cea2e37` (feat) **Plan metadata:** (pending final commit) _Note: TDD tasks had RED-GREEN commits merged into single task commits_ ## Files Created/Modified - `pkg/types/chunk.go` - Shared Chunk struct (Data, Source, Offset) breaking circular import - `pkg/engine/finding.go` - Finding struct with MaskKey for masked key output - `pkg/engine/entropy.go` - Shannon entropy using math.Log2 (~15 lines) - `pkg/engine/filter.go` - KeywordFilter using Aho-Corasick automaton - `pkg/engine/detector.go` - Detect applying regex patterns + entropy threshold - `pkg/engine/engine.go` - Engine.Scan orchestrating 3-stage pipeline with ants pool - `pkg/engine/sources/source.go` - Source interface using pkg/types.Chunk - `pkg/engine/sources/file.go` - FileSource with overlapping chunk reads - `pkg/engine/scanner_test.go` - 7 integration tests replacing stub tests - `pkg/engine/entropy_test.go` - 6 unit tests for Shannon and MaskKey - `testdata/samples/anthropic_key.txt` - Fixed key length for regex match - `testdata/samples/multiple_keys.txt` - Fixed anthropic key length ## Decisions Made - Used `pkg/types/chunk.go` to break the engine<->sources circular import (Go requires this pattern) - ants Pool.Release() instead of ReleaseWithTimeout (method doesn't exist in current ants/v2 API) - FileSource reads entire file via os.ReadFile then splits into overlapping chunks -- mmap deferred to Phase 4 - Mutex protects resultsChan writes from detector goroutines to prevent channel deadlock ## Deviations from Plan ### Auto-fixed Issues **1. [Rule 1 - Bug] Fixed Anthropic test key lengths too short for regex pattern** - **Found during:** Task 2 (pipeline integration tests) - **Issue:** anthropic_key.txt and multiple_keys.txt had Anthropic keys with suffix < 93 chars, failing the `sk-ant-api03-[A-Za-z0-9_\-]{93,}` regex - **Fix:** Extended synthetic key suffixes to 101 and 102 chars respectively - **Files modified:** testdata/samples/anthropic_key.txt, testdata/samples/multiple_keys.txt - **Verification:** Regex matches confirmed, all pipeline tests pass - **Committed in:** cea2e37 (Task 2 commit) **2. [Rule 1 - Bug] Fixed ants API: ReleaseWithTimeout does not exist** - **Found during:** Task 2 (compilation) - **Issue:** Plan specified `pool.ReleaseWithTimeout(5*time.Second)` but ants/v2 only has `pool.Release()` - **Fix:** Changed to `pool.Release()` and removed unused `time` import - **Files modified:** pkg/engine/engine.go - **Verification:** Build succeeds, all tests pass - **Committed in:** cea2e37 (Task 2 commit) --- **Total deviations:** 2 auto-fixed (2 bugs) **Impact on plan:** Both fixes necessary for correctness. No scope creep. ## Issues Encountered None beyond the auto-fixed deviations above. ## User Setup Required None - no external service configuration required. ## Next Phase Readiness - Scan engine ready for CLI integration (Plan 05: `keyhunter scan`) - Engine.Scan() returns `<-chan Finding` ready for any consumer (CLI, web, bot) - Source interface ready for additional adapters (dir, git, stdin) in Phase 4 ## Self-Check: PASSED All 10 created files verified on disk. Both task commits (45cc676, cea2e37) verified in git log. --- *Phase: 01-foundation* *Completed: 2026-04-05*