docs(01-04): complete scan engine plan
- SUMMARY.md with pipeline implementation details - STATE.md updated with progress and decisions - ROADMAP.md and REQUIREMENTS.md updated
This commit is contained in:
@@ -9,11 +9,11 @@ Requirements for initial release. Each maps to roadmap phases.
|
|||||||
|
|
||||||
### Core Engine
|
### Core Engine
|
||||||
|
|
||||||
- [ ] **CORE-01**: Scanner engine detects API keys using keyword pre-filtering + regex matching pipeline
|
- [x] **CORE-01**: Scanner engine detects API keys using keyword pre-filtering + regex matching pipeline
|
||||||
- [x] **CORE-02**: Provider definitions loaded from YAML files embedded at compile time via Go embed
|
- [x] **CORE-02**: Provider definitions loaded from YAML files embedded at compile time via Go embed
|
||||||
- [x] **CORE-03**: Provider registry manages 108+ provider definitions with pattern, keyword, confidence, and verify metadata
|
- [x] **CORE-03**: Provider registry manages 108+ provider definitions with pattern, keyword, confidence, and verify metadata
|
||||||
- [ ] **CORE-04**: Entropy analysis as secondary signal for low-confidence providers (generic key formats)
|
- [x] **CORE-04**: Entropy analysis as secondary signal for low-confidence providers (generic key formats)
|
||||||
- [ ] **CORE-05**: Worker pool parallelism with configurable worker count (default: CPU count)
|
- [x] **CORE-05**: Worker pool parallelism with configurable worker count (default: CPU count)
|
||||||
- [x] **CORE-06**: Aho-Corasick keyword pre-filter runs before regex for 10x performance on large files
|
- [x] **CORE-06**: Aho-Corasick keyword pre-filter runs before regex for 10x performance on large files
|
||||||
- [ ] **CORE-07**: mmap-based large file reading for memory efficiency
|
- [ ] **CORE-07**: mmap-based large file reading for memory efficiency
|
||||||
|
|
||||||
|
|||||||
@@ -46,10 +46,10 @@ Decimal phases appear between their surrounding integers in numeric order.
|
|||||||
**Plans**: 5 plans
|
**Plans**: 5 plans
|
||||||
|
|
||||||
Plans:
|
Plans:
|
||||||
- [ ] 01-01-PLAN.md — Go module init, dependency installation, test scaffolding and testdata fixtures
|
- [x] 01-01-PLAN.md — Go module init, dependency installation, test scaffolding and testdata fixtures
|
||||||
- [x] 01-02-PLAN.md — Provider registry: YAML schema, embed loader, Aho-Corasick automaton, Registry struct
|
- [x] 01-02-PLAN.md — Provider registry: YAML schema, embed loader, Aho-Corasick automaton, Registry struct
|
||||||
- [ ] 01-03-PLAN.md — Storage layer: AES-256-GCM encryption, Argon2id key derivation, SQLite + Finding CRUD
|
- [x] 01-03-PLAN.md — Storage layer: AES-256-GCM encryption, Argon2id key derivation, SQLite + Finding CRUD
|
||||||
- [ ] 01-04-PLAN.md — Scan engine pipeline: keyword pre-filter, regex+entropy detector, FileSource, ants worker pool
|
- [x] 01-04-PLAN.md — Scan engine pipeline: keyword pre-filter, regex+entropy detector, FileSource, ants worker pool
|
||||||
- [ ] 01-05-PLAN.md — CLI wiring: scan, providers list/info/stats, config init/set/get, output table
|
- [ ] 01-05-PLAN.md — CLI wiring: scan, providers list/info/stats, config init/set/get, output table
|
||||||
|
|
||||||
### Phase 2: Tier 1-2 Providers
|
### Phase 2: Tier 1-2 Providers
|
||||||
|
|||||||
@@ -3,14 +3,14 @@ gsd_state_version: 1.0
|
|||||||
milestone: v1.0
|
milestone: v1.0
|
||||||
milestone_name: milestone
|
milestone_name: milestone
|
||||||
status: planning
|
status: planning
|
||||||
stopped_at: Completed 01-foundation 01-02-PLAN.md
|
stopped_at: Completed 01-foundation 01-04-PLAN.md
|
||||||
last_updated: "2026-04-04T21:12:49.099Z"
|
last_updated: "2026-04-05T09:22:35.186Z"
|
||||||
last_activity: 2026-04-04 — Roadmap created, 18 phases defined covering 146 v1 requirements
|
last_activity: 2026-04-04 — Roadmap created, 18 phases defined covering 146 v1 requirements
|
||||||
progress:
|
progress:
|
||||||
total_phases: 18
|
total_phases: 18
|
||||||
completed_phases: 0
|
completed_phases: 0
|
||||||
total_plans: 5
|
total_plans: 5
|
||||||
completed_plans: 1
|
completed_plans: 4
|
||||||
percent: 20
|
percent: 20
|
||||||
---
|
---
|
||||||
|
|
||||||
@@ -53,6 +53,7 @@ Progress: [██░░░░░░░░] 20%
|
|||||||
|
|
||||||
*Updated after each plan completion*
|
*Updated after each plan completion*
|
||||||
| Phase 01-foundation P02 | 9 | 2 tasks | 11 files |
|
| Phase 01-foundation P02 | 9 | 2 tasks | 11 files |
|
||||||
|
| Phase 01-foundation P04 | 5min | 2 tasks | 12 files |
|
||||||
|
|
||||||
## Accumulated Context
|
## Accumulated Context
|
||||||
|
|
||||||
@@ -67,6 +68,7 @@ Recent decisions affecting current work:
|
|||||||
- Roadmap: Verification (Phase 5) requires consent prompt + LEGAL.md — not optional polish
|
- Roadmap: Verification (Phase 5) requires consent prompt + LEGAL.md — not optional polish
|
||||||
- [Phase 01-foundation]: Provider YAML in dual locations: providers/ (user-visible) and pkg/providers/definitions/ (embed) — Go embed cannot use '..' paths
|
- [Phase 01-foundation]: Provider YAML in dual locations: providers/ (user-visible) and pkg/providers/definitions/ (embed) — Go embed cannot use '..' paths
|
||||||
- [Phase 01-foundation]: Aho-Corasick built with DFA=true at NewRegistry() for O(n) keyword pre-filtering across all providers
|
- [Phase 01-foundation]: Aho-Corasick built with DFA=true at NewRegistry() for O(n) keyword pre-filtering across all providers
|
||||||
|
- [Phase 01-foundation]: pkg/types/chunk.go breaks engine<->sources circular import; ants pool with WaitGroup+Mutex for detector coordination
|
||||||
|
|
||||||
### Pending Todos
|
### Pending Todos
|
||||||
|
|
||||||
@@ -81,6 +83,6 @@ None yet.
|
|||||||
|
|
||||||
## Session Continuity
|
## Session Continuity
|
||||||
|
|
||||||
Last session: 2026-04-04T21:12:49.095Z
|
Last session: 2026-04-05T09:22:35.183Z
|
||||||
Stopped at: Completed 01-foundation 01-02-PLAN.md
|
Stopped at: Completed 01-foundation 01-04-PLAN.md
|
||||||
Resume file: None
|
Resume file: None
|
||||||
|
|||||||
147
.planning/phases/01-foundation/01-04-SUMMARY.md
Normal file
147
.planning/phases/01-foundation/01-04-SUMMARY.md
Normal file
@@ -0,0 +1,147 @@
|
|||||||
|
---
|
||||||
|
phase: 01-foundation
|
||||||
|
plan: 04
|
||||||
|
subsystem: engine
|
||||||
|
tags: [scanning, aho-corasick, entropy, regex, ants, goroutine-pool, pipeline]
|
||||||
|
|
||||||
|
requires:
|
||||||
|
- phase: 01-foundation-02
|
||||||
|
provides: "Provider Registry with AC() automaton and List() for pattern matching"
|
||||||
|
provides:
|
||||||
|
- "Three-stage scanning pipeline: AC pre-filter, regex+entropy detector, results channel"
|
||||||
|
- "Engine.Scan(ctx, source, config) -> <-chan Finding"
|
||||||
|
- "Source interface for input adapters"
|
||||||
|
- "FileSource for single-file scanning"
|
||||||
|
- "Shannon entropy function"
|
||||||
|
- "pkg/types.Chunk shared type breaking circular imports"
|
||||||
|
affects: [cli-scan, input-sources, verification, output-formats]
|
||||||
|
|
||||||
|
tech-stack:
|
||||||
|
added: [ants/v2, pkg/types]
|
||||||
|
patterns: [three-stage-channel-pipeline, goroutine-pool-with-waitgroup, overlapping-chunk-reads]
|
||||||
|
|
||||||
|
key-files:
|
||||||
|
created:
|
||||||
|
- pkg/types/chunk.go
|
||||||
|
- pkg/engine/finding.go
|
||||||
|
- pkg/engine/entropy.go
|
||||||
|
- pkg/engine/filter.go
|
||||||
|
- pkg/engine/detector.go
|
||||||
|
- pkg/engine/sources/source.go
|
||||||
|
- pkg/engine/sources/file.go
|
||||||
|
modified:
|
||||||
|
- pkg/engine/engine.go
|
||||||
|
- pkg/engine/scanner_test.go
|
||||||
|
- testdata/samples/anthropic_key.txt
|
||||||
|
- testdata/samples/multiple_keys.txt
|
||||||
|
|
||||||
|
key-decisions:
|
||||||
|
- "pkg/types/chunk.go breaks engine<->sources circular import"
|
||||||
|
- "ants pool with sync.WaitGroup+Mutex for detector stage coordination"
|
||||||
|
- "FileSource uses os.ReadFile with 256-byte chunk overlap; mmap deferred to Phase 4"
|
||||||
|
- "Pool.Release() used instead of ReleaseWithTimeout (not in ants/v2 API)"
|
||||||
|
|
||||||
|
patterns-established:
|
||||||
|
- "Three-stage channel pipeline: Source->KeywordFilter->Detect->resultsChan"
|
||||||
|
- "Shared types in pkg/types to avoid circular imports between engine and sources"
|
||||||
|
- "Overlapping chunks (256 bytes) to prevent key splitting at boundaries"
|
||||||
|
|
||||||
|
requirements-completed: [CORE-01, CORE-04, CORE-05, CORE-06]
|
||||||
|
|
||||||
|
duration: 5min
|
||||||
|
completed: 2026-04-05
|
||||||
|
---
|
||||||
|
|
||||||
|
# Phase 1 Plan 4: Scan Engine Summary
|
||||||
|
|
||||||
|
**Three-stage scanning pipeline with Aho-Corasick pre-filter, regex+entropy detection via ants goroutine pool, and FileSource adapter**
|
||||||
|
|
||||||
|
## Performance
|
||||||
|
|
||||||
|
- **Duration:** 5 min
|
||||||
|
- **Started:** 2026-04-05T09:16:37Z
|
||||||
|
- **Completed:** 2026-04-05T09:21:30Z
|
||||||
|
- **Tasks:** 2
|
||||||
|
- **Files modified:** 12
|
||||||
|
|
||||||
|
## Accomplishments
|
||||||
|
- Three-stage pipeline (AC keyword filter -> regex+entropy detector -> results channel) working end-to-end
|
||||||
|
- Shannon entropy function correctly discriminates real keys (>= 3.5 bits/char) from low-entropy strings
|
||||||
|
- ants v2 goroutine pool with configurable worker count for parallel detection
|
||||||
|
- FileSource with overlapping chunk reads preventing key splitting at boundaries
|
||||||
|
- All 12 engine tests pass including pipeline integration tests against real testdata
|
||||||
|
|
||||||
|
## Task Commits
|
||||||
|
|
||||||
|
Each task was committed atomically:
|
||||||
|
|
||||||
|
1. **Task 1: Shared types, Finding, Shannon entropy** - `45cc676` (feat)
|
||||||
|
2. **Task 2: Pipeline stages, engine, FileSource, tests** - `cea2e37` (feat)
|
||||||
|
|
||||||
|
**Plan metadata:** (pending final commit)
|
||||||
|
|
||||||
|
_Note: TDD tasks had RED-GREEN commits merged into single task commits_
|
||||||
|
|
||||||
|
## Files Created/Modified
|
||||||
|
- `pkg/types/chunk.go` - Shared Chunk struct (Data, Source, Offset) breaking circular import
|
||||||
|
- `pkg/engine/finding.go` - Finding struct with MaskKey for masked key output
|
||||||
|
- `pkg/engine/entropy.go` - Shannon entropy using math.Log2 (~15 lines)
|
||||||
|
- `pkg/engine/filter.go` - KeywordFilter using Aho-Corasick automaton
|
||||||
|
- `pkg/engine/detector.go` - Detect applying regex patterns + entropy threshold
|
||||||
|
- `pkg/engine/engine.go` - Engine.Scan orchestrating 3-stage pipeline with ants pool
|
||||||
|
- `pkg/engine/sources/source.go` - Source interface using pkg/types.Chunk
|
||||||
|
- `pkg/engine/sources/file.go` - FileSource with overlapping chunk reads
|
||||||
|
- `pkg/engine/scanner_test.go` - 7 integration tests replacing stub tests
|
||||||
|
- `pkg/engine/entropy_test.go` - 6 unit tests for Shannon and MaskKey
|
||||||
|
- `testdata/samples/anthropic_key.txt` - Fixed key length for regex match
|
||||||
|
- `testdata/samples/multiple_keys.txt` - Fixed anthropic key length
|
||||||
|
|
||||||
|
## Decisions Made
|
||||||
|
- Used `pkg/types/chunk.go` to break the engine<->sources circular import (Go requires this pattern)
|
||||||
|
- ants Pool.Release() instead of ReleaseWithTimeout (method doesn't exist in current ants/v2 API)
|
||||||
|
- FileSource reads entire file via os.ReadFile then splits into overlapping chunks -- mmap deferred to Phase 4
|
||||||
|
- Mutex protects resultsChan writes from detector goroutines to prevent channel deadlock
|
||||||
|
|
||||||
|
## Deviations from Plan
|
||||||
|
|
||||||
|
### Auto-fixed Issues
|
||||||
|
|
||||||
|
**1. [Rule 1 - Bug] Fixed Anthropic test key lengths too short for regex pattern**
|
||||||
|
- **Found during:** Task 2 (pipeline integration tests)
|
||||||
|
- **Issue:** anthropic_key.txt and multiple_keys.txt had Anthropic keys with suffix < 93 chars, failing the `sk-ant-api03-[A-Za-z0-9_\-]{93,}` regex
|
||||||
|
- **Fix:** Extended synthetic key suffixes to 101 and 102 chars respectively
|
||||||
|
- **Files modified:** testdata/samples/anthropic_key.txt, testdata/samples/multiple_keys.txt
|
||||||
|
- **Verification:** Regex matches confirmed, all pipeline tests pass
|
||||||
|
- **Committed in:** cea2e37 (Task 2 commit)
|
||||||
|
|
||||||
|
**2. [Rule 1 - Bug] Fixed ants API: ReleaseWithTimeout does not exist**
|
||||||
|
- **Found during:** Task 2 (compilation)
|
||||||
|
- **Issue:** Plan specified `pool.ReleaseWithTimeout(5*time.Second)` but ants/v2 only has `pool.Release()`
|
||||||
|
- **Fix:** Changed to `pool.Release()` and removed unused `time` import
|
||||||
|
- **Files modified:** pkg/engine/engine.go
|
||||||
|
- **Verification:** Build succeeds, all tests pass
|
||||||
|
- **Committed in:** cea2e37 (Task 2 commit)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Total deviations:** 2 auto-fixed (2 bugs)
|
||||||
|
**Impact on plan:** Both fixes necessary for correctness. No scope creep.
|
||||||
|
|
||||||
|
## Issues Encountered
|
||||||
|
None beyond the auto-fixed deviations above.
|
||||||
|
|
||||||
|
## User Setup Required
|
||||||
|
None - no external service configuration required.
|
||||||
|
|
||||||
|
## Next Phase Readiness
|
||||||
|
- Scan engine ready for CLI integration (Plan 05: `keyhunter scan`)
|
||||||
|
- Engine.Scan() returns `<-chan Finding` ready for any consumer (CLI, web, bot)
|
||||||
|
- Source interface ready for additional adapters (dir, git, stdin) in Phase 4
|
||||||
|
|
||||||
|
## Self-Check: PASSED
|
||||||
|
|
||||||
|
All 10 created files verified on disk. Both task commits (45cc676, cea2e37) verified in git log.
|
||||||
|
|
||||||
|
---
|
||||||
|
*Phase: 01-foundation*
|
||||||
|
*Completed: 2026-04-05*
|
||||||
Reference in New Issue
Block a user