Files
keyhunter/.planning/phases/01-foundation/01-04-PLAN.md
salvacybersec 684b67cb73 docs(01-foundation): create phase 1 plan — 5 plans across 3 execution waves
Wave 0: module init + test scaffolding (01-01)
Wave 1: provider registry (01-02) + storage layer (01-03) in parallel
Wave 2: scan engine pipeline (01-04, depends on 01-02)
Wave 3: CLI wiring + integration checkpoint (01-05, depends on all)

Covers all 16 Phase 1 requirements: CORE-01 through CORE-07, STOR-01 through STOR-03,
CLI-01 through CLI-05, PROV-10.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 23:44:09 +03:00

22 KiB

phase, plan, type, wave, depends_on, files_modified, autonomous, requirements, must_haves
phase plan type wave depends_on files_modified autonomous requirements must_haves
01-foundation 04 execute 2
01-02
pkg/engine/chunk.go
pkg/engine/finding.go
pkg/engine/entropy.go
pkg/engine/filter.go
pkg/engine/detector.go
pkg/engine/engine.go
pkg/engine/sources/source.go
pkg/engine/sources/file.go
pkg/engine/scanner_test.go
true
CORE-01
CORE-04
CORE-05
CORE-06
CORE-07
truths artifacts key_links
Shannon entropy function returns expected values for known inputs
Aho-Corasick pre-filter passes chunks containing provider keywords and drops those without
Detector correctly identifies OpenAI and Anthropic key patterns in test fixtures via regex
Full scan pipeline: scan testdata/samples/openai_key.txt → Finding with ProviderName==openai
Full scan pipeline: scan testdata/samples/no_keys.txt → zero findings
Worker pool uses ants v2 with configurable worker count
path provides exports
pkg/engine/chunk.go Chunk struct (Data []byte, Source string, Offset int64)
Chunk
path provides exports
pkg/engine/finding.go Finding struct (provider, key value, masked, confidence, source, line)
Finding
MaskKey
path provides exports
pkg/engine/entropy.go Shannon(s string) float64 — ~10 line stdlib math implementation
Shannon
path provides exports
pkg/engine/filter.go KeywordFilter stage — runs Aho-Corasick and passes/drops chunks
KeywordFilter
path provides exports
pkg/engine/detector.go Detector stage — applies provider regexps and entropy check to chunks
Detector
path provides exports
pkg/engine/engine.go Engine struct with Scan(ctx, src, cfg) <-chan Finding
Engine
NewEngine
ScanConfig
path provides exports
pkg/engine/sources/source.go Source interface with Chunks(ctx, chan<- Chunk) error
Source
path provides exports
pkg/engine/sources/file.go FileSource implementing Source for single-file scanning
FileSource
NewFileSource
from to via pattern
pkg/engine/engine.go pkg/providers/registry.go Engine holds *providers.Registry, uses Registry.AC() for pre-filter providers.Registry
from to via pattern
pkg/engine/filter.go github.com/petar-dambovaliev/aho-corasick AC.FindAll() on each chunk FindAll
from to via pattern
pkg/engine/detector.go pkg/engine/entropy.go Shannon() called when EntropyMin > 0 in pattern Shannon
from to via pattern
pkg/engine/engine.go github.com/panjf2000/ants/v2 ants.NewPool for detector workers ants.NewPool
Build the three-stage scanning engine pipeline: Aho-Corasick keyword pre-filter, regex + entropy detector workers using ants goroutine pool, and a FileSource adapter. Wire them together in an Engine that emits Findings on a channel.

Purpose: The scan engine is the core differentiator. Plans 02 and 03 provide its dependencies (Registry for patterns + keywords, storage types for Finding). The CLI (Plan 05) calls Engine.Scan() to implement keyhunter scan. Output: pkg/engine/{chunk,finding,entropy,filter,detector,engine}.go and sources/{source,file}.go. scanner_test.go stubs filled.

<execution_context> @$HOME/.claude/get-shit-done/workflows/execute-plan.md @$HOME/.claude/get-shit-done/templates/summary.md </execution_context>

@.planning/phases/01-foundation/01-RESEARCH.md @.planning/phases/01-foundation/01-02-SUMMARY.md package providers

type Provider struct { Name string Keywords []string Patterns []Pattern Tier int }

type Pattern struct { Regex string EntropyMin float64 Confidence string }

type Registry struct { ... } func (r *Registry) List() []Provider func (r *Registry) AC() ahocorasick.AhoCorasick // pre-built Aho-Corasick

chunksChan chan Chunk (buffer: 1000) detectableChan chan Chunk (buffer: 500) resultsChan chan Finding (buffer: 100)

Stage 1: Source.Chunks() → chunksChan (goroutine, closes chan on done) Stage 2: KeywordFilter(chunksChan) → detectableChan (goroutine, AC.FindAll) Stage 3: N detector workers (ants pool) → resultsChan

type ScanConfig struct { Workers int // default: runtime.NumCPU() * 8 Verify bool // Phase 5 — always false in Phase 1 Unmask bool // for output layer }

type Source interface { Chunks(ctx context.Context, out chan<- Chunk) error }

type FileSource struct { Path string ChunkSize int // bytes per chunk, default 4096 }

Chunking strategy: read file in chunks of ChunkSize bytes with overlap of max(256, maxPatternLen) to avoid splitting a key across chunk boundaries.

import ahocorasick "github.com/petar-dambovaliev/aho-corasick" // ac.FindAll(s string) []ahocorasick.Match — returns match positions

import "github.com/panjf2000/ants/v2" // pool, _ := ants.NewPool(workers, ants.WithOptions(...)) // pool.Submit(func() { ... }) // pool.ReleaseWithTimeout(timeout)

Task 1: Core types and Shannon entropy function pkg/engine/chunk.go, pkg/engine/finding.go, pkg/engine/entropy.go - /home/salva/Documents/apikey/.planning/phases/01-foundation/01-RESEARCH.md (CORE-04 row: Shannon entropy, ~10-line stdlib function, threshold 3.5 bits/char) - /home/salva/Documents/apikey/pkg/storage/findings.go (Finding and MaskKey defined there — engine.Finding is a separate type for the pipeline) - Test 1: Shannon("aaaaaaa") → value near 0.0 (all same characters, no entropy) - Test 2: Shannon("abcdefgh") → value near 3.0 (8 distinct chars) - Test 3: Shannon("sk-proj-ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqr") → >= 3.5 (real key entropy) - Test 4: Shannon("") → 0.0 (empty string) - Test 5: MaskKey("sk-proj-abc1234") → "sk-proj-...1234" (first 8 + last 4) - Test 6: MaskKey("abc") → "****" (too short to mask) Create **pkg/engine/chunk.go**: ```go package engine

// Chunk is a segment of file content passed through the scanning pipeline. type Chunk struct { Data []byte // raw bytes Source string // file path, URL, or description Offset int64 // byte offset of this chunk within the source }


Create **pkg/engine/finding.go**:
```go
package engine

import "time"

// Finding represents a detected API key from the scanning pipeline.
// KeyValue holds the plaintext key — the storage layer encrypts it before persisting.
type Finding struct {
    ProviderName string
    KeyValue     string // full plaintext key
    KeyMasked    string // first8...last4
    Confidence   string // "high", "medium", "low"
    Source       string // file path or description
    SourceType   string // "file", "dir", "git", "stdin", "url"
    LineNumber   int
    Offset       int64
    DetectedAt   time.Time
}

// MaskKey returns a masked representation: first 8 chars + "..." + last 4 chars.
// Returns "****" if the key is shorter than 12 characters.
func MaskKey(key string) string {
    if len(key) < 12 {
        return "****"
    }
    return key[:8] + "..." + key[len(key)-4:]
}

Create pkg/engine/entropy.go:

package engine

import "math"

// Shannon computes the Shannon entropy of a string in bits per character.
// Returns 0.0 for empty strings.
// A value >= 3.5 indicates high randomness, consistent with real API keys.
func Shannon(s string) float64 {
    if len(s) == 0 {
        return 0.0
    }
    freq := make(map[rune]float64)
    for _, c := range s {
        freq[c]++
    }
    n := float64(len([]rune(s)))
    var entropy float64
    for _, count := range freq {
        p := count / n
        entropy -= p * math.Log2(p)
    }
    return entropy
}
cd /home/salva/Documents/apikey && go build ./pkg/engine/... && echo "BUILD OK" - `go build ./pkg/engine/...` exits 0 - pkg/engine/chunk.go exports Chunk with fields Data, Source, Offset - pkg/engine/finding.go exports Finding and MaskKey - pkg/engine/entropy.go exports Shannon using math.Log2 - `grep -q 'math\.Log2' pkg/engine/entropy.go` exits 0 - Shannon("aaaaaaa") == 0.0 (manually verifiable from code) - MaskKey("sk-proj-abc1234") produces "sk-proj-...1234" Chunk, Finding, MaskKey, and Shannon exist and compile. Shannon uses stdlib math only — no external library. Task 2: Pipeline stages, engine orchestration, FileSource, and filled test stubs pkg/engine/filter.go, pkg/engine/detector.go, pkg/engine/engine.go, pkg/engine/sources/source.go, pkg/engine/sources/file.go, pkg/engine/scanner_test.go - /home/salva/Documents/apikey/.planning/phases/01-foundation/01-RESEARCH.md (Pattern 2: Three-Stage Scanning Pipeline — exact channel-based code example) - /home/salva/Documents/apikey/pkg/engine/chunk.go - /home/salva/Documents/apikey/pkg/engine/finding.go - /home/salva/Documents/apikey/pkg/engine/entropy.go - /home/salva/Documents/apikey/pkg/providers/registry.go (Registry.AC() and Registry.List() signatures) - Test 1: Scan testdata/samples/openai_key.txt → 1 finding, ProviderName=="openai", KeyValue contains "sk-proj-" - Test 2: Scan testdata/samples/anthropic_key.txt → 1 finding, ProviderName=="anthropic" - Test 3: Scan testdata/samples/no_keys.txt → 0 findings - Test 4: Scan testdata/samples/multiple_keys.txt → 2 findings (openai + anthropic) - Test 5: Shannon("sk-proj-ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqr") >= 3.5 (entropy check) - Test 6: KeywordFilter drops a chunk with text "hello world" (no provider keywords) Create **pkg/engine/sources/source.go**: ```go package sources

import ( "context"

"github.com/salvacybersec/keyhunter/pkg/engine"

)

// Source is the interface all input adapters must implement. // Chunks writes content segments to the out channel until the source is exhausted or ctx is cancelled. type Source interface { Chunks(ctx context.Context, out chan<- engine.Chunk) error }


Create **pkg/engine/sources/file.go**:
```go
package sources

import (
    "context"
    "os"

    "github.com/salvacybersec/keyhunter/pkg/engine"
)

const defaultChunkSize = 4096
const chunkOverlap = 256 // overlap between chunks to avoid splitting keys at boundaries

// FileSource reads a single file and emits overlapping chunks.
type FileSource struct {
    Path      string
    ChunkSize int
}

// NewFileSource creates a FileSource for the given path with the default chunk size.
func NewFileSource(path string) *FileSource {
    return &FileSource{Path: path, ChunkSize: defaultChunkSize}
}

// Chunks reads the file in overlapping segments and sends each chunk to out.
func (f *FileSource) Chunks(ctx context.Context, out chan<- engine.Chunk) error {
    data, err := os.ReadFile(f.Path)
    if err != nil {
        return err
    }
    size := f.ChunkSize
    if size <= 0 {
        size = defaultChunkSize
    }
    if len(data) <= size {
        // File fits in one chunk
        select {
        case <-ctx.Done():
            return ctx.Err()
        case out <- engine.Chunk{Data: data, Source: f.Path, Offset: 0}:
        }
        return nil
    }
    // Emit overlapping chunks
    var offset int64
    for start := 0; start < len(data); start += size - chunkOverlap {
        end := start + size
        if end > len(data) {
            end = len(data)
        }
        chunk := engine.Chunk{
            Data:   data[start:end],
            Source: f.Path,
            Offset: offset,
        }
        select {
        case <-ctx.Done():
            return ctx.Err()
        case out <- chunk:
        }
        offset += int64(end - start)
        if end == len(data) {
            break
        }
    }
    return nil
}

Create pkg/engine/filter.go:

package engine

import (
    ahocorasick "github.com/petar-dambovaliev/aho-corasick"
)

// KeywordFilter filters a stream of chunks using an Aho-Corasick automaton.
// Only chunks that contain at least one provider keyword are sent to out.
// This is Stage 2 of the pipeline (runs after Source, before Detector).
func KeywordFilter(ac ahocorasick.AhoCorasick, in <-chan Chunk, out chan<- Chunk) {
    for chunk := range in {
        if len(ac.FindAll(string(chunk.Data))) > 0 {
            out <- chunk
        }
    }
}

Create pkg/engine/detector.go:

package engine

import (
    "regexp"
    "strings"
    "time"

    "github.com/salvacybersec/keyhunter/pkg/providers"
)

// Detector applies provider regex patterns and optional entropy checks to a chunk.
// It returns all findings from the chunk.
func Detect(chunk Chunk, providerList []providers.Provider) []Finding {
    var findings []Finding
    content := string(chunk.Data)

    for _, p := range providerList {
        for _, pat := range p.Patterns {
            re, err := regexp.Compile(pat.Regex)
            if err != nil {
                continue // invalid regex — skip silently
            }
            matches := re.FindAllString(content, -1)
            for _, match := range matches {
                // Apply entropy check if threshold is set
                if pat.EntropyMin > 0 && Shannon(match) < pat.EntropyMin {
                    continue // too low entropy — likely a placeholder
                }
                line := lineNumber(content, match)
                findings = append(findings, Finding{
                    ProviderName: p.Name,
                    KeyValue:     match,
                    KeyMasked:    MaskKey(match),
                    Confidence:   pat.Confidence,
                    Source:       chunk.Source,
                    SourceType:   "file",
                    LineNumber:   line,
                    Offset:       chunk.Offset,
                    DetectedAt:   time.Now(),
                })
            }
        }
    }
    return findings
}

// lineNumber returns the 1-based line number where match first appears in content.
func lineNumber(content, match string) int {
    idx := strings.Index(content, match)
    if idx < 0 {
        return 0
    }
    return strings.Count(content[:idx], "\n") + 1
}

Create pkg/engine/engine.go:

package engine

import (
    "context"
    "runtime"
    "sync"
    "time"

    "github.com/panjf2000/ants/v2"
    "github.com/salvacybersec/keyhunter/pkg/providers"
    "github.com/salvacybersec/keyhunter/pkg/engine/sources"
)

// ScanConfig controls scan execution parameters.
type ScanConfig struct {
    Workers int  // number of detector goroutines; defaults to runtime.NumCPU() * 8
    Verify  bool // opt-in active verification (Phase 5)
    Unmask  bool // include full key in Finding.KeyValue
}

// Engine orchestrates the three-stage scanning pipeline.
type Engine struct {
    registry *providers.Registry
}

// NewEngine creates an Engine backed by the given provider registry.
func NewEngine(registry *providers.Registry) *Engine {
    return &Engine{registry: registry}
}

// Scan runs the three-stage pipeline against src and returns a channel of Findings.
// The channel is closed when all chunks have been processed.
// The caller must drain the channel fully or cancel ctx to avoid goroutine leaks.
func (e *Engine) Scan(ctx context.Context, src sources.Source, cfg ScanConfig) (<-chan Finding, error) {
    workers := cfg.Workers
    if workers <= 0 {
        workers = runtime.NumCPU() * 8
    }

    chunksChan    := make(chan Chunk, 1000)
    detectableChan := make(chan Chunk, 500)
    resultsChan   := make(chan Finding, 100)

    // Stage 1: source → chunksChan
    go func() {
        defer close(chunksChan)
        _ = src.Chunks(ctx, chunksChan)
    }()

    // Stage 2: keyword pre-filter → detectableChan
    go func() {
        defer close(detectableChan)
        KeywordFilter(e.registry.AC(), chunksChan, detectableChan)
    }()

    // Stage 3: detector workers → resultsChan
    pool, err := ants.NewPool(workers)
    if err != nil {
        close(resultsChan)
        return nil, err
    }
    providerList := e.registry.List()

    var wg sync.WaitGroup
    var mu sync.Mutex

    go func() {
        defer func() {
            wg.Wait()
            close(resultsChan)
            pool.ReleaseWithTimeout(5 * time.Second)
        }()

        for chunk := range detectableChan {
            c := chunk // capture
            wg.Add(1)
            _ = pool.Submit(func() {
                defer wg.Done()
                found := Detect(c, providerList)
                mu.Lock()
                for _, f := range found {
                    select {
                    case resultsChan <- f:
                    case <-ctx.Done():
                    }
                }
                mu.Unlock()
            })
        }
    }()

    return resultsChan, nil
}

Fill pkg/engine/scanner_test.go (replacing stubs from Plan 01):

package engine_test

import (
    "context"
    "testing"

    "github.com/salvacybersec/keyhunter/pkg/engine"
    "github.com/salvacybersec/keyhunter/pkg/engine/sources"
    "github.com/salvacybersec/keyhunter/pkg/providers"
    "github.com/stretchr/testify/assert"
    "github.com/stretchr/testify/require"
)

func newTestRegistry(t *testing.T) *providers.Registry {
    t.Helper()
    reg, err := providers.NewRegistry()
    require.NoError(t, err)
    return reg
}

func TestShannonEntropy(t *testing.T) {
    assert.InDelta(t, 0.0, engine.Shannon("aaaaaaa"), 0.01)
    assert.Greater(t, engine.Shannon("sk-proj-ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqr"), 3.5)
    assert.Equal(t, 0.0, engine.Shannon(""))
}

func TestKeywordPreFilter(t *testing.T) {
    reg := newTestRegistry(t)
    ac := reg.AC()

    // Chunk with OpenAI keyword should pass
    matches := ac.FindAll("export OPENAI_API_KEY=sk-proj-test")
    assert.NotEmpty(t, matches)

    // Chunk with no keywords should be dropped
    noMatches := ac.FindAll("hello world no secrets here")
    assert.Empty(t, noMatches)
}

func TestScannerPipelineOpenAI(t *testing.T) {
    reg := newTestRegistry(t)
    eng := engine.NewEngine(reg)
    src := sources.NewFileSource("../../testdata/samples/openai_key.txt")
    cfg := engine.ScanConfig{Workers: 2}

    ch, err := eng.Scan(context.Background(), src, cfg)
    require.NoError(t, err)

    var findings []engine.Finding
    for f := range ch {
        findings = append(findings, f)
    }

    require.Len(t, findings, 1, "expected exactly 1 finding in openai_key.txt")
    assert.Equal(t, "openai", findings[0].ProviderName)
    assert.Contains(t, findings[0].KeyValue, "sk-proj-")
}

func TestScannerPipelineNoKeys(t *testing.T) {
    reg := newTestRegistry(t)
    eng := engine.NewEngine(reg)
    src := sources.NewFileSource("../../testdata/samples/no_keys.txt")
    cfg := engine.ScanConfig{Workers: 2}

    ch, err := eng.Scan(context.Background(), src, cfg)
    require.NoError(t, err)

    var findings []engine.Finding
    for f := range ch {
        findings = append(findings, f)
    }

    assert.Empty(t, findings, "expected zero findings in no_keys.txt")
}

func TestScannerPipelineMultipleKeys(t *testing.T) {
    reg := newTestRegistry(t)
    eng := engine.NewEngine(reg)
    src := sources.NewFileSource("../../testdata/samples/multiple_keys.txt")
    cfg := engine.ScanConfig{Workers: 2}

    ch, err := eng.Scan(context.Background(), src, cfg)
    require.NoError(t, err)

    var findings []engine.Finding
    for f := range ch {
        findings = append(findings, f)
    }

    assert.GreaterOrEqual(t, len(findings), 2, "expected at least 2 findings in multiple_keys.txt")

    var names []string
    for _, f := range findings {
        names = append(names, f.ProviderName)
    }
    assert.Contains(t, names, "openai")
    assert.Contains(t, names, "anthropic")
}
cd /home/salva/Documents/apikey && go test ./pkg/engine/... -v -count=1 2>&1 | tail -30 - `go test ./pkg/engine/... -v -count=1` exits 0 with all tests PASS (no SKIP) - TestShannonEntropy passes — 0.0 for "aaaaaaa", >= 3.5 for real key pattern - TestKeywordPreFilter passes — AC matches sk-proj-, empty for "hello world" - TestScannerPipelineOpenAI passes — 1 finding with ProviderName=="openai" - TestScannerPipelineNoKeys passes — 0 findings - TestScannerPipelineMultipleKeys passes — >= 2 findings with both provider names - `grep -q 'ants\.NewPool' pkg/engine/engine.go` exits 0 - `grep -q 'KeywordFilter' pkg/engine/engine.go` exits 0 - `go build ./...` still exits 0 Three-stage scanning pipeline works end-to-end: FileSource → KeywordFilter (AC) → Detect (regex + entropy) → Finding channel. All engine tests pass. After both tasks: - `go test ./pkg/engine/... -v -count=1` exits 0 with 6 tests PASS - `go build ./...` exits 0 - `grep -q 'ants\.NewPool' pkg/engine/engine.go` exits 0 - `grep -q 'math\.Log2' pkg/engine/entropy.go` exits 0 - Scanning testdata/samples/openai_key.txt returns 1 finding with provider "openai" - Scanning testdata/samples/no_keys.txt returns 0 findings

<success_criteria>

  • Three-stage pipeline: AC pre-filter → regex + entropy detector → results channel (CORE-01, CORE-06)
  • Shannon entropy function using stdlib math (CORE-04)
  • ants v2 goroutine pool with configurable worker count (CORE-05)
  • FileSource adapter reading files in overlapping chunks (CORE-07 partial — full mmap in Phase 4)
  • All engine tests pass against real testdata fixtures </success_criteria>
After completion, create `.planning/phases/01-foundation/01-04-SUMMARY.md` following the summary template.