713 lines
24 KiB
Markdown
713 lines
24 KiB
Markdown
---
|
|
phase: 01-foundation
|
|
plan: 04
|
|
type: execute
|
|
wave: 2
|
|
depends_on: [01-02]
|
|
files_modified:
|
|
- pkg/types/chunk.go
|
|
- pkg/engine/finding.go
|
|
- pkg/engine/entropy.go
|
|
- pkg/engine/filter.go
|
|
- pkg/engine/detector.go
|
|
- pkg/engine/engine.go
|
|
- pkg/engine/sources/source.go
|
|
- pkg/engine/sources/file.go
|
|
- pkg/engine/scanner_test.go
|
|
autonomous: true
|
|
requirements: [CORE-01, CORE-04, CORE-05, CORE-06]
|
|
|
|
must_haves:
|
|
truths:
|
|
- "Shannon entropy function returns expected values for known inputs"
|
|
- "Aho-Corasick pre-filter passes chunks containing provider keywords and drops those without"
|
|
- "Detector correctly identifies OpenAI and Anthropic key patterns in test fixtures via regex"
|
|
- "Full scan pipeline: scan testdata/samples/openai_key.txt → Finding with ProviderName==openai"
|
|
- "Full scan pipeline: scan testdata/samples/no_keys.txt → zero findings"
|
|
- "Worker pool uses ants v2 with configurable worker count"
|
|
artifacts:
|
|
- path: "pkg/types/chunk.go"
|
|
provides: "Chunk struct (Data []byte, Source string, Offset int64) — shared by engine and sources packages"
|
|
exports: ["Chunk"]
|
|
- path: "pkg/engine/finding.go"
|
|
provides: "Finding struct (provider, key value, masked, confidence, source, line)"
|
|
exports: ["Finding", "MaskKey"]
|
|
- path: "pkg/engine/entropy.go"
|
|
provides: "Shannon(s string) float64 — ~10 line stdlib math implementation"
|
|
exports: ["Shannon"]
|
|
- path: "pkg/engine/filter.go"
|
|
provides: "KeywordFilter stage — runs Aho-Corasick and passes/drops chunks"
|
|
exports: ["KeywordFilter"]
|
|
- path: "pkg/engine/detector.go"
|
|
provides: "Detector stage — applies provider regexps and entropy check to chunks"
|
|
exports: ["Detect"]
|
|
- path: "pkg/engine/engine.go"
|
|
provides: "Engine struct with Scan(ctx, src, cfg) <-chan Finding"
|
|
exports: ["Engine", "NewEngine", "ScanConfig"]
|
|
- path: "pkg/engine/sources/source.go"
|
|
provides: "Source interface with Chunks(ctx, chan<- types.Chunk) error"
|
|
exports: ["Source"]
|
|
- path: "pkg/engine/sources/file.go"
|
|
provides: "FileSource implementing Source for single-file scanning"
|
|
exports: ["FileSource", "NewFileSource"]
|
|
key_links:
|
|
- from: "pkg/engine/engine.go"
|
|
to: "pkg/providers/registry.go"
|
|
via: "Engine holds *providers.Registry, uses Registry.AC() for pre-filter"
|
|
pattern: "providers\\.Registry"
|
|
- from: "pkg/engine/filter.go"
|
|
to: "github.com/petar-dambovaliev/aho-corasick"
|
|
via: "AC.FindAll() on each chunk"
|
|
pattern: "FindAll"
|
|
- from: "pkg/engine/detector.go"
|
|
to: "pkg/engine/entropy.go"
|
|
via: "Shannon() called when EntropyMin > 0 in pattern"
|
|
pattern: "Shannon"
|
|
- from: "pkg/engine/engine.go"
|
|
to: "github.com/panjf2000/ants/v2"
|
|
via: "ants.NewPool for detector workers"
|
|
pattern: "ants\\.NewPool"
|
|
- from: "pkg/engine/sources/source.go"
|
|
to: "pkg/types/chunk.go"
|
|
via: "Source interface uses types.Chunk — avoids circular import with pkg/engine"
|
|
pattern: "types\\.Chunk"
|
|
---
|
|
|
|
<objective>
|
|
Build the three-stage scanning engine pipeline: Aho-Corasick keyword pre-filter, regex + entropy detector workers using ants goroutine pool, and a FileSource adapter. Wire them together in an Engine that emits Findings on a channel.
|
|
|
|
Purpose: The scan engine is the core differentiator. Plans 02 and 03 provide its dependencies (Registry for patterns + keywords, storage types for Finding). The CLI (Plan 05) calls Engine.Scan() to implement `keyhunter scan`.
|
|
Output: pkg/types/chunk.go, pkg/engine/{finding,entropy,filter,detector,engine}.go and sources/{source,file}.go. scanner_test.go stubs filled.
|
|
|
|
NOTE on CORE-07 (mmap large file reading): FileSource uses os.ReadFile() in Phase 1, which is sufficient for the test fixtures. mmap-based reading for files > 10MB is deferred to Phase 4 (Input Sources) where it belongs architecturally alongside all other source adapter work.
|
|
</objective>
|
|
|
|
<execution_context>
|
|
@$HOME/.claude/get-shit-done/workflows/execute-plan.md
|
|
@$HOME/.claude/get-shit-done/templates/summary.md
|
|
</execution_context>
|
|
|
|
<context>
|
|
@.planning/phases/01-foundation/01-RESEARCH.md
|
|
@.planning/phases/01-foundation/01-02-SUMMARY.md
|
|
|
|
<interfaces>
|
|
<!-- IMPORTANT: Circular import prevention -->
|
|
The sources sub-package (pkg/engine/sources) needs the Chunk type.
|
|
If Chunk were defined in pkg/engine, then sources would import engine, and engine imports
|
|
sources (for the Source interface) — a circular import. Go will refuse to compile.
|
|
|
|
Resolution: Define Chunk in pkg/types (a shared, import-free package):
|
|
pkg/types/chunk.go — defines types.Chunk
|
|
pkg/engine/sources — imports pkg/types (no circular dep)
|
|
pkg/engine — imports pkg/types and pkg/engine/sources (no circular dep)
|
|
|
|
<!-- Provider Registry types (from Plan 02) -->
|
|
package providers
|
|
|
|
type Provider struct {
|
|
Name string
|
|
Keywords []string
|
|
Patterns []Pattern
|
|
Tier int
|
|
}
|
|
|
|
type Pattern struct {
|
|
Regex string
|
|
EntropyMin float64
|
|
Confidence string
|
|
}
|
|
|
|
type Registry struct { ... }
|
|
func (r *Registry) List() []Provider
|
|
func (r *Registry) AC() ahocorasick.AhoCorasick // pre-built Aho-Corasick
|
|
|
|
<!-- Three-stage pipeline pattern from RESEARCH.md Pattern 2 -->
|
|
chunksChan chan types.Chunk (buffer: 1000)
|
|
detectableChan chan types.Chunk (buffer: 500)
|
|
resultsChan chan Finding (buffer: 100)
|
|
|
|
Stage 1: Source.Chunks() → chunksChan (goroutine, closes chan on done)
|
|
Stage 2: KeywordFilter(chunksChan) → detectableChan (goroutine, AC.FindAll)
|
|
Stage 3: N detector workers (ants pool) → resultsChan
|
|
|
|
<!-- ScanConfig -->
|
|
type ScanConfig struct {
|
|
Workers int // default: runtime.NumCPU() * 8
|
|
Verify bool // Phase 5 — always false in Phase 1
|
|
Unmask bool // for output layer
|
|
}
|
|
|
|
<!-- Source interface -->
|
|
type Source interface {
|
|
Chunks(ctx context.Context, out chan<- types.Chunk) error
|
|
}
|
|
|
|
<!-- FileSource -->
|
|
type FileSource struct {
|
|
Path string
|
|
ChunkSize int // bytes per chunk, default 4096
|
|
}
|
|
|
|
Chunking strategy: read file in chunks of ChunkSize bytes with overlap of max(256, maxPatternLen)
|
|
to avoid splitting a key across chunk boundaries.
|
|
|
|
<!-- Aho-Corasick import -->
|
|
import ahocorasick "github.com/petar-dambovaliev/aho-corasick"
|
|
// ac.FindAll(s string) []ahocorasick.Match — returns match positions
|
|
|
|
<!-- ants import -->
|
|
import "github.com/panjf2000/ants/v2"
|
|
// pool, _ := ants.NewPool(workers, ants.WithOptions(...))
|
|
// pool.Submit(func() { ... })
|
|
// pool.ReleaseWithTimeout(timeout)
|
|
</interfaces>
|
|
</context>
|
|
|
|
<tasks>
|
|
|
|
<task type="auto" tdd="true">
|
|
<name>Task 1: Shared types package, Finding, and Shannon entropy function</name>
|
|
<files>pkg/types/chunk.go, pkg/engine/finding.go, pkg/engine/entropy.go</files>
|
|
<read_first>
|
|
- /home/salva/Documents/apikey/.planning/phases/01-foundation/01-RESEARCH.md (CORE-04 row: Shannon entropy, ~10-line stdlib function, threshold 3.5 bits/char)
|
|
- /home/salva/Documents/apikey/pkg/storage/findings.go (Finding and MaskKey defined there — engine.Finding is a separate type for the pipeline)
|
|
</read_first>
|
|
<behavior>
|
|
- Test 1: Shannon("aaaaaaa") → value near 0.0 (all same characters, no entropy)
|
|
- Test 2: Shannon("abcdefgh") → value near 3.0 (8 distinct chars)
|
|
- Test 3: Shannon("sk-proj-ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqr") → >= 3.5 (real key entropy)
|
|
- Test 4: Shannon("") → 0.0 (empty string)
|
|
- Test 5: MaskKey("sk-proj-abc1234") → "sk-proj-...1234" (first 8 + last 4)
|
|
- Test 6: MaskKey("abc") → "****" (too short to mask)
|
|
</behavior>
|
|
<action>
|
|
Create **pkg/types/chunk.go** — the shared type that breaks the circular import:
|
|
```go
|
|
package types
|
|
|
|
// Chunk is a segment of file content passed through the scanning pipeline.
|
|
// Defined in pkg/types (not pkg/engine) so that pkg/engine/sources can use it
|
|
// without creating a circular import with pkg/engine.
|
|
type Chunk struct {
|
|
Data []byte // raw bytes
|
|
Source string // file path, URL, or description
|
|
Offset int64 // byte offset of this chunk within the source
|
|
}
|
|
```
|
|
|
|
Create **pkg/engine/finding.go**:
|
|
```go
|
|
package engine
|
|
|
|
import "time"
|
|
|
|
// Finding represents a detected API key from the scanning pipeline.
|
|
// KeyValue holds the plaintext key — the storage layer encrypts it before persisting.
|
|
type Finding struct {
|
|
ProviderName string
|
|
KeyValue string // full plaintext key
|
|
KeyMasked string // first8...last4
|
|
Confidence string // "high", "medium", "low"
|
|
Source string // file path or description
|
|
SourceType string // "file", "dir", "git", "stdin", "url"
|
|
LineNumber int
|
|
Offset int64
|
|
DetectedAt time.Time
|
|
}
|
|
|
|
// MaskKey returns a masked representation: first 8 chars + "..." + last 4 chars.
|
|
// Returns "****" if the key is shorter than 12 characters.
|
|
func MaskKey(key string) string {
|
|
if len(key) < 12 {
|
|
return "****"
|
|
}
|
|
return key[:8] + "..." + key[len(key)-4:]
|
|
}
|
|
```
|
|
|
|
Create **pkg/engine/entropy.go**:
|
|
```go
|
|
package engine
|
|
|
|
import "math"
|
|
|
|
// Shannon computes the Shannon entropy of a string in bits per character.
|
|
// Returns 0.0 for empty strings.
|
|
// A value >= 3.5 indicates high randomness, consistent with real API keys.
|
|
func Shannon(s string) float64 {
|
|
if len(s) == 0 {
|
|
return 0.0
|
|
}
|
|
freq := make(map[rune]float64)
|
|
for _, c := range s {
|
|
freq[c]++
|
|
}
|
|
n := float64(len([]rune(s)))
|
|
var entropy float64
|
|
for _, count := range freq {
|
|
p := count / n
|
|
entropy -= p * math.Log2(p)
|
|
}
|
|
return entropy
|
|
}
|
|
```
|
|
</action>
|
|
<verify>
|
|
<automated>cd /home/salva/Documents/apikey && go build ./pkg/types/... && go build ./pkg/engine/... && echo "BUILD OK"</automated>
|
|
</verify>
|
|
<acceptance_criteria>
|
|
- `go build ./pkg/types/...` exits 0
|
|
- `go build ./pkg/engine/...` exits 0
|
|
- pkg/types/chunk.go exports Chunk with fields Data, Source, Offset
|
|
- pkg/engine/finding.go exports Finding and MaskKey
|
|
- pkg/engine/entropy.go exports Shannon using math.Log2
|
|
- `grep -q 'math\.Log2' pkg/engine/entropy.go` exits 0
|
|
- MaskKey("sk-proj-abc1234") produces "sk-proj-...1234"
|
|
</acceptance_criteria>
|
|
<done>pkg/types/Chunk exists (no imports, no circular dependency risk), Finding, MaskKey, and Shannon exist and compile.</done>
|
|
</task>
|
|
|
|
<task type="auto" tdd="true">
|
|
<name>Task 2: Pipeline stages, engine orchestration, FileSource, and filled test stubs</name>
|
|
<files>
|
|
pkg/engine/filter.go,
|
|
pkg/engine/detector.go,
|
|
pkg/engine/engine.go,
|
|
pkg/engine/sources/source.go,
|
|
pkg/engine/sources/file.go,
|
|
pkg/engine/scanner_test.go
|
|
</files>
|
|
<read_first>
|
|
- /home/salva/Documents/apikey/.planning/phases/01-foundation/01-RESEARCH.md (Pattern 2: Three-Stage Scanning Pipeline — exact channel-based code example)
|
|
- /home/salva/Documents/apikey/pkg/types/chunk.go
|
|
- /home/salva/Documents/apikey/pkg/engine/chunk.go (if exists — use pkg/types/chunk.go instead)
|
|
- /home/salva/Documents/apikey/pkg/engine/finding.go
|
|
- /home/salva/Documents/apikey/pkg/engine/entropy.go
|
|
- /home/salva/Documents/apikey/pkg/providers/registry.go (Registry.AC() and Registry.List() signatures)
|
|
</read_first>
|
|
<behavior>
|
|
- Test 1: Scan testdata/samples/openai_key.txt → 1 finding, ProviderName=="openai", KeyValue contains "sk-proj-"
|
|
- Test 2: Scan testdata/samples/anthropic_key.txt → 1 finding, ProviderName=="anthropic"
|
|
- Test 3: Scan testdata/samples/no_keys.txt → 0 findings
|
|
- Test 4: Scan testdata/samples/multiple_keys.txt → 2 findings (openai + anthropic)
|
|
- Test 5: Shannon("sk-proj-ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqr") >= 3.5 (entropy check)
|
|
- Test 6: KeywordFilter drops a chunk with text "hello world" (no provider keywords)
|
|
</behavior>
|
|
<action>
|
|
Create **pkg/engine/sources/source.go**:
|
|
```go
|
|
package sources
|
|
|
|
import (
|
|
"context"
|
|
|
|
"github.com/salvacybersec/keyhunter/pkg/types"
|
|
)
|
|
|
|
// Source is the interface all input adapters must implement.
|
|
// Chunks writes content segments to the out channel until the source is exhausted or ctx is cancelled.
|
|
// NOTE: Source is defined in the sources sub-package (not pkg/engine) and uses pkg/types.Chunk
|
|
// to avoid a circular import: engine → sources → engine.
|
|
type Source interface {
|
|
Chunks(ctx context.Context, out chan<- types.Chunk) error
|
|
}
|
|
```
|
|
|
|
Create **pkg/engine/sources/file.go**:
|
|
```go
|
|
package sources
|
|
|
|
import (
|
|
"context"
|
|
"os"
|
|
|
|
"github.com/salvacybersec/keyhunter/pkg/types"
|
|
)
|
|
|
|
const defaultChunkSize = 4096
|
|
const chunkOverlap = 256 // overlap between chunks to avoid splitting keys at boundaries
|
|
|
|
// FileSource reads a single file and emits overlapping chunks.
|
|
type FileSource struct {
|
|
Path string
|
|
ChunkSize int
|
|
}
|
|
|
|
// NewFileSource creates a FileSource for the given path with the default chunk size.
|
|
func NewFileSource(path string) *FileSource {
|
|
return &FileSource{Path: path, ChunkSize: defaultChunkSize}
|
|
}
|
|
|
|
// Chunks reads the file in overlapping segments and sends each chunk to out.
|
|
// Uses os.ReadFile for simplicity in Phase 1. mmap for files > 10MB is implemented
|
|
// in Phase 4 (Input Sources) alongside all other source adapter enhancements.
|
|
func (f *FileSource) Chunks(ctx context.Context, out chan<- types.Chunk) error {
|
|
data, err := os.ReadFile(f.Path)
|
|
if err != nil {
|
|
return err
|
|
}
|
|
size := f.ChunkSize
|
|
if size <= 0 {
|
|
size = defaultChunkSize
|
|
}
|
|
if len(data) <= size {
|
|
// File fits in one chunk
|
|
select {
|
|
case <-ctx.Done():
|
|
return ctx.Err()
|
|
case out <- types.Chunk{Data: data, Source: f.Path, Offset: 0}:
|
|
}
|
|
return nil
|
|
}
|
|
// Emit overlapping chunks
|
|
var offset int64
|
|
for start := 0; start < len(data); start += size - chunkOverlap {
|
|
end := start + size
|
|
if end > len(data) {
|
|
end = len(data)
|
|
}
|
|
chunk := types.Chunk{
|
|
Data: data[start:end],
|
|
Source: f.Path,
|
|
Offset: offset,
|
|
}
|
|
select {
|
|
case <-ctx.Done():
|
|
return ctx.Err()
|
|
case out <- chunk:
|
|
}
|
|
offset += int64(end - start)
|
|
if end == len(data) {
|
|
break
|
|
}
|
|
}
|
|
return nil
|
|
}
|
|
```
|
|
|
|
Create **pkg/engine/filter.go**:
|
|
```go
|
|
package engine
|
|
|
|
import (
|
|
ahocorasick "github.com/petar-dambovaliev/aho-corasick"
|
|
"github.com/salvacybersec/keyhunter/pkg/types"
|
|
)
|
|
|
|
// KeywordFilter filters a stream of chunks using an Aho-Corasick automaton.
|
|
// Only chunks that contain at least one provider keyword are sent to out.
|
|
// This is Stage 2 of the pipeline (runs after Source, before Detector).
|
|
func KeywordFilter(ac ahocorasick.AhoCorasick, in <-chan types.Chunk, out chan<- types.Chunk) {
|
|
for chunk := range in {
|
|
if len(ac.FindAll(string(chunk.Data))) > 0 {
|
|
out <- chunk
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
Create **pkg/engine/detector.go**:
|
|
```go
|
|
package engine
|
|
|
|
import (
|
|
"regexp"
|
|
"strings"
|
|
"time"
|
|
|
|
"github.com/salvacybersec/keyhunter/pkg/providers"
|
|
"github.com/salvacybersec/keyhunter/pkg/types"
|
|
)
|
|
|
|
// Detect applies provider regex patterns and optional entropy checks to a chunk.
|
|
// It returns all findings from the chunk.
|
|
func Detect(chunk types.Chunk, providerList []providers.Provider) []Finding {
|
|
var findings []Finding
|
|
content := string(chunk.Data)
|
|
|
|
for _, p := range providerList {
|
|
for _, pat := range p.Patterns {
|
|
re, err := regexp.Compile(pat.Regex)
|
|
if err != nil {
|
|
continue // invalid regex — skip silently
|
|
}
|
|
matches := re.FindAllString(content, -1)
|
|
for _, match := range matches {
|
|
// Apply entropy check if threshold is set
|
|
if pat.EntropyMin > 0 && Shannon(match) < pat.EntropyMin {
|
|
continue // too low entropy — likely a placeholder
|
|
}
|
|
line := lineNumber(content, match)
|
|
findings = append(findings, Finding{
|
|
ProviderName: p.Name,
|
|
KeyValue: match,
|
|
KeyMasked: MaskKey(match),
|
|
Confidence: pat.Confidence,
|
|
Source: chunk.Source,
|
|
SourceType: "file",
|
|
LineNumber: line,
|
|
Offset: chunk.Offset,
|
|
DetectedAt: time.Now(),
|
|
})
|
|
}
|
|
}
|
|
}
|
|
return findings
|
|
}
|
|
|
|
// lineNumber returns the 1-based line number where match first appears in content.
|
|
func lineNumber(content, match string) int {
|
|
idx := strings.Index(content, match)
|
|
if idx < 0 {
|
|
return 0
|
|
}
|
|
return strings.Count(content[:idx], "\n") + 1
|
|
}
|
|
```
|
|
|
|
Create **pkg/engine/engine.go**:
|
|
```go
|
|
package engine
|
|
|
|
import (
|
|
"context"
|
|
"runtime"
|
|
"sync"
|
|
"time"
|
|
|
|
"github.com/panjf2000/ants/v2"
|
|
"github.com/salvacybersec/keyhunter/pkg/engine/sources"
|
|
"github.com/salvacybersec/keyhunter/pkg/providers"
|
|
"github.com/salvacybersec/keyhunter/pkg/types"
|
|
)
|
|
|
|
// ScanConfig controls scan execution parameters.
|
|
type ScanConfig struct {
|
|
Workers int // number of detector goroutines; defaults to runtime.NumCPU() * 8
|
|
Verify bool // opt-in active verification (Phase 5)
|
|
Unmask bool // include full key in Finding.KeyValue
|
|
}
|
|
|
|
// Engine orchestrates the three-stage scanning pipeline.
|
|
type Engine struct {
|
|
registry *providers.Registry
|
|
}
|
|
|
|
// NewEngine creates an Engine backed by the given provider registry.
|
|
func NewEngine(registry *providers.Registry) *Engine {
|
|
return &Engine{registry: registry}
|
|
}
|
|
|
|
// Scan runs the three-stage pipeline against src and returns a channel of Findings.
|
|
// The channel is closed when all chunks have been processed.
|
|
// The caller must drain the channel fully or cancel ctx to avoid goroutine leaks.
|
|
func (e *Engine) Scan(ctx context.Context, src sources.Source, cfg ScanConfig) (<-chan Finding, error) {
|
|
workers := cfg.Workers
|
|
if workers <= 0 {
|
|
workers = runtime.NumCPU() * 8
|
|
}
|
|
|
|
// Declare channels on separate lines to ensure correct Go syntax.
|
|
chunksChan := make(chan types.Chunk, 1000)
|
|
detectableChan := make(chan types.Chunk, 500)
|
|
resultsChan := make(chan Finding, 100)
|
|
|
|
// Stage 1: source → chunksChan
|
|
go func() {
|
|
defer close(chunksChan)
|
|
_ = src.Chunks(ctx, chunksChan)
|
|
}()
|
|
|
|
// Stage 2: keyword pre-filter → detectableChan
|
|
go func() {
|
|
defer close(detectableChan)
|
|
KeywordFilter(e.registry.AC(), chunksChan, detectableChan)
|
|
}()
|
|
|
|
// Stage 3: detector workers → resultsChan
|
|
pool, err := ants.NewPool(workers)
|
|
if err != nil {
|
|
close(resultsChan)
|
|
return nil, err
|
|
}
|
|
providerList := e.registry.List()
|
|
|
|
var wg sync.WaitGroup
|
|
var mu sync.Mutex
|
|
|
|
go func() {
|
|
defer func() {
|
|
wg.Wait()
|
|
close(resultsChan)
|
|
pool.ReleaseWithTimeout(5 * time.Second)
|
|
}()
|
|
|
|
for chunk := range detectableChan {
|
|
c := chunk // capture loop variable
|
|
wg.Add(1)
|
|
_ = pool.Submit(func() {
|
|
defer wg.Done()
|
|
found := Detect(c, providerList)
|
|
mu.Lock()
|
|
for _, f := range found {
|
|
select {
|
|
case resultsChan <- f:
|
|
case <-ctx.Done():
|
|
}
|
|
}
|
|
mu.Unlock()
|
|
})
|
|
}
|
|
}()
|
|
|
|
return resultsChan, nil
|
|
}
|
|
```
|
|
|
|
Fill **pkg/engine/scanner_test.go** (replacing stubs from Plan 01):
|
|
```go
|
|
package engine_test
|
|
|
|
import (
|
|
"context"
|
|
"testing"
|
|
|
|
"github.com/salvacybersec/keyhunter/pkg/engine"
|
|
"github.com/salvacybersec/keyhunter/pkg/engine/sources"
|
|
"github.com/salvacybersec/keyhunter/pkg/providers"
|
|
"github.com/stretchr/testify/assert"
|
|
"github.com/stretchr/testify/require"
|
|
)
|
|
|
|
func newTestRegistry(t *testing.T) *providers.Registry {
|
|
t.Helper()
|
|
reg, err := providers.NewRegistry()
|
|
require.NoError(t, err)
|
|
return reg
|
|
}
|
|
|
|
func TestShannonEntropy(t *testing.T) {
|
|
assert.InDelta(t, 0.0, engine.Shannon("aaaaaaa"), 0.01)
|
|
assert.Greater(t, engine.Shannon("sk-proj-ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqr"), 3.5)
|
|
assert.Equal(t, 0.0, engine.Shannon(""))
|
|
}
|
|
|
|
func TestKeywordPreFilter(t *testing.T) {
|
|
reg := newTestRegistry(t)
|
|
ac := reg.AC()
|
|
|
|
// Chunk with OpenAI keyword should pass
|
|
matches := ac.FindAll("export OPENAI_API_KEY=sk-proj-test")
|
|
assert.NotEmpty(t, matches)
|
|
|
|
// Chunk with no keywords should be dropped
|
|
noMatches := ac.FindAll("hello world no secrets here")
|
|
assert.Empty(t, noMatches)
|
|
}
|
|
|
|
func TestScannerPipelineOpenAI(t *testing.T) {
|
|
reg := newTestRegistry(t)
|
|
eng := engine.NewEngine(reg)
|
|
src := sources.NewFileSource("../../testdata/samples/openai_key.txt")
|
|
cfg := engine.ScanConfig{Workers: 2}
|
|
|
|
ch, err := eng.Scan(context.Background(), src, cfg)
|
|
require.NoError(t, err)
|
|
|
|
var findings []engine.Finding
|
|
for f := range ch {
|
|
findings = append(findings, f)
|
|
}
|
|
|
|
require.Len(t, findings, 1, "expected exactly 1 finding in openai_key.txt")
|
|
assert.Equal(t, "openai", findings[0].ProviderName)
|
|
assert.Contains(t, findings[0].KeyValue, "sk-proj-")
|
|
}
|
|
|
|
func TestScannerPipelineNoKeys(t *testing.T) {
|
|
reg := newTestRegistry(t)
|
|
eng := engine.NewEngine(reg)
|
|
src := sources.NewFileSource("../../testdata/samples/no_keys.txt")
|
|
cfg := engine.ScanConfig{Workers: 2}
|
|
|
|
ch, err := eng.Scan(context.Background(), src, cfg)
|
|
require.NoError(t, err)
|
|
|
|
var findings []engine.Finding
|
|
for f := range ch {
|
|
findings = append(findings, f)
|
|
}
|
|
|
|
assert.Empty(t, findings, "expected zero findings in no_keys.txt")
|
|
}
|
|
|
|
func TestScannerPipelineMultipleKeys(t *testing.T) {
|
|
reg := newTestRegistry(t)
|
|
eng := engine.NewEngine(reg)
|
|
src := sources.NewFileSource("../../testdata/samples/multiple_keys.txt")
|
|
cfg := engine.ScanConfig{Workers: 2}
|
|
|
|
ch, err := eng.Scan(context.Background(), src, cfg)
|
|
require.NoError(t, err)
|
|
|
|
var findings []engine.Finding
|
|
for f := range ch {
|
|
findings = append(findings, f)
|
|
}
|
|
|
|
assert.GreaterOrEqual(t, len(findings), 2, "expected at least 2 findings in multiple_keys.txt")
|
|
|
|
var names []string
|
|
for _, f := range findings {
|
|
names = append(names, f.ProviderName)
|
|
}
|
|
assert.Contains(t, names, "openai")
|
|
assert.Contains(t, names, "anthropic")
|
|
}
|
|
```
|
|
</action>
|
|
<verify>
|
|
<automated>cd /home/salva/Documents/apikey && go test ./pkg/engine/... -v -count=1 2>&1 | tail -30</automated>
|
|
</verify>
|
|
<acceptance_criteria>
|
|
- `go test ./pkg/engine/... -v -count=1` exits 0 with all tests PASS (no SKIP)
|
|
- `go build ./...` exits 0 with no circular import errors
|
|
- TestShannonEntropy passes — 0.0 for "aaaaaaa", >= 3.5 for real key pattern
|
|
- TestKeywordPreFilter passes — AC matches sk-proj-, empty for "hello world"
|
|
- TestScannerPipelineOpenAI passes — 1 finding with ProviderName=="openai"
|
|
- TestScannerPipelineNoKeys passes — 0 findings
|
|
- TestScannerPipelineMultipleKeys passes — >= 2 findings with both provider names
|
|
- `grep -q 'ants\.NewPool' pkg/engine/engine.go` exits 0
|
|
- `grep -q 'KeywordFilter' pkg/engine/engine.go` exits 0
|
|
- pkg/types/chunk.go exists and pkg/engine/sources imports pkg/types (not pkg/engine)
|
|
</acceptance_criteria>
|
|
<done>Three-stage scanning pipeline works end-to-end: FileSource → KeywordFilter (AC) → Detect (regex + entropy) → Finding channel. Circular import resolved via pkg/types. All engine tests pass.</done>
|
|
</task>
|
|
|
|
</tasks>
|
|
|
|
<verification>
|
|
After both tasks:
|
|
- `go build ./...` exits 0 with zero circular import errors
|
|
- `go test ./pkg/engine/... -v -count=1` exits 0 with all tests PASS
|
|
- `grep -q 'ants\.NewPool' pkg/engine/engine.go` exits 0
|
|
- `grep -q 'math\.Log2' pkg/engine/entropy.go` exits 0
|
|
- `grep -rq 'pkg/types' pkg/engine/sources/source.go` exits 0 (sources imports types, not engine)
|
|
- Scanning testdata/samples/openai_key.txt returns 1 finding with provider "openai"
|
|
- Scanning testdata/samples/no_keys.txt returns 0 findings
|
|
</verification>
|
|
|
|
<success_criteria>
|
|
- Three-stage pipeline: AC pre-filter → regex + entropy detector → results channel (CORE-01, CORE-06)
|
|
- Shannon entropy function using stdlib math (CORE-04)
|
|
- ants v2 goroutine pool with configurable worker count (CORE-05)
|
|
- FileSource adapter reading files in overlapping chunks using os.ReadFile (mmap deferred to Phase 4)
|
|
- pkg/types/Chunk breaks the engine↔sources circular import
|
|
- All engine tests pass against real testdata fixtures
|
|
</success_criteria>
|
|
|
|
<output>
|
|
After completion, create `.planning/phases/01-foundation/01-04-SUMMARY.md` following the summary template.
|
|
</output>
|