docs(phase-1): research foundation phase

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
salvacybersec
2026-04-04 23:32:10 +03:00
parent ee92aad4cf
commit fa3916a417

View File

@@ -0,0 +1,764 @@
# Phase 1: Foundation - Research
**Researched:** 2026-04-04
**Domain:** Go CLI scaffolding, provider registry schema, three-stage scan pipeline, encrypted SQLite storage
**Confidence:** HIGH
---
<phase_requirements>
## Phase Requirements
| ID | Description | Research Support |
|----|-------------|------------------|
| CORE-01 | Scanner engine detects API keys using keyword pre-filtering + regex matching pipeline | Aho-Corasick pre-filter (petar-dambovaliev/aho-corasick) + Go RE2 regexp; buffered channel pipeline pattern documented |
| CORE-02 | Provider definitions loaded from YAML files embedded at compile time via Go embed | `//go:embed providers/*.yaml` + `fs.WalkDir` to iterate; gopkg.in/yaml.v3 for parse |
| CORE-03 | Provider registry manages 108+ provider definitions with pattern, keyword, confidence, and verify metadata | Registry struct holding `[]Provider` loaded at startup; injected via constructor not global |
| CORE-04 | Entropy analysis as secondary signal for low-confidence providers | Shannon entropy implemented as ~10-line stdlib function using `math` package; threshold 3.5 bits/char |
| CORE-05 | Worker pool parallelism with configurable worker count (default: CPU count) | `ants.NewPool(runtime.NumCPU() * 8)` for detectors; `ants.NewPool(runtime.NumCPU())` for verifiers |
| CORE-06 | Aho-Corasick keyword pre-filter runs before regex for 10x performance on large files | petar-dambovaliev/aho-corasick: 20x faster than cloudflare's, used by TruffleHog; build automaton at registry init |
| CORE-07 | mmap-based large file reading for memory efficiency | `golang.org/x/sys/unix.Mmap` or `syscall.Mmap` for large file sources; skip binary files via magic bytes |
| STOR-01 | SQLite database for persisting scan results, keys, recon history | `modernc.org/sqlite` (pure Go, no CGo); `database/sql` interface; WAL mode; schema embedded via `//go:embed` |
| STOR-02 | Application-level AES-256 encryption for stored keys and sensitive config | `crypto/aes` + `crypto/cipher` GCM mode; nonce prepended to ciphertext stored in BLOB column |
| STOR-03 | Encryption key derived from user passphrase via Argon2 | `golang.org/x/crypto/argon2` IDKey with RFC 9106 params: time=1, memory=64*1024, threads=4, keyLen=32 |
| CLI-01 | Cobra-based CLI with commands: scan, verify, import, recon, keys, serve, dorks, providers, config, hook, schedule | cobra v1.10.2 command tree; `cmd/` package; main.go < 30 lines |
| CLI-02 | keyhunter config init creates ~/.keyhunter.yaml | `viper.WriteConfigAs(filepath)` with `os.MkdirAll`; `PersistentPreRunE` for config load |
| CLI-03 | keyhunter config set key value persists values | `viper.Set(key, value)` + `viper.WriteConfig()` |
| CLI-04 | keyhunter providers list/info/stats for provider management | Registry.List(), Registry.Get(name) from loaded YAML; lipgloss table for terminal output |
| CLI-05 | Scan flags: --providers, --category, --confidence, --exclude, --verify, --workers, --output, --unmask, --notify | Persistent flags on `scan` command; viper.BindPFlag for config file override |
| PROV-10 | Provider YAML schema includes format_version and last_verified fields validated at load time | Custom `UnmarshalYAML` method on Provider struct; return error if format_version == 0 or last_verified empty |
</phase_requirements>
---
## Summary
Phase 1 builds the three foundations everything else depends on: the provider registry (YAML schema + embed), the storage layer (SQLite + AES-256 encryption), and the CLI skeleton (Cobra + Viper). These are the two zero-dependency components in the architecture — nothing downstream can be built until both exist. The scanning engine (three-stage pipeline) is also scoped here because all Phase 1 success criteria require a working scan to validate the other two foundations.
The stack for this phase is fully settled from prior research. The one open question from SUMMARY.md — which Aho-Corasick library to use — is now resolved: TruffleHog originally used `petar-dambovaliev/aho-corasick` (confirmed via the TruffleHog blog post crediting Petar Dambovaliev). That library is 20x faster than cloudflare/ahocorasick and uses 1/8th the memory. A second open question — Argon2 vs PBKDF2 for key derivation — is resolved: use Argon2id (`golang.org/x/crypto/argon2.IDKey`) per RFC 9106 recommendations, which is the modern standard and already available in the `x/crypto` package already needed for other operations.
The AES-256 encryption approach is application-level (not SQLCipher), using `crypto/aes` GCM mode with the nonce prepended to the ciphertext stored as a BLOB. This preserves `CGO_ENABLED=0` throughout. Key derivation uses Argon2id to produce a 32-byte key from a user passphrase + random salt stored alongside the database. For Phase 1, the salt can be stored in the config YAML; OS keychain integration (zalando/go-keyring) can be added in a later phase without schema migration.
**Primary recommendation:** Build in order: (1) Provider YAML schema + embed loader, (2) SQLite schema + AES-256 crypto layer, (3) Scanning engine pipeline (Aho-Corasick + RE2 + entropy), (4) Cobra/Viper CLI wiring. The scan pipeline validation is the integration test that proves all three foundations work together.
---
## Project Constraints (from CLAUDE.md)
All directives from CLAUDE.md are binding. Key constraints for Phase 1:
| Constraint | Directive |
|------------|-----------|
| Language | Go 1.22+ only. No other language. |
| CGO | `CGO_ENABLED=0` throughout — single binary, cross-compilation |
| SQLite driver | `modernc.org/sqlite` — NOT `mattn/go-sqlite3` (CGo) |
| SQLite encryption | Application-level AES-256 via `crypto/aes` — NOT SQLCipher (CGo) |
| CLI framework | `cobra v1.10.2` + `viper v1.21.0` — no alternatives |
| YAML parsing | `gopkg.in/yaml.v3` — not sigs.k8s.io/yaml or goccy/go-yaml |
| Concurrency | `ants v2.12.0` worker pool |
| Architecture | Plugin-based — providers as YAML files, compile-time embedded via `go:embed` |
| Regex engine | Go stdlib `regexp` (RE2-backed) ONLY — never `regexp2` or PCRE |
| Verification | Opt-in (`--verify` flag) — passive scanning by default |
| Key masking | Default masked output, `--unmask` for full keys |
| Worker pool | `github.com/panjf2000/ants/v2` v2.12.0 |
| Output formatting | `github.com/charmbracelet/lipgloss` (latest) |
| Web stack | `chi v5.2.5` + `templ v0.3.1001` + htmx + Tailwind v4 (Phase 18 only — do not scaffold in Phase 1) |
| Telegram | `telego v1.8.0` (Phase 17 only) |
| Scheduler | `gocron v2.19.1` (Phase 17 only) |
| Build | `go build -ldflags="-s -w"` for stripped binary |
| Forbidden patterns | Fiber, Echo, mattn/go-sqlite3, SQLCipher, robfig/cron, regexp2, Full Bubble Tea TUI |
---
## Standard Stack
### Core (Phase 1 only)
| Library | Version | Purpose | Why Standard |
|---------|---------|---------|--------------|
| `github.com/spf13/cobra` | v1.10.2 | CLI command tree | Industry standard; used by TruffleHog, Gitleaks, Kubernetes, Docker CLI |
| `github.com/spf13/viper` | v1.21.0 | Config management (YAML/env/flags) | Cobra-native integration; v1.21.0 fixed key-casing bugs |
| `modernc.org/sqlite` | v1.35.x (SQLite 3.51.2) | Embedded database | Pure Go; no CGo; cross-compiles cleanly; updated 2026-03-17 |
| `gopkg.in/yaml.v3` | v3.0.1 | Parse provider YAML | Handles inline/anchored structs; stable; transitive dep of cobra anyway |
| `github.com/petar-dambovaliev/aho-corasick` | latest (2025-04-24) | Keyword pre-filter | 20x faster than cloudflare/ahocorasick; 1/8 memory; used by TruffleHog; MIT license |
| `github.com/panjf2000/ants/v2` | v2.12.0 | Worker pool | Battle-tested goroutine pool; v2.12.0 adds ReleaseContext for clean shutdown |
| `golang.org/x/crypto` | latest x/ | Argon2 key derivation | Official Go extended library; IDKey for Argon2id; same package used for other crypto needs |
| `golang.org/x/time` | latest x/ | Rate limiting (future-proofing) | Needed for CORE-05 workers; token bucket; add now to avoid later go.mod churn |
| `github.com/charmbracelet/lipgloss` | latest | Terminal table output | Declarative styles; used across Go security tool ecosystem |
| `github.com/stretchr/testify` | v1.10.x | Test assertions | Assert/require only; standard in Go ecosystem |
| `embed` (stdlib) | — | Compile-time YAML embed | Go 1.16+ native; no external dep |
| `crypto/aes`, `crypto/cipher` (stdlib) | — | AES-256-GCM encryption | Standard library; no CGo; GCM provides authenticated encryption |
| `math` (stdlib) | — | Shannon entropy calculation | ~10-line implementation; no library needed |
| `database/sql` (stdlib) | — | SQL interface over modernc.org/sqlite | Driver registered as `"sqlite"`; raw SQL; no ORM |
### Not Needed in Phase 1 (scaffold stubs only if required by interface)
| Library | Deferred To |
|---------|-------------|
| `chi v5.2.5` | Phase 18 (Web Dashboard) |
| `templ v0.3.1001` | Phase 18 (Web Dashboard) |
| `telego v1.8.0` | Phase 17 (Telegram Bot) |
| `gocron v2.19.1` | Phase 17 (Scheduler) |
**Installation (Phase 1 dependencies):**
```bash
go mod init github.com/yourusername/keyhunter
go get github.com/spf13/cobra@v1.10.2
go get github.com/spf13/viper@v1.21.0
go get modernc.org/sqlite@latest
go get gopkg.in/yaml.v3@v3.0.1
go get github.com/petar-dambovaliev/aho-corasick@latest
go get github.com/panjf2000/ants/v2@v2.12.0
go get golang.org/x/crypto@latest
go get golang.org/x/time@latest
go get github.com/charmbracelet/lipgloss@latest
go get github.com/stretchr/testify@latest
```
**Version verification (run before writing go.mod manually):**
```bash
go list -m github.com/petar-dambovaliev/aho-corasick
go list -m modernc.org/sqlite
go list -m github.com/panjf2000/ants/v2
```
---
## Architecture Patterns
### Recommended Project Structure (Phase 1 scope)
```
keyhunter/
main.go # < 30 lines, cobra root Execute()
cmd/
root.go # rootCmd, persistent flags, PersistentPreRunE config load
scan.go # scan command + flags
providers.go # providers list/info/stats commands
config.go # config init/set/get commands
pkg/
providers/
loader.go # embed.FS + fs.WalkDir + yaml.Unmarshal
registry.go # Registry struct, Get/List/Stats methods
schema.go # Provider, Pattern, VerifySpec structs + UnmarshalYAML validation
engine/
engine.go # Engine struct, Scan() method, pipeline orchestration
pipeline.go # channel wiring: chunksChan, detectableChan, resultsChan
filter.go # Aho-Corasick pre-filter stage
detector.go # Regex + entropy detector worker
entropy.go # Shannon entropy function
chunk.go # Chunk type (content []byte, source string, offset int64)
finding.go # Finding type (provider, key_value, key_masked, confidence, source, path)
sources/
source.go # Source interface
file.go # FileSource
dir.go # DirSource (recursive with glob exclude)
storage/
db.go # DB struct, Open(), migrations via embedded schema.sql
schema.sql # DDL for findings, scans, settings tables
encrypt.go # AES-256-GCM Encrypt(plaintext, key) / Decrypt(ciphertext, key)
crypto.go # Argon2id key derivation: DeriveKey(passphrase, salt)
findings.go # CRUD for findings table
scans.go # CRUD for scans table
config/
config.go # Config struct, Load(), defaults
output/
table.go # lipgloss colored terminal table
json.go # encoding/json output
providers/
openai.yaml # Reference provider definitions (Phase 1: schema examples only)
anthropic.yaml
huggingface.yaml
```
### Pattern 1: Provider Registry with Compile-Time Embed
**What:** Provider YAML definitions embedded via `//go:embed` and loaded into an in-memory registry at startup.
**When to use:** Always. Never load provider files from disk at runtime.
```go
// Source: Go embed docs (https://pkg.go.dev/embed)
package providers
import (
"embed"
"io/fs"
"gopkg.in/yaml.v3"
)
//go:embed ../../providers/*.yaml
var providersFS embed.FS
type Registry struct {
providers []Provider
ac ahocorasick.AhoCorasick
}
func NewRegistry() (*Registry, error) {
var providers []Provider
err := fs.WalkDir(providersFS, "providers", func(path string, d fs.DirEntry, err error) error {
if err != nil || d.IsDir() || filepath.Ext(path) != ".yaml" {
return err
}
data, err := providersFS.ReadFile(path)
if err != nil {
return err
}
var p Provider
if err := yaml.Unmarshal(data, &p); err != nil {
return fmt.Errorf("provider %s: %w", path, err)
}
providers = append(providers, p)
return nil
})
if err != nil {
return nil, err
}
// Build Aho-Corasick automaton from all keywords
var keywords []string
for _, p := range providers {
keywords = append(keywords, p.Keywords...)
}
builder := ahocorasick.NewAhoCorasickBuilder(ahocorasick.Opts{DFA: true})
ac := builder.Build(keywords)
return &Registry{providers: providers, ac: ac}, nil
}
```
### Pattern 2: Three-Stage Scanning Pipeline (Buffered Channels)
**What:** Source adapters produce chunks onto buffered channels. Aho-Corasick pre-filter reduces candidates. Detector workers apply regex + entropy.
**When to use:** All scan operations. Never skip the pre-filter.
```go
// Source: TruffleHog v3 architecture (https://trufflesecurity.com/blog/making-trufflehog-faster-with-aho-corasick)
func (e *Engine) Scan(ctx context.Context, src Source, cfg ScanConfig) (<-chan Finding, error) {
chunksChan := make(chan Chunk, 1000)
detectableChan := make(chan Chunk, 500)
resultsChan := make(chan Finding, 100)
// Stage 1: Source → chunks
go func() {
defer close(chunksChan)
src.Chunks(ctx, chunksChan)
}()
// Stage 2: Aho-Corasick keyword pre-filter
go func() {
defer close(detectableChan)
for chunk := range chunksChan {
if len(e.registry.AC().FindAll(string(chunk.Data))) > 0 {
detectableChan <- chunk
}
}
}()
// Stage 3: Detector workers
var wg sync.WaitGroup
for i := 0; i < cfg.Workers; i++ {
wg.Add(1)
go func() {
defer wg.Done()
for chunk := range detectableChan {
e.detect(chunk, resultsChan)
}
}()
}
go func() {
wg.Wait()
close(resultsChan)
}()
return resultsChan, nil
}
```
### Pattern 3: AES-256-GCM Column Encryption
**What:** Encrypt key values before storing in SQLite. Nonce prepended to ciphertext. Key derived via Argon2id.
**When to use:** Every write of a full API key to storage.
```go
// Source: Go crypto/cipher docs (https://pkg.go.dev/crypto/cipher)
func Encrypt(plaintext []byte, key []byte) ([]byte, error) {
block, err := aes.NewCipher(key) // key must be 32 bytes for AES-256
if err != nil {
return nil, err
}
gcm, err := cipher.NewGCM(block)
if err != nil {
return nil, err
}
nonce := make([]byte, gcm.NonceSize())
if _, err := io.ReadFull(rand.Reader, nonce); err != nil {
return nil, err
}
ciphertext := gcm.Seal(nonce, nonce, plaintext, nil) // nonce prepended
return ciphertext, nil
}
func Decrypt(ciphertext []byte, key []byte) ([]byte, error) {
block, err := aes.NewCipher(key)
if err != nil {
return nil, err
}
gcm, err := cipher.NewGCM(block)
if err != nil {
return nil, err
}
nonceSize := gcm.NonceSize()
if len(ciphertext) < nonceSize {
return nil, errors.New("ciphertext too short")
}
nonce, ciphertext := ciphertext[:nonceSize], ciphertext[nonceSize:]
return gcm.Open(nil, nonce, ciphertext, nil)
}
```
### Pattern 4: Argon2id Key Derivation
**What:** Derive 32-byte AES key from user passphrase + random salt. RFC 9106 Section 7.3 parameters.
**When to use:** On first use (generate and store salt); on subsequent use (re-derive key from passphrase + stored salt).
```go
// Source: https://pkg.go.dev/golang.org/x/crypto/argon2
import "golang.org/x/crypto/argon2"
const (
argon2Time = 1
argon2Memory = 64 * 1024 // 64 MB
argon2Threads = 4
argon2KeyLen = 32 // AES-256 key length
)
func DeriveKey(passphrase []byte, salt []byte) []byte {
return argon2.IDKey(passphrase, salt, argon2Time, argon2Memory, argon2Threads, argon2KeyLen)
}
// Generate salt (do once, persist in config)
func NewSalt() ([]byte, error) {
salt := make([]byte, 32)
_, err := rand.Read(salt)
return salt, err
}
```
### Pattern 5: Provider YAML Schema with Validation
**What:** Provider struct with `UnmarshalYAML` that validates required fields including `format_version` and `last_verified`.
**When to use:** Every provider YAML load. Return error on invalid schema — fail fast at startup.
```go
// Source: gopkg.in/yaml.v3 docs (https://pkg.go.dev/gopkg.in/yaml.v3)
type Provider struct {
Name string `yaml:"name"`
FormatVersion int `yaml:"format_version"`
LastVerified string `yaml:"last_verified"` // ISO date: "2026-04-04"
Keywords []string `yaml:"keywords"`
Patterns []Pattern `yaml:"patterns"`
Verify *VerifySpec `yaml:"verify,omitempty"`
}
func (p *Provider) UnmarshalYAML(value *yaml.Node) error {
type rawProvider Provider
if err := value.Decode((*rawProvider)(p)); err != nil {
return err
}
if p.FormatVersion == 0 {
return fmt.Errorf("provider %q: format_version is required", p.Name)
}
if p.LastVerified == "" {
return fmt.Errorf("provider %q: last_verified is required", p.Name)
}
if len(p.Keywords) == 0 {
return fmt.Errorf("provider %q: at least one keyword is required", p.Name)
}
if len(p.Patterns) == 0 {
return fmt.Errorf("provider %q: at least one pattern is required", p.Name)
}
return nil
}
```
### Pattern 6: Shannon Entropy (stdlib only)
**What:** Calculate bits-per-character entropy of a string. No library needed.
**When to use:** CORE-04: as secondary signal after keyword pre-filter and regex confirm a candidate. Entropy threshold 3.5 bits/char for most providers.
```go
// Source: Shannon entropy formula (verified against TruffleHog entropy implementation)
import "math"
func shannonEntropy(s string) float64 {
if len(s) == 0 {
return 0
}
freq := make(map[rune]float64)
for _, c := range s {
freq[c]++
}
n := float64(len([]rune(s)))
var entropy float64
for _, count := range freq {
p := count / n
entropy -= p * math.Log2(p)
}
return entropy
}
```
### Pattern 7: modernc.org/sqlite Initialization
**What:** Open SQLite DB with WAL mode and foreign keys. Schema embedded via `//go:embed`.
**When to use:** Storage layer initialization.
```go
// Source: https://practicalgobook.net/posts/go-sqlite-no-cgo/
import (
"database/sql"
_ "modernc.org/sqlite" // registers "sqlite" driver
"embed"
)
//go:embed schema.sql
var schemaSQL string
func Open(path string) (*sql.DB, error) {
db, err := sql.Open("sqlite", path)
if err != nil {
return nil, err
}
// Enable WAL mode for better concurrent access
if _, err := db.Exec("PRAGMA journal_mode=WAL"); err != nil {
return nil, err
}
if _, err := db.Exec("PRAGMA foreign_keys=ON"); err != nil {
return nil, err
}
// Apply schema
if _, err := db.Exec(schemaSQL); err != nil {
return nil, err
}
return db, nil
}
```
### Pattern 8: Cobra + Viper Wiring
**What:** Cobra command tree with Viper config loaded in PersistentPreRunE. Flags bound to Viper for config file override.
**When to use:** root.go and all command files.
```go
// Source: https://www.glukhov.org/post/2025/11/go-cli-applications-with-cobra-and-viper/
var rootCmd = &cobra.Command{
Use: "keyhunter",
Short: "API key scanner for 108+ LLM providers",
PersistentPreRunE: func(cmd *cobra.Command, args []string) error {
viper.SetConfigName(".keyhunter")
viper.SetConfigType("yaml")
viper.AddConfigPath("$HOME")
viper.AutomaticEnv()
viper.SetEnvPrefix("KEYHUNTER")
if err := viper.ReadInConfig(); err != nil {
if _, ok := err.(viper.ConfigFileNotFoundError); !ok {
return err
}
// Config file not found — acceptable on first run
}
return nil
},
}
// In init() of each command file:
func init() {
scanCmd.Flags().IntP("workers", "w", 0, "worker count (default: CPU count)")
viper.BindPFlag("workers", scanCmd.Flags().Lookup("workers"))
}
```
### Anti-Patterns to Avoid
- **Global provider registry:** Pass `*Registry` via constructor. Global state makes testing impossible without full initialization.
- **Unbuffered result channels:** Use `make(chan Finding, 1000)`. Unbuffered channels cause detector workers to block on slow output consumers, collapsing parallelism.
- **Runtime YAML loading:** Never load from filesystem at scan time. `//go:embed` only.
- **regexp2 or PCRE:** Go's `regexp` package (RE2) already provides linear-time guarantees. regexp2 loses this guarantee.
- **Entropy-only detection:** Never flag a candidate based solely on entropy score. Entropy is a secondary filter applied only after keyword pre-filter and regex confirm a pattern match.
- **Plaintext key column:** Never store full API key as TEXT. Always encrypt with AES-256-GCM before INSERT.
- **os.Getenv for passphrase:** Derive the AES key via Argon2id from a passphrase. Never store raw passphrase or raw key in config file.
---
## Don't Hand-Roll
| Problem | Don't Build | Use Instead | Why |
|---------|-------------|-------------|-----|
| Multi-keyword search automaton | Custom Aho-Corasick or loop-based string search | `petar-dambovaliev/aho-corasick` | Aho-Corasick is O(n+m+z); naive loop is O(n*k); library is 20x faster than next-best Go option |
| Goroutine pool with dynamic resize | Manual goroutine spawn with WaitGroup | `ants v2.12.0` | Goroutine explosion on large repos; ants handles backpressure, panic recovery, context cancellation |
| AES key derivation from passphrase | SHA256(passphrase) or similar | `argon2.IDKey` from `golang.org/x/crypto` | MD5/SHA hash-based KDF is trivially brute-forceable; Argon2id is GPU-resistant by design |
| SQLite column encryption | XOR, custom cipher, or base64 "encoding" | `crypto/aes` GCM via stdlib | GCM provides both confidentiality and authentication; custom schemes always have vulnerabilities |
| Config file management | Custom INI or JSON parser | `viper v1.21.0` | Viper handles YAML + env vars + CLI flags with correct precedence; hand-rolled configs miss env var override |
| CLI command parsing | `flag` stdlib or custom parser | `cobra v1.10.2` | Cobra provides nested subcommands, persistent flags, shell completion, help generation — stdlib flag lacks all of these |
**Key insight:** The Aho-Corasick pre-filter and Argon2id key derivation in particular are problems where "obvious" implementations (nested loops, SHA256) have well-documented security or performance failure modes that justify the dependency cost.
---
## Common Pitfalls
### Pitfall 1: Wrong Aho-Corasick Library Choice
**What goes wrong:** Using `cloudflare/ahocorasick` because it appears more prominent in search results. It uses 8x more memory and runs 20x slower than `petar-dambovaliev/aho-corasick`. For a tool scanning large repos with 108 keyword patterns, this difference is measurable.
**Why it happens:** cloudflare/ahocorasick appears first in many search results.
**How to avoid:** Use `github.com/petar-dambovaliev/aho-corasick` (verified as the library TruffleHog uses per their own blog post crediting Petar Dambovaliev). Build the automaton once at registry init; the automaton is thread-safe for concurrent reads.
**Warning signs:** Scans on 100MB+ repos running significantly slower than expected; high memory usage during keyword pre-filter stage.
---
### Pitfall 2: Encryption Key Stored in Config File as Raw Bytes
**What goes wrong:** Storing the 32-byte AES key directly in `~/.keyhunter.yaml` or as an env var. Anyone with config file read access can decrypt the entire database.
**Why it happens:** "Just store the key" is the simplest implementation. Argon2id + salt seems like unnecessary complexity.
**How to avoid:** Store only the randomly generated salt in config. Re-derive the key from `passphrase + salt` on each run using Argon2id. The passphrase is entered interactively or set via `KEYHUNTER_PASSPHRASE` env var (never stored). For Phase 1, interactive passphrase entry is acceptable. OS keychain integration (zalando/go-keyring) can be added later without schema migration.
**Warning signs:** `keyhunter_key:` field appearing as hex bytes in the YAML config file.
---
### Pitfall 3: Provider YAML Schema Allowing Missing format_version or last_verified
**What goes wrong:** Provider YAML loads without validation. Providers with missing `format_version` or stale `last_verified` accumulate over time. Pattern health tracking (PROV-10) becomes meaningless.
**Why it happens:** `yaml.Unmarshal` to a struct silently zero-values missing fields. No validation = no error.
**How to avoid:** Implement `UnmarshalYAML` with explicit validation on the Provider struct. Fail at startup (not at scan time) if any provider is invalid. This catches schema errors at development time, not production time.
**Warning signs:** `format_version: 0` appearing in any loaded provider struct.
---
### Pitfall 4: SQLite Without WAL Mode in a CLI Tool
**What goes wrong:** Default SQLite journal mode causes `SQLITE_BUSY` errors when the dashboard (Phase 18) or multiple concurrent processes read while a scan writes. The default journal also performs slower write throughput.
**Why it happens:** `sql.Open("sqlite", path)` uses the default rollback journal.
**How to avoid:** Always execute `PRAGMA journal_mode=WAL` immediately after opening the database. This is a one-time setup that persists in the database file. Also set `PRAGMA foreign_keys=ON` to enforce referential integrity.
**Warning signs:** `database is locked` errors during concurrent read/write operations.
---
### Pitfall 5: Entropy Check Before Regex Confirmation
**What goes wrong:** Running Shannon entropy on every chunk that passes the Aho-Corasick keyword filter produces up to 80% false positive rate (HashiCorp 2025 research). High-entropy strings like UUIDs, hashes, and base64-encoded content all score above 3.5 bits/char.
**Why it happens:** Entropy feels like an independent signal and is applied eagerly as a quick filter.
**How to avoid:** Entropy is strictly a secondary signal. Apply it only to strings that have already matched both a keyword (Aho-Corasick) AND a provider regex pattern. The order is: keyword pre-filter → regex match → entropy calibration. Never entropy-only.
**Warning signs:** More than 30% of findings lacking a matching provider regex pattern.
---
### Pitfall 6: mmap for Small Files on Linux
**What goes wrong:** mmap is beneficial for large files (>10MB) where avoiding a full read-into-memory matters. For small files, mmap has higher setup overhead than a simple `os.ReadFile`. mmap also requires explicit cleanup to avoid address space exhaustion on directory scans with thousands of small files.
**Why it happens:** CORE-07 specifies mmap-based large file reading, and implementing it uniformly for all files seems simpler.
**How to avoid:** Use `os.ReadFile` for files under 10MB. Use mmap only above that threshold, with explicit `unix.Munmap` cleanup in a deferred call. Check for binary files before mmap — use the first 512 bytes of the file to detect binary content via `http.DetectContentType` and skip non-text files.
**Warning signs:** Address space exhaustion during directory scans; "too many open files" errors.
---
### Pitfall 7: Argon2 Parameters Too Low (Fast KDF = Weak Security)
**What goes wrong:** Using time=1, memory=4096 (a commonly copied example) instead of RFC 9106's recommendations. Fast key derivation makes brute-force attacks on the database passphrase practical.
**Why it happens:** Low parameters make tests run faster and startup feel snappier.
**How to avoid:** Use RFC 9106 Section 7.3 parameters for Argon2id: `time=1, memory=64*1024 (64MB), threads=4, keyLen=32`. These are the current recommendations. Test startup latency with these parameters — on modern hardware, key derivation takes ~100-300ms, which is acceptable for a CLI tool.
**Warning signs:** `argon2.IDKey(pass, salt, 1, 4096, 1, 32)` — memory parameter is 64*1024 (65536), not 4096.
---
## State of the Art
| Old Approach | Current Approach | When Changed | Impact |
|--------------|------------------|--------------|--------|
| Linear keyword scan before regex | Aho-Corasick pre-filter | TruffleHog v3.28.6 (2024) | 2x average scan speedup; O(n+m+z) vs O(n*k) |
| mattn/go-sqlite3 (CGo) | modernc.org/sqlite (pure Go) | Go 1.16+ era | CGO_ENABLED=0 enabled; cross-compilation works |
| SQLCipher for DB encryption | Application-level AES-256-GCM | 2023-2025 | No CGo dependency; AES GCM provides authentication |
| PBKDF2 for key derivation | Argon2id | RFC 9106 (2021) | GPU-resistant; side-channel resistant; OWASP recommended |
| regexp2/PCRE in Go scanners | Go stdlib regexp (RE2) | Ongoing | ReDoS immune; linear time guaranteed |
| Storing full keys masked in DB | Encrypt key_encrypted column, store only mask | GHSA-4h8c-qrcq-cv5c (2024) | Database file no longer a credential dump |
**Deprecated/outdated:**
- `mattn/go-sqlite3`: Requires CGo; cross-compilation breaks; modernc.org/sqlite is the replacement.
- `robfig/cron`: Unmaintained since 2020; use gocron v2 (Phase 17).
- `cloudflare/ahocorasick`: Still maintained but 20x slower than petar-dambovaliev/aho-corasick; do not use.
- Entropy-only secret detection: HashiCorp 2025 research confirms 80% FP rate; layered pipeline is the current standard.
---
## Open Questions
1. **Passphrase UX for first run**
- What we know: Argon2id requires a passphrase to derive the AES key. For Phase 1, this must be handled on first `keyhunter scan` or `keyhunter config init`.
- What's unclear: Should Phase 1 use `bufio.NewReader(os.Stdin).ReadString('\n')` for passphrase entry, or skip encryption and use a generated random key stored in config (less secure but zero-friction)?
- Recommendation: Use a generated random 32-byte key stored in `~/.keyhunter.yaml` as base64 for Phase 1 (zero-friction). Document that this is a development shortcut; OS keychain integration (zalando/go-keyring) replaces it in a later phase. The encrypt/decrypt functions and schema are in place; only the key source changes.
2. **mmap on Linux: syscall vs golang.org/x/sys/unix**
- What we know: Both `syscall.Mmap` and `golang.org/x/sys/unix.Mmap` provide mmap on Linux. The x/ package has a cleaner API.
- What's unclear: `golang.org/x/sys` is already a transitive dependency of many packages (likely pulled in by viper or cobra). Whether it's already in go.sum or needs explicit addition.
- Recommendation: Use `golang.org/x/sys/unix` for mmap in the file source adapter. It will almost certainly already be in go.sum. Only implement mmap for Phase 1 if CORE-07 is in scope for the minimal viable scan pipeline; otherwise defer to Phase 4.
3. **Provider YAML for Phase 1: how many definitions?**
- What we know: Phase 1 requires schema + PROV-10 (format_version, last_verified fields). Full 108-provider definitions are Phase 2-3.
- What's unclear: The success criterion "keyhunter scan ./somefile runs the three-stage pipeline and returns findings with provider names" implies at least one real provider definition must exist.
- Recommendation: Ship 3 reference provider definitions in Phase 1 (OpenAI, Anthropic, HuggingFace) with valid format_version and last_verified. All 108 providers are Phase 2-3 scope. These 3 definitions validate the schema and make the success criteria testable.
---
## Environment Availability
| Dependency | Required By | Available | Version | Fallback |
|------------|------------|-----------|---------|----------|
| Go | Core language | Yes | 1.26.1 (exceeds 1.22 requirement) | — |
| git | Version control | Yes | 2.53.0 | — |
| golangci-lint | Static analysis / CI | No | — | Install via `go install github.com/golangci-lint/golangci-lint/cmd/golangci-lint@latest` or skip in Phase 1 |
| npm | (not needed Phase 1) | Yes (4.2.2 via tailwind check) | — | Not needed until Phase 18 |
**Missing dependencies with fallback:**
- `golangci-lint`: Not found. Install before Phase 1 linting task, or skip lint gate for initial scaffold and add in CI pipeline. Fallback: `go vet ./...` catches most critical issues.
**Missing dependencies with no fallback:**
- None. Go 1.26.1 is available and exceeds the 1.22+ requirement.
---
## Validation Architecture
### Test Framework
| Property | Value |
|----------|-------|
| Framework | `go test` (stdlib) + `testify v1.10.x` for assertions |
| Config file | None needed — standard `go test ./...` discovers `*_test.go` |
| Quick run command | `go test ./pkg/... -race -timeout 30s` |
| Full suite command | `go test ./... -race -cover -timeout 120s` |
### Phase Requirements to Test Map
| Req ID | Behavior | Test Type | Automated Command | File Exists? |
|--------|----------|-----------|-------------------|-------------|
| CORE-01 | Scan pipeline detects known API key patterns in test input | integration | `go test ./pkg/engine/... -run TestScanPipeline -v` | No — Wave 0 |
| CORE-02 | Provider YAML loads from embed.FS without error | unit | `go test ./pkg/providers/... -run TestNewRegistry -v` | No — Wave 0 |
| CORE-03 | Registry holds correct provider count, Get() returns provider by name | unit | `go test ./pkg/providers/... -run TestRegistry -v` | No — Wave 0 |
| CORE-04 | Shannon entropy returns correct value for known inputs | unit | `go test ./pkg/engine/... -run TestShannonEntropy -v` | No — Wave 0 |
| CORE-05 | Worker pool uses correct concurrency; all workers complete | unit | `go test ./pkg/engine/... -run TestWorkerPool -v` | No — Wave 0 |
| CORE-06 | Aho-Corasick filter passes keyword-matched chunks; rejects non-matching | unit | `go test ./pkg/engine/... -run TestAhoCorasickFilter -v` | No — Wave 0 |
| CORE-07 | Large file source reads without OOM; binary files skipped | integration | `go test ./pkg/engine/sources/... -run TestFileSourceLarge -v` | No — Wave 0 |
| STOR-01 | SQLite DB opens; schema applies; WAL mode set | unit | `go test ./pkg/storage/... -run TestOpen -v` | No — Wave 0 |
| STOR-02 | AES-256-GCM encrypt/decrypt round-trip is lossless | unit | `go test ./pkg/storage/... -run TestEncryptDecrypt -v` | No — Wave 0 |
| STOR-03 | Argon2id DeriveKey produces 32-byte deterministic output | unit | `go test ./pkg/storage/... -run TestDeriveKey -v` | No — Wave 0 |
| CLI-01 | `keyhunter --help` exits 0; all subcommands listed | smoke | `go run ./... --help` | No — Wave 0 |
| CLI-02 | `keyhunter config init` creates ~/.keyhunter.yaml | integration | `go test ./cmd/... -run TestConfigInit -v` | No — Wave 0 |
| CLI-03 | `keyhunter config set key val` persists to YAML | integration | `go test ./cmd/... -run TestConfigSet -v` | No — Wave 0 |
| CLI-04 | `keyhunter providers list` returns at least 3 providers | integration | `go test ./cmd/... -run TestProvidersList -v` | No — Wave 0 |
| CLI-05 | `keyhunter scan --workers 4 testfile` uses 4 workers | integration | `go test ./cmd/... -run TestScanFlags -v` | No — Wave 0 |
| PROV-10 | Provider YAML with missing format_version returns error at load | unit | `go test ./pkg/providers/... -run TestProviderValidation -v` | No — Wave 0 |
### Sampling Rate
- **Per task commit:** `go test ./pkg/... -race -timeout 30s`
- **Per wave merge:** `go test ./... -race -cover -timeout 120s`
- **Phase gate:** Full suite green before `/gsd:verify-work`
### Wave 0 Gaps
All test files are new — this is a greenfield project.
- [ ] `pkg/providers/registry_test.go` — covers CORE-02, CORE-03, PROV-10
- [ ] `pkg/engine/entropy_test.go` — covers CORE-04
- [ ] `pkg/engine/filter_test.go` — covers CORE-06
- [ ] `pkg/engine/engine_test.go` — covers CORE-01, CORE-05
- [ ] `pkg/engine/sources/file_test.go` — covers CORE-07
- [ ] `pkg/storage/encrypt_test.go` — covers STOR-02
- [ ] `pkg/storage/crypto_test.go` — covers STOR-03
- [ ] `pkg/storage/db_test.go` — covers STOR-01
- [ ] `cmd/config_test.go` — covers CLI-02, CLI-03
- [ ] `cmd/providers_test.go` — covers CLI-04
- [ ] `cmd/scan_test.go` — covers CLI-05
- [ ] `testdata/fixtures/` — synthetic test files with known API key patterns for integration tests
- [ ] Framework install: `go get github.com/stretchr/testify@latest` — if not added during go.mod init
---
## Sources
### Primary (HIGH confidence)
- TruffleHog Aho-Corasick blog: https://trufflesecurity.com/blog/making-trufflehog-faster-with-aho-corasick — confirmed petar-dambovaliev library and 2x speedup claim
- petar-dambovaliev/aho-corasick pkg.go.dev: https://pkg.go.dev/github.com/petar-dambovaliev/aho-corasick — API verified, last updated 2025-04-24
- Go crypto/cipher (AES-GCM): https://pkg.go.dev/crypto/cipher — Encrypt/Decrypt pattern verified
- Go argon2 package: https://pkg.go.dev/golang.org/x/crypto/argon2 — IDKey parameters from RFC 9106
- modernc.org/sqlite pkg.go.dev: https://pkg.go.dev/modernc.org/sqlite — pure Go confirmed, SQLite 3.51.2
- Go embed package: https://pkg.go.dev/embed — WalkDir pattern for loading embedded files
- gopkg.in/yaml.v3: https://pkg.go.dev/gopkg.in/yaml.v3 — UnmarshalYAML custom validation
- ants v2 README: https://github.com/panjf2000/ants — Pool usage pattern
- cobra docs: https://github.com/spf13/cobra — PersistentPreRunE config loading pattern
- viper docs: https://github.com/spf13/viper — BindPFlag, WriteConfigAs patterns
### Secondary (MEDIUM confidence)
- TruffleHog go.sum (BobuSumisu reference): https://github.com/trufflesecurity/trufflehog/blob/main/go.sum — historical library; petar-dambovaliev is current per blog post
- Practical Go SQLite (no CGo): https://practicalgobook.net/posts/go-sqlite-no-cgo/ — WAL mode pattern verified against official SQLite docs
- HashiCorp entropy FP research: https://www.hashicorp.com/en/blog/false-positives-a-big-problem-for-secret-scanners — 80% FP rate from entropy-only detection
- Cobra/Viper 2025 article: https://www.glukhov.org/post/2025/11/go-cli-applications-with-cobra-and-viper/ — PersistentPreRunE pattern
- zalando/go-keyring: https://github.com/zalando/go-keyring — Linux uses D-Bus Secret Service (libsecret); noted as future improvement for Phase 1+
### Tertiary (LOW confidence)
- argon2aes reference implementation: https://github.com/presbrey/argon2aes — implementation pattern reference only; use stdlib directly
- Passphrase UX patterns: Community convention; no authoritative Go CLI standard for passphrase input UX
---
## Metadata
**Confidence breakdown:**
- Standard stack: HIGH — all versions verified against official GitHub releases and pkg.go.dev as of 2026-04-04
- Architecture: HIGH — TruffleHog v3 pipeline is the proven model; channel patterns are established Go idiom
- Aho-Corasick library choice: HIGH — TruffleHog blog post explicitly credits petar-dambovaliev; pkg.go.dev confirms API and last update date
- AES-256+Argon2id approach: HIGH — stdlib only; RFC 9106 parameters; well-documented pattern
- Pitfalls: HIGH — sourced from GHSA advisory, HashiCorp research, official Go docs
- Test architecture: HIGH — standard `go test` patterns; no uncertainty
**Research date:** 2026-04-04
**Valid until:** 2026-07-04 (stable libraries; 90 days is safe for Go stdlib + well-maintained ecosystem packages)