40 KiB
Phase 1: Foundation - Research
Researched: 2026-04-04 Domain: Go CLI scaffolding, provider registry schema, three-stage scan pipeline, encrypted SQLite storage Confidence: HIGH
<phase_requirements>
Phase Requirements
| ID | Description | Research Support |
|---|---|---|
| CORE-01 | Scanner engine detects API keys using keyword pre-filtering + regex matching pipeline | Aho-Corasick pre-filter (petar-dambovaliev/aho-corasick) + Go RE2 regexp; buffered channel pipeline pattern documented |
| CORE-02 | Provider definitions loaded from YAML files embedded at compile time via Go embed | //go:embed providers/*.yaml + fs.WalkDir to iterate; gopkg.in/yaml.v3 for parse |
| CORE-03 | Provider registry manages 108+ provider definitions with pattern, keyword, confidence, and verify metadata | Registry struct holding []Provider loaded at startup; injected via constructor not global |
| CORE-04 | Entropy analysis as secondary signal for low-confidence providers | Shannon entropy implemented as ~10-line stdlib function using math package; threshold 3.5 bits/char |
| CORE-05 | Worker pool parallelism with configurable worker count (default: CPU count) | ants.NewPool(runtime.NumCPU() * 8) for detectors; ants.NewPool(runtime.NumCPU()) for verifiers |
| CORE-06 | Aho-Corasick keyword pre-filter runs before regex for 10x performance on large files | petar-dambovaliev/aho-corasick: 20x faster than cloudflare's, used by TruffleHog; build automaton at registry init |
| CORE-07 | mmap-based large file reading for memory efficiency | golang.org/x/sys/unix.Mmap or syscall.Mmap for large file sources; skip binary files via magic bytes |
| STOR-01 | SQLite database for persisting scan results, keys, recon history | modernc.org/sqlite (pure Go, no CGo); database/sql interface; WAL mode; schema embedded via //go:embed |
| STOR-02 | Application-level AES-256 encryption for stored keys and sensitive config | crypto/aes + crypto/cipher GCM mode; nonce prepended to ciphertext stored in BLOB column |
| STOR-03 | Encryption key derived from user passphrase via Argon2 | golang.org/x/crypto/argon2 IDKey with RFC 9106 params: time=1, memory=64*1024, threads=4, keyLen=32 |
| CLI-01 | Cobra-based CLI with commands: scan, verify, import, recon, keys, serve, dorks, providers, config, hook, schedule | cobra v1.10.2 command tree; cmd/ package; main.go < 30 lines |
| CLI-02 | keyhunter config init creates ~/.keyhunter.yaml | viper.WriteConfigAs(filepath) with os.MkdirAll; PersistentPreRunE for config load |
| CLI-03 | keyhunter config set key value persists values | viper.Set(key, value) + viper.WriteConfig() |
| CLI-04 | keyhunter providers list/info/stats for provider management | Registry.List(), Registry.Get(name) from loaded YAML; lipgloss table for terminal output |
| CLI-05 | Scan flags: --providers, --category, --confidence, --exclude, --verify, --workers, --output, --unmask, --notify | Persistent flags on scan command; viper.BindPFlag for config file override |
| PROV-10 | Provider YAML schema includes format_version and last_verified fields validated at load time | Custom UnmarshalYAML method on Provider struct; return error if format_version == 0 or last_verified empty |
| </phase_requirements> |
Summary
Phase 1 builds the three foundations everything else depends on: the provider registry (YAML schema + embed), the storage layer (SQLite + AES-256 encryption), and the CLI skeleton (Cobra + Viper). These are the two zero-dependency components in the architecture — nothing downstream can be built until both exist. The scanning engine (three-stage pipeline) is also scoped here because all Phase 1 success criteria require a working scan to validate the other two foundations.
The stack for this phase is fully settled from prior research. The one open question from SUMMARY.md — which Aho-Corasick library to use — is now resolved: TruffleHog originally used petar-dambovaliev/aho-corasick (confirmed via the TruffleHog blog post crediting Petar Dambovaliev). That library is 20x faster than cloudflare/ahocorasick and uses 1/8th the memory. A second open question — Argon2 vs PBKDF2 for key derivation — is resolved: use Argon2id (golang.org/x/crypto/argon2.IDKey) per RFC 9106 recommendations, which is the modern standard and already available in the x/crypto package already needed for other operations.
The AES-256 encryption approach is application-level (not SQLCipher), using crypto/aes GCM mode with the nonce prepended to the ciphertext stored as a BLOB. This preserves CGO_ENABLED=0 throughout. Key derivation uses Argon2id to produce a 32-byte key from a user passphrase + random salt stored alongside the database. For Phase 1, the salt can be stored in the config YAML; OS keychain integration (zalando/go-keyring) can be added in a later phase without schema migration.
Primary recommendation: Build in order: (1) Provider YAML schema + embed loader, (2) SQLite schema + AES-256 crypto layer, (3) Scanning engine pipeline (Aho-Corasick + RE2 + entropy), (4) Cobra/Viper CLI wiring. The scan pipeline validation is the integration test that proves all three foundations work together.
Project Constraints (from CLAUDE.md)
All directives from CLAUDE.md are binding. Key constraints for Phase 1:
| Constraint | Directive |
|---|---|
| Language | Go 1.22+ only. No other language. |
| CGO | CGO_ENABLED=0 throughout — single binary, cross-compilation |
| SQLite driver | modernc.org/sqlite — NOT mattn/go-sqlite3 (CGo) |
| SQLite encryption | Application-level AES-256 via crypto/aes — NOT SQLCipher (CGo) |
| CLI framework | cobra v1.10.2 + viper v1.21.0 — no alternatives |
| YAML parsing | gopkg.in/yaml.v3 — not sigs.k8s.io/yaml or goccy/go-yaml |
| Concurrency | ants v2.12.0 worker pool |
| Architecture | Plugin-based — providers as YAML files, compile-time embedded via go:embed |
| Regex engine | Go stdlib regexp (RE2-backed) ONLY — never regexp2 or PCRE |
| Verification | Opt-in (--verify flag) — passive scanning by default |
| Key masking | Default masked output, --unmask for full keys |
| Worker pool | github.com/panjf2000/ants/v2 v2.12.0 |
| Output formatting | github.com/charmbracelet/lipgloss (latest) |
| Web stack | chi v5.2.5 + templ v0.3.1001 + htmx + Tailwind v4 (Phase 18 only — do not scaffold in Phase 1) |
| Telegram | telego v1.8.0 (Phase 17 only) |
| Scheduler | gocron v2.19.1 (Phase 17 only) |
| Build | go build -ldflags="-s -w" for stripped binary |
| Forbidden patterns | Fiber, Echo, mattn/go-sqlite3, SQLCipher, robfig/cron, regexp2, Full Bubble Tea TUI |
Standard Stack
Core (Phase 1 only)
| Library | Version | Purpose | Why Standard |
|---|---|---|---|
github.com/spf13/cobra |
v1.10.2 | CLI command tree | Industry standard; used by TruffleHog, Gitleaks, Kubernetes, Docker CLI |
github.com/spf13/viper |
v1.21.0 | Config management (YAML/env/flags) | Cobra-native integration; v1.21.0 fixed key-casing bugs |
modernc.org/sqlite |
v1.35.x (SQLite 3.51.2) | Embedded database | Pure Go; no CGo; cross-compiles cleanly; updated 2026-03-17 |
gopkg.in/yaml.v3 |
v3.0.1 | Parse provider YAML | Handles inline/anchored structs; stable; transitive dep of cobra anyway |
github.com/petar-dambovaliev/aho-corasick |
latest (2025-04-24) | Keyword pre-filter | 20x faster than cloudflare/ahocorasick; 1/8 memory; used by TruffleHog; MIT license |
github.com/panjf2000/ants/v2 |
v2.12.0 | Worker pool | Battle-tested goroutine pool; v2.12.0 adds ReleaseContext for clean shutdown |
golang.org/x/crypto |
latest x/ | Argon2 key derivation | Official Go extended library; IDKey for Argon2id; same package used for other crypto needs |
golang.org/x/time |
latest x/ | Rate limiting (future-proofing) | Needed for CORE-05 workers; token bucket; add now to avoid later go.mod churn |
github.com/charmbracelet/lipgloss |
latest | Terminal table output | Declarative styles; used across Go security tool ecosystem |
github.com/stretchr/testify |
v1.10.x | Test assertions | Assert/require only; standard in Go ecosystem |
embed (stdlib) |
— | Compile-time YAML embed | Go 1.16+ native; no external dep |
crypto/aes, crypto/cipher (stdlib) |
— | AES-256-GCM encryption | Standard library; no CGo; GCM provides authenticated encryption |
math (stdlib) |
— | Shannon entropy calculation | ~10-line implementation; no library needed |
database/sql (stdlib) |
— | SQL interface over modernc.org/sqlite | Driver registered as "sqlite"; raw SQL; no ORM |
Not Needed in Phase 1 (scaffold stubs only if required by interface)
| Library | Deferred To |
|---|---|
chi v5.2.5 |
Phase 18 (Web Dashboard) |
templ v0.3.1001 |
Phase 18 (Web Dashboard) |
telego v1.8.0 |
Phase 17 (Telegram Bot) |
gocron v2.19.1 |
Phase 17 (Scheduler) |
Installation (Phase 1 dependencies):
go mod init github.com/yourusername/keyhunter
go get github.com/spf13/cobra@v1.10.2
go get github.com/spf13/viper@v1.21.0
go get modernc.org/sqlite@latest
go get gopkg.in/yaml.v3@v3.0.1
go get github.com/petar-dambovaliev/aho-corasick@latest
go get github.com/panjf2000/ants/v2@v2.12.0
go get golang.org/x/crypto@latest
go get golang.org/x/time@latest
go get github.com/charmbracelet/lipgloss@latest
go get github.com/stretchr/testify@latest
Version verification (run before writing go.mod manually):
go list -m github.com/petar-dambovaliev/aho-corasick
go list -m modernc.org/sqlite
go list -m github.com/panjf2000/ants/v2
Architecture Patterns
Recommended Project Structure (Phase 1 scope)
keyhunter/
main.go # < 30 lines, cobra root Execute()
cmd/
root.go # rootCmd, persistent flags, PersistentPreRunE config load
scan.go # scan command + flags
providers.go # providers list/info/stats commands
config.go # config init/set/get commands
pkg/
providers/
loader.go # embed.FS + fs.WalkDir + yaml.Unmarshal
registry.go # Registry struct, Get/List/Stats methods
schema.go # Provider, Pattern, VerifySpec structs + UnmarshalYAML validation
engine/
engine.go # Engine struct, Scan() method, pipeline orchestration
pipeline.go # channel wiring: chunksChan, detectableChan, resultsChan
filter.go # Aho-Corasick pre-filter stage
detector.go # Regex + entropy detector worker
entropy.go # Shannon entropy function
chunk.go # Chunk type (content []byte, source string, offset int64)
finding.go # Finding type (provider, key_value, key_masked, confidence, source, path)
sources/
source.go # Source interface
file.go # FileSource
dir.go # DirSource (recursive with glob exclude)
storage/
db.go # DB struct, Open(), migrations via embedded schema.sql
schema.sql # DDL for findings, scans, settings tables
encrypt.go # AES-256-GCM Encrypt(plaintext, key) / Decrypt(ciphertext, key)
crypto.go # Argon2id key derivation: DeriveKey(passphrase, salt)
findings.go # CRUD for findings table
scans.go # CRUD for scans table
config/
config.go # Config struct, Load(), defaults
output/
table.go # lipgloss colored terminal table
json.go # encoding/json output
providers/
openai.yaml # Reference provider definitions (Phase 1: schema examples only)
anthropic.yaml
huggingface.yaml
Pattern 1: Provider Registry with Compile-Time Embed
What: Provider YAML definitions embedded via //go:embed and loaded into an in-memory registry at startup.
When to use: Always. Never load provider files from disk at runtime.
// Source: Go embed docs (https://pkg.go.dev/embed)
package providers
import (
"embed"
"io/fs"
"gopkg.in/yaml.v3"
)
//go:embed ../../providers/*.yaml
var providersFS embed.FS
type Registry struct {
providers []Provider
ac ahocorasick.AhoCorasick
}
func NewRegistry() (*Registry, error) {
var providers []Provider
err := fs.WalkDir(providersFS, "providers", func(path string, d fs.DirEntry, err error) error {
if err != nil || d.IsDir() || filepath.Ext(path) != ".yaml" {
return err
}
data, err := providersFS.ReadFile(path)
if err != nil {
return err
}
var p Provider
if err := yaml.Unmarshal(data, &p); err != nil {
return fmt.Errorf("provider %s: %w", path, err)
}
providers = append(providers, p)
return nil
})
if err != nil {
return nil, err
}
// Build Aho-Corasick automaton from all keywords
var keywords []string
for _, p := range providers {
keywords = append(keywords, p.Keywords...)
}
builder := ahocorasick.NewAhoCorasickBuilder(ahocorasick.Opts{DFA: true})
ac := builder.Build(keywords)
return &Registry{providers: providers, ac: ac}, nil
}
Pattern 2: Three-Stage Scanning Pipeline (Buffered Channels)
What: Source adapters produce chunks onto buffered channels. Aho-Corasick pre-filter reduces candidates. Detector workers apply regex + entropy.
When to use: All scan operations. Never skip the pre-filter.
// Source: TruffleHog v3 architecture (https://trufflesecurity.com/blog/making-trufflehog-faster-with-aho-corasick)
func (e *Engine) Scan(ctx context.Context, src Source, cfg ScanConfig) (<-chan Finding, error) {
chunksChan := make(chan Chunk, 1000)
detectableChan := make(chan Chunk, 500)
resultsChan := make(chan Finding, 100)
// Stage 1: Source → chunks
go func() {
defer close(chunksChan)
src.Chunks(ctx, chunksChan)
}()
// Stage 2: Aho-Corasick keyword pre-filter
go func() {
defer close(detectableChan)
for chunk := range chunksChan {
if len(e.registry.AC().FindAll(string(chunk.Data))) > 0 {
detectableChan <- chunk
}
}
}()
// Stage 3: Detector workers
var wg sync.WaitGroup
for i := 0; i < cfg.Workers; i++ {
wg.Add(1)
go func() {
defer wg.Done()
for chunk := range detectableChan {
e.detect(chunk, resultsChan)
}
}()
}
go func() {
wg.Wait()
close(resultsChan)
}()
return resultsChan, nil
}
Pattern 3: AES-256-GCM Column Encryption
What: Encrypt key values before storing in SQLite. Nonce prepended to ciphertext. Key derived via Argon2id.
When to use: Every write of a full API key to storage.
// Source: Go crypto/cipher docs (https://pkg.go.dev/crypto/cipher)
func Encrypt(plaintext []byte, key []byte) ([]byte, error) {
block, err := aes.NewCipher(key) // key must be 32 bytes for AES-256
if err != nil {
return nil, err
}
gcm, err := cipher.NewGCM(block)
if err != nil {
return nil, err
}
nonce := make([]byte, gcm.NonceSize())
if _, err := io.ReadFull(rand.Reader, nonce); err != nil {
return nil, err
}
ciphertext := gcm.Seal(nonce, nonce, plaintext, nil) // nonce prepended
return ciphertext, nil
}
func Decrypt(ciphertext []byte, key []byte) ([]byte, error) {
block, err := aes.NewCipher(key)
if err != nil {
return nil, err
}
gcm, err := cipher.NewGCM(block)
if err != nil {
return nil, err
}
nonceSize := gcm.NonceSize()
if len(ciphertext) < nonceSize {
return nil, errors.New("ciphertext too short")
}
nonce, ciphertext := ciphertext[:nonceSize], ciphertext[nonceSize:]
return gcm.Open(nil, nonce, ciphertext, nil)
}
Pattern 4: Argon2id Key Derivation
What: Derive 32-byte AES key from user passphrase + random salt. RFC 9106 Section 7.3 parameters.
When to use: On first use (generate and store salt); on subsequent use (re-derive key from passphrase + stored salt).
// Source: https://pkg.go.dev/golang.org/x/crypto/argon2
import "golang.org/x/crypto/argon2"
const (
argon2Time = 1
argon2Memory = 64 * 1024 // 64 MB
argon2Threads = 4
argon2KeyLen = 32 // AES-256 key length
)
func DeriveKey(passphrase []byte, salt []byte) []byte {
return argon2.IDKey(passphrase, salt, argon2Time, argon2Memory, argon2Threads, argon2KeyLen)
}
// Generate salt (do once, persist in config)
func NewSalt() ([]byte, error) {
salt := make([]byte, 32)
_, err := rand.Read(salt)
return salt, err
}
Pattern 5: Provider YAML Schema with Validation
What: Provider struct with UnmarshalYAML that validates required fields including format_version and last_verified.
When to use: Every provider YAML load. Return error on invalid schema — fail fast at startup.
// Source: gopkg.in/yaml.v3 docs (https://pkg.go.dev/gopkg.in/yaml.v3)
type Provider struct {
Name string `yaml:"name"`
FormatVersion int `yaml:"format_version"`
LastVerified string `yaml:"last_verified"` // ISO date: "2026-04-04"
Keywords []string `yaml:"keywords"`
Patterns []Pattern `yaml:"patterns"`
Verify *VerifySpec `yaml:"verify,omitempty"`
}
func (p *Provider) UnmarshalYAML(value *yaml.Node) error {
type rawProvider Provider
if err := value.Decode((*rawProvider)(p)); err != nil {
return err
}
if p.FormatVersion == 0 {
return fmt.Errorf("provider %q: format_version is required", p.Name)
}
if p.LastVerified == "" {
return fmt.Errorf("provider %q: last_verified is required", p.Name)
}
if len(p.Keywords) == 0 {
return fmt.Errorf("provider %q: at least one keyword is required", p.Name)
}
if len(p.Patterns) == 0 {
return fmt.Errorf("provider %q: at least one pattern is required", p.Name)
}
return nil
}
Pattern 6: Shannon Entropy (stdlib only)
What: Calculate bits-per-character entropy of a string. No library needed.
When to use: CORE-04: as secondary signal after keyword pre-filter and regex confirm a candidate. Entropy threshold 3.5 bits/char for most providers.
// Source: Shannon entropy formula (verified against TruffleHog entropy implementation)
import "math"
func shannonEntropy(s string) float64 {
if len(s) == 0 {
return 0
}
freq := make(map[rune]float64)
for _, c := range s {
freq[c]++
}
n := float64(len([]rune(s)))
var entropy float64
for _, count := range freq {
p := count / n
entropy -= p * math.Log2(p)
}
return entropy
}
Pattern 7: modernc.org/sqlite Initialization
What: Open SQLite DB with WAL mode and foreign keys. Schema embedded via //go:embed.
When to use: Storage layer initialization.
// Source: https://practicalgobook.net/posts/go-sqlite-no-cgo/
import (
"database/sql"
_ "modernc.org/sqlite" // registers "sqlite" driver
"embed"
)
//go:embed schema.sql
var schemaSQL string
func Open(path string) (*sql.DB, error) {
db, err := sql.Open("sqlite", path)
if err != nil {
return nil, err
}
// Enable WAL mode for better concurrent access
if _, err := db.Exec("PRAGMA journal_mode=WAL"); err != nil {
return nil, err
}
if _, err := db.Exec("PRAGMA foreign_keys=ON"); err != nil {
return nil, err
}
// Apply schema
if _, err := db.Exec(schemaSQL); err != nil {
return nil, err
}
return db, nil
}
Pattern 8: Cobra + Viper Wiring
What: Cobra command tree with Viper config loaded in PersistentPreRunE. Flags bound to Viper for config file override.
When to use: root.go and all command files.
// Source: https://www.glukhov.org/post/2025/11/go-cli-applications-with-cobra-and-viper/
var rootCmd = &cobra.Command{
Use: "keyhunter",
Short: "API key scanner for 108+ LLM providers",
PersistentPreRunE: func(cmd *cobra.Command, args []string) error {
viper.SetConfigName(".keyhunter")
viper.SetConfigType("yaml")
viper.AddConfigPath("$HOME")
viper.AutomaticEnv()
viper.SetEnvPrefix("KEYHUNTER")
if err := viper.ReadInConfig(); err != nil {
if _, ok := err.(viper.ConfigFileNotFoundError); !ok {
return err
}
// Config file not found — acceptable on first run
}
return nil
},
}
// In init() of each command file:
func init() {
scanCmd.Flags().IntP("workers", "w", 0, "worker count (default: CPU count)")
viper.BindPFlag("workers", scanCmd.Flags().Lookup("workers"))
}
Anti-Patterns to Avoid
- Global provider registry: Pass
*Registryvia constructor. Global state makes testing impossible without full initialization. - Unbuffered result channels: Use
make(chan Finding, 1000). Unbuffered channels cause detector workers to block on slow output consumers, collapsing parallelism. - Runtime YAML loading: Never load from filesystem at scan time.
//go:embedonly. - regexp2 or PCRE: Go's
regexppackage (RE2) already provides linear-time guarantees. regexp2 loses this guarantee. - Entropy-only detection: Never flag a candidate based solely on entropy score. Entropy is a secondary filter applied only after keyword pre-filter and regex confirm a pattern match.
- Plaintext key column: Never store full API key as TEXT. Always encrypt with AES-256-GCM before INSERT.
- os.Getenv for passphrase: Derive the AES key via Argon2id from a passphrase. Never store raw passphrase or raw key in config file.
Don't Hand-Roll
| Problem | Don't Build | Use Instead | Why |
|---|---|---|---|
| Multi-keyword search automaton | Custom Aho-Corasick or loop-based string search | petar-dambovaliev/aho-corasick |
Aho-Corasick is O(n+m+z); naive loop is O(n*k); library is 20x faster than next-best Go option |
| Goroutine pool with dynamic resize | Manual goroutine spawn with WaitGroup | ants v2.12.0 |
Goroutine explosion on large repos; ants handles backpressure, panic recovery, context cancellation |
| AES key derivation from passphrase | SHA256(passphrase) or similar | argon2.IDKey from golang.org/x/crypto |
MD5/SHA hash-based KDF is trivially brute-forceable; Argon2id is GPU-resistant by design |
| SQLite column encryption | XOR, custom cipher, or base64 "encoding" | crypto/aes GCM via stdlib |
GCM provides both confidentiality and authentication; custom schemes always have vulnerabilities |
| Config file management | Custom INI or JSON parser | viper v1.21.0 |
Viper handles YAML + env vars + CLI flags with correct precedence; hand-rolled configs miss env var override |
| CLI command parsing | flag stdlib or custom parser |
cobra v1.10.2 |
Cobra provides nested subcommands, persistent flags, shell completion, help generation — stdlib flag lacks all of these |
Key insight: The Aho-Corasick pre-filter and Argon2id key derivation in particular are problems where "obvious" implementations (nested loops, SHA256) have well-documented security or performance failure modes that justify the dependency cost.
Common Pitfalls
Pitfall 1: Wrong Aho-Corasick Library Choice
What goes wrong: Using cloudflare/ahocorasick because it appears more prominent in search results. It uses 8x more memory and runs 20x slower than petar-dambovaliev/aho-corasick. For a tool scanning large repos with 108 keyword patterns, this difference is measurable.
Why it happens: cloudflare/ahocorasick appears first in many search results.
How to avoid: Use github.com/petar-dambovaliev/aho-corasick (verified as the library TruffleHog uses per their own blog post crediting Petar Dambovaliev). Build the automaton once at registry init; the automaton is thread-safe for concurrent reads.
Warning signs: Scans on 100MB+ repos running significantly slower than expected; high memory usage during keyword pre-filter stage.
Pitfall 2: Encryption Key Stored in Config File as Raw Bytes
What goes wrong: Storing the 32-byte AES key directly in ~/.keyhunter.yaml or as an env var. Anyone with config file read access can decrypt the entire database.
Why it happens: "Just store the key" is the simplest implementation. Argon2id + salt seems like unnecessary complexity.
How to avoid: Store only the randomly generated salt in config. Re-derive the key from passphrase + salt on each run using Argon2id. The passphrase is entered interactively or set via KEYHUNTER_PASSPHRASE env var (never stored). For Phase 1, interactive passphrase entry is acceptable. OS keychain integration (zalando/go-keyring) can be added later without schema migration.
Warning signs: keyhunter_key: field appearing as hex bytes in the YAML config file.
Pitfall 3: Provider YAML Schema Allowing Missing format_version or last_verified
What goes wrong: Provider YAML loads without validation. Providers with missing format_version or stale last_verified accumulate over time. Pattern health tracking (PROV-10) becomes meaningless.
Why it happens: yaml.Unmarshal to a struct silently zero-values missing fields. No validation = no error.
How to avoid: Implement UnmarshalYAML with explicit validation on the Provider struct. Fail at startup (not at scan time) if any provider is invalid. This catches schema errors at development time, not production time.
Warning signs: format_version: 0 appearing in any loaded provider struct.
Pitfall 4: SQLite Without WAL Mode in a CLI Tool
What goes wrong: Default SQLite journal mode causes SQLITE_BUSY errors when the dashboard (Phase 18) or multiple concurrent processes read while a scan writes. The default journal also performs slower write throughput.
Why it happens: sql.Open("sqlite", path) uses the default rollback journal.
How to avoid: Always execute PRAGMA journal_mode=WAL immediately after opening the database. This is a one-time setup that persists in the database file. Also set PRAGMA foreign_keys=ON to enforce referential integrity.
Warning signs: database is locked errors during concurrent read/write operations.
Pitfall 5: Entropy Check Before Regex Confirmation
What goes wrong: Running Shannon entropy on every chunk that passes the Aho-Corasick keyword filter produces up to 80% false positive rate (HashiCorp 2025 research). High-entropy strings like UUIDs, hashes, and base64-encoded content all score above 3.5 bits/char.
Why it happens: Entropy feels like an independent signal and is applied eagerly as a quick filter.
How to avoid: Entropy is strictly a secondary signal. Apply it only to strings that have already matched both a keyword (Aho-Corasick) AND a provider regex pattern. The order is: keyword pre-filter → regex match → entropy calibration. Never entropy-only.
Warning signs: More than 30% of findings lacking a matching provider regex pattern.
Pitfall 6: mmap for Small Files on Linux
What goes wrong: mmap is beneficial for large files (>10MB) where avoiding a full read-into-memory matters. For small files, mmap has higher setup overhead than a simple os.ReadFile. mmap also requires explicit cleanup to avoid address space exhaustion on directory scans with thousands of small files.
Why it happens: CORE-07 specifies mmap-based large file reading, and implementing it uniformly for all files seems simpler.
How to avoid: Use os.ReadFile for files under 10MB. Use mmap only above that threshold, with explicit unix.Munmap cleanup in a deferred call. Check for binary files before mmap — use the first 512 bytes of the file to detect binary content via http.DetectContentType and skip non-text files.
Warning signs: Address space exhaustion during directory scans; "too many open files" errors.
Pitfall 7: Argon2 Parameters Too Low (Fast KDF = Weak Security)
What goes wrong: Using time=1, memory=4096 (a commonly copied example) instead of RFC 9106's recommendations. Fast key derivation makes brute-force attacks on the database passphrase practical.
Why it happens: Low parameters make tests run faster and startup feel snappier.
How to avoid: Use RFC 9106 Section 7.3 parameters for Argon2id: time=1, memory=64*1024 (64MB), threads=4, keyLen=32. These are the current recommendations. Test startup latency with these parameters — on modern hardware, key derivation takes ~100-300ms, which is acceptable for a CLI tool.
Warning signs: argon2.IDKey(pass, salt, 1, 4096, 1, 32) — memory parameter is 64*1024 (65536), not 4096.
State of the Art
| Old Approach | Current Approach | When Changed | Impact |
|---|---|---|---|
| Linear keyword scan before regex | Aho-Corasick pre-filter | TruffleHog v3.28.6 (2024) | 2x average scan speedup; O(n+m+z) vs O(n*k) |
| mattn/go-sqlite3 (CGo) | modernc.org/sqlite (pure Go) | Go 1.16+ era | CGO_ENABLED=0 enabled; cross-compilation works |
| SQLCipher for DB encryption | Application-level AES-256-GCM | 2023-2025 | No CGo dependency; AES GCM provides authentication |
| PBKDF2 for key derivation | Argon2id | RFC 9106 (2021) | GPU-resistant; side-channel resistant; OWASP recommended |
| regexp2/PCRE in Go scanners | Go stdlib regexp (RE2) | Ongoing | ReDoS immune; linear time guaranteed |
| Storing full keys masked in DB | Encrypt key_encrypted column, store only mask | GHSA-4h8c-qrcq-cv5c (2024) | Database file no longer a credential dump |
Deprecated/outdated:
mattn/go-sqlite3: Requires CGo; cross-compilation breaks; modernc.org/sqlite is the replacement.robfig/cron: Unmaintained since 2020; use gocron v2 (Phase 17).cloudflare/ahocorasick: Still maintained but 20x slower than petar-dambovaliev/aho-corasick; do not use.- Entropy-only secret detection: HashiCorp 2025 research confirms 80% FP rate; layered pipeline is the current standard.
Open Questions
-
Passphrase UX for first run
- What we know: Argon2id requires a passphrase to derive the AES key. For Phase 1, this must be handled on first
keyhunter scanorkeyhunter config init. - What's unclear: Should Phase 1 use
bufio.NewReader(os.Stdin).ReadString('\n')for passphrase entry, or skip encryption and use a generated random key stored in config (less secure but zero-friction)? - Recommendation: Use a generated random 32-byte key stored in
~/.keyhunter.yamlas base64 for Phase 1 (zero-friction). Document that this is a development shortcut; OS keychain integration (zalando/go-keyring) replaces it in a later phase. The encrypt/decrypt functions and schema are in place; only the key source changes.
- What we know: Argon2id requires a passphrase to derive the AES key. For Phase 1, this must be handled on first
-
mmap on Linux: syscall vs golang.org/x/sys/unix
- What we know: Both
syscall.Mmapandgolang.org/x/sys/unix.Mmapprovide mmap on Linux. The x/ package has a cleaner API. - What's unclear:
golang.org/x/sysis already a transitive dependency of many packages (likely pulled in by viper or cobra). Whether it's already in go.sum or needs explicit addition. - Recommendation: Use
golang.org/x/sys/unixfor mmap in the file source adapter. It will almost certainly already be in go.sum. Only implement mmap for Phase 1 if CORE-07 is in scope for the minimal viable scan pipeline; otherwise defer to Phase 4.
- What we know: Both
-
Provider YAML for Phase 1: how many definitions?
- What we know: Phase 1 requires schema + PROV-10 (format_version, last_verified fields). Full 108-provider definitions are Phase 2-3.
- What's unclear: The success criterion "keyhunter scan ./somefile runs the three-stage pipeline and returns findings with provider names" implies at least one real provider definition must exist.
- Recommendation: Ship 3 reference provider definitions in Phase 1 (OpenAI, Anthropic, HuggingFace) with valid format_version and last_verified. All 108 providers are Phase 2-3 scope. These 3 definitions validate the schema and make the success criteria testable.
Environment Availability
| Dependency | Required By | Available | Version | Fallback |
|---|---|---|---|---|
| Go | Core language | Yes | 1.26.1 (exceeds 1.22 requirement) | — |
| git | Version control | Yes | 2.53.0 | — |
| golangci-lint | Static analysis / CI | No | — | Install via go install github.com/golangci-lint/golangci-lint/cmd/golangci-lint@latest or skip in Phase 1 |
| npm | (not needed Phase 1) | Yes (4.2.2 via tailwind check) | — | Not needed until Phase 18 |
Missing dependencies with fallback:
golangci-lint: Not found. Install before Phase 1 linting task, or skip lint gate for initial scaffold and add in CI pipeline. Fallback:go vet ./...catches most critical issues.
Missing dependencies with no fallback:
- None. Go 1.26.1 is available and exceeds the 1.22+ requirement.
Validation Architecture
Test Framework
| Property | Value |
|---|---|
| Framework | go test (stdlib) + testify v1.10.x for assertions |
| Config file | None needed — standard go test ./... discovers *_test.go |
| Quick run command | go test ./pkg/... -race -timeout 30s |
| Full suite command | go test ./... -race -cover -timeout 120s |
Phase Requirements to Test Map
| Req ID | Behavior | Test Type | Automated Command | File Exists? |
|---|---|---|---|---|
| CORE-01 | Scan pipeline detects known API key patterns in test input | integration | go test ./pkg/engine/... -run TestScanPipeline -v |
No — Wave 0 |
| CORE-02 | Provider YAML loads from embed.FS without error | unit | go test ./pkg/providers/... -run TestNewRegistry -v |
No — Wave 0 |
| CORE-03 | Registry holds correct provider count, Get() returns provider by name | unit | go test ./pkg/providers/... -run TestRegistry -v |
No — Wave 0 |
| CORE-04 | Shannon entropy returns correct value for known inputs | unit | go test ./pkg/engine/... -run TestShannonEntropy -v |
No — Wave 0 |
| CORE-05 | Worker pool uses correct concurrency; all workers complete | unit | go test ./pkg/engine/... -run TestWorkerPool -v |
No — Wave 0 |
| CORE-06 | Aho-Corasick filter passes keyword-matched chunks; rejects non-matching | unit | go test ./pkg/engine/... -run TestAhoCorasickFilter -v |
No — Wave 0 |
| CORE-07 | Large file source reads without OOM; binary files skipped | integration | go test ./pkg/engine/sources/... -run TestFileSourceLarge -v |
No — Wave 0 |
| STOR-01 | SQLite DB opens; schema applies; WAL mode set | unit | go test ./pkg/storage/... -run TestOpen -v |
No — Wave 0 |
| STOR-02 | AES-256-GCM encrypt/decrypt round-trip is lossless | unit | go test ./pkg/storage/... -run TestEncryptDecrypt -v |
No — Wave 0 |
| STOR-03 | Argon2id DeriveKey produces 32-byte deterministic output | unit | go test ./pkg/storage/... -run TestDeriveKey -v |
No — Wave 0 |
| CLI-01 | keyhunter --help exits 0; all subcommands listed |
smoke | go run ./... --help |
No — Wave 0 |
| CLI-02 | keyhunter config init creates ~/.keyhunter.yaml |
integration | go test ./cmd/... -run TestConfigInit -v |
No — Wave 0 |
| CLI-03 | keyhunter config set key val persists to YAML |
integration | go test ./cmd/... -run TestConfigSet -v |
No — Wave 0 |
| CLI-04 | keyhunter providers list returns at least 3 providers |
integration | go test ./cmd/... -run TestProvidersList -v |
No — Wave 0 |
| CLI-05 | keyhunter scan --workers 4 testfile uses 4 workers |
integration | go test ./cmd/... -run TestScanFlags -v |
No — Wave 0 |
| PROV-10 | Provider YAML with missing format_version returns error at load | unit | go test ./pkg/providers/... -run TestProviderValidation -v |
No — Wave 0 |
Sampling Rate
- Per task commit:
go test ./pkg/... -race -timeout 30s - Per wave merge:
go test ./... -race -cover -timeout 120s - Phase gate: Full suite green before
/gsd:verify-work
Wave 0 Gaps
All test files are new — this is a greenfield project.
pkg/providers/registry_test.go— covers CORE-02, CORE-03, PROV-10pkg/engine/entropy_test.go— covers CORE-04pkg/engine/filter_test.go— covers CORE-06pkg/engine/engine_test.go— covers CORE-01, CORE-05pkg/engine/sources/file_test.go— covers CORE-07pkg/storage/encrypt_test.go— covers STOR-02pkg/storage/crypto_test.go— covers STOR-03pkg/storage/db_test.go— covers STOR-01cmd/config_test.go— covers CLI-02, CLI-03cmd/providers_test.go— covers CLI-04cmd/scan_test.go— covers CLI-05testdata/fixtures/— synthetic test files with known API key patterns for integration tests- Framework install:
go get github.com/stretchr/testify@latest— if not added during go.mod init
Sources
Primary (HIGH confidence)
- TruffleHog Aho-Corasick blog: https://trufflesecurity.com/blog/making-trufflehog-faster-with-aho-corasick — confirmed petar-dambovaliev library and 2x speedup claim
- petar-dambovaliev/aho-corasick pkg.go.dev: https://pkg.go.dev/github.com/petar-dambovaliev/aho-corasick — API verified, last updated 2025-04-24
- Go crypto/cipher (AES-GCM): https://pkg.go.dev/crypto/cipher — Encrypt/Decrypt pattern verified
- Go argon2 package: https://pkg.go.dev/golang.org/x/crypto/argon2 — IDKey parameters from RFC 9106
- modernc.org/sqlite pkg.go.dev: https://pkg.go.dev/modernc.org/sqlite — pure Go confirmed, SQLite 3.51.2
- Go embed package: https://pkg.go.dev/embed — WalkDir pattern for loading embedded files
- gopkg.in/yaml.v3: https://pkg.go.dev/gopkg.in/yaml.v3 — UnmarshalYAML custom validation
- ants v2 README: https://github.com/panjf2000/ants — Pool usage pattern
- cobra docs: https://github.com/spf13/cobra — PersistentPreRunE config loading pattern
- viper docs: https://github.com/spf13/viper — BindPFlag, WriteConfigAs patterns
Secondary (MEDIUM confidence)
- TruffleHog go.sum (BobuSumisu reference): https://github.com/trufflesecurity/trufflehog/blob/main/go.sum — historical library; petar-dambovaliev is current per blog post
- Practical Go SQLite (no CGo): https://practicalgobook.net/posts/go-sqlite-no-cgo/ — WAL mode pattern verified against official SQLite docs
- HashiCorp entropy FP research: https://www.hashicorp.com/en/blog/false-positives-a-big-problem-for-secret-scanners — 80% FP rate from entropy-only detection
- Cobra/Viper 2025 article: https://www.glukhov.org/post/2025/11/go-cli-applications-with-cobra-and-viper/ — PersistentPreRunE pattern
- zalando/go-keyring: https://github.com/zalando/go-keyring — Linux uses D-Bus Secret Service (libsecret); noted as future improvement for Phase 1+
Tertiary (LOW confidence)
- argon2aes reference implementation: https://github.com/presbrey/argon2aes — implementation pattern reference only; use stdlib directly
- Passphrase UX patterns: Community convention; no authoritative Go CLI standard for passphrase input UX
Metadata
Confidence breakdown:
- Standard stack: HIGH — all versions verified against official GitHub releases and pkg.go.dev as of 2026-04-04
- Architecture: HIGH — TruffleHog v3 pipeline is the proven model; channel patterns are established Go idiom
- Aho-Corasick library choice: HIGH — TruffleHog blog post explicitly credits petar-dambovaliev; pkg.go.dev confirms API and last update date
- AES-256+Argon2id approach: HIGH — stdlib only; RFC 9106 parameters; well-documented pattern
- Pitfalls: HIGH — sourced from GHSA advisory, HashiCorp research, official Go docs
- Test architecture: HIGH — standard
go testpatterns; no uncertainty
Research date: 2026-04-04 Valid until: 2026-07-04 (stable libraries; 90 days is safe for Go stdlib + well-maintained ecosystem packages)