From fa3916a41781f3836f802b7d2c067211840e5fdb Mon Sep 17 00:00:00 2001 From: salvacybersec Date: Sat, 4 Apr 2026 23:32:10 +0300 Subject: [PATCH] docs(phase-1): research foundation phase Co-Authored-By: Claude Sonnet 4.6 --- .planning/phases/01-foundation/01-RESEARCH.md | 764 ++++++++++++++++++ 1 file changed, 764 insertions(+) create mode 100644 .planning/phases/01-foundation/01-RESEARCH.md diff --git a/.planning/phases/01-foundation/01-RESEARCH.md b/.planning/phases/01-foundation/01-RESEARCH.md new file mode 100644 index 0000000..5fa14da --- /dev/null +++ b/.planning/phases/01-foundation/01-RESEARCH.md @@ -0,0 +1,764 @@ +# Phase 1: Foundation - Research + +**Researched:** 2026-04-04 +**Domain:** Go CLI scaffolding, provider registry schema, three-stage scan pipeline, encrypted SQLite storage +**Confidence:** HIGH + +--- + + +## Phase Requirements + +| ID | Description | Research Support | +|----|-------------|------------------| +| CORE-01 | Scanner engine detects API keys using keyword pre-filtering + regex matching pipeline | Aho-Corasick pre-filter (petar-dambovaliev/aho-corasick) + Go RE2 regexp; buffered channel pipeline pattern documented | +| CORE-02 | Provider definitions loaded from YAML files embedded at compile time via Go embed | `//go:embed providers/*.yaml` + `fs.WalkDir` to iterate; gopkg.in/yaml.v3 for parse | +| CORE-03 | Provider registry manages 108+ provider definitions with pattern, keyword, confidence, and verify metadata | Registry struct holding `[]Provider` loaded at startup; injected via constructor not global | +| CORE-04 | Entropy analysis as secondary signal for low-confidence providers | Shannon entropy implemented as ~10-line stdlib function using `math` package; threshold 3.5 bits/char | +| CORE-05 | Worker pool parallelism with configurable worker count (default: CPU count) | `ants.NewPool(runtime.NumCPU() * 8)` for detectors; `ants.NewPool(runtime.NumCPU())` for verifiers | +| CORE-06 | Aho-Corasick keyword pre-filter runs before regex for 10x performance on large files | petar-dambovaliev/aho-corasick: 20x faster than cloudflare's, used by TruffleHog; build automaton at registry init | +| CORE-07 | mmap-based large file reading for memory efficiency | `golang.org/x/sys/unix.Mmap` or `syscall.Mmap` for large file sources; skip binary files via magic bytes | +| STOR-01 | SQLite database for persisting scan results, keys, recon history | `modernc.org/sqlite` (pure Go, no CGo); `database/sql` interface; WAL mode; schema embedded via `//go:embed` | +| STOR-02 | Application-level AES-256 encryption for stored keys and sensitive config | `crypto/aes` + `crypto/cipher` GCM mode; nonce prepended to ciphertext stored in BLOB column | +| STOR-03 | Encryption key derived from user passphrase via Argon2 | `golang.org/x/crypto/argon2` IDKey with RFC 9106 params: time=1, memory=64*1024, threads=4, keyLen=32 | +| CLI-01 | Cobra-based CLI with commands: scan, verify, import, recon, keys, serve, dorks, providers, config, hook, schedule | cobra v1.10.2 command tree; `cmd/` package; main.go < 30 lines | +| CLI-02 | keyhunter config init creates ~/.keyhunter.yaml | `viper.WriteConfigAs(filepath)` with `os.MkdirAll`; `PersistentPreRunE` for config load | +| CLI-03 | keyhunter config set key value persists values | `viper.Set(key, value)` + `viper.WriteConfig()` | +| CLI-04 | keyhunter providers list/info/stats for provider management | Registry.List(), Registry.Get(name) from loaded YAML; lipgloss table for terminal output | +| CLI-05 | Scan flags: --providers, --category, --confidence, --exclude, --verify, --workers, --output, --unmask, --notify | Persistent flags on `scan` command; viper.BindPFlag for config file override | +| PROV-10 | Provider YAML schema includes format_version and last_verified fields validated at load time | Custom `UnmarshalYAML` method on Provider struct; return error if format_version == 0 or last_verified empty | + + +--- + +## Summary + +Phase 1 builds the three foundations everything else depends on: the provider registry (YAML schema + embed), the storage layer (SQLite + AES-256 encryption), and the CLI skeleton (Cobra + Viper). These are the two zero-dependency components in the architecture — nothing downstream can be built until both exist. The scanning engine (three-stage pipeline) is also scoped here because all Phase 1 success criteria require a working scan to validate the other two foundations. + +The stack for this phase is fully settled from prior research. The one open question from SUMMARY.md — which Aho-Corasick library to use — is now resolved: TruffleHog originally used `petar-dambovaliev/aho-corasick` (confirmed via the TruffleHog blog post crediting Petar Dambovaliev). That library is 20x faster than cloudflare/ahocorasick and uses 1/8th the memory. A second open question — Argon2 vs PBKDF2 for key derivation — is resolved: use Argon2id (`golang.org/x/crypto/argon2.IDKey`) per RFC 9106 recommendations, which is the modern standard and already available in the `x/crypto` package already needed for other operations. + +The AES-256 encryption approach is application-level (not SQLCipher), using `crypto/aes` GCM mode with the nonce prepended to the ciphertext stored as a BLOB. This preserves `CGO_ENABLED=0` throughout. Key derivation uses Argon2id to produce a 32-byte key from a user passphrase + random salt stored alongside the database. For Phase 1, the salt can be stored in the config YAML; OS keychain integration (zalando/go-keyring) can be added in a later phase without schema migration. + +**Primary recommendation:** Build in order: (1) Provider YAML schema + embed loader, (2) SQLite schema + AES-256 crypto layer, (3) Scanning engine pipeline (Aho-Corasick + RE2 + entropy), (4) Cobra/Viper CLI wiring. The scan pipeline validation is the integration test that proves all three foundations work together. + +--- + +## Project Constraints (from CLAUDE.md) + +All directives from CLAUDE.md are binding. Key constraints for Phase 1: + +| Constraint | Directive | +|------------|-----------| +| Language | Go 1.22+ only. No other language. | +| CGO | `CGO_ENABLED=0` throughout — single binary, cross-compilation | +| SQLite driver | `modernc.org/sqlite` — NOT `mattn/go-sqlite3` (CGo) | +| SQLite encryption | Application-level AES-256 via `crypto/aes` — NOT SQLCipher (CGo) | +| CLI framework | `cobra v1.10.2` + `viper v1.21.0` — no alternatives | +| YAML parsing | `gopkg.in/yaml.v3` — not sigs.k8s.io/yaml or goccy/go-yaml | +| Concurrency | `ants v2.12.0` worker pool | +| Architecture | Plugin-based — providers as YAML files, compile-time embedded via `go:embed` | +| Regex engine | Go stdlib `regexp` (RE2-backed) ONLY — never `regexp2` or PCRE | +| Verification | Opt-in (`--verify` flag) — passive scanning by default | +| Key masking | Default masked output, `--unmask` for full keys | +| Worker pool | `github.com/panjf2000/ants/v2` v2.12.0 | +| Output formatting | `github.com/charmbracelet/lipgloss` (latest) | +| Web stack | `chi v5.2.5` + `templ v0.3.1001` + htmx + Tailwind v4 (Phase 18 only — do not scaffold in Phase 1) | +| Telegram | `telego v1.8.0` (Phase 17 only) | +| Scheduler | `gocron v2.19.1` (Phase 17 only) | +| Build | `go build -ldflags="-s -w"` for stripped binary | +| Forbidden patterns | Fiber, Echo, mattn/go-sqlite3, SQLCipher, robfig/cron, regexp2, Full Bubble Tea TUI | + +--- + +## Standard Stack + +### Core (Phase 1 only) + +| Library | Version | Purpose | Why Standard | +|---------|---------|---------|--------------| +| `github.com/spf13/cobra` | v1.10.2 | CLI command tree | Industry standard; used by TruffleHog, Gitleaks, Kubernetes, Docker CLI | +| `github.com/spf13/viper` | v1.21.0 | Config management (YAML/env/flags) | Cobra-native integration; v1.21.0 fixed key-casing bugs | +| `modernc.org/sqlite` | v1.35.x (SQLite 3.51.2) | Embedded database | Pure Go; no CGo; cross-compiles cleanly; updated 2026-03-17 | +| `gopkg.in/yaml.v3` | v3.0.1 | Parse provider YAML | Handles inline/anchored structs; stable; transitive dep of cobra anyway | +| `github.com/petar-dambovaliev/aho-corasick` | latest (2025-04-24) | Keyword pre-filter | 20x faster than cloudflare/ahocorasick; 1/8 memory; used by TruffleHog; MIT license | +| `github.com/panjf2000/ants/v2` | v2.12.0 | Worker pool | Battle-tested goroutine pool; v2.12.0 adds ReleaseContext for clean shutdown | +| `golang.org/x/crypto` | latest x/ | Argon2 key derivation | Official Go extended library; IDKey for Argon2id; same package used for other crypto needs | +| `golang.org/x/time` | latest x/ | Rate limiting (future-proofing) | Needed for CORE-05 workers; token bucket; add now to avoid later go.mod churn | +| `github.com/charmbracelet/lipgloss` | latest | Terminal table output | Declarative styles; used across Go security tool ecosystem | +| `github.com/stretchr/testify` | v1.10.x | Test assertions | Assert/require only; standard in Go ecosystem | +| `embed` (stdlib) | — | Compile-time YAML embed | Go 1.16+ native; no external dep | +| `crypto/aes`, `crypto/cipher` (stdlib) | — | AES-256-GCM encryption | Standard library; no CGo; GCM provides authenticated encryption | +| `math` (stdlib) | — | Shannon entropy calculation | ~10-line implementation; no library needed | +| `database/sql` (stdlib) | — | SQL interface over modernc.org/sqlite | Driver registered as `"sqlite"`; raw SQL; no ORM | + +### Not Needed in Phase 1 (scaffold stubs only if required by interface) + +| Library | Deferred To | +|---------|-------------| +| `chi v5.2.5` | Phase 18 (Web Dashboard) | +| `templ v0.3.1001` | Phase 18 (Web Dashboard) | +| `telego v1.8.0` | Phase 17 (Telegram Bot) | +| `gocron v2.19.1` | Phase 17 (Scheduler) | + +**Installation (Phase 1 dependencies):** + +```bash +go mod init github.com/yourusername/keyhunter +go get github.com/spf13/cobra@v1.10.2 +go get github.com/spf13/viper@v1.21.0 +go get modernc.org/sqlite@latest +go get gopkg.in/yaml.v3@v3.0.1 +go get github.com/petar-dambovaliev/aho-corasick@latest +go get github.com/panjf2000/ants/v2@v2.12.0 +go get golang.org/x/crypto@latest +go get golang.org/x/time@latest +go get github.com/charmbracelet/lipgloss@latest +go get github.com/stretchr/testify@latest +``` + +**Version verification (run before writing go.mod manually):** + +```bash +go list -m github.com/petar-dambovaliev/aho-corasick +go list -m modernc.org/sqlite +go list -m github.com/panjf2000/ants/v2 +``` + +--- + +## Architecture Patterns + +### Recommended Project Structure (Phase 1 scope) + +``` +keyhunter/ + main.go # < 30 lines, cobra root Execute() + cmd/ + root.go # rootCmd, persistent flags, PersistentPreRunE config load + scan.go # scan command + flags + providers.go # providers list/info/stats commands + config.go # config init/set/get commands + pkg/ + providers/ + loader.go # embed.FS + fs.WalkDir + yaml.Unmarshal + registry.go # Registry struct, Get/List/Stats methods + schema.go # Provider, Pattern, VerifySpec structs + UnmarshalYAML validation + engine/ + engine.go # Engine struct, Scan() method, pipeline orchestration + pipeline.go # channel wiring: chunksChan, detectableChan, resultsChan + filter.go # Aho-Corasick pre-filter stage + detector.go # Regex + entropy detector worker + entropy.go # Shannon entropy function + chunk.go # Chunk type (content []byte, source string, offset int64) + finding.go # Finding type (provider, key_value, key_masked, confidence, source, path) + sources/ + source.go # Source interface + file.go # FileSource + dir.go # DirSource (recursive with glob exclude) + storage/ + db.go # DB struct, Open(), migrations via embedded schema.sql + schema.sql # DDL for findings, scans, settings tables + encrypt.go # AES-256-GCM Encrypt(plaintext, key) / Decrypt(ciphertext, key) + crypto.go # Argon2id key derivation: DeriveKey(passphrase, salt) + findings.go # CRUD for findings table + scans.go # CRUD for scans table + config/ + config.go # Config struct, Load(), defaults + output/ + table.go # lipgloss colored terminal table + json.go # encoding/json output + providers/ + openai.yaml # Reference provider definitions (Phase 1: schema examples only) + anthropic.yaml + huggingface.yaml +``` + +### Pattern 1: Provider Registry with Compile-Time Embed + +**What:** Provider YAML definitions embedded via `//go:embed` and loaded into an in-memory registry at startup. + +**When to use:** Always. Never load provider files from disk at runtime. + +```go +// Source: Go embed docs (https://pkg.go.dev/embed) + +package providers + +import ( + "embed" + "io/fs" + "gopkg.in/yaml.v3" +) + +//go:embed ../../providers/*.yaml +var providersFS embed.FS + +type Registry struct { + providers []Provider + ac ahocorasick.AhoCorasick +} + +func NewRegistry() (*Registry, error) { + var providers []Provider + err := fs.WalkDir(providersFS, "providers", func(path string, d fs.DirEntry, err error) error { + if err != nil || d.IsDir() || filepath.Ext(path) != ".yaml" { + return err + } + data, err := providersFS.ReadFile(path) + if err != nil { + return err + } + var p Provider + if err := yaml.Unmarshal(data, &p); err != nil { + return fmt.Errorf("provider %s: %w", path, err) + } + providers = append(providers, p) + return nil + }) + if err != nil { + return nil, err + } + // Build Aho-Corasick automaton from all keywords + var keywords []string + for _, p := range providers { + keywords = append(keywords, p.Keywords...) + } + builder := ahocorasick.NewAhoCorasickBuilder(ahocorasick.Opts{DFA: true}) + ac := builder.Build(keywords) + return &Registry{providers: providers, ac: ac}, nil +} +``` + +### Pattern 2: Three-Stage Scanning Pipeline (Buffered Channels) + +**What:** Source adapters produce chunks onto buffered channels. Aho-Corasick pre-filter reduces candidates. Detector workers apply regex + entropy. + +**When to use:** All scan operations. Never skip the pre-filter. + +```go +// Source: TruffleHog v3 architecture (https://trufflesecurity.com/blog/making-trufflehog-faster-with-aho-corasick) + +func (e *Engine) Scan(ctx context.Context, src Source, cfg ScanConfig) (<-chan Finding, error) { + chunksChan := make(chan Chunk, 1000) + detectableChan := make(chan Chunk, 500) + resultsChan := make(chan Finding, 100) + + // Stage 1: Source → chunks + go func() { + defer close(chunksChan) + src.Chunks(ctx, chunksChan) + }() + + // Stage 2: Aho-Corasick keyword pre-filter + go func() { + defer close(detectableChan) + for chunk := range chunksChan { + if len(e.registry.AC().FindAll(string(chunk.Data))) > 0 { + detectableChan <- chunk + } + } + }() + + // Stage 3: Detector workers + var wg sync.WaitGroup + for i := 0; i < cfg.Workers; i++ { + wg.Add(1) + go func() { + defer wg.Done() + for chunk := range detectableChan { + e.detect(chunk, resultsChan) + } + }() + } + go func() { + wg.Wait() + close(resultsChan) + }() + + return resultsChan, nil +} +``` + +### Pattern 3: AES-256-GCM Column Encryption + +**What:** Encrypt key values before storing in SQLite. Nonce prepended to ciphertext. Key derived via Argon2id. + +**When to use:** Every write of a full API key to storage. + +```go +// Source: Go crypto/cipher docs (https://pkg.go.dev/crypto/cipher) + +func Encrypt(plaintext []byte, key []byte) ([]byte, error) { + block, err := aes.NewCipher(key) // key must be 32 bytes for AES-256 + if err != nil { + return nil, err + } + gcm, err := cipher.NewGCM(block) + if err != nil { + return nil, err + } + nonce := make([]byte, gcm.NonceSize()) + if _, err := io.ReadFull(rand.Reader, nonce); err != nil { + return nil, err + } + ciphertext := gcm.Seal(nonce, nonce, plaintext, nil) // nonce prepended + return ciphertext, nil +} + +func Decrypt(ciphertext []byte, key []byte) ([]byte, error) { + block, err := aes.NewCipher(key) + if err != nil { + return nil, err + } + gcm, err := cipher.NewGCM(block) + if err != nil { + return nil, err + } + nonceSize := gcm.NonceSize() + if len(ciphertext) < nonceSize { + return nil, errors.New("ciphertext too short") + } + nonce, ciphertext := ciphertext[:nonceSize], ciphertext[nonceSize:] + return gcm.Open(nil, nonce, ciphertext, nil) +} +``` + +### Pattern 4: Argon2id Key Derivation + +**What:** Derive 32-byte AES key from user passphrase + random salt. RFC 9106 Section 7.3 parameters. + +**When to use:** On first use (generate and store salt); on subsequent use (re-derive key from passphrase + stored salt). + +```go +// Source: https://pkg.go.dev/golang.org/x/crypto/argon2 + +import "golang.org/x/crypto/argon2" + +const ( + argon2Time = 1 + argon2Memory = 64 * 1024 // 64 MB + argon2Threads = 4 + argon2KeyLen = 32 // AES-256 key length +) + +func DeriveKey(passphrase []byte, salt []byte) []byte { + return argon2.IDKey(passphrase, salt, argon2Time, argon2Memory, argon2Threads, argon2KeyLen) +} + +// Generate salt (do once, persist in config) +func NewSalt() ([]byte, error) { + salt := make([]byte, 32) + _, err := rand.Read(salt) + return salt, err +} +``` + +### Pattern 5: Provider YAML Schema with Validation + +**What:** Provider struct with `UnmarshalYAML` that validates required fields including `format_version` and `last_verified`. + +**When to use:** Every provider YAML load. Return error on invalid schema — fail fast at startup. + +```go +// Source: gopkg.in/yaml.v3 docs (https://pkg.go.dev/gopkg.in/yaml.v3) + +type Provider struct { + Name string `yaml:"name"` + FormatVersion int `yaml:"format_version"` + LastVerified string `yaml:"last_verified"` // ISO date: "2026-04-04" + Keywords []string `yaml:"keywords"` + Patterns []Pattern `yaml:"patterns"` + Verify *VerifySpec `yaml:"verify,omitempty"` +} + +func (p *Provider) UnmarshalYAML(value *yaml.Node) error { + type rawProvider Provider + if err := value.Decode((*rawProvider)(p)); err != nil { + return err + } + if p.FormatVersion == 0 { + return fmt.Errorf("provider %q: format_version is required", p.Name) + } + if p.LastVerified == "" { + return fmt.Errorf("provider %q: last_verified is required", p.Name) + } + if len(p.Keywords) == 0 { + return fmt.Errorf("provider %q: at least one keyword is required", p.Name) + } + if len(p.Patterns) == 0 { + return fmt.Errorf("provider %q: at least one pattern is required", p.Name) + } + return nil +} +``` + +### Pattern 6: Shannon Entropy (stdlib only) + +**What:** Calculate bits-per-character entropy of a string. No library needed. + +**When to use:** CORE-04: as secondary signal after keyword pre-filter and regex confirm a candidate. Entropy threshold 3.5 bits/char for most providers. + +```go +// Source: Shannon entropy formula (verified against TruffleHog entropy implementation) + +import "math" + +func shannonEntropy(s string) float64 { + if len(s) == 0 { + return 0 + } + freq := make(map[rune]float64) + for _, c := range s { + freq[c]++ + } + n := float64(len([]rune(s))) + var entropy float64 + for _, count := range freq { + p := count / n + entropy -= p * math.Log2(p) + } + return entropy +} +``` + +### Pattern 7: modernc.org/sqlite Initialization + +**What:** Open SQLite DB with WAL mode and foreign keys. Schema embedded via `//go:embed`. + +**When to use:** Storage layer initialization. + +```go +// Source: https://practicalgobook.net/posts/go-sqlite-no-cgo/ + +import ( + "database/sql" + _ "modernc.org/sqlite" // registers "sqlite" driver + "embed" +) + +//go:embed schema.sql +var schemaSQL string + +func Open(path string) (*sql.DB, error) { + db, err := sql.Open("sqlite", path) + if err != nil { + return nil, err + } + // Enable WAL mode for better concurrent access + if _, err := db.Exec("PRAGMA journal_mode=WAL"); err != nil { + return nil, err + } + if _, err := db.Exec("PRAGMA foreign_keys=ON"); err != nil { + return nil, err + } + // Apply schema + if _, err := db.Exec(schemaSQL); err != nil { + return nil, err + } + return db, nil +} +``` + +### Pattern 8: Cobra + Viper Wiring + +**What:** Cobra command tree with Viper config loaded in PersistentPreRunE. Flags bound to Viper for config file override. + +**When to use:** root.go and all command files. + +```go +// Source: https://www.glukhov.org/post/2025/11/go-cli-applications-with-cobra-and-viper/ + +var rootCmd = &cobra.Command{ + Use: "keyhunter", + Short: "API key scanner for 108+ LLM providers", + PersistentPreRunE: func(cmd *cobra.Command, args []string) error { + viper.SetConfigName(".keyhunter") + viper.SetConfigType("yaml") + viper.AddConfigPath("$HOME") + viper.AutomaticEnv() + viper.SetEnvPrefix("KEYHUNTER") + if err := viper.ReadInConfig(); err != nil { + if _, ok := err.(viper.ConfigFileNotFoundError); !ok { + return err + } + // Config file not found — acceptable on first run + } + return nil + }, +} + +// In init() of each command file: +func init() { + scanCmd.Flags().IntP("workers", "w", 0, "worker count (default: CPU count)") + viper.BindPFlag("workers", scanCmd.Flags().Lookup("workers")) +} +``` + +### Anti-Patterns to Avoid + +- **Global provider registry:** Pass `*Registry` via constructor. Global state makes testing impossible without full initialization. +- **Unbuffered result channels:** Use `make(chan Finding, 1000)`. Unbuffered channels cause detector workers to block on slow output consumers, collapsing parallelism. +- **Runtime YAML loading:** Never load from filesystem at scan time. `//go:embed` only. +- **regexp2 or PCRE:** Go's `regexp` package (RE2) already provides linear-time guarantees. regexp2 loses this guarantee. +- **Entropy-only detection:** Never flag a candidate based solely on entropy score. Entropy is a secondary filter applied only after keyword pre-filter and regex confirm a pattern match. +- **Plaintext key column:** Never store full API key as TEXT. Always encrypt with AES-256-GCM before INSERT. +- **os.Getenv for passphrase:** Derive the AES key via Argon2id from a passphrase. Never store raw passphrase or raw key in config file. + +--- + +## Don't Hand-Roll + +| Problem | Don't Build | Use Instead | Why | +|---------|-------------|-------------|-----| +| Multi-keyword search automaton | Custom Aho-Corasick or loop-based string search | `petar-dambovaliev/aho-corasick` | Aho-Corasick is O(n+m+z); naive loop is O(n*k); library is 20x faster than next-best Go option | +| Goroutine pool with dynamic resize | Manual goroutine spawn with WaitGroup | `ants v2.12.0` | Goroutine explosion on large repos; ants handles backpressure, panic recovery, context cancellation | +| AES key derivation from passphrase | SHA256(passphrase) or similar | `argon2.IDKey` from `golang.org/x/crypto` | MD5/SHA hash-based KDF is trivially brute-forceable; Argon2id is GPU-resistant by design | +| SQLite column encryption | XOR, custom cipher, or base64 "encoding" | `crypto/aes` GCM via stdlib | GCM provides both confidentiality and authentication; custom schemes always have vulnerabilities | +| Config file management | Custom INI or JSON parser | `viper v1.21.0` | Viper handles YAML + env vars + CLI flags with correct precedence; hand-rolled configs miss env var override | +| CLI command parsing | `flag` stdlib or custom parser | `cobra v1.10.2` | Cobra provides nested subcommands, persistent flags, shell completion, help generation — stdlib flag lacks all of these | + +**Key insight:** The Aho-Corasick pre-filter and Argon2id key derivation in particular are problems where "obvious" implementations (nested loops, SHA256) have well-documented security or performance failure modes that justify the dependency cost. + +--- + +## Common Pitfalls + +### Pitfall 1: Wrong Aho-Corasick Library Choice + +**What goes wrong:** Using `cloudflare/ahocorasick` because it appears more prominent in search results. It uses 8x more memory and runs 20x slower than `petar-dambovaliev/aho-corasick`. For a tool scanning large repos with 108 keyword patterns, this difference is measurable. + +**Why it happens:** cloudflare/ahocorasick appears first in many search results. + +**How to avoid:** Use `github.com/petar-dambovaliev/aho-corasick` (verified as the library TruffleHog uses per their own blog post crediting Petar Dambovaliev). Build the automaton once at registry init; the automaton is thread-safe for concurrent reads. + +**Warning signs:** Scans on 100MB+ repos running significantly slower than expected; high memory usage during keyword pre-filter stage. + +--- + +### Pitfall 2: Encryption Key Stored in Config File as Raw Bytes + +**What goes wrong:** Storing the 32-byte AES key directly in `~/.keyhunter.yaml` or as an env var. Anyone with config file read access can decrypt the entire database. + +**Why it happens:** "Just store the key" is the simplest implementation. Argon2id + salt seems like unnecessary complexity. + +**How to avoid:** Store only the randomly generated salt in config. Re-derive the key from `passphrase + salt` on each run using Argon2id. The passphrase is entered interactively or set via `KEYHUNTER_PASSPHRASE` env var (never stored). For Phase 1, interactive passphrase entry is acceptable. OS keychain integration (zalando/go-keyring) can be added later without schema migration. + +**Warning signs:** `keyhunter_key:` field appearing as hex bytes in the YAML config file. + +--- + +### Pitfall 3: Provider YAML Schema Allowing Missing format_version or last_verified + +**What goes wrong:** Provider YAML loads without validation. Providers with missing `format_version` or stale `last_verified` accumulate over time. Pattern health tracking (PROV-10) becomes meaningless. + +**Why it happens:** `yaml.Unmarshal` to a struct silently zero-values missing fields. No validation = no error. + +**How to avoid:** Implement `UnmarshalYAML` with explicit validation on the Provider struct. Fail at startup (not at scan time) if any provider is invalid. This catches schema errors at development time, not production time. + +**Warning signs:** `format_version: 0` appearing in any loaded provider struct. + +--- + +### Pitfall 4: SQLite Without WAL Mode in a CLI Tool + +**What goes wrong:** Default SQLite journal mode causes `SQLITE_BUSY` errors when the dashboard (Phase 18) or multiple concurrent processes read while a scan writes. The default journal also performs slower write throughput. + +**Why it happens:** `sql.Open("sqlite", path)` uses the default rollback journal. + +**How to avoid:** Always execute `PRAGMA journal_mode=WAL` immediately after opening the database. This is a one-time setup that persists in the database file. Also set `PRAGMA foreign_keys=ON` to enforce referential integrity. + +**Warning signs:** `database is locked` errors during concurrent read/write operations. + +--- + +### Pitfall 5: Entropy Check Before Regex Confirmation + +**What goes wrong:** Running Shannon entropy on every chunk that passes the Aho-Corasick keyword filter produces up to 80% false positive rate (HashiCorp 2025 research). High-entropy strings like UUIDs, hashes, and base64-encoded content all score above 3.5 bits/char. + +**Why it happens:** Entropy feels like an independent signal and is applied eagerly as a quick filter. + +**How to avoid:** Entropy is strictly a secondary signal. Apply it only to strings that have already matched both a keyword (Aho-Corasick) AND a provider regex pattern. The order is: keyword pre-filter → regex match → entropy calibration. Never entropy-only. + +**Warning signs:** More than 30% of findings lacking a matching provider regex pattern. + +--- + +### Pitfall 6: mmap for Small Files on Linux + +**What goes wrong:** mmap is beneficial for large files (>10MB) where avoiding a full read-into-memory matters. For small files, mmap has higher setup overhead than a simple `os.ReadFile`. mmap also requires explicit cleanup to avoid address space exhaustion on directory scans with thousands of small files. + +**Why it happens:** CORE-07 specifies mmap-based large file reading, and implementing it uniformly for all files seems simpler. + +**How to avoid:** Use `os.ReadFile` for files under 10MB. Use mmap only above that threshold, with explicit `unix.Munmap` cleanup in a deferred call. Check for binary files before mmap — use the first 512 bytes of the file to detect binary content via `http.DetectContentType` and skip non-text files. + +**Warning signs:** Address space exhaustion during directory scans; "too many open files" errors. + +--- + +### Pitfall 7: Argon2 Parameters Too Low (Fast KDF = Weak Security) + +**What goes wrong:** Using time=1, memory=4096 (a commonly copied example) instead of RFC 9106's recommendations. Fast key derivation makes brute-force attacks on the database passphrase practical. + +**Why it happens:** Low parameters make tests run faster and startup feel snappier. + +**How to avoid:** Use RFC 9106 Section 7.3 parameters for Argon2id: `time=1, memory=64*1024 (64MB), threads=4, keyLen=32`. These are the current recommendations. Test startup latency with these parameters — on modern hardware, key derivation takes ~100-300ms, which is acceptable for a CLI tool. + +**Warning signs:** `argon2.IDKey(pass, salt, 1, 4096, 1, 32)` — memory parameter is 64*1024 (65536), not 4096. + +--- + +## State of the Art + +| Old Approach | Current Approach | When Changed | Impact | +|--------------|------------------|--------------|--------| +| Linear keyword scan before regex | Aho-Corasick pre-filter | TruffleHog v3.28.6 (2024) | 2x average scan speedup; O(n+m+z) vs O(n*k) | +| mattn/go-sqlite3 (CGo) | modernc.org/sqlite (pure Go) | Go 1.16+ era | CGO_ENABLED=0 enabled; cross-compilation works | +| SQLCipher for DB encryption | Application-level AES-256-GCM | 2023-2025 | No CGo dependency; AES GCM provides authentication | +| PBKDF2 for key derivation | Argon2id | RFC 9106 (2021) | GPU-resistant; side-channel resistant; OWASP recommended | +| regexp2/PCRE in Go scanners | Go stdlib regexp (RE2) | Ongoing | ReDoS immune; linear time guaranteed | +| Storing full keys masked in DB | Encrypt key_encrypted column, store only mask | GHSA-4h8c-qrcq-cv5c (2024) | Database file no longer a credential dump | + +**Deprecated/outdated:** +- `mattn/go-sqlite3`: Requires CGo; cross-compilation breaks; modernc.org/sqlite is the replacement. +- `robfig/cron`: Unmaintained since 2020; use gocron v2 (Phase 17). +- `cloudflare/ahocorasick`: Still maintained but 20x slower than petar-dambovaliev/aho-corasick; do not use. +- Entropy-only secret detection: HashiCorp 2025 research confirms 80% FP rate; layered pipeline is the current standard. + +--- + +## Open Questions + +1. **Passphrase UX for first run** + - What we know: Argon2id requires a passphrase to derive the AES key. For Phase 1, this must be handled on first `keyhunter scan` or `keyhunter config init`. + - What's unclear: Should Phase 1 use `bufio.NewReader(os.Stdin).ReadString('\n')` for passphrase entry, or skip encryption and use a generated random key stored in config (less secure but zero-friction)? + - Recommendation: Use a generated random 32-byte key stored in `~/.keyhunter.yaml` as base64 for Phase 1 (zero-friction). Document that this is a development shortcut; OS keychain integration (zalando/go-keyring) replaces it in a later phase. The encrypt/decrypt functions and schema are in place; only the key source changes. + +2. **mmap on Linux: syscall vs golang.org/x/sys/unix** + - What we know: Both `syscall.Mmap` and `golang.org/x/sys/unix.Mmap` provide mmap on Linux. The x/ package has a cleaner API. + - What's unclear: `golang.org/x/sys` is already a transitive dependency of many packages (likely pulled in by viper or cobra). Whether it's already in go.sum or needs explicit addition. + - Recommendation: Use `golang.org/x/sys/unix` for mmap in the file source adapter. It will almost certainly already be in go.sum. Only implement mmap for Phase 1 if CORE-07 is in scope for the minimal viable scan pipeline; otherwise defer to Phase 4. + +3. **Provider YAML for Phase 1: how many definitions?** + - What we know: Phase 1 requires schema + PROV-10 (format_version, last_verified fields). Full 108-provider definitions are Phase 2-3. + - What's unclear: The success criterion "keyhunter scan ./somefile runs the three-stage pipeline and returns findings with provider names" implies at least one real provider definition must exist. + - Recommendation: Ship 3 reference provider definitions in Phase 1 (OpenAI, Anthropic, HuggingFace) with valid format_version and last_verified. All 108 providers are Phase 2-3 scope. These 3 definitions validate the schema and make the success criteria testable. + +--- + +## Environment Availability + +| Dependency | Required By | Available | Version | Fallback | +|------------|------------|-----------|---------|----------| +| Go | Core language | Yes | 1.26.1 (exceeds 1.22 requirement) | — | +| git | Version control | Yes | 2.53.0 | — | +| golangci-lint | Static analysis / CI | No | — | Install via `go install github.com/golangci-lint/golangci-lint/cmd/golangci-lint@latest` or skip in Phase 1 | +| npm | (not needed Phase 1) | Yes (4.2.2 via tailwind check) | — | Not needed until Phase 18 | + +**Missing dependencies with fallback:** +- `golangci-lint`: Not found. Install before Phase 1 linting task, or skip lint gate for initial scaffold and add in CI pipeline. Fallback: `go vet ./...` catches most critical issues. + +**Missing dependencies with no fallback:** +- None. Go 1.26.1 is available and exceeds the 1.22+ requirement. + +--- + +## Validation Architecture + +### Test Framework + +| Property | Value | +|----------|-------| +| Framework | `go test` (stdlib) + `testify v1.10.x` for assertions | +| Config file | None needed — standard `go test ./...` discovers `*_test.go` | +| Quick run command | `go test ./pkg/... -race -timeout 30s` | +| Full suite command | `go test ./... -race -cover -timeout 120s` | + +### Phase Requirements to Test Map + +| Req ID | Behavior | Test Type | Automated Command | File Exists? | +|--------|----------|-----------|-------------------|-------------| +| CORE-01 | Scan pipeline detects known API key patterns in test input | integration | `go test ./pkg/engine/... -run TestScanPipeline -v` | No — Wave 0 | +| CORE-02 | Provider YAML loads from embed.FS without error | unit | `go test ./pkg/providers/... -run TestNewRegistry -v` | No — Wave 0 | +| CORE-03 | Registry holds correct provider count, Get() returns provider by name | unit | `go test ./pkg/providers/... -run TestRegistry -v` | No — Wave 0 | +| CORE-04 | Shannon entropy returns correct value for known inputs | unit | `go test ./pkg/engine/... -run TestShannonEntropy -v` | No — Wave 0 | +| CORE-05 | Worker pool uses correct concurrency; all workers complete | unit | `go test ./pkg/engine/... -run TestWorkerPool -v` | No — Wave 0 | +| CORE-06 | Aho-Corasick filter passes keyword-matched chunks; rejects non-matching | unit | `go test ./pkg/engine/... -run TestAhoCorasickFilter -v` | No — Wave 0 | +| CORE-07 | Large file source reads without OOM; binary files skipped | integration | `go test ./pkg/engine/sources/... -run TestFileSourceLarge -v` | No — Wave 0 | +| STOR-01 | SQLite DB opens; schema applies; WAL mode set | unit | `go test ./pkg/storage/... -run TestOpen -v` | No — Wave 0 | +| STOR-02 | AES-256-GCM encrypt/decrypt round-trip is lossless | unit | `go test ./pkg/storage/... -run TestEncryptDecrypt -v` | No — Wave 0 | +| STOR-03 | Argon2id DeriveKey produces 32-byte deterministic output | unit | `go test ./pkg/storage/... -run TestDeriveKey -v` | No — Wave 0 | +| CLI-01 | `keyhunter --help` exits 0; all subcommands listed | smoke | `go run ./... --help` | No — Wave 0 | +| CLI-02 | `keyhunter config init` creates ~/.keyhunter.yaml | integration | `go test ./cmd/... -run TestConfigInit -v` | No — Wave 0 | +| CLI-03 | `keyhunter config set key val` persists to YAML | integration | `go test ./cmd/... -run TestConfigSet -v` | No — Wave 0 | +| CLI-04 | `keyhunter providers list` returns at least 3 providers | integration | `go test ./cmd/... -run TestProvidersList -v` | No — Wave 0 | +| CLI-05 | `keyhunter scan --workers 4 testfile` uses 4 workers | integration | `go test ./cmd/... -run TestScanFlags -v` | No — Wave 0 | +| PROV-10 | Provider YAML with missing format_version returns error at load | unit | `go test ./pkg/providers/... -run TestProviderValidation -v` | No — Wave 0 | + +### Sampling Rate + +- **Per task commit:** `go test ./pkg/... -race -timeout 30s` +- **Per wave merge:** `go test ./... -race -cover -timeout 120s` +- **Phase gate:** Full suite green before `/gsd:verify-work` + +### Wave 0 Gaps + +All test files are new — this is a greenfield project. + +- [ ] `pkg/providers/registry_test.go` — covers CORE-02, CORE-03, PROV-10 +- [ ] `pkg/engine/entropy_test.go` — covers CORE-04 +- [ ] `pkg/engine/filter_test.go` — covers CORE-06 +- [ ] `pkg/engine/engine_test.go` — covers CORE-01, CORE-05 +- [ ] `pkg/engine/sources/file_test.go` — covers CORE-07 +- [ ] `pkg/storage/encrypt_test.go` — covers STOR-02 +- [ ] `pkg/storage/crypto_test.go` — covers STOR-03 +- [ ] `pkg/storage/db_test.go` — covers STOR-01 +- [ ] `cmd/config_test.go` — covers CLI-02, CLI-03 +- [ ] `cmd/providers_test.go` — covers CLI-04 +- [ ] `cmd/scan_test.go` — covers CLI-05 +- [ ] `testdata/fixtures/` — synthetic test files with known API key patterns for integration tests +- [ ] Framework install: `go get github.com/stretchr/testify@latest` — if not added during go.mod init + +--- + +## Sources + +### Primary (HIGH confidence) + +- TruffleHog Aho-Corasick blog: https://trufflesecurity.com/blog/making-trufflehog-faster-with-aho-corasick — confirmed petar-dambovaliev library and 2x speedup claim +- petar-dambovaliev/aho-corasick pkg.go.dev: https://pkg.go.dev/github.com/petar-dambovaliev/aho-corasick — API verified, last updated 2025-04-24 +- Go crypto/cipher (AES-GCM): https://pkg.go.dev/crypto/cipher — Encrypt/Decrypt pattern verified +- Go argon2 package: https://pkg.go.dev/golang.org/x/crypto/argon2 — IDKey parameters from RFC 9106 +- modernc.org/sqlite pkg.go.dev: https://pkg.go.dev/modernc.org/sqlite — pure Go confirmed, SQLite 3.51.2 +- Go embed package: https://pkg.go.dev/embed — WalkDir pattern for loading embedded files +- gopkg.in/yaml.v3: https://pkg.go.dev/gopkg.in/yaml.v3 — UnmarshalYAML custom validation +- ants v2 README: https://github.com/panjf2000/ants — Pool usage pattern +- cobra docs: https://github.com/spf13/cobra — PersistentPreRunE config loading pattern +- viper docs: https://github.com/spf13/viper — BindPFlag, WriteConfigAs patterns + +### Secondary (MEDIUM confidence) + +- TruffleHog go.sum (BobuSumisu reference): https://github.com/trufflesecurity/trufflehog/blob/main/go.sum — historical library; petar-dambovaliev is current per blog post +- Practical Go SQLite (no CGo): https://practicalgobook.net/posts/go-sqlite-no-cgo/ — WAL mode pattern verified against official SQLite docs +- HashiCorp entropy FP research: https://www.hashicorp.com/en/blog/false-positives-a-big-problem-for-secret-scanners — 80% FP rate from entropy-only detection +- Cobra/Viper 2025 article: https://www.glukhov.org/post/2025/11/go-cli-applications-with-cobra-and-viper/ — PersistentPreRunE pattern +- zalando/go-keyring: https://github.com/zalando/go-keyring — Linux uses D-Bus Secret Service (libsecret); noted as future improvement for Phase 1+ + +### Tertiary (LOW confidence) + +- argon2aes reference implementation: https://github.com/presbrey/argon2aes — implementation pattern reference only; use stdlib directly +- Passphrase UX patterns: Community convention; no authoritative Go CLI standard for passphrase input UX + +--- + +## Metadata + +**Confidence breakdown:** +- Standard stack: HIGH — all versions verified against official GitHub releases and pkg.go.dev as of 2026-04-04 +- Architecture: HIGH — TruffleHog v3 pipeline is the proven model; channel patterns are established Go idiom +- Aho-Corasick library choice: HIGH — TruffleHog blog post explicitly credits petar-dambovaliev; pkg.go.dev confirms API and last update date +- AES-256+Argon2id approach: HIGH — stdlib only; RFC 9106 parameters; well-documented pattern +- Pitfalls: HIGH — sourced from GHSA advisory, HashiCorp research, official Go docs +- Test architecture: HIGH — standard `go test` patterns; no uncertainty + +**Research date:** 2026-04-04 +**Valid until:** 2026-07-04 (stable libraries; 90 days is safe for Go stdlib + well-maintained ecosystem packages)