# Phase 1: Foundation - Research **Researched:** 2026-04-04 **Domain:** Go CLI scaffolding, provider registry schema, three-stage scan pipeline, encrypted SQLite storage **Confidence:** HIGH --- ## Phase Requirements | ID | Description | Research Support | |----|-------------|------------------| | CORE-01 | Scanner engine detects API keys using keyword pre-filtering + regex matching pipeline | Aho-Corasick pre-filter (petar-dambovaliev/aho-corasick) + Go RE2 regexp; buffered channel pipeline pattern documented | | CORE-02 | Provider definitions loaded from YAML files embedded at compile time via Go embed | `//go:embed providers/*.yaml` + `fs.WalkDir` to iterate; gopkg.in/yaml.v3 for parse | | CORE-03 | Provider registry manages 108+ provider definitions with pattern, keyword, confidence, and verify metadata | Registry struct holding `[]Provider` loaded at startup; injected via constructor not global | | CORE-04 | Entropy analysis as secondary signal for low-confidence providers | Shannon entropy implemented as ~10-line stdlib function using `math` package; threshold 3.5 bits/char | | CORE-05 | Worker pool parallelism with configurable worker count (default: CPU count) | `ants.NewPool(runtime.NumCPU() * 8)` for detectors; `ants.NewPool(runtime.NumCPU())` for verifiers | | CORE-06 | Aho-Corasick keyword pre-filter runs before regex for 10x performance on large files | petar-dambovaliev/aho-corasick: 20x faster than cloudflare's, used by TruffleHog; build automaton at registry init | | CORE-07 | mmap-based large file reading for memory efficiency | `golang.org/x/sys/unix.Mmap` or `syscall.Mmap` for large file sources; skip binary files via magic bytes | | STOR-01 | SQLite database for persisting scan results, keys, recon history | `modernc.org/sqlite` (pure Go, no CGo); `database/sql` interface; WAL mode; schema embedded via `//go:embed` | | STOR-02 | Application-level AES-256 encryption for stored keys and sensitive config | `crypto/aes` + `crypto/cipher` GCM mode; nonce prepended to ciphertext stored in BLOB column | | STOR-03 | Encryption key derived from user passphrase via Argon2 | `golang.org/x/crypto/argon2` IDKey with RFC 9106 params: time=1, memory=64*1024, threads=4, keyLen=32 | | CLI-01 | Cobra-based CLI with commands: scan, verify, import, recon, keys, serve, dorks, providers, config, hook, schedule | cobra v1.10.2 command tree; `cmd/` package; main.go < 30 lines | | CLI-02 | keyhunter config init creates ~/.keyhunter.yaml | `viper.WriteConfigAs(filepath)` with `os.MkdirAll`; `PersistentPreRunE` for config load | | CLI-03 | keyhunter config set key value persists values | `viper.Set(key, value)` + `viper.WriteConfig()` | | CLI-04 | keyhunter providers list/info/stats for provider management | Registry.List(), Registry.Get(name) from loaded YAML; lipgloss table for terminal output | | CLI-05 | Scan flags: --providers, --category, --confidence, --exclude, --verify, --workers, --output, --unmask, --notify | Persistent flags on `scan` command; viper.BindPFlag for config file override | | PROV-10 | Provider YAML schema includes format_version and last_verified fields validated at load time | Custom `UnmarshalYAML` method on Provider struct; return error if format_version == 0 or last_verified empty | --- ## Summary Phase 1 builds the three foundations everything else depends on: the provider registry (YAML schema + embed), the storage layer (SQLite + AES-256 encryption), and the CLI skeleton (Cobra + Viper). These are the two zero-dependency components in the architecture — nothing downstream can be built until both exist. The scanning engine (three-stage pipeline) is also scoped here because all Phase 1 success criteria require a working scan to validate the other two foundations. The stack for this phase is fully settled from prior research. The one open question from SUMMARY.md — which Aho-Corasick library to use — is now resolved: TruffleHog originally used `petar-dambovaliev/aho-corasick` (confirmed via the TruffleHog blog post crediting Petar Dambovaliev). That library is 20x faster than cloudflare/ahocorasick and uses 1/8th the memory. A second open question — Argon2 vs PBKDF2 for key derivation — is resolved: use Argon2id (`golang.org/x/crypto/argon2.IDKey`) per RFC 9106 recommendations, which is the modern standard and already available in the `x/crypto` package already needed for other operations. The AES-256 encryption approach is application-level (not SQLCipher), using `crypto/aes` GCM mode with the nonce prepended to the ciphertext stored as a BLOB. This preserves `CGO_ENABLED=0` throughout. Key derivation uses Argon2id to produce a 32-byte key from a user passphrase + random salt stored alongside the database. For Phase 1, the salt can be stored in the config YAML; OS keychain integration (zalando/go-keyring) can be added in a later phase without schema migration. **Primary recommendation:** Build in order: (1) Provider YAML schema + embed loader, (2) SQLite schema + AES-256 crypto layer, (3) Scanning engine pipeline (Aho-Corasick + RE2 + entropy), (4) Cobra/Viper CLI wiring. The scan pipeline validation is the integration test that proves all three foundations work together. --- ## Project Constraints (from CLAUDE.md) All directives from CLAUDE.md are binding. Key constraints for Phase 1: | Constraint | Directive | |------------|-----------| | Language | Go 1.22+ only. No other language. | | CGO | `CGO_ENABLED=0` throughout — single binary, cross-compilation | | SQLite driver | `modernc.org/sqlite` — NOT `mattn/go-sqlite3` (CGo) | | SQLite encryption | Application-level AES-256 via `crypto/aes` — NOT SQLCipher (CGo) | | CLI framework | `cobra v1.10.2` + `viper v1.21.0` — no alternatives | | YAML parsing | `gopkg.in/yaml.v3` — not sigs.k8s.io/yaml or goccy/go-yaml | | Concurrency | `ants v2.12.0` worker pool | | Architecture | Plugin-based — providers as YAML files, compile-time embedded via `go:embed` | | Regex engine | Go stdlib `regexp` (RE2-backed) ONLY — never `regexp2` or PCRE | | Verification | Opt-in (`--verify` flag) — passive scanning by default | | Key masking | Default masked output, `--unmask` for full keys | | Worker pool | `github.com/panjf2000/ants/v2` v2.12.0 | | Output formatting | `github.com/charmbracelet/lipgloss` (latest) | | Web stack | `chi v5.2.5` + `templ v0.3.1001` + htmx + Tailwind v4 (Phase 18 only — do not scaffold in Phase 1) | | Telegram | `telego v1.8.0` (Phase 17 only) | | Scheduler | `gocron v2.19.1` (Phase 17 only) | | Build | `go build -ldflags="-s -w"` for stripped binary | | Forbidden patterns | Fiber, Echo, mattn/go-sqlite3, SQLCipher, robfig/cron, regexp2, Full Bubble Tea TUI | --- ## Standard Stack ### Core (Phase 1 only) | Library | Version | Purpose | Why Standard | |---------|---------|---------|--------------| | `github.com/spf13/cobra` | v1.10.2 | CLI command tree | Industry standard; used by TruffleHog, Gitleaks, Kubernetes, Docker CLI | | `github.com/spf13/viper` | v1.21.0 | Config management (YAML/env/flags) | Cobra-native integration; v1.21.0 fixed key-casing bugs | | `modernc.org/sqlite` | v1.35.x (SQLite 3.51.2) | Embedded database | Pure Go; no CGo; cross-compiles cleanly; updated 2026-03-17 | | `gopkg.in/yaml.v3` | v3.0.1 | Parse provider YAML | Handles inline/anchored structs; stable; transitive dep of cobra anyway | | `github.com/petar-dambovaliev/aho-corasick` | latest (2025-04-24) | Keyword pre-filter | 20x faster than cloudflare/ahocorasick; 1/8 memory; used by TruffleHog; MIT license | | `github.com/panjf2000/ants/v2` | v2.12.0 | Worker pool | Battle-tested goroutine pool; v2.12.0 adds ReleaseContext for clean shutdown | | `golang.org/x/crypto` | latest x/ | Argon2 key derivation | Official Go extended library; IDKey for Argon2id; same package used for other crypto needs | | `golang.org/x/time` | latest x/ | Rate limiting (future-proofing) | Needed for CORE-05 workers; token bucket; add now to avoid later go.mod churn | | `github.com/charmbracelet/lipgloss` | latest | Terminal table output | Declarative styles; used across Go security tool ecosystem | | `github.com/stretchr/testify` | v1.10.x | Test assertions | Assert/require only; standard in Go ecosystem | | `embed` (stdlib) | — | Compile-time YAML embed | Go 1.16+ native; no external dep | | `crypto/aes`, `crypto/cipher` (stdlib) | — | AES-256-GCM encryption | Standard library; no CGo; GCM provides authenticated encryption | | `math` (stdlib) | — | Shannon entropy calculation | ~10-line implementation; no library needed | | `database/sql` (stdlib) | — | SQL interface over modernc.org/sqlite | Driver registered as `"sqlite"`; raw SQL; no ORM | ### Not Needed in Phase 1 (scaffold stubs only if required by interface) | Library | Deferred To | |---------|-------------| | `chi v5.2.5` | Phase 18 (Web Dashboard) | | `templ v0.3.1001` | Phase 18 (Web Dashboard) | | `telego v1.8.0` | Phase 17 (Telegram Bot) | | `gocron v2.19.1` | Phase 17 (Scheduler) | **Installation (Phase 1 dependencies):** ```bash go mod init github.com/yourusername/keyhunter go get github.com/spf13/cobra@v1.10.2 go get github.com/spf13/viper@v1.21.0 go get modernc.org/sqlite@latest go get gopkg.in/yaml.v3@v3.0.1 go get github.com/petar-dambovaliev/aho-corasick@latest go get github.com/panjf2000/ants/v2@v2.12.0 go get golang.org/x/crypto@latest go get golang.org/x/time@latest go get github.com/charmbracelet/lipgloss@latest go get github.com/stretchr/testify@latest ``` **Version verification (run before writing go.mod manually):** ```bash go list -m github.com/petar-dambovaliev/aho-corasick go list -m modernc.org/sqlite go list -m github.com/panjf2000/ants/v2 ``` --- ## Architecture Patterns ### Recommended Project Structure (Phase 1 scope) ``` keyhunter/ main.go # < 30 lines, cobra root Execute() cmd/ root.go # rootCmd, persistent flags, PersistentPreRunE config load scan.go # scan command + flags providers.go # providers list/info/stats commands config.go # config init/set/get commands pkg/ providers/ loader.go # embed.FS + fs.WalkDir + yaml.Unmarshal registry.go # Registry struct, Get/List/Stats methods schema.go # Provider, Pattern, VerifySpec structs + UnmarshalYAML validation engine/ engine.go # Engine struct, Scan() method, pipeline orchestration pipeline.go # channel wiring: chunksChan, detectableChan, resultsChan filter.go # Aho-Corasick pre-filter stage detector.go # Regex + entropy detector worker entropy.go # Shannon entropy function chunk.go # Chunk type (content []byte, source string, offset int64) finding.go # Finding type (provider, key_value, key_masked, confidence, source, path) sources/ source.go # Source interface file.go # FileSource dir.go # DirSource (recursive with glob exclude) storage/ db.go # DB struct, Open(), migrations via embedded schema.sql schema.sql # DDL for findings, scans, settings tables encrypt.go # AES-256-GCM Encrypt(plaintext, key) / Decrypt(ciphertext, key) crypto.go # Argon2id key derivation: DeriveKey(passphrase, salt) findings.go # CRUD for findings table scans.go # CRUD for scans table config/ config.go # Config struct, Load(), defaults output/ table.go # lipgloss colored terminal table json.go # encoding/json output providers/ openai.yaml # Reference provider definitions (Phase 1: schema examples only) anthropic.yaml huggingface.yaml ``` ### Pattern 1: Provider Registry with Compile-Time Embed **What:** Provider YAML definitions embedded via `//go:embed` and loaded into an in-memory registry at startup. **When to use:** Always. Never load provider files from disk at runtime. ```go // Source: Go embed docs (https://pkg.go.dev/embed) package providers import ( "embed" "io/fs" "gopkg.in/yaml.v3" ) //go:embed ../../providers/*.yaml var providersFS embed.FS type Registry struct { providers []Provider ac ahocorasick.AhoCorasick } func NewRegistry() (*Registry, error) { var providers []Provider err := fs.WalkDir(providersFS, "providers", func(path string, d fs.DirEntry, err error) error { if err != nil || d.IsDir() || filepath.Ext(path) != ".yaml" { return err } data, err := providersFS.ReadFile(path) if err != nil { return err } var p Provider if err := yaml.Unmarshal(data, &p); err != nil { return fmt.Errorf("provider %s: %w", path, err) } providers = append(providers, p) return nil }) if err != nil { return nil, err } // Build Aho-Corasick automaton from all keywords var keywords []string for _, p := range providers { keywords = append(keywords, p.Keywords...) } builder := ahocorasick.NewAhoCorasickBuilder(ahocorasick.Opts{DFA: true}) ac := builder.Build(keywords) return &Registry{providers: providers, ac: ac}, nil } ``` ### Pattern 2: Three-Stage Scanning Pipeline (Buffered Channels) **What:** Source adapters produce chunks onto buffered channels. Aho-Corasick pre-filter reduces candidates. Detector workers apply regex + entropy. **When to use:** All scan operations. Never skip the pre-filter. ```go // Source: TruffleHog v3 architecture (https://trufflesecurity.com/blog/making-trufflehog-faster-with-aho-corasick) func (e *Engine) Scan(ctx context.Context, src Source, cfg ScanConfig) (<-chan Finding, error) { chunksChan := make(chan Chunk, 1000) detectableChan := make(chan Chunk, 500) resultsChan := make(chan Finding, 100) // Stage 1: Source → chunks go func() { defer close(chunksChan) src.Chunks(ctx, chunksChan) }() // Stage 2: Aho-Corasick keyword pre-filter go func() { defer close(detectableChan) for chunk := range chunksChan { if len(e.registry.AC().FindAll(string(chunk.Data))) > 0 { detectableChan <- chunk } } }() // Stage 3: Detector workers var wg sync.WaitGroup for i := 0; i < cfg.Workers; i++ { wg.Add(1) go func() { defer wg.Done() for chunk := range detectableChan { e.detect(chunk, resultsChan) } }() } go func() { wg.Wait() close(resultsChan) }() return resultsChan, nil } ``` ### Pattern 3: AES-256-GCM Column Encryption **What:** Encrypt key values before storing in SQLite. Nonce prepended to ciphertext. Key derived via Argon2id. **When to use:** Every write of a full API key to storage. ```go // Source: Go crypto/cipher docs (https://pkg.go.dev/crypto/cipher) func Encrypt(plaintext []byte, key []byte) ([]byte, error) { block, err := aes.NewCipher(key) // key must be 32 bytes for AES-256 if err != nil { return nil, err } gcm, err := cipher.NewGCM(block) if err != nil { return nil, err } nonce := make([]byte, gcm.NonceSize()) if _, err := io.ReadFull(rand.Reader, nonce); err != nil { return nil, err } ciphertext := gcm.Seal(nonce, nonce, plaintext, nil) // nonce prepended return ciphertext, nil } func Decrypt(ciphertext []byte, key []byte) ([]byte, error) { block, err := aes.NewCipher(key) if err != nil { return nil, err } gcm, err := cipher.NewGCM(block) if err != nil { return nil, err } nonceSize := gcm.NonceSize() if len(ciphertext) < nonceSize { return nil, errors.New("ciphertext too short") } nonce, ciphertext := ciphertext[:nonceSize], ciphertext[nonceSize:] return gcm.Open(nil, nonce, ciphertext, nil) } ``` ### Pattern 4: Argon2id Key Derivation **What:** Derive 32-byte AES key from user passphrase + random salt. RFC 9106 Section 7.3 parameters. **When to use:** On first use (generate and store salt); on subsequent use (re-derive key from passphrase + stored salt). ```go // Source: https://pkg.go.dev/golang.org/x/crypto/argon2 import "golang.org/x/crypto/argon2" const ( argon2Time = 1 argon2Memory = 64 * 1024 // 64 MB argon2Threads = 4 argon2KeyLen = 32 // AES-256 key length ) func DeriveKey(passphrase []byte, salt []byte) []byte { return argon2.IDKey(passphrase, salt, argon2Time, argon2Memory, argon2Threads, argon2KeyLen) } // Generate salt (do once, persist in config) func NewSalt() ([]byte, error) { salt := make([]byte, 32) _, err := rand.Read(salt) return salt, err } ``` ### Pattern 5: Provider YAML Schema with Validation **What:** Provider struct with `UnmarshalYAML` that validates required fields including `format_version` and `last_verified`. **When to use:** Every provider YAML load. Return error on invalid schema — fail fast at startup. ```go // Source: gopkg.in/yaml.v3 docs (https://pkg.go.dev/gopkg.in/yaml.v3) type Provider struct { Name string `yaml:"name"` FormatVersion int `yaml:"format_version"` LastVerified string `yaml:"last_verified"` // ISO date: "2026-04-04" Keywords []string `yaml:"keywords"` Patterns []Pattern `yaml:"patterns"` Verify *VerifySpec `yaml:"verify,omitempty"` } func (p *Provider) UnmarshalYAML(value *yaml.Node) error { type rawProvider Provider if err := value.Decode((*rawProvider)(p)); err != nil { return err } if p.FormatVersion == 0 { return fmt.Errorf("provider %q: format_version is required", p.Name) } if p.LastVerified == "" { return fmt.Errorf("provider %q: last_verified is required", p.Name) } if len(p.Keywords) == 0 { return fmt.Errorf("provider %q: at least one keyword is required", p.Name) } if len(p.Patterns) == 0 { return fmt.Errorf("provider %q: at least one pattern is required", p.Name) } return nil } ``` ### Pattern 6: Shannon Entropy (stdlib only) **What:** Calculate bits-per-character entropy of a string. No library needed. **When to use:** CORE-04: as secondary signal after keyword pre-filter and regex confirm a candidate. Entropy threshold 3.5 bits/char for most providers. ```go // Source: Shannon entropy formula (verified against TruffleHog entropy implementation) import "math" func shannonEntropy(s string) float64 { if len(s) == 0 { return 0 } freq := make(map[rune]float64) for _, c := range s { freq[c]++ } n := float64(len([]rune(s))) var entropy float64 for _, count := range freq { p := count / n entropy -= p * math.Log2(p) } return entropy } ``` ### Pattern 7: modernc.org/sqlite Initialization **What:** Open SQLite DB with WAL mode and foreign keys. Schema embedded via `//go:embed`. **When to use:** Storage layer initialization. ```go // Source: https://practicalgobook.net/posts/go-sqlite-no-cgo/ import ( "database/sql" _ "modernc.org/sqlite" // registers "sqlite" driver "embed" ) //go:embed schema.sql var schemaSQL string func Open(path string) (*sql.DB, error) { db, err := sql.Open("sqlite", path) if err != nil { return nil, err } // Enable WAL mode for better concurrent access if _, err := db.Exec("PRAGMA journal_mode=WAL"); err != nil { return nil, err } if _, err := db.Exec("PRAGMA foreign_keys=ON"); err != nil { return nil, err } // Apply schema if _, err := db.Exec(schemaSQL); err != nil { return nil, err } return db, nil } ``` ### Pattern 8: Cobra + Viper Wiring **What:** Cobra command tree with Viper config loaded in PersistentPreRunE. Flags bound to Viper for config file override. **When to use:** root.go and all command files. ```go // Source: https://www.glukhov.org/post/2025/11/go-cli-applications-with-cobra-and-viper/ var rootCmd = &cobra.Command{ Use: "keyhunter", Short: "API key scanner for 108+ LLM providers", PersistentPreRunE: func(cmd *cobra.Command, args []string) error { viper.SetConfigName(".keyhunter") viper.SetConfigType("yaml") viper.AddConfigPath("$HOME") viper.AutomaticEnv() viper.SetEnvPrefix("KEYHUNTER") if err := viper.ReadInConfig(); err != nil { if _, ok := err.(viper.ConfigFileNotFoundError); !ok { return err } // Config file not found — acceptable on first run } return nil }, } // In init() of each command file: func init() { scanCmd.Flags().IntP("workers", "w", 0, "worker count (default: CPU count)") viper.BindPFlag("workers", scanCmd.Flags().Lookup("workers")) } ``` ### Anti-Patterns to Avoid - **Global provider registry:** Pass `*Registry` via constructor. Global state makes testing impossible without full initialization. - **Unbuffered result channels:** Use `make(chan Finding, 1000)`. Unbuffered channels cause detector workers to block on slow output consumers, collapsing parallelism. - **Runtime YAML loading:** Never load from filesystem at scan time. `//go:embed` only. - **regexp2 or PCRE:** Go's `regexp` package (RE2) already provides linear-time guarantees. regexp2 loses this guarantee. - **Entropy-only detection:** Never flag a candidate based solely on entropy score. Entropy is a secondary filter applied only after keyword pre-filter and regex confirm a pattern match. - **Plaintext key column:** Never store full API key as TEXT. Always encrypt with AES-256-GCM before INSERT. - **os.Getenv for passphrase:** Derive the AES key via Argon2id from a passphrase. Never store raw passphrase or raw key in config file. --- ## Don't Hand-Roll | Problem | Don't Build | Use Instead | Why | |---------|-------------|-------------|-----| | Multi-keyword search automaton | Custom Aho-Corasick or loop-based string search | `petar-dambovaliev/aho-corasick` | Aho-Corasick is O(n+m+z); naive loop is O(n*k); library is 20x faster than next-best Go option | | Goroutine pool with dynamic resize | Manual goroutine spawn with WaitGroup | `ants v2.12.0` | Goroutine explosion on large repos; ants handles backpressure, panic recovery, context cancellation | | AES key derivation from passphrase | SHA256(passphrase) or similar | `argon2.IDKey` from `golang.org/x/crypto` | MD5/SHA hash-based KDF is trivially brute-forceable; Argon2id is GPU-resistant by design | | SQLite column encryption | XOR, custom cipher, or base64 "encoding" | `crypto/aes` GCM via stdlib | GCM provides both confidentiality and authentication; custom schemes always have vulnerabilities | | Config file management | Custom INI or JSON parser | `viper v1.21.0` | Viper handles YAML + env vars + CLI flags with correct precedence; hand-rolled configs miss env var override | | CLI command parsing | `flag` stdlib or custom parser | `cobra v1.10.2` | Cobra provides nested subcommands, persistent flags, shell completion, help generation — stdlib flag lacks all of these | **Key insight:** The Aho-Corasick pre-filter and Argon2id key derivation in particular are problems where "obvious" implementations (nested loops, SHA256) have well-documented security or performance failure modes that justify the dependency cost. --- ## Common Pitfalls ### Pitfall 1: Wrong Aho-Corasick Library Choice **What goes wrong:** Using `cloudflare/ahocorasick` because it appears more prominent in search results. It uses 8x more memory and runs 20x slower than `petar-dambovaliev/aho-corasick`. For a tool scanning large repos with 108 keyword patterns, this difference is measurable. **Why it happens:** cloudflare/ahocorasick appears first in many search results. **How to avoid:** Use `github.com/petar-dambovaliev/aho-corasick` (verified as the library TruffleHog uses per their own blog post crediting Petar Dambovaliev). Build the automaton once at registry init; the automaton is thread-safe for concurrent reads. **Warning signs:** Scans on 100MB+ repos running significantly slower than expected; high memory usage during keyword pre-filter stage. --- ### Pitfall 2: Encryption Key Stored in Config File as Raw Bytes **What goes wrong:** Storing the 32-byte AES key directly in `~/.keyhunter.yaml` or as an env var. Anyone with config file read access can decrypt the entire database. **Why it happens:** "Just store the key" is the simplest implementation. Argon2id + salt seems like unnecessary complexity. **How to avoid:** Store only the randomly generated salt in config. Re-derive the key from `passphrase + salt` on each run using Argon2id. The passphrase is entered interactively or set via `KEYHUNTER_PASSPHRASE` env var (never stored). For Phase 1, interactive passphrase entry is acceptable. OS keychain integration (zalando/go-keyring) can be added later without schema migration. **Warning signs:** `keyhunter_key:` field appearing as hex bytes in the YAML config file. --- ### Pitfall 3: Provider YAML Schema Allowing Missing format_version or last_verified **What goes wrong:** Provider YAML loads without validation. Providers with missing `format_version` or stale `last_verified` accumulate over time. Pattern health tracking (PROV-10) becomes meaningless. **Why it happens:** `yaml.Unmarshal` to a struct silently zero-values missing fields. No validation = no error. **How to avoid:** Implement `UnmarshalYAML` with explicit validation on the Provider struct. Fail at startup (not at scan time) if any provider is invalid. This catches schema errors at development time, not production time. **Warning signs:** `format_version: 0` appearing in any loaded provider struct. --- ### Pitfall 4: SQLite Without WAL Mode in a CLI Tool **What goes wrong:** Default SQLite journal mode causes `SQLITE_BUSY` errors when the dashboard (Phase 18) or multiple concurrent processes read while a scan writes. The default journal also performs slower write throughput. **Why it happens:** `sql.Open("sqlite", path)` uses the default rollback journal. **How to avoid:** Always execute `PRAGMA journal_mode=WAL` immediately after opening the database. This is a one-time setup that persists in the database file. Also set `PRAGMA foreign_keys=ON` to enforce referential integrity. **Warning signs:** `database is locked` errors during concurrent read/write operations. --- ### Pitfall 5: Entropy Check Before Regex Confirmation **What goes wrong:** Running Shannon entropy on every chunk that passes the Aho-Corasick keyword filter produces up to 80% false positive rate (HashiCorp 2025 research). High-entropy strings like UUIDs, hashes, and base64-encoded content all score above 3.5 bits/char. **Why it happens:** Entropy feels like an independent signal and is applied eagerly as a quick filter. **How to avoid:** Entropy is strictly a secondary signal. Apply it only to strings that have already matched both a keyword (Aho-Corasick) AND a provider regex pattern. The order is: keyword pre-filter → regex match → entropy calibration. Never entropy-only. **Warning signs:** More than 30% of findings lacking a matching provider regex pattern. --- ### Pitfall 6: mmap for Small Files on Linux **What goes wrong:** mmap is beneficial for large files (>10MB) where avoiding a full read-into-memory matters. For small files, mmap has higher setup overhead than a simple `os.ReadFile`. mmap also requires explicit cleanup to avoid address space exhaustion on directory scans with thousands of small files. **Why it happens:** CORE-07 specifies mmap-based large file reading, and implementing it uniformly for all files seems simpler. **How to avoid:** Use `os.ReadFile` for files under 10MB. Use mmap only above that threshold, with explicit `unix.Munmap` cleanup in a deferred call. Check for binary files before mmap — use the first 512 bytes of the file to detect binary content via `http.DetectContentType` and skip non-text files. **Warning signs:** Address space exhaustion during directory scans; "too many open files" errors. --- ### Pitfall 7: Argon2 Parameters Too Low (Fast KDF = Weak Security) **What goes wrong:** Using time=1, memory=4096 (a commonly copied example) instead of RFC 9106's recommendations. Fast key derivation makes brute-force attacks on the database passphrase practical. **Why it happens:** Low parameters make tests run faster and startup feel snappier. **How to avoid:** Use RFC 9106 Section 7.3 parameters for Argon2id: `time=1, memory=64*1024 (64MB), threads=4, keyLen=32`. These are the current recommendations. Test startup latency with these parameters — on modern hardware, key derivation takes ~100-300ms, which is acceptable for a CLI tool. **Warning signs:** `argon2.IDKey(pass, salt, 1, 4096, 1, 32)` — memory parameter is 64*1024 (65536), not 4096. --- ## State of the Art | Old Approach | Current Approach | When Changed | Impact | |--------------|------------------|--------------|--------| | Linear keyword scan before regex | Aho-Corasick pre-filter | TruffleHog v3.28.6 (2024) | 2x average scan speedup; O(n+m+z) vs O(n*k) | | mattn/go-sqlite3 (CGo) | modernc.org/sqlite (pure Go) | Go 1.16+ era | CGO_ENABLED=0 enabled; cross-compilation works | | SQLCipher for DB encryption | Application-level AES-256-GCM | 2023-2025 | No CGo dependency; AES GCM provides authentication | | PBKDF2 for key derivation | Argon2id | RFC 9106 (2021) | GPU-resistant; side-channel resistant; OWASP recommended | | regexp2/PCRE in Go scanners | Go stdlib regexp (RE2) | Ongoing | ReDoS immune; linear time guaranteed | | Storing full keys masked in DB | Encrypt key_encrypted column, store only mask | GHSA-4h8c-qrcq-cv5c (2024) | Database file no longer a credential dump | **Deprecated/outdated:** - `mattn/go-sqlite3`: Requires CGo; cross-compilation breaks; modernc.org/sqlite is the replacement. - `robfig/cron`: Unmaintained since 2020; use gocron v2 (Phase 17). - `cloudflare/ahocorasick`: Still maintained but 20x slower than petar-dambovaliev/aho-corasick; do not use. - Entropy-only secret detection: HashiCorp 2025 research confirms 80% FP rate; layered pipeline is the current standard. --- ## Open Questions 1. **Passphrase UX for first run** - What we know: Argon2id requires a passphrase to derive the AES key. For Phase 1, this must be handled on first `keyhunter scan` or `keyhunter config init`. - What's unclear: Should Phase 1 use `bufio.NewReader(os.Stdin).ReadString('\n')` for passphrase entry, or skip encryption and use a generated random key stored in config (less secure but zero-friction)? - Recommendation: Use a generated random 32-byte key stored in `~/.keyhunter.yaml` as base64 for Phase 1 (zero-friction). Document that this is a development shortcut; OS keychain integration (zalando/go-keyring) replaces it in a later phase. The encrypt/decrypt functions and schema are in place; only the key source changes. 2. **mmap on Linux: syscall vs golang.org/x/sys/unix** - What we know: Both `syscall.Mmap` and `golang.org/x/sys/unix.Mmap` provide mmap on Linux. The x/ package has a cleaner API. - What's unclear: `golang.org/x/sys` is already a transitive dependency of many packages (likely pulled in by viper or cobra). Whether it's already in go.sum or needs explicit addition. - Recommendation: Use `golang.org/x/sys/unix` for mmap in the file source adapter. It will almost certainly already be in go.sum. Only implement mmap for Phase 1 if CORE-07 is in scope for the minimal viable scan pipeline; otherwise defer to Phase 4. 3. **Provider YAML for Phase 1: how many definitions?** - What we know: Phase 1 requires schema + PROV-10 (format_version, last_verified fields). Full 108-provider definitions are Phase 2-3. - What's unclear: The success criterion "keyhunter scan ./somefile runs the three-stage pipeline and returns findings with provider names" implies at least one real provider definition must exist. - Recommendation: Ship 3 reference provider definitions in Phase 1 (OpenAI, Anthropic, HuggingFace) with valid format_version and last_verified. All 108 providers are Phase 2-3 scope. These 3 definitions validate the schema and make the success criteria testable. --- ## Environment Availability | Dependency | Required By | Available | Version | Fallback | |------------|------------|-----------|---------|----------| | Go | Core language | Yes | 1.26.1 (exceeds 1.22 requirement) | — | | git | Version control | Yes | 2.53.0 | — | | golangci-lint | Static analysis / CI | No | — | Install via `go install github.com/golangci-lint/golangci-lint/cmd/golangci-lint@latest` or skip in Phase 1 | | npm | (not needed Phase 1) | Yes (4.2.2 via tailwind check) | — | Not needed until Phase 18 | **Missing dependencies with fallback:** - `golangci-lint`: Not found. Install before Phase 1 linting task, or skip lint gate for initial scaffold and add in CI pipeline. Fallback: `go vet ./...` catches most critical issues. **Missing dependencies with no fallback:** - None. Go 1.26.1 is available and exceeds the 1.22+ requirement. --- ## Validation Architecture ### Test Framework | Property | Value | |----------|-------| | Framework | `go test` (stdlib) + `testify v1.10.x` for assertions | | Config file | None needed — standard `go test ./...` discovers `*_test.go` | | Quick run command | `go test ./pkg/... -race -timeout 30s` | | Full suite command | `go test ./... -race -cover -timeout 120s` | ### Phase Requirements to Test Map | Req ID | Behavior | Test Type | Automated Command | File Exists? | |--------|----------|-----------|-------------------|-------------| | CORE-01 | Scan pipeline detects known API key patterns in test input | integration | `go test ./pkg/engine/... -run TestScanPipeline -v` | No — Wave 0 | | CORE-02 | Provider YAML loads from embed.FS without error | unit | `go test ./pkg/providers/... -run TestNewRegistry -v` | No — Wave 0 | | CORE-03 | Registry holds correct provider count, Get() returns provider by name | unit | `go test ./pkg/providers/... -run TestRegistry -v` | No — Wave 0 | | CORE-04 | Shannon entropy returns correct value for known inputs | unit | `go test ./pkg/engine/... -run TestShannonEntropy -v` | No — Wave 0 | | CORE-05 | Worker pool uses correct concurrency; all workers complete | unit | `go test ./pkg/engine/... -run TestWorkerPool -v` | No — Wave 0 | | CORE-06 | Aho-Corasick filter passes keyword-matched chunks; rejects non-matching | unit | `go test ./pkg/engine/... -run TestAhoCorasickFilter -v` | No — Wave 0 | | CORE-07 | Large file source reads without OOM; binary files skipped | integration | `go test ./pkg/engine/sources/... -run TestFileSourceLarge -v` | No — Wave 0 | | STOR-01 | SQLite DB opens; schema applies; WAL mode set | unit | `go test ./pkg/storage/... -run TestOpen -v` | No — Wave 0 | | STOR-02 | AES-256-GCM encrypt/decrypt round-trip is lossless | unit | `go test ./pkg/storage/... -run TestEncryptDecrypt -v` | No — Wave 0 | | STOR-03 | Argon2id DeriveKey produces 32-byte deterministic output | unit | `go test ./pkg/storage/... -run TestDeriveKey -v` | No — Wave 0 | | CLI-01 | `keyhunter --help` exits 0; all subcommands listed | smoke | `go run ./... --help` | No — Wave 0 | | CLI-02 | `keyhunter config init` creates ~/.keyhunter.yaml | integration | `go test ./cmd/... -run TestConfigInit -v` | No — Wave 0 | | CLI-03 | `keyhunter config set key val` persists to YAML | integration | `go test ./cmd/... -run TestConfigSet -v` | No — Wave 0 | | CLI-04 | `keyhunter providers list` returns at least 3 providers | integration | `go test ./cmd/... -run TestProvidersList -v` | No — Wave 0 | | CLI-05 | `keyhunter scan --workers 4 testfile` uses 4 workers | integration | `go test ./cmd/... -run TestScanFlags -v` | No — Wave 0 | | PROV-10 | Provider YAML with missing format_version returns error at load | unit | `go test ./pkg/providers/... -run TestProviderValidation -v` | No — Wave 0 | ### Sampling Rate - **Per task commit:** `go test ./pkg/... -race -timeout 30s` - **Per wave merge:** `go test ./... -race -cover -timeout 120s` - **Phase gate:** Full suite green before `/gsd:verify-work` ### Wave 0 Gaps All test files are new — this is a greenfield project. - [ ] `pkg/providers/registry_test.go` — covers CORE-02, CORE-03, PROV-10 - [ ] `pkg/engine/entropy_test.go` — covers CORE-04 - [ ] `pkg/engine/filter_test.go` — covers CORE-06 - [ ] `pkg/engine/engine_test.go` — covers CORE-01, CORE-05 - [ ] `pkg/engine/sources/file_test.go` — covers CORE-07 - [ ] `pkg/storage/encrypt_test.go` — covers STOR-02 - [ ] `pkg/storage/crypto_test.go` — covers STOR-03 - [ ] `pkg/storage/db_test.go` — covers STOR-01 - [ ] `cmd/config_test.go` — covers CLI-02, CLI-03 - [ ] `cmd/providers_test.go` — covers CLI-04 - [ ] `cmd/scan_test.go` — covers CLI-05 - [ ] `testdata/fixtures/` — synthetic test files with known API key patterns for integration tests - [ ] Framework install: `go get github.com/stretchr/testify@latest` — if not added during go.mod init --- ## Sources ### Primary (HIGH confidence) - TruffleHog Aho-Corasick blog: https://trufflesecurity.com/blog/making-trufflehog-faster-with-aho-corasick — confirmed petar-dambovaliev library and 2x speedup claim - petar-dambovaliev/aho-corasick pkg.go.dev: https://pkg.go.dev/github.com/petar-dambovaliev/aho-corasick — API verified, last updated 2025-04-24 - Go crypto/cipher (AES-GCM): https://pkg.go.dev/crypto/cipher — Encrypt/Decrypt pattern verified - Go argon2 package: https://pkg.go.dev/golang.org/x/crypto/argon2 — IDKey parameters from RFC 9106 - modernc.org/sqlite pkg.go.dev: https://pkg.go.dev/modernc.org/sqlite — pure Go confirmed, SQLite 3.51.2 - Go embed package: https://pkg.go.dev/embed — WalkDir pattern for loading embedded files - gopkg.in/yaml.v3: https://pkg.go.dev/gopkg.in/yaml.v3 — UnmarshalYAML custom validation - ants v2 README: https://github.com/panjf2000/ants — Pool usage pattern - cobra docs: https://github.com/spf13/cobra — PersistentPreRunE config loading pattern - viper docs: https://github.com/spf13/viper — BindPFlag, WriteConfigAs patterns ### Secondary (MEDIUM confidence) - TruffleHog go.sum (BobuSumisu reference): https://github.com/trufflesecurity/trufflehog/blob/main/go.sum — historical library; petar-dambovaliev is current per blog post - Practical Go SQLite (no CGo): https://practicalgobook.net/posts/go-sqlite-no-cgo/ — WAL mode pattern verified against official SQLite docs - HashiCorp entropy FP research: https://www.hashicorp.com/en/blog/false-positives-a-big-problem-for-secret-scanners — 80% FP rate from entropy-only detection - Cobra/Viper 2025 article: https://www.glukhov.org/post/2025/11/go-cli-applications-with-cobra-and-viper/ — PersistentPreRunE pattern - zalando/go-keyring: https://github.com/zalando/go-keyring — Linux uses D-Bus Secret Service (libsecret); noted as future improvement for Phase 1+ ### Tertiary (LOW confidence) - argon2aes reference implementation: https://github.com/presbrey/argon2aes — implementation pattern reference only; use stdlib directly - Passphrase UX patterns: Community convention; no authoritative Go CLI standard for passphrase input UX --- ## Metadata **Confidence breakdown:** - Standard stack: HIGH — all versions verified against official GitHub releases and pkg.go.dev as of 2026-04-04 - Architecture: HIGH — TruffleHog v3 pipeline is the proven model; channel patterns are established Go idiom - Aho-Corasick library choice: HIGH — TruffleHog blog post explicitly credits petar-dambovaliev; pkg.go.dev confirms API and last update date - AES-256+Argon2id approach: HIGH — stdlib only; RFC 9106 parameters; well-documented pattern - Pitfalls: HIGH — sourced from GHSA advisory, HashiCorp research, official Go docs - Test architecture: HIGH — standard `go test` patterns; no uncertainty **Research date:** 2026-04-04 **Valid until:** 2026-07-04 (stable libraries; 90 days is safe for Go stdlib + well-maintained ecosystem packages)