Files
keyhunter/.planning/phases/01-foundation/01-RESEARCH.md
salvacybersec fa3916a417 docs(phase-1): research foundation phase
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 23:32:10 +03:00

40 KiB

Phase 1: Foundation - Research

Researched: 2026-04-04 Domain: Go CLI scaffolding, provider registry schema, three-stage scan pipeline, encrypted SQLite storage Confidence: HIGH


<phase_requirements>

Phase Requirements

ID Description Research Support
CORE-01 Scanner engine detects API keys using keyword pre-filtering + regex matching pipeline Aho-Corasick pre-filter (petar-dambovaliev/aho-corasick) + Go RE2 regexp; buffered channel pipeline pattern documented
CORE-02 Provider definitions loaded from YAML files embedded at compile time via Go embed //go:embed providers/*.yaml + fs.WalkDir to iterate; gopkg.in/yaml.v3 for parse
CORE-03 Provider registry manages 108+ provider definitions with pattern, keyword, confidence, and verify metadata Registry struct holding []Provider loaded at startup; injected via constructor not global
CORE-04 Entropy analysis as secondary signal for low-confidence providers Shannon entropy implemented as ~10-line stdlib function using math package; threshold 3.5 bits/char
CORE-05 Worker pool parallelism with configurable worker count (default: CPU count) ants.NewPool(runtime.NumCPU() * 8) for detectors; ants.NewPool(runtime.NumCPU()) for verifiers
CORE-06 Aho-Corasick keyword pre-filter runs before regex for 10x performance on large files petar-dambovaliev/aho-corasick: 20x faster than cloudflare's, used by TruffleHog; build automaton at registry init
CORE-07 mmap-based large file reading for memory efficiency golang.org/x/sys/unix.Mmap or syscall.Mmap for large file sources; skip binary files via magic bytes
STOR-01 SQLite database for persisting scan results, keys, recon history modernc.org/sqlite (pure Go, no CGo); database/sql interface; WAL mode; schema embedded via //go:embed
STOR-02 Application-level AES-256 encryption for stored keys and sensitive config crypto/aes + crypto/cipher GCM mode; nonce prepended to ciphertext stored in BLOB column
STOR-03 Encryption key derived from user passphrase via Argon2 golang.org/x/crypto/argon2 IDKey with RFC 9106 params: time=1, memory=64*1024, threads=4, keyLen=32
CLI-01 Cobra-based CLI with commands: scan, verify, import, recon, keys, serve, dorks, providers, config, hook, schedule cobra v1.10.2 command tree; cmd/ package; main.go < 30 lines
CLI-02 keyhunter config init creates ~/.keyhunter.yaml viper.WriteConfigAs(filepath) with os.MkdirAll; PersistentPreRunE for config load
CLI-03 keyhunter config set key value persists values viper.Set(key, value) + viper.WriteConfig()
CLI-04 keyhunter providers list/info/stats for provider management Registry.List(), Registry.Get(name) from loaded YAML; lipgloss table for terminal output
CLI-05 Scan flags: --providers, --category, --confidence, --exclude, --verify, --workers, --output, --unmask, --notify Persistent flags on scan command; viper.BindPFlag for config file override
PROV-10 Provider YAML schema includes format_version and last_verified fields validated at load time Custom UnmarshalYAML method on Provider struct; return error if format_version == 0 or last_verified empty
</phase_requirements>

Summary

Phase 1 builds the three foundations everything else depends on: the provider registry (YAML schema + embed), the storage layer (SQLite + AES-256 encryption), and the CLI skeleton (Cobra + Viper). These are the two zero-dependency components in the architecture — nothing downstream can be built until both exist. The scanning engine (three-stage pipeline) is also scoped here because all Phase 1 success criteria require a working scan to validate the other two foundations.

The stack for this phase is fully settled from prior research. The one open question from SUMMARY.md — which Aho-Corasick library to use — is now resolved: TruffleHog originally used petar-dambovaliev/aho-corasick (confirmed via the TruffleHog blog post crediting Petar Dambovaliev). That library is 20x faster than cloudflare/ahocorasick and uses 1/8th the memory. A second open question — Argon2 vs PBKDF2 for key derivation — is resolved: use Argon2id (golang.org/x/crypto/argon2.IDKey) per RFC 9106 recommendations, which is the modern standard and already available in the x/crypto package already needed for other operations.

The AES-256 encryption approach is application-level (not SQLCipher), using crypto/aes GCM mode with the nonce prepended to the ciphertext stored as a BLOB. This preserves CGO_ENABLED=0 throughout. Key derivation uses Argon2id to produce a 32-byte key from a user passphrase + random salt stored alongside the database. For Phase 1, the salt can be stored in the config YAML; OS keychain integration (zalando/go-keyring) can be added in a later phase without schema migration.

Primary recommendation: Build in order: (1) Provider YAML schema + embed loader, (2) SQLite schema + AES-256 crypto layer, (3) Scanning engine pipeline (Aho-Corasick + RE2 + entropy), (4) Cobra/Viper CLI wiring. The scan pipeline validation is the integration test that proves all three foundations work together.


Project Constraints (from CLAUDE.md)

All directives from CLAUDE.md are binding. Key constraints for Phase 1:

Constraint Directive
Language Go 1.22+ only. No other language.
CGO CGO_ENABLED=0 throughout — single binary, cross-compilation
SQLite driver modernc.org/sqlite — NOT mattn/go-sqlite3 (CGo)
SQLite encryption Application-level AES-256 via crypto/aes — NOT SQLCipher (CGo)
CLI framework cobra v1.10.2 + viper v1.21.0 — no alternatives
YAML parsing gopkg.in/yaml.v3 — not sigs.k8s.io/yaml or goccy/go-yaml
Concurrency ants v2.12.0 worker pool
Architecture Plugin-based — providers as YAML files, compile-time embedded via go:embed
Regex engine Go stdlib regexp (RE2-backed) ONLY — never regexp2 or PCRE
Verification Opt-in (--verify flag) — passive scanning by default
Key masking Default masked output, --unmask for full keys
Worker pool github.com/panjf2000/ants/v2 v2.12.0
Output formatting github.com/charmbracelet/lipgloss (latest)
Web stack chi v5.2.5 + templ v0.3.1001 + htmx + Tailwind v4 (Phase 18 only — do not scaffold in Phase 1)
Telegram telego v1.8.0 (Phase 17 only)
Scheduler gocron v2.19.1 (Phase 17 only)
Build go build -ldflags="-s -w" for stripped binary
Forbidden patterns Fiber, Echo, mattn/go-sqlite3, SQLCipher, robfig/cron, regexp2, Full Bubble Tea TUI

Standard Stack

Core (Phase 1 only)

Library Version Purpose Why Standard
github.com/spf13/cobra v1.10.2 CLI command tree Industry standard; used by TruffleHog, Gitleaks, Kubernetes, Docker CLI
github.com/spf13/viper v1.21.0 Config management (YAML/env/flags) Cobra-native integration; v1.21.0 fixed key-casing bugs
modernc.org/sqlite v1.35.x (SQLite 3.51.2) Embedded database Pure Go; no CGo; cross-compiles cleanly; updated 2026-03-17
gopkg.in/yaml.v3 v3.0.1 Parse provider YAML Handles inline/anchored structs; stable; transitive dep of cobra anyway
github.com/petar-dambovaliev/aho-corasick latest (2025-04-24) Keyword pre-filter 20x faster than cloudflare/ahocorasick; 1/8 memory; used by TruffleHog; MIT license
github.com/panjf2000/ants/v2 v2.12.0 Worker pool Battle-tested goroutine pool; v2.12.0 adds ReleaseContext for clean shutdown
golang.org/x/crypto latest x/ Argon2 key derivation Official Go extended library; IDKey for Argon2id; same package used for other crypto needs
golang.org/x/time latest x/ Rate limiting (future-proofing) Needed for CORE-05 workers; token bucket; add now to avoid later go.mod churn
github.com/charmbracelet/lipgloss latest Terminal table output Declarative styles; used across Go security tool ecosystem
github.com/stretchr/testify v1.10.x Test assertions Assert/require only; standard in Go ecosystem
embed (stdlib) Compile-time YAML embed Go 1.16+ native; no external dep
crypto/aes, crypto/cipher (stdlib) AES-256-GCM encryption Standard library; no CGo; GCM provides authenticated encryption
math (stdlib) Shannon entropy calculation ~10-line implementation; no library needed
database/sql (stdlib) SQL interface over modernc.org/sqlite Driver registered as "sqlite"; raw SQL; no ORM

Not Needed in Phase 1 (scaffold stubs only if required by interface)

Library Deferred To
chi v5.2.5 Phase 18 (Web Dashboard)
templ v0.3.1001 Phase 18 (Web Dashboard)
telego v1.8.0 Phase 17 (Telegram Bot)
gocron v2.19.1 Phase 17 (Scheduler)

Installation (Phase 1 dependencies):

go mod init github.com/yourusername/keyhunter
go get github.com/spf13/cobra@v1.10.2
go get github.com/spf13/viper@v1.21.0
go get modernc.org/sqlite@latest
go get gopkg.in/yaml.v3@v3.0.1
go get github.com/petar-dambovaliev/aho-corasick@latest
go get github.com/panjf2000/ants/v2@v2.12.0
go get golang.org/x/crypto@latest
go get golang.org/x/time@latest
go get github.com/charmbracelet/lipgloss@latest
go get github.com/stretchr/testify@latest

Version verification (run before writing go.mod manually):

go list -m github.com/petar-dambovaliev/aho-corasick
go list -m modernc.org/sqlite
go list -m github.com/panjf2000/ants/v2

Architecture Patterns

keyhunter/
  main.go                      # < 30 lines, cobra root Execute()
  cmd/
    root.go                    # rootCmd, persistent flags, PersistentPreRunE config load
    scan.go                    # scan command + flags
    providers.go               # providers list/info/stats commands
    config.go                  # config init/set/get commands
  pkg/
    providers/
      loader.go                # embed.FS + fs.WalkDir + yaml.Unmarshal
      registry.go              # Registry struct, Get/List/Stats methods
      schema.go                # Provider, Pattern, VerifySpec structs + UnmarshalYAML validation
    engine/
      engine.go                # Engine struct, Scan() method, pipeline orchestration
      pipeline.go              # channel wiring: chunksChan, detectableChan, resultsChan
      filter.go                # Aho-Corasick pre-filter stage
      detector.go              # Regex + entropy detector worker
      entropy.go               # Shannon entropy function
      chunk.go                 # Chunk type (content []byte, source string, offset int64)
      finding.go               # Finding type (provider, key_value, key_masked, confidence, source, path)
      sources/
        source.go              # Source interface
        file.go                # FileSource
        dir.go                 # DirSource (recursive with glob exclude)
    storage/
      db.go                    # DB struct, Open(), migrations via embedded schema.sql
      schema.sql               # DDL for findings, scans, settings tables
      encrypt.go               # AES-256-GCM Encrypt(plaintext, key) / Decrypt(ciphertext, key)
      crypto.go                # Argon2id key derivation: DeriveKey(passphrase, salt)
      findings.go              # CRUD for findings table
      scans.go                 # CRUD for scans table
    config/
      config.go                # Config struct, Load(), defaults
    output/
      table.go                 # lipgloss colored terminal table
      json.go                  # encoding/json output
  providers/
    openai.yaml                # Reference provider definitions (Phase 1: schema examples only)
    anthropic.yaml
    huggingface.yaml

Pattern 1: Provider Registry with Compile-Time Embed

What: Provider YAML definitions embedded via //go:embed and loaded into an in-memory registry at startup.

When to use: Always. Never load provider files from disk at runtime.

// Source: Go embed docs (https://pkg.go.dev/embed)

package providers

import (
    "embed"
    "io/fs"
    "gopkg.in/yaml.v3"
)

//go:embed ../../providers/*.yaml
var providersFS embed.FS

type Registry struct {
    providers []Provider
    ac        ahocorasick.AhoCorasick
}

func NewRegistry() (*Registry, error) {
    var providers []Provider
    err := fs.WalkDir(providersFS, "providers", func(path string, d fs.DirEntry, err error) error {
        if err != nil || d.IsDir() || filepath.Ext(path) != ".yaml" {
            return err
        }
        data, err := providersFS.ReadFile(path)
        if err != nil {
            return err
        }
        var p Provider
        if err := yaml.Unmarshal(data, &p); err != nil {
            return fmt.Errorf("provider %s: %w", path, err)
        }
        providers = append(providers, p)
        return nil
    })
    if err != nil {
        return nil, err
    }
    // Build Aho-Corasick automaton from all keywords
    var keywords []string
    for _, p := range providers {
        keywords = append(keywords, p.Keywords...)
    }
    builder := ahocorasick.NewAhoCorasickBuilder(ahocorasick.Opts{DFA: true})
    ac := builder.Build(keywords)
    return &Registry{providers: providers, ac: ac}, nil
}

Pattern 2: Three-Stage Scanning Pipeline (Buffered Channels)

What: Source adapters produce chunks onto buffered channels. Aho-Corasick pre-filter reduces candidates. Detector workers apply regex + entropy.

When to use: All scan operations. Never skip the pre-filter.

// Source: TruffleHog v3 architecture (https://trufflesecurity.com/blog/making-trufflehog-faster-with-aho-corasick)

func (e *Engine) Scan(ctx context.Context, src Source, cfg ScanConfig) (<-chan Finding, error) {
    chunksChan    := make(chan Chunk, 1000)
    detectableChan := make(chan Chunk, 500)
    resultsChan   := make(chan Finding, 100)

    // Stage 1: Source → chunks
    go func() {
        defer close(chunksChan)
        src.Chunks(ctx, chunksChan)
    }()

    // Stage 2: Aho-Corasick keyword pre-filter
    go func() {
        defer close(detectableChan)
        for chunk := range chunksChan {
            if len(e.registry.AC().FindAll(string(chunk.Data))) > 0 {
                detectableChan <- chunk
            }
        }
    }()

    // Stage 3: Detector workers
    var wg sync.WaitGroup
    for i := 0; i < cfg.Workers; i++ {
        wg.Add(1)
        go func() {
            defer wg.Done()
            for chunk := range detectableChan {
                e.detect(chunk, resultsChan)
            }
        }()
    }
    go func() {
        wg.Wait()
        close(resultsChan)
    }()

    return resultsChan, nil
}

Pattern 3: AES-256-GCM Column Encryption

What: Encrypt key values before storing in SQLite. Nonce prepended to ciphertext. Key derived via Argon2id.

When to use: Every write of a full API key to storage.

// Source: Go crypto/cipher docs (https://pkg.go.dev/crypto/cipher)

func Encrypt(plaintext []byte, key []byte) ([]byte, error) {
    block, err := aes.NewCipher(key) // key must be 32 bytes for AES-256
    if err != nil {
        return nil, err
    }
    gcm, err := cipher.NewGCM(block)
    if err != nil {
        return nil, err
    }
    nonce := make([]byte, gcm.NonceSize())
    if _, err := io.ReadFull(rand.Reader, nonce); err != nil {
        return nil, err
    }
    ciphertext := gcm.Seal(nonce, nonce, plaintext, nil) // nonce prepended
    return ciphertext, nil
}

func Decrypt(ciphertext []byte, key []byte) ([]byte, error) {
    block, err := aes.NewCipher(key)
    if err != nil {
        return nil, err
    }
    gcm, err := cipher.NewGCM(block)
    if err != nil {
        return nil, err
    }
    nonceSize := gcm.NonceSize()
    if len(ciphertext) < nonceSize {
        return nil, errors.New("ciphertext too short")
    }
    nonce, ciphertext := ciphertext[:nonceSize], ciphertext[nonceSize:]
    return gcm.Open(nil, nonce, ciphertext, nil)
}

Pattern 4: Argon2id Key Derivation

What: Derive 32-byte AES key from user passphrase + random salt. RFC 9106 Section 7.3 parameters.

When to use: On first use (generate and store salt); on subsequent use (re-derive key from passphrase + stored salt).

// Source: https://pkg.go.dev/golang.org/x/crypto/argon2

import "golang.org/x/crypto/argon2"

const (
    argon2Time    = 1
    argon2Memory  = 64 * 1024 // 64 MB
    argon2Threads = 4
    argon2KeyLen  = 32        // AES-256 key length
)

func DeriveKey(passphrase []byte, salt []byte) []byte {
    return argon2.IDKey(passphrase, salt, argon2Time, argon2Memory, argon2Threads, argon2KeyLen)
}

// Generate salt (do once, persist in config)
func NewSalt() ([]byte, error) {
    salt := make([]byte, 32)
    _, err := rand.Read(salt)
    return salt, err
}

Pattern 5: Provider YAML Schema with Validation

What: Provider struct with UnmarshalYAML that validates required fields including format_version and last_verified.

When to use: Every provider YAML load. Return error on invalid schema — fail fast at startup.

// Source: gopkg.in/yaml.v3 docs (https://pkg.go.dev/gopkg.in/yaml.v3)

type Provider struct {
    Name          string    `yaml:"name"`
    FormatVersion int       `yaml:"format_version"`
    LastVerified  string    `yaml:"last_verified"` // ISO date: "2026-04-04"
    Keywords      []string  `yaml:"keywords"`
    Patterns      []Pattern `yaml:"patterns"`
    Verify        *VerifySpec `yaml:"verify,omitempty"`
}

func (p *Provider) UnmarshalYAML(value *yaml.Node) error {
    type rawProvider Provider
    if err := value.Decode((*rawProvider)(p)); err != nil {
        return err
    }
    if p.FormatVersion == 0 {
        return fmt.Errorf("provider %q: format_version is required", p.Name)
    }
    if p.LastVerified == "" {
        return fmt.Errorf("provider %q: last_verified is required", p.Name)
    }
    if len(p.Keywords) == 0 {
        return fmt.Errorf("provider %q: at least one keyword is required", p.Name)
    }
    if len(p.Patterns) == 0 {
        return fmt.Errorf("provider %q: at least one pattern is required", p.Name)
    }
    return nil
}

Pattern 6: Shannon Entropy (stdlib only)

What: Calculate bits-per-character entropy of a string. No library needed.

When to use: CORE-04: as secondary signal after keyword pre-filter and regex confirm a candidate. Entropy threshold 3.5 bits/char for most providers.

// Source: Shannon entropy formula (verified against TruffleHog entropy implementation)

import "math"

func shannonEntropy(s string) float64 {
    if len(s) == 0 {
        return 0
    }
    freq := make(map[rune]float64)
    for _, c := range s {
        freq[c]++
    }
    n := float64(len([]rune(s)))
    var entropy float64
    for _, count := range freq {
        p := count / n
        entropy -= p * math.Log2(p)
    }
    return entropy
}

Pattern 7: modernc.org/sqlite Initialization

What: Open SQLite DB with WAL mode and foreign keys. Schema embedded via //go:embed.

When to use: Storage layer initialization.

// Source: https://practicalgobook.net/posts/go-sqlite-no-cgo/

import (
    "database/sql"
    _ "modernc.org/sqlite" // registers "sqlite" driver
    "embed"
)

//go:embed schema.sql
var schemaSQL string

func Open(path string) (*sql.DB, error) {
    db, err := sql.Open("sqlite", path)
    if err != nil {
        return nil, err
    }
    // Enable WAL mode for better concurrent access
    if _, err := db.Exec("PRAGMA journal_mode=WAL"); err != nil {
        return nil, err
    }
    if _, err := db.Exec("PRAGMA foreign_keys=ON"); err != nil {
        return nil, err
    }
    // Apply schema
    if _, err := db.Exec(schemaSQL); err != nil {
        return nil, err
    }
    return db, nil
}

Pattern 8: Cobra + Viper Wiring

What: Cobra command tree with Viper config loaded in PersistentPreRunE. Flags bound to Viper for config file override.

When to use: root.go and all command files.

// Source: https://www.glukhov.org/post/2025/11/go-cli-applications-with-cobra-and-viper/

var rootCmd = &cobra.Command{
    Use:   "keyhunter",
    Short: "API key scanner for 108+ LLM providers",
    PersistentPreRunE: func(cmd *cobra.Command, args []string) error {
        viper.SetConfigName(".keyhunter")
        viper.SetConfigType("yaml")
        viper.AddConfigPath("$HOME")
        viper.AutomaticEnv()
        viper.SetEnvPrefix("KEYHUNTER")
        if err := viper.ReadInConfig(); err != nil {
            if _, ok := err.(viper.ConfigFileNotFoundError); !ok {
                return err
            }
            // Config file not found — acceptable on first run
        }
        return nil
    },
}

// In init() of each command file:
func init() {
    scanCmd.Flags().IntP("workers", "w", 0, "worker count (default: CPU count)")
    viper.BindPFlag("workers", scanCmd.Flags().Lookup("workers"))
}

Anti-Patterns to Avoid

  • Global provider registry: Pass *Registry via constructor. Global state makes testing impossible without full initialization.
  • Unbuffered result channels: Use make(chan Finding, 1000). Unbuffered channels cause detector workers to block on slow output consumers, collapsing parallelism.
  • Runtime YAML loading: Never load from filesystem at scan time. //go:embed only.
  • regexp2 or PCRE: Go's regexp package (RE2) already provides linear-time guarantees. regexp2 loses this guarantee.
  • Entropy-only detection: Never flag a candidate based solely on entropy score. Entropy is a secondary filter applied only after keyword pre-filter and regex confirm a pattern match.
  • Plaintext key column: Never store full API key as TEXT. Always encrypt with AES-256-GCM before INSERT.
  • os.Getenv for passphrase: Derive the AES key via Argon2id from a passphrase. Never store raw passphrase or raw key in config file.

Don't Hand-Roll

Problem Don't Build Use Instead Why
Multi-keyword search automaton Custom Aho-Corasick or loop-based string search petar-dambovaliev/aho-corasick Aho-Corasick is O(n+m+z); naive loop is O(n*k); library is 20x faster than next-best Go option
Goroutine pool with dynamic resize Manual goroutine spawn with WaitGroup ants v2.12.0 Goroutine explosion on large repos; ants handles backpressure, panic recovery, context cancellation
AES key derivation from passphrase SHA256(passphrase) or similar argon2.IDKey from golang.org/x/crypto MD5/SHA hash-based KDF is trivially brute-forceable; Argon2id is GPU-resistant by design
SQLite column encryption XOR, custom cipher, or base64 "encoding" crypto/aes GCM via stdlib GCM provides both confidentiality and authentication; custom schemes always have vulnerabilities
Config file management Custom INI or JSON parser viper v1.21.0 Viper handles YAML + env vars + CLI flags with correct precedence; hand-rolled configs miss env var override
CLI command parsing flag stdlib or custom parser cobra v1.10.2 Cobra provides nested subcommands, persistent flags, shell completion, help generation — stdlib flag lacks all of these

Key insight: The Aho-Corasick pre-filter and Argon2id key derivation in particular are problems where "obvious" implementations (nested loops, SHA256) have well-documented security or performance failure modes that justify the dependency cost.


Common Pitfalls

Pitfall 1: Wrong Aho-Corasick Library Choice

What goes wrong: Using cloudflare/ahocorasick because it appears more prominent in search results. It uses 8x more memory and runs 20x slower than petar-dambovaliev/aho-corasick. For a tool scanning large repos with 108 keyword patterns, this difference is measurable.

Why it happens: cloudflare/ahocorasick appears first in many search results.

How to avoid: Use github.com/petar-dambovaliev/aho-corasick (verified as the library TruffleHog uses per their own blog post crediting Petar Dambovaliev). Build the automaton once at registry init; the automaton is thread-safe for concurrent reads.

Warning signs: Scans on 100MB+ repos running significantly slower than expected; high memory usage during keyword pre-filter stage.


Pitfall 2: Encryption Key Stored in Config File as Raw Bytes

What goes wrong: Storing the 32-byte AES key directly in ~/.keyhunter.yaml or as an env var. Anyone with config file read access can decrypt the entire database.

Why it happens: "Just store the key" is the simplest implementation. Argon2id + salt seems like unnecessary complexity.

How to avoid: Store only the randomly generated salt in config. Re-derive the key from passphrase + salt on each run using Argon2id. The passphrase is entered interactively or set via KEYHUNTER_PASSPHRASE env var (never stored). For Phase 1, interactive passphrase entry is acceptable. OS keychain integration (zalando/go-keyring) can be added later without schema migration.

Warning signs: keyhunter_key: field appearing as hex bytes in the YAML config file.


Pitfall 3: Provider YAML Schema Allowing Missing format_version or last_verified

What goes wrong: Provider YAML loads without validation. Providers with missing format_version or stale last_verified accumulate over time. Pattern health tracking (PROV-10) becomes meaningless.

Why it happens: yaml.Unmarshal to a struct silently zero-values missing fields. No validation = no error.

How to avoid: Implement UnmarshalYAML with explicit validation on the Provider struct. Fail at startup (not at scan time) if any provider is invalid. This catches schema errors at development time, not production time.

Warning signs: format_version: 0 appearing in any loaded provider struct.


Pitfall 4: SQLite Without WAL Mode in a CLI Tool

What goes wrong: Default SQLite journal mode causes SQLITE_BUSY errors when the dashboard (Phase 18) or multiple concurrent processes read while a scan writes. The default journal also performs slower write throughput.

Why it happens: sql.Open("sqlite", path) uses the default rollback journal.

How to avoid: Always execute PRAGMA journal_mode=WAL immediately after opening the database. This is a one-time setup that persists in the database file. Also set PRAGMA foreign_keys=ON to enforce referential integrity.

Warning signs: database is locked errors during concurrent read/write operations.


Pitfall 5: Entropy Check Before Regex Confirmation

What goes wrong: Running Shannon entropy on every chunk that passes the Aho-Corasick keyword filter produces up to 80% false positive rate (HashiCorp 2025 research). High-entropy strings like UUIDs, hashes, and base64-encoded content all score above 3.5 bits/char.

Why it happens: Entropy feels like an independent signal and is applied eagerly as a quick filter.

How to avoid: Entropy is strictly a secondary signal. Apply it only to strings that have already matched both a keyword (Aho-Corasick) AND a provider regex pattern. The order is: keyword pre-filter → regex match → entropy calibration. Never entropy-only.

Warning signs: More than 30% of findings lacking a matching provider regex pattern.


Pitfall 6: mmap for Small Files on Linux

What goes wrong: mmap is beneficial for large files (>10MB) where avoiding a full read-into-memory matters. For small files, mmap has higher setup overhead than a simple os.ReadFile. mmap also requires explicit cleanup to avoid address space exhaustion on directory scans with thousands of small files.

Why it happens: CORE-07 specifies mmap-based large file reading, and implementing it uniformly for all files seems simpler.

How to avoid: Use os.ReadFile for files under 10MB. Use mmap only above that threshold, with explicit unix.Munmap cleanup in a deferred call. Check for binary files before mmap — use the first 512 bytes of the file to detect binary content via http.DetectContentType and skip non-text files.

Warning signs: Address space exhaustion during directory scans; "too many open files" errors.


Pitfall 7: Argon2 Parameters Too Low (Fast KDF = Weak Security)

What goes wrong: Using time=1, memory=4096 (a commonly copied example) instead of RFC 9106's recommendations. Fast key derivation makes brute-force attacks on the database passphrase practical.

Why it happens: Low parameters make tests run faster and startup feel snappier.

How to avoid: Use RFC 9106 Section 7.3 parameters for Argon2id: time=1, memory=64*1024 (64MB), threads=4, keyLen=32. These are the current recommendations. Test startup latency with these parameters — on modern hardware, key derivation takes ~100-300ms, which is acceptable for a CLI tool.

Warning signs: argon2.IDKey(pass, salt, 1, 4096, 1, 32) — memory parameter is 64*1024 (65536), not 4096.


State of the Art

Old Approach Current Approach When Changed Impact
Linear keyword scan before regex Aho-Corasick pre-filter TruffleHog v3.28.6 (2024) 2x average scan speedup; O(n+m+z) vs O(n*k)
mattn/go-sqlite3 (CGo) modernc.org/sqlite (pure Go) Go 1.16+ era CGO_ENABLED=0 enabled; cross-compilation works
SQLCipher for DB encryption Application-level AES-256-GCM 2023-2025 No CGo dependency; AES GCM provides authentication
PBKDF2 for key derivation Argon2id RFC 9106 (2021) GPU-resistant; side-channel resistant; OWASP recommended
regexp2/PCRE in Go scanners Go stdlib regexp (RE2) Ongoing ReDoS immune; linear time guaranteed
Storing full keys masked in DB Encrypt key_encrypted column, store only mask GHSA-4h8c-qrcq-cv5c (2024) Database file no longer a credential dump

Deprecated/outdated:

  • mattn/go-sqlite3: Requires CGo; cross-compilation breaks; modernc.org/sqlite is the replacement.
  • robfig/cron: Unmaintained since 2020; use gocron v2 (Phase 17).
  • cloudflare/ahocorasick: Still maintained but 20x slower than petar-dambovaliev/aho-corasick; do not use.
  • Entropy-only secret detection: HashiCorp 2025 research confirms 80% FP rate; layered pipeline is the current standard.

Open Questions

  1. Passphrase UX for first run

    • What we know: Argon2id requires a passphrase to derive the AES key. For Phase 1, this must be handled on first keyhunter scan or keyhunter config init.
    • What's unclear: Should Phase 1 use bufio.NewReader(os.Stdin).ReadString('\n') for passphrase entry, or skip encryption and use a generated random key stored in config (less secure but zero-friction)?
    • Recommendation: Use a generated random 32-byte key stored in ~/.keyhunter.yaml as base64 for Phase 1 (zero-friction). Document that this is a development shortcut; OS keychain integration (zalando/go-keyring) replaces it in a later phase. The encrypt/decrypt functions and schema are in place; only the key source changes.
  2. mmap on Linux: syscall vs golang.org/x/sys/unix

    • What we know: Both syscall.Mmap and golang.org/x/sys/unix.Mmap provide mmap on Linux. The x/ package has a cleaner API.
    • What's unclear: golang.org/x/sys is already a transitive dependency of many packages (likely pulled in by viper or cobra). Whether it's already in go.sum or needs explicit addition.
    • Recommendation: Use golang.org/x/sys/unix for mmap in the file source adapter. It will almost certainly already be in go.sum. Only implement mmap for Phase 1 if CORE-07 is in scope for the minimal viable scan pipeline; otherwise defer to Phase 4.
  3. Provider YAML for Phase 1: how many definitions?

    • What we know: Phase 1 requires schema + PROV-10 (format_version, last_verified fields). Full 108-provider definitions are Phase 2-3.
    • What's unclear: The success criterion "keyhunter scan ./somefile runs the three-stage pipeline and returns findings with provider names" implies at least one real provider definition must exist.
    • Recommendation: Ship 3 reference provider definitions in Phase 1 (OpenAI, Anthropic, HuggingFace) with valid format_version and last_verified. All 108 providers are Phase 2-3 scope. These 3 definitions validate the schema and make the success criteria testable.

Environment Availability

Dependency Required By Available Version Fallback
Go Core language Yes 1.26.1 (exceeds 1.22 requirement)
git Version control Yes 2.53.0
golangci-lint Static analysis / CI No Install via go install github.com/golangci-lint/golangci-lint/cmd/golangci-lint@latest or skip in Phase 1
npm (not needed Phase 1) Yes (4.2.2 via tailwind check) Not needed until Phase 18

Missing dependencies with fallback:

  • golangci-lint: Not found. Install before Phase 1 linting task, or skip lint gate for initial scaffold and add in CI pipeline. Fallback: go vet ./... catches most critical issues.

Missing dependencies with no fallback:

  • None. Go 1.26.1 is available and exceeds the 1.22+ requirement.

Validation Architecture

Test Framework

Property Value
Framework go test (stdlib) + testify v1.10.x for assertions
Config file None needed — standard go test ./... discovers *_test.go
Quick run command go test ./pkg/... -race -timeout 30s
Full suite command go test ./... -race -cover -timeout 120s

Phase Requirements to Test Map

Req ID Behavior Test Type Automated Command File Exists?
CORE-01 Scan pipeline detects known API key patterns in test input integration go test ./pkg/engine/... -run TestScanPipeline -v No — Wave 0
CORE-02 Provider YAML loads from embed.FS without error unit go test ./pkg/providers/... -run TestNewRegistry -v No — Wave 0
CORE-03 Registry holds correct provider count, Get() returns provider by name unit go test ./pkg/providers/... -run TestRegistry -v No — Wave 0
CORE-04 Shannon entropy returns correct value for known inputs unit go test ./pkg/engine/... -run TestShannonEntropy -v No — Wave 0
CORE-05 Worker pool uses correct concurrency; all workers complete unit go test ./pkg/engine/... -run TestWorkerPool -v No — Wave 0
CORE-06 Aho-Corasick filter passes keyword-matched chunks; rejects non-matching unit go test ./pkg/engine/... -run TestAhoCorasickFilter -v No — Wave 0
CORE-07 Large file source reads without OOM; binary files skipped integration go test ./pkg/engine/sources/... -run TestFileSourceLarge -v No — Wave 0
STOR-01 SQLite DB opens; schema applies; WAL mode set unit go test ./pkg/storage/... -run TestOpen -v No — Wave 0
STOR-02 AES-256-GCM encrypt/decrypt round-trip is lossless unit go test ./pkg/storage/... -run TestEncryptDecrypt -v No — Wave 0
STOR-03 Argon2id DeriveKey produces 32-byte deterministic output unit go test ./pkg/storage/... -run TestDeriveKey -v No — Wave 0
CLI-01 keyhunter --help exits 0; all subcommands listed smoke go run ./... --help No — Wave 0
CLI-02 keyhunter config init creates ~/.keyhunter.yaml integration go test ./cmd/... -run TestConfigInit -v No — Wave 0
CLI-03 keyhunter config set key val persists to YAML integration go test ./cmd/... -run TestConfigSet -v No — Wave 0
CLI-04 keyhunter providers list returns at least 3 providers integration go test ./cmd/... -run TestProvidersList -v No — Wave 0
CLI-05 keyhunter scan --workers 4 testfile uses 4 workers integration go test ./cmd/... -run TestScanFlags -v No — Wave 0
PROV-10 Provider YAML with missing format_version returns error at load unit go test ./pkg/providers/... -run TestProviderValidation -v No — Wave 0

Sampling Rate

  • Per task commit: go test ./pkg/... -race -timeout 30s
  • Per wave merge: go test ./... -race -cover -timeout 120s
  • Phase gate: Full suite green before /gsd:verify-work

Wave 0 Gaps

All test files are new — this is a greenfield project.

  • pkg/providers/registry_test.go — covers CORE-02, CORE-03, PROV-10
  • pkg/engine/entropy_test.go — covers CORE-04
  • pkg/engine/filter_test.go — covers CORE-06
  • pkg/engine/engine_test.go — covers CORE-01, CORE-05
  • pkg/engine/sources/file_test.go — covers CORE-07
  • pkg/storage/encrypt_test.go — covers STOR-02
  • pkg/storage/crypto_test.go — covers STOR-03
  • pkg/storage/db_test.go — covers STOR-01
  • cmd/config_test.go — covers CLI-02, CLI-03
  • cmd/providers_test.go — covers CLI-04
  • cmd/scan_test.go — covers CLI-05
  • testdata/fixtures/ — synthetic test files with known API key patterns for integration tests
  • Framework install: go get github.com/stretchr/testify@latest — if not added during go.mod init

Sources

Primary (HIGH confidence)

Secondary (MEDIUM confidence)

Tertiary (LOW confidence)

  • argon2aes reference implementation: https://github.com/presbrey/argon2aes — implementation pattern reference only; use stdlib directly
  • Passphrase UX patterns: Community convention; no authoritative Go CLI standard for passphrase input UX

Metadata

Confidence breakdown:

  • Standard stack: HIGH — all versions verified against official GitHub releases and pkg.go.dev as of 2026-04-04
  • Architecture: HIGH — TruffleHog v3 pipeline is the proven model; channel patterns are established Go idiom
  • Aho-Corasick library choice: HIGH — TruffleHog blog post explicitly credits petar-dambovaliev; pkg.go.dev confirms API and last update date
  • AES-256+Argon2id approach: HIGH — stdlib only; RFC 9106 parameters; well-documented pattern
  • Pitfalls: HIGH — sourced from GHSA advisory, HashiCorp research, official Go docs
  • Test architecture: HIGH — standard go test patterns; no uncertainty

Research date: 2026-04-04 Valid until: 2026-07-04 (stable libraries; 90 days is safe for Go stdlib + well-maintained ecosystem packages)