From 684b67cb73c6508134bdfb00388291698c4fba98 Mon Sep 17 00:00:00 2001 From: salvacybersec Date: Sat, 4 Apr 2026 23:44:09 +0300 Subject: [PATCH] =?UTF-8?q?docs(01-foundation):=20create=20phase=201=20pla?= =?UTF-8?q?n=20=E2=80=94=205=20plans=20across=203=20execution=20waves?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Wave 0: module init + test scaffolding (01-01) Wave 1: provider registry (01-02) + storage layer (01-03) in parallel Wave 2: scan engine pipeline (01-04, depends on 01-02) Wave 3: CLI wiring + integration checkpoint (01-05, depends on all) Covers all 16 Phase 1 requirements: CORE-01 through CORE-07, STOR-01 through STOR-03, CLI-01 through CLI-05, PROV-10. Co-Authored-By: Claude Sonnet 4.6 --- .planning/ROADMAP.md | 11 +- .planning/phases/01-foundation/01-01-PLAN.md | 359 +++++++++ .planning/phases/01-foundation/01-02-PLAN.md | 663 ++++++++++++++++ .planning/phases/01-foundation/01-03-PLAN.md | 634 ++++++++++++++++ .planning/phases/01-foundation/01-04-PLAN.md | 682 +++++++++++++++++ .planning/phases/01-foundation/01-05-PLAN.md | 748 +++++++++++++++++++ 6 files changed, 3095 insertions(+), 2 deletions(-) create mode 100644 .planning/phases/01-foundation/01-01-PLAN.md create mode 100644 .planning/phases/01-foundation/01-02-PLAN.md create mode 100644 .planning/phases/01-foundation/01-03-PLAN.md create mode 100644 .planning/phases/01-foundation/01-04-PLAN.md create mode 100644 .planning/phases/01-foundation/01-05-PLAN.md diff --git a/.planning/ROADMAP.md b/.planning/ROADMAP.md index e21aaa5..53c9549 100644 --- a/.planning/ROADMAP.md +++ b/.planning/ROADMAP.md @@ -43,7 +43,14 @@ Decimal phases appear between their surrounding integers in numeric order. 3. `keyhunter config init` creates `~/.keyhunter.yaml` and `keyhunter config set ` persists values 4. `keyhunter providers list` and `keyhunter providers info ` return provider metadata from YAML definitions 5. Provider YAML schema includes `format_version` and `last_verified` fields validated at load time -**Plans**: TBD +**Plans**: 5 plans + +Plans: +- [ ] 01-01-PLAN.md — Go module init, dependency installation, test scaffolding and testdata fixtures +- [ ] 01-02-PLAN.md — Provider registry: YAML schema, embed loader, Aho-Corasick automaton, Registry struct +- [ ] 01-03-PLAN.md — Storage layer: AES-256-GCM encryption, Argon2id key derivation, SQLite + Finding CRUD +- [ ] 01-04-PLAN.md — Scan engine pipeline: keyword pre-filter, regex+entropy detector, FileSource, ants worker pool +- [ ] 01-05-PLAN.md — CLI wiring: scan, providers list/info/stats, config init/set/get, output table ### Phase 2: Tier 1-2 Providers **Goal**: The 26 highest-value LLM provider YAML definitions exist with accurate regex patterns, keyword lists, confidence levels, and verify endpoints — covering OpenAI, Anthropic, Google AI, AWS Bedrock, Azure OpenAI and all major inference platforms @@ -248,7 +255,7 @@ Phases execute in numeric order: 1 → 2 → 3 → ... → 18 | Phase | Plans Complete | Status | Completed | |-------|----------------|--------|-----------| -| 1. Foundation | 0/? | Not started | - | +| 1. Foundation | 0/5 | Planning complete | - | | 2. Tier 1-2 Providers | 0/? | Not started | - | | 3. Tier 3-9 Providers | 0/? | Not started | - | | 4. Input Sources | 0/? | Not started | - | diff --git a/.planning/phases/01-foundation/01-01-PLAN.md b/.planning/phases/01-foundation/01-01-PLAN.md new file mode 100644 index 0000000..2e0d275 --- /dev/null +++ b/.planning/phases/01-foundation/01-01-PLAN.md @@ -0,0 +1,359 @@ +--- +phase: 01-foundation +plan: 01 +type: execute +wave: 0 +depends_on: [] +files_modified: + - go.mod + - go.sum + - main.go + - testdata/samples/openai_key.txt + - testdata/samples/anthropic_key.txt + - testdata/samples/no_keys.txt + - pkg/providers/registry_test.go + - pkg/storage/db_test.go + - pkg/engine/scanner_test.go +autonomous: true +requirements: [CORE-01, CORE-02, CORE-03, CORE-04, CORE-05, CORE-06, CORE-07, STOR-01, STOR-02, STOR-03, CLI-01] + +must_haves: + truths: + - "go.mod exists with all Phase 1 dependencies at pinned versions" + - "go build ./... succeeds with zero errors on a fresh checkout" + - "go test ./... -short runs without compilation errors (tests may fail — stubs are fine)" + - "testdata/ contains files with known key patterns for scanner integration tests" + artifacts: + - path: "go.mod" + provides: "Module declaration with all Phase 1 dependencies" + contains: "module github.com/salvacybersec/keyhunter" + - path: "main.go" + provides: "Binary entry point under 30 lines" + contains: "func main()" + - path: "testdata/samples/openai_key.txt" + provides: "Sample file with synthetic OpenAI key for scanner tests" + - path: "pkg/providers/registry_test.go" + provides: "Test stubs for provider loading and registry" + - path: "pkg/storage/db_test.go" + provides: "Test stubs for SQLite + encryption roundtrip" + - path: "pkg/engine/scanner_test.go" + provides: "Test stubs for pipeline stages" + key_links: + - from: "go.mod" + to: "petar-dambovaliev/aho-corasick" + via: "require directive" + pattern: "petar-dambovaliev/aho-corasick" + - from: "go.mod" + to: "modernc.org/sqlite" + via: "require directive" + pattern: "modernc.org/sqlite" +--- + + +Initialize the Go module, install all Phase 1 dependencies at pinned versions, create the minimal main.go entry point, and lay down test scaffolding with testdata fixtures that every subsequent plan's tests depend on. + +Purpose: All subsequent plans require a compiling module and test infrastructure to exist before they can add production code and make tests green. Wave 0 satisfies this bootstrap requirement. +Output: go.mod, go.sum, main.go, pkg/*/test stubs, testdata/ fixtures. + + + +@$HOME/.claude/get-shit-done/workflows/execute-plan.md +@$HOME/.claude/get-shit-done/templates/summary.md + + + +@.planning/PROJECT.md +@.planning/ROADMAP.md +@.planning/phases/01-foundation/01-RESEARCH.md +@.planning/phases/01-foundation/01-VALIDATION.md + + + +Module: github.com/salvacybersec/keyhunter + + +Dependencies to install: + github.com/spf13/cobra@v1.10.2 + github.com/spf13/viper@v1.21.0 + modernc.org/sqlite@latest + gopkg.in/yaml.v3@v3.0.1 + github.com/petar-dambovaliev/aho-corasick@latest + github.com/panjf2000/ants/v2@v2.12.0 + golang.org/x/crypto@latest + golang.org/x/time@latest + github.com/charmbracelet/lipgloss@latest + github.com/stretchr/testify@latest + + +go 1.22 + + +keyhunter/ + main.go + cmd/ + root.go (created in Plan 05) + scan.go (created in Plan 05) + providers.go (created in Plan 05) + config.go (created in Plan 05) + pkg/ + providers/ (created in Plan 02) + engine/ (created in Plan 04) + storage/ (created in Plan 03) + config/ (created in Plan 05) + output/ (created in Plan 05) + providers/ (created in Plan 02) + testdata/ + samples/ + + + + + + + Task 1: Initialize Go module and install Phase 1 dependencies + go.mod, go.sum + + - /home/salva/Documents/apikey/.planning/phases/01-foundation/01-RESEARCH.md (Standard Stack section — exact library versions) + - /home/salva/Documents/apikey/CLAUDE.md (Technology Stack table — version constraints) + + +Run the following commands in the project root (/home/salva/Documents/apikey): + +```bash +go mod init github.com/salvacybersec/keyhunter +go get github.com/spf13/cobra@v1.10.2 +go get github.com/spf13/viper@v1.21.0 +go get modernc.org/sqlite@latest +go get gopkg.in/yaml.v3@v3.0.1 +go get github.com/petar-dambovaliev/aho-corasick@latest +go get github.com/panjf2000/ants/v2@v2.12.0 +go get golang.org/x/crypto@latest +go get golang.org/x/time@latest +go get github.com/charmbracelet/lipgloss@latest +go get github.com/stretchr/testify@latest +go mod tidy +``` + +Verify the resulting go.mod contains: +- `module github.com/salvacybersec/keyhunter` +- `go 1.22` (or 1.22.x) +- `github.com/spf13/cobra v1.10.2` +- `github.com/spf13/viper v1.21.0` +- `github.com/petar-dambovaliev/aho-corasick` (any version) +- `github.com/panjf2000/ants/v2 v2.12.0` +- `modernc.org/sqlite` (any v1.35.x) +- `github.com/charmbracelet/lipgloss` (any version) + +Do NOT add: chi, templ, telego, gocron — these are Phase 17-18 only. +Do NOT use CGO_ENABLED=1 or mattn/go-sqlite3. + + + cd /home/salva/Documents/apikey && grep -q 'module github.com/salvacybersec/keyhunter' go.mod && grep -q 'cobra v1.10.2' go.mod && grep -q 'modernc.org/sqlite' go.mod && echo "go.mod OK" + + + - go.mod contains `module github.com/salvacybersec/keyhunter` + - go.mod contains `github.com/spf13/cobra v1.10.2` (exact) + - go.mod contains `github.com/spf13/viper v1.21.0` (exact) + - go.mod contains `github.com/panjf2000/ants/v2 v2.12.0` (exact) + - go.mod contains `modernc.org/sqlite` (v1.35.x) + - go.mod contains `github.com/petar-dambovaliev/aho-corasick` + - go.mod contains `golang.org/x/crypto` + - go.mod contains `github.com/charmbracelet/lipgloss` + - go.sum exists and is non-empty + - `go mod verify` exits 0 + + go.mod and go.sum committed with all Phase 1 dependencies at correct versions + + + + Task 2: Create main.go entry point and test scaffolding + + main.go, + testdata/samples/openai_key.txt, + testdata/samples/anthropic_key.txt, + testdata/samples/multiple_keys.txt, + testdata/samples/no_keys.txt, + pkg/providers/registry_test.go, + pkg/storage/db_test.go, + pkg/engine/scanner_test.go + + + - /home/salva/Documents/apikey/.planning/phases/01-foundation/01-VALIDATION.md (Wave 0 Requirements and Per-Task Verification Map) + - /home/salva/Documents/apikey/.planning/phases/01-foundation/01-RESEARCH.md (Architecture Patterns, project structure diagram) + + +Create the following files: + +**main.go** (must be under 30 lines): +```go +package main + +import "github.com/salvacybersec/keyhunter/cmd" + +func main() { + cmd.Execute() +} +``` + +**testdata/samples/openai_key.txt** — file containing a synthetic (non-real) OpenAI-style key for scanner integration tests: +``` +# Test file: synthetic OpenAI key pattern +OPENAI_API_KEY=sk-proj-ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqr1234 +``` + +**testdata/samples/anthropic_key.txt** — file containing a synthetic Anthropic-style key: +``` +# Test file: synthetic Anthropic key pattern +export ANTHROPIC_API_KEY="sk-ant-api03-ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxy01234567890-ABCDE" +``` + +**testdata/samples/multiple_keys.txt** — file with both key types: +``` +# Multiple providers in one file +OPENAI_API_KEY=sk-proj-ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqr5678 +ANTHROPIC_API_KEY=sk-ant-api03-XYZabcdefghijklmnopqrstuvwxyz01234567890ABCDEFGH-XYZAB +``` + +**testdata/samples/no_keys.txt** — file with no keys (negative test case): +``` +# This file contains no API keys +# Used to verify false-positive rate is zero for clean files +Hello world +``` + +**pkg/providers/registry_test.go** — test stubs (will be filled by Plan 02): +```go +package providers_test + +import ( + "testing" +) + +// TestRegistryLoad verifies that provider YAML files are loaded from embed.FS. +// Stub: will be implemented when registry.go exists (Plan 02). +func TestRegistryLoad(t *testing.T) { + t.Skip("stub — implement after registry.go exists") +} + +// TestProviderSchemaValidation verifies format_version and last_verified are required. +// Stub: will be implemented when schema.go validation exists (Plan 02). +func TestProviderSchemaValidation(t *testing.T) { + t.Skip("stub — implement after schema.go validation exists") +} + +// TestAhoCorasickBuild verifies Aho-Corasick automaton builds from provider keywords. +// Stub: will be implemented when registry builds automaton (Plan 02). +func TestAhoCorasickBuild(t *testing.T) { + t.Skip("stub — implement after registry AC build exists") +} +``` + +**pkg/storage/db_test.go** — test stubs (will be filled by Plan 03): +```go +package storage_test + +import ( + "testing" +) + +// TestDBOpen verifies SQLite database opens and creates schema. +// Stub: will be implemented when db.go exists (Plan 03). +func TestDBOpen(t *testing.T) { + t.Skip("stub — implement after db.go exists") +} + +// TestEncryptDecryptRoundtrip verifies AES-256-GCM encrypt/decrypt roundtrip. +// Stub: will be implemented when encrypt.go exists (Plan 03). +func TestEncryptDecryptRoundtrip(t *testing.T) { + t.Skip("stub — implement after encrypt.go exists") +} + +// TestArgon2KeyDerivation verifies Argon2id produces 32-byte key deterministically. +// Stub: will be implemented when crypto.go exists (Plan 03). +func TestArgon2KeyDerivation(t *testing.T) { + t.Skip("stub — implement after crypto.go exists") +} +``` + +**pkg/engine/scanner_test.go** — test stubs (will be filled by Plan 04): +```go +package engine_test + +import ( + "testing" +) + +// TestShannonEntropy verifies the entropy function returns expected values. +// Stub: will be implemented when entropy.go exists (Plan 04). +func TestShannonEntropy(t *testing.T) { + t.Skip("stub — implement after entropy.go exists") +} + +// TestKeywordPreFilter verifies Aho-Corasick pre-filter rejects files without keywords. +// Stub: will be implemented when filter.go exists (Plan 04). +func TestKeywordPreFilter(t *testing.T) { + t.Skip("stub — implement after filter.go exists") +} + +// TestScannerPipeline verifies end-to-end scan of testdata returns expected findings. +// Stub: will be implemented when engine.go exists (Plan 04). +func TestScannerPipeline(t *testing.T) { + t.Skip("stub — implement after engine.go exists") +} +``` + +Create the `cmd/` package directory with a minimal stub so main.go compiles: + +**cmd/root.go** (minimal stub — will be replaced by Plan 05): +```go +package cmd + +import "os" + +// Execute is a stub. The real command tree is built in Plan 05. +func Execute() { + _ = os.Args +} +``` + +After creating all files, run `go build ./...` to confirm the module compiles. + + + cd /home/salva/Documents/apikey && go build ./... && go test ./... -short 2>&1 | grep -v "^--- SKIP" | grep -v "^SKIP" | grep -v "^ok" || true && echo "BUILD OK" + + + - `go build ./...` exits 0 with no errors + - `go test ./... -short` exits 0 (all stubs skip, no failures) + - main.go is under 30 lines + - testdata/samples/openai_key.txt contains `sk-proj-` prefix + - testdata/samples/anthropic_key.txt contains `sk-ant-api03-` prefix + - testdata/samples/no_keys.txt contains no key patterns + - pkg/providers/registry_test.go, pkg/storage/db_test.go, pkg/engine/scanner_test.go each exist with skip-based stubs + - cmd/root.go exists so `go build ./...` compiles + + Module compiles, test stubs exist, testdata fixtures created. Subsequent plans can now add production code and make tests green. + + + + + +After both tasks: +- `cd /home/salva/Documents/apikey && go build ./...` exits 0 +- `go test ./... -short` exits 0 +- `grep -r 'sk-proj-' testdata/` finds the OpenAI test fixture +- `grep -r 'sk-ant-api03-' testdata/` finds the Anthropic test fixture +- go.mod has all required dependencies at specified versions + + + +- go.mod initialized with module path `github.com/salvacybersec/keyhunter` and Go 1.22 +- All 10 Phase 1 dependencies installed at correct versions +- main.go under 30 lines, compiles successfully +- 3 test stub files exist (providers, storage, engine) +- 4 testdata fixture files exist (openai key, anthropic key, multiple keys, no keys) +- `go build ./...` and `go test ./... -short` both exit 0 + + + +After completion, create `.planning/phases/01-foundation/01-01-SUMMARY.md` following the summary template. + diff --git a/.planning/phases/01-foundation/01-02-PLAN.md b/.planning/phases/01-foundation/01-02-PLAN.md new file mode 100644 index 0000000..ea31e83 --- /dev/null +++ b/.planning/phases/01-foundation/01-02-PLAN.md @@ -0,0 +1,663 @@ +--- +phase: 01-foundation +plan: 02 +type: execute +wave: 1 +depends_on: [01-01] +files_modified: + - providers/openai.yaml + - providers/anthropic.yaml + - providers/huggingface.yaml + - pkg/providers/schema.go + - pkg/providers/loader.go + - pkg/providers/registry.go + - pkg/providers/registry_test.go +autonomous: true +requirements: [CORE-02, CORE-03, CORE-06, PROV-10] + +must_haves: + truths: + - "Provider YAML files are embedded at compile time — no filesystem access at runtime" + - "Registry loads all YAML files from embed.FS and returns a slice of Provider structs" + - "Provider schema validation rejects YAML missing format_version or last_verified" + - "Aho-Corasick automaton is built from all provider keywords at registry init" + - "keyhunter providers list command lists providers (tested via registry methods)" + artifacts: + - path: "providers/openai.yaml" + provides: "Reference provider definition with all schema fields" + contains: "format_version" + - path: "pkg/providers/schema.go" + provides: "Provider, Pattern, VerifySpec Go structs with UnmarshalYAML validation" + exports: ["Provider", "Pattern", "VerifySpec"] + - path: "pkg/providers/registry.go" + provides: "Registry struct with List, Get, Stats, AC methods" + exports: ["Registry", "NewRegistry"] + - path: "pkg/providers/loader.go" + provides: "embed.FS declaration and fs.WalkDir loading logic" + contains: "go:embed" + key_links: + - from: "pkg/providers/loader.go" + to: "providers/*.yaml" + via: "//go:embed directive" + pattern: "go:embed.*providers" + - from: "pkg/providers/registry.go" + to: "github.com/petar-dambovaliev/aho-corasick" + via: "AC automaton build at NewRegistry()" + pattern: "ahocorasick" + - from: "pkg/providers/schema.go" + to: "format_version and last_verified YAML fields" + via: "UnmarshalYAML validation" + pattern: "UnmarshalYAML" +--- + + +Build the provider registry: YAML schema structs with validation, embed.FS loader, in-memory registry with List/Get/Stats/AC methods, and three reference provider YAML definitions. The Aho-Corasick automaton is built from all provider keywords at registry initialization. + +Purpose: Every downstream subsystem (scan engine, CLI providers command, verification engine) depends on the Registry interface. This plan establishes the stable contract they build against. +Output: providers/*.yaml, pkg/providers/{schema,loader,registry}.go, registry_test.go (stubs filled). + + + +@$HOME/.claude/get-shit-done/workflows/execute-plan.md +@$HOME/.claude/get-shit-done/templates/summary.md + + + +@.planning/phases/01-foundation/01-RESEARCH.md +@.planning/phases/01-foundation/01-01-SUMMARY.md + + + +Full provider YAML structure: +```yaml +format_version: 1 +name: openai +display_name: OpenAI +tier: 1 +last_verified: "2026-04-04" +keywords: + - "sk-proj-" + - "openai" +patterns: + - regex: 'sk-proj-[A-Za-z0-9_\-]{48,}' + entropy_min: 3.5 + confidence: high +verify: + method: GET + url: https://api.openai.com/v1/models + headers: + Authorization: "Bearer {KEY}" + valid_status: [200] + invalid_status: [401, 403] +``` + + +Provider struct fields: + FormatVersion int (yaml:"format_version" — must be >= 1) + Name string (yaml:"name") + DisplayName string (yaml:"display_name") + Tier int (yaml:"tier") + LastVerified string (yaml:"last_verified" — must be non-empty) + Keywords []string (yaml:"keywords") + Patterns []Pattern (yaml:"patterns") + Verify VerifySpec (yaml:"verify") + +Pattern struct fields: + Regex string (yaml:"regex") + EntropyMin float64 (yaml:"entropy_min") + Confidence string (yaml:"confidence" — "high", "medium", "low") + +VerifySpec struct fields: + Method string (yaml:"method") + URL string (yaml:"url") + Headers map[string]string (yaml:"headers") + ValidStatus []int (yaml:"valid_status") + InvalidStatus []int (yaml:"invalid_status") + + +type Registry struct { ... } +func NewRegistry() (*Registry, error) +func (r *Registry) List() []Provider +func (r *Registry) Get(name string) (Provider, bool) +func (r *Registry) Stats() RegistryStats // {Total int, ByTier map[int]int, ByConfidence map[string]int} +func (r *Registry) AC() ahocorasick.AhoCorasick // pre-built automaton + + +The embed directive must reference providers relative to loader.go location. +loader.go is at pkg/providers/loader.go. +providers/ directory is at project root. +Use: //go:embed ../../providers/*.yaml +and embed.FS path will be "../../providers/openai.yaml" etc. + +Actually: Go embed paths must be relative and cannot use "..". +Correct approach: place the embed in a file at project root level, or adjust. +Better approach from research: put loader in providers package, embed from pkg/providers, +but reference the providers/ dir which sits at root. + +Resolution: The go:embed directive path is relative to the SOURCE FILE, not the module root. +Since loader.go is at pkg/providers/loader.go, to embed ../../providers/*.yaml would work +syntactically but Go's embed restricts paths containing "..". + +Use this instead: place a providers_embed.go at the PROJECT ROOT (same dir as go.mod): + package main -- NO, this breaks package separation + +Correct architectural pattern (from RESEARCH.md example): +The embed FS should be in pkg/providers/loader.go using a path that doesn't traverse up. +Solution: embed the providers directory from within the providers package itself by +symlinking or — better — move the YAML files to pkg/providers/definitions/*.yaml and use: + //go:embed definitions/*.yaml + +This is the clean solution: pkg/providers/definitions/openai.yaml etc. +Update files_modified accordingly. The RESEARCH.md shows //go:embed ../../providers/*.yaml +but that path won't work with Go's embed restrictions. Use definitions/ subdirectory instead. + + + + + + + Task 1: Provider YAML schema structs with validation + pkg/providers/schema.go, providers/openai.yaml, providers/anthropic.yaml, providers/huggingface.yaml + + - /home/salva/Documents/apikey/.planning/phases/01-foundation/01-RESEARCH.md (Pattern 1: Provider Registry, Provider YAML schema section, PROV-10 row in requirements table) + - /home/salva/Documents/apikey/.planning/research/ARCHITECTURE.md (Provider Registry component, YAML schema example) + + + - Test 1: Provider with format_version=0 → UnmarshalYAML returns error "format_version must be >= 1" + - Test 2: Provider with empty last_verified → UnmarshalYAML returns error "last_verified is required" + - Test 3: Valid provider YAML → UnmarshalYAML succeeds, Provider.Name == "openai" + - Test 4: Provider with no patterns → loaded successfully (patterns list can be empty for schema-only providers) + - Test 5: Pattern.Confidence not in {"high","medium","low"} → error "confidence must be high, medium, or low" + + +Create pkg/providers/schema.go: + +```go +package providers + +import ( + "fmt" + "gopkg.in/yaml.v3" +) + +// Provider represents a single API key provider definition loaded from YAML. +type Provider struct { + FormatVersion int `yaml:"format_version"` + Name string `yaml:"name"` + DisplayName string `yaml:"display_name"` + Tier int `yaml:"tier"` + LastVerified string `yaml:"last_verified"` + Keywords []string `yaml:"keywords"` + Patterns []Pattern `yaml:"patterns"` + Verify VerifySpec `yaml:"verify"` +} + +// Pattern defines a single regex pattern for API key detection. +type Pattern struct { + Regex string `yaml:"regex"` + EntropyMin float64 `yaml:"entropy_min"` + Confidence string `yaml:"confidence"` +} + +// VerifySpec defines how to verify a key is live (used by Phase 5 verification engine). +type VerifySpec struct { + Method string `yaml:"method"` + URL string `yaml:"url"` + Headers map[string]string `yaml:"headers"` + ValidStatus []int `yaml:"valid_status"` + InvalidStatus []int `yaml:"invalid_status"` +} + +// RegistryStats holds aggregate statistics about loaded providers. +type RegistryStats struct { + Total int + ByTier map[int]int + ByConfidence map[string]int +} + +// UnmarshalYAML implements yaml.Unmarshaler with schema validation (satisfies PROV-10). +func (p *Provider) UnmarshalYAML(value *yaml.Node) error { + // Use a type alias to avoid infinite recursion + type ProviderAlias Provider + var alias ProviderAlias + if err := value.Decode(&alias); err != nil { + return err + } + if alias.FormatVersion < 1 { + return fmt.Errorf("provider %q: format_version must be >= 1 (got %d)", alias.Name, alias.FormatVersion) + } + if alias.LastVerified == "" { + return fmt.Errorf("provider %q: last_verified is required", alias.Name) + } + validConfidences := map[string]bool{"high": true, "medium": true, "low": true, "": true} + for _, pat := range alias.Patterns { + if !validConfidences[pat.Confidence] { + return fmt.Errorf("provider %q: pattern confidence %q must be high, medium, or low", alias.Name, pat.Confidence) + } + } + *p = Provider(alias) + return nil +} +``` + +Create the three reference YAML provider definitions. These are SCHEMA EXAMPLES for Phase 1; full pattern libraries come in Phase 2-3. + +**providers/openai.yaml:** +```yaml +format_version: 1 +name: openai +display_name: OpenAI +tier: 1 +last_verified: "2026-04-04" +keywords: + - "sk-proj-" + - "openai" +patterns: + - regex: 'sk-proj-[A-Za-z0-9_\-]{48,}' + entropy_min: 3.5 + confidence: high +verify: + method: GET + url: https://api.openai.com/v1/models + headers: + Authorization: "Bearer {KEY}" + valid_status: [200] + invalid_status: [401, 403] +``` + +**providers/anthropic.yaml:** +```yaml +format_version: 1 +name: anthropic +display_name: Anthropic +tier: 1 +last_verified: "2026-04-04" +keywords: + - "sk-ant-api03-" + - "anthropic" +patterns: + - regex: 'sk-ant-api03-[A-Za-z0-9_\-]{93,}' + entropy_min: 3.5 + confidence: high +verify: + method: GET + url: https://api.anthropic.com/v1/models + headers: + x-api-key: "{KEY}" + anthropic-version: "2023-06-01" + valid_status: [200] + invalid_status: [401, 403] +``` + +**providers/huggingface.yaml:** +```yaml +format_version: 1 +name: huggingface +display_name: HuggingFace +tier: 3 +last_verified: "2026-04-04" +keywords: + - "hf_" + - "huggingface" +patterns: + - regex: 'hf_[A-Za-z0-9]{34,}' + entropy_min: 3.5 + confidence: high +verify: + method: GET + url: https://huggingface.co/api/whoami-v2 + headers: + Authorization: "Bearer {KEY}" + valid_status: [200] + invalid_status: [401, 403] +``` + + + cd /home/salva/Documents/apikey && go build ./pkg/providers/... && go test ./pkg/providers/... -run TestProviderSchemaValidation -v 2>&1 | head -30 + + + - `go build ./pkg/providers/...` exits 0 + - providers/openai.yaml contains `format_version: 1` and `last_verified` + - providers/anthropic.yaml contains `format_version: 1` and `last_verified` + - providers/huggingface.yaml contains `format_version: 1` and `last_verified` + - pkg/providers/schema.go exports: Provider, Pattern, VerifySpec, RegistryStats + - Provider.UnmarshalYAML returns error when format_version < 1 + - Provider.UnmarshalYAML returns error when last_verified is empty + - `grep -q 'UnmarshalYAML' pkg/providers/schema.go` exits 0 + + Provider schema structs exist with validation. Three reference YAML files exist with all required fields. + + + + Task 2: Embed loader, registry with Aho-Corasick, and filled test stubs + pkg/providers/loader.go, pkg/providers/registry.go, pkg/providers/registry_test.go + + - /home/salva/Documents/apikey/.planning/phases/01-foundation/01-RESEARCH.md (Pattern 1: Provider Registry with Compile-Time Embed — exact code example) + - /home/salva/Documents/apikey/pkg/providers/schema.go (types just created in Task 1) + + + - Test 1: NewRegistry() loads 3 providers from embedded YAML → registry.List() returns slice of length 3 + - Test 2: registry.Get("openai") → returns Provider with Name=="openai", bool==true + - Test 3: registry.Get("nonexistent") → returns zero Provider, bool==false + - Test 4: registry.Stats().Total == 3 and Stats().ByTier[1] == 2 (openai + anthropic are tier 1) + - Test 5: AC automaton built — registry.AC().FindAll("sk-proj-abc") returns non-empty slice + - Test 6: AC automaton does NOT match — registry.AC().FindAll("hello world") returns empty slice + + +IMPORTANT NOTE ON EMBED PATHS: Go's embed package does NOT allow paths containing "..". +Since loader.go is at pkg/providers/loader.go, it CANNOT embed ../../providers/*.yaml. + +Solution: Place provider YAML files at pkg/providers/definitions/*.yaml and use: + //go:embed definitions/*.yaml + +This means the YAML files created in Task 1 at providers/openai.yaml etc. are the +"source of truth" files users may inspect, but the embedded versions live in +pkg/providers/definitions/. Copy them there (or move and update Task 1 output). + +Actually, the cleanest solution per Go embed docs: put an embed.go file at the PACKAGE +level that embeds a subdirectory. Since pkg/providers/ package owns the embed, use: + pkg/providers/definitions/openai.yaml (embedded) + providers/openai.yaml (user-facing, can symlink or keep as docs) + +For Phase 1, keep BOTH: the providers/ root dir for user reference, definitions/ for embed. +Copy the three YAML files from providers/ to pkg/providers/definitions/ at the end. + +Create **pkg/providers/loader.go**: +```go +package providers + +import ( + "embed" + "fmt" + "io/fs" + "path/filepath" + + "gopkg.in/yaml.v3" +) + +//go:embed definitions/*.yaml +var definitionsFS embed.FS + +// loadProviders reads all YAML files from the embedded definitions FS. +func loadProviders() ([]Provider, error) { + var providers []Provider + err := fs.WalkDir(definitionsFS, "definitions", func(path string, d fs.DirEntry, err error) error { + if err != nil { + return err + } + if d.IsDir() || filepath.Ext(path) != ".yaml" { + return nil + } + data, err := definitionsFS.ReadFile(path) + if err != nil { + return fmt.Errorf("reading provider file %s: %w", path, err) + } + var p Provider + if err := yaml.Unmarshal(data, &p); err != nil { + return fmt.Errorf("parsing provider %s: %w", path, err) + } + providers = append(providers, p) + return nil + }) + return providers, err +} +``` + +Create **pkg/providers/registry.go**: +```go +package providers + +import ( + ahocorasick "github.com/petar-dambovaliev/aho-corasick" +) + +// Registry is the in-memory store of all loaded provider definitions. +// It is initialized once at startup and is safe for concurrent reads. +type Registry struct { + providers []Provider + index map[string]int // name -> slice index + ac ahocorasick.AhoCorasick // pre-built automaton for keyword pre-filter +} + +// NewRegistry loads all embedded provider YAML files, validates them, builds the +// Aho-Corasick automaton from all provider keywords, and returns the Registry. +func NewRegistry() (*Registry, error) { + providers, err := loadProviders() + if err != nil { + return nil, fmt.Errorf("loading providers: %w", err) + } + + index := make(map[string]int, len(providers)) + var keywords []string + for i, p := range providers { + index[p.Name] = i + keywords = append(keywords, p.Keywords...) + } + + builder := ahocorasick.NewAhoCorasickBuilder(ahocorasick.Opts{DFA: true}) + ac := builder.Build(keywords) + + return &Registry{ + providers: providers, + index: index, + ac: ac, + }, nil +} + +// List returns all loaded providers. +func (r *Registry) List() []Provider { + return r.providers +} + +// Get returns a provider by name and a boolean indicating whether it was found. +func (r *Registry) Get(name string) (Provider, bool) { + idx, ok := r.index[name] + if !ok { + return Provider{}, false + } + return r.providers[idx], true +} + +// Stats returns aggregate statistics about the loaded providers. +func (r *Registry) Stats() RegistryStats { + stats := RegistryStats{ + Total: len(r.providers), + ByTier: make(map[int]int), + ByConfidence: make(map[string]int), + } + for _, p := range r.providers { + stats.ByTier[p.Tier]++ + for _, pat := range p.Patterns { + stats.ByConfidence[pat.Confidence]++ + } + } + return stats +} + +// AC returns the pre-built Aho-Corasick automaton for keyword pre-filtering. +func (r *Registry) AC() ahocorasick.AhoCorasick { + return r.ac +} +``` + +Note: registry.go needs `import "fmt"` added. + +Then copy the three YAML files into the embed location: +```bash +mkdir -p /home/salva/Documents/apikey/pkg/providers/definitions +cp /home/salva/Documents/apikey/providers/openai.yaml /home/salva/Documents/apikey/pkg/providers/definitions/ +cp /home/salva/Documents/apikey/providers/anthropic.yaml /home/salva/Documents/apikey/pkg/providers/definitions/ +cp /home/salva/Documents/apikey/providers/huggingface.yaml /home/salva/Documents/apikey/pkg/providers/definitions/ +``` + +Finally, fill in **pkg/providers/registry_test.go** (replacing the stubs from Plan 01): +```go +package providers_test + +import ( + "testing" + + "github.com/salvacybersec/keyhunter/pkg/providers" + "github.com/stretchr/testify/assert" + "github.com/stretchr/testify/require" +) + +func TestRegistryLoad(t *testing.T) { + reg, err := providers.NewRegistry() + require.NoError(t, err) + assert.GreaterOrEqual(t, len(reg.List()), 3, "expected at least 3 providers loaded") +} + +func TestRegistryGet(t *testing.T) { + reg, err := providers.NewRegistry() + require.NoError(t, err) + + p, ok := reg.Get("openai") + assert.True(t, ok) + assert.Equal(t, "openai", p.Name) + assert.Equal(t, 1, p.Tier) + + _, ok = reg.Get("nonexistent-provider") + assert.False(t, ok) +} + +func TestRegistryStats(t *testing.T) { + reg, err := providers.NewRegistry() + require.NoError(t, err) + + stats := reg.Stats() + assert.GreaterOrEqual(t, stats.Total, 3) + assert.GreaterOrEqual(t, stats.ByTier[1], 2, "expected at least 2 tier-1 providers") +} + +func TestAhoCorasickBuild(t *testing.T) { + reg, err := providers.NewRegistry() + require.NoError(t, err) + + ac := reg.AC() + + // Should match OpenAI keyword + matches := ac.FindAll("OPENAI_API_KEY=sk-proj-abc") + assert.NotEmpty(t, matches, "expected AC to find keyword in string containing 'sk-proj-'") + + // Should not match clean text + noMatches := ac.FindAll("hello world no secrets here") + assert.Empty(t, noMatches, "expected no AC matches in text with no provider keywords") +} + +func TestProviderSchemaValidation(t *testing.T) { + import_yaml := ` +format_version: 0 +name: invalid +last_verified: "" +` + // Directly test UnmarshalYAML via yaml.Unmarshal + var p providers.Provider + err := yaml.Unmarshal([]byte(import_yaml), &p) // NOTE: need import "gopkg.in/yaml.v3" + assert.Error(t, err, "expected validation error for format_version=0") +} +``` + +Note: The TestProviderSchemaValidation test needs `import "gopkg.in/yaml.v3"` added. +Add it to the imports. Full corrected test file with proper imports: + +```go +package providers_test + +import ( + "testing" + + "github.com/salvacybersec/keyhunter/pkg/providers" + "github.com/stretchr/testify/assert" + "github.com/stretchr/testify/require" + "gopkg.in/yaml.v3" +) + +func TestRegistryLoad(t *testing.T) { + reg, err := providers.NewRegistry() + require.NoError(t, err) + assert.GreaterOrEqual(t, len(reg.List()), 3, "expected at least 3 providers") +} + +func TestRegistryGet(t *testing.T) { + reg, err := providers.NewRegistry() + require.NoError(t, err) + + p, ok := reg.Get("openai") + assert.True(t, ok) + assert.Equal(t, "openai", p.Name) + assert.Equal(t, 1, p.Tier) + + _, notOk := reg.Get("nonexistent-provider") + assert.False(t, notOk) +} + +func TestRegistryStats(t *testing.T) { + reg, err := providers.NewRegistry() + require.NoError(t, err) + + stats := reg.Stats() + assert.GreaterOrEqual(t, stats.Total, 3) + assert.GreaterOrEqual(t, stats.ByTier[1], 2) +} + +func TestAhoCorasickBuild(t *testing.T) { + reg, err := providers.NewRegistry() + require.NoError(t, err) + + ac := reg.AC() + matches := ac.FindAll("export OPENAI_API_KEY=sk-proj-abc") + assert.NotEmpty(t, matches) + + noMatches := ac.FindAll("hello world nothing here") + assert.Empty(t, noMatches) +} + +func TestProviderSchemaValidation(t *testing.T) { + invalid := []byte("format_version: 0\nname: invalid\nlast_verified: \"\"\n") + var p providers.Provider + err := yaml.Unmarshal(invalid, &p) + assert.Error(t, err) + assert.Contains(t, err.Error(), "format_version") +} +``` + + + cd /home/salva/Documents/apikey && go test ./pkg/providers/... -v -count=1 2>&1 | tail -20 + + + - `go test ./pkg/providers/... -v` exits 0 with all 5 tests PASS (not SKIP) + - TestRegistryLoad passes with >= 3 providers + - TestRegistryGet passes — "openai" found, "nonexistent" not found + - TestRegistryStats passes — Total >= 3 + - TestAhoCorasickBuild passes — "sk-proj-" match found, "hello world" empty + - TestProviderSchemaValidation passes — error on format_version=0 + - `grep -r 'go:embed' pkg/providers/loader.go` exits 0 + - pkg/providers/definitions/ directory exists with 3 YAML files + + Registry loads providers from embedded YAML, builds Aho-Corasick automaton, exposes List/Get/Stats/AC. All 5 tests pass. + + + + + +After both tasks: +- `go test ./pkg/providers/... -v -count=1` exits 0 with 5 tests PASS +- `go build ./...` still exits 0 +- `grep -q 'format_version' providers/openai.yaml providers/anthropic.yaml providers/huggingface.yaml` exits 0 +- `grep -q 'go:embed' pkg/providers/loader.go` exits 0 +- pkg/providers/definitions/ has 3 YAML files (same content as providers/) + + + +- 3 reference provider YAML files exist in providers/ and pkg/providers/definitions/ with format_version and last_verified +- Provider schema validates format_version >= 1 and non-empty last_verified (PROV-10) +- Registry loads providers from embed.FS at compile time (CORE-02) +- Aho-Corasick automaton built from all keywords at NewRegistry() (CORE-06) +- Registry exposes List(), Get(), Stats(), AC() (CORE-03) +- 5 provider tests all pass + + + +After completion, create `.planning/phases/01-foundation/01-02-SUMMARY.md` following the summary template. + diff --git a/.planning/phases/01-foundation/01-03-PLAN.md b/.planning/phases/01-foundation/01-03-PLAN.md new file mode 100644 index 0000000..b081e0d --- /dev/null +++ b/.planning/phases/01-foundation/01-03-PLAN.md @@ -0,0 +1,634 @@ +--- +phase: 01-foundation +plan: 03 +type: execute +wave: 1 +depends_on: [01-01] +files_modified: + - pkg/storage/schema.sql + - pkg/storage/encrypt.go + - pkg/storage/crypto.go + - pkg/storage/db.go + - pkg/storage/findings.go + - pkg/storage/db_test.go +autonomous: true +requirements: [STOR-01, STOR-02, STOR-03] + +must_haves: + truths: + - "SQLite database opens, runs migrations from embedded schema.sql, and closes cleanly" + - "AES-256-GCM Encrypt/Decrypt roundtrip produces the original plaintext" + - "Argon2id DeriveKey with the same passphrase and salt always returns the same 32-byte key" + - "A Finding can be saved to the database with the key_value stored encrypted and retrieved as plaintext" + - "The raw database file does NOT contain plaintext API key values" + artifacts: + - path: "pkg/storage/encrypt.go" + provides: "Encrypt(plaintext, key) and Decrypt(ciphertext, key) using AES-256-GCM" + exports: ["Encrypt", "Decrypt"] + - path: "pkg/storage/crypto.go" + provides: "DeriveKey(passphrase, salt) using Argon2id RFC 9106 params" + exports: ["DeriveKey", "NewSalt"] + - path: "pkg/storage/db.go" + provides: "DB struct with Open(), Close(), WAL mode, embedded schema migration" + exports: ["DB", "Open"] + - path: "pkg/storage/findings.go" + provides: "SaveFinding(finding, encKey) and ListFindings(encKey) CRUD" + exports: ["SaveFinding", "ListFindings", "Finding"] + - path: "pkg/storage/schema.sql" + provides: "CREATE TABLE statements for findings, scans, settings" + contains: "CREATE TABLE IF NOT EXISTS findings" + key_links: + - from: "pkg/storage/findings.go" + to: "pkg/storage/encrypt.go" + via: "Encrypt() called before INSERT, Decrypt() called after SELECT" + pattern: "Encrypt|Decrypt" + - from: "pkg/storage/db.go" + to: "pkg/storage/schema.sql" + via: "//go:embed schema.sql and db.Exec on open" + pattern: "go:embed.*schema" + - from: "pkg/storage/crypto.go" + to: "golang.org/x/crypto/argon2" + via: "argon2.IDKey call" + pattern: "argon2\\.IDKey" +--- + + +Build the storage layer: AES-256-GCM column encryption, Argon2id key derivation, SQLite database with WAL mode and embedded schema, and Finding CRUD operations that transparently encrypt key values on write and decrypt on read. + +Purpose: Scanner results from Plan 04 and CLI commands from Plan 05 need a storage layer to persist findings. The encryption contract (Encrypt/Decrypt/DeriveKey) must exist before the scanner pipeline can store keys. +Output: pkg/storage/{encrypt,crypto,db,findings,schema}.go and db_test.go (stubs filled). + + + +@$HOME/.claude/get-shit-done/workflows/execute-plan.md +@$HOME/.claude/get-shit-done/templates/summary.md + + + +@.planning/phases/01-foundation/01-RESEARCH.md +@.planning/phases/01-foundation/01-01-SUMMARY.md + + + +func Encrypt(plaintext []byte, key []byte) ([]byte, error) + // key must be exactly 32 bytes (AES-256) + // nonce prepended to ciphertext in returned []byte + // uses crypto/aes + crypto/cipher GCM + +func Decrypt(ciphertext []byte, key []byte) ([]byte, error) + // expects nonce prepended format from Encrypt() + // returns ErrCiphertextTooShort if len < nonceSize + + +func DeriveKey(passphrase []byte, salt []byte) []byte + // params: time=1, memory=64*1024, threads=4, keyLen=32 + // returns exactly 32 bytes deterministically + +func NewSalt() ([]byte, error) + // generates 16 random bytes via crypto/rand + + +findings table columns: + id INTEGER PRIMARY KEY AUTOINCREMENT + scan_id INTEGER REFERENCES scans(id) + provider_name TEXT NOT NULL + key_value BLOB NOT NULL -- AES-256-GCM encrypted, nonce prepended + key_masked TEXT NOT NULL -- first8...last4, stored plaintext for display + confidence TEXT NOT NULL -- "high", "medium", "low" + source_path TEXT + source_type TEXT -- "file", "dir", "git", "stdin", "url" + line_number INTEGER + created_at DATETIME DEFAULT CURRENT_TIMESTAMP + +scans table columns: + id INTEGER PRIMARY KEY AUTOINCREMENT + started_at DATETIME NOT NULL + finished_at DATETIME + source_path TEXT + finding_count INTEGER DEFAULT 0 + created_at DATETIME DEFAULT CURRENT_TIMESTAMP + +settings table columns: + key TEXT PRIMARY KEY + value TEXT NOT NULL + updated_at DATETIME DEFAULT CURRENT_TIMESTAMP + + +type Finding struct { + ID int64 + ScanID int64 + ProviderName string + KeyValue string // plaintext — encrypted before storage + KeyMasked string // first8chars...last4chars + Confidence string + SourcePath string + SourceType string + LineNumber int +} + + +import _ "modernc.org/sqlite" +// driver registered as "sqlite" (NOT "sqlite3") +db, err := sql.Open("sqlite", dataSourceName) + + + + + + + Task 1: AES-256-GCM encryption and Argon2id key derivation + pkg/storage/encrypt.go, pkg/storage/crypto.go + + - /home/salva/Documents/apikey/.planning/phases/01-foundation/01-RESEARCH.md (Pattern 3: AES-256-GCM Column Encryption and Pattern 4: Argon2id Key Derivation — exact code examples) + + + - Test 1: Encrypt then Decrypt same key → returns original plaintext exactly + - Test 2: Encrypt produces output longer than input (nonce + tag overhead) + - Test 3: Two Encrypt calls on same plaintext → different ciphertext (random nonce) + - Test 4: Decrypt with wrong key → returns error (GCM authentication fails) + - Test 5: DeriveKey with same passphrase+salt → same 32-byte output (deterministic) + - Test 6: DeriveKey output is exactly 32 bytes + - Test 7: NewSalt() returns 16 bytes, two calls return different values + + +Create **pkg/storage/encrypt.go**: +```go +package storage + +import ( + "crypto/aes" + "crypto/cipher" + "crypto/rand" + "errors" + "io" +) + +// ErrCiphertextTooShort is returned when ciphertext is shorter than the GCM nonce size. +var ErrCiphertextTooShort = errors.New("ciphertext too short") + +// Encrypt encrypts plaintext using AES-256-GCM with a random nonce. +// The nonce is prepended to the returned ciphertext. +// key must be exactly 32 bytes (AES-256). +func Encrypt(plaintext []byte, key []byte) ([]byte, error) { + block, err := aes.NewCipher(key) + if err != nil { + return nil, err + } + gcm, err := cipher.NewGCM(block) + if err != nil { + return nil, err + } + nonce := make([]byte, gcm.NonceSize()) + if _, err := io.ReadFull(rand.Reader, nonce); err != nil { + return nil, err + } + // Seal appends encrypted data to nonce, so nonce is prepended + ciphertext := gcm.Seal(nonce, nonce, plaintext, nil) + return ciphertext, nil +} + +// Decrypt decrypts ciphertext produced by Encrypt. +// Expects the nonce to be prepended to the ciphertext. +func Decrypt(ciphertext []byte, key []byte) ([]byte, error) { + block, err := aes.NewCipher(key) + if err != nil { + return nil, err + } + gcm, err := cipher.NewGCM(block) + if err != nil { + return nil, err + } + nonceSize := gcm.NonceSize() + if len(ciphertext) < nonceSize { + return nil, ErrCiphertextTooShort + } + nonce, ciphertext := ciphertext[:nonceSize], ciphertext[nonceSize:] + return gcm.Open(nil, nonce, ciphertext, nil) +} +``` + +Create **pkg/storage/crypto.go**: +```go +package storage + +import ( + "crypto/rand" + + "golang.org/x/crypto/argon2" +) + +const ( + argon2Time uint32 = 1 + argon2Memory uint32 = 64 * 1024 // 64 MB — RFC 9106 Section 7.3 + argon2Threads uint8 = 4 + argon2KeyLen uint32 = 32 // AES-256 key length + saltSize = 16 +) + +// DeriveKey produces a 32-byte AES-256 key from a passphrase and salt using Argon2id. +// Uses RFC 9106 Section 7.3 recommended parameters. +// Given the same passphrase and salt, always returns the same key. +func DeriveKey(passphrase []byte, salt []byte) []byte { + return argon2.IDKey(passphrase, salt, argon2Time, argon2Memory, argon2Threads, argon2KeyLen) +} + +// NewSalt generates a cryptographically random 16-byte salt. +// Store alongside the database and reuse on each key derivation. +func NewSalt() ([]byte, error) { + salt := make([]byte, saltSize) + if _, err := rand.Read(salt); err != nil { + return nil, err + } + return salt, nil +} +``` + + + cd /home/salva/Documents/apikey && go build ./pkg/storage/... && echo "BUILD OK" + + + - `go build ./pkg/storage/...` exits 0 + - pkg/storage/encrypt.go exports: Encrypt, Decrypt, ErrCiphertextTooShort + - pkg/storage/crypto.go exports: DeriveKey, NewSalt + - `grep -q 'argon2\.IDKey' pkg/storage/crypto.go` exits 0 + - `grep -q 'crypto/aes' pkg/storage/encrypt.go` exits 0 + - `grep -q 'cipher\.NewGCM' pkg/storage/encrypt.go` exits 0 + + Encrypt/Decrypt and DeriveKey/NewSalt exist and compile. Encryption uses AES-256-GCM with random nonce. Key derivation uses Argon2id RFC 9106 parameters. + + + + Task 2: SQLite database, schema, Finding CRUD, and filled test stubs + pkg/storage/schema.sql, pkg/storage/db.go, pkg/storage/findings.go, pkg/storage/db_test.go + + - /home/salva/Documents/apikey/.planning/phases/01-foundation/01-RESEARCH.md (STOR-01 row, Pattern 1 for embed usage pattern) + - /home/salva/Documents/apikey/pkg/storage/encrypt.go (Encrypt/Decrypt signatures) + - /home/salva/Documents/apikey/pkg/storage/crypto.go (DeriveKey signature) + + + - Test 1: Open(":memory:") returns *DB without error, schema tables exist + - Test 2: Encrypt/Decrypt roundtrip — Encrypt([]byte("sk-proj-abc"), key) then Decrypt returns "sk-proj-abc" + - Test 3: DeriveKey(passphrase, salt) twice returns identical 32 bytes + - Test 4: NewSalt() twice returns different slices + - Test 5: SaveFinding stores finding → ListFindings decrypts and returns KeyValue == "sk-proj-test" + - Test 6: Database file (when not :memory:) does NOT contain literal "sk-proj-test" in raw bytes + + +Create **pkg/storage/schema.sql**: +```sql +-- KeyHunter database schema +-- Version: 1 + +CREATE TABLE IF NOT EXISTS scans ( + id INTEGER PRIMARY KEY AUTOINCREMENT, + started_at DATETIME NOT NULL, + finished_at DATETIME, + source_path TEXT, + finding_count INTEGER DEFAULT 0, + created_at DATETIME DEFAULT CURRENT_TIMESTAMP +); + +CREATE TABLE IF NOT EXISTS findings ( + id INTEGER PRIMARY KEY AUTOINCREMENT, + scan_id INTEGER REFERENCES scans(id), + provider_name TEXT NOT NULL, + key_value BLOB NOT NULL, + key_masked TEXT NOT NULL, + confidence TEXT NOT NULL, + source_path TEXT, + source_type TEXT, + line_number INTEGER, + created_at DATETIME DEFAULT CURRENT_TIMESTAMP +); + +CREATE TABLE IF NOT EXISTS settings ( + key TEXT PRIMARY KEY, + value TEXT NOT NULL, + updated_at DATETIME DEFAULT CURRENT_TIMESTAMP +); + +-- Indexes for common queries +CREATE INDEX IF NOT EXISTS idx_findings_scan_id ON findings(scan_id); +CREATE INDEX IF NOT EXISTS idx_findings_provider ON findings(provider_name); +CREATE INDEX IF NOT EXISTS idx_findings_created ON findings(created_at DESC); +``` + +Create **pkg/storage/db.go**: +```go +package storage + +import ( + "database/sql" + _ "embed" + "fmt" + + _ "modernc.org/sqlite" +) + +//go:embed schema.sql +var schemaSQLBytes []byte + +// DB wraps the sql.DB connection with KeyHunter-specific behavior. +type DB struct { + sql *sql.DB +} + +// Open opens or creates a SQLite database at path, runs embedded schema migrations, +// and enables WAL mode for better concurrent read performance. +// Use ":memory:" for tests. +func Open(path string) (*DB, error) { + sqlDB, err := sql.Open("sqlite", path) + if err != nil { + return nil, fmt.Errorf("opening database: %w", err) + } + + // Enable WAL mode for concurrent reads + if _, err := sqlDB.Exec("PRAGMA journal_mode=WAL"); err != nil { + sqlDB.Close() + return nil, fmt.Errorf("enabling WAL mode: %w", err) + } + + // Enable foreign keys + if _, err := sqlDB.Exec("PRAGMA foreign_keys=ON"); err != nil { + sqlDB.Close() + return nil, fmt.Errorf("enabling foreign keys: %w", err) + } + + // Run schema migrations + if _, err := sqlDB.Exec(string(schemaSQLBytes)); err != nil { + sqlDB.Close() + return nil, fmt.Errorf("running schema migrations: %w", err) + } + + return &DB{sql: sqlDB}, nil +} + +// Close closes the underlying database connection. +func (db *DB) Close() error { + return db.sql.Close() +} + +// SQL returns the underlying sql.DB for advanced use cases. +func (db *DB) SQL() *sql.DB { + return db.sql +} +``` + +Create **pkg/storage/findings.go**: +```go +package storage + +import ( + "fmt" + "time" +) + +// Finding represents a detected API key with metadata. +// KeyValue is always plaintext in this struct — encryption happens at the storage boundary. +type Finding struct { + ID int64 + ScanID int64 + ProviderName string + KeyValue string // plaintext — encrypted before storage, decrypted after retrieval + KeyMasked string // first8...last4, stored plaintext + Confidence string + SourcePath string + SourceType string + LineNumber int + CreatedAt time.Time +} + +// MaskKey returns the masked form of a key: first 8 chars + "..." + last 4 chars. +// If the key is too short (< 12 chars), returns the full key masked with asterisks. +func MaskKey(key string) string { + if len(key) < 12 { + return "****" + } + return key[:8] + "..." + key[len(key)-4:] +} + +// SaveFinding encrypts the finding's KeyValue and persists the finding to the database. +// encKey must be a 32-byte AES-256 key (from DeriveKey). +func (db *DB) SaveFinding(f Finding, encKey []byte) (int64, error) { + encrypted, err := Encrypt([]byte(f.KeyValue), encKey) + if err != nil { + return 0, fmt.Errorf("encrypting key value: %w", err) + } + + masked := f.KeyMasked + if masked == "" { + masked = MaskKey(f.KeyValue) + } + + res, err := db.sql.Exec( + `INSERT INTO findings (scan_id, provider_name, key_value, key_masked, confidence, source_path, source_type, line_number) + VALUES (?, ?, ?, ?, ?, ?, ?, ?)`, + f.ScanID, f.ProviderName, encrypted, masked, f.Confidence, f.SourcePath, f.SourceType, f.LineNumber, + ) + if err != nil { + return 0, fmt.Errorf("inserting finding: %w", err) + } + return res.LastInsertId() +} + +// ListFindings retrieves all findings, decrypting key values using encKey. +// encKey must be the same 32-byte key used during SaveFinding. +func (db *DB) ListFindings(encKey []byte) ([]Finding, error) { + rows, err := db.sql.Query( + `SELECT id, scan_id, provider_name, key_value, key_masked, confidence, + source_path, source_type, line_number, created_at + FROM findings ORDER BY created_at DESC`, + ) + if err != nil { + return nil, fmt.Errorf("querying findings: %w", err) + } + defer rows.Close() + + var findings []Finding + for rows.Next() { + var f Finding + var encrypted []byte + var createdAt string + err := rows.Scan( + &f.ID, &f.ScanID, &f.ProviderName, &encrypted, &f.KeyMasked, + &f.Confidence, &f.SourcePath, &f.SourceType, &f.LineNumber, &createdAt, + ) + if err != nil { + return nil, fmt.Errorf("scanning finding row: %w", err) + } + plain, err := Decrypt(encrypted, encKey) + if err != nil { + return nil, fmt.Errorf("decrypting finding %d: %w", f.ID, err) + } + f.KeyValue = string(plain) + f.CreatedAt, _ = time.Parse("2006-01-02 15:04:05", createdAt) + findings = append(findings, f) + } + return findings, rows.Err() +} +``` + +Fill **pkg/storage/db_test.go** (replacing stubs from Plan 01): +```go +package storage_test + +import ( + "testing" + + "github.com/salvacybersec/keyhunter/pkg/storage" + "github.com/stretchr/testify/assert" + "github.com/stretchr/testify/require" +) + +func TestDBOpen(t *testing.T) { + db, err := storage.Open(":memory:") + require.NoError(t, err) + defer db.Close() + + // Verify schema tables exist + rows, err := db.SQL().Query("SELECT name FROM sqlite_master WHERE type='table' ORDER BY name") + require.NoError(t, err) + defer rows.Close() + + var tables []string + for rows.Next() { + var name string + require.NoError(t, rows.Scan(&name)) + tables = append(tables, name) + } + assert.Contains(t, tables, "findings") + assert.Contains(t, tables, "scans") + assert.Contains(t, tables, "settings") +} + +func TestEncryptDecryptRoundtrip(t *testing.T) { + key := make([]byte, 32) // all-zero key for test + for i := range key { + key[i] = byte(i) + } + plaintext := []byte("sk-proj-supersecretapikey1234") + + ciphertext, err := storage.Encrypt(plaintext, key) + require.NoError(t, err) + assert.Greater(t, len(ciphertext), len(plaintext), "ciphertext should be longer than plaintext") + + recovered, err := storage.Decrypt(ciphertext, key) + require.NoError(t, err) + assert.Equal(t, plaintext, recovered) +} + +func TestEncryptNonDeterministic(t *testing.T) { + key := make([]byte, 32) + plain := []byte("test-key") + ct1, err1 := storage.Encrypt(plain, key) + ct2, err2 := storage.Encrypt(plain, key) + require.NoError(t, err1) + require.NoError(t, err2) + assert.NotEqual(t, ct1, ct2, "same plaintext encrypted twice should produce different ciphertext") +} + +func TestDecryptWrongKey(t *testing.T) { + key1 := make([]byte, 32) + key2 := make([]byte, 32) + key2[0] = 0xFF + + ct, err := storage.Encrypt([]byte("secret"), key1) + require.NoError(t, err) + + _, err = storage.Decrypt(ct, key2) + assert.Error(t, err, "decryption with wrong key should fail") +} + +func TestArgon2KeyDerivation(t *testing.T) { + passphrase := []byte("my-secure-passphrase") + salt := []byte("1234567890abcdef") // 16 bytes + + key1 := storage.DeriveKey(passphrase, salt) + key2 := storage.DeriveKey(passphrase, salt) + + assert.Equal(t, 32, len(key1), "derived key must be 32 bytes") + assert.Equal(t, key1, key2, "same passphrase+salt must produce same key") +} + +func TestNewSalt(t *testing.T) { + salt1, err1 := storage.NewSalt() + salt2, err2 := storage.NewSalt() + require.NoError(t, err1) + require.NoError(t, err2) + assert.Equal(t, 16, len(salt1)) + assert.NotEqual(t, salt1, salt2, "two salts should differ") +} + +func TestSaveFindingEncrypted(t *testing.T) { + db, err := storage.Open(":memory:") + require.NoError(t, err) + defer db.Close() + + // Derive a test key + key := storage.DeriveKey([]byte("testpassphrase"), []byte("testsalt1234xxxx")) + + f := storage.Finding{ + ProviderName: "openai", + KeyValue: "sk-proj-test1234567890abcdefghijklmnopqr", + Confidence: "high", + SourcePath: "/test/file.env", + SourceType: "file", + LineNumber: 42, + } + + id, err := db.SaveFinding(f, key) + require.NoError(t, err) + assert.Greater(t, id, int64(0)) + + findings, err := db.ListFindings(key) + require.NoError(t, err) + require.Len(t, findings, 1) + assert.Equal(t, "sk-proj-test1234567890abcdefghijklmnopqr", findings[0].KeyValue) + assert.Equal(t, "openai", findings[0].ProviderName) + // Verify masking + assert.Equal(t, "sk-proj-...opqr", findings[0].KeyMasked) +} +``` + + + cd /home/salva/Documents/apikey && go test ./pkg/storage/... -v -count=1 2>&1 | tail -25 + + + - `go test ./pkg/storage/... -v -count=1` exits 0 with all 7 tests PASS (no SKIP) + - TestDBOpen finds tables: findings, scans, settings + - TestEncryptDecryptRoundtrip passes — recovered plaintext matches original + - TestEncryptNonDeterministic passes — two encryptions differ + - TestDecryptWrongKey passes — wrong key causes error + - TestArgon2KeyDerivation passes — 32 bytes, deterministic + - TestNewSalt passes — 16 bytes, non-deterministic + - TestSaveFindingEncrypted passes — stored and retrieved with correct KeyValue and KeyMasked + - `grep -q 'go:embed.*schema' pkg/storage/db.go` exits 0 + - `grep -q 'modernc.org/sqlite' pkg/storage/db.go` exits 0 + - `grep -q 'journal_mode=WAL' pkg/storage/db.go` exits 0 + + Storage layer complete — SQLite opens with schema, AES-256-GCM encrypt/decrypt works, Argon2id key derivation works, SaveFinding/ListFindings encrypt/decrypt transparently. All 7 tests pass. + + + + + +After both tasks: +- `go test ./pkg/storage/... -v -count=1` exits 0 with 7 tests PASS +- `go build ./...` still exits 0 +- `grep -q 'argon2\.IDKey' pkg/storage/crypto.go` exits 0 +- `grep -q 'cipher\.NewGCM' pkg/storage/encrypt.go` exits 0 +- `grep -q 'journal_mode=WAL' pkg/storage/db.go` exits 0 +- schema.sql contains CREATE TABLE for findings, scans, settings + + + +- SQLite database opens and auto-migrates from embedded schema.sql (STOR-01) +- AES-256-GCM column encryption works: Encrypt + Decrypt roundtrip returns original (STOR-02) +- Argon2id key derivation: DeriveKey deterministic, 32 bytes, RFC 9106 params (STOR-03) +- FindingCRUD: SaveFinding encrypts before INSERT, ListFindings decrypts after SELECT +- All 7 storage tests pass + + + +After completion, create `.planning/phases/01-foundation/01-03-SUMMARY.md` following the summary template. + diff --git a/.planning/phases/01-foundation/01-04-PLAN.md b/.planning/phases/01-foundation/01-04-PLAN.md new file mode 100644 index 0000000..0fc8b57 --- /dev/null +++ b/.planning/phases/01-foundation/01-04-PLAN.md @@ -0,0 +1,682 @@ +--- +phase: 01-foundation +plan: 04 +type: execute +wave: 2 +depends_on: [01-02] +files_modified: + - pkg/engine/chunk.go + - pkg/engine/finding.go + - pkg/engine/entropy.go + - pkg/engine/filter.go + - pkg/engine/detector.go + - pkg/engine/engine.go + - pkg/engine/sources/source.go + - pkg/engine/sources/file.go + - pkg/engine/scanner_test.go +autonomous: true +requirements: [CORE-01, CORE-04, CORE-05, CORE-06, CORE-07] + +must_haves: + truths: + - "Shannon entropy function returns expected values for known inputs" + - "Aho-Corasick pre-filter passes chunks containing provider keywords and drops those without" + - "Detector correctly identifies OpenAI and Anthropic key patterns in test fixtures via regex" + - "Full scan pipeline: scan testdata/samples/openai_key.txt → Finding with ProviderName==openai" + - "Full scan pipeline: scan testdata/samples/no_keys.txt → zero findings" + - "Worker pool uses ants v2 with configurable worker count" + artifacts: + - path: "pkg/engine/chunk.go" + provides: "Chunk struct (Data []byte, Source string, Offset int64)" + exports: ["Chunk"] + - path: "pkg/engine/finding.go" + provides: "Finding struct (provider, key value, masked, confidence, source, line)" + exports: ["Finding", "MaskKey"] + - path: "pkg/engine/entropy.go" + provides: "Shannon(s string) float64 — ~10 line stdlib math implementation" + exports: ["Shannon"] + - path: "pkg/engine/filter.go" + provides: "KeywordFilter stage — runs Aho-Corasick and passes/drops chunks" + exports: ["KeywordFilter"] + - path: "pkg/engine/detector.go" + provides: "Detector stage — applies provider regexps and entropy check to chunks" + exports: ["Detector"] + - path: "pkg/engine/engine.go" + provides: "Engine struct with Scan(ctx, src, cfg) <-chan Finding" + exports: ["Engine", "NewEngine", "ScanConfig"] + - path: "pkg/engine/sources/source.go" + provides: "Source interface with Chunks(ctx, chan<- Chunk) error" + exports: ["Source"] + - path: "pkg/engine/sources/file.go" + provides: "FileSource implementing Source for single-file scanning" + exports: ["FileSource", "NewFileSource"] + key_links: + - from: "pkg/engine/engine.go" + to: "pkg/providers/registry.go" + via: "Engine holds *providers.Registry, uses Registry.AC() for pre-filter" + pattern: "providers\\.Registry" + - from: "pkg/engine/filter.go" + to: "github.com/petar-dambovaliev/aho-corasick" + via: "AC.FindAll() on each chunk" + pattern: "FindAll" + - from: "pkg/engine/detector.go" + to: "pkg/engine/entropy.go" + via: "Shannon() called when EntropyMin > 0 in pattern" + pattern: "Shannon" + - from: "pkg/engine/engine.go" + to: "github.com/panjf2000/ants/v2" + via: "ants.NewPool for detector workers" + pattern: "ants\\.NewPool" +--- + + +Build the three-stage scanning engine pipeline: Aho-Corasick keyword pre-filter, regex + entropy detector workers using ants goroutine pool, and a FileSource adapter. Wire them together in an Engine that emits Findings on a channel. + +Purpose: The scan engine is the core differentiator. Plans 02 and 03 provide its dependencies (Registry for patterns + keywords, storage types for Finding). The CLI (Plan 05) calls Engine.Scan() to implement `keyhunter scan`. +Output: pkg/engine/{chunk,finding,entropy,filter,detector,engine}.go and sources/{source,file}.go. scanner_test.go stubs filled. + + + +@$HOME/.claude/get-shit-done/workflows/execute-plan.md +@$HOME/.claude/get-shit-done/templates/summary.md + + + +@.planning/phases/01-foundation/01-RESEARCH.md +@.planning/phases/01-foundation/01-02-SUMMARY.md + + + +package providers + +type Provider struct { + Name string + Keywords []string + Patterns []Pattern + Tier int +} + +type Pattern struct { + Regex string + EntropyMin float64 + Confidence string +} + +type Registry struct { ... } +func (r *Registry) List() []Provider +func (r *Registry) AC() ahocorasick.AhoCorasick // pre-built Aho-Corasick + + +chunksChan chan Chunk (buffer: 1000) +detectableChan chan Chunk (buffer: 500) +resultsChan chan Finding (buffer: 100) + +Stage 1: Source.Chunks() → chunksChan (goroutine, closes chan on done) +Stage 2: KeywordFilter(chunksChan) → detectableChan (goroutine, AC.FindAll) +Stage 3: N detector workers (ants pool) → resultsChan + + +type ScanConfig struct { + Workers int // default: runtime.NumCPU() * 8 + Verify bool // Phase 5 — always false in Phase 1 + Unmask bool // for output layer +} + + +type Source interface { + Chunks(ctx context.Context, out chan<- Chunk) error +} + + +type FileSource struct { + Path string + ChunkSize int // bytes per chunk, default 4096 +} + +Chunking strategy: read file in chunks of ChunkSize bytes with overlap of max(256, maxPatternLen) +to avoid splitting a key across chunk boundaries. + + +import ahocorasick "github.com/petar-dambovaliev/aho-corasick" +// ac.FindAll(s string) []ahocorasick.Match — returns match positions + + +import "github.com/panjf2000/ants/v2" +// pool, _ := ants.NewPool(workers, ants.WithOptions(...)) +// pool.Submit(func() { ... }) +// pool.ReleaseWithTimeout(timeout) + + + + + + + Task 1: Core types and Shannon entropy function + pkg/engine/chunk.go, pkg/engine/finding.go, pkg/engine/entropy.go + + - /home/salva/Documents/apikey/.planning/phases/01-foundation/01-RESEARCH.md (CORE-04 row: Shannon entropy, ~10-line stdlib function, threshold 3.5 bits/char) + - /home/salva/Documents/apikey/pkg/storage/findings.go (Finding and MaskKey defined there — engine.Finding is a separate type for the pipeline) + + + - Test 1: Shannon("aaaaaaa") → value near 0.0 (all same characters, no entropy) + - Test 2: Shannon("abcdefgh") → value near 3.0 (8 distinct chars) + - Test 3: Shannon("sk-proj-ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqr") → >= 3.5 (real key entropy) + - Test 4: Shannon("") → 0.0 (empty string) + - Test 5: MaskKey("sk-proj-abc1234") → "sk-proj-...1234" (first 8 + last 4) + - Test 6: MaskKey("abc") → "****" (too short to mask) + + +Create **pkg/engine/chunk.go**: +```go +package engine + +// Chunk is a segment of file content passed through the scanning pipeline. +type Chunk struct { + Data []byte // raw bytes + Source string // file path, URL, or description + Offset int64 // byte offset of this chunk within the source +} +``` + +Create **pkg/engine/finding.go**: +```go +package engine + +import "time" + +// Finding represents a detected API key from the scanning pipeline. +// KeyValue holds the plaintext key — the storage layer encrypts it before persisting. +type Finding struct { + ProviderName string + KeyValue string // full plaintext key + KeyMasked string // first8...last4 + Confidence string // "high", "medium", "low" + Source string // file path or description + SourceType string // "file", "dir", "git", "stdin", "url" + LineNumber int + Offset int64 + DetectedAt time.Time +} + +// MaskKey returns a masked representation: first 8 chars + "..." + last 4 chars. +// Returns "****" if the key is shorter than 12 characters. +func MaskKey(key string) string { + if len(key) < 12 { + return "****" + } + return key[:8] + "..." + key[len(key)-4:] +} +``` + +Create **pkg/engine/entropy.go**: +```go +package engine + +import "math" + +// Shannon computes the Shannon entropy of a string in bits per character. +// Returns 0.0 for empty strings. +// A value >= 3.5 indicates high randomness, consistent with real API keys. +func Shannon(s string) float64 { + if len(s) == 0 { + return 0.0 + } + freq := make(map[rune]float64) + for _, c := range s { + freq[c]++ + } + n := float64(len([]rune(s))) + var entropy float64 + for _, count := range freq { + p := count / n + entropy -= p * math.Log2(p) + } + return entropy +} +``` + + + cd /home/salva/Documents/apikey && go build ./pkg/engine/... && echo "BUILD OK" + + + - `go build ./pkg/engine/...` exits 0 + - pkg/engine/chunk.go exports Chunk with fields Data, Source, Offset + - pkg/engine/finding.go exports Finding and MaskKey + - pkg/engine/entropy.go exports Shannon using math.Log2 + - `grep -q 'math\.Log2' pkg/engine/entropy.go` exits 0 + - Shannon("aaaaaaa") == 0.0 (manually verifiable from code) + - MaskKey("sk-proj-abc1234") produces "sk-proj-...1234" + + Chunk, Finding, MaskKey, and Shannon exist and compile. Shannon uses stdlib math only — no external library. + + + + Task 2: Pipeline stages, engine orchestration, FileSource, and filled test stubs + + pkg/engine/filter.go, + pkg/engine/detector.go, + pkg/engine/engine.go, + pkg/engine/sources/source.go, + pkg/engine/sources/file.go, + pkg/engine/scanner_test.go + + + - /home/salva/Documents/apikey/.planning/phases/01-foundation/01-RESEARCH.md (Pattern 2: Three-Stage Scanning Pipeline — exact channel-based code example) + - /home/salva/Documents/apikey/pkg/engine/chunk.go + - /home/salva/Documents/apikey/pkg/engine/finding.go + - /home/salva/Documents/apikey/pkg/engine/entropy.go + - /home/salva/Documents/apikey/pkg/providers/registry.go (Registry.AC() and Registry.List() signatures) + + + - Test 1: Scan testdata/samples/openai_key.txt → 1 finding, ProviderName=="openai", KeyValue contains "sk-proj-" + - Test 2: Scan testdata/samples/anthropic_key.txt → 1 finding, ProviderName=="anthropic" + - Test 3: Scan testdata/samples/no_keys.txt → 0 findings + - Test 4: Scan testdata/samples/multiple_keys.txt → 2 findings (openai + anthropic) + - Test 5: Shannon("sk-proj-ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqr") >= 3.5 (entropy check) + - Test 6: KeywordFilter drops a chunk with text "hello world" (no provider keywords) + + +Create **pkg/engine/sources/source.go**: +```go +package sources + +import ( + "context" + + "github.com/salvacybersec/keyhunter/pkg/engine" +) + +// Source is the interface all input adapters must implement. +// Chunks writes content segments to the out channel until the source is exhausted or ctx is cancelled. +type Source interface { + Chunks(ctx context.Context, out chan<- engine.Chunk) error +} +``` + +Create **pkg/engine/sources/file.go**: +```go +package sources + +import ( + "context" + "os" + + "github.com/salvacybersec/keyhunter/pkg/engine" +) + +const defaultChunkSize = 4096 +const chunkOverlap = 256 // overlap between chunks to avoid splitting keys at boundaries + +// FileSource reads a single file and emits overlapping chunks. +type FileSource struct { + Path string + ChunkSize int +} + +// NewFileSource creates a FileSource for the given path with the default chunk size. +func NewFileSource(path string) *FileSource { + return &FileSource{Path: path, ChunkSize: defaultChunkSize} +} + +// Chunks reads the file in overlapping segments and sends each chunk to out. +func (f *FileSource) Chunks(ctx context.Context, out chan<- engine.Chunk) error { + data, err := os.ReadFile(f.Path) + if err != nil { + return err + } + size := f.ChunkSize + if size <= 0 { + size = defaultChunkSize + } + if len(data) <= size { + // File fits in one chunk + select { + case <-ctx.Done(): + return ctx.Err() + case out <- engine.Chunk{Data: data, Source: f.Path, Offset: 0}: + } + return nil + } + // Emit overlapping chunks + var offset int64 + for start := 0; start < len(data); start += size - chunkOverlap { + end := start + size + if end > len(data) { + end = len(data) + } + chunk := engine.Chunk{ + Data: data[start:end], + Source: f.Path, + Offset: offset, + } + select { + case <-ctx.Done(): + return ctx.Err() + case out <- chunk: + } + offset += int64(end - start) + if end == len(data) { + break + } + } + return nil +} +``` + +Create **pkg/engine/filter.go**: +```go +package engine + +import ( + ahocorasick "github.com/petar-dambovaliev/aho-corasick" +) + +// KeywordFilter filters a stream of chunks using an Aho-Corasick automaton. +// Only chunks that contain at least one provider keyword are sent to out. +// This is Stage 2 of the pipeline (runs after Source, before Detector). +func KeywordFilter(ac ahocorasick.AhoCorasick, in <-chan Chunk, out chan<- Chunk) { + for chunk := range in { + if len(ac.FindAll(string(chunk.Data))) > 0 { + out <- chunk + } + } +} +``` + +Create **pkg/engine/detector.go**: +```go +package engine + +import ( + "regexp" + "strings" + "time" + + "github.com/salvacybersec/keyhunter/pkg/providers" +) + +// Detector applies provider regex patterns and optional entropy checks to a chunk. +// It returns all findings from the chunk. +func Detect(chunk Chunk, providerList []providers.Provider) []Finding { + var findings []Finding + content := string(chunk.Data) + + for _, p := range providerList { + for _, pat := range p.Patterns { + re, err := regexp.Compile(pat.Regex) + if err != nil { + continue // invalid regex — skip silently + } + matches := re.FindAllString(content, -1) + for _, match := range matches { + // Apply entropy check if threshold is set + if pat.EntropyMin > 0 && Shannon(match) < pat.EntropyMin { + continue // too low entropy — likely a placeholder + } + line := lineNumber(content, match) + findings = append(findings, Finding{ + ProviderName: p.Name, + KeyValue: match, + KeyMasked: MaskKey(match), + Confidence: pat.Confidence, + Source: chunk.Source, + SourceType: "file", + LineNumber: line, + Offset: chunk.Offset, + DetectedAt: time.Now(), + }) + } + } + } + return findings +} + +// lineNumber returns the 1-based line number where match first appears in content. +func lineNumber(content, match string) int { + idx := strings.Index(content, match) + if idx < 0 { + return 0 + } + return strings.Count(content[:idx], "\n") + 1 +} +``` + +Create **pkg/engine/engine.go**: +```go +package engine + +import ( + "context" + "runtime" + "sync" + "time" + + "github.com/panjf2000/ants/v2" + "github.com/salvacybersec/keyhunter/pkg/providers" + "github.com/salvacybersec/keyhunter/pkg/engine/sources" +) + +// ScanConfig controls scan execution parameters. +type ScanConfig struct { + Workers int // number of detector goroutines; defaults to runtime.NumCPU() * 8 + Verify bool // opt-in active verification (Phase 5) + Unmask bool // include full key in Finding.KeyValue +} + +// Engine orchestrates the three-stage scanning pipeline. +type Engine struct { + registry *providers.Registry +} + +// NewEngine creates an Engine backed by the given provider registry. +func NewEngine(registry *providers.Registry) *Engine { + return &Engine{registry: registry} +} + +// Scan runs the three-stage pipeline against src and returns a channel of Findings. +// The channel is closed when all chunks have been processed. +// The caller must drain the channel fully or cancel ctx to avoid goroutine leaks. +func (e *Engine) Scan(ctx context.Context, src sources.Source, cfg ScanConfig) (<-chan Finding, error) { + workers := cfg.Workers + if workers <= 0 { + workers = runtime.NumCPU() * 8 + } + + chunksChan := make(chan Chunk, 1000) + detectableChan := make(chan Chunk, 500) + resultsChan := make(chan Finding, 100) + + // Stage 1: source → chunksChan + go func() { + defer close(chunksChan) + _ = src.Chunks(ctx, chunksChan) + }() + + // Stage 2: keyword pre-filter → detectableChan + go func() { + defer close(detectableChan) + KeywordFilter(e.registry.AC(), chunksChan, detectableChan) + }() + + // Stage 3: detector workers → resultsChan + pool, err := ants.NewPool(workers) + if err != nil { + close(resultsChan) + return nil, err + } + providerList := e.registry.List() + + var wg sync.WaitGroup + var mu sync.Mutex + + go func() { + defer func() { + wg.Wait() + close(resultsChan) + pool.ReleaseWithTimeout(5 * time.Second) + }() + + for chunk := range detectableChan { + c := chunk // capture + wg.Add(1) + _ = pool.Submit(func() { + defer wg.Done() + found := Detect(c, providerList) + mu.Lock() + for _, f := range found { + select { + case resultsChan <- f: + case <-ctx.Done(): + } + } + mu.Unlock() + }) + } + }() + + return resultsChan, nil +} +``` + +Fill **pkg/engine/scanner_test.go** (replacing stubs from Plan 01): +```go +package engine_test + +import ( + "context" + "testing" + + "github.com/salvacybersec/keyhunter/pkg/engine" + "github.com/salvacybersec/keyhunter/pkg/engine/sources" + "github.com/salvacybersec/keyhunter/pkg/providers" + "github.com/stretchr/testify/assert" + "github.com/stretchr/testify/require" +) + +func newTestRegistry(t *testing.T) *providers.Registry { + t.Helper() + reg, err := providers.NewRegistry() + require.NoError(t, err) + return reg +} + +func TestShannonEntropy(t *testing.T) { + assert.InDelta(t, 0.0, engine.Shannon("aaaaaaa"), 0.01) + assert.Greater(t, engine.Shannon("sk-proj-ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqr"), 3.5) + assert.Equal(t, 0.0, engine.Shannon("")) +} + +func TestKeywordPreFilter(t *testing.T) { + reg := newTestRegistry(t) + ac := reg.AC() + + // Chunk with OpenAI keyword should pass + matches := ac.FindAll("export OPENAI_API_KEY=sk-proj-test") + assert.NotEmpty(t, matches) + + // Chunk with no keywords should be dropped + noMatches := ac.FindAll("hello world no secrets here") + assert.Empty(t, noMatches) +} + +func TestScannerPipelineOpenAI(t *testing.T) { + reg := newTestRegistry(t) + eng := engine.NewEngine(reg) + src := sources.NewFileSource("../../testdata/samples/openai_key.txt") + cfg := engine.ScanConfig{Workers: 2} + + ch, err := eng.Scan(context.Background(), src, cfg) + require.NoError(t, err) + + var findings []engine.Finding + for f := range ch { + findings = append(findings, f) + } + + require.Len(t, findings, 1, "expected exactly 1 finding in openai_key.txt") + assert.Equal(t, "openai", findings[0].ProviderName) + assert.Contains(t, findings[0].KeyValue, "sk-proj-") +} + +func TestScannerPipelineNoKeys(t *testing.T) { + reg := newTestRegistry(t) + eng := engine.NewEngine(reg) + src := sources.NewFileSource("../../testdata/samples/no_keys.txt") + cfg := engine.ScanConfig{Workers: 2} + + ch, err := eng.Scan(context.Background(), src, cfg) + require.NoError(t, err) + + var findings []engine.Finding + for f := range ch { + findings = append(findings, f) + } + + assert.Empty(t, findings, "expected zero findings in no_keys.txt") +} + +func TestScannerPipelineMultipleKeys(t *testing.T) { + reg := newTestRegistry(t) + eng := engine.NewEngine(reg) + src := sources.NewFileSource("../../testdata/samples/multiple_keys.txt") + cfg := engine.ScanConfig{Workers: 2} + + ch, err := eng.Scan(context.Background(), src, cfg) + require.NoError(t, err) + + var findings []engine.Finding + for f := range ch { + findings = append(findings, f) + } + + assert.GreaterOrEqual(t, len(findings), 2, "expected at least 2 findings in multiple_keys.txt") + + var names []string + for _, f := range findings { + names = append(names, f.ProviderName) + } + assert.Contains(t, names, "openai") + assert.Contains(t, names, "anthropic") +} +``` + + + cd /home/salva/Documents/apikey && go test ./pkg/engine/... -v -count=1 2>&1 | tail -30 + + + - `go test ./pkg/engine/... -v -count=1` exits 0 with all tests PASS (no SKIP) + - TestShannonEntropy passes — 0.0 for "aaaaaaa", >= 3.5 for real key pattern + - TestKeywordPreFilter passes — AC matches sk-proj-, empty for "hello world" + - TestScannerPipelineOpenAI passes — 1 finding with ProviderName=="openai" + - TestScannerPipelineNoKeys passes — 0 findings + - TestScannerPipelineMultipleKeys passes — >= 2 findings with both provider names + - `grep -q 'ants\.NewPool' pkg/engine/engine.go` exits 0 + - `grep -q 'KeywordFilter' pkg/engine/engine.go` exits 0 + - `go build ./...` still exits 0 + + Three-stage scanning pipeline works end-to-end: FileSource → KeywordFilter (AC) → Detect (regex + entropy) → Finding channel. All engine tests pass. + + + + + +After both tasks: +- `go test ./pkg/engine/... -v -count=1` exits 0 with 6 tests PASS +- `go build ./...` exits 0 +- `grep -q 'ants\.NewPool' pkg/engine/engine.go` exits 0 +- `grep -q 'math\.Log2' pkg/engine/entropy.go` exits 0 +- Scanning testdata/samples/openai_key.txt returns 1 finding with provider "openai" +- Scanning testdata/samples/no_keys.txt returns 0 findings + + + +- Three-stage pipeline: AC pre-filter → regex + entropy detector → results channel (CORE-01, CORE-06) +- Shannon entropy function using stdlib math (CORE-04) +- ants v2 goroutine pool with configurable worker count (CORE-05) +- FileSource adapter reading files in overlapping chunks (CORE-07 partial — full mmap in Phase 4) +- All engine tests pass against real testdata fixtures + + + +After completion, create `.planning/phases/01-foundation/01-04-SUMMARY.md` following the summary template. + diff --git a/.planning/phases/01-foundation/01-05-PLAN.md b/.planning/phases/01-foundation/01-05-PLAN.md new file mode 100644 index 0000000..fb95433 --- /dev/null +++ b/.planning/phases/01-foundation/01-05-PLAN.md @@ -0,0 +1,748 @@ +--- +phase: 01-foundation +plan: 05 +type: execute +wave: 3 +depends_on: [01-02, 01-03, 01-04] +files_modified: + - cmd/root.go + - cmd/scan.go + - cmd/providers.go + - cmd/config.go + - pkg/config/config.go + - pkg/output/table.go +autonomous: false +requirements: [CLI-01, CLI-02, CLI-03, CLI-04, CLI-05] + +must_haves: + truths: + - "`keyhunter scan ./testdata/samples/openai_key.txt` runs the pipeline and prints a finding" + - "`keyhunter providers list` prints a table with at least 3 providers" + - "`keyhunter providers info openai` prints OpenAI provider details" + - "`keyhunter config init` creates ~/.keyhunter.yaml without error" + - "`keyhunter config set workers 16` persists the value to ~/.keyhunter.yaml" + - "`keyhunter --help` shows all top-level commands: scan, providers, config" + artifacts: + - path: "cmd/root.go" + provides: "Cobra root command with PersistentPreRunE config loading" + contains: "cobra.Command" + - path: "cmd/scan.go" + provides: "scan command wiring Engine + FileSource + output table" + exports: ["scanCmd"] + - path: "cmd/providers.go" + provides: "providers list/info/stats subcommands using Registry" + exports: ["providersCmd"] + - path: "cmd/config.go" + provides: "config init/set/get subcommands using Viper" + exports: ["configCmd"] + - path: "pkg/config/config.go" + provides: "Config struct with Load() and defaults" + exports: ["Config", "Load"] + - path: "pkg/output/table.go" + provides: "lipgloss terminal table for printing Findings" + exports: ["PrintFindings"] + key_links: + - from: "cmd/scan.go" + to: "pkg/engine/engine.go" + via: "engine.NewEngine(registry).Scan() called in RunE" + pattern: "engine\\.NewEngine" + - from: "cmd/scan.go" + to: "pkg/storage/db.go" + via: "storage.Open() called, SaveFinding for each result" + pattern: "storage\\.Open" + - from: "cmd/root.go" + to: "github.com/spf13/viper" + via: "viper.SetConfigFile in PersistentPreRunE" + pattern: "viper\\.SetConfigFile" + - from: "cmd/providers.go" + to: "pkg/providers/registry.go" + via: "Registry.List(), Registry.Get(), Registry.Stats() called" + pattern: "registry\\.List|registry\\.Get|registry\\.Stats" +--- + + +Wire all subsystems together through the Cobra CLI: scan command (engine + storage + output), providers list/info/stats commands, and config init/set/get commands. This is the integration layer — all business logic lives in pkg/, cmd/ only wires. + +Purpose: Satisfies all Phase 1 CLI requirements and delivers the first working `keyhunter scan` command that completes the end-to-end success criteria. +Output: cmd/{root,scan,providers,config}.go, pkg/config/config.go, pkg/output/table.go. + + + +@$HOME/.claude/get-shit-done/workflows/execute-plan.md +@$HOME/.claude/get-shit-done/templates/summary.md + + + +@.planning/phases/01-foundation/01-RESEARCH.md +@.planning/phases/01-foundation/01-02-SUMMARY.md +@.planning/phases/01-foundation/01-03-SUMMARY.md +@.planning/phases/01-foundation/01-04-SUMMARY.md + + + +package engine +type ScanConfig struct { Workers int; Verify bool; Unmask bool } +func NewEngine(registry *providers.Registry) *Engine +func (e *Engine) Scan(ctx context.Context, src sources.Source, cfg ScanConfig) (<-chan Finding, error) + + +package sources +func NewFileSource(path string) *FileSource + + +type Finding struct { + ProviderName string + KeyValue string + KeyMasked string + Confidence string + Source string + LineNumber int +} + + +package storage +func Open(path string) (*DB, error) +func (db *DB) SaveFinding(f Finding, encKey []byte) (int64, error) +func DeriveKey(passphrase []byte, salt []byte) []byte +func NewSalt() ([]byte, error) + + +package providers +func NewRegistry() (*Registry, error) +func (r *Registry) List() []Provider +func (r *Registry) Get(name string) (Provider, bool) +func (r *Registry) Stats() RegistryStats + + +DBPath: ~/.keyhunter/keyhunter.db +ConfigPath: ~/.keyhunter.yaml +Workers: runtime.NumCPU() * 8 +Passphrase: (prompt if not in env KEYHUNTER_PASSPHRASE — Phase 1: use empty string as dev default) + + +"database.path" → DBPath +"scan.workers" → Workers +"encryption.passphrase" → Passphrase (sensitive — warn in help) + + +Columns: PROVIDER | MASKED KEY | CONFIDENCE | SOURCE | LINE +Colors: use lipgloss.NewStyle().Foreground() for confidence: high=green, medium=yellow, low=red + + + + + + + Task 1: Config package, output table, and root command + pkg/config/config.go, pkg/output/table.go, cmd/root.go + + - /home/salva/Documents/apikey/.planning/phases/01-foundation/01-RESEARCH.md (CLI-01, CLI-02, CLI-03 rows, Standard Stack: cobra v1.10.2 + viper v1.21.0) + - /home/salva/Documents/apikey/pkg/engine/finding.go (Finding struct fields for output) + + +Create **pkg/config/config.go**: +```go +package config + +import ( + "os" + "path/filepath" + "runtime" +) + +// Config holds all KeyHunter runtime configuration. +// Values are populated from ~/.keyhunter.yaml, environment variables, and CLI flags (in that precedence order). +type Config struct { + DBPath string // path to SQLite database file + ConfigPath string // path to config YAML file + Workers int // number of scanner worker goroutines + Passphrase string // encryption passphrase (sensitive) +} + +// Load returns a Config with defaults applied. +// Callers should override individual fields after Load() using viper-bound values. +func Load() Config { + home, _ := os.UserHomeDir() + return Config{ + DBPath: filepath.Join(home, ".keyhunter", "keyhunter.db"), + ConfigPath: filepath.Join(home, ".keyhunter.yaml"), + Workers: runtime.NumCPU() * 8, + Passphrase: "", // Phase 1: empty passphrase; Phase 6+ will prompt + } +} +``` + +Create **pkg/output/table.go**: +```go +package output + +import ( + "fmt" + "os" + + "github.com/charmbracelet/lipgloss" + "github.com/salvacybersec/keyhunter/pkg/engine" +) + +var ( + styleHigh = lipgloss.NewStyle().Foreground(lipgloss.Color("2")) // green + styleMedium = lipgloss.NewStyle().Foreground(lipgloss.Color("3")) // yellow + styleLow = lipgloss.NewStyle().Foreground(lipgloss.Color("1")) // red + styleHeader = lipgloss.NewStyle().Bold(true).Underline(true) +) + +// PrintFindings writes findings as a colored terminal table to stdout. +// If unmask is true, KeyValue is shown; otherwise KeyMasked is shown. +func PrintFindings(findings []engine.Finding, unmask bool) { + if len(findings) == 0 { + fmt.Println("No API keys found.") + return + } + + // Header + fmt.Fprintf(os.Stdout, "%-20s %-40s %-10s %-30s %s\n", + styleHeader.Render("PROVIDER"), + styleHeader.Render("KEY"), + styleHeader.Render("CONFIDENCE"), + styleHeader.Render("SOURCE"), + styleHeader.Render("LINE"), + ) + fmt.Println(lipgloss.NewStyle().Foreground(lipgloss.Color("8")).Render( + "──────────────────────────────────────────────────────────────────────────────────────────────────────────", + )) + + for _, f := range findings { + keyDisplay := f.KeyMasked + if unmask { + keyDisplay = f.KeyValue + } + + confStyle := styleLow + switch f.Confidence { + case "high": + confStyle = styleHigh + case "medium": + confStyle = styleMedium + } + + fmt.Fprintf(os.Stdout, "%-20s %-40s %-10s %-30s %d\n", + f.ProviderName, + keyDisplay, + confStyle.Render(f.Confidence), + truncate(f.Source, 28), + f.LineNumber, + ) + } + fmt.Printf("\n%d key(s) found.\n", len(findings)) +} + +func truncate(s string, max int) string { + if len(s) <= max { + return s + } + return "..." + s[len(s)-max+3:] +} +``` + +Create **cmd/root.go** (replaces the stub from Plan 01): +```go +package cmd + +import ( + "fmt" + "os" + "path/filepath" + + "github.com/spf13/cobra" + "github.com/spf13/viper" +) + +var cfgFile string + +// rootCmd is the base command when called without any subcommands. +var rootCmd = &cobra.Command{ + Use: "keyhunter", + Short: "KeyHunter — detect leaked LLM API keys across 108+ providers", + Long: `KeyHunter scans files, git history, and internet sources for leaked LLM API keys. +Supports 108+ providers with Aho-Corasick pre-filtering and regex + entropy detection.`, + SilenceUsage: true, +} + +// Execute is the entry point called by main.go. +func Execute() { + if err := rootCmd.Execute(); err != nil { + os.Exit(1) + } +} + +func init() { + cobra.OnInitialize(initConfig) + rootCmd.PersistentFlags().StringVar(&cfgFile, "config", "", "config file (default: ~/.keyhunter.yaml)") + rootCmd.AddCommand(scanCmd) + rootCmd.AddCommand(providersCmd) + rootCmd.AddCommand(configCmd) +} + +func initConfig() { + if cfgFile != "" { + viper.SetConfigFile(cfgFile) + } else { + home, err := os.UserHomeDir() + if err != nil { + fmt.Fprintln(os.Stderr, "warning: cannot determine home directory:", err) + return + } + viper.SetConfigName(".keyhunter") + viper.SetConfigType("yaml") + viper.AddConfigPath(home) + viper.AddConfigPath(".") + } + + viper.SetEnvPrefix("KEYHUNTER") + viper.AutomaticEnv() + + // Defaults + viper.SetDefault("scan.workers", 0) // 0 = auto (CPU*8) + viper.SetDefault("database.path", filepath.Join(mustHomeDir(), ".keyhunter", "keyhunter.db")) + + // Config file is optional — ignore if not found + _ = viper.ReadInConfig() +} + +func mustHomeDir() string { + h, _ := os.UserHomeDir() + return h +} +``` + + + cd /home/salva/Documents/apikey && go build ./... && ./keyhunter --help 2>&1 | grep -E "scan|providers|config" && echo "HELP OK" + + + - `go build ./...` exits 0 + - `./keyhunter --help` shows "scan", "providers", and "config" in command list + - pkg/config/config.go exports Config and Load + - pkg/output/table.go exports PrintFindings + - cmd/root.go declares rootCmd, Execute(), scanCmd, providersCmd, configCmd referenced + - `grep -q 'viper\.SetConfigFile\|viper\.SetConfigName' cmd/root.go` exits 0 + - lipgloss used for header and confidence coloring + + Root command, config package, and output table exist. `keyhunter --help` shows the three top-level commands. + + + + Task 2: scan, providers, and config subcommands + cmd/scan.go, cmd/providers.go, cmd/config.go + + - /home/salva/Documents/apikey/.planning/phases/01-foundation/01-RESEARCH.md (CLI-04, CLI-05 rows, Pattern 2 pipeline usage) + - /home/salva/Documents/apikey/cmd/root.go (rootCmd, viper setup) + - /home/salva/Documents/apikey/pkg/engine/engine.go (Engine.Scan, ScanConfig) + - /home/salva/Documents/apikey/pkg/storage/db.go (Open, SaveFinding) + - /home/salva/Documents/apikey/pkg/providers/registry.go (NewRegistry, List, Get, Stats) + + +Create **cmd/scan.go**: +```go +package cmd + +import ( + "context" + "fmt" + "os" + "path/filepath" + "runtime" + + "github.com/spf13/cobra" + "github.com/spf13/viper" + "github.com/salvacybersec/keyhunter/pkg/config" + "github.com/salvacybersec/keyhunter/pkg/engine" + "github.com/salvacybersec/keyhunter/pkg/engine/sources" + "github.com/salvacybersec/keyhunter/pkg/output" + "github.com/salvacybersec/keyhunter/pkg/providers" + "github.com/salvacybersec/keyhunter/pkg/storage" +) + +var ( + flagWorkers int + flagVerify bool + flagUnmask bool + flagOutput string + flagExclude []string +) + +var scanCmd = &cobra.Command{ + Use: "scan ", + Short: "Scan a file or directory for leaked API keys", + Args: cobra.ExactArgs(1), + RunE: func(cmd *cobra.Command, args []string) error { + target := args[0] + + // Load config + cfg := config.Load() + if viper.GetInt("scan.workers") > 0 { + cfg.Workers = viper.GetInt("scan.workers") + } + + // Workers flag overrides config + workers := flagWorkers + if workers <= 0 { + workers = cfg.Workers + } + if workers <= 0 { + workers = runtime.NumCPU() * 8 + } + + // Initialize registry + reg, err := providers.NewRegistry() + if err != nil { + return fmt.Errorf("loading providers: %w", err) + } + + // Initialize engine + eng := engine.NewEngine(reg) + src := sources.NewFileSource(target) + + scanCfg := engine.ScanConfig{ + Workers: workers, + Verify: flagVerify, + Unmask: flagUnmask, + } + + // Open database (ensure directory exists) + dbPath := viper.GetString("database.path") + if dbPath == "" { + dbPath = cfg.DBPath + } + if err := os.MkdirAll(filepath.Dir(dbPath), 0700); err != nil { + return fmt.Errorf("creating database directory: %w", err) + } + db, err := storage.Open(dbPath) + if err != nil { + return fmt.Errorf("opening database: %w", err) + } + defer db.Close() + + // Derive encryption key (Phase 1: empty passphrase with fixed dev salt) + salt := []byte("keyhunter-dev-s0") // Phase 1 placeholder — Phase 6 replaces with proper salt storage + encKey := storage.DeriveKey([]byte(cfg.Passphrase), salt) + + // Run scan + ch, err := eng.Scan(context.Background(), src, scanCfg) + if err != nil { + return fmt.Errorf("starting scan: %w", err) + } + + var findings []engine.Finding + for f := range ch { + findings = append(findings, f) + // Persist to storage + storeFinding := storage.Finding{ + ProviderName: f.ProviderName, + KeyValue: f.KeyValue, + KeyMasked: f.KeyMasked, + Confidence: f.Confidence, + SourcePath: f.Source, + SourceType: f.SourceType, + LineNumber: f.LineNumber, + } + if _, err := db.SaveFinding(storeFinding, encKey); err != nil { + fmt.Fprintf(os.Stderr, "warning: failed to save finding: %v\n", err) + } + } + + // Output + switch flagOutput { + case "json": + // Phase 6 — basic JSON for now + fmt.Printf("[] # JSON output: Phase 6\n") + default: + output.PrintFindings(findings, flagUnmask) + } + + // Exit code semantics (CLI-05 / OUT-06): 0=clean, 1=found, 2=error + if len(findings) > 0 { + os.Exit(1) + } + return nil + }, +} + +func init() { + scanCmd.Flags().IntVar(&flagWorkers, "workers", 0, "number of worker goroutines (default: CPU*8)") + scanCmd.Flags().BoolVar(&flagVerify, "verify", false, "actively verify found keys (opt-in, Phase 5)") + scanCmd.Flags().BoolVar(&flagUnmask, "unmask", false, "show full key values (default: masked)") + scanCmd.Flags().StringVar(&flagOutput, "output", "table", "output format: table, json (more in Phase 6)") + scanCmd.Flags().StringSliceVar(&flagExclude, "exclude", nil, "glob patterns to exclude (e.g. *.min.js)") + viper.BindPFlag("scan.workers", scanCmd.Flags().Lookup("workers")) +} +``` + +Create **cmd/providers.go**: +```go +package cmd + +import ( + "fmt" + "os" + "strings" + + "github.com/charmbracelet/lipgloss" + "github.com/spf13/cobra" + "github.com/salvacybersec/keyhunter/pkg/providers" +) + +var providersCmd = &cobra.Command{ + Use: "providers", + Short: "Manage and inspect provider definitions", +} + +var providersListCmd = &cobra.Command{ + Use: "list", + Short: "List all loaded provider definitions", + RunE: func(cmd *cobra.Command, args []string) error { + reg, err := providers.NewRegistry() + if err != nil { + return err + } + bold := lipgloss.NewStyle().Bold(true) + fmt.Fprintf(os.Stdout, "%-20s %-6s %-8s %s\n", + bold.Render("NAME"), bold.Render("TIER"), bold.Render("PATTERNS"), bold.Render("KEYWORDS")) + fmt.Println(strings.Repeat("─", 70)) + for _, p := range reg.List() { + fmt.Fprintf(os.Stdout, "%-20s %-6d %-8d %s\n", + p.Name, p.Tier, len(p.Patterns), strings.Join(p.Keywords, ", ")) + } + stats := reg.Stats() + fmt.Printf("\nTotal: %d providers\n", stats.Total) + return nil + }, +} + +var providersInfoCmd = &cobra.Command{ + Use: "info ", + Short: "Show detailed info for a provider", + Args: cobra.ExactArgs(1), + RunE: func(cmd *cobra.Command, args []string) error { + reg, err := providers.NewRegistry() + if err != nil { + return err + } + p, ok := reg.Get(args[0]) + if !ok { + return fmt.Errorf("provider %q not found", args[0]) + } + fmt.Printf("Name: %s\n", p.Name) + fmt.Printf("Display Name: %s\n", p.DisplayName) + fmt.Printf("Tier: %d\n", p.Tier) + fmt.Printf("Last Verified: %s\n", p.LastVerified) + fmt.Printf("Keywords: %s\n", strings.Join(p.Keywords, ", ")) + fmt.Printf("Patterns: %d\n", len(p.Patterns)) + for i, pat := range p.Patterns { + fmt.Printf(" [%d] regex=%s confidence=%s entropy_min=%.1f\n", + i+1, pat.Regex, pat.Confidence, pat.EntropyMin) + } + if p.Verify.URL != "" { + fmt.Printf("Verify URL: %s %s\n", p.Verify.Method, p.Verify.URL) + } + return nil + }, +} + +var providersStatsCmd = &cobra.Command{ + Use: "stats", + Short: "Show provider statistics", + RunE: func(cmd *cobra.Command, args []string) error { + reg, err := providers.NewRegistry() + if err != nil { + return err + } + stats := reg.Stats() + fmt.Printf("Total providers: %d\n", stats.Total) + fmt.Printf("By tier:\n") + for tier := 1; tier <= 9; tier++ { + if count := stats.ByTier[tier]; count > 0 { + fmt.Printf(" Tier %d: %d\n", tier, count) + } + } + fmt.Printf("By confidence:\n") + for conf, count := range stats.ByConfidence { + fmt.Printf(" %s: %d\n", conf, count) + } + return nil + }, +} + +func init() { + providersCmd.AddCommand(providersListCmd) + providersCmd.AddCommand(providersInfoCmd) + providersCmd.AddCommand(providersStatsCmd) +} +``` + +Create **cmd/config.go**: +```go +package cmd + +import ( + "fmt" + "os" + "path/filepath" + + "github.com/spf13/cobra" + "github.com/spf13/viper" +) + +var configCmd = &cobra.Command{ + Use: "config", + Short: "Manage KeyHunter configuration", +} + +var configInitCmd = &cobra.Command{ + Use: "init", + Short: "Create default configuration file at ~/.keyhunter.yaml", + RunE: func(cmd *cobra.Command, args []string) error { + home, err := os.UserHomeDir() + if err != nil { + return fmt.Errorf("cannot determine home directory: %w", err) + } + configPath := filepath.Join(home, ".keyhunter.yaml") + + // Set defaults before writing + viper.SetDefault("scan.workers", 0) + viper.SetDefault("database.path", filepath.Join(home, ".keyhunter", "keyhunter.db")) + + if err := viper.WriteConfigAs(configPath); err != nil { + return fmt.Errorf("writing config: %w", err) + } + fmt.Printf("Config initialized: %s\n", configPath) + return nil + }, +} + +var configSetCmd = &cobra.Command{ + Use: "set ", + Short: "Set a configuration value", + Args: cobra.ExactArgs(2), + RunE: func(cmd *cobra.Command, args []string) error { + key, value := args[0], args[1] + viper.Set(key, value) + if err := viper.WriteConfig(); err != nil { + // If config file doesn't exist yet, create it + home, _ := os.UserHomeDir() + configPath := filepath.Join(home, ".keyhunter.yaml") + if err2 := viper.WriteConfigAs(configPath); err2 != nil { + return fmt.Errorf("writing config: %w", err2) + } + } + fmt.Printf("Set %s = %s\n", key, value) + return nil + }, +} + +var configGetCmd = &cobra.Command{ + Use: "get ", + Short: "Get a configuration value", + Args: cobra.ExactArgs(1), + RunE: func(cmd *cobra.Command, args []string) error { + val := viper.Get(args[0]) + if val == nil { + return fmt.Errorf("key %q not found", args[0]) + } + fmt.Printf("%v\n", val) + return nil + }, +} + +func init() { + configCmd.AddCommand(configInitCmd) + configCmd.AddCommand(configSetCmd) + configCmd.AddCommand(configGetCmd) +} +``` + + + cd /home/salva/Documents/apikey && go build -o keyhunter . && ./keyhunter providers list && ./keyhunter providers info openai && echo "PROVIDERS OK" + + + - `go build -o keyhunter .` exits 0 + - `./keyhunter --help` shows scan, providers, config commands + - `./keyhunter providers list` prints table with >= 3 rows including "openai" + - `./keyhunter providers info openai` prints Name, Tier, Keywords, Patterns, Verify URL + - `./keyhunter providers stats` prints "Total providers: 3" or more + - `./keyhunter config init` creates or updates ~/.keyhunter.yaml + - `./keyhunter config set scan.workers 16` exits 0 + - `./keyhunter scan testdata/samples/openai_key.txt` exits 1 (keys found) and prints a table row with "openai" + - `./keyhunter scan testdata/samples/no_keys.txt` exits 0 and prints "No API keys found." + - `grep -q 'viper\.BindPFlag' cmd/scan.go` exits 0 + + Full CLI works: scan finds and persists keys, providers list/info/stats work, config init/set/get work. Phase 1 success criteria all met. + + + + +Complete Phase 1 implementation: +- Provider registry with 3 YAML definitions, Aho-Corasick automaton, schema validation +- Storage layer with AES-256-GCM encryption, Argon2id key derivation, SQLite WAL mode +- Three-stage scan engine: keyword pre-filter → regex + entropy detector → finding channel +- CLI: keyhunter scan, providers list/info/stats, config init/set/get + + +Run these commands from the project root and confirm each expected output: + +1. `cd /home/salva/Documents/apikey && go test ./... -v -count=1` + Expected: All tests PASS, zero FAIL, zero SKIP (except original stubs now filled) + +2. `./keyhunter scan testdata/samples/openai_key.txt` + Expected: Exit code 1, table printed with 1 row showing "openai" provider, masked key + +3. `./keyhunter scan testdata/samples/no_keys.txt` + Expected: Exit code 0, "No API keys found." printed + +4. `./keyhunter providers list` + Expected: Table with openai, anthropic, huggingface rows + +5. `./keyhunter providers info openai` + Expected: Name, Tier 1, Keywords including "sk-proj-", Pattern regex shown + +6. `./keyhunter config init` + Expected: "Config initialized: ~/.keyhunter.yaml" and the file exists + +7. `./keyhunter config set scan.workers 16 && ./keyhunter config get scan.workers` + Expected: "Set scan.workers = 16" then "16" + +8. Build the binary with production flags: + `CGO_ENABLED=0 go build -ldflags="-s -w" -o keyhunter-prod .` + Expected: Builds without error, binary produced + + Type "approved" if all 8 checks pass, or describe which check failed and what output you saw. + + + + + +Full Phase 1 integration check: +- `go test ./... -count=1` exits 0 +- `./keyhunter scan testdata/samples/openai_key.txt` exits 1 with findings table +- `./keyhunter scan testdata/samples/no_keys.txt` exits 0 with "No API keys found." +- `./keyhunter providers list` shows 3+ providers +- `./keyhunter config init` creates ~/.keyhunter.yaml +- `CGO_ENABLED=0 go build -ldflags="-s -w" -o keyhunter-prod .` exits 0 + + + +- Cobra CLI with scan, providers, config commands (CLI-01) +- `keyhunter config init` creates ~/.keyhunter.yaml (CLI-02) +- `keyhunter config set key value` persists (CLI-03) +- `keyhunter providers list/info/stats` work (CLI-04) +- scan flags: --workers, --verify, --unmask, --output, --exclude (CLI-05) +- All Phase 1 success criteria from ROADMAP.md satisfied: + 1. `keyhunter scan ./somefile` runs three-stage pipeline and returns findings with provider names + 2. Findings persisted to SQLite with AES-256 encrypted key_value + 3. `keyhunter config init` and `config set` work + 4. `keyhunter providers list/info` return provider metadata from YAML + 5. Provider YAML has format_version and last_verified, validated at load time + + + +After completion, create `.planning/phases/01-foundation/01-05-SUMMARY.md` following the summary template. +