docs(01-foundation): create phase 1 plan — 5 plans across 3 execution waves

Wave 0: module init + test scaffolding (01-01)
Wave 1: provider registry (01-02) + storage layer (01-03) in parallel
Wave 2: scan engine pipeline (01-04, depends on 01-02)
Wave 3: CLI wiring + integration checkpoint (01-05, depends on all)

Covers all 16 Phase 1 requirements: CORE-01 through CORE-07, STOR-01 through STOR-03,
CLI-01 through CLI-05, PROV-10.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
salvacybersec
2026-04-04 23:44:09 +03:00
parent c573b97a68
commit 684b67cb73
6 changed files with 3095 additions and 2 deletions

View File

@@ -43,7 +43,14 @@ Decimal phases appear between their surrounding integers in numeric order.
3. `keyhunter config init` creates `~/.keyhunter.yaml` and `keyhunter config set <key> <value>` persists values
4. `keyhunter providers list` and `keyhunter providers info <name>` return provider metadata from YAML definitions
5. Provider YAML schema includes `format_version` and `last_verified` fields validated at load time
**Plans**: TBD
**Plans**: 5 plans
Plans:
- [ ] 01-01-PLAN.md — Go module init, dependency installation, test scaffolding and testdata fixtures
- [ ] 01-02-PLAN.md — Provider registry: YAML schema, embed loader, Aho-Corasick automaton, Registry struct
- [ ] 01-03-PLAN.md — Storage layer: AES-256-GCM encryption, Argon2id key derivation, SQLite + Finding CRUD
- [ ] 01-04-PLAN.md — Scan engine pipeline: keyword pre-filter, regex+entropy detector, FileSource, ants worker pool
- [ ] 01-05-PLAN.md — CLI wiring: scan, providers list/info/stats, config init/set/get, output table
### Phase 2: Tier 1-2 Providers
**Goal**: The 26 highest-value LLM provider YAML definitions exist with accurate regex patterns, keyword lists, confidence levels, and verify endpoints — covering OpenAI, Anthropic, Google AI, AWS Bedrock, Azure OpenAI and all major inference platforms
@@ -248,7 +255,7 @@ Phases execute in numeric order: 1 → 2 → 3 → ... → 18
| Phase | Plans Complete | Status | Completed |
|-------|----------------|--------|-----------|
| 1. Foundation | 0/? | Not started | - |
| 1. Foundation | 0/5 | Planning complete | - |
| 2. Tier 1-2 Providers | 0/? | Not started | - |
| 3. Tier 3-9 Providers | 0/? | Not started | - |
| 4. Input Sources | 0/? | Not started | - |

View File

@@ -0,0 +1,359 @@
---
phase: 01-foundation
plan: 01
type: execute
wave: 0
depends_on: []
files_modified:
- go.mod
- go.sum
- main.go
- testdata/samples/openai_key.txt
- testdata/samples/anthropic_key.txt
- testdata/samples/no_keys.txt
- pkg/providers/registry_test.go
- pkg/storage/db_test.go
- pkg/engine/scanner_test.go
autonomous: true
requirements: [CORE-01, CORE-02, CORE-03, CORE-04, CORE-05, CORE-06, CORE-07, STOR-01, STOR-02, STOR-03, CLI-01]
must_haves:
truths:
- "go.mod exists with all Phase 1 dependencies at pinned versions"
- "go build ./... succeeds with zero errors on a fresh checkout"
- "go test ./... -short runs without compilation errors (tests may fail — stubs are fine)"
- "testdata/ contains files with known key patterns for scanner integration tests"
artifacts:
- path: "go.mod"
provides: "Module declaration with all Phase 1 dependencies"
contains: "module github.com/salvacybersec/keyhunter"
- path: "main.go"
provides: "Binary entry point under 30 lines"
contains: "func main()"
- path: "testdata/samples/openai_key.txt"
provides: "Sample file with synthetic OpenAI key for scanner tests"
- path: "pkg/providers/registry_test.go"
provides: "Test stubs for provider loading and registry"
- path: "pkg/storage/db_test.go"
provides: "Test stubs for SQLite + encryption roundtrip"
- path: "pkg/engine/scanner_test.go"
provides: "Test stubs for pipeline stages"
key_links:
- from: "go.mod"
to: "petar-dambovaliev/aho-corasick"
via: "require directive"
pattern: "petar-dambovaliev/aho-corasick"
- from: "go.mod"
to: "modernc.org/sqlite"
via: "require directive"
pattern: "modernc.org/sqlite"
---
<objective>
Initialize the Go module, install all Phase 1 dependencies at pinned versions, create the minimal main.go entry point, and lay down test scaffolding with testdata fixtures that every subsequent plan's tests depend on.
Purpose: All subsequent plans require a compiling module and test infrastructure to exist before they can add production code and make tests green. Wave 0 satisfies this bootstrap requirement.
Output: go.mod, go.sum, main.go, pkg/*/test stubs, testdata/ fixtures.
</objective>
<execution_context>
@$HOME/.claude/get-shit-done/workflows/execute-plan.md
@$HOME/.claude/get-shit-done/templates/summary.md
</execution_context>
<context>
@.planning/PROJECT.md
@.planning/ROADMAP.md
@.planning/phases/01-foundation/01-RESEARCH.md
@.planning/phases/01-foundation/01-VALIDATION.md
<interfaces>
<!-- Module path used throughout the project -->
Module: github.com/salvacybersec/keyhunter
<!-- Pinned versions from RESEARCH.md -->
Dependencies to install:
github.com/spf13/cobra@v1.10.2
github.com/spf13/viper@v1.21.0
modernc.org/sqlite@latest
gopkg.in/yaml.v3@v3.0.1
github.com/petar-dambovaliev/aho-corasick@latest
github.com/panjf2000/ants/v2@v2.12.0
golang.org/x/crypto@latest
golang.org/x/time@latest
github.com/charmbracelet/lipgloss@latest
github.com/stretchr/testify@latest
<!-- Go version -->
go 1.22
<!-- Directory structure to scaffold (from RESEARCH.md) -->
keyhunter/
main.go
cmd/
root.go (created in Plan 05)
scan.go (created in Plan 05)
providers.go (created in Plan 05)
config.go (created in Plan 05)
pkg/
providers/ (created in Plan 02)
engine/ (created in Plan 04)
storage/ (created in Plan 03)
config/ (created in Plan 05)
output/ (created in Plan 05)
providers/ (created in Plan 02)
testdata/
samples/
</interfaces>
</context>
<tasks>
<task type="auto" tdd="false">
<name>Task 1: Initialize Go module and install Phase 1 dependencies</name>
<files>go.mod, go.sum</files>
<read_first>
- /home/salva/Documents/apikey/.planning/phases/01-foundation/01-RESEARCH.md (Standard Stack section — exact library versions)
- /home/salva/Documents/apikey/CLAUDE.md (Technology Stack table — version constraints)
</read_first>
<action>
Run the following commands in the project root (/home/salva/Documents/apikey):
```bash
go mod init github.com/salvacybersec/keyhunter
go get github.com/spf13/cobra@v1.10.2
go get github.com/spf13/viper@v1.21.0
go get modernc.org/sqlite@latest
go get gopkg.in/yaml.v3@v3.0.1
go get github.com/petar-dambovaliev/aho-corasick@latest
go get github.com/panjf2000/ants/v2@v2.12.0
go get golang.org/x/crypto@latest
go get golang.org/x/time@latest
go get github.com/charmbracelet/lipgloss@latest
go get github.com/stretchr/testify@latest
go mod tidy
```
Verify the resulting go.mod contains:
- `module github.com/salvacybersec/keyhunter`
- `go 1.22` (or 1.22.x)
- `github.com/spf13/cobra v1.10.2`
- `github.com/spf13/viper v1.21.0`
- `github.com/petar-dambovaliev/aho-corasick` (any version)
- `github.com/panjf2000/ants/v2 v2.12.0`
- `modernc.org/sqlite` (any v1.35.x)
- `github.com/charmbracelet/lipgloss` (any version)
Do NOT add: chi, templ, telego, gocron — these are Phase 17-18 only.
Do NOT use CGO_ENABLED=1 or mattn/go-sqlite3.
</action>
<verify>
<automated>cd /home/salva/Documents/apikey && grep -q 'module github.com/salvacybersec/keyhunter' go.mod && grep -q 'cobra v1.10.2' go.mod && grep -q 'modernc.org/sqlite' go.mod && echo "go.mod OK"</automated>
</verify>
<acceptance_criteria>
- go.mod contains `module github.com/salvacybersec/keyhunter`
- go.mod contains `github.com/spf13/cobra v1.10.2` (exact)
- go.mod contains `github.com/spf13/viper v1.21.0` (exact)
- go.mod contains `github.com/panjf2000/ants/v2 v2.12.0` (exact)
- go.mod contains `modernc.org/sqlite` (v1.35.x)
- go.mod contains `github.com/petar-dambovaliev/aho-corasick`
- go.mod contains `golang.org/x/crypto`
- go.mod contains `github.com/charmbracelet/lipgloss`
- go.sum exists and is non-empty
- `go mod verify` exits 0
</acceptance_criteria>
<done>go.mod and go.sum committed with all Phase 1 dependencies at correct versions</done>
</task>
<task type="auto" tdd="false">
<name>Task 2: Create main.go entry point and test scaffolding</name>
<files>
main.go,
testdata/samples/openai_key.txt,
testdata/samples/anthropic_key.txt,
testdata/samples/multiple_keys.txt,
testdata/samples/no_keys.txt,
pkg/providers/registry_test.go,
pkg/storage/db_test.go,
pkg/engine/scanner_test.go
</files>
<read_first>
- /home/salva/Documents/apikey/.planning/phases/01-foundation/01-VALIDATION.md (Wave 0 Requirements and Per-Task Verification Map)
- /home/salva/Documents/apikey/.planning/phases/01-foundation/01-RESEARCH.md (Architecture Patterns, project structure diagram)
</read_first>
<action>
Create the following files:
**main.go** (must be under 30 lines):
```go
package main
import "github.com/salvacybersec/keyhunter/cmd"
func main() {
cmd.Execute()
}
```
**testdata/samples/openai_key.txt** — file containing a synthetic (non-real) OpenAI-style key for scanner integration tests:
```
# Test file: synthetic OpenAI key pattern
OPENAI_API_KEY=sk-proj-ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqr1234
```
**testdata/samples/anthropic_key.txt** — file containing a synthetic Anthropic-style key:
```
# Test file: synthetic Anthropic key pattern
export ANTHROPIC_API_KEY="sk-ant-api03-ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxy01234567890-ABCDE"
```
**testdata/samples/multiple_keys.txt** — file with both key types:
```
# Multiple providers in one file
OPENAI_API_KEY=sk-proj-ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqr5678
ANTHROPIC_API_KEY=sk-ant-api03-XYZabcdefghijklmnopqrstuvwxyz01234567890ABCDEFGH-XYZAB
```
**testdata/samples/no_keys.txt** — file with no keys (negative test case):
```
# This file contains no API keys
# Used to verify false-positive rate is zero for clean files
Hello world
```
**pkg/providers/registry_test.go** — test stubs (will be filled by Plan 02):
```go
package providers_test
import (
"testing"
)
// TestRegistryLoad verifies that provider YAML files are loaded from embed.FS.
// Stub: will be implemented when registry.go exists (Plan 02).
func TestRegistryLoad(t *testing.T) {
t.Skip("stub — implement after registry.go exists")
}
// TestProviderSchemaValidation verifies format_version and last_verified are required.
// Stub: will be implemented when schema.go validation exists (Plan 02).
func TestProviderSchemaValidation(t *testing.T) {
t.Skip("stub — implement after schema.go validation exists")
}
// TestAhoCorasickBuild verifies Aho-Corasick automaton builds from provider keywords.
// Stub: will be implemented when registry builds automaton (Plan 02).
func TestAhoCorasickBuild(t *testing.T) {
t.Skip("stub — implement after registry AC build exists")
}
```
**pkg/storage/db_test.go** — test stubs (will be filled by Plan 03):
```go
package storage_test
import (
"testing"
)
// TestDBOpen verifies SQLite database opens and creates schema.
// Stub: will be implemented when db.go exists (Plan 03).
func TestDBOpen(t *testing.T) {
t.Skip("stub — implement after db.go exists")
}
// TestEncryptDecryptRoundtrip verifies AES-256-GCM encrypt/decrypt roundtrip.
// Stub: will be implemented when encrypt.go exists (Plan 03).
func TestEncryptDecryptRoundtrip(t *testing.T) {
t.Skip("stub — implement after encrypt.go exists")
}
// TestArgon2KeyDerivation verifies Argon2id produces 32-byte key deterministically.
// Stub: will be implemented when crypto.go exists (Plan 03).
func TestArgon2KeyDerivation(t *testing.T) {
t.Skip("stub — implement after crypto.go exists")
}
```
**pkg/engine/scanner_test.go** — test stubs (will be filled by Plan 04):
```go
package engine_test
import (
"testing"
)
// TestShannonEntropy verifies the entropy function returns expected values.
// Stub: will be implemented when entropy.go exists (Plan 04).
func TestShannonEntropy(t *testing.T) {
t.Skip("stub — implement after entropy.go exists")
}
// TestKeywordPreFilter verifies Aho-Corasick pre-filter rejects files without keywords.
// Stub: will be implemented when filter.go exists (Plan 04).
func TestKeywordPreFilter(t *testing.T) {
t.Skip("stub — implement after filter.go exists")
}
// TestScannerPipeline verifies end-to-end scan of testdata returns expected findings.
// Stub: will be implemented when engine.go exists (Plan 04).
func TestScannerPipeline(t *testing.T) {
t.Skip("stub — implement after engine.go exists")
}
```
Create the `cmd/` package directory with a minimal stub so main.go compiles:
**cmd/root.go** (minimal stub — will be replaced by Plan 05):
```go
package cmd
import "os"
// Execute is a stub. The real command tree is built in Plan 05.
func Execute() {
_ = os.Args
}
```
After creating all files, run `go build ./...` to confirm the module compiles.
</action>
<verify>
<automated>cd /home/salva/Documents/apikey && go build ./... && go test ./... -short 2>&1 | grep -v "^--- SKIP" | grep -v "^SKIP" | grep -v "^ok" || true && echo "BUILD OK"</automated>
</verify>
<acceptance_criteria>
- `go build ./...` exits 0 with no errors
- `go test ./... -short` exits 0 (all stubs skip, no failures)
- main.go is under 30 lines
- testdata/samples/openai_key.txt contains `sk-proj-` prefix
- testdata/samples/anthropic_key.txt contains `sk-ant-api03-` prefix
- testdata/samples/no_keys.txt contains no key patterns
- pkg/providers/registry_test.go, pkg/storage/db_test.go, pkg/engine/scanner_test.go each exist with skip-based stubs
- cmd/root.go exists so `go build ./...` compiles
</acceptance_criteria>
<done>Module compiles, test stubs exist, testdata fixtures created. Subsequent plans can now add production code and make tests green.</done>
</task>
</tasks>
<verification>
After both tasks:
- `cd /home/salva/Documents/apikey && go build ./...` exits 0
- `go test ./... -short` exits 0
- `grep -r 'sk-proj-' testdata/` finds the OpenAI test fixture
- `grep -r 'sk-ant-api03-' testdata/` finds the Anthropic test fixture
- go.mod has all required dependencies at specified versions
</verification>
<success_criteria>
- go.mod initialized with module path `github.com/salvacybersec/keyhunter` and Go 1.22
- All 10 Phase 1 dependencies installed at correct versions
- main.go under 30 lines, compiles successfully
- 3 test stub files exist (providers, storage, engine)
- 4 testdata fixture files exist (openai key, anthropic key, multiple keys, no keys)
- `go build ./...` and `go test ./... -short` both exit 0
</success_criteria>
<output>
After completion, create `.planning/phases/01-foundation/01-01-SUMMARY.md` following the summary template.
</output>

View File

@@ -0,0 +1,663 @@
---
phase: 01-foundation
plan: 02
type: execute
wave: 1
depends_on: [01-01]
files_modified:
- providers/openai.yaml
- providers/anthropic.yaml
- providers/huggingface.yaml
- pkg/providers/schema.go
- pkg/providers/loader.go
- pkg/providers/registry.go
- pkg/providers/registry_test.go
autonomous: true
requirements: [CORE-02, CORE-03, CORE-06, PROV-10]
must_haves:
truths:
- "Provider YAML files are embedded at compile time — no filesystem access at runtime"
- "Registry loads all YAML files from embed.FS and returns a slice of Provider structs"
- "Provider schema validation rejects YAML missing format_version or last_verified"
- "Aho-Corasick automaton is built from all provider keywords at registry init"
- "keyhunter providers list command lists providers (tested via registry methods)"
artifacts:
- path: "providers/openai.yaml"
provides: "Reference provider definition with all schema fields"
contains: "format_version"
- path: "pkg/providers/schema.go"
provides: "Provider, Pattern, VerifySpec Go structs with UnmarshalYAML validation"
exports: ["Provider", "Pattern", "VerifySpec"]
- path: "pkg/providers/registry.go"
provides: "Registry struct with List, Get, Stats, AC methods"
exports: ["Registry", "NewRegistry"]
- path: "pkg/providers/loader.go"
provides: "embed.FS declaration and fs.WalkDir loading logic"
contains: "go:embed"
key_links:
- from: "pkg/providers/loader.go"
to: "providers/*.yaml"
via: "//go:embed directive"
pattern: "go:embed.*providers"
- from: "pkg/providers/registry.go"
to: "github.com/petar-dambovaliev/aho-corasick"
via: "AC automaton build at NewRegistry()"
pattern: "ahocorasick"
- from: "pkg/providers/schema.go"
to: "format_version and last_verified YAML fields"
via: "UnmarshalYAML validation"
pattern: "UnmarshalYAML"
---
<objective>
Build the provider registry: YAML schema structs with validation, embed.FS loader, in-memory registry with List/Get/Stats/AC methods, and three reference provider YAML definitions. The Aho-Corasick automaton is built from all provider keywords at registry initialization.
Purpose: Every downstream subsystem (scan engine, CLI providers command, verification engine) depends on the Registry interface. This plan establishes the stable contract they build against.
Output: providers/*.yaml, pkg/providers/{schema,loader,registry}.go, registry_test.go (stubs filled).
</objective>
<execution_context>
@$HOME/.claude/get-shit-done/workflows/execute-plan.md
@$HOME/.claude/get-shit-done/templates/summary.md
</execution_context>
<context>
@.planning/phases/01-foundation/01-RESEARCH.md
@.planning/phases/01-foundation/01-01-SUMMARY.md
<interfaces>
<!-- Provider YAML schema (from ARCHITECTURE.md and RESEARCH.md) -->
Full provider YAML structure:
```yaml
format_version: 1
name: openai
display_name: OpenAI
tier: 1
last_verified: "2026-04-04"
keywords:
- "sk-proj-"
- "openai"
patterns:
- regex: 'sk-proj-[A-Za-z0-9_\-]{48,}'
entropy_min: 3.5
confidence: high
verify:
method: GET
url: https://api.openai.com/v1/models
headers:
Authorization: "Bearer {KEY}"
valid_status: [200]
invalid_status: [401, 403]
```
<!-- Go struct mapping -->
Provider struct fields:
FormatVersion int (yaml:"format_version" — must be >= 1)
Name string (yaml:"name")
DisplayName string (yaml:"display_name")
Tier int (yaml:"tier")
LastVerified string (yaml:"last_verified" — must be non-empty)
Keywords []string (yaml:"keywords")
Patterns []Pattern (yaml:"patterns")
Verify VerifySpec (yaml:"verify")
Pattern struct fields:
Regex string (yaml:"regex")
EntropyMin float64 (yaml:"entropy_min")
Confidence string (yaml:"confidence" — "high", "medium", "low")
VerifySpec struct fields:
Method string (yaml:"method")
URL string (yaml:"url")
Headers map[string]string (yaml:"headers")
ValidStatus []int (yaml:"valid_status")
InvalidStatus []int (yaml:"invalid_status")
<!-- Registry methods needed by downstream plans -->
type Registry struct { ... }
func NewRegistry() (*Registry, error)
func (r *Registry) List() []Provider
func (r *Registry) Get(name string) (Provider, bool)
func (r *Registry) Stats() RegistryStats // {Total int, ByTier map[int]int, ByConfidence map[string]int}
func (r *Registry) AC() ahocorasick.AhoCorasick // pre-built automaton
<!-- embed path convention -->
The embed directive must reference providers relative to loader.go location.
loader.go is at pkg/providers/loader.go.
providers/ directory is at project root.
Use: //go:embed ../../providers/*.yaml
and embed.FS path will be "../../providers/openai.yaml" etc.
Actually: Go embed paths must be relative and cannot use "..".
Correct approach: place the embed in a file at project root level, or adjust.
Better approach from research: put loader in providers package, embed from pkg/providers,
but reference the providers/ dir which sits at root.
Resolution: The go:embed directive path is relative to the SOURCE FILE, not the module root.
Since loader.go is at pkg/providers/loader.go, to embed ../../providers/*.yaml would work
syntactically but Go's embed restricts paths containing "..".
Use this instead: place a providers_embed.go at the PROJECT ROOT (same dir as go.mod):
package main -- NO, this breaks package separation
Correct architectural pattern (from RESEARCH.md example):
The embed FS should be in pkg/providers/loader.go using a path that doesn't traverse up.
Solution: embed the providers directory from within the providers package itself by
symlinking or — better — move the YAML files to pkg/providers/definitions/*.yaml and use:
//go:embed definitions/*.yaml
This is the clean solution: pkg/providers/definitions/openai.yaml etc.
Update files_modified accordingly. The RESEARCH.md shows //go:embed ../../providers/*.yaml
but that path won't work with Go's embed restrictions. Use definitions/ subdirectory instead.
</interfaces>
</context>
<tasks>
<task type="auto" tdd="true">
<name>Task 1: Provider YAML schema structs with validation</name>
<files>pkg/providers/schema.go, providers/openai.yaml, providers/anthropic.yaml, providers/huggingface.yaml</files>
<read_first>
- /home/salva/Documents/apikey/.planning/phases/01-foundation/01-RESEARCH.md (Pattern 1: Provider Registry, Provider YAML schema section, PROV-10 row in requirements table)
- /home/salva/Documents/apikey/.planning/research/ARCHITECTURE.md (Provider Registry component, YAML schema example)
</read_first>
<behavior>
- Test 1: Provider with format_version=0 → UnmarshalYAML returns error "format_version must be >= 1"
- Test 2: Provider with empty last_verified → UnmarshalYAML returns error "last_verified is required"
- Test 3: Valid provider YAML → UnmarshalYAML succeeds, Provider.Name == "openai"
- Test 4: Provider with no patterns → loaded successfully (patterns list can be empty for schema-only providers)
- Test 5: Pattern.Confidence not in {"high","medium","low"} → error "confidence must be high, medium, or low"
</behavior>
<action>
Create pkg/providers/schema.go:
```go
package providers
import (
"fmt"
"gopkg.in/yaml.v3"
)
// Provider represents a single API key provider definition loaded from YAML.
type Provider struct {
FormatVersion int `yaml:"format_version"`
Name string `yaml:"name"`
DisplayName string `yaml:"display_name"`
Tier int `yaml:"tier"`
LastVerified string `yaml:"last_verified"`
Keywords []string `yaml:"keywords"`
Patterns []Pattern `yaml:"patterns"`
Verify VerifySpec `yaml:"verify"`
}
// Pattern defines a single regex pattern for API key detection.
type Pattern struct {
Regex string `yaml:"regex"`
EntropyMin float64 `yaml:"entropy_min"`
Confidence string `yaml:"confidence"`
}
// VerifySpec defines how to verify a key is live (used by Phase 5 verification engine).
type VerifySpec struct {
Method string `yaml:"method"`
URL string `yaml:"url"`
Headers map[string]string `yaml:"headers"`
ValidStatus []int `yaml:"valid_status"`
InvalidStatus []int `yaml:"invalid_status"`
}
// RegistryStats holds aggregate statistics about loaded providers.
type RegistryStats struct {
Total int
ByTier map[int]int
ByConfidence map[string]int
}
// UnmarshalYAML implements yaml.Unmarshaler with schema validation (satisfies PROV-10).
func (p *Provider) UnmarshalYAML(value *yaml.Node) error {
// Use a type alias to avoid infinite recursion
type ProviderAlias Provider
var alias ProviderAlias
if err := value.Decode(&alias); err != nil {
return err
}
if alias.FormatVersion < 1 {
return fmt.Errorf("provider %q: format_version must be >= 1 (got %d)", alias.Name, alias.FormatVersion)
}
if alias.LastVerified == "" {
return fmt.Errorf("provider %q: last_verified is required", alias.Name)
}
validConfidences := map[string]bool{"high": true, "medium": true, "low": true, "": true}
for _, pat := range alias.Patterns {
if !validConfidences[pat.Confidence] {
return fmt.Errorf("provider %q: pattern confidence %q must be high, medium, or low", alias.Name, pat.Confidence)
}
}
*p = Provider(alias)
return nil
}
```
Create the three reference YAML provider definitions. These are SCHEMA EXAMPLES for Phase 1; full pattern libraries come in Phase 2-3.
**providers/openai.yaml:**
```yaml
format_version: 1
name: openai
display_name: OpenAI
tier: 1
last_verified: "2026-04-04"
keywords:
- "sk-proj-"
- "openai"
patterns:
- regex: 'sk-proj-[A-Za-z0-9_\-]{48,}'
entropy_min: 3.5
confidence: high
verify:
method: GET
url: https://api.openai.com/v1/models
headers:
Authorization: "Bearer {KEY}"
valid_status: [200]
invalid_status: [401, 403]
```
**providers/anthropic.yaml:**
```yaml
format_version: 1
name: anthropic
display_name: Anthropic
tier: 1
last_verified: "2026-04-04"
keywords:
- "sk-ant-api03-"
- "anthropic"
patterns:
- regex: 'sk-ant-api03-[A-Za-z0-9_\-]{93,}'
entropy_min: 3.5
confidence: high
verify:
method: GET
url: https://api.anthropic.com/v1/models
headers:
x-api-key: "{KEY}"
anthropic-version: "2023-06-01"
valid_status: [200]
invalid_status: [401, 403]
```
**providers/huggingface.yaml:**
```yaml
format_version: 1
name: huggingface
display_name: HuggingFace
tier: 3
last_verified: "2026-04-04"
keywords:
- "hf_"
- "huggingface"
patterns:
- regex: 'hf_[A-Za-z0-9]{34,}'
entropy_min: 3.5
confidence: high
verify:
method: GET
url: https://huggingface.co/api/whoami-v2
headers:
Authorization: "Bearer {KEY}"
valid_status: [200]
invalid_status: [401, 403]
```
</action>
<verify>
<automated>cd /home/salva/Documents/apikey && go build ./pkg/providers/... && go test ./pkg/providers/... -run TestProviderSchemaValidation -v 2>&1 | head -30</automated>
</verify>
<acceptance_criteria>
- `go build ./pkg/providers/...` exits 0
- providers/openai.yaml contains `format_version: 1` and `last_verified`
- providers/anthropic.yaml contains `format_version: 1` and `last_verified`
- providers/huggingface.yaml contains `format_version: 1` and `last_verified`
- pkg/providers/schema.go exports: Provider, Pattern, VerifySpec, RegistryStats
- Provider.UnmarshalYAML returns error when format_version < 1
- Provider.UnmarshalYAML returns error when last_verified is empty
- `grep -q 'UnmarshalYAML' pkg/providers/schema.go` exits 0
</acceptance_criteria>
<done>Provider schema structs exist with validation. Three reference YAML files exist with all required fields.</done>
</task>
<task type="auto" tdd="true">
<name>Task 2: Embed loader, registry with Aho-Corasick, and filled test stubs</name>
<files>pkg/providers/loader.go, pkg/providers/registry.go, pkg/providers/registry_test.go</files>
<read_first>
- /home/salva/Documents/apikey/.planning/phases/01-foundation/01-RESEARCH.md (Pattern 1: Provider Registry with Compile-Time Embed — exact code example)
- /home/salva/Documents/apikey/pkg/providers/schema.go (types just created in Task 1)
</read_first>
<behavior>
- Test 1: NewRegistry() loads 3 providers from embedded YAML → registry.List() returns slice of length 3
- Test 2: registry.Get("openai") → returns Provider with Name=="openai", bool==true
- Test 3: registry.Get("nonexistent") → returns zero Provider, bool==false
- Test 4: registry.Stats().Total == 3 and Stats().ByTier[1] == 2 (openai + anthropic are tier 1)
- Test 5: AC automaton built — registry.AC().FindAll("sk-proj-abc") returns non-empty slice
- Test 6: AC automaton does NOT match — registry.AC().FindAll("hello world") returns empty slice
</behavior>
<action>
IMPORTANT NOTE ON EMBED PATHS: Go's embed package does NOT allow paths containing "..".
Since loader.go is at pkg/providers/loader.go, it CANNOT embed ../../providers/*.yaml.
Solution: Place provider YAML files at pkg/providers/definitions/*.yaml and use:
//go:embed definitions/*.yaml
This means the YAML files created in Task 1 at providers/openai.yaml etc. are the
"source of truth" files users may inspect, but the embedded versions live in
pkg/providers/definitions/. Copy them there (or move and update Task 1 output).
Actually, the cleanest solution per Go embed docs: put an embed.go file at the PACKAGE
level that embeds a subdirectory. Since pkg/providers/ package owns the embed, use:
pkg/providers/definitions/openai.yaml (embedded)
providers/openai.yaml (user-facing, can symlink or keep as docs)
For Phase 1, keep BOTH: the providers/ root dir for user reference, definitions/ for embed.
Copy the three YAML files from providers/ to pkg/providers/definitions/ at the end.
Create **pkg/providers/loader.go**:
```go
package providers
import (
"embed"
"fmt"
"io/fs"
"path/filepath"
"gopkg.in/yaml.v3"
)
//go:embed definitions/*.yaml
var definitionsFS embed.FS
// loadProviders reads all YAML files from the embedded definitions FS.
func loadProviders() ([]Provider, error) {
var providers []Provider
err := fs.WalkDir(definitionsFS, "definitions", func(path string, d fs.DirEntry, err error) error {
if err != nil {
return err
}
if d.IsDir() || filepath.Ext(path) != ".yaml" {
return nil
}
data, err := definitionsFS.ReadFile(path)
if err != nil {
return fmt.Errorf("reading provider file %s: %w", path, err)
}
var p Provider
if err := yaml.Unmarshal(data, &p); err != nil {
return fmt.Errorf("parsing provider %s: %w", path, err)
}
providers = append(providers, p)
return nil
})
return providers, err
}
```
Create **pkg/providers/registry.go**:
```go
package providers
import (
ahocorasick "github.com/petar-dambovaliev/aho-corasick"
)
// Registry is the in-memory store of all loaded provider definitions.
// It is initialized once at startup and is safe for concurrent reads.
type Registry struct {
providers []Provider
index map[string]int // name -> slice index
ac ahocorasick.AhoCorasick // pre-built automaton for keyword pre-filter
}
// NewRegistry loads all embedded provider YAML files, validates them, builds the
// Aho-Corasick automaton from all provider keywords, and returns the Registry.
func NewRegistry() (*Registry, error) {
providers, err := loadProviders()
if err != nil {
return nil, fmt.Errorf("loading providers: %w", err)
}
index := make(map[string]int, len(providers))
var keywords []string
for i, p := range providers {
index[p.Name] = i
keywords = append(keywords, p.Keywords...)
}
builder := ahocorasick.NewAhoCorasickBuilder(ahocorasick.Opts{DFA: true})
ac := builder.Build(keywords)
return &Registry{
providers: providers,
index: index,
ac: ac,
}, nil
}
// List returns all loaded providers.
func (r *Registry) List() []Provider {
return r.providers
}
// Get returns a provider by name and a boolean indicating whether it was found.
func (r *Registry) Get(name string) (Provider, bool) {
idx, ok := r.index[name]
if !ok {
return Provider{}, false
}
return r.providers[idx], true
}
// Stats returns aggregate statistics about the loaded providers.
func (r *Registry) Stats() RegistryStats {
stats := RegistryStats{
Total: len(r.providers),
ByTier: make(map[int]int),
ByConfidence: make(map[string]int),
}
for _, p := range r.providers {
stats.ByTier[p.Tier]++
for _, pat := range p.Patterns {
stats.ByConfidence[pat.Confidence]++
}
}
return stats
}
// AC returns the pre-built Aho-Corasick automaton for keyword pre-filtering.
func (r *Registry) AC() ahocorasick.AhoCorasick {
return r.ac
}
```
Note: registry.go needs `import "fmt"` added.
Then copy the three YAML files into the embed location:
```bash
mkdir -p /home/salva/Documents/apikey/pkg/providers/definitions
cp /home/salva/Documents/apikey/providers/openai.yaml /home/salva/Documents/apikey/pkg/providers/definitions/
cp /home/salva/Documents/apikey/providers/anthropic.yaml /home/salva/Documents/apikey/pkg/providers/definitions/
cp /home/salva/Documents/apikey/providers/huggingface.yaml /home/salva/Documents/apikey/pkg/providers/definitions/
```
Finally, fill in **pkg/providers/registry_test.go** (replacing the stubs from Plan 01):
```go
package providers_test
import (
"testing"
"github.com/salvacybersec/keyhunter/pkg/providers"
"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/require"
)
func TestRegistryLoad(t *testing.T) {
reg, err := providers.NewRegistry()
require.NoError(t, err)
assert.GreaterOrEqual(t, len(reg.List()), 3, "expected at least 3 providers loaded")
}
func TestRegistryGet(t *testing.T) {
reg, err := providers.NewRegistry()
require.NoError(t, err)
p, ok := reg.Get("openai")
assert.True(t, ok)
assert.Equal(t, "openai", p.Name)
assert.Equal(t, 1, p.Tier)
_, ok = reg.Get("nonexistent-provider")
assert.False(t, ok)
}
func TestRegistryStats(t *testing.T) {
reg, err := providers.NewRegistry()
require.NoError(t, err)
stats := reg.Stats()
assert.GreaterOrEqual(t, stats.Total, 3)
assert.GreaterOrEqual(t, stats.ByTier[1], 2, "expected at least 2 tier-1 providers")
}
func TestAhoCorasickBuild(t *testing.T) {
reg, err := providers.NewRegistry()
require.NoError(t, err)
ac := reg.AC()
// Should match OpenAI keyword
matches := ac.FindAll("OPENAI_API_KEY=sk-proj-abc")
assert.NotEmpty(t, matches, "expected AC to find keyword in string containing 'sk-proj-'")
// Should not match clean text
noMatches := ac.FindAll("hello world no secrets here")
assert.Empty(t, noMatches, "expected no AC matches in text with no provider keywords")
}
func TestProviderSchemaValidation(t *testing.T) {
import_yaml := `
format_version: 0
name: invalid
last_verified: ""
`
// Directly test UnmarshalYAML via yaml.Unmarshal
var p providers.Provider
err := yaml.Unmarshal([]byte(import_yaml), &p) // NOTE: need import "gopkg.in/yaml.v3"
assert.Error(t, err, "expected validation error for format_version=0")
}
```
Note: The TestProviderSchemaValidation test needs `import "gopkg.in/yaml.v3"` added.
Add it to the imports. Full corrected test file with proper imports:
```go
package providers_test
import (
"testing"
"github.com/salvacybersec/keyhunter/pkg/providers"
"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/require"
"gopkg.in/yaml.v3"
)
func TestRegistryLoad(t *testing.T) {
reg, err := providers.NewRegistry()
require.NoError(t, err)
assert.GreaterOrEqual(t, len(reg.List()), 3, "expected at least 3 providers")
}
func TestRegistryGet(t *testing.T) {
reg, err := providers.NewRegistry()
require.NoError(t, err)
p, ok := reg.Get("openai")
assert.True(t, ok)
assert.Equal(t, "openai", p.Name)
assert.Equal(t, 1, p.Tier)
_, notOk := reg.Get("nonexistent-provider")
assert.False(t, notOk)
}
func TestRegistryStats(t *testing.T) {
reg, err := providers.NewRegistry()
require.NoError(t, err)
stats := reg.Stats()
assert.GreaterOrEqual(t, stats.Total, 3)
assert.GreaterOrEqual(t, stats.ByTier[1], 2)
}
func TestAhoCorasickBuild(t *testing.T) {
reg, err := providers.NewRegistry()
require.NoError(t, err)
ac := reg.AC()
matches := ac.FindAll("export OPENAI_API_KEY=sk-proj-abc")
assert.NotEmpty(t, matches)
noMatches := ac.FindAll("hello world nothing here")
assert.Empty(t, noMatches)
}
func TestProviderSchemaValidation(t *testing.T) {
invalid := []byte("format_version: 0\nname: invalid\nlast_verified: \"\"\n")
var p providers.Provider
err := yaml.Unmarshal(invalid, &p)
assert.Error(t, err)
assert.Contains(t, err.Error(), "format_version")
}
```
</action>
<verify>
<automated>cd /home/salva/Documents/apikey && go test ./pkg/providers/... -v -count=1 2>&1 | tail -20</automated>
</verify>
<acceptance_criteria>
- `go test ./pkg/providers/... -v` exits 0 with all 5 tests PASS (not SKIP)
- TestRegistryLoad passes with >= 3 providers
- TestRegistryGet passes — "openai" found, "nonexistent" not found
- TestRegistryStats passes — Total >= 3
- TestAhoCorasickBuild passes — "sk-proj-" match found, "hello world" empty
- TestProviderSchemaValidation passes — error on format_version=0
- `grep -r 'go:embed' pkg/providers/loader.go` exits 0
- pkg/providers/definitions/ directory exists with 3 YAML files
</acceptance_criteria>
<done>Registry loads providers from embedded YAML, builds Aho-Corasick automaton, exposes List/Get/Stats/AC. All 5 tests pass.</done>
</task>
</tasks>
<verification>
After both tasks:
- `go test ./pkg/providers/... -v -count=1` exits 0 with 5 tests PASS
- `go build ./...` still exits 0
- `grep -q 'format_version' providers/openai.yaml providers/anthropic.yaml providers/huggingface.yaml` exits 0
- `grep -q 'go:embed' pkg/providers/loader.go` exits 0
- pkg/providers/definitions/ has 3 YAML files (same content as providers/)
</verification>
<success_criteria>
- 3 reference provider YAML files exist in providers/ and pkg/providers/definitions/ with format_version and last_verified
- Provider schema validates format_version >= 1 and non-empty last_verified (PROV-10)
- Registry loads providers from embed.FS at compile time (CORE-02)
- Aho-Corasick automaton built from all keywords at NewRegistry() (CORE-06)
- Registry exposes List(), Get(), Stats(), AC() (CORE-03)
- 5 provider tests all pass
</success_criteria>
<output>
After completion, create `.planning/phases/01-foundation/01-02-SUMMARY.md` following the summary template.
</output>

View File

@@ -0,0 +1,634 @@
---
phase: 01-foundation
plan: 03
type: execute
wave: 1
depends_on: [01-01]
files_modified:
- pkg/storage/schema.sql
- pkg/storage/encrypt.go
- pkg/storage/crypto.go
- pkg/storage/db.go
- pkg/storage/findings.go
- pkg/storage/db_test.go
autonomous: true
requirements: [STOR-01, STOR-02, STOR-03]
must_haves:
truths:
- "SQLite database opens, runs migrations from embedded schema.sql, and closes cleanly"
- "AES-256-GCM Encrypt/Decrypt roundtrip produces the original plaintext"
- "Argon2id DeriveKey with the same passphrase and salt always returns the same 32-byte key"
- "A Finding can be saved to the database with the key_value stored encrypted and retrieved as plaintext"
- "The raw database file does NOT contain plaintext API key values"
artifacts:
- path: "pkg/storage/encrypt.go"
provides: "Encrypt(plaintext, key) and Decrypt(ciphertext, key) using AES-256-GCM"
exports: ["Encrypt", "Decrypt"]
- path: "pkg/storage/crypto.go"
provides: "DeriveKey(passphrase, salt) using Argon2id RFC 9106 params"
exports: ["DeriveKey", "NewSalt"]
- path: "pkg/storage/db.go"
provides: "DB struct with Open(), Close(), WAL mode, embedded schema migration"
exports: ["DB", "Open"]
- path: "pkg/storage/findings.go"
provides: "SaveFinding(finding, encKey) and ListFindings(encKey) CRUD"
exports: ["SaveFinding", "ListFindings", "Finding"]
- path: "pkg/storage/schema.sql"
provides: "CREATE TABLE statements for findings, scans, settings"
contains: "CREATE TABLE IF NOT EXISTS findings"
key_links:
- from: "pkg/storage/findings.go"
to: "pkg/storage/encrypt.go"
via: "Encrypt() called before INSERT, Decrypt() called after SELECT"
pattern: "Encrypt|Decrypt"
- from: "pkg/storage/db.go"
to: "pkg/storage/schema.sql"
via: "//go:embed schema.sql and db.Exec on open"
pattern: "go:embed.*schema"
- from: "pkg/storage/crypto.go"
to: "golang.org/x/crypto/argon2"
via: "argon2.IDKey call"
pattern: "argon2\\.IDKey"
---
<objective>
Build the storage layer: AES-256-GCM column encryption, Argon2id key derivation, SQLite database with WAL mode and embedded schema, and Finding CRUD operations that transparently encrypt key values on write and decrypt on read.
Purpose: Scanner results from Plan 04 and CLI commands from Plan 05 need a storage layer to persist findings. The encryption contract (Encrypt/Decrypt/DeriveKey) must exist before the scanner pipeline can store keys.
Output: pkg/storage/{encrypt,crypto,db,findings,schema}.go and db_test.go (stubs filled).
</objective>
<execution_context>
@$HOME/.claude/get-shit-done/workflows/execute-plan.md
@$HOME/.claude/get-shit-done/templates/summary.md
</execution_context>
<context>
@.planning/phases/01-foundation/01-RESEARCH.md
@.planning/phases/01-foundation/01-01-SUMMARY.md
<interfaces>
<!-- AES-256-GCM encrypt/decrypt pattern from RESEARCH.md Pattern 3 -->
func Encrypt(plaintext []byte, key []byte) ([]byte, error)
// key must be exactly 32 bytes (AES-256)
// nonce prepended to ciphertext in returned []byte
// uses crypto/aes + crypto/cipher GCM
func Decrypt(ciphertext []byte, key []byte) ([]byte, error)
// expects nonce prepended format from Encrypt()
// returns ErrCiphertextTooShort if len < nonceSize
<!-- Argon2id key derivation pattern from RESEARCH.md Pattern 4 -->
func DeriveKey(passphrase []byte, salt []byte) []byte
// params: time=1, memory=64*1024, threads=4, keyLen=32
// returns exactly 32 bytes deterministically
func NewSalt() ([]byte, error)
// generates 16 random bytes via crypto/rand
<!-- SQLite schema — findings table -->
findings table columns:
id INTEGER PRIMARY KEY AUTOINCREMENT
scan_id INTEGER REFERENCES scans(id)
provider_name TEXT NOT NULL
key_value BLOB NOT NULL -- AES-256-GCM encrypted, nonce prepended
key_masked TEXT NOT NULL -- first8...last4, stored plaintext for display
confidence TEXT NOT NULL -- "high", "medium", "low"
source_path TEXT
source_type TEXT -- "file", "dir", "git", "stdin", "url"
line_number INTEGER
created_at DATETIME DEFAULT CURRENT_TIMESTAMP
scans table columns:
id INTEGER PRIMARY KEY AUTOINCREMENT
started_at DATETIME NOT NULL
finished_at DATETIME
source_path TEXT
finding_count INTEGER DEFAULT 0
created_at DATETIME DEFAULT CURRENT_TIMESTAMP
settings table columns:
key TEXT PRIMARY KEY
value TEXT NOT NULL
updated_at DATETIME DEFAULT CURRENT_TIMESTAMP
<!-- Finding struct for inter-package communication -->
type Finding struct {
ID int64
ScanID int64
ProviderName string
KeyValue string // plaintext — encrypted before storage
KeyMasked string // first8chars...last4chars
Confidence string
SourcePath string
SourceType string
LineNumber int
}
<!-- DB driver registration -->
import _ "modernc.org/sqlite"
// driver registered as "sqlite" (NOT "sqlite3")
db, err := sql.Open("sqlite", dataSourceName)
</interfaces>
</context>
<tasks>
<task type="auto" tdd="true">
<name>Task 1: AES-256-GCM encryption and Argon2id key derivation</name>
<files>pkg/storage/encrypt.go, pkg/storage/crypto.go</files>
<read_first>
- /home/salva/Documents/apikey/.planning/phases/01-foundation/01-RESEARCH.md (Pattern 3: AES-256-GCM Column Encryption and Pattern 4: Argon2id Key Derivation — exact code examples)
</read_first>
<behavior>
- Test 1: Encrypt then Decrypt same key → returns original plaintext exactly
- Test 2: Encrypt produces output longer than input (nonce + tag overhead)
- Test 3: Two Encrypt calls on same plaintext → different ciphertext (random nonce)
- Test 4: Decrypt with wrong key → returns error (GCM authentication fails)
- Test 5: DeriveKey with same passphrase+salt → same 32-byte output (deterministic)
- Test 6: DeriveKey output is exactly 32 bytes
- Test 7: NewSalt() returns 16 bytes, two calls return different values
</behavior>
<action>
Create **pkg/storage/encrypt.go**:
```go
package storage
import (
"crypto/aes"
"crypto/cipher"
"crypto/rand"
"errors"
"io"
)
// ErrCiphertextTooShort is returned when ciphertext is shorter than the GCM nonce size.
var ErrCiphertextTooShort = errors.New("ciphertext too short")
// Encrypt encrypts plaintext using AES-256-GCM with a random nonce.
// The nonce is prepended to the returned ciphertext.
// key must be exactly 32 bytes (AES-256).
func Encrypt(plaintext []byte, key []byte) ([]byte, error) {
block, err := aes.NewCipher(key)
if err != nil {
return nil, err
}
gcm, err := cipher.NewGCM(block)
if err != nil {
return nil, err
}
nonce := make([]byte, gcm.NonceSize())
if _, err := io.ReadFull(rand.Reader, nonce); err != nil {
return nil, err
}
// Seal appends encrypted data to nonce, so nonce is prepended
ciphertext := gcm.Seal(nonce, nonce, plaintext, nil)
return ciphertext, nil
}
// Decrypt decrypts ciphertext produced by Encrypt.
// Expects the nonce to be prepended to the ciphertext.
func Decrypt(ciphertext []byte, key []byte) ([]byte, error) {
block, err := aes.NewCipher(key)
if err != nil {
return nil, err
}
gcm, err := cipher.NewGCM(block)
if err != nil {
return nil, err
}
nonceSize := gcm.NonceSize()
if len(ciphertext) < nonceSize {
return nil, ErrCiphertextTooShort
}
nonce, ciphertext := ciphertext[:nonceSize], ciphertext[nonceSize:]
return gcm.Open(nil, nonce, ciphertext, nil)
}
```
Create **pkg/storage/crypto.go**:
```go
package storage
import (
"crypto/rand"
"golang.org/x/crypto/argon2"
)
const (
argon2Time uint32 = 1
argon2Memory uint32 = 64 * 1024 // 64 MB — RFC 9106 Section 7.3
argon2Threads uint8 = 4
argon2KeyLen uint32 = 32 // AES-256 key length
saltSize = 16
)
// DeriveKey produces a 32-byte AES-256 key from a passphrase and salt using Argon2id.
// Uses RFC 9106 Section 7.3 recommended parameters.
// Given the same passphrase and salt, always returns the same key.
func DeriveKey(passphrase []byte, salt []byte) []byte {
return argon2.IDKey(passphrase, salt, argon2Time, argon2Memory, argon2Threads, argon2KeyLen)
}
// NewSalt generates a cryptographically random 16-byte salt.
// Store alongside the database and reuse on each key derivation.
func NewSalt() ([]byte, error) {
salt := make([]byte, saltSize)
if _, err := rand.Read(salt); err != nil {
return nil, err
}
return salt, nil
}
```
</action>
<verify>
<automated>cd /home/salva/Documents/apikey && go build ./pkg/storage/... && echo "BUILD OK"</automated>
</verify>
<acceptance_criteria>
- `go build ./pkg/storage/...` exits 0
- pkg/storage/encrypt.go exports: Encrypt, Decrypt, ErrCiphertextTooShort
- pkg/storage/crypto.go exports: DeriveKey, NewSalt
- `grep -q 'argon2\.IDKey' pkg/storage/crypto.go` exits 0
- `grep -q 'crypto/aes' pkg/storage/encrypt.go` exits 0
- `grep -q 'cipher\.NewGCM' pkg/storage/encrypt.go` exits 0
</acceptance_criteria>
<done>Encrypt/Decrypt and DeriveKey/NewSalt exist and compile. Encryption uses AES-256-GCM with random nonce. Key derivation uses Argon2id RFC 9106 parameters.</done>
</task>
<task type="auto" tdd="true">
<name>Task 2: SQLite database, schema, Finding CRUD, and filled test stubs</name>
<files>pkg/storage/schema.sql, pkg/storage/db.go, pkg/storage/findings.go, pkg/storage/db_test.go</files>
<read_first>
- /home/salva/Documents/apikey/.planning/phases/01-foundation/01-RESEARCH.md (STOR-01 row, Pattern 1 for embed usage pattern)
- /home/salva/Documents/apikey/pkg/storage/encrypt.go (Encrypt/Decrypt signatures)
- /home/salva/Documents/apikey/pkg/storage/crypto.go (DeriveKey signature)
</read_first>
<behavior>
- Test 1: Open(":memory:") returns *DB without error, schema tables exist
- Test 2: Encrypt/Decrypt roundtrip — Encrypt([]byte("sk-proj-abc"), key) then Decrypt returns "sk-proj-abc"
- Test 3: DeriveKey(passphrase, salt) twice returns identical 32 bytes
- Test 4: NewSalt() twice returns different slices
- Test 5: SaveFinding stores finding → ListFindings decrypts and returns KeyValue == "sk-proj-test"
- Test 6: Database file (when not :memory:) does NOT contain literal "sk-proj-test" in raw bytes
</behavior>
<action>
Create **pkg/storage/schema.sql**:
```sql
-- KeyHunter database schema
-- Version: 1
CREATE TABLE IF NOT EXISTS scans (
id INTEGER PRIMARY KEY AUTOINCREMENT,
started_at DATETIME NOT NULL,
finished_at DATETIME,
source_path TEXT,
finding_count INTEGER DEFAULT 0,
created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE IF NOT EXISTS findings (
id INTEGER PRIMARY KEY AUTOINCREMENT,
scan_id INTEGER REFERENCES scans(id),
provider_name TEXT NOT NULL,
key_value BLOB NOT NULL,
key_masked TEXT NOT NULL,
confidence TEXT NOT NULL,
source_path TEXT,
source_type TEXT,
line_number INTEGER,
created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE IF NOT EXISTS settings (
key TEXT PRIMARY KEY,
value TEXT NOT NULL,
updated_at DATETIME DEFAULT CURRENT_TIMESTAMP
);
-- Indexes for common queries
CREATE INDEX IF NOT EXISTS idx_findings_scan_id ON findings(scan_id);
CREATE INDEX IF NOT EXISTS idx_findings_provider ON findings(provider_name);
CREATE INDEX IF NOT EXISTS idx_findings_created ON findings(created_at DESC);
```
Create **pkg/storage/db.go**:
```go
package storage
import (
"database/sql"
_ "embed"
"fmt"
_ "modernc.org/sqlite"
)
//go:embed schema.sql
var schemaSQLBytes []byte
// DB wraps the sql.DB connection with KeyHunter-specific behavior.
type DB struct {
sql *sql.DB
}
// Open opens or creates a SQLite database at path, runs embedded schema migrations,
// and enables WAL mode for better concurrent read performance.
// Use ":memory:" for tests.
func Open(path string) (*DB, error) {
sqlDB, err := sql.Open("sqlite", path)
if err != nil {
return nil, fmt.Errorf("opening database: %w", err)
}
// Enable WAL mode for concurrent reads
if _, err := sqlDB.Exec("PRAGMA journal_mode=WAL"); err != nil {
sqlDB.Close()
return nil, fmt.Errorf("enabling WAL mode: %w", err)
}
// Enable foreign keys
if _, err := sqlDB.Exec("PRAGMA foreign_keys=ON"); err != nil {
sqlDB.Close()
return nil, fmt.Errorf("enabling foreign keys: %w", err)
}
// Run schema migrations
if _, err := sqlDB.Exec(string(schemaSQLBytes)); err != nil {
sqlDB.Close()
return nil, fmt.Errorf("running schema migrations: %w", err)
}
return &DB{sql: sqlDB}, nil
}
// Close closes the underlying database connection.
func (db *DB) Close() error {
return db.sql.Close()
}
// SQL returns the underlying sql.DB for advanced use cases.
func (db *DB) SQL() *sql.DB {
return db.sql
}
```
Create **pkg/storage/findings.go**:
```go
package storage
import (
"fmt"
"time"
)
// Finding represents a detected API key with metadata.
// KeyValue is always plaintext in this struct — encryption happens at the storage boundary.
type Finding struct {
ID int64
ScanID int64
ProviderName string
KeyValue string // plaintext — encrypted before storage, decrypted after retrieval
KeyMasked string // first8...last4, stored plaintext
Confidence string
SourcePath string
SourceType string
LineNumber int
CreatedAt time.Time
}
// MaskKey returns the masked form of a key: first 8 chars + "..." + last 4 chars.
// If the key is too short (< 12 chars), returns the full key masked with asterisks.
func MaskKey(key string) string {
if len(key) < 12 {
return "****"
}
return key[:8] + "..." + key[len(key)-4:]
}
// SaveFinding encrypts the finding's KeyValue and persists the finding to the database.
// encKey must be a 32-byte AES-256 key (from DeriveKey).
func (db *DB) SaveFinding(f Finding, encKey []byte) (int64, error) {
encrypted, err := Encrypt([]byte(f.KeyValue), encKey)
if err != nil {
return 0, fmt.Errorf("encrypting key value: %w", err)
}
masked := f.KeyMasked
if masked == "" {
masked = MaskKey(f.KeyValue)
}
res, err := db.sql.Exec(
`INSERT INTO findings (scan_id, provider_name, key_value, key_masked, confidence, source_path, source_type, line_number)
VALUES (?, ?, ?, ?, ?, ?, ?, ?)`,
f.ScanID, f.ProviderName, encrypted, masked, f.Confidence, f.SourcePath, f.SourceType, f.LineNumber,
)
if err != nil {
return 0, fmt.Errorf("inserting finding: %w", err)
}
return res.LastInsertId()
}
// ListFindings retrieves all findings, decrypting key values using encKey.
// encKey must be the same 32-byte key used during SaveFinding.
func (db *DB) ListFindings(encKey []byte) ([]Finding, error) {
rows, err := db.sql.Query(
`SELECT id, scan_id, provider_name, key_value, key_masked, confidence,
source_path, source_type, line_number, created_at
FROM findings ORDER BY created_at DESC`,
)
if err != nil {
return nil, fmt.Errorf("querying findings: %w", err)
}
defer rows.Close()
var findings []Finding
for rows.Next() {
var f Finding
var encrypted []byte
var createdAt string
err := rows.Scan(
&f.ID, &f.ScanID, &f.ProviderName, &encrypted, &f.KeyMasked,
&f.Confidence, &f.SourcePath, &f.SourceType, &f.LineNumber, &createdAt,
)
if err != nil {
return nil, fmt.Errorf("scanning finding row: %w", err)
}
plain, err := Decrypt(encrypted, encKey)
if err != nil {
return nil, fmt.Errorf("decrypting finding %d: %w", f.ID, err)
}
f.KeyValue = string(plain)
f.CreatedAt, _ = time.Parse("2006-01-02 15:04:05", createdAt)
findings = append(findings, f)
}
return findings, rows.Err()
}
```
Fill **pkg/storage/db_test.go** (replacing stubs from Plan 01):
```go
package storage_test
import (
"testing"
"github.com/salvacybersec/keyhunter/pkg/storage"
"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/require"
)
func TestDBOpen(t *testing.T) {
db, err := storage.Open(":memory:")
require.NoError(t, err)
defer db.Close()
// Verify schema tables exist
rows, err := db.SQL().Query("SELECT name FROM sqlite_master WHERE type='table' ORDER BY name")
require.NoError(t, err)
defer rows.Close()
var tables []string
for rows.Next() {
var name string
require.NoError(t, rows.Scan(&name))
tables = append(tables, name)
}
assert.Contains(t, tables, "findings")
assert.Contains(t, tables, "scans")
assert.Contains(t, tables, "settings")
}
func TestEncryptDecryptRoundtrip(t *testing.T) {
key := make([]byte, 32) // all-zero key for test
for i := range key {
key[i] = byte(i)
}
plaintext := []byte("sk-proj-supersecretapikey1234")
ciphertext, err := storage.Encrypt(plaintext, key)
require.NoError(t, err)
assert.Greater(t, len(ciphertext), len(plaintext), "ciphertext should be longer than plaintext")
recovered, err := storage.Decrypt(ciphertext, key)
require.NoError(t, err)
assert.Equal(t, plaintext, recovered)
}
func TestEncryptNonDeterministic(t *testing.T) {
key := make([]byte, 32)
plain := []byte("test-key")
ct1, err1 := storage.Encrypt(plain, key)
ct2, err2 := storage.Encrypt(plain, key)
require.NoError(t, err1)
require.NoError(t, err2)
assert.NotEqual(t, ct1, ct2, "same plaintext encrypted twice should produce different ciphertext")
}
func TestDecryptWrongKey(t *testing.T) {
key1 := make([]byte, 32)
key2 := make([]byte, 32)
key2[0] = 0xFF
ct, err := storage.Encrypt([]byte("secret"), key1)
require.NoError(t, err)
_, err = storage.Decrypt(ct, key2)
assert.Error(t, err, "decryption with wrong key should fail")
}
func TestArgon2KeyDerivation(t *testing.T) {
passphrase := []byte("my-secure-passphrase")
salt := []byte("1234567890abcdef") // 16 bytes
key1 := storage.DeriveKey(passphrase, salt)
key2 := storage.DeriveKey(passphrase, salt)
assert.Equal(t, 32, len(key1), "derived key must be 32 bytes")
assert.Equal(t, key1, key2, "same passphrase+salt must produce same key")
}
func TestNewSalt(t *testing.T) {
salt1, err1 := storage.NewSalt()
salt2, err2 := storage.NewSalt()
require.NoError(t, err1)
require.NoError(t, err2)
assert.Equal(t, 16, len(salt1))
assert.NotEqual(t, salt1, salt2, "two salts should differ")
}
func TestSaveFindingEncrypted(t *testing.T) {
db, err := storage.Open(":memory:")
require.NoError(t, err)
defer db.Close()
// Derive a test key
key := storage.DeriveKey([]byte("testpassphrase"), []byte("testsalt1234xxxx"))
f := storage.Finding{
ProviderName: "openai",
KeyValue: "sk-proj-test1234567890abcdefghijklmnopqr",
Confidence: "high",
SourcePath: "/test/file.env",
SourceType: "file",
LineNumber: 42,
}
id, err := db.SaveFinding(f, key)
require.NoError(t, err)
assert.Greater(t, id, int64(0))
findings, err := db.ListFindings(key)
require.NoError(t, err)
require.Len(t, findings, 1)
assert.Equal(t, "sk-proj-test1234567890abcdefghijklmnopqr", findings[0].KeyValue)
assert.Equal(t, "openai", findings[0].ProviderName)
// Verify masking
assert.Equal(t, "sk-proj-...opqr", findings[0].KeyMasked)
}
```
</action>
<verify>
<automated>cd /home/salva/Documents/apikey && go test ./pkg/storage/... -v -count=1 2>&1 | tail -25</automated>
</verify>
<acceptance_criteria>
- `go test ./pkg/storage/... -v -count=1` exits 0 with all 7 tests PASS (no SKIP)
- TestDBOpen finds tables: findings, scans, settings
- TestEncryptDecryptRoundtrip passes — recovered plaintext matches original
- TestEncryptNonDeterministic passes — two encryptions differ
- TestDecryptWrongKey passes — wrong key causes error
- TestArgon2KeyDerivation passes — 32 bytes, deterministic
- TestNewSalt passes — 16 bytes, non-deterministic
- TestSaveFindingEncrypted passes — stored and retrieved with correct KeyValue and KeyMasked
- `grep -q 'go:embed.*schema' pkg/storage/db.go` exits 0
- `grep -q 'modernc.org/sqlite' pkg/storage/db.go` exits 0
- `grep -q 'journal_mode=WAL' pkg/storage/db.go` exits 0
</acceptance_criteria>
<done>Storage layer complete — SQLite opens with schema, AES-256-GCM encrypt/decrypt works, Argon2id key derivation works, SaveFinding/ListFindings encrypt/decrypt transparently. All 7 tests pass.</done>
</task>
</tasks>
<verification>
After both tasks:
- `go test ./pkg/storage/... -v -count=1` exits 0 with 7 tests PASS
- `go build ./...` still exits 0
- `grep -q 'argon2\.IDKey' pkg/storage/crypto.go` exits 0
- `grep -q 'cipher\.NewGCM' pkg/storage/encrypt.go` exits 0
- `grep -q 'journal_mode=WAL' pkg/storage/db.go` exits 0
- schema.sql contains CREATE TABLE for findings, scans, settings
</verification>
<success_criteria>
- SQLite database opens and auto-migrates from embedded schema.sql (STOR-01)
- AES-256-GCM column encryption works: Encrypt + Decrypt roundtrip returns original (STOR-02)
- Argon2id key derivation: DeriveKey deterministic, 32 bytes, RFC 9106 params (STOR-03)
- FindingCRUD: SaveFinding encrypts before INSERT, ListFindings decrypts after SELECT
- All 7 storage tests pass
</success_criteria>
<output>
After completion, create `.planning/phases/01-foundation/01-03-SUMMARY.md` following the summary template.
</output>

View File

@@ -0,0 +1,682 @@
---
phase: 01-foundation
plan: 04
type: execute
wave: 2
depends_on: [01-02]
files_modified:
- pkg/engine/chunk.go
- pkg/engine/finding.go
- pkg/engine/entropy.go
- pkg/engine/filter.go
- pkg/engine/detector.go
- pkg/engine/engine.go
- pkg/engine/sources/source.go
- pkg/engine/sources/file.go
- pkg/engine/scanner_test.go
autonomous: true
requirements: [CORE-01, CORE-04, CORE-05, CORE-06, CORE-07]
must_haves:
truths:
- "Shannon entropy function returns expected values for known inputs"
- "Aho-Corasick pre-filter passes chunks containing provider keywords and drops those without"
- "Detector correctly identifies OpenAI and Anthropic key patterns in test fixtures via regex"
- "Full scan pipeline: scan testdata/samples/openai_key.txt → Finding with ProviderName==openai"
- "Full scan pipeline: scan testdata/samples/no_keys.txt → zero findings"
- "Worker pool uses ants v2 with configurable worker count"
artifacts:
- path: "pkg/engine/chunk.go"
provides: "Chunk struct (Data []byte, Source string, Offset int64)"
exports: ["Chunk"]
- path: "pkg/engine/finding.go"
provides: "Finding struct (provider, key value, masked, confidence, source, line)"
exports: ["Finding", "MaskKey"]
- path: "pkg/engine/entropy.go"
provides: "Shannon(s string) float64 — ~10 line stdlib math implementation"
exports: ["Shannon"]
- path: "pkg/engine/filter.go"
provides: "KeywordFilter stage — runs Aho-Corasick and passes/drops chunks"
exports: ["KeywordFilter"]
- path: "pkg/engine/detector.go"
provides: "Detector stage — applies provider regexps and entropy check to chunks"
exports: ["Detector"]
- path: "pkg/engine/engine.go"
provides: "Engine struct with Scan(ctx, src, cfg) <-chan Finding"
exports: ["Engine", "NewEngine", "ScanConfig"]
- path: "pkg/engine/sources/source.go"
provides: "Source interface with Chunks(ctx, chan<- Chunk) error"
exports: ["Source"]
- path: "pkg/engine/sources/file.go"
provides: "FileSource implementing Source for single-file scanning"
exports: ["FileSource", "NewFileSource"]
key_links:
- from: "pkg/engine/engine.go"
to: "pkg/providers/registry.go"
via: "Engine holds *providers.Registry, uses Registry.AC() for pre-filter"
pattern: "providers\\.Registry"
- from: "pkg/engine/filter.go"
to: "github.com/petar-dambovaliev/aho-corasick"
via: "AC.FindAll() on each chunk"
pattern: "FindAll"
- from: "pkg/engine/detector.go"
to: "pkg/engine/entropy.go"
via: "Shannon() called when EntropyMin > 0 in pattern"
pattern: "Shannon"
- from: "pkg/engine/engine.go"
to: "github.com/panjf2000/ants/v2"
via: "ants.NewPool for detector workers"
pattern: "ants\\.NewPool"
---
<objective>
Build the three-stage scanning engine pipeline: Aho-Corasick keyword pre-filter, regex + entropy detector workers using ants goroutine pool, and a FileSource adapter. Wire them together in an Engine that emits Findings on a channel.
Purpose: The scan engine is the core differentiator. Plans 02 and 03 provide its dependencies (Registry for patterns + keywords, storage types for Finding). The CLI (Plan 05) calls Engine.Scan() to implement `keyhunter scan`.
Output: pkg/engine/{chunk,finding,entropy,filter,detector,engine}.go and sources/{source,file}.go. scanner_test.go stubs filled.
</objective>
<execution_context>
@$HOME/.claude/get-shit-done/workflows/execute-plan.md
@$HOME/.claude/get-shit-done/templates/summary.md
</execution_context>
<context>
@.planning/phases/01-foundation/01-RESEARCH.md
@.planning/phases/01-foundation/01-02-SUMMARY.md
<interfaces>
<!-- Provider Registry types (from Plan 02) -->
package providers
type Provider struct {
Name string
Keywords []string
Patterns []Pattern
Tier int
}
type Pattern struct {
Regex string
EntropyMin float64
Confidence string
}
type Registry struct { ... }
func (r *Registry) List() []Provider
func (r *Registry) AC() ahocorasick.AhoCorasick // pre-built Aho-Corasick
<!-- Three-stage pipeline pattern from RESEARCH.md Pattern 2 -->
chunksChan chan Chunk (buffer: 1000)
detectableChan chan Chunk (buffer: 500)
resultsChan chan Finding (buffer: 100)
Stage 1: Source.Chunks() → chunksChan (goroutine, closes chan on done)
Stage 2: KeywordFilter(chunksChan) → detectableChan (goroutine, AC.FindAll)
Stage 3: N detector workers (ants pool) → resultsChan
<!-- ScanConfig -->
type ScanConfig struct {
Workers int // default: runtime.NumCPU() * 8
Verify bool // Phase 5 — always false in Phase 1
Unmask bool // for output layer
}
<!-- Source interface -->
type Source interface {
Chunks(ctx context.Context, out chan<- Chunk) error
}
<!-- FileSource -->
type FileSource struct {
Path string
ChunkSize int // bytes per chunk, default 4096
}
Chunking strategy: read file in chunks of ChunkSize bytes with overlap of max(256, maxPatternLen)
to avoid splitting a key across chunk boundaries.
<!-- Aho-Corasick import -->
import ahocorasick "github.com/petar-dambovaliev/aho-corasick"
// ac.FindAll(s string) []ahocorasick.Match — returns match positions
<!-- ants import -->
import "github.com/panjf2000/ants/v2"
// pool, _ := ants.NewPool(workers, ants.WithOptions(...))
// pool.Submit(func() { ... })
// pool.ReleaseWithTimeout(timeout)
</interfaces>
</context>
<tasks>
<task type="auto" tdd="true">
<name>Task 1: Core types and Shannon entropy function</name>
<files>pkg/engine/chunk.go, pkg/engine/finding.go, pkg/engine/entropy.go</files>
<read_first>
- /home/salva/Documents/apikey/.planning/phases/01-foundation/01-RESEARCH.md (CORE-04 row: Shannon entropy, ~10-line stdlib function, threshold 3.5 bits/char)
- /home/salva/Documents/apikey/pkg/storage/findings.go (Finding and MaskKey defined there — engine.Finding is a separate type for the pipeline)
</read_first>
<behavior>
- Test 1: Shannon("aaaaaaa") → value near 0.0 (all same characters, no entropy)
- Test 2: Shannon("abcdefgh") → value near 3.0 (8 distinct chars)
- Test 3: Shannon("sk-proj-ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqr") → >= 3.5 (real key entropy)
- Test 4: Shannon("") → 0.0 (empty string)
- Test 5: MaskKey("sk-proj-abc1234") → "sk-proj-...1234" (first 8 + last 4)
- Test 6: MaskKey("abc") → "****" (too short to mask)
</behavior>
<action>
Create **pkg/engine/chunk.go**:
```go
package engine
// Chunk is a segment of file content passed through the scanning pipeline.
type Chunk struct {
Data []byte // raw bytes
Source string // file path, URL, or description
Offset int64 // byte offset of this chunk within the source
}
```
Create **pkg/engine/finding.go**:
```go
package engine
import "time"
// Finding represents a detected API key from the scanning pipeline.
// KeyValue holds the plaintext key — the storage layer encrypts it before persisting.
type Finding struct {
ProviderName string
KeyValue string // full plaintext key
KeyMasked string // first8...last4
Confidence string // "high", "medium", "low"
Source string // file path or description
SourceType string // "file", "dir", "git", "stdin", "url"
LineNumber int
Offset int64
DetectedAt time.Time
}
// MaskKey returns a masked representation: first 8 chars + "..." + last 4 chars.
// Returns "****" if the key is shorter than 12 characters.
func MaskKey(key string) string {
if len(key) < 12 {
return "****"
}
return key[:8] + "..." + key[len(key)-4:]
}
```
Create **pkg/engine/entropy.go**:
```go
package engine
import "math"
// Shannon computes the Shannon entropy of a string in bits per character.
// Returns 0.0 for empty strings.
// A value >= 3.5 indicates high randomness, consistent with real API keys.
func Shannon(s string) float64 {
if len(s) == 0 {
return 0.0
}
freq := make(map[rune]float64)
for _, c := range s {
freq[c]++
}
n := float64(len([]rune(s)))
var entropy float64
for _, count := range freq {
p := count / n
entropy -= p * math.Log2(p)
}
return entropy
}
```
</action>
<verify>
<automated>cd /home/salva/Documents/apikey && go build ./pkg/engine/... && echo "BUILD OK"</automated>
</verify>
<acceptance_criteria>
- `go build ./pkg/engine/...` exits 0
- pkg/engine/chunk.go exports Chunk with fields Data, Source, Offset
- pkg/engine/finding.go exports Finding and MaskKey
- pkg/engine/entropy.go exports Shannon using math.Log2
- `grep -q 'math\.Log2' pkg/engine/entropy.go` exits 0
- Shannon("aaaaaaa") == 0.0 (manually verifiable from code)
- MaskKey("sk-proj-abc1234") produces "sk-proj-...1234"
</acceptance_criteria>
<done>Chunk, Finding, MaskKey, and Shannon exist and compile. Shannon uses stdlib math only — no external library.</done>
</task>
<task type="auto" tdd="true">
<name>Task 2: Pipeline stages, engine orchestration, FileSource, and filled test stubs</name>
<files>
pkg/engine/filter.go,
pkg/engine/detector.go,
pkg/engine/engine.go,
pkg/engine/sources/source.go,
pkg/engine/sources/file.go,
pkg/engine/scanner_test.go
</files>
<read_first>
- /home/salva/Documents/apikey/.planning/phases/01-foundation/01-RESEARCH.md (Pattern 2: Three-Stage Scanning Pipeline — exact channel-based code example)
- /home/salva/Documents/apikey/pkg/engine/chunk.go
- /home/salva/Documents/apikey/pkg/engine/finding.go
- /home/salva/Documents/apikey/pkg/engine/entropy.go
- /home/salva/Documents/apikey/pkg/providers/registry.go (Registry.AC() and Registry.List() signatures)
</read_first>
<behavior>
- Test 1: Scan testdata/samples/openai_key.txt → 1 finding, ProviderName=="openai", KeyValue contains "sk-proj-"
- Test 2: Scan testdata/samples/anthropic_key.txt → 1 finding, ProviderName=="anthropic"
- Test 3: Scan testdata/samples/no_keys.txt → 0 findings
- Test 4: Scan testdata/samples/multiple_keys.txt → 2 findings (openai + anthropic)
- Test 5: Shannon("sk-proj-ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqr") >= 3.5 (entropy check)
- Test 6: KeywordFilter drops a chunk with text "hello world" (no provider keywords)
</behavior>
<action>
Create **pkg/engine/sources/source.go**:
```go
package sources
import (
"context"
"github.com/salvacybersec/keyhunter/pkg/engine"
)
// Source is the interface all input adapters must implement.
// Chunks writes content segments to the out channel until the source is exhausted or ctx is cancelled.
type Source interface {
Chunks(ctx context.Context, out chan<- engine.Chunk) error
}
```
Create **pkg/engine/sources/file.go**:
```go
package sources
import (
"context"
"os"
"github.com/salvacybersec/keyhunter/pkg/engine"
)
const defaultChunkSize = 4096
const chunkOverlap = 256 // overlap between chunks to avoid splitting keys at boundaries
// FileSource reads a single file and emits overlapping chunks.
type FileSource struct {
Path string
ChunkSize int
}
// NewFileSource creates a FileSource for the given path with the default chunk size.
func NewFileSource(path string) *FileSource {
return &FileSource{Path: path, ChunkSize: defaultChunkSize}
}
// Chunks reads the file in overlapping segments and sends each chunk to out.
func (f *FileSource) Chunks(ctx context.Context, out chan<- engine.Chunk) error {
data, err := os.ReadFile(f.Path)
if err != nil {
return err
}
size := f.ChunkSize
if size <= 0 {
size = defaultChunkSize
}
if len(data) <= size {
// File fits in one chunk
select {
case <-ctx.Done():
return ctx.Err()
case out <- engine.Chunk{Data: data, Source: f.Path, Offset: 0}:
}
return nil
}
// Emit overlapping chunks
var offset int64
for start := 0; start < len(data); start += size - chunkOverlap {
end := start + size
if end > len(data) {
end = len(data)
}
chunk := engine.Chunk{
Data: data[start:end],
Source: f.Path,
Offset: offset,
}
select {
case <-ctx.Done():
return ctx.Err()
case out <- chunk:
}
offset += int64(end - start)
if end == len(data) {
break
}
}
return nil
}
```
Create **pkg/engine/filter.go**:
```go
package engine
import (
ahocorasick "github.com/petar-dambovaliev/aho-corasick"
)
// KeywordFilter filters a stream of chunks using an Aho-Corasick automaton.
// Only chunks that contain at least one provider keyword are sent to out.
// This is Stage 2 of the pipeline (runs after Source, before Detector).
func KeywordFilter(ac ahocorasick.AhoCorasick, in <-chan Chunk, out chan<- Chunk) {
for chunk := range in {
if len(ac.FindAll(string(chunk.Data))) > 0 {
out <- chunk
}
}
}
```
Create **pkg/engine/detector.go**:
```go
package engine
import (
"regexp"
"strings"
"time"
"github.com/salvacybersec/keyhunter/pkg/providers"
)
// Detector applies provider regex patterns and optional entropy checks to a chunk.
// It returns all findings from the chunk.
func Detect(chunk Chunk, providerList []providers.Provider) []Finding {
var findings []Finding
content := string(chunk.Data)
for _, p := range providerList {
for _, pat := range p.Patterns {
re, err := regexp.Compile(pat.Regex)
if err != nil {
continue // invalid regex — skip silently
}
matches := re.FindAllString(content, -1)
for _, match := range matches {
// Apply entropy check if threshold is set
if pat.EntropyMin > 0 && Shannon(match) < pat.EntropyMin {
continue // too low entropy — likely a placeholder
}
line := lineNumber(content, match)
findings = append(findings, Finding{
ProviderName: p.Name,
KeyValue: match,
KeyMasked: MaskKey(match),
Confidence: pat.Confidence,
Source: chunk.Source,
SourceType: "file",
LineNumber: line,
Offset: chunk.Offset,
DetectedAt: time.Now(),
})
}
}
}
return findings
}
// lineNumber returns the 1-based line number where match first appears in content.
func lineNumber(content, match string) int {
idx := strings.Index(content, match)
if idx < 0 {
return 0
}
return strings.Count(content[:idx], "\n") + 1
}
```
Create **pkg/engine/engine.go**:
```go
package engine
import (
"context"
"runtime"
"sync"
"time"
"github.com/panjf2000/ants/v2"
"github.com/salvacybersec/keyhunter/pkg/providers"
"github.com/salvacybersec/keyhunter/pkg/engine/sources"
)
// ScanConfig controls scan execution parameters.
type ScanConfig struct {
Workers int // number of detector goroutines; defaults to runtime.NumCPU() * 8
Verify bool // opt-in active verification (Phase 5)
Unmask bool // include full key in Finding.KeyValue
}
// Engine orchestrates the three-stage scanning pipeline.
type Engine struct {
registry *providers.Registry
}
// NewEngine creates an Engine backed by the given provider registry.
func NewEngine(registry *providers.Registry) *Engine {
return &Engine{registry: registry}
}
// Scan runs the three-stage pipeline against src and returns a channel of Findings.
// The channel is closed when all chunks have been processed.
// The caller must drain the channel fully or cancel ctx to avoid goroutine leaks.
func (e *Engine) Scan(ctx context.Context, src sources.Source, cfg ScanConfig) (<-chan Finding, error) {
workers := cfg.Workers
if workers <= 0 {
workers = runtime.NumCPU() * 8
}
chunksChan := make(chan Chunk, 1000)
detectableChan := make(chan Chunk, 500)
resultsChan := make(chan Finding, 100)
// Stage 1: source → chunksChan
go func() {
defer close(chunksChan)
_ = src.Chunks(ctx, chunksChan)
}()
// Stage 2: keyword pre-filter → detectableChan
go func() {
defer close(detectableChan)
KeywordFilter(e.registry.AC(), chunksChan, detectableChan)
}()
// Stage 3: detector workers → resultsChan
pool, err := ants.NewPool(workers)
if err != nil {
close(resultsChan)
return nil, err
}
providerList := e.registry.List()
var wg sync.WaitGroup
var mu sync.Mutex
go func() {
defer func() {
wg.Wait()
close(resultsChan)
pool.ReleaseWithTimeout(5 * time.Second)
}()
for chunk := range detectableChan {
c := chunk // capture
wg.Add(1)
_ = pool.Submit(func() {
defer wg.Done()
found := Detect(c, providerList)
mu.Lock()
for _, f := range found {
select {
case resultsChan <- f:
case <-ctx.Done():
}
}
mu.Unlock()
})
}
}()
return resultsChan, nil
}
```
Fill **pkg/engine/scanner_test.go** (replacing stubs from Plan 01):
```go
package engine_test
import (
"context"
"testing"
"github.com/salvacybersec/keyhunter/pkg/engine"
"github.com/salvacybersec/keyhunter/pkg/engine/sources"
"github.com/salvacybersec/keyhunter/pkg/providers"
"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/require"
)
func newTestRegistry(t *testing.T) *providers.Registry {
t.Helper()
reg, err := providers.NewRegistry()
require.NoError(t, err)
return reg
}
func TestShannonEntropy(t *testing.T) {
assert.InDelta(t, 0.0, engine.Shannon("aaaaaaa"), 0.01)
assert.Greater(t, engine.Shannon("sk-proj-ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqr"), 3.5)
assert.Equal(t, 0.0, engine.Shannon(""))
}
func TestKeywordPreFilter(t *testing.T) {
reg := newTestRegistry(t)
ac := reg.AC()
// Chunk with OpenAI keyword should pass
matches := ac.FindAll("export OPENAI_API_KEY=sk-proj-test")
assert.NotEmpty(t, matches)
// Chunk with no keywords should be dropped
noMatches := ac.FindAll("hello world no secrets here")
assert.Empty(t, noMatches)
}
func TestScannerPipelineOpenAI(t *testing.T) {
reg := newTestRegistry(t)
eng := engine.NewEngine(reg)
src := sources.NewFileSource("../../testdata/samples/openai_key.txt")
cfg := engine.ScanConfig{Workers: 2}
ch, err := eng.Scan(context.Background(), src, cfg)
require.NoError(t, err)
var findings []engine.Finding
for f := range ch {
findings = append(findings, f)
}
require.Len(t, findings, 1, "expected exactly 1 finding in openai_key.txt")
assert.Equal(t, "openai", findings[0].ProviderName)
assert.Contains(t, findings[0].KeyValue, "sk-proj-")
}
func TestScannerPipelineNoKeys(t *testing.T) {
reg := newTestRegistry(t)
eng := engine.NewEngine(reg)
src := sources.NewFileSource("../../testdata/samples/no_keys.txt")
cfg := engine.ScanConfig{Workers: 2}
ch, err := eng.Scan(context.Background(), src, cfg)
require.NoError(t, err)
var findings []engine.Finding
for f := range ch {
findings = append(findings, f)
}
assert.Empty(t, findings, "expected zero findings in no_keys.txt")
}
func TestScannerPipelineMultipleKeys(t *testing.T) {
reg := newTestRegistry(t)
eng := engine.NewEngine(reg)
src := sources.NewFileSource("../../testdata/samples/multiple_keys.txt")
cfg := engine.ScanConfig{Workers: 2}
ch, err := eng.Scan(context.Background(), src, cfg)
require.NoError(t, err)
var findings []engine.Finding
for f := range ch {
findings = append(findings, f)
}
assert.GreaterOrEqual(t, len(findings), 2, "expected at least 2 findings in multiple_keys.txt")
var names []string
for _, f := range findings {
names = append(names, f.ProviderName)
}
assert.Contains(t, names, "openai")
assert.Contains(t, names, "anthropic")
}
```
</action>
<verify>
<automated>cd /home/salva/Documents/apikey && go test ./pkg/engine/... -v -count=1 2>&1 | tail -30</automated>
</verify>
<acceptance_criteria>
- `go test ./pkg/engine/... -v -count=1` exits 0 with all tests PASS (no SKIP)
- TestShannonEntropy passes — 0.0 for "aaaaaaa", >= 3.5 for real key pattern
- TestKeywordPreFilter passes — AC matches sk-proj-, empty for "hello world"
- TestScannerPipelineOpenAI passes — 1 finding with ProviderName=="openai"
- TestScannerPipelineNoKeys passes — 0 findings
- TestScannerPipelineMultipleKeys passes — >= 2 findings with both provider names
- `grep -q 'ants\.NewPool' pkg/engine/engine.go` exits 0
- `grep -q 'KeywordFilter' pkg/engine/engine.go` exits 0
- `go build ./...` still exits 0
</acceptance_criteria>
<done>Three-stage scanning pipeline works end-to-end: FileSource → KeywordFilter (AC) → Detect (regex + entropy) → Finding channel. All engine tests pass.</done>
</task>
</tasks>
<verification>
After both tasks:
- `go test ./pkg/engine/... -v -count=1` exits 0 with 6 tests PASS
- `go build ./...` exits 0
- `grep -q 'ants\.NewPool' pkg/engine/engine.go` exits 0
- `grep -q 'math\.Log2' pkg/engine/entropy.go` exits 0
- Scanning testdata/samples/openai_key.txt returns 1 finding with provider "openai"
- Scanning testdata/samples/no_keys.txt returns 0 findings
</verification>
<success_criteria>
- Three-stage pipeline: AC pre-filter → regex + entropy detector → results channel (CORE-01, CORE-06)
- Shannon entropy function using stdlib math (CORE-04)
- ants v2 goroutine pool with configurable worker count (CORE-05)
- FileSource adapter reading files in overlapping chunks (CORE-07 partial — full mmap in Phase 4)
- All engine tests pass against real testdata fixtures
</success_criteria>
<output>
After completion, create `.planning/phases/01-foundation/01-04-SUMMARY.md` following the summary template.
</output>

View File

@@ -0,0 +1,748 @@
---
phase: 01-foundation
plan: 05
type: execute
wave: 3
depends_on: [01-02, 01-03, 01-04]
files_modified:
- cmd/root.go
- cmd/scan.go
- cmd/providers.go
- cmd/config.go
- pkg/config/config.go
- pkg/output/table.go
autonomous: false
requirements: [CLI-01, CLI-02, CLI-03, CLI-04, CLI-05]
must_haves:
truths:
- "`keyhunter scan ./testdata/samples/openai_key.txt` runs the pipeline and prints a finding"
- "`keyhunter providers list` prints a table with at least 3 providers"
- "`keyhunter providers info openai` prints OpenAI provider details"
- "`keyhunter config init` creates ~/.keyhunter.yaml without error"
- "`keyhunter config set workers 16` persists the value to ~/.keyhunter.yaml"
- "`keyhunter --help` shows all top-level commands: scan, providers, config"
artifacts:
- path: "cmd/root.go"
provides: "Cobra root command with PersistentPreRunE config loading"
contains: "cobra.Command"
- path: "cmd/scan.go"
provides: "scan command wiring Engine + FileSource + output table"
exports: ["scanCmd"]
- path: "cmd/providers.go"
provides: "providers list/info/stats subcommands using Registry"
exports: ["providersCmd"]
- path: "cmd/config.go"
provides: "config init/set/get subcommands using Viper"
exports: ["configCmd"]
- path: "pkg/config/config.go"
provides: "Config struct with Load() and defaults"
exports: ["Config", "Load"]
- path: "pkg/output/table.go"
provides: "lipgloss terminal table for printing Findings"
exports: ["PrintFindings"]
key_links:
- from: "cmd/scan.go"
to: "pkg/engine/engine.go"
via: "engine.NewEngine(registry).Scan() called in RunE"
pattern: "engine\\.NewEngine"
- from: "cmd/scan.go"
to: "pkg/storage/db.go"
via: "storage.Open() called, SaveFinding for each result"
pattern: "storage\\.Open"
- from: "cmd/root.go"
to: "github.com/spf13/viper"
via: "viper.SetConfigFile in PersistentPreRunE"
pattern: "viper\\.SetConfigFile"
- from: "cmd/providers.go"
to: "pkg/providers/registry.go"
via: "Registry.List(), Registry.Get(), Registry.Stats() called"
pattern: "registry\\.List|registry\\.Get|registry\\.Stats"
---
<objective>
Wire all subsystems together through the Cobra CLI: scan command (engine + storage + output), providers list/info/stats commands, and config init/set/get commands. This is the integration layer — all business logic lives in pkg/, cmd/ only wires.
Purpose: Satisfies all Phase 1 CLI requirements and delivers the first working `keyhunter scan` command that completes the end-to-end success criteria.
Output: cmd/{root,scan,providers,config}.go, pkg/config/config.go, pkg/output/table.go.
</objective>
<execution_context>
@$HOME/.claude/get-shit-done/workflows/execute-plan.md
@$HOME/.claude/get-shit-done/templates/summary.md
</execution_context>
<context>
@.planning/phases/01-foundation/01-RESEARCH.md
@.planning/phases/01-foundation/01-02-SUMMARY.md
@.planning/phases/01-foundation/01-03-SUMMARY.md
@.planning/phases/01-foundation/01-04-SUMMARY.md
<interfaces>
<!-- Engine (from Plan 04) -->
package engine
type ScanConfig struct { Workers int; Verify bool; Unmask bool }
func NewEngine(registry *providers.Registry) *Engine
func (e *Engine) Scan(ctx context.Context, src sources.Source, cfg ScanConfig) (<-chan Finding, error)
<!-- FileSource (from Plan 04) -->
package sources
func NewFileSource(path string) *FileSource
<!-- Finding type (from Plan 04) -->
type Finding struct {
ProviderName string
KeyValue string
KeyMasked string
Confidence string
Source string
LineNumber int
}
<!-- Storage (from Plan 03) -->
package storage
func Open(path string) (*DB, error)
func (db *DB) SaveFinding(f Finding, encKey []byte) (int64, error)
func DeriveKey(passphrase []byte, salt []byte) []byte
func NewSalt() ([]byte, error)
<!-- Registry (from Plan 02) -->
package providers
func NewRegistry() (*Registry, error)
func (r *Registry) List() []Provider
func (r *Registry) Get(name string) (Provider, bool)
func (r *Registry) Stats() RegistryStats
<!-- Config defaults -->
DBPath: ~/.keyhunter/keyhunter.db
ConfigPath: ~/.keyhunter.yaml
Workers: runtime.NumCPU() * 8
Passphrase: (prompt if not in env KEYHUNTER_PASSPHRASE — Phase 1: use empty string as dev default)
<!-- Viper config keys -->
"database.path" → DBPath
"scan.workers" → Workers
"encryption.passphrase" → Passphrase (sensitive — warn in help)
<!-- lipgloss table output -->
Columns: PROVIDER | MASKED KEY | CONFIDENCE | SOURCE | LINE
Colors: use lipgloss.NewStyle().Foreground() for confidence: high=green, medium=yellow, low=red
</interfaces>
</context>
<tasks>
<task type="auto" tdd="false">
<name>Task 1: Config package, output table, and root command</name>
<files>pkg/config/config.go, pkg/output/table.go, cmd/root.go</files>
<read_first>
- /home/salva/Documents/apikey/.planning/phases/01-foundation/01-RESEARCH.md (CLI-01, CLI-02, CLI-03 rows, Standard Stack: cobra v1.10.2 + viper v1.21.0)
- /home/salva/Documents/apikey/pkg/engine/finding.go (Finding struct fields for output)
</read_first>
<action>
Create **pkg/config/config.go**:
```go
package config
import (
"os"
"path/filepath"
"runtime"
)
// Config holds all KeyHunter runtime configuration.
// Values are populated from ~/.keyhunter.yaml, environment variables, and CLI flags (in that precedence order).
type Config struct {
DBPath string // path to SQLite database file
ConfigPath string // path to config YAML file
Workers int // number of scanner worker goroutines
Passphrase string // encryption passphrase (sensitive)
}
// Load returns a Config with defaults applied.
// Callers should override individual fields after Load() using viper-bound values.
func Load() Config {
home, _ := os.UserHomeDir()
return Config{
DBPath: filepath.Join(home, ".keyhunter", "keyhunter.db"),
ConfigPath: filepath.Join(home, ".keyhunter.yaml"),
Workers: runtime.NumCPU() * 8,
Passphrase: "", // Phase 1: empty passphrase; Phase 6+ will prompt
}
}
```
Create **pkg/output/table.go**:
```go
package output
import (
"fmt"
"os"
"github.com/charmbracelet/lipgloss"
"github.com/salvacybersec/keyhunter/pkg/engine"
)
var (
styleHigh = lipgloss.NewStyle().Foreground(lipgloss.Color("2")) // green
styleMedium = lipgloss.NewStyle().Foreground(lipgloss.Color("3")) // yellow
styleLow = lipgloss.NewStyle().Foreground(lipgloss.Color("1")) // red
styleHeader = lipgloss.NewStyle().Bold(true).Underline(true)
)
// PrintFindings writes findings as a colored terminal table to stdout.
// If unmask is true, KeyValue is shown; otherwise KeyMasked is shown.
func PrintFindings(findings []engine.Finding, unmask bool) {
if len(findings) == 0 {
fmt.Println("No API keys found.")
return
}
// Header
fmt.Fprintf(os.Stdout, "%-20s %-40s %-10s %-30s %s\n",
styleHeader.Render("PROVIDER"),
styleHeader.Render("KEY"),
styleHeader.Render("CONFIDENCE"),
styleHeader.Render("SOURCE"),
styleHeader.Render("LINE"),
)
fmt.Println(lipgloss.NewStyle().Foreground(lipgloss.Color("8")).Render(
"──────────────────────────────────────────────────────────────────────────────────────────────────────────",
))
for _, f := range findings {
keyDisplay := f.KeyMasked
if unmask {
keyDisplay = f.KeyValue
}
confStyle := styleLow
switch f.Confidence {
case "high":
confStyle = styleHigh
case "medium":
confStyle = styleMedium
}
fmt.Fprintf(os.Stdout, "%-20s %-40s %-10s %-30s %d\n",
f.ProviderName,
keyDisplay,
confStyle.Render(f.Confidence),
truncate(f.Source, 28),
f.LineNumber,
)
}
fmt.Printf("\n%d key(s) found.\n", len(findings))
}
func truncate(s string, max int) string {
if len(s) <= max {
return s
}
return "..." + s[len(s)-max+3:]
}
```
Create **cmd/root.go** (replaces the stub from Plan 01):
```go
package cmd
import (
"fmt"
"os"
"path/filepath"
"github.com/spf13/cobra"
"github.com/spf13/viper"
)
var cfgFile string
// rootCmd is the base command when called without any subcommands.
var rootCmd = &cobra.Command{
Use: "keyhunter",
Short: "KeyHunter — detect leaked LLM API keys across 108+ providers",
Long: `KeyHunter scans files, git history, and internet sources for leaked LLM API keys.
Supports 108+ providers with Aho-Corasick pre-filtering and regex + entropy detection.`,
SilenceUsage: true,
}
// Execute is the entry point called by main.go.
func Execute() {
if err := rootCmd.Execute(); err != nil {
os.Exit(1)
}
}
func init() {
cobra.OnInitialize(initConfig)
rootCmd.PersistentFlags().StringVar(&cfgFile, "config", "", "config file (default: ~/.keyhunter.yaml)")
rootCmd.AddCommand(scanCmd)
rootCmd.AddCommand(providersCmd)
rootCmd.AddCommand(configCmd)
}
func initConfig() {
if cfgFile != "" {
viper.SetConfigFile(cfgFile)
} else {
home, err := os.UserHomeDir()
if err != nil {
fmt.Fprintln(os.Stderr, "warning: cannot determine home directory:", err)
return
}
viper.SetConfigName(".keyhunter")
viper.SetConfigType("yaml")
viper.AddConfigPath(home)
viper.AddConfigPath(".")
}
viper.SetEnvPrefix("KEYHUNTER")
viper.AutomaticEnv()
// Defaults
viper.SetDefault("scan.workers", 0) // 0 = auto (CPU*8)
viper.SetDefault("database.path", filepath.Join(mustHomeDir(), ".keyhunter", "keyhunter.db"))
// Config file is optional — ignore if not found
_ = viper.ReadInConfig()
}
func mustHomeDir() string {
h, _ := os.UserHomeDir()
return h
}
```
</action>
<verify>
<automated>cd /home/salva/Documents/apikey && go build ./... && ./keyhunter --help 2>&1 | grep -E "scan|providers|config" && echo "HELP OK"</automated>
</verify>
<acceptance_criteria>
- `go build ./...` exits 0
- `./keyhunter --help` shows "scan", "providers", and "config" in command list
- pkg/config/config.go exports Config and Load
- pkg/output/table.go exports PrintFindings
- cmd/root.go declares rootCmd, Execute(), scanCmd, providersCmd, configCmd referenced
- `grep -q 'viper\.SetConfigFile\|viper\.SetConfigName' cmd/root.go` exits 0
- lipgloss used for header and confidence coloring
</acceptance_criteria>
<done>Root command, config package, and output table exist. `keyhunter --help` shows the three top-level commands.</done>
</task>
<task type="auto" tdd="false">
<name>Task 2: scan, providers, and config subcommands</name>
<files>cmd/scan.go, cmd/providers.go, cmd/config.go</files>
<read_first>
- /home/salva/Documents/apikey/.planning/phases/01-foundation/01-RESEARCH.md (CLI-04, CLI-05 rows, Pattern 2 pipeline usage)
- /home/salva/Documents/apikey/cmd/root.go (rootCmd, viper setup)
- /home/salva/Documents/apikey/pkg/engine/engine.go (Engine.Scan, ScanConfig)
- /home/salva/Documents/apikey/pkg/storage/db.go (Open, SaveFinding)
- /home/salva/Documents/apikey/pkg/providers/registry.go (NewRegistry, List, Get, Stats)
</read_first>
<action>
Create **cmd/scan.go**:
```go
package cmd
import (
"context"
"fmt"
"os"
"path/filepath"
"runtime"
"github.com/spf13/cobra"
"github.com/spf13/viper"
"github.com/salvacybersec/keyhunter/pkg/config"
"github.com/salvacybersec/keyhunter/pkg/engine"
"github.com/salvacybersec/keyhunter/pkg/engine/sources"
"github.com/salvacybersec/keyhunter/pkg/output"
"github.com/salvacybersec/keyhunter/pkg/providers"
"github.com/salvacybersec/keyhunter/pkg/storage"
)
var (
flagWorkers int
flagVerify bool
flagUnmask bool
flagOutput string
flagExclude []string
)
var scanCmd = &cobra.Command{
Use: "scan <path>",
Short: "Scan a file or directory for leaked API keys",
Args: cobra.ExactArgs(1),
RunE: func(cmd *cobra.Command, args []string) error {
target := args[0]
// Load config
cfg := config.Load()
if viper.GetInt("scan.workers") > 0 {
cfg.Workers = viper.GetInt("scan.workers")
}
// Workers flag overrides config
workers := flagWorkers
if workers <= 0 {
workers = cfg.Workers
}
if workers <= 0 {
workers = runtime.NumCPU() * 8
}
// Initialize registry
reg, err := providers.NewRegistry()
if err != nil {
return fmt.Errorf("loading providers: %w", err)
}
// Initialize engine
eng := engine.NewEngine(reg)
src := sources.NewFileSource(target)
scanCfg := engine.ScanConfig{
Workers: workers,
Verify: flagVerify,
Unmask: flagUnmask,
}
// Open database (ensure directory exists)
dbPath := viper.GetString("database.path")
if dbPath == "" {
dbPath = cfg.DBPath
}
if err := os.MkdirAll(filepath.Dir(dbPath), 0700); err != nil {
return fmt.Errorf("creating database directory: %w", err)
}
db, err := storage.Open(dbPath)
if err != nil {
return fmt.Errorf("opening database: %w", err)
}
defer db.Close()
// Derive encryption key (Phase 1: empty passphrase with fixed dev salt)
salt := []byte("keyhunter-dev-s0") // Phase 1 placeholder — Phase 6 replaces with proper salt storage
encKey := storage.DeriveKey([]byte(cfg.Passphrase), salt)
// Run scan
ch, err := eng.Scan(context.Background(), src, scanCfg)
if err != nil {
return fmt.Errorf("starting scan: %w", err)
}
var findings []engine.Finding
for f := range ch {
findings = append(findings, f)
// Persist to storage
storeFinding := storage.Finding{
ProviderName: f.ProviderName,
KeyValue: f.KeyValue,
KeyMasked: f.KeyMasked,
Confidence: f.Confidence,
SourcePath: f.Source,
SourceType: f.SourceType,
LineNumber: f.LineNumber,
}
if _, err := db.SaveFinding(storeFinding, encKey); err != nil {
fmt.Fprintf(os.Stderr, "warning: failed to save finding: %v\n", err)
}
}
// Output
switch flagOutput {
case "json":
// Phase 6 — basic JSON for now
fmt.Printf("[] # JSON output: Phase 6\n")
default:
output.PrintFindings(findings, flagUnmask)
}
// Exit code semantics (CLI-05 / OUT-06): 0=clean, 1=found, 2=error
if len(findings) > 0 {
os.Exit(1)
}
return nil
},
}
func init() {
scanCmd.Flags().IntVar(&flagWorkers, "workers", 0, "number of worker goroutines (default: CPU*8)")
scanCmd.Flags().BoolVar(&flagVerify, "verify", false, "actively verify found keys (opt-in, Phase 5)")
scanCmd.Flags().BoolVar(&flagUnmask, "unmask", false, "show full key values (default: masked)")
scanCmd.Flags().StringVar(&flagOutput, "output", "table", "output format: table, json (more in Phase 6)")
scanCmd.Flags().StringSliceVar(&flagExclude, "exclude", nil, "glob patterns to exclude (e.g. *.min.js)")
viper.BindPFlag("scan.workers", scanCmd.Flags().Lookup("workers"))
}
```
Create **cmd/providers.go**:
```go
package cmd
import (
"fmt"
"os"
"strings"
"github.com/charmbracelet/lipgloss"
"github.com/spf13/cobra"
"github.com/salvacybersec/keyhunter/pkg/providers"
)
var providersCmd = &cobra.Command{
Use: "providers",
Short: "Manage and inspect provider definitions",
}
var providersListCmd = &cobra.Command{
Use: "list",
Short: "List all loaded provider definitions",
RunE: func(cmd *cobra.Command, args []string) error {
reg, err := providers.NewRegistry()
if err != nil {
return err
}
bold := lipgloss.NewStyle().Bold(true)
fmt.Fprintf(os.Stdout, "%-20s %-6s %-8s %s\n",
bold.Render("NAME"), bold.Render("TIER"), bold.Render("PATTERNS"), bold.Render("KEYWORDS"))
fmt.Println(strings.Repeat("─", 70))
for _, p := range reg.List() {
fmt.Fprintf(os.Stdout, "%-20s %-6d %-8d %s\n",
p.Name, p.Tier, len(p.Patterns), strings.Join(p.Keywords, ", "))
}
stats := reg.Stats()
fmt.Printf("\nTotal: %d providers\n", stats.Total)
return nil
},
}
var providersInfoCmd = &cobra.Command{
Use: "info <name>",
Short: "Show detailed info for a provider",
Args: cobra.ExactArgs(1),
RunE: func(cmd *cobra.Command, args []string) error {
reg, err := providers.NewRegistry()
if err != nil {
return err
}
p, ok := reg.Get(args[0])
if !ok {
return fmt.Errorf("provider %q not found", args[0])
}
fmt.Printf("Name: %s\n", p.Name)
fmt.Printf("Display Name: %s\n", p.DisplayName)
fmt.Printf("Tier: %d\n", p.Tier)
fmt.Printf("Last Verified: %s\n", p.LastVerified)
fmt.Printf("Keywords: %s\n", strings.Join(p.Keywords, ", "))
fmt.Printf("Patterns: %d\n", len(p.Patterns))
for i, pat := range p.Patterns {
fmt.Printf(" [%d] regex=%s confidence=%s entropy_min=%.1f\n",
i+1, pat.Regex, pat.Confidence, pat.EntropyMin)
}
if p.Verify.URL != "" {
fmt.Printf("Verify URL: %s %s\n", p.Verify.Method, p.Verify.URL)
}
return nil
},
}
var providersStatsCmd = &cobra.Command{
Use: "stats",
Short: "Show provider statistics",
RunE: func(cmd *cobra.Command, args []string) error {
reg, err := providers.NewRegistry()
if err != nil {
return err
}
stats := reg.Stats()
fmt.Printf("Total providers: %d\n", stats.Total)
fmt.Printf("By tier:\n")
for tier := 1; tier <= 9; tier++ {
if count := stats.ByTier[tier]; count > 0 {
fmt.Printf(" Tier %d: %d\n", tier, count)
}
}
fmt.Printf("By confidence:\n")
for conf, count := range stats.ByConfidence {
fmt.Printf(" %s: %d\n", conf, count)
}
return nil
},
}
func init() {
providersCmd.AddCommand(providersListCmd)
providersCmd.AddCommand(providersInfoCmd)
providersCmd.AddCommand(providersStatsCmd)
}
```
Create **cmd/config.go**:
```go
package cmd
import (
"fmt"
"os"
"path/filepath"
"github.com/spf13/cobra"
"github.com/spf13/viper"
)
var configCmd = &cobra.Command{
Use: "config",
Short: "Manage KeyHunter configuration",
}
var configInitCmd = &cobra.Command{
Use: "init",
Short: "Create default configuration file at ~/.keyhunter.yaml",
RunE: func(cmd *cobra.Command, args []string) error {
home, err := os.UserHomeDir()
if err != nil {
return fmt.Errorf("cannot determine home directory: %w", err)
}
configPath := filepath.Join(home, ".keyhunter.yaml")
// Set defaults before writing
viper.SetDefault("scan.workers", 0)
viper.SetDefault("database.path", filepath.Join(home, ".keyhunter", "keyhunter.db"))
if err := viper.WriteConfigAs(configPath); err != nil {
return fmt.Errorf("writing config: %w", err)
}
fmt.Printf("Config initialized: %s\n", configPath)
return nil
},
}
var configSetCmd = &cobra.Command{
Use: "set <key> <value>",
Short: "Set a configuration value",
Args: cobra.ExactArgs(2),
RunE: func(cmd *cobra.Command, args []string) error {
key, value := args[0], args[1]
viper.Set(key, value)
if err := viper.WriteConfig(); err != nil {
// If config file doesn't exist yet, create it
home, _ := os.UserHomeDir()
configPath := filepath.Join(home, ".keyhunter.yaml")
if err2 := viper.WriteConfigAs(configPath); err2 != nil {
return fmt.Errorf("writing config: %w", err2)
}
}
fmt.Printf("Set %s = %s\n", key, value)
return nil
},
}
var configGetCmd = &cobra.Command{
Use: "get <key>",
Short: "Get a configuration value",
Args: cobra.ExactArgs(1),
RunE: func(cmd *cobra.Command, args []string) error {
val := viper.Get(args[0])
if val == nil {
return fmt.Errorf("key %q not found", args[0])
}
fmt.Printf("%v\n", val)
return nil
},
}
func init() {
configCmd.AddCommand(configInitCmd)
configCmd.AddCommand(configSetCmd)
configCmd.AddCommand(configGetCmd)
}
```
</action>
<verify>
<automated>cd /home/salva/Documents/apikey && go build -o keyhunter . && ./keyhunter providers list && ./keyhunter providers info openai && echo "PROVIDERS OK"</automated>
</verify>
<acceptance_criteria>
- `go build -o keyhunter .` exits 0
- `./keyhunter --help` shows scan, providers, config commands
- `./keyhunter providers list` prints table with >= 3 rows including "openai"
- `./keyhunter providers info openai` prints Name, Tier, Keywords, Patterns, Verify URL
- `./keyhunter providers stats` prints "Total providers: 3" or more
- `./keyhunter config init` creates or updates ~/.keyhunter.yaml
- `./keyhunter config set scan.workers 16` exits 0
- `./keyhunter scan testdata/samples/openai_key.txt` exits 1 (keys found) and prints a table row with "openai"
- `./keyhunter scan testdata/samples/no_keys.txt` exits 0 and prints "No API keys found."
- `grep -q 'viper\.BindPFlag' cmd/scan.go` exits 0
</acceptance_criteria>
<done>Full CLI works: scan finds and persists keys, providers list/info/stats work, config init/set/get work. Phase 1 success criteria all met.</done>
</task>
<task type="checkpoint:human-verify" gate="blocking">
<what-built>
Complete Phase 1 implementation:
- Provider registry with 3 YAML definitions, Aho-Corasick automaton, schema validation
- Storage layer with AES-256-GCM encryption, Argon2id key derivation, SQLite WAL mode
- Three-stage scan engine: keyword pre-filter → regex + entropy detector → finding channel
- CLI: keyhunter scan, providers list/info/stats, config init/set/get
</what-built>
<how-to-verify>
Run these commands from the project root and confirm each expected output:
1. `cd /home/salva/Documents/apikey && go test ./... -v -count=1`
Expected: All tests PASS, zero FAIL, zero SKIP (except original stubs now filled)
2. `./keyhunter scan testdata/samples/openai_key.txt`
Expected: Exit code 1, table printed with 1 row showing "openai" provider, masked key
3. `./keyhunter scan testdata/samples/no_keys.txt`
Expected: Exit code 0, "No API keys found." printed
4. `./keyhunter providers list`
Expected: Table with openai, anthropic, huggingface rows
5. `./keyhunter providers info openai`
Expected: Name, Tier 1, Keywords including "sk-proj-", Pattern regex shown
6. `./keyhunter config init`
Expected: "Config initialized: ~/.keyhunter.yaml" and the file exists
7. `./keyhunter config set scan.workers 16 && ./keyhunter config get scan.workers`
Expected: "Set scan.workers = 16" then "16"
8. Build the binary with production flags:
`CGO_ENABLED=0 go build -ldflags="-s -w" -o keyhunter-prod .`
Expected: Builds without error, binary produced
</how-to-verify>
<resume-signal>Type "approved" if all 8 checks pass, or describe which check failed and what output you saw.</resume-signal>
</task>
</tasks>
<verification>
Full Phase 1 integration check:
- `go test ./... -count=1` exits 0
- `./keyhunter scan testdata/samples/openai_key.txt` exits 1 with findings table
- `./keyhunter scan testdata/samples/no_keys.txt` exits 0 with "No API keys found."
- `./keyhunter providers list` shows 3+ providers
- `./keyhunter config init` creates ~/.keyhunter.yaml
- `CGO_ENABLED=0 go build -ldflags="-s -w" -o keyhunter-prod .` exits 0
</verification>
<success_criteria>
- Cobra CLI with scan, providers, config commands (CLI-01)
- `keyhunter config init` creates ~/.keyhunter.yaml (CLI-02)
- `keyhunter config set key value` persists (CLI-03)
- `keyhunter providers list/info/stats` work (CLI-04)
- scan flags: --workers, --verify, --unmask, --output, --exclude (CLI-05)
- All Phase 1 success criteria from ROADMAP.md satisfied:
1. `keyhunter scan ./somefile` runs three-stage pipeline and returns findings with provider names
2. Findings persisted to SQLite with AES-256 encrypted key_value
3. `keyhunter config init` and `config set` work
4. `keyhunter providers list/info` return provider metadata from YAML
5. Provider YAML has format_version and last_verified, validated at load time
</success_criteria>
<output>
After completion, create `.planning/phases/01-foundation/01-05-SUMMARY.md` following the summary template.
</output>