Files
keyhunter/.planning/phases/01-foundation/01-02-PLAN.md
salvacybersec 684b67cb73 docs(01-foundation): create phase 1 plan — 5 plans across 3 execution waves
Wave 0: module init + test scaffolding (01-01)
Wave 1: provider registry (01-02) + storage layer (01-03) in parallel
Wave 2: scan engine pipeline (01-04, depends on 01-02)
Wave 3: CLI wiring + integration checkpoint (01-05, depends on all)

Covers all 16 Phase 1 requirements: CORE-01 through CORE-07, STOR-01 through STOR-03,
CLI-01 through CLI-05, PROV-10.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 23:44:09 +03:00

23 KiB

phase, plan, type, wave, depends_on, files_modified, autonomous, requirements, must_haves
phase plan type wave depends_on files_modified autonomous requirements must_haves
01-foundation 02 execute 1
01-01
providers/openai.yaml
providers/anthropic.yaml
providers/huggingface.yaml
pkg/providers/schema.go
pkg/providers/loader.go
pkg/providers/registry.go
pkg/providers/registry_test.go
true
CORE-02
CORE-03
CORE-06
PROV-10
truths artifacts key_links
Provider YAML files are embedded at compile time — no filesystem access at runtime
Registry loads all YAML files from embed.FS and returns a slice of Provider structs
Provider schema validation rejects YAML missing format_version or last_verified
Aho-Corasick automaton is built from all provider keywords at registry init
keyhunter providers list command lists providers (tested via registry methods)
path provides contains
providers/openai.yaml Reference provider definition with all schema fields format_version
path provides exports
pkg/providers/schema.go Provider, Pattern, VerifySpec Go structs with UnmarshalYAML validation
Provider
Pattern
VerifySpec
path provides exports
pkg/providers/registry.go Registry struct with List, Get, Stats, AC methods
Registry
NewRegistry
path provides contains
pkg/providers/loader.go embed.FS declaration and fs.WalkDir loading logic go:embed
from to via pattern
pkg/providers/loader.go providers/*.yaml //go:embed directive go:embed.*providers
from to via pattern
pkg/providers/registry.go github.com/petar-dambovaliev/aho-corasick AC automaton build at NewRegistry() ahocorasick
from to via pattern
pkg/providers/schema.go format_version and last_verified YAML fields UnmarshalYAML validation UnmarshalYAML
Build the provider registry: YAML schema structs with validation, embed.FS loader, in-memory registry with List/Get/Stats/AC methods, and three reference provider YAML definitions. The Aho-Corasick automaton is built from all provider keywords at registry initialization.

Purpose: Every downstream subsystem (scan engine, CLI providers command, verification engine) depends on the Registry interface. This plan establishes the stable contract they build against. Output: providers/*.yaml, pkg/providers/{schema,loader,registry}.go, registry_test.go (stubs filled).

<execution_context> @$HOME/.claude/get-shit-done/workflows/execute-plan.md @$HOME/.claude/get-shit-done/templates/summary.md </execution_context>

@.planning/phases/01-foundation/01-RESEARCH.md @.planning/phases/01-foundation/01-01-SUMMARY.md Full provider YAML structure: ```yaml format_version: 1 name: openai display_name: OpenAI tier: 1 last_verified: "2026-04-04" keywords: - "sk-proj-" - "openai" patterns: - regex: 'sk-proj-[A-Za-z0-9_\-]{48,}' entropy_min: 3.5 confidence: high verify: method: GET url: https://api.openai.com/v1/models headers: Authorization: "Bearer {KEY}" valid_status: [200] invalid_status: [401, 403] ```

Provider struct fields: FormatVersion int (yaml:"format_version" — must be >= 1) Name string (yaml:"name") DisplayName string (yaml:"display_name") Tier int (yaml:"tier") LastVerified string (yaml:"last_verified" — must be non-empty) Keywords []string (yaml:"keywords") Patterns []Pattern (yaml:"patterns") Verify VerifySpec (yaml:"verify")

Pattern struct fields: Regex string (yaml:"regex") EntropyMin float64 (yaml:"entropy_min") Confidence string (yaml:"confidence" — "high", "medium", "low")

VerifySpec struct fields: Method string (yaml:"method") URL string (yaml:"url") Headers map[string]string (yaml:"headers") ValidStatus []int (yaml:"valid_status") InvalidStatus []int (yaml:"invalid_status")

type Registry struct { ... } func NewRegistry() (*Registry, error) func (r *Registry) List() []Provider func (r *Registry) Get(name string) (Provider, bool) func (r *Registry) Stats() RegistryStats // {Total int, ByTier map[int]int, ByConfidence map[string]int} func (r *Registry) AC() ahocorasick.AhoCorasick // pre-built automaton

The embed directive must reference providers relative to loader.go location. loader.go is at pkg/providers/loader.go. providers/ directory is at project root. Use: //go:embed ../../providers/*.yaml and embed.FS path will be "../../providers/openai.yaml" etc.

Actually: Go embed paths must be relative and cannot use "..". Correct approach: place the embed in a file at project root level, or adjust. Better approach from research: put loader in providers package, embed from pkg/providers, but reference the providers/ dir which sits at root.

Resolution: The go:embed directive path is relative to the SOURCE FILE, not the module root. Since loader.go is at pkg/providers/loader.go, to embed ../../providers/*.yaml would work syntactically but Go's embed restricts paths containing "..".

Use this instead: place a providers_embed.go at the PROJECT ROOT (same dir as go.mod): package main -- NO, this breaks package separation

Correct architectural pattern (from RESEARCH.md example): The embed FS should be in pkg/providers/loader.go using a path that doesn't traverse up. Solution: embed the providers directory from within the providers package itself by symlinking or — better — move the YAML files to pkg/providers/definitions/.yaml and use: //go:embed definitions/.yaml

This is the clean solution: pkg/providers/definitions/openai.yaml etc. Update files_modified accordingly. The RESEARCH.md shows //go:embed ../../providers/*.yaml but that path won't work with Go's embed restrictions. Use definitions/ subdirectory instead.

Task 1: Provider YAML schema structs with validation pkg/providers/schema.go, providers/openai.yaml, providers/anthropic.yaml, providers/huggingface.yaml - /home/salva/Documents/apikey/.planning/phases/01-foundation/01-RESEARCH.md (Pattern 1: Provider Registry, Provider YAML schema section, PROV-10 row in requirements table) - /home/salva/Documents/apikey/.planning/research/ARCHITECTURE.md (Provider Registry component, YAML schema example) - Test 1: Provider with format_version=0 → UnmarshalYAML returns error "format_version must be >= 1" - Test 2: Provider with empty last_verified → UnmarshalYAML returns error "last_verified is required" - Test 3: Valid provider YAML → UnmarshalYAML succeeds, Provider.Name == "openai" - Test 4: Provider with no patterns → loaded successfully (patterns list can be empty for schema-only providers) - Test 5: Pattern.Confidence not in {"high","medium","low"} → error "confidence must be high, medium, or low" Create pkg/providers/schema.go:
package providers

import (
    "fmt"
    "gopkg.in/yaml.v3"
)

// Provider represents a single API key provider definition loaded from YAML.
type Provider struct {
    FormatVersion int        `yaml:"format_version"`
    Name          string     `yaml:"name"`
    DisplayName   string     `yaml:"display_name"`
    Tier          int        `yaml:"tier"`
    LastVerified  string     `yaml:"last_verified"`
    Keywords      []string   `yaml:"keywords"`
    Patterns      []Pattern  `yaml:"patterns"`
    Verify        VerifySpec `yaml:"verify"`
}

// Pattern defines a single regex pattern for API key detection.
type Pattern struct {
    Regex      string  `yaml:"regex"`
    EntropyMin float64 `yaml:"entropy_min"`
    Confidence string  `yaml:"confidence"`
}

// VerifySpec defines how to verify a key is live (used by Phase 5 verification engine).
type VerifySpec struct {
    Method        string            `yaml:"method"`
    URL           string            `yaml:"url"`
    Headers       map[string]string `yaml:"headers"`
    ValidStatus   []int             `yaml:"valid_status"`
    InvalidStatus []int             `yaml:"invalid_status"`
}

// RegistryStats holds aggregate statistics about loaded providers.
type RegistryStats struct {
    Total        int
    ByTier       map[int]int
    ByConfidence map[string]int
}

// UnmarshalYAML implements yaml.Unmarshaler with schema validation (satisfies PROV-10).
func (p *Provider) UnmarshalYAML(value *yaml.Node) error {
    // Use a type alias to avoid infinite recursion
    type ProviderAlias Provider
    var alias ProviderAlias
    if err := value.Decode(&alias); err != nil {
        return err
    }
    if alias.FormatVersion < 1 {
        return fmt.Errorf("provider %q: format_version must be >= 1 (got %d)", alias.Name, alias.FormatVersion)
    }
    if alias.LastVerified == "" {
        return fmt.Errorf("provider %q: last_verified is required", alias.Name)
    }
    validConfidences := map[string]bool{"high": true, "medium": true, "low": true, "": true}
    for _, pat := range alias.Patterns {
        if !validConfidences[pat.Confidence] {
            return fmt.Errorf("provider %q: pattern confidence %q must be high, medium, or low", alias.Name, pat.Confidence)
        }
    }
    *p = Provider(alias)
    return nil
}

Create the three reference YAML provider definitions. These are SCHEMA EXAMPLES for Phase 1; full pattern libraries come in Phase 2-3.

providers/openai.yaml:

format_version: 1
name: openai
display_name: OpenAI
tier: 1
last_verified: "2026-04-04"
keywords:
  - "sk-proj-"
  - "openai"
patterns:
  - regex: 'sk-proj-[A-Za-z0-9_\-]{48,}'
    entropy_min: 3.5
    confidence: high
verify:
  method: GET
  url: https://api.openai.com/v1/models
  headers:
    Authorization: "Bearer {KEY}"
  valid_status: [200]
  invalid_status: [401, 403]

providers/anthropic.yaml:

format_version: 1
name: anthropic
display_name: Anthropic
tier: 1
last_verified: "2026-04-04"
keywords:
  - "sk-ant-api03-"
  - "anthropic"
patterns:
  - regex: 'sk-ant-api03-[A-Za-z0-9_\-]{93,}'
    entropy_min: 3.5
    confidence: high
verify:
  method: GET
  url: https://api.anthropic.com/v1/models
  headers:
    x-api-key: "{KEY}"
    anthropic-version: "2023-06-01"
  valid_status: [200]
  invalid_status: [401, 403]

providers/huggingface.yaml:

format_version: 1
name: huggingface
display_name: HuggingFace
tier: 3
last_verified: "2026-04-04"
keywords:
  - "hf_"
  - "huggingface"
patterns:
  - regex: 'hf_[A-Za-z0-9]{34,}'
    entropy_min: 3.5
    confidence: high
verify:
  method: GET
  url: https://huggingface.co/api/whoami-v2
  headers:
    Authorization: "Bearer {KEY}"
  valid_status: [200]
  invalid_status: [401, 403]
cd /home/salva/Documents/apikey && go build ./pkg/providers/... && go test ./pkg/providers/... -run TestProviderSchemaValidation -v 2>&1 | head -30 - `go build ./pkg/providers/...` exits 0 - providers/openai.yaml contains `format_version: 1` and `last_verified` - providers/anthropic.yaml contains `format_version: 1` and `last_verified` - providers/huggingface.yaml contains `format_version: 1` and `last_verified` - pkg/providers/schema.go exports: Provider, Pattern, VerifySpec, RegistryStats - Provider.UnmarshalYAML returns error when format_version < 1 - Provider.UnmarshalYAML returns error when last_verified is empty - `grep -q 'UnmarshalYAML' pkg/providers/schema.go` exits 0 Provider schema structs exist with validation. Three reference YAML files exist with all required fields. Task 2: Embed loader, registry with Aho-Corasick, and filled test stubs pkg/providers/loader.go, pkg/providers/registry.go, pkg/providers/registry_test.go - /home/salva/Documents/apikey/.planning/phases/01-foundation/01-RESEARCH.md (Pattern 1: Provider Registry with Compile-Time Embed — exact code example) - /home/salva/Documents/apikey/pkg/providers/schema.go (types just created in Task 1) - Test 1: NewRegistry() loads 3 providers from embedded YAML → registry.List() returns slice of length 3 - Test 2: registry.Get("openai") → returns Provider with Name=="openai", bool==true - Test 3: registry.Get("nonexistent") → returns zero Provider, bool==false - Test 4: registry.Stats().Total == 3 and Stats().ByTier[1] == 2 (openai + anthropic are tier 1) - Test 5: AC automaton built — registry.AC().FindAll("sk-proj-abc") returns non-empty slice - Test 6: AC automaton does NOT match — registry.AC().FindAll("hello world") returns empty slice IMPORTANT NOTE ON EMBED PATHS: Go's embed package does NOT allow paths containing "..". Since loader.go is at pkg/providers/loader.go, it CANNOT embed ../../providers/*.yaml.

Solution: Place provider YAML files at pkg/providers/definitions/.yaml and use: //go:embed definitions/.yaml

This means the YAML files created in Task 1 at providers/openai.yaml etc. are the "source of truth" files users may inspect, but the embedded versions live in pkg/providers/definitions/. Copy them there (or move and update Task 1 output).

Actually, the cleanest solution per Go embed docs: put an embed.go file at the PACKAGE level that embeds a subdirectory. Since pkg/providers/ package owns the embed, use: pkg/providers/definitions/openai.yaml (embedded) providers/openai.yaml (user-facing, can symlink or keep as docs)

For Phase 1, keep BOTH: the providers/ root dir for user reference, definitions/ for embed. Copy the three YAML files from providers/ to pkg/providers/definitions/ at the end.

Create pkg/providers/loader.go:

package providers

import (
    "embed"
    "fmt"
    "io/fs"
    "path/filepath"

    "gopkg.in/yaml.v3"
)

//go:embed definitions/*.yaml
var definitionsFS embed.FS

// loadProviders reads all YAML files from the embedded definitions FS.
func loadProviders() ([]Provider, error) {
    var providers []Provider
    err := fs.WalkDir(definitionsFS, "definitions", func(path string, d fs.DirEntry, err error) error {
        if err != nil {
            return err
        }
        if d.IsDir() || filepath.Ext(path) != ".yaml" {
            return nil
        }
        data, err := definitionsFS.ReadFile(path)
        if err != nil {
            return fmt.Errorf("reading provider file %s: %w", path, err)
        }
        var p Provider
        if err := yaml.Unmarshal(data, &p); err != nil {
            return fmt.Errorf("parsing provider %s: %w", path, err)
        }
        providers = append(providers, p)
        return nil
    })
    return providers, err
}

Create pkg/providers/registry.go:

package providers

import (
    ahocorasick "github.com/petar-dambovaliev/aho-corasick"
)

// Registry is the in-memory store of all loaded provider definitions.
// It is initialized once at startup and is safe for concurrent reads.
type Registry struct {
    providers []Provider
    index     map[string]int          // name -> slice index
    ac        ahocorasick.AhoCorasick // pre-built automaton for keyword pre-filter
}

// NewRegistry loads all embedded provider YAML files, validates them, builds the
// Aho-Corasick automaton from all provider keywords, and returns the Registry.
func NewRegistry() (*Registry, error) {
    providers, err := loadProviders()
    if err != nil {
        return nil, fmt.Errorf("loading providers: %w", err)
    }

    index := make(map[string]int, len(providers))
    var keywords []string
    for i, p := range providers {
        index[p.Name] = i
        keywords = append(keywords, p.Keywords...)
    }

    builder := ahocorasick.NewAhoCorasickBuilder(ahocorasick.Opts{DFA: true})
    ac := builder.Build(keywords)

    return &Registry{
        providers: providers,
        index:     index,
        ac:        ac,
    }, nil
}

// List returns all loaded providers.
func (r *Registry) List() []Provider {
    return r.providers
}

// Get returns a provider by name and a boolean indicating whether it was found.
func (r *Registry) Get(name string) (Provider, bool) {
    idx, ok := r.index[name]
    if !ok {
        return Provider{}, false
    }
    return r.providers[idx], true
}

// Stats returns aggregate statistics about the loaded providers.
func (r *Registry) Stats() RegistryStats {
    stats := RegistryStats{
        Total:        len(r.providers),
        ByTier:       make(map[int]int),
        ByConfidence: make(map[string]int),
    }
    for _, p := range r.providers {
        stats.ByTier[p.Tier]++
        for _, pat := range p.Patterns {
            stats.ByConfidence[pat.Confidence]++
        }
    }
    return stats
}

// AC returns the pre-built Aho-Corasick automaton for keyword pre-filtering.
func (r *Registry) AC() ahocorasick.AhoCorasick {
    return r.ac
}

Note: registry.go needs import "fmt" added.

Then copy the three YAML files into the embed location:

mkdir -p /home/salva/Documents/apikey/pkg/providers/definitions
cp /home/salva/Documents/apikey/providers/openai.yaml /home/salva/Documents/apikey/pkg/providers/definitions/
cp /home/salva/Documents/apikey/providers/anthropic.yaml /home/salva/Documents/apikey/pkg/providers/definitions/
cp /home/salva/Documents/apikey/providers/huggingface.yaml /home/salva/Documents/apikey/pkg/providers/definitions/

Finally, fill in pkg/providers/registry_test.go (replacing the stubs from Plan 01):

package providers_test

import (
    "testing"

    "github.com/salvacybersec/keyhunter/pkg/providers"
    "github.com/stretchr/testify/assert"
    "github.com/stretchr/testify/require"
)

func TestRegistryLoad(t *testing.T) {
    reg, err := providers.NewRegistry()
    require.NoError(t, err)
    assert.GreaterOrEqual(t, len(reg.List()), 3, "expected at least 3 providers loaded")
}

func TestRegistryGet(t *testing.T) {
    reg, err := providers.NewRegistry()
    require.NoError(t, err)

    p, ok := reg.Get("openai")
    assert.True(t, ok)
    assert.Equal(t, "openai", p.Name)
    assert.Equal(t, 1, p.Tier)

    _, ok = reg.Get("nonexistent-provider")
    assert.False(t, ok)
}

func TestRegistryStats(t *testing.T) {
    reg, err := providers.NewRegistry()
    require.NoError(t, err)

    stats := reg.Stats()
    assert.GreaterOrEqual(t, stats.Total, 3)
    assert.GreaterOrEqual(t, stats.ByTier[1], 2, "expected at least 2 tier-1 providers")
}

func TestAhoCorasickBuild(t *testing.T) {
    reg, err := providers.NewRegistry()
    require.NoError(t, err)

    ac := reg.AC()

    // Should match OpenAI keyword
    matches := ac.FindAll("OPENAI_API_KEY=sk-proj-abc")
    assert.NotEmpty(t, matches, "expected AC to find keyword in string containing 'sk-proj-'")

    // Should not match clean text
    noMatches := ac.FindAll("hello world no secrets here")
    assert.Empty(t, noMatches, "expected no AC matches in text with no provider keywords")
}

func TestProviderSchemaValidation(t *testing.T) {
    import_yaml := `
format_version: 0
name: invalid
last_verified: ""
`
    // Directly test UnmarshalYAML via yaml.Unmarshal
    var p providers.Provider
    err := yaml.Unmarshal([]byte(import_yaml), &p)  // NOTE: need import "gopkg.in/yaml.v3"
    assert.Error(t, err, "expected validation error for format_version=0")
}

Note: The TestProviderSchemaValidation test needs import "gopkg.in/yaml.v3" added. Add it to the imports. Full corrected test file with proper imports:

package providers_test

import (
    "testing"

    "github.com/salvacybersec/keyhunter/pkg/providers"
    "github.com/stretchr/testify/assert"
    "github.com/stretchr/testify/require"
    "gopkg.in/yaml.v3"
)

func TestRegistryLoad(t *testing.T) {
    reg, err := providers.NewRegistry()
    require.NoError(t, err)
    assert.GreaterOrEqual(t, len(reg.List()), 3, "expected at least 3 providers")
}

func TestRegistryGet(t *testing.T) {
    reg, err := providers.NewRegistry()
    require.NoError(t, err)

    p, ok := reg.Get("openai")
    assert.True(t, ok)
    assert.Equal(t, "openai", p.Name)
    assert.Equal(t, 1, p.Tier)

    _, notOk := reg.Get("nonexistent-provider")
    assert.False(t, notOk)
}

func TestRegistryStats(t *testing.T) {
    reg, err := providers.NewRegistry()
    require.NoError(t, err)

    stats := reg.Stats()
    assert.GreaterOrEqual(t, stats.Total, 3)
    assert.GreaterOrEqual(t, stats.ByTier[1], 2)
}

func TestAhoCorasickBuild(t *testing.T) {
    reg, err := providers.NewRegistry()
    require.NoError(t, err)

    ac := reg.AC()
    matches := ac.FindAll("export OPENAI_API_KEY=sk-proj-abc")
    assert.NotEmpty(t, matches)

    noMatches := ac.FindAll("hello world nothing here")
    assert.Empty(t, noMatches)
}

func TestProviderSchemaValidation(t *testing.T) {
    invalid := []byte("format_version: 0\nname: invalid\nlast_verified: \"\"\n")
    var p providers.Provider
    err := yaml.Unmarshal(invalid, &p)
    assert.Error(t, err)
    assert.Contains(t, err.Error(), "format_version")
}
cd /home/salva/Documents/apikey && go test ./pkg/providers/... -v -count=1 2>&1 | tail -20 - `go test ./pkg/providers/... -v` exits 0 with all 5 tests PASS (not SKIP) - TestRegistryLoad passes with >= 3 providers - TestRegistryGet passes — "openai" found, "nonexistent" not found - TestRegistryStats passes — Total >= 3 - TestAhoCorasickBuild passes — "sk-proj-" match found, "hello world" empty - TestProviderSchemaValidation passes — error on format_version=0 - `grep -r 'go:embed' pkg/providers/loader.go` exits 0 - pkg/providers/definitions/ directory exists with 3 YAML files Registry loads providers from embedded YAML, builds Aho-Corasick automaton, exposes List/Get/Stats/AC. All 5 tests pass. After both tasks: - `go test ./pkg/providers/... -v -count=1` exits 0 with 5 tests PASS - `go build ./...` still exits 0 - `grep -q 'format_version' providers/openai.yaml providers/anthropic.yaml providers/huggingface.yaml` exits 0 - `grep -q 'go:embed' pkg/providers/loader.go` exits 0 - pkg/providers/definitions/ has 3 YAML files (same content as providers/)

<success_criteria>

  • 3 reference provider YAML files exist in providers/ and pkg/providers/definitions/ with format_version and last_verified
  • Provider schema validates format_version >= 1 and non-empty last_verified (PROV-10)
  • Registry loads providers from embed.FS at compile time (CORE-02)
  • Aho-Corasick automaton built from all keywords at NewRegistry() (CORE-06)
  • Registry exposes List(), Get(), Stats(), AC() (CORE-03)
  • 5 provider tests all pass </success_criteria>
After completion, create `.planning/phases/01-foundation/01-02-SUMMARY.md` following the summary template.