--- phase: 01-foundation plan: 02 subsystem: providers tags: [yaml, embed, aho-corasick, registry, go-embed, gopkg.in/yaml.v3] # Dependency graph requires: - phase: 01-01 provides: go.mod with all Phase 1 dependencies, test scaffolding, cmd/root.go stub provides: - Provider, Pattern, VerifySpec, RegistryStats Go structs with YAML validation - Registry with List(), Get(), Stats(), AC() methods - Aho-Corasick automaton built from all provider keywords at NewRegistry() - Three reference provider YAML definitions (openai, anthropic, huggingface) - Compile-time embed of provider YAML via pkg/providers/definitions/ affects: - scan-engine - cli-providers-command - verification-engine - storage-layer # Tech tracking tech-stack: added: - gopkg.in/yaml.v3 (UnmarshalYAML custom validation) - github.com/petar-dambovaliev/aho-corasick (keyword pre-filter automaton) - embed (stdlib) for compile-time YAML embedding patterns: - Provider YAML at providers/ (user-visible) + pkg/providers/definitions/ (embed location) - Type alias pattern for custom UnmarshalYAML without infinite recursion - Registry injected via constructor (NewRegistry), not global singleton key-files: created: - pkg/providers/schema.go - pkg/providers/loader.go - pkg/providers/registry.go - pkg/providers/registry_test.go - pkg/providers/definitions/openai.yaml - pkg/providers/definitions/anthropic.yaml - pkg/providers/definitions/huggingface.yaml - providers/openai.yaml - providers/anthropic.yaml - providers/huggingface.yaml modified: [] key-decisions: - "Provider YAML kept in dual locations: providers/ (user-visible) and pkg/providers/definitions/ (embedded) — Go embed cannot use '..' paths, so definitions/ subdirectory is canonical embed source" - "UnmarshalYAML validates format_version >= 1 and non-empty last_verified at parse time, not at registry use time — fail fast on malformed definitions" - "Aho-Corasick automaton built with DFA=true for deterministic performance — trades memory for guaranteed O(n) matching" - "Registry is value-safe for concurrent reads — no mutex needed since providers slice is written once at NewRegistry and never mutated" patterns-established: - "Pattern 1: Type alias in UnmarshalYAML to avoid infinite recursion: `type ProviderAlias Provider`" - "Pattern 2: embed path convention — YAML at pkg/providers/definitions/, user docs at providers/" - "Pattern 3: Registry constructor NewRegistry() loads+validates+indexes+builds AC in one call" requirements-completed: [CORE-02, CORE-03, CORE-06, PROV-10] # Metrics duration: 9min completed: 2026-04-04 --- # Phase 01 Plan 02: Provider Registry Summary **YAML schema structs with UnmarshalYAML validation, embed.FS loader, and Aho-Corasick registry serving List/Get/Stats/AC to all downstream subsystems** ## Performance - **Duration:** ~9 min - **Started:** 2026-04-04T21:02:31Z - **Completed:** 2026-04-04T21:11:41Z - **Tasks:** 2 (both TDD) - **Files modified:** 10 created, 1 updated (registry_test.go) ## Accomplishments - Provider YAML schema with compile-time validation (format_version >= 1, last_verified required, confidence enum) - Registry loads 3 providers from embedded YAML at startup, builds Aho-Corasick automaton over all keywords - Three reference provider YAML definitions with full verify specs (OpenAI, Anthropic, HuggingFace) - All 5 provider tests pass: TestRegistryLoad, TestRegistryGet, TestRegistryStats, TestAhoCorasickBuild, TestProviderSchemaValidation ## Task Commits Each task was committed atomically: 1. **TDD RED - Failing tests for schema and registry** - `ebaf7d7` (test) 2. **Task 1: Provider schema structs and reference YAMLs** - `4fcdc42` (feat) 3. **Task 2: Embed loader, registry with AC, filled test stubs** - `a9859b3` (feat) _Note: Bootstrap (go.mod, main.go, test stubs) was included in the RED commit since Plan 01-01 runs in parallel._ ## Files Created/Modified - `pkg/providers/schema.go` - Provider, Pattern, VerifySpec, RegistryStats structs with UnmarshalYAML validation - `pkg/providers/loader.go` - embed.FS declaration with //go:embed definitions/*.yaml and fs.WalkDir loader - `pkg/providers/registry.go` - Registry struct with List(), Get(), Stats(), AC() methods and NewRegistry() constructor - `pkg/providers/registry_test.go` - Full test implementation (replaced stub from Plan 01) - `pkg/providers/definitions/openai.yaml` - Embedded OpenAI provider definition - `pkg/providers/definitions/anthropic.yaml` - Embedded Anthropic provider definition - `pkg/providers/definitions/huggingface.yaml` - Embedded HuggingFace provider definition - `providers/openai.yaml` - User-visible OpenAI reference definition - `providers/anthropic.yaml` - User-visible Anthropic reference definition - `providers/huggingface.yaml` - User-visible HuggingFace reference definition ## Decisions Made - **Dual YAML location:** providers/ for user reference, pkg/providers/definitions/ for embed — Go's embed package cannot traverse `..` paths, so definitions/ inside the package is the only valid embed location. - **DFA mode for Aho-Corasick:** `Opts{DFA: true}` chosen for guaranteed O(n) matching at cost of higher upfront build time — appropriate for a scanner tool that pays build cost once and scans many files. - **Constructor injection over globals:** NewRegistry() returns a value; callers inject it. No package-level `var Registry` global — avoids init order issues and enables testing. ## Deviations from Plan ### Auto-fixed Issues **1. [Rule 3 - Blocking] Bootstrapped Plan 01-01 prerequisites in this worktree** - **Found during:** Pre-task setup - **Issue:** Plan 01-02 depends on Plan 01-01 (go.mod, main.go, test stubs) which runs in parallel in a different worktree. This worktree had no go.mod. - **Fix:** Executed Plan 01-01 bootstrap (go mod init, go get all 10 deps, main.go, cmd/root.go, testdata fixtures, test stub files) before starting Plan 01-02 tasks. - **Files modified:** go.mod, go.sum, main.go, cmd/root.go, testdata/samples/*.txt, pkg/*/stub_test.go files - **Verification:** `go build ./...` succeeded before Plan 01-02 task execution - **Committed in:** ebaf7d7 (RED phase commit includes bootstrap) **2. [Rule 3 - Blocking] go mod tidy required after adding production packages** - **Found during:** Task 2 GREEN phase - **Issue:** `go test` failed with "no required module provides package github.com/petar-dambovaliev/aho-corasick" even though it was in go.mod — tidy hadn't propagated it for non-test code. - **Fix:** Ran `go mod tidy` which resolved the module graph. - **Files modified:** go.mod, go.sum - **Verification:** `go test ./pkg/providers/...` passed after tidy --- **Total deviations:** 2 auto-fixed (2 blocking/infrastructure) **Impact on plan:** Both deviations were infrastructure setup, not scope changes. Plan objectives met exactly. ## Issues Encountered - Go embed `..` path restriction required dual YAML directory strategy (documented in plan's context, confirmed during implementation) - aho-corasick package name is `aho_corasick` (underscore) not `ahocorasick` — used import alias `ahocorasick` for cleaner code ## User Setup Required None - no external service configuration required. ## Next Phase Readiness - Registry interface is stable: NewRegistry(), List(), Get(), Stats(), AC() — downstream plans can depend on these signatures - Phase 03 (Storage Layer) can proceed immediately — no registry dependency - Phase 04 (Scan Engine) can now wire AC() for keyword pre-filtering - Phase 05 (CLI) can call Registry.List() for `keyhunter providers list` - Known: only 3 reference providers embedded; Phase 02-03 will add all 108 --- *Phase: 01-foundation* *Completed: 2026-04-04*