docs(02): research phase domain - provider key formats and regex patterns

2026-04-05 13:06:03 +03:00
parent fea691f27b
commit b8c69cba7e
1 changed files with 573 additions and 0 deletions
--- a/.planning/phases/02-tier-1-2-providers/02-RESEARCH.md
+++ b/.planning/phases/02-tier-1-2-providers/02-RESEARCH.md
@@ -0,0 +1,573 @@
+# Phase 2: Tier 1-2 Providers - Research
+
+**Researched:** 2026-04-05
+**Domain:** LLM/AI Provider API Key Formats, Regex Patterns, Verification Endpoints
+**Confidence:** MEDIUM (well-documented providers are HIGH; lesser-known Tier 2 providers are LOW-MEDIUM)
+
+## Summary
+
+This phase requires creating 26 provider YAML definitions (12 Tier 1 + 14 Tier 2) following the established schema pattern from Phase 1. The core challenge is not Go code -- the infrastructure (schema, loader, registry, AC automaton) already exists and works. The challenge is **accuracy of regex patterns and key format data** across 26 providers with varying documentation quality.
+
+For Tier 1 providers (OpenAI, Anthropic, Google AI, xAI, Cohere, Groq), key formats are well-documented with authoritative regex patterns from TruffleHog and gitleaks. For Tier 2 inference platforms (Together, Fireworks, Lepton, Modal, Baseten, Cerebrium, NovitaAI, SambaNova, OctoAI, Friendli), many providers lack distinctive key prefixes and use generic opaque token formats, making regex-only detection harder and requiring keyword-context-based matching.
+
+**Primary recommendation:** Create all 26 YAML files using the established schema. For well-prefixed providers, use HIGH confidence patterns. For generic-format providers, use MEDIUM/LOW confidence patterns with strong keyword lists for AC pre-filtering. File placement follows the dual-location pattern established in Phase 1 (providers/ and pkg/providers/definitions/).
+
+<user_constraints>
+## User Constraints (from CONTEXT.md)
+
+### Locked Decisions
+None -- all implementation choices at Claude's discretion (infrastructure/data phase).
+
+### Claude's Discretion
+All implementation choices are at Claude's discretion -- pure infrastructure/data phase. Use ROADMAP phase goal, success criteria, and existing provider YAML schema (providers/openai.yaml, providers/anthropic.yaml, providers/huggingface.yaml) as templates.
+
+### Deferred Ideas (OUT OF SCOPE)
+None -- discuss phase skipped.
+</user_constraints>
+
+<phase_requirements>
+## Phase Requirements
+
+| ID | Description | Research Support |
+|----|-------------|------------------|
+| PROV-01 | 12 Tier 1 Frontier provider YAML definitions (OpenAI, Anthropic, Google AI, Vertex, AWS Bedrock, Azure OpenAI, Meta AI, xAI, Cohere, Mistral, Inflection, AI21) | Regex patterns documented below for all 12; verify endpoints identified |
+| PROV-02 | 14 Tier 2 Inference Platform provider definitions (Together, Fireworks, Groq, Replicate, Anyscale, DeepInfra, Lepton, Modal, Baseten, Cerebrium, NovitaAI, Sambanova, OctoAI, Friendli) | Regex patterns documented for prefixed providers (Groq, Replicate, Anyscale, Perplexity); keyword-context approach recommended for generic-format providers |
+</phase_requirements>
+
+## Project Constraints (from CLAUDE.md)
+
+- **Go regexp only (RE2)** -- no PCRE/regexp2. All patterns must be RE2-safe (no lookahead/lookbehind, no backreferences).
+- **Providers in dual locations**: `providers/` (user-visible) and `pkg/providers/definitions/` (Go embed source). Files must be kept in sync.
+- **Schema**: `format_version: 1`, `name`, `display_name`, `tier`, `last_verified`, `keywords[]`, `patterns[]` (with `regex`, `entropy_min`, `confidence`), `verify` (with `method`, `url`, `headers`, `valid_status`, `invalid_status`).
+- **Confidence values**: `high`, `medium`, `low` (validated in UnmarshalYAML).
+- **Keywords**: lowercase, used for Aho-Corasick pre-filtering via DFA automaton.
+
+## Standard Stack
+
+No new libraries required. This phase is pure YAML data file creation using the existing infrastructure from Phase 1.
+
+### Existing Infrastructure Used
+| Component | Location | Purpose |
+|-----------|----------|---------|
+| Provider struct | `pkg/providers/schema.go` | YAML schema with validation |
+| Loader | `pkg/providers/loader.go` | embed.FS walker for `definitions/*.yaml` |
+| Registry | `pkg/providers/registry.go` | Provider index + AC automaton build |
+| Tests | `pkg/providers/registry_test.go` | Registry load, get, stats, AC tests |
+
+## Architecture Patterns
+
+### File Placement (Dual Location)
+Every new provider YAML must be placed in BOTH:
+```
+providers/{name}.yaml                    # User-visible reference
+pkg/providers/definitions/{name}.yaml    # Go embed source (compiled into binary)
+```
+
+### YAML Template Pattern
+```yaml
+format_version: 1
+name: {provider_slug}
+display_name: {Display Name}
+tier: {1 or 2}
+last_verified: "2026-04-05"
+keywords:
+  - "{prefix_literal}"          # Exact key prefix for AC match
+  - "{provider_name_lowercase}" # Provider name for context match
+  - "{env_var_hint}"            # Common env var name fragments
+patterns:
+  - regex: '{RE2_compatible_regex}'
+    entropy_min: {3.0-4.0}
+    confidence: {high|medium|low}
+verify:
+  method: {GET|POST}
+  url: {lightweight_api_endpoint}
+  headers:
+    {auth_header}: "{KEY_placeholder}"
+  valid_status: [200]
+  invalid_status: [401, 403]
+```
+
+### Confidence Level Assignment Strategy
+| Key Format | Confidence | Rationale |
+|------------|------------|-----------|
+| Unique prefix + fixed length (e.g., `sk-ant-api03-`, `gsk_`, `r8_`, `xai-`) | **high** | Prefix alone is near-unique; false positive rate extremely low |
+| Unique prefix, variable length (e.g., `sk-proj-`, `AIzaSy`) | **high** | Prefix is distinctive enough |
+| Short generic prefix + context needed (e.g., `sk-` for Cohere) | **medium** | Prefix collides with OpenAI legacy; needs keyword context |
+| No prefix, opaque token (e.g., UUID, hex string, base64) | **low** | Requires strong keyword context; high false positive risk without AC pre-filter |
+| 32-char hex string (e.g., Azure OpenAI) | **low** | Extremely generic format; keyword context mandatory |
+
+### Keyword Strategy for Low-Confidence Providers
+Providers with generic key formats (no distinctive prefix) MUST have rich keyword lists for the Aho-Corasick pre-filter to work effectively. Include:
+1. Provider name (lowercase)
+2. Common env var fragments (e.g., `together_api`, `baseten_api`, `modal_token`)
+3. API base URL fragments (e.g., `api.together.xyz`, `api.baseten.co`)
+4. SDK/config identifiers (e.g., `togetherai`, `deepinfra`)
+
+### Anti-Patterns to Avoid
+- **Overly broad regex without keyword anchor:** A pattern like `[A-Za-z0-9]{40}` without keywords would match every 40-char alphanumeric string -- useless.
+- **PCRE features in regex:** Go RE2 does not support lookahead (`(?=)`), lookbehind (`(?<=)`), or backreferences. All patterns must be RE2-safe.
+- **Hardcoding `T3BlbkFJ` for non-OpenAI providers:** The base64 "OpenAI" magic string is OpenAI-specific; do not use for other providers.
+- **Missing dual-location sync:** Forgetting to copy YAML to both `providers/` and `pkg/providers/definitions/`.
+
+## Provider Key Format Research
+
+### Tier 1: Frontier Providers (12)
+
+#### 1. OpenAI
+**Confidence: HIGH** -- TruffleHog verified
+- **Prefixes:** `sk-proj-`, `sk-svcacct-`, `sk-None-`, legacy `sk-` (all contain `T3BlbkFJ` base64 marker)
+- **TruffleHog regex:** `sk-(?:(?:proj|svcacct|service)-[A-Za-z0-9_-]+|[a-zA-Z0-9]+)T3BlbkFJ[A-Za-z0-9_-]+`
+- **KeyHunter regex (simplified):** `sk-proj-[A-Za-z0-9_\-]{48,}` (existing, covers primary format)
+- **Note:** Existing openai.yaml only covers `sk-proj-`. Should add patterns for `sk-svcacct-` and legacy `sk-` with T3BlbkFJ marker.
+- **Verify:** GET `https://api.openai.com/v1/models` with `Authorization: Bearer {KEY}` -- 200=valid, 401=invalid
+- **Keywords:** `sk-proj-`, `sk-svcacct-`, `openai`, `T3BlbkFJ`
+
+#### 2. Anthropic
+**Confidence: HIGH** -- TruffleHog + gitleaks verified
+- **Prefixes:** `sk-ant-api03-` (standard), `sk-ant-admin01-` (admin)
+- **TruffleHog regex:** `sk-ant-(?:admin01|api03)-[\w\-]{93}AA`
+- **gitleaks regex:** `sk-ant-api03-[a-zA-Z0-9_\-]{93}AA`
+- **Note:** Existing anthropic.yaml pattern `sk-ant-api03-[A-Za-z0-9_\-]{93,}` should be tightened to end with `AA` suffix.
+- **Verify:** GET `https://api.anthropic.com/v1/models` with `x-api-key: {KEY}` + `anthropic-version: 2023-06-01` -- 200=valid, 401=invalid
+- **Keywords:** `sk-ant-api03-`, `sk-ant-admin01-`, `anthropic`
+
+#### 3. Google AI (Gemini)
+**Confidence: HIGH** -- TruffleHog verified
+- **Prefix:** `AIzaSy`
+- **TruffleHog regex:** `AIzaSy[A-Za-z0-9_-]{33}`
+- **Total length:** 39 characters
+- **Note:** Same key format as all Google API keys. No trailing word boundary needed (keys can end with hyphen).
+- **Verify:** GET `https://generativelanguage.googleapis.com/v1/models?key={KEY}` -- 200=valid, 400/403=invalid
+- **Keywords:** `AIzaSy`, `google_api`, `gemini`
+
+#### 4. Google Vertex AI
+**Confidence: MEDIUM**
+- **Format:** Uses Google Cloud service account JSON key files (not a simple API key string) OR standard Google API keys (AIzaSy format, same as #3).
+- **Approach:** For API key mode, reuse `AIzaSy` pattern (same as Google AI). Service account JSON key detection is a separate concern (JSON file with `"type": "service_account"` and `"private_key"` field).
+- **Recommendation:** Create a separate `vertex-ai` provider YAML that focuses on the API key path with `AIzaSy` pattern AND a service account private key regex.
+- **Service account key regex:** The private key in a GCP service account JSON starts with `-----BEGIN RSA PRIVATE KEY-----` -- but this is a general GCP credential, not Vertex-specific. Keep this provider focused on the API key path.
+- **Verify:** GET `https://generativelanguage.googleapis.com/v1/models?key={KEY}` (same endpoint works for Vertex API keys)
+- **Keywords:** `vertex`, `google_cloud`, `AIzaSy`, `vertex_ai`
+
+#### 5. AWS Bedrock
+**Confidence: HIGH** -- gitleaks verified
+- **Long-lived prefix:** `ABSK` (base64 encodes to `BedrockAPIKey-`)
+- **gitleaks regex (long-lived):** `ABSK[A-Za-z0-9+/]{109,269}={0,2}`
+- **Short-lived prefix:** `bedrock-api-key-YmVkcm9ja` (base64 prefix)
+- **Also detect:** AWS IAM access keys (`AKIA[0-9A-Z]{16}` + secret) used with Bedrock
+- **Recommendation:** Two patterns: (1) Bedrock-specific ABSK prefix, (2) AWS AKIA (general, shared with other AWS services)
+- **Verify:** Cannot verify with a simple HTTP call -- AWS Bedrock requires SigV4 signing. Mark verify as empty/placeholder.
+- **Keywords:** `ABSK`, `bedrock`, `aws_bedrock`, `AKIA`
+
+#### 6. Azure OpenAI
+**Confidence: MEDIUM** -- TruffleHog verified but pattern is generic
+- **Format:** 32-character lowercase hexadecimal string
+- **TruffleHog regex:** `[a-f0-9]{32}` (with keyword context requirement)
+- **Problem:** 32-char hex is extremely generic. MUST rely on keywords for context.
+- **Keywords:** `azure`, `openai.azure.com`, `azure_openai`, `api-key`, `cognitive`
+- **Verify:** GET `https://{resource}.openai.azure.com/openai/deployments?api-version=2024-02-01` with `api-key: {KEY}` -- but requires resource name. Cannot generically verify. Mark verify as placeholder.
+- **Entropy min:** 3.5 (hex has theoretical max ~4.0)
+
+#### 7. Meta AI (Llama API)
+**Confidence: LOW**
+- **Format:** Not publicly documented as of April 2026. Meta Llama API launched April 2025.
+- **Env var:** `META_LLAMA_API_KEY`
+- **Approach:** Generic long token pattern with strong keyword context.
+- **Keywords:** `meta`, `llama`, `meta_llama`, `llama_api`
+- **Regex:** Generic high-entropy alphanumeric pattern, medium confidence
+- **Verify:** GET `https://api.llama.com/v1/models` with `Authorization: Bearer {KEY}` (inferred from OpenAI-compatible API)
+
+#### 8. xAI (Grok)
+**Confidence: HIGH** -- TruffleHog verified
+- **Prefix:** `xai-`
+- **TruffleHog regex:** `xai-[0-9a-zA-Z_]{80}`
+- **Total length:** 84 characters
+- **Verify:** GET `https://api.x.ai/v1/api-key` with `Authorization: Bearer {KEY}` -- 200=valid, 401=invalid
+- **Keywords:** `xai-`, `xai`, `grok`
+
+#### 9. Cohere
+**Confidence: MEDIUM** -- gitleaks verified but pattern requires context
+- **Format:** 40 alphanumeric characters, no distinctive prefix
+- **gitleaks regex:** Context-dependent match on `cohere` or `CO_API_KEY` keyword + `[a-zA-Z0-9]{40}`
+- **Problem:** 40-char alphanumeric overlaps with many other tokens.
+- **Keywords:** `cohere`, `CO_API_KEY`, `cohere_api`
+- **Verify:** GET `https://api.cohere.ai/v1/models` with `Authorization: Bearer {KEY}` -- 200=valid, 401=invalid
+- **Entropy min:** 4.0 (higher threshold to reduce false positives)
+
+#### 10. Mistral AI
+**Confidence: MEDIUM**
+- **Format:** Not prefixed (GitGuardian confirms "Prefixed: False"). Opaque token.
+- **Approach:** Keyword-context match. Mistral keys appear to be 32-char alphanumeric or UUID-format.
+- **Keywords:** `mistral`, `MISTRAL_API_KEY`, `mistral.ai`, `la_plateforme`
+- **Verify:** GET `https://api.mistral.ai/v1/models` with `Authorization: Bearer {KEY}` -- 200=valid, 401=invalid
+
+#### 11. Inflection AI
+**Confidence: LOW**
+- **Format:** Not publicly documented. API launched late 2025 via inflection-sdk.
+- **Env var:** `PI_API_KEY`, `INFLECTION_API_KEY`
+- **Keywords:** `inflection`, `pi_api`, `PI_API_KEY`
+- **Verify:** Endpoint unclear -- inferred OpenAI-compatible pattern. Mark as placeholder.
+
+#### 12. AI21 Labs
+**Confidence: LOW**
+- **Format:** Not publicly documented with distinctive prefix.
+- **Env var:** `AI21_API_KEY`
+- **Keywords:** `ai21`, `AI21_API_KEY`, `jamba`, `jurassic`
+- **Verify:** GET `https://api.ai21.com/studio/v1/models` with `Authorization: Bearer {KEY}` -- inferred
+
+### Tier 2: Inference Platforms (14)
+
+#### 1. Together AI
+**Confidence: LOW-MEDIUM**
+- **Format:** Appears to use generic opaque tokens. Some documentation shows `sk-` prefix but this may be example placeholder.
+- **Keywords:** `together`, `TOGETHER_API_KEY`, `api.together.xyz`, `togetherai`
+- **Verify:** GET `https://api.together.xyz/v1/models` with `Authorization: Bearer {KEY}`
+
+#### 2. Fireworks AI
+**Confidence: LOW-MEDIUM**
+- **Format:** GitGuardian confirms "Prefixed: False". Opaque token format.
+- **Prefix:** `fw_` prefix has been reported in some sources but not confirmed by GitGuardian.
+- **Keywords:** `fireworks`, `FIREWORKS_API_KEY`, `fireworks.ai`, `fw_`
+- **Verify:** GET `https://api.fireworks.ai/inference/v1/models` with `Authorization: Bearer {KEY}`
+
+#### 3. Groq
+**Confidence: HIGH** -- TruffleHog verified
+- **Prefix:** `gsk_`
+- **TruffleHog regex:** `gsk_[a-zA-Z0-9]{52}`
+- **Total length:** 56 characters
+- **Verify:** GET `https://api.groq.com/openai/v1/models` with `Authorization: Bearer {KEY}` -- 200=valid, 401=invalid
+- **Keywords:** `gsk_`, `groq`, `GROQ_API_KEY`
+
+#### 4. Replicate
+**Confidence: HIGH** -- TruffleHog verified
+- **Prefix:** `r8_`
+- **TruffleHog regex:** `r8_[0-9A-Za-z-_]{37}`
+- **Total length:** 40 characters
+- **Verify:** GET `https://api.replicate.com/v1/predictions` with `Authorization: Bearer {KEY}` -- 200=valid, 401=invalid
+- **Keywords:** `r8_`, `replicate`, `REPLICATE_API_TOKEN`
+
+#### 5. Anyscale
+**Confidence: MEDIUM**
+- **Prefix:** `esecret_`
+- **Regex:** `esecret_[A-Za-z0-9_-]{20,}`
+- **Keywords:** `esecret_`, `anyscale`, `ANYSCALE_API_KEY`
+- **Verify:** GET `https://api.endpoints.anyscale.com/v1/models` with `Authorization: Bearer {KEY}`
+- **Note:** Anyscale Endpoints are being deprecated in favor of their newer platform. Still worth detecting.
+
+#### 6. DeepInfra
+**Confidence: LOW-MEDIUM**
+- **Format:** Opaque token. JWT-based scoped tokens use `jwt:` prefix.
+- **Keywords:** `deepinfra`, `DEEPINFRA_API_KEY`, `deepinfra.com`
+- **Verify:** GET `https://api.deepinfra.com/v1/openai/models` with `Authorization: Bearer {KEY}`
+
+#### 7. Lepton AI
+**Confidence: LOW**
+- **Format:** Not publicly documented. NVIDIA acquired Lepton AI April 2025 (rebranded to DGX Cloud Lepton).
+- **Keywords:** `lepton`, `LEPTON_API_TOKEN`, `lepton.ai`
+- **Verify:** Endpoint uncertain. Mark as placeholder.
+
+#### 8. Modal
+**Confidence: LOW**
+- **Format:** Not publicly documented with distinctive format.
+- **Keywords:** `modal`, `MODAL_TOKEN_ID`, `MODAL_TOKEN_SECRET`, `modal.com`
+- **Note:** Modal uses token ID + token secret pair, not a single API key.
+- **Verify:** Mark as placeholder.
+
+#### 9. Baseten
+**Confidence: LOW-MEDIUM**
+- **Format:** Uses API keys passed in header.
+- **Keywords:** `baseten`, `BASETEN_API_KEY`, `api.baseten.co`
+- **Verify:** GET `https://api.baseten.co/v1/models` with `Authorization: Api-Key {KEY}` (non-standard header)
+
+#### 10. Cerebrium
+**Confidence: LOW**
+- **Format:** Not publicly documented with distinctive format.
+- **Keywords:** `cerebrium`, `CEREBRIUM_API_KEY`, `cerebrium.ai`
+- **Verify:** Mark as placeholder.
+
+#### 11. NovitaAI
+**Confidence: LOW**
+- **Format:** Not publicly documented with distinctive format.
+- **Keywords:** `novita`, `NOVITA_API_KEY`, `novita.ai`
+- **Verify:** GET `https://api.novita.ai/v3/openai/models` with `Authorization: Bearer {KEY}` (inferred)
+
+#### 12. SambaNova
+**Confidence: LOW**
+- **Format:** Not publicly documented with distinctive prefix.
+- **Keywords:** `sambanova`, `SAMBANOVA_API_KEY`, `sambastudio`, `snapi`
+- **Verify:** GET `https://api.sambanova.ai/v1/models` with `Authorization: Bearer {KEY}` (inferred OpenAI-compatible)
+
+#### 13. OctoAI
+**Confidence: LOW**
+- **Format:** Not publicly documented. OctoAI was shut down / merged in 2025.
+- **Keywords:** `octoai`, `OCTOAI_TOKEN`, `octo.ai`
+- **Verify:** Mark as placeholder (service may be defunct).
+
+#### 14. Friendli
+**Confidence: LOW**
+- **Format:** Uses "Friendli Token" -- not publicly documented format.
+- **Keywords:** `friendli`, `FRIENDLI_TOKEN`, `friendli.ai`
+- **Verify:** Mark as placeholder.
+
+### IMPORTANT: Perplexity Classification Issue
+
+The phase description lists Perplexity as Tier 2, but REQUIREMENTS.md lists it under Tier 3 (PROV-03). The gitleaks config has a pattern: `pplx-[a-zA-Z0-9]{48}`. **Follow the phase description** which includes it in the Tier 2 list. If Perplexity should not be Tier 2, one of the other 14 Tier 2 providers from PROV-02 should replace it. Looking at the PROV-02 requirement list, Perplexity is listed there as one of the 14. This is correct for this phase.
+
+#### Perplexity (listed in PROV-02)
+**Confidence: HIGH** -- gitleaks verified
+- **Prefix:** `pplx-`
+- **gitleaks regex:** `pplx-[a-zA-Z0-9]{48}`
+- **Total length:** 53 characters
+- **Keywords:** `pplx-`, `perplexity`, `PERPLEXITY_API_KEY`
+- **Verify:** GET `https://api.perplexity.ai/chat/completions` with `Authorization: Bearer {KEY}` -- but this is a POST endpoint. Use model listing if available.
+
+## Don't Hand-Roll
+
+| Problem | Don't Build | Use Instead | Why |
+|---------|-------------|-------------|-----|
+| Regex validation | Custom regex validator | Go `regexp.Compile` + existing schema validation in `UnmarshalYAML` | Schema already validates confidence levels; add regex compilation check |
+| Key format research | Guessing patterns | TruffleHog/gitleaks source code + GitGuardian detector docs | These tools have validated patterns against real-world data |
+| AC automaton rebuild | Manual keyword management | Existing `NewRegistry()` auto-builds AC from all provider keywords | Just add YAML files; registry handles everything |
+| Dual file sync | Manual copy script | Plan tasks should explicitly copy to both locations | Simple but must not be forgotten |
+
+## Common Pitfalls
+
+### Pitfall 1: Catastrophic Backtracking in Regex
+**What goes wrong:** Complex regex with nested quantifiers causes exponential backtracking.
+**Why it happens:** Using patterns like `(a+)+` or `([A-Za-z0-9]+[-_]?)+`.
+**How to avoid:** Go RE2 engine prevents this by design -- but still write simple, linear patterns. Avoid unnecessary alternation groups.
+**Warning signs:** Regex compilation errors in tests.
+
+### Pitfall 2: Overly Broad Patterns Without Keywords
+**What goes wrong:** Pattern like `[A-Za-z0-9]{32}` matches random strings everywhere.
+**Why it happens:** Provider has no distinctive prefix (Azure OpenAI, Mistral, AI21, etc.).
+**How to avoid:** Set confidence to `low`, require strong keyword list for AC pre-filtering. The scanner will only test regex AFTER AC finds a keyword match.
+**Warning signs:** Hundreds of false positives when scanning any codebase.
+
+### Pitfall 3: Forgetting Dual-Location File Sync
+**What goes wrong:** Provider added to `providers/` but not `pkg/providers/definitions/`, or vice versa.
+**Why it happens:** Phase 1 decision to maintain both locations.
+**How to avoid:** Every task that creates a YAML file must explicitly create it in both locations. Consider a verification step that compares both directories.
+**Warning signs:** `go test` passes but `providers/` directory has different count than `pkg/providers/definitions/`.
+
+### Pitfall 4: Invalid YAML Schema
+**What goes wrong:** Missing `format_version`, empty `last_verified`, or invalid confidence value.
+**Why it happens:** Copy-paste errors or typos.
+**How to avoid:** The `UnmarshalYAML` validation catches these. Run `go test ./pkg/providers/...` after each batch.
+**Warning signs:** Test failures with schema validation errors.
+
+### Pitfall 5: Regex Not RE2-Compatible
+**What goes wrong:** Using PCRE features like `(?<=prefix)` or `(?!suffix)`.
+**Why it happens:** Copying regex from TruffleHog (which uses Go RE2 but sometimes with custom wrappers) or from gitleaks (which uses Go RE2 but has context-match wrappers).
+**How to avoid:** Test every regex with `regexp.MustCompile()` in Go. Strip boundary assertions that gitleaks adds (like `(?:[\x60'"\s;]|\\[nr]|$)` -- these are gitleaks-specific context matchers, not needed in our YAML patterns).
+**Warning signs:** `regexp.Compile` returns error.
+
+### Pitfall 6: Schema Missing Category Field
+**What goes wrong:** CONTEXT.md mentions a `category` field in the YAML format, but `pkg/providers/schema.go` Provider struct has no `Category` field.
+**Why it happens:** CONTEXT.md describes an intended schema that was not fully implemented in Phase 1.
+**How to avoid:** Either add `Category` field to schema.go in this phase, or omit it from YAML files. Recommend adding it since it is mentioned in CLI-04 (`keyhunter providers list` might want category filtering).
+**Warning signs:** YAML has `category` field that gets silently ignored by Go YAML parser.
+
+## Code Examples
+
+### Provider YAML File (High-Confidence Prefixed Provider)
+```yaml
+# Source: TruffleHog xAI detector (verified)
+format_version: 1
+name: xai
+display_name: xAI
+tier: 1
+last_verified: "2026-04-05"
+keywords:
+  - "xai-"
+  - "xai"
+  - "grok"
+patterns:
+  - regex: 'xai-[0-9a-zA-Z_]{80}'
+    entropy_min: 3.5
+    confidence: high
+verify:
+  method: GET
+  url: https://api.x.ai/v1/api-key
+  headers:
+    Authorization: "Bearer {KEY}"
+  valid_status: [200]
+  invalid_status: [401, 403]
+```
+
+### Provider YAML File (Low-Confidence Generic Provider)
+```yaml
+# Low-confidence: no distinctive prefix, requires keyword context
+format_version: 1
+name: together
+display_name: Together AI
+tier: 2
+last_verified: "2026-04-05"
+keywords:
+  - "together"
+  - "together_api"
+  - "api.together.xyz"
+  - "togetherai"
+  - "TOGETHER_API_KEY"
+patterns:
+  - regex: '[A-Za-z0-9]{64}'
+    entropy_min: 4.0
+    confidence: low
+verify:
+  method: GET
+  url: https://api.together.xyz/v1/models
+  headers:
+    Authorization: "Bearer {KEY}"
+  valid_status: [200]
+  invalid_status: [401, 403]
+```
+
+### Test Pattern: Verify Regex Compiles
+```go
+func TestProviderRegexCompiles(t *testing.T) {
+    reg, err := providers.NewRegistry()
+    require.NoError(t, err)
+    for _, p := range reg.List() {
+        for i, pat := range p.Patterns {
+            _, err := regexp.Compile(pat.Regex)
+            assert.NoError(t, err, "provider %s pattern %d has invalid regex: %s", p.Name, i, pat.Regex)
+        }
+    }
+}
+```
+
+### Test Pattern: Verify Provider Count
+```go
+func TestTier1And2ProviderCount(t *testing.T) {
+    reg, err := providers.NewRegistry()
+    require.NoError(t, err)
+    stats := reg.Stats()
+    assert.GreaterOrEqual(t, stats.ByTier[1], 12, "expected at least 12 Tier 1 providers")
+    assert.GreaterOrEqual(t, stats.ByTier[2], 14, "expected at least 14 Tier 2 providers")
+    assert.GreaterOrEqual(t, stats.Total, 26, "expected at least 26 total providers")
+}
+```
+
+### Test Pattern: AC Pre-Filter Matches Known Key Prefix
+```go
+func TestACMatchesKnownPrefixes(t *testing.T) {
+    reg, err := providers.NewRegistry()
+    require.NoError(t, err)
+    ac := reg.AC()
+    
+    prefixes := []string{
+        "sk-proj-abc123", "sk-ant-api03-abc", "AIzaSyABC123",
+        "xai-abc123", "gsk_abc123", "r8_abc123", "pplx-abc123",
+    }
+    for _, prefix := range prefixes {
+        matches := ac.FindAll(prefix)
+        assert.NotEmpty(t, matches, "AC should match prefix: %s", prefix)
+    }
+}
+```
+
+## State of the Art
+
+| Old Approach | Current Approach | When Changed | Impact |
+|--------------|------------------|--------------|--------|
+| OpenAI `sk-` prefix only | `sk-proj-`, `sk-svcacct-`, `sk-None-` with `T3BlbkFJ` marker | 2024 | Must detect all modern prefixes |
+| AWS Bedrock via IAM only | Bedrock-specific `ABSK` API keys | Late 2025 | New key type to detect |
+| Regex-only detection | Regex + keyword pre-filter + entropy | Current best practice | KeyHunter architecture already supports this |
+| Simple prefix match | TruffleHog suffix markers (e.g., Anthropic `AA` suffix) | Current | Tighter patterns reduce false positives |
+
+## Open Questions
+
+1. **Category field in schema**
+   - What we know: CONTEXT.md mentions `category` in YAML format, but `schema.go` has no `Category` field.
+   - What's unclear: Whether to add it now or defer.
+   - Recommendation: Add `Category string \`yaml:"category"\`` to Provider struct as part of this phase. Values like `"frontier"`, `"inference-platform"`, `"specialized"` would support CLI-04 filtering.
+
+2. **Perplexity tier assignment**
+   - What we know: PROV-02 lists "Perplexity" as one of 14 Tier 2 providers. PROV-03 also lists it under Tier 3.
+   - What's unclear: Which is canonical.
+   - Recommendation: Follow the phase description (PROV-02 lists it as Tier 2). If needed, adjust in Phase 3.
+
+3. **OpenAI existing pattern needs update**
+   - What we know: Current `openai.yaml` only has `sk-proj-` pattern. Modern keys also use `sk-svcacct-`, `sk-None-`, and legacy `sk-` with `T3BlbkFJ` marker.
+   - Recommendation: Update openai.yaml with additional patterns.
+
+4. **Anthropic existing pattern can be tightened**
+   - What we know: Current pattern `sk-ant-api03-[A-Za-z0-9_\-]{93,}` does not require the `AA` suffix.
+   - Recommendation: Tighten to `sk-ant-api03-[A-Za-z0-9_\-]{93}AA` per TruffleHog and add admin key pattern.
+
+5. **Defunct/transitioning providers**
+   - What we know: OctoAI appears defunct. Anyscale Endpoints being deprecated. Lepton acquired by NVIDIA.
+   - Recommendation: Still create YAML definitions for all -- leaked keys from defunct services can still have value if the service was operational.
+
+## Validation Architecture
+
+### Test Framework
+| Property | Value |
+|----------|-------|
+| Framework | Go standard `testing` + `testify` v1.x |
+| Config file | `go.mod` (no separate test config) |
+| Quick run command | `go test ./pkg/providers/... -v -count=1` |
+| Full suite command | `go test ./... -v -count=1` |
+
+### Phase Requirements -> Test Map
+| Req ID | Behavior | Test Type | Automated Command | File Exists? |
+|--------|----------|-----------|-------------------|-------------|
+| PROV-01 | 12 Tier 1 providers loaded with correct names/tiers | unit | `go test ./pkg/providers/... -run TestTier1Providers -v` | Wave 0 |
+| PROV-01 | Each Tier 1 provider regex compiles and matches sample key | unit | `go test ./pkg/providers/... -run TestTier1Patterns -v` | Wave 0 |
+| PROV-02 | 14 Tier 2 providers loaded with correct names/tiers | unit | `go test ./pkg/providers/... -run TestTier2Providers -v` | Wave 0 |
+| PROV-02 | Each Tier 2 provider regex compiles and matches sample key | unit | `go test ./pkg/providers/... -run TestTier2Patterns -v` | Wave 0 |
+| PROV-01+02 | AC automaton matches all provider keywords | unit | `go test ./pkg/providers/... -run TestACMatchesAllKeywords -v` | Wave 0 |
+| PROV-01+02 | Registry stats show 26+ providers | unit | `go test ./pkg/providers/... -run TestProviderCount -v` | Wave 0 |
+
+### Sampling Rate
+- **Per task commit:** `go test ./pkg/providers/... -v -count=1`
+- **Per wave merge:** `go test ./... -v -count=1`
+- **Phase gate:** Full suite green before `/gsd:verify-work`
+
+### Wave 0 Gaps
+- [ ] `pkg/providers/tier1_test.go` -- covers PROV-01 (provider names, tier values, pattern compilation, sample key matching)
+- [ ] `pkg/providers/tier2_test.go` -- covers PROV-02 (same as above for Tier 2)
+- [ ] Update `pkg/providers/registry_test.go` -- update `TestRegistryLoad` assertion from `>= 3` to `>= 26`
+
+## Sources
+
+### Primary (HIGH confidence)
+- [TruffleHog OpenAI detector](https://github.com/trufflesecurity/trufflehog/blob/main/pkg/detectors/openai/openai.go) -- regex pattern for OpenAI keys
+- [TruffleHog Anthropic detector](https://github.com/trufflesecurity/trufflehog/blob/main/pkg/detectors/anthropic/anthropic.go) -- regex pattern for Anthropic keys
+- [TruffleHog Google Gemini detector](https://github.com/trufflesecurity/trufflehog/blob/main/pkg/detectors/googlegemini/googlegemini.go) -- regex pattern for Google AI keys
+- [TruffleHog Groq detector](https://github.com/trufflesecurity/trufflehog/blob/main/pkg/detectors/groq/groq.go) -- regex pattern `gsk_[a-zA-Z0-9]{52}`
+- [TruffleHog xAI detector](https://github.com/trufflesecurity/trufflehog/blob/main/pkg/detectors/xai/xai.go) -- regex pattern `xai-[0-9a-zA-Z_]{80}`
+- [TruffleHog Replicate detector](https://github.com/trufflesecurity/trufflehog/blob/main/pkg/detectors/replicate/replicate.go) -- regex pattern `r8_[0-9A-Za-z-_]{37}`
+- [TruffleHog Azure OpenAI detector](https://github.com/trufflesecurity/trufflehog/blob/main/pkg/detectors/azure_openai/azure_openai.go) -- 32-char hex with keyword context
+- [gitleaks config](https://github.com/gitleaks/gitleaks/blob/master/config/gitleaks.toml) -- Anthropic, Cohere, AWS Bedrock patterns
+
+### Secondary (MEDIUM confidence)
+- [AWS Bedrock API keys (Wiz blog)](https://www.wiz.io/blog/a-new-type-of-long-lived-key-on-aws-bedrock-api-keys) -- ABSK prefix documentation
+- [GitGuardian Groq detector](https://docs.gitguardian.com/secrets-detection/secrets-detection-engine/detectors/specifics/groq_api_key) -- confirms prefixed format
+- [GitGuardian Mistral detector](https://docs.gitguardian.com/secrets-detection/secrets-detection-engine/detectors/specifics/mistralai_apikey) -- confirms "Prefixed: False"
+- [GitGuardian Fireworks detector](https://docs.gitguardian.com/secrets-detection/secrets-detection-engine/detectors/specifics/fireworks_ai_api_key) -- confirms "Prefixed: False"
+- [OpenAI community forum](https://community.openai.com/t/how-to-create-an-api-secret-key-with-prefix-sk-only-always-creates-sk-proj-keys/1263531) -- sk-proj-, sk-svcacct-, sk-None- prefixes
+- [Replicate docs](https://replicate.com/docs/topics/security/api-tokens) -- r8_ prefix, 40-char format
+- [Anyscale docs](https://docs.anyscale.com/endpoints/text-generation/authenticate/) -- esecret_ prefix
+
+### Tertiary (LOW confidence -- needs validation at implementation)
+- Meta AI key format -- not documented, inferred from OpenAI-compatible API pattern
+- Inflection AI key format -- not documented, SDK-based inference only
+- AI21 Labs key format -- no prefix confirmed
+- Together AI key format -- no distinctive prefix confirmed
+- DeepInfra key format -- JWT-based tokens referenced but standard key format unclear
+- Lepton AI, Modal, Baseten, Cerebrium, NovitaAI, SambaNova, OctoAI, Friendli -- key formats not publicly documented
+
+## Metadata
+
+**Confidence breakdown:**
+- Standard stack: HIGH -- no new dependencies, using existing Phase 1 infrastructure
+- Architecture: HIGH -- dual-location YAML pattern well-established
+- Provider key formats (Tier 1 well-known): HIGH -- TruffleHog/gitleaks verified patterns
+- Provider key formats (Tier 1 less-known): MEDIUM -- Meta, Inflection, AI21 have limited documentation
+- Provider key formats (Tier 2 prefixed): HIGH -- Groq, Replicate, Perplexity, Anyscale have clear prefixes
+- Provider key formats (Tier 2 generic): LOW -- many Tier 2 providers lack documented key formats
+- Pitfalls: HIGH -- well-understood RE2 and false-positive challenges
+
+**Research date:** 2026-04-05
+**Valid until:** 2026-05-05 (30 days -- API key formats change infrequently)