From b8c69cba7e21ae0401303b3548cf0687fe71c5b4 Mon Sep 17 00:00:00 2001 From: salvacybersec Date: Sun, 5 Apr 2026 13:06:03 +0300 Subject: [PATCH] docs(02): research phase domain - provider key formats and regex patterns --- .../02-tier-1-2-providers/02-RESEARCH.md | 573 ++++++++++++++++++ 1 file changed, 573 insertions(+) create mode 100644 .planning/phases/02-tier-1-2-providers/02-RESEARCH.md diff --git a/.planning/phases/02-tier-1-2-providers/02-RESEARCH.md b/.planning/phases/02-tier-1-2-providers/02-RESEARCH.md new file mode 100644 index 0000000..0a4acb8 --- /dev/null +++ b/.planning/phases/02-tier-1-2-providers/02-RESEARCH.md @@ -0,0 +1,573 @@ +# Phase 2: Tier 1-2 Providers - Research + +**Researched:** 2026-04-05 +**Domain:** LLM/AI Provider API Key Formats, Regex Patterns, Verification Endpoints +**Confidence:** MEDIUM (well-documented providers are HIGH; lesser-known Tier 2 providers are LOW-MEDIUM) + +## Summary + +This phase requires creating 26 provider YAML definitions (12 Tier 1 + 14 Tier 2) following the established schema pattern from Phase 1. The core challenge is not Go code -- the infrastructure (schema, loader, registry, AC automaton) already exists and works. The challenge is **accuracy of regex patterns and key format data** across 26 providers with varying documentation quality. + +For Tier 1 providers (OpenAI, Anthropic, Google AI, xAI, Cohere, Groq), key formats are well-documented with authoritative regex patterns from TruffleHog and gitleaks. For Tier 2 inference platforms (Together, Fireworks, Lepton, Modal, Baseten, Cerebrium, NovitaAI, SambaNova, OctoAI, Friendli), many providers lack distinctive key prefixes and use generic opaque token formats, making regex-only detection harder and requiring keyword-context-based matching. + +**Primary recommendation:** Create all 26 YAML files using the established schema. For well-prefixed providers, use HIGH confidence patterns. For generic-format providers, use MEDIUM/LOW confidence patterns with strong keyword lists for AC pre-filtering. File placement follows the dual-location pattern established in Phase 1 (providers/ and pkg/providers/definitions/). + + +## User Constraints (from CONTEXT.md) + +### Locked Decisions +None -- all implementation choices at Claude's discretion (infrastructure/data phase). + +### Claude's Discretion +All implementation choices are at Claude's discretion -- pure infrastructure/data phase. Use ROADMAP phase goal, success criteria, and existing provider YAML schema (providers/openai.yaml, providers/anthropic.yaml, providers/huggingface.yaml) as templates. + +### Deferred Ideas (OUT OF SCOPE) +None -- discuss phase skipped. + + + +## Phase Requirements + +| ID | Description | Research Support | +|----|-------------|------------------| +| PROV-01 | 12 Tier 1 Frontier provider YAML definitions (OpenAI, Anthropic, Google AI, Vertex, AWS Bedrock, Azure OpenAI, Meta AI, xAI, Cohere, Mistral, Inflection, AI21) | Regex patterns documented below for all 12; verify endpoints identified | +| PROV-02 | 14 Tier 2 Inference Platform provider definitions (Together, Fireworks, Groq, Replicate, Anyscale, DeepInfra, Lepton, Modal, Baseten, Cerebrium, NovitaAI, Sambanova, OctoAI, Friendli) | Regex patterns documented for prefixed providers (Groq, Replicate, Anyscale, Perplexity); keyword-context approach recommended for generic-format providers | + + +## Project Constraints (from CLAUDE.md) + +- **Go regexp only (RE2)** -- no PCRE/regexp2. All patterns must be RE2-safe (no lookahead/lookbehind, no backreferences). +- **Providers in dual locations**: `providers/` (user-visible) and `pkg/providers/definitions/` (Go embed source). Files must be kept in sync. +- **Schema**: `format_version: 1`, `name`, `display_name`, `tier`, `last_verified`, `keywords[]`, `patterns[]` (with `regex`, `entropy_min`, `confidence`), `verify` (with `method`, `url`, `headers`, `valid_status`, `invalid_status`). +- **Confidence values**: `high`, `medium`, `low` (validated in UnmarshalYAML). +- **Keywords**: lowercase, used for Aho-Corasick pre-filtering via DFA automaton. + +## Standard Stack + +No new libraries required. This phase is pure YAML data file creation using the existing infrastructure from Phase 1. + +### Existing Infrastructure Used +| Component | Location | Purpose | +|-----------|----------|---------| +| Provider struct | `pkg/providers/schema.go` | YAML schema with validation | +| Loader | `pkg/providers/loader.go` | embed.FS walker for `definitions/*.yaml` | +| Registry | `pkg/providers/registry.go` | Provider index + AC automaton build | +| Tests | `pkg/providers/registry_test.go` | Registry load, get, stats, AC tests | + +## Architecture Patterns + +### File Placement (Dual Location) +Every new provider YAML must be placed in BOTH: +``` +providers/{name}.yaml # User-visible reference +pkg/providers/definitions/{name}.yaml # Go embed source (compiled into binary) +``` + +### YAML Template Pattern +```yaml +format_version: 1 +name: {provider_slug} +display_name: {Display Name} +tier: {1 or 2} +last_verified: "2026-04-05" +keywords: + - "{prefix_literal}" # Exact key prefix for AC match + - "{provider_name_lowercase}" # Provider name for context match + - "{env_var_hint}" # Common env var name fragments +patterns: + - regex: '{RE2_compatible_regex}' + entropy_min: {3.0-4.0} + confidence: {high|medium|low} +verify: + method: {GET|POST} + url: {lightweight_api_endpoint} + headers: + {auth_header}: "{KEY_placeholder}" + valid_status: [200] + invalid_status: [401, 403] +``` + +### Confidence Level Assignment Strategy +| Key Format | Confidence | Rationale | +|------------|------------|-----------| +| Unique prefix + fixed length (e.g., `sk-ant-api03-`, `gsk_`, `r8_`, `xai-`) | **high** | Prefix alone is near-unique; false positive rate extremely low | +| Unique prefix, variable length (e.g., `sk-proj-`, `AIzaSy`) | **high** | Prefix is distinctive enough | +| Short generic prefix + context needed (e.g., `sk-` for Cohere) | **medium** | Prefix collides with OpenAI legacy; needs keyword context | +| No prefix, opaque token (e.g., UUID, hex string, base64) | **low** | Requires strong keyword context; high false positive risk without AC pre-filter | +| 32-char hex string (e.g., Azure OpenAI) | **low** | Extremely generic format; keyword context mandatory | + +### Keyword Strategy for Low-Confidence Providers +Providers with generic key formats (no distinctive prefix) MUST have rich keyword lists for the Aho-Corasick pre-filter to work effectively. Include: +1. Provider name (lowercase) +2. Common env var fragments (e.g., `together_api`, `baseten_api`, `modal_token`) +3. API base URL fragments (e.g., `api.together.xyz`, `api.baseten.co`) +4. SDK/config identifiers (e.g., `togetherai`, `deepinfra`) + +### Anti-Patterns to Avoid +- **Overly broad regex without keyword anchor:** A pattern like `[A-Za-z0-9]{40}` without keywords would match every 40-char alphanumeric string -- useless. +- **PCRE features in regex:** Go RE2 does not support lookahead (`(?=)`), lookbehind (`(?<=)`), or backreferences. All patterns must be RE2-safe. +- **Hardcoding `T3BlbkFJ` for non-OpenAI providers:** The base64 "OpenAI" magic string is OpenAI-specific; do not use for other providers. +- **Missing dual-location sync:** Forgetting to copy YAML to both `providers/` and `pkg/providers/definitions/`. + +## Provider Key Format Research + +### Tier 1: Frontier Providers (12) + +#### 1. OpenAI +**Confidence: HIGH** -- TruffleHog verified +- **Prefixes:** `sk-proj-`, `sk-svcacct-`, `sk-None-`, legacy `sk-` (all contain `T3BlbkFJ` base64 marker) +- **TruffleHog regex:** `sk-(?:(?:proj|svcacct|service)-[A-Za-z0-9_-]+|[a-zA-Z0-9]+)T3BlbkFJ[A-Za-z0-9_-]+` +- **KeyHunter regex (simplified):** `sk-proj-[A-Za-z0-9_\-]{48,}` (existing, covers primary format) +- **Note:** Existing openai.yaml only covers `sk-proj-`. Should add patterns for `sk-svcacct-` and legacy `sk-` with T3BlbkFJ marker. +- **Verify:** GET `https://api.openai.com/v1/models` with `Authorization: Bearer {KEY}` -- 200=valid, 401=invalid +- **Keywords:** `sk-proj-`, `sk-svcacct-`, `openai`, `T3BlbkFJ` + +#### 2. Anthropic +**Confidence: HIGH** -- TruffleHog + gitleaks verified +- **Prefixes:** `sk-ant-api03-` (standard), `sk-ant-admin01-` (admin) +- **TruffleHog regex:** `sk-ant-(?:admin01|api03)-[\w\-]{93}AA` +- **gitleaks regex:** `sk-ant-api03-[a-zA-Z0-9_\-]{93}AA` +- **Note:** Existing anthropic.yaml pattern `sk-ant-api03-[A-Za-z0-9_\-]{93,}` should be tightened to end with `AA` suffix. +- **Verify:** GET `https://api.anthropic.com/v1/models` with `x-api-key: {KEY}` + `anthropic-version: 2023-06-01` -- 200=valid, 401=invalid +- **Keywords:** `sk-ant-api03-`, `sk-ant-admin01-`, `anthropic` + +#### 3. Google AI (Gemini) +**Confidence: HIGH** -- TruffleHog verified +- **Prefix:** `AIzaSy` +- **TruffleHog regex:** `AIzaSy[A-Za-z0-9_-]{33}` +- **Total length:** 39 characters +- **Note:** Same key format as all Google API keys. No trailing word boundary needed (keys can end with hyphen). +- **Verify:** GET `https://generativelanguage.googleapis.com/v1/models?key={KEY}` -- 200=valid, 400/403=invalid +- **Keywords:** `AIzaSy`, `google_api`, `gemini` + +#### 4. Google Vertex AI +**Confidence: MEDIUM** +- **Format:** Uses Google Cloud service account JSON key files (not a simple API key string) OR standard Google API keys (AIzaSy format, same as #3). +- **Approach:** For API key mode, reuse `AIzaSy` pattern (same as Google AI). Service account JSON key detection is a separate concern (JSON file with `"type": "service_account"` and `"private_key"` field). +- **Recommendation:** Create a separate `vertex-ai` provider YAML that focuses on the API key path with `AIzaSy` pattern AND a service account private key regex. +- **Service account key regex:** The private key in a GCP service account JSON starts with `-----BEGIN RSA PRIVATE KEY-----` -- but this is a general GCP credential, not Vertex-specific. Keep this provider focused on the API key path. +- **Verify:** GET `https://generativelanguage.googleapis.com/v1/models?key={KEY}` (same endpoint works for Vertex API keys) +- **Keywords:** `vertex`, `google_cloud`, `AIzaSy`, `vertex_ai` + +#### 5. AWS Bedrock +**Confidence: HIGH** -- gitleaks verified +- **Long-lived prefix:** `ABSK` (base64 encodes to `BedrockAPIKey-`) +- **gitleaks regex (long-lived):** `ABSK[A-Za-z0-9+/]{109,269}={0,2}` +- **Short-lived prefix:** `bedrock-api-key-YmVkcm9ja` (base64 prefix) +- **Also detect:** AWS IAM access keys (`AKIA[0-9A-Z]{16}` + secret) used with Bedrock +- **Recommendation:** Two patterns: (1) Bedrock-specific ABSK prefix, (2) AWS AKIA (general, shared with other AWS services) +- **Verify:** Cannot verify with a simple HTTP call -- AWS Bedrock requires SigV4 signing. Mark verify as empty/placeholder. +- **Keywords:** `ABSK`, `bedrock`, `aws_bedrock`, `AKIA` + +#### 6. Azure OpenAI +**Confidence: MEDIUM** -- TruffleHog verified but pattern is generic +- **Format:** 32-character lowercase hexadecimal string +- **TruffleHog regex:** `[a-f0-9]{32}` (with keyword context requirement) +- **Problem:** 32-char hex is extremely generic. MUST rely on keywords for context. +- **Keywords:** `azure`, `openai.azure.com`, `azure_openai`, `api-key`, `cognitive` +- **Verify:** GET `https://{resource}.openai.azure.com/openai/deployments?api-version=2024-02-01` with `api-key: {KEY}` -- but requires resource name. Cannot generically verify. Mark verify as placeholder. +- **Entropy min:** 3.5 (hex has theoretical max ~4.0) + +#### 7. Meta AI (Llama API) +**Confidence: LOW** +- **Format:** Not publicly documented as of April 2026. Meta Llama API launched April 2025. +- **Env var:** `META_LLAMA_API_KEY` +- **Approach:** Generic long token pattern with strong keyword context. +- **Keywords:** `meta`, `llama`, `meta_llama`, `llama_api` +- **Regex:** Generic high-entropy alphanumeric pattern, medium confidence +- **Verify:** GET `https://api.llama.com/v1/models` with `Authorization: Bearer {KEY}` (inferred from OpenAI-compatible API) + +#### 8. xAI (Grok) +**Confidence: HIGH** -- TruffleHog verified +- **Prefix:** `xai-` +- **TruffleHog regex:** `xai-[0-9a-zA-Z_]{80}` +- **Total length:** 84 characters +- **Verify:** GET `https://api.x.ai/v1/api-key` with `Authorization: Bearer {KEY}` -- 200=valid, 401=invalid +- **Keywords:** `xai-`, `xai`, `grok` + +#### 9. Cohere +**Confidence: MEDIUM** -- gitleaks verified but pattern requires context +- **Format:** 40 alphanumeric characters, no distinctive prefix +- **gitleaks regex:** Context-dependent match on `cohere` or `CO_API_KEY` keyword + `[a-zA-Z0-9]{40}` +- **Problem:** 40-char alphanumeric overlaps with many other tokens. +- **Keywords:** `cohere`, `CO_API_KEY`, `cohere_api` +- **Verify:** GET `https://api.cohere.ai/v1/models` with `Authorization: Bearer {KEY}` -- 200=valid, 401=invalid +- **Entropy min:** 4.0 (higher threshold to reduce false positives) + +#### 10. Mistral AI +**Confidence: MEDIUM** +- **Format:** Not prefixed (GitGuardian confirms "Prefixed: False"). Opaque token. +- **Approach:** Keyword-context match. Mistral keys appear to be 32-char alphanumeric or UUID-format. +- **Keywords:** `mistral`, `MISTRAL_API_KEY`, `mistral.ai`, `la_plateforme` +- **Verify:** GET `https://api.mistral.ai/v1/models` with `Authorization: Bearer {KEY}` -- 200=valid, 401=invalid + +#### 11. Inflection AI +**Confidence: LOW** +- **Format:** Not publicly documented. API launched late 2025 via inflection-sdk. +- **Env var:** `PI_API_KEY`, `INFLECTION_API_KEY` +- **Keywords:** `inflection`, `pi_api`, `PI_API_KEY` +- **Verify:** Endpoint unclear -- inferred OpenAI-compatible pattern. Mark as placeholder. + +#### 12. AI21 Labs +**Confidence: LOW** +- **Format:** Not publicly documented with distinctive prefix. +- **Env var:** `AI21_API_KEY` +- **Keywords:** `ai21`, `AI21_API_KEY`, `jamba`, `jurassic` +- **Verify:** GET `https://api.ai21.com/studio/v1/models` with `Authorization: Bearer {KEY}` -- inferred + +### Tier 2: Inference Platforms (14) + +#### 1. Together AI +**Confidence: LOW-MEDIUM** +- **Format:** Appears to use generic opaque tokens. Some documentation shows `sk-` prefix but this may be example placeholder. +- **Keywords:** `together`, `TOGETHER_API_KEY`, `api.together.xyz`, `togetherai` +- **Verify:** GET `https://api.together.xyz/v1/models` with `Authorization: Bearer {KEY}` + +#### 2. Fireworks AI +**Confidence: LOW-MEDIUM** +- **Format:** GitGuardian confirms "Prefixed: False". Opaque token format. +- **Prefix:** `fw_` prefix has been reported in some sources but not confirmed by GitGuardian. +- **Keywords:** `fireworks`, `FIREWORKS_API_KEY`, `fireworks.ai`, `fw_` +- **Verify:** GET `https://api.fireworks.ai/inference/v1/models` with `Authorization: Bearer {KEY}` + +#### 3. Groq +**Confidence: HIGH** -- TruffleHog verified +- **Prefix:** `gsk_` +- **TruffleHog regex:** `gsk_[a-zA-Z0-9]{52}` +- **Total length:** 56 characters +- **Verify:** GET `https://api.groq.com/openai/v1/models` with `Authorization: Bearer {KEY}` -- 200=valid, 401=invalid +- **Keywords:** `gsk_`, `groq`, `GROQ_API_KEY` + +#### 4. Replicate +**Confidence: HIGH** -- TruffleHog verified +- **Prefix:** `r8_` +- **TruffleHog regex:** `r8_[0-9A-Za-z-_]{37}` +- **Total length:** 40 characters +- **Verify:** GET `https://api.replicate.com/v1/predictions` with `Authorization: Bearer {KEY}` -- 200=valid, 401=invalid +- **Keywords:** `r8_`, `replicate`, `REPLICATE_API_TOKEN` + +#### 5. Anyscale +**Confidence: MEDIUM** +- **Prefix:** `esecret_` +- **Regex:** `esecret_[A-Za-z0-9_-]{20,}` +- **Keywords:** `esecret_`, `anyscale`, `ANYSCALE_API_KEY` +- **Verify:** GET `https://api.endpoints.anyscale.com/v1/models` with `Authorization: Bearer {KEY}` +- **Note:** Anyscale Endpoints are being deprecated in favor of their newer platform. Still worth detecting. + +#### 6. DeepInfra +**Confidence: LOW-MEDIUM** +- **Format:** Opaque token. JWT-based scoped tokens use `jwt:` prefix. +- **Keywords:** `deepinfra`, `DEEPINFRA_API_KEY`, `deepinfra.com` +- **Verify:** GET `https://api.deepinfra.com/v1/openai/models` with `Authorization: Bearer {KEY}` + +#### 7. Lepton AI +**Confidence: LOW** +- **Format:** Not publicly documented. NVIDIA acquired Lepton AI April 2025 (rebranded to DGX Cloud Lepton). +- **Keywords:** `lepton`, `LEPTON_API_TOKEN`, `lepton.ai` +- **Verify:** Endpoint uncertain. Mark as placeholder. + +#### 8. Modal +**Confidence: LOW** +- **Format:** Not publicly documented with distinctive format. +- **Keywords:** `modal`, `MODAL_TOKEN_ID`, `MODAL_TOKEN_SECRET`, `modal.com` +- **Note:** Modal uses token ID + token secret pair, not a single API key. +- **Verify:** Mark as placeholder. + +#### 9. Baseten +**Confidence: LOW-MEDIUM** +- **Format:** Uses API keys passed in header. +- **Keywords:** `baseten`, `BASETEN_API_KEY`, `api.baseten.co` +- **Verify:** GET `https://api.baseten.co/v1/models` with `Authorization: Api-Key {KEY}` (non-standard header) + +#### 10. Cerebrium +**Confidence: LOW** +- **Format:** Not publicly documented with distinctive format. +- **Keywords:** `cerebrium`, `CEREBRIUM_API_KEY`, `cerebrium.ai` +- **Verify:** Mark as placeholder. + +#### 11. NovitaAI +**Confidence: LOW** +- **Format:** Not publicly documented with distinctive format. +- **Keywords:** `novita`, `NOVITA_API_KEY`, `novita.ai` +- **Verify:** GET `https://api.novita.ai/v3/openai/models` with `Authorization: Bearer {KEY}` (inferred) + +#### 12. SambaNova +**Confidence: LOW** +- **Format:** Not publicly documented with distinctive prefix. +- **Keywords:** `sambanova`, `SAMBANOVA_API_KEY`, `sambastudio`, `snapi` +- **Verify:** GET `https://api.sambanova.ai/v1/models` with `Authorization: Bearer {KEY}` (inferred OpenAI-compatible) + +#### 13. OctoAI +**Confidence: LOW** +- **Format:** Not publicly documented. OctoAI was shut down / merged in 2025. +- **Keywords:** `octoai`, `OCTOAI_TOKEN`, `octo.ai` +- **Verify:** Mark as placeholder (service may be defunct). + +#### 14. Friendli +**Confidence: LOW** +- **Format:** Uses "Friendli Token" -- not publicly documented format. +- **Keywords:** `friendli`, `FRIENDLI_TOKEN`, `friendli.ai` +- **Verify:** Mark as placeholder. + +### IMPORTANT: Perplexity Classification Issue + +The phase description lists Perplexity as Tier 2, but REQUIREMENTS.md lists it under Tier 3 (PROV-03). The gitleaks config has a pattern: `pplx-[a-zA-Z0-9]{48}`. **Follow the phase description** which includes it in the Tier 2 list. If Perplexity should not be Tier 2, one of the other 14 Tier 2 providers from PROV-02 should replace it. Looking at the PROV-02 requirement list, Perplexity is listed there as one of the 14. This is correct for this phase. + +#### Perplexity (listed in PROV-02) +**Confidence: HIGH** -- gitleaks verified +- **Prefix:** `pplx-` +- **gitleaks regex:** `pplx-[a-zA-Z0-9]{48}` +- **Total length:** 53 characters +- **Keywords:** `pplx-`, `perplexity`, `PERPLEXITY_API_KEY` +- **Verify:** GET `https://api.perplexity.ai/chat/completions` with `Authorization: Bearer {KEY}` -- but this is a POST endpoint. Use model listing if available. + +## Don't Hand-Roll + +| Problem | Don't Build | Use Instead | Why | +|---------|-------------|-------------|-----| +| Regex validation | Custom regex validator | Go `regexp.Compile` + existing schema validation in `UnmarshalYAML` | Schema already validates confidence levels; add regex compilation check | +| Key format research | Guessing patterns | TruffleHog/gitleaks source code + GitGuardian detector docs | These tools have validated patterns against real-world data | +| AC automaton rebuild | Manual keyword management | Existing `NewRegistry()` auto-builds AC from all provider keywords | Just add YAML files; registry handles everything | +| Dual file sync | Manual copy script | Plan tasks should explicitly copy to both locations | Simple but must not be forgotten | + +## Common Pitfalls + +### Pitfall 1: Catastrophic Backtracking in Regex +**What goes wrong:** Complex regex with nested quantifiers causes exponential backtracking. +**Why it happens:** Using patterns like `(a+)+` or `([A-Za-z0-9]+[-_]?)+`. +**How to avoid:** Go RE2 engine prevents this by design -- but still write simple, linear patterns. Avoid unnecessary alternation groups. +**Warning signs:** Regex compilation errors in tests. + +### Pitfall 2: Overly Broad Patterns Without Keywords +**What goes wrong:** Pattern like `[A-Za-z0-9]{32}` matches random strings everywhere. +**Why it happens:** Provider has no distinctive prefix (Azure OpenAI, Mistral, AI21, etc.). +**How to avoid:** Set confidence to `low`, require strong keyword list for AC pre-filtering. The scanner will only test regex AFTER AC finds a keyword match. +**Warning signs:** Hundreds of false positives when scanning any codebase. + +### Pitfall 3: Forgetting Dual-Location File Sync +**What goes wrong:** Provider added to `providers/` but not `pkg/providers/definitions/`, or vice versa. +**Why it happens:** Phase 1 decision to maintain both locations. +**How to avoid:** Every task that creates a YAML file must explicitly create it in both locations. Consider a verification step that compares both directories. +**Warning signs:** `go test` passes but `providers/` directory has different count than `pkg/providers/definitions/`. + +### Pitfall 4: Invalid YAML Schema +**What goes wrong:** Missing `format_version`, empty `last_verified`, or invalid confidence value. +**Why it happens:** Copy-paste errors or typos. +**How to avoid:** The `UnmarshalYAML` validation catches these. Run `go test ./pkg/providers/...` after each batch. +**Warning signs:** Test failures with schema validation errors. + +### Pitfall 5: Regex Not RE2-Compatible +**What goes wrong:** Using PCRE features like `(?<=prefix)` or `(?!suffix)`. +**Why it happens:** Copying regex from TruffleHog (which uses Go RE2 but sometimes with custom wrappers) or from gitleaks (which uses Go RE2 but has context-match wrappers). +**How to avoid:** Test every regex with `regexp.MustCompile()` in Go. Strip boundary assertions that gitleaks adds (like `(?:[\x60'"\s;]|\\[nr]|$)` -- these are gitleaks-specific context matchers, not needed in our YAML patterns). +**Warning signs:** `regexp.Compile` returns error. + +### Pitfall 6: Schema Missing Category Field +**What goes wrong:** CONTEXT.md mentions a `category` field in the YAML format, but `pkg/providers/schema.go` Provider struct has no `Category` field. +**Why it happens:** CONTEXT.md describes an intended schema that was not fully implemented in Phase 1. +**How to avoid:** Either add `Category` field to schema.go in this phase, or omit it from YAML files. Recommend adding it since it is mentioned in CLI-04 (`keyhunter providers list` might want category filtering). +**Warning signs:** YAML has `category` field that gets silently ignored by Go YAML parser. + +## Code Examples + +### Provider YAML File (High-Confidence Prefixed Provider) +```yaml +# Source: TruffleHog xAI detector (verified) +format_version: 1 +name: xai +display_name: xAI +tier: 1 +last_verified: "2026-04-05" +keywords: + - "xai-" + - "xai" + - "grok" +patterns: + - regex: 'xai-[0-9a-zA-Z_]{80}' + entropy_min: 3.5 + confidence: high +verify: + method: GET + url: https://api.x.ai/v1/api-key + headers: + Authorization: "Bearer {KEY}" + valid_status: [200] + invalid_status: [401, 403] +``` + +### Provider YAML File (Low-Confidence Generic Provider) +```yaml +# Low-confidence: no distinctive prefix, requires keyword context +format_version: 1 +name: together +display_name: Together AI +tier: 2 +last_verified: "2026-04-05" +keywords: + - "together" + - "together_api" + - "api.together.xyz" + - "togetherai" + - "TOGETHER_API_KEY" +patterns: + - regex: '[A-Za-z0-9]{64}' + entropy_min: 4.0 + confidence: low +verify: + method: GET + url: https://api.together.xyz/v1/models + headers: + Authorization: "Bearer {KEY}" + valid_status: [200] + invalid_status: [401, 403] +``` + +### Test Pattern: Verify Regex Compiles +```go +func TestProviderRegexCompiles(t *testing.T) { + reg, err := providers.NewRegistry() + require.NoError(t, err) + for _, p := range reg.List() { + for i, pat := range p.Patterns { + _, err := regexp.Compile(pat.Regex) + assert.NoError(t, err, "provider %s pattern %d has invalid regex: %s", p.Name, i, pat.Regex) + } + } +} +``` + +### Test Pattern: Verify Provider Count +```go +func TestTier1And2ProviderCount(t *testing.T) { + reg, err := providers.NewRegistry() + require.NoError(t, err) + stats := reg.Stats() + assert.GreaterOrEqual(t, stats.ByTier[1], 12, "expected at least 12 Tier 1 providers") + assert.GreaterOrEqual(t, stats.ByTier[2], 14, "expected at least 14 Tier 2 providers") + assert.GreaterOrEqual(t, stats.Total, 26, "expected at least 26 total providers") +} +``` + +### Test Pattern: AC Pre-Filter Matches Known Key Prefix +```go +func TestACMatchesKnownPrefixes(t *testing.T) { + reg, err := providers.NewRegistry() + require.NoError(t, err) + ac := reg.AC() + + prefixes := []string{ + "sk-proj-abc123", "sk-ant-api03-abc", "AIzaSyABC123", + "xai-abc123", "gsk_abc123", "r8_abc123", "pplx-abc123", + } + for _, prefix := range prefixes { + matches := ac.FindAll(prefix) + assert.NotEmpty(t, matches, "AC should match prefix: %s", prefix) + } +} +``` + +## State of the Art + +| Old Approach | Current Approach | When Changed | Impact | +|--------------|------------------|--------------|--------| +| OpenAI `sk-` prefix only | `sk-proj-`, `sk-svcacct-`, `sk-None-` with `T3BlbkFJ` marker | 2024 | Must detect all modern prefixes | +| AWS Bedrock via IAM only | Bedrock-specific `ABSK` API keys | Late 2025 | New key type to detect | +| Regex-only detection | Regex + keyword pre-filter + entropy | Current best practice | KeyHunter architecture already supports this | +| Simple prefix match | TruffleHog suffix markers (e.g., Anthropic `AA` suffix) | Current | Tighter patterns reduce false positives | + +## Open Questions + +1. **Category field in schema** + - What we know: CONTEXT.md mentions `category` in YAML format, but `schema.go` has no `Category` field. + - What's unclear: Whether to add it now or defer. + - Recommendation: Add `Category string \`yaml:"category"\`` to Provider struct as part of this phase. Values like `"frontier"`, `"inference-platform"`, `"specialized"` would support CLI-04 filtering. + +2. **Perplexity tier assignment** + - What we know: PROV-02 lists "Perplexity" as one of 14 Tier 2 providers. PROV-03 also lists it under Tier 3. + - What's unclear: Which is canonical. + - Recommendation: Follow the phase description (PROV-02 lists it as Tier 2). If needed, adjust in Phase 3. + +3. **OpenAI existing pattern needs update** + - What we know: Current `openai.yaml` only has `sk-proj-` pattern. Modern keys also use `sk-svcacct-`, `sk-None-`, and legacy `sk-` with `T3BlbkFJ` marker. + - Recommendation: Update openai.yaml with additional patterns. + +4. **Anthropic existing pattern can be tightened** + - What we know: Current pattern `sk-ant-api03-[A-Za-z0-9_\-]{93,}` does not require the `AA` suffix. + - Recommendation: Tighten to `sk-ant-api03-[A-Za-z0-9_\-]{93}AA` per TruffleHog and add admin key pattern. + +5. **Defunct/transitioning providers** + - What we know: OctoAI appears defunct. Anyscale Endpoints being deprecated. Lepton acquired by NVIDIA. + - Recommendation: Still create YAML definitions for all -- leaked keys from defunct services can still have value if the service was operational. + +## Validation Architecture + +### Test Framework +| Property | Value | +|----------|-------| +| Framework | Go standard `testing` + `testify` v1.x | +| Config file | `go.mod` (no separate test config) | +| Quick run command | `go test ./pkg/providers/... -v -count=1` | +| Full suite command | `go test ./... -v -count=1` | + +### Phase Requirements -> Test Map +| Req ID | Behavior | Test Type | Automated Command | File Exists? | +|--------|----------|-----------|-------------------|-------------| +| PROV-01 | 12 Tier 1 providers loaded with correct names/tiers | unit | `go test ./pkg/providers/... -run TestTier1Providers -v` | Wave 0 | +| PROV-01 | Each Tier 1 provider regex compiles and matches sample key | unit | `go test ./pkg/providers/... -run TestTier1Patterns -v` | Wave 0 | +| PROV-02 | 14 Tier 2 providers loaded with correct names/tiers | unit | `go test ./pkg/providers/... -run TestTier2Providers -v` | Wave 0 | +| PROV-02 | Each Tier 2 provider regex compiles and matches sample key | unit | `go test ./pkg/providers/... -run TestTier2Patterns -v` | Wave 0 | +| PROV-01+02 | AC automaton matches all provider keywords | unit | `go test ./pkg/providers/... -run TestACMatchesAllKeywords -v` | Wave 0 | +| PROV-01+02 | Registry stats show 26+ providers | unit | `go test ./pkg/providers/... -run TestProviderCount -v` | Wave 0 | + +### Sampling Rate +- **Per task commit:** `go test ./pkg/providers/... -v -count=1` +- **Per wave merge:** `go test ./... -v -count=1` +- **Phase gate:** Full suite green before `/gsd:verify-work` + +### Wave 0 Gaps +- [ ] `pkg/providers/tier1_test.go` -- covers PROV-01 (provider names, tier values, pattern compilation, sample key matching) +- [ ] `pkg/providers/tier2_test.go` -- covers PROV-02 (same as above for Tier 2) +- [ ] Update `pkg/providers/registry_test.go` -- update `TestRegistryLoad` assertion from `>= 3` to `>= 26` + +## Sources + +### Primary (HIGH confidence) +- [TruffleHog OpenAI detector](https://github.com/trufflesecurity/trufflehog/blob/main/pkg/detectors/openai/openai.go) -- regex pattern for OpenAI keys +- [TruffleHog Anthropic detector](https://github.com/trufflesecurity/trufflehog/blob/main/pkg/detectors/anthropic/anthropic.go) -- regex pattern for Anthropic keys +- [TruffleHog Google Gemini detector](https://github.com/trufflesecurity/trufflehog/blob/main/pkg/detectors/googlegemini/googlegemini.go) -- regex pattern for Google AI keys +- [TruffleHog Groq detector](https://github.com/trufflesecurity/trufflehog/blob/main/pkg/detectors/groq/groq.go) -- regex pattern `gsk_[a-zA-Z0-9]{52}` +- [TruffleHog xAI detector](https://github.com/trufflesecurity/trufflehog/blob/main/pkg/detectors/xai/xai.go) -- regex pattern `xai-[0-9a-zA-Z_]{80}` +- [TruffleHog Replicate detector](https://github.com/trufflesecurity/trufflehog/blob/main/pkg/detectors/replicate/replicate.go) -- regex pattern `r8_[0-9A-Za-z-_]{37}` +- [TruffleHog Azure OpenAI detector](https://github.com/trufflesecurity/trufflehog/blob/main/pkg/detectors/azure_openai/azure_openai.go) -- 32-char hex with keyword context +- [gitleaks config](https://github.com/gitleaks/gitleaks/blob/master/config/gitleaks.toml) -- Anthropic, Cohere, AWS Bedrock patterns + +### Secondary (MEDIUM confidence) +- [AWS Bedrock API keys (Wiz blog)](https://www.wiz.io/blog/a-new-type-of-long-lived-key-on-aws-bedrock-api-keys) -- ABSK prefix documentation +- [GitGuardian Groq detector](https://docs.gitguardian.com/secrets-detection/secrets-detection-engine/detectors/specifics/groq_api_key) -- confirms prefixed format +- [GitGuardian Mistral detector](https://docs.gitguardian.com/secrets-detection/secrets-detection-engine/detectors/specifics/mistralai_apikey) -- confirms "Prefixed: False" +- [GitGuardian Fireworks detector](https://docs.gitguardian.com/secrets-detection/secrets-detection-engine/detectors/specifics/fireworks_ai_api_key) -- confirms "Prefixed: False" +- [OpenAI community forum](https://community.openai.com/t/how-to-create-an-api-secret-key-with-prefix-sk-only-always-creates-sk-proj-keys/1263531) -- sk-proj-, sk-svcacct-, sk-None- prefixes +- [Replicate docs](https://replicate.com/docs/topics/security/api-tokens) -- r8_ prefix, 40-char format +- [Anyscale docs](https://docs.anyscale.com/endpoints/text-generation/authenticate/) -- esecret_ prefix + +### Tertiary (LOW confidence -- needs validation at implementation) +- Meta AI key format -- not documented, inferred from OpenAI-compatible API pattern +- Inflection AI key format -- not documented, SDK-based inference only +- AI21 Labs key format -- no prefix confirmed +- Together AI key format -- no distinctive prefix confirmed +- DeepInfra key format -- JWT-based tokens referenced but standard key format unclear +- Lepton AI, Modal, Baseten, Cerebrium, NovitaAI, SambaNova, OctoAI, Friendli -- key formats not publicly documented + +## Metadata + +**Confidence breakdown:** +- Standard stack: HIGH -- no new dependencies, using existing Phase 1 infrastructure +- Architecture: HIGH -- dual-location YAML pattern well-established +- Provider key formats (Tier 1 well-known): HIGH -- TruffleHog/gitleaks verified patterns +- Provider key formats (Tier 1 less-known): MEDIUM -- Meta, Inflection, AI21 have limited documentation +- Provider key formats (Tier 2 prefixed): HIGH -- Groq, Replicate, Perplexity, Anyscale have clear prefixes +- Provider key formats (Tier 2 generic): LOW -- many Tier 2 providers lack documented key formats +- Pitfalls: HIGH -- well-understood RE2 and false-positive challenges + +**Research date:** 2026-04-05 +**Valid until:** 2026-05-05 (30 days -- API key formats change infrequently)