# Phase 2: Tier 1-2 Providers - Research **Researched:** 2026-04-05 **Domain:** LLM/AI Provider API Key Formats, Regex Patterns, Verification Endpoints **Confidence:** MEDIUM (well-documented providers are HIGH; lesser-known Tier 2 providers are LOW-MEDIUM) ## Summary This phase requires creating 26 provider YAML definitions (12 Tier 1 + 14 Tier 2) following the established schema pattern from Phase 1. The core challenge is not Go code -- the infrastructure (schema, loader, registry, AC automaton) already exists and works. The challenge is **accuracy of regex patterns and key format data** across 26 providers with varying documentation quality. For Tier 1 providers (OpenAI, Anthropic, Google AI, xAI, Cohere, Groq), key formats are well-documented with authoritative regex patterns from TruffleHog and gitleaks. For Tier 2 inference platforms (Together, Fireworks, Lepton, Modal, Baseten, Cerebrium, NovitaAI, SambaNova, OctoAI, Friendli), many providers lack distinctive key prefixes and use generic opaque token formats, making regex-only detection harder and requiring keyword-context-based matching. **Primary recommendation:** Create all 26 YAML files using the established schema. For well-prefixed providers, use HIGH confidence patterns. For generic-format providers, use MEDIUM/LOW confidence patterns with strong keyword lists for AC pre-filtering. File placement follows the dual-location pattern established in Phase 1 (providers/ and pkg/providers/definitions/). ## User Constraints (from CONTEXT.md) ### Locked Decisions None -- all implementation choices at Claude's discretion (infrastructure/data phase). ### Claude's Discretion All implementation choices are at Claude's discretion -- pure infrastructure/data phase. Use ROADMAP phase goal, success criteria, and existing provider YAML schema (providers/openai.yaml, providers/anthropic.yaml, providers/huggingface.yaml) as templates. ### Deferred Ideas (OUT OF SCOPE) None -- discuss phase skipped. ## Phase Requirements | ID | Description | Research Support | |----|-------------|------------------| | PROV-01 | 12 Tier 1 Frontier provider YAML definitions (OpenAI, Anthropic, Google AI, Vertex, AWS Bedrock, Azure OpenAI, Meta AI, xAI, Cohere, Mistral, Inflection, AI21) | Regex patterns documented below for all 12; verify endpoints identified | | PROV-02 | 14 Tier 2 Inference Platform provider definitions (Together, Fireworks, Groq, Replicate, Anyscale, DeepInfra, Lepton, Modal, Baseten, Cerebrium, NovitaAI, Sambanova, OctoAI, Friendli) | Regex patterns documented for prefixed providers (Groq, Replicate, Anyscale, Perplexity); keyword-context approach recommended for generic-format providers | ## Project Constraints (from CLAUDE.md) - **Go regexp only (RE2)** -- no PCRE/regexp2. All patterns must be RE2-safe (no lookahead/lookbehind, no backreferences). - **Providers in dual locations**: `providers/` (user-visible) and `pkg/providers/definitions/` (Go embed source). Files must be kept in sync. - **Schema**: `format_version: 1`, `name`, `display_name`, `tier`, `last_verified`, `keywords[]`, `patterns[]` (with `regex`, `entropy_min`, `confidence`), `verify` (with `method`, `url`, `headers`, `valid_status`, `invalid_status`). - **Confidence values**: `high`, `medium`, `low` (validated in UnmarshalYAML). - **Keywords**: lowercase, used for Aho-Corasick pre-filtering via DFA automaton. ## Standard Stack No new libraries required. This phase is pure YAML data file creation using the existing infrastructure from Phase 1. ### Existing Infrastructure Used | Component | Location | Purpose | |-----------|----------|---------| | Provider struct | `pkg/providers/schema.go` | YAML schema with validation | | Loader | `pkg/providers/loader.go` | embed.FS walker for `definitions/*.yaml` | | Registry | `pkg/providers/registry.go` | Provider index + AC automaton build | | Tests | `pkg/providers/registry_test.go` | Registry load, get, stats, AC tests | ## Architecture Patterns ### File Placement (Dual Location) Every new provider YAML must be placed in BOTH: ``` providers/{name}.yaml # User-visible reference pkg/providers/definitions/{name}.yaml # Go embed source (compiled into binary) ``` ### YAML Template Pattern ```yaml format_version: 1 name: {provider_slug} display_name: {Display Name} tier: {1 or 2} last_verified: "2026-04-05" keywords: - "{prefix_literal}" # Exact key prefix for AC match - "{provider_name_lowercase}" # Provider name for context match - "{env_var_hint}" # Common env var name fragments patterns: - regex: '{RE2_compatible_regex}' entropy_min: {3.0-4.0} confidence: {high|medium|low} verify: method: {GET|POST} url: {lightweight_api_endpoint} headers: {auth_header}: "{KEY_placeholder}" valid_status: [200] invalid_status: [401, 403] ``` ### Confidence Level Assignment Strategy | Key Format | Confidence | Rationale | |------------|------------|-----------| | Unique prefix + fixed length (e.g., `sk-ant-api03-`, `gsk_`, `r8_`, `xai-`) | **high** | Prefix alone is near-unique; false positive rate extremely low | | Unique prefix, variable length (e.g., `sk-proj-`, `AIzaSy`) | **high** | Prefix is distinctive enough | | Short generic prefix + context needed (e.g., `sk-` for Cohere) | **medium** | Prefix collides with OpenAI legacy; needs keyword context | | No prefix, opaque token (e.g., UUID, hex string, base64) | **low** | Requires strong keyword context; high false positive risk without AC pre-filter | | 32-char hex string (e.g., Azure OpenAI) | **low** | Extremely generic format; keyword context mandatory | ### Keyword Strategy for Low-Confidence Providers Providers with generic key formats (no distinctive prefix) MUST have rich keyword lists for the Aho-Corasick pre-filter to work effectively. Include: 1. Provider name (lowercase) 2. Common env var fragments (e.g., `together_api`, `baseten_api`, `modal_token`) 3. API base URL fragments (e.g., `api.together.xyz`, `api.baseten.co`) 4. SDK/config identifiers (e.g., `togetherai`, `deepinfra`) ### Anti-Patterns to Avoid - **Overly broad regex without keyword anchor:** A pattern like `[A-Za-z0-9]{40}` without keywords would match every 40-char alphanumeric string -- useless. - **PCRE features in regex:** Go RE2 does not support lookahead (`(?=)`), lookbehind (`(?<=)`), or backreferences. All patterns must be RE2-safe. - **Hardcoding `T3BlbkFJ` for non-OpenAI providers:** The base64 "OpenAI" magic string is OpenAI-specific; do not use for other providers. - **Missing dual-location sync:** Forgetting to copy YAML to both `providers/` and `pkg/providers/definitions/`. ## Provider Key Format Research ### Tier 1: Frontier Providers (12) #### 1. OpenAI **Confidence: HIGH** -- TruffleHog verified - **Prefixes:** `sk-proj-`, `sk-svcacct-`, `sk-None-`, legacy `sk-` (all contain `T3BlbkFJ` base64 marker) - **TruffleHog regex:** `sk-(?:(?:proj|svcacct|service)-[A-Za-z0-9_-]+|[a-zA-Z0-9]+)T3BlbkFJ[A-Za-z0-9_-]+` - **KeyHunter regex (simplified):** `sk-proj-[A-Za-z0-9_\-]{48,}` (existing, covers primary format) - **Note:** Existing openai.yaml only covers `sk-proj-`. Should add patterns for `sk-svcacct-` and legacy `sk-` with T3BlbkFJ marker. - **Verify:** GET `https://api.openai.com/v1/models` with `Authorization: Bearer {KEY}` -- 200=valid, 401=invalid - **Keywords:** `sk-proj-`, `sk-svcacct-`, `openai`, `T3BlbkFJ` #### 2. Anthropic **Confidence: HIGH** -- TruffleHog + gitleaks verified - **Prefixes:** `sk-ant-api03-` (standard), `sk-ant-admin01-` (admin) - **TruffleHog regex:** `sk-ant-(?:admin01|api03)-[\w\-]{93}AA` - **gitleaks regex:** `sk-ant-api03-[a-zA-Z0-9_\-]{93}AA` - **Note:** Existing anthropic.yaml pattern `sk-ant-api03-[A-Za-z0-9_\-]{93,}` should be tightened to end with `AA` suffix. - **Verify:** GET `https://api.anthropic.com/v1/models` with `x-api-key: {KEY}` + `anthropic-version: 2023-06-01` -- 200=valid, 401=invalid - **Keywords:** `sk-ant-api03-`, `sk-ant-admin01-`, `anthropic` #### 3. Google AI (Gemini) **Confidence: HIGH** -- TruffleHog verified - **Prefix:** `AIzaSy` - **TruffleHog regex:** `AIzaSy[A-Za-z0-9_-]{33}` - **Total length:** 39 characters - **Note:** Same key format as all Google API keys. No trailing word boundary needed (keys can end with hyphen). - **Verify:** GET `https://generativelanguage.googleapis.com/v1/models?key={KEY}` -- 200=valid, 400/403=invalid - **Keywords:** `AIzaSy`, `google_api`, `gemini` #### 4. Google Vertex AI **Confidence: MEDIUM** - **Format:** Uses Google Cloud service account JSON key files (not a simple API key string) OR standard Google API keys (AIzaSy format, same as #3). - **Approach:** For API key mode, reuse `AIzaSy` pattern (same as Google AI). Service account JSON key detection is a separate concern (JSON file with `"type": "service_account"` and `"private_key"` field). - **Recommendation:** Create a separate `vertex-ai` provider YAML that focuses on the API key path with `AIzaSy` pattern AND a service account private key regex. - **Service account key regex:** The private key in a GCP service account JSON starts with `-----BEGIN RSA PRIVATE KEY-----` -- but this is a general GCP credential, not Vertex-specific. Keep this provider focused on the API key path. - **Verify:** GET `https://generativelanguage.googleapis.com/v1/models?key={KEY}` (same endpoint works for Vertex API keys) - **Keywords:** `vertex`, `google_cloud`, `AIzaSy`, `vertex_ai` #### 5. AWS Bedrock **Confidence: HIGH** -- gitleaks verified - **Long-lived prefix:** `ABSK` (base64 encodes to `BedrockAPIKey-`) - **gitleaks regex (long-lived):** `ABSK[A-Za-z0-9+/]{109,269}={0,2}` - **Short-lived prefix:** `bedrock-api-key-YmVkcm9ja` (base64 prefix) - **Also detect:** AWS IAM access keys (`AKIA[0-9A-Z]{16}` + secret) used with Bedrock - **Recommendation:** Two patterns: (1) Bedrock-specific ABSK prefix, (2) AWS AKIA (general, shared with other AWS services) - **Verify:** Cannot verify with a simple HTTP call -- AWS Bedrock requires SigV4 signing. Mark verify as empty/placeholder. - **Keywords:** `ABSK`, `bedrock`, `aws_bedrock`, `AKIA` #### 6. Azure OpenAI **Confidence: MEDIUM** -- TruffleHog verified but pattern is generic - **Format:** 32-character lowercase hexadecimal string - **TruffleHog regex:** `[a-f0-9]{32}` (with keyword context requirement) - **Problem:** 32-char hex is extremely generic. MUST rely on keywords for context. - **Keywords:** `azure`, `openai.azure.com`, `azure_openai`, `api-key`, `cognitive` - **Verify:** GET `https://{resource}.openai.azure.com/openai/deployments?api-version=2024-02-01` with `api-key: {KEY}` -- but requires resource name. Cannot generically verify. Mark verify as placeholder. - **Entropy min:** 3.5 (hex has theoretical max ~4.0) #### 7. Meta AI (Llama API) **Confidence: LOW** - **Format:** Not publicly documented as of April 2026. Meta Llama API launched April 2025. - **Env var:** `META_LLAMA_API_KEY` - **Approach:** Generic long token pattern with strong keyword context. - **Keywords:** `meta`, `llama`, `meta_llama`, `llama_api` - **Regex:** Generic high-entropy alphanumeric pattern, medium confidence - **Verify:** GET `https://api.llama.com/v1/models` with `Authorization: Bearer {KEY}` (inferred from OpenAI-compatible API) #### 8. xAI (Grok) **Confidence: HIGH** -- TruffleHog verified - **Prefix:** `xai-` - **TruffleHog regex:** `xai-[0-9a-zA-Z_]{80}` - **Total length:** 84 characters - **Verify:** GET `https://api.x.ai/v1/api-key` with `Authorization: Bearer {KEY}` -- 200=valid, 401=invalid - **Keywords:** `xai-`, `xai`, `grok` #### 9. Cohere **Confidence: MEDIUM** -- gitleaks verified but pattern requires context - **Format:** 40 alphanumeric characters, no distinctive prefix - **gitleaks regex:** Context-dependent match on `cohere` or `CO_API_KEY` keyword + `[a-zA-Z0-9]{40}` - **Problem:** 40-char alphanumeric overlaps with many other tokens. - **Keywords:** `cohere`, `CO_API_KEY`, `cohere_api` - **Verify:** GET `https://api.cohere.ai/v1/models` with `Authorization: Bearer {KEY}` -- 200=valid, 401=invalid - **Entropy min:** 4.0 (higher threshold to reduce false positives) #### 10. Mistral AI **Confidence: MEDIUM** - **Format:** Not prefixed (GitGuardian confirms "Prefixed: False"). Opaque token. - **Approach:** Keyword-context match. Mistral keys appear to be 32-char alphanumeric or UUID-format. - **Keywords:** `mistral`, `MISTRAL_API_KEY`, `mistral.ai`, `la_plateforme` - **Verify:** GET `https://api.mistral.ai/v1/models` with `Authorization: Bearer {KEY}` -- 200=valid, 401=invalid #### 11. Inflection AI **Confidence: LOW** - **Format:** Not publicly documented. API launched late 2025 via inflection-sdk. - **Env var:** `PI_API_KEY`, `INFLECTION_API_KEY` - **Keywords:** `inflection`, `pi_api`, `PI_API_KEY` - **Verify:** Endpoint unclear -- inferred OpenAI-compatible pattern. Mark as placeholder. #### 12. AI21 Labs **Confidence: LOW** - **Format:** Not publicly documented with distinctive prefix. - **Env var:** `AI21_API_KEY` - **Keywords:** `ai21`, `AI21_API_KEY`, `jamba`, `jurassic` - **Verify:** GET `https://api.ai21.com/studio/v1/models` with `Authorization: Bearer {KEY}` -- inferred ### Tier 2: Inference Platforms (14) #### 1. Together AI **Confidence: LOW-MEDIUM** - **Format:** Appears to use generic opaque tokens. Some documentation shows `sk-` prefix but this may be example placeholder. - **Keywords:** `together`, `TOGETHER_API_KEY`, `api.together.xyz`, `togetherai` - **Verify:** GET `https://api.together.xyz/v1/models` with `Authorization: Bearer {KEY}` #### 2. Fireworks AI **Confidence: LOW-MEDIUM** - **Format:** GitGuardian confirms "Prefixed: False". Opaque token format. - **Prefix:** `fw_` prefix has been reported in some sources but not confirmed by GitGuardian. - **Keywords:** `fireworks`, `FIREWORKS_API_KEY`, `fireworks.ai`, `fw_` - **Verify:** GET `https://api.fireworks.ai/inference/v1/models` with `Authorization: Bearer {KEY}` #### 3. Groq **Confidence: HIGH** -- TruffleHog verified - **Prefix:** `gsk_` - **TruffleHog regex:** `gsk_[a-zA-Z0-9]{52}` - **Total length:** 56 characters - **Verify:** GET `https://api.groq.com/openai/v1/models` with `Authorization: Bearer {KEY}` -- 200=valid, 401=invalid - **Keywords:** `gsk_`, `groq`, `GROQ_API_KEY` #### 4. Replicate **Confidence: HIGH** -- TruffleHog verified - **Prefix:** `r8_` - **TruffleHog regex:** `r8_[0-9A-Za-z-_]{37}` - **Total length:** 40 characters - **Verify:** GET `https://api.replicate.com/v1/predictions` with `Authorization: Bearer {KEY}` -- 200=valid, 401=invalid - **Keywords:** `r8_`, `replicate`, `REPLICATE_API_TOKEN` #### 5. Anyscale **Confidence: MEDIUM** - **Prefix:** `esecret_` - **Regex:** `esecret_[A-Za-z0-9_-]{20,}` - **Keywords:** `esecret_`, `anyscale`, `ANYSCALE_API_KEY` - **Verify:** GET `https://api.endpoints.anyscale.com/v1/models` with `Authorization: Bearer {KEY}` - **Note:** Anyscale Endpoints are being deprecated in favor of their newer platform. Still worth detecting. #### 6. DeepInfra **Confidence: LOW-MEDIUM** - **Format:** Opaque token. JWT-based scoped tokens use `jwt:` prefix. - **Keywords:** `deepinfra`, `DEEPINFRA_API_KEY`, `deepinfra.com` - **Verify:** GET `https://api.deepinfra.com/v1/openai/models` with `Authorization: Bearer {KEY}` #### 7. Lepton AI **Confidence: LOW** - **Format:** Not publicly documented. NVIDIA acquired Lepton AI April 2025 (rebranded to DGX Cloud Lepton). - **Keywords:** `lepton`, `LEPTON_API_TOKEN`, `lepton.ai` - **Verify:** Endpoint uncertain. Mark as placeholder. #### 8. Modal **Confidence: LOW** - **Format:** Not publicly documented with distinctive format. - **Keywords:** `modal`, `MODAL_TOKEN_ID`, `MODAL_TOKEN_SECRET`, `modal.com` - **Note:** Modal uses token ID + token secret pair, not a single API key. - **Verify:** Mark as placeholder. #### 9. Baseten **Confidence: LOW-MEDIUM** - **Format:** Uses API keys passed in header. - **Keywords:** `baseten`, `BASETEN_API_KEY`, `api.baseten.co` - **Verify:** GET `https://api.baseten.co/v1/models` with `Authorization: Api-Key {KEY}` (non-standard header) #### 10. Cerebrium **Confidence: LOW** - **Format:** Not publicly documented with distinctive format. - **Keywords:** `cerebrium`, `CEREBRIUM_API_KEY`, `cerebrium.ai` - **Verify:** Mark as placeholder. #### 11. NovitaAI **Confidence: LOW** - **Format:** Not publicly documented with distinctive format. - **Keywords:** `novita`, `NOVITA_API_KEY`, `novita.ai` - **Verify:** GET `https://api.novita.ai/v3/openai/models` with `Authorization: Bearer {KEY}` (inferred) #### 12. SambaNova **Confidence: LOW** - **Format:** Not publicly documented with distinctive prefix. - **Keywords:** `sambanova`, `SAMBANOVA_API_KEY`, `sambastudio`, `snapi` - **Verify:** GET `https://api.sambanova.ai/v1/models` with `Authorization: Bearer {KEY}` (inferred OpenAI-compatible) #### 13. OctoAI **Confidence: LOW** - **Format:** Not publicly documented. OctoAI was shut down / merged in 2025. - **Keywords:** `octoai`, `OCTOAI_TOKEN`, `octo.ai` - **Verify:** Mark as placeholder (service may be defunct). #### 14. Friendli **Confidence: LOW** - **Format:** Uses "Friendli Token" -- not publicly documented format. - **Keywords:** `friendli`, `FRIENDLI_TOKEN`, `friendli.ai` - **Verify:** Mark as placeholder. ### IMPORTANT: Perplexity Classification Issue The phase description lists Perplexity as Tier 2, but REQUIREMENTS.md lists it under Tier 3 (PROV-03). The gitleaks config has a pattern: `pplx-[a-zA-Z0-9]{48}`. **Follow the phase description** which includes it in the Tier 2 list. If Perplexity should not be Tier 2, one of the other 14 Tier 2 providers from PROV-02 should replace it. Looking at the PROV-02 requirement list, Perplexity is listed there as one of the 14. This is correct for this phase. #### Perplexity (listed in PROV-02) **Confidence: HIGH** -- gitleaks verified - **Prefix:** `pplx-` - **gitleaks regex:** `pplx-[a-zA-Z0-9]{48}` - **Total length:** 53 characters - **Keywords:** `pplx-`, `perplexity`, `PERPLEXITY_API_KEY` - **Verify:** GET `https://api.perplexity.ai/chat/completions` with `Authorization: Bearer {KEY}` -- but this is a POST endpoint. Use model listing if available. ## Don't Hand-Roll | Problem | Don't Build | Use Instead | Why | |---------|-------------|-------------|-----| | Regex validation | Custom regex validator | Go `regexp.Compile` + existing schema validation in `UnmarshalYAML` | Schema already validates confidence levels; add regex compilation check | | Key format research | Guessing patterns | TruffleHog/gitleaks source code + GitGuardian detector docs | These tools have validated patterns against real-world data | | AC automaton rebuild | Manual keyword management | Existing `NewRegistry()` auto-builds AC from all provider keywords | Just add YAML files; registry handles everything | | Dual file sync | Manual copy script | Plan tasks should explicitly copy to both locations | Simple but must not be forgotten | ## Common Pitfalls ### Pitfall 1: Catastrophic Backtracking in Regex **What goes wrong:** Complex regex with nested quantifiers causes exponential backtracking. **Why it happens:** Using patterns like `(a+)+` or `([A-Za-z0-9]+[-_]?)+`. **How to avoid:** Go RE2 engine prevents this by design -- but still write simple, linear patterns. Avoid unnecessary alternation groups. **Warning signs:** Regex compilation errors in tests. ### Pitfall 2: Overly Broad Patterns Without Keywords **What goes wrong:** Pattern like `[A-Za-z0-9]{32}` matches random strings everywhere. **Why it happens:** Provider has no distinctive prefix (Azure OpenAI, Mistral, AI21, etc.). **How to avoid:** Set confidence to `low`, require strong keyword list for AC pre-filtering. The scanner will only test regex AFTER AC finds a keyword match. **Warning signs:** Hundreds of false positives when scanning any codebase. ### Pitfall 3: Forgetting Dual-Location File Sync **What goes wrong:** Provider added to `providers/` but not `pkg/providers/definitions/`, or vice versa. **Why it happens:** Phase 1 decision to maintain both locations. **How to avoid:** Every task that creates a YAML file must explicitly create it in both locations. Consider a verification step that compares both directories. **Warning signs:** `go test` passes but `providers/` directory has different count than `pkg/providers/definitions/`. ### Pitfall 4: Invalid YAML Schema **What goes wrong:** Missing `format_version`, empty `last_verified`, or invalid confidence value. **Why it happens:** Copy-paste errors or typos. **How to avoid:** The `UnmarshalYAML` validation catches these. Run `go test ./pkg/providers/...` after each batch. **Warning signs:** Test failures with schema validation errors. ### Pitfall 5: Regex Not RE2-Compatible **What goes wrong:** Using PCRE features like `(?<=prefix)` or `(?!suffix)`. **Why it happens:** Copying regex from TruffleHog (which uses Go RE2 but sometimes with custom wrappers) or from gitleaks (which uses Go RE2 but has context-match wrappers). **How to avoid:** Test every regex with `regexp.MustCompile()` in Go. Strip boundary assertions that gitleaks adds (like `(?:[\x60'"\s;]|\\[nr]|$)` -- these are gitleaks-specific context matchers, not needed in our YAML patterns). **Warning signs:** `regexp.Compile` returns error. ### Pitfall 6: Schema Missing Category Field **What goes wrong:** CONTEXT.md mentions a `category` field in the YAML format, but `pkg/providers/schema.go` Provider struct has no `Category` field. **Why it happens:** CONTEXT.md describes an intended schema that was not fully implemented in Phase 1. **How to avoid:** Either add `Category` field to schema.go in this phase, or omit it from YAML files. Recommend adding it since it is mentioned in CLI-04 (`keyhunter providers list` might want category filtering). **Warning signs:** YAML has `category` field that gets silently ignored by Go YAML parser. ## Code Examples ### Provider YAML File (High-Confidence Prefixed Provider) ```yaml # Source: TruffleHog xAI detector (verified) format_version: 1 name: xai display_name: xAI tier: 1 last_verified: "2026-04-05" keywords: - "xai-" - "xai" - "grok" patterns: - regex: 'xai-[0-9a-zA-Z_]{80}' entropy_min: 3.5 confidence: high verify: method: GET url: https://api.x.ai/v1/api-key headers: Authorization: "Bearer {KEY}" valid_status: [200] invalid_status: [401, 403] ``` ### Provider YAML File (Low-Confidence Generic Provider) ```yaml # Low-confidence: no distinctive prefix, requires keyword context format_version: 1 name: together display_name: Together AI tier: 2 last_verified: "2026-04-05" keywords: - "together" - "together_api" - "api.together.xyz" - "togetherai" - "TOGETHER_API_KEY" patterns: - regex: '[A-Za-z0-9]{64}' entropy_min: 4.0 confidence: low verify: method: GET url: https://api.together.xyz/v1/models headers: Authorization: "Bearer {KEY}" valid_status: [200] invalid_status: [401, 403] ``` ### Test Pattern: Verify Regex Compiles ```go func TestProviderRegexCompiles(t *testing.T) { reg, err := providers.NewRegistry() require.NoError(t, err) for _, p := range reg.List() { for i, pat := range p.Patterns { _, err := regexp.Compile(pat.Regex) assert.NoError(t, err, "provider %s pattern %d has invalid regex: %s", p.Name, i, pat.Regex) } } } ``` ### Test Pattern: Verify Provider Count ```go func TestTier1And2ProviderCount(t *testing.T) { reg, err := providers.NewRegistry() require.NoError(t, err) stats := reg.Stats() assert.GreaterOrEqual(t, stats.ByTier[1], 12, "expected at least 12 Tier 1 providers") assert.GreaterOrEqual(t, stats.ByTier[2], 14, "expected at least 14 Tier 2 providers") assert.GreaterOrEqual(t, stats.Total, 26, "expected at least 26 total providers") } ``` ### Test Pattern: AC Pre-Filter Matches Known Key Prefix ```go func TestACMatchesKnownPrefixes(t *testing.T) { reg, err := providers.NewRegistry() require.NoError(t, err) ac := reg.AC() prefixes := []string{ "sk-proj-abc123", "sk-ant-api03-abc", "AIzaSyABC123", "xai-abc123", "gsk_abc123", "r8_abc123", "pplx-abc123", } for _, prefix := range prefixes { matches := ac.FindAll(prefix) assert.NotEmpty(t, matches, "AC should match prefix: %s", prefix) } } ``` ## State of the Art | Old Approach | Current Approach | When Changed | Impact | |--------------|------------------|--------------|--------| | OpenAI `sk-` prefix only | `sk-proj-`, `sk-svcacct-`, `sk-None-` with `T3BlbkFJ` marker | 2024 | Must detect all modern prefixes | | AWS Bedrock via IAM only | Bedrock-specific `ABSK` API keys | Late 2025 | New key type to detect | | Regex-only detection | Regex + keyword pre-filter + entropy | Current best practice | KeyHunter architecture already supports this | | Simple prefix match | TruffleHog suffix markers (e.g., Anthropic `AA` suffix) | Current | Tighter patterns reduce false positives | ## Open Questions 1. **Category field in schema** - What we know: CONTEXT.md mentions `category` in YAML format, but `schema.go` has no `Category` field. - What's unclear: Whether to add it now or defer. - Recommendation: Add `Category string \`yaml:"category"\`` to Provider struct as part of this phase. Values like `"frontier"`, `"inference-platform"`, `"specialized"` would support CLI-04 filtering. 2. **Perplexity tier assignment** - What we know: PROV-02 lists "Perplexity" as one of 14 Tier 2 providers. PROV-03 also lists it under Tier 3. - What's unclear: Which is canonical. - Recommendation: Follow the phase description (PROV-02 lists it as Tier 2). If needed, adjust in Phase 3. 3. **OpenAI existing pattern needs update** - What we know: Current `openai.yaml` only has `sk-proj-` pattern. Modern keys also use `sk-svcacct-`, `sk-None-`, and legacy `sk-` with `T3BlbkFJ` marker. - Recommendation: Update openai.yaml with additional patterns. 4. **Anthropic existing pattern can be tightened** - What we know: Current pattern `sk-ant-api03-[A-Za-z0-9_\-]{93,}` does not require the `AA` suffix. - Recommendation: Tighten to `sk-ant-api03-[A-Za-z0-9_\-]{93}AA` per TruffleHog and add admin key pattern. 5. **Defunct/transitioning providers** - What we know: OctoAI appears defunct. Anyscale Endpoints being deprecated. Lepton acquired by NVIDIA. - Recommendation: Still create YAML definitions for all -- leaked keys from defunct services can still have value if the service was operational. ## Validation Architecture ### Test Framework | Property | Value | |----------|-------| | Framework | Go standard `testing` + `testify` v1.x | | Config file | `go.mod` (no separate test config) | | Quick run command | `go test ./pkg/providers/... -v -count=1` | | Full suite command | `go test ./... -v -count=1` | ### Phase Requirements -> Test Map | Req ID | Behavior | Test Type | Automated Command | File Exists? | |--------|----------|-----------|-------------------|-------------| | PROV-01 | 12 Tier 1 providers loaded with correct names/tiers | unit | `go test ./pkg/providers/... -run TestTier1Providers -v` | Wave 0 | | PROV-01 | Each Tier 1 provider regex compiles and matches sample key | unit | `go test ./pkg/providers/... -run TestTier1Patterns -v` | Wave 0 | | PROV-02 | 14 Tier 2 providers loaded with correct names/tiers | unit | `go test ./pkg/providers/... -run TestTier2Providers -v` | Wave 0 | | PROV-02 | Each Tier 2 provider regex compiles and matches sample key | unit | `go test ./pkg/providers/... -run TestTier2Patterns -v` | Wave 0 | | PROV-01+02 | AC automaton matches all provider keywords | unit | `go test ./pkg/providers/... -run TestACMatchesAllKeywords -v` | Wave 0 | | PROV-01+02 | Registry stats show 26+ providers | unit | `go test ./pkg/providers/... -run TestProviderCount -v` | Wave 0 | ### Sampling Rate - **Per task commit:** `go test ./pkg/providers/... -v -count=1` - **Per wave merge:** `go test ./... -v -count=1` - **Phase gate:** Full suite green before `/gsd:verify-work` ### Wave 0 Gaps - [ ] `pkg/providers/tier1_test.go` -- covers PROV-01 (provider names, tier values, pattern compilation, sample key matching) - [ ] `pkg/providers/tier2_test.go` -- covers PROV-02 (same as above for Tier 2) - [ ] Update `pkg/providers/registry_test.go` -- update `TestRegistryLoad` assertion from `>= 3` to `>= 26` ## Sources ### Primary (HIGH confidence) - [TruffleHog OpenAI detector](https://github.com/trufflesecurity/trufflehog/blob/main/pkg/detectors/openai/openai.go) -- regex pattern for OpenAI keys - [TruffleHog Anthropic detector](https://github.com/trufflesecurity/trufflehog/blob/main/pkg/detectors/anthropic/anthropic.go) -- regex pattern for Anthropic keys - [TruffleHog Google Gemini detector](https://github.com/trufflesecurity/trufflehog/blob/main/pkg/detectors/googlegemini/googlegemini.go) -- regex pattern for Google AI keys - [TruffleHog Groq detector](https://github.com/trufflesecurity/trufflehog/blob/main/pkg/detectors/groq/groq.go) -- regex pattern `gsk_[a-zA-Z0-9]{52}` - [TruffleHog xAI detector](https://github.com/trufflesecurity/trufflehog/blob/main/pkg/detectors/xai/xai.go) -- regex pattern `xai-[0-9a-zA-Z_]{80}` - [TruffleHog Replicate detector](https://github.com/trufflesecurity/trufflehog/blob/main/pkg/detectors/replicate/replicate.go) -- regex pattern `r8_[0-9A-Za-z-_]{37}` - [TruffleHog Azure OpenAI detector](https://github.com/trufflesecurity/trufflehog/blob/main/pkg/detectors/azure_openai/azure_openai.go) -- 32-char hex with keyword context - [gitleaks config](https://github.com/gitleaks/gitleaks/blob/master/config/gitleaks.toml) -- Anthropic, Cohere, AWS Bedrock patterns ### Secondary (MEDIUM confidence) - [AWS Bedrock API keys (Wiz blog)](https://www.wiz.io/blog/a-new-type-of-long-lived-key-on-aws-bedrock-api-keys) -- ABSK prefix documentation - [GitGuardian Groq detector](https://docs.gitguardian.com/secrets-detection/secrets-detection-engine/detectors/specifics/groq_api_key) -- confirms prefixed format - [GitGuardian Mistral detector](https://docs.gitguardian.com/secrets-detection/secrets-detection-engine/detectors/specifics/mistralai_apikey) -- confirms "Prefixed: False" - [GitGuardian Fireworks detector](https://docs.gitguardian.com/secrets-detection/secrets-detection-engine/detectors/specifics/fireworks_ai_api_key) -- confirms "Prefixed: False" - [OpenAI community forum](https://community.openai.com/t/how-to-create-an-api-secret-key-with-prefix-sk-only-always-creates-sk-proj-keys/1263531) -- sk-proj-, sk-svcacct-, sk-None- prefixes - [Replicate docs](https://replicate.com/docs/topics/security/api-tokens) -- r8_ prefix, 40-char format - [Anyscale docs](https://docs.anyscale.com/endpoints/text-generation/authenticate/) -- esecret_ prefix ### Tertiary (LOW confidence -- needs validation at implementation) - Meta AI key format -- not documented, inferred from OpenAI-compatible API pattern - Inflection AI key format -- not documented, SDK-based inference only - AI21 Labs key format -- no prefix confirmed - Together AI key format -- no distinctive prefix confirmed - DeepInfra key format -- JWT-based tokens referenced but standard key format unclear - Lepton AI, Modal, Baseten, Cerebrium, NovitaAI, SambaNova, OctoAI, Friendli -- key formats not publicly documented ## Metadata **Confidence breakdown:** - Standard stack: HIGH -- no new dependencies, using existing Phase 1 infrastructure - Architecture: HIGH -- dual-location YAML pattern well-established - Provider key formats (Tier 1 well-known): HIGH -- TruffleHog/gitleaks verified patterns - Provider key formats (Tier 1 less-known): MEDIUM -- Meta, Inflection, AI21 have limited documentation - Provider key formats (Tier 2 prefixed): HIGH -- Groq, Replicate, Perplexity, Anyscale have clear prefixes - Provider key formats (Tier 2 generic): LOW -- many Tier 2 providers lack documented key formats - Pitfalls: HIGH -- well-understood RE2 and false-positive challenges **Research date:** 2026-04-05 **Valid until:** 2026-05-05 (30 days -- API key formats change infrequently)