# Phase 2: Tier 1-2 Providers - Research

**Researched:** 2026-04-05
**Domain:** LLM/AI Provider API Key Formats, Regex Patterns, Verification Endpoints
**Confidence:** MEDIUM (well-documented providers are HIGH; lesser-known Tier 2 providers are LOW-MEDIUM)

## Summary

This phase requires creating 26 provider YAML definitions (12 Tier 1 + 14 Tier 2) following the established schema pattern from Phase 1. The core challenge is not Go code -- the infrastructure (schema, loader, registry, AC automaton) already exists and works. The challenge is **accuracy of regex patterns and key format data** across 26 providers with varying documentation quality.

For Tier 1 providers (OpenAI, Anthropic, Google AI, xAI, Cohere, Groq), key formats are well-documented with authoritative regex patterns from TruffleHog and gitleaks. For Tier 2 inference platforms (Together, Fireworks, Lepton, Modal, Baseten, Cerebrium, NovitaAI, SambaNova, OctoAI, Friendli), many providers lack distinctive key prefixes and use generic opaque token formats, making regex-only detection harder and requiring keyword-context-based matching.

**Primary recommendation:** Create all 26 YAML files using the established schema. For well-prefixed providers, use HIGH confidence patterns. For generic-format providers, use MEDIUM/LOW confidence patterns with strong keyword lists for AC pre-filtering. File placement follows the dual-location pattern established in Phase 1 (providers/ and pkg/providers/definitions/).

<user_constraints>
## User Constraints (from CONTEXT.md)

### Locked Decisions
None -- all implementation choices at Claude's discretion (infrastructure/data phase).

### Claude's Discretion
All implementation choices are at Claude's discretion -- pure infrastructure/data phase. Use ROADMAP phase goal, success criteria, and existing provider YAML schema (providers/openai.yaml, providers/anthropic.yaml, providers/huggingface.yaml) as templates.

### Deferred Ideas (OUT OF SCOPE)
None -- discuss phase skipped.
</user_constraints>

<phase_requirements>
## Phase Requirements

| ID | Description | Research Support |
|----|-------------|------------------|
| PROV-01 | 12 Tier 1 Frontier provider YAML definitions (OpenAI, Anthropic, Google AI, Vertex, AWS Bedrock, Azure OpenAI, Meta AI, xAI, Cohere, Mistral, Inflection, AI21) | Regex patterns documented below for all 12; verify endpoints identified |
| PROV-02 | 14 Tier 2 Inference Platform provider definitions (Together, Fireworks, Groq, Replicate, Anyscale, DeepInfra, Lepton, Modal, Baseten, Cerebrium, NovitaAI, Sambanova, OctoAI, Friendli) | Regex patterns documented for prefixed providers (Groq, Replicate, Anyscale, Perplexity); keyword-context approach recommended for generic-format providers |
</phase_requirements>

## Project Constraints (from CLAUDE.md)

- **Go regexp only (RE2)** -- no PCRE/regexp2. All patterns must be RE2-safe (no lookahead/lookbehind, no backreferences).
- **Providers in dual locations**: `providers/` (user-visible) and `pkg/providers/definitions/` (Go embed source). Files must be kept in sync.
- **Schema**: `format_version: 1`, `name`, `display_name`, `tier`, `last_verified`, `keywords[]`, `patterns[]` (with `regex`, `entropy_min`, `confidence`), `verify` (with `method`, `url`, `headers`, `valid_status`, `invalid_status`).
- **Confidence values**: `high`, `medium`, `low` (validated in UnmarshalYAML).
- **Keywords**: lowercase, used for Aho-Corasick pre-filtering via DFA automaton.

## Standard Stack

No new libraries required. This phase is pure YAML data file creation using the existing infrastructure from Phase 1.

### Existing Infrastructure Used
| Component | Location | Purpose |
|-----------|----------|---------|
| Provider struct | `pkg/providers/schema.go` | YAML schema with validation |
| Loader | `pkg/providers/loader.go` | embed.FS walker for `definitions/*.yaml` |
| Registry | `pkg/providers/registry.go` | Provider index + AC automaton build |
| Tests | `pkg/providers/registry_test.go` | Registry load, get, stats, AC tests |

## Architecture Patterns

### File Placement (Dual Location)
Every new provider YAML must be placed in BOTH:
```
providers/{name}.yaml                    # User-visible reference
pkg/providers/definitions/{name}.yaml    # Go embed source (compiled into binary)
```

### YAML Template Pattern
```yaml
format_version: 1
name: {provider_slug}
display_name: {Display Name}
tier: {1 or 2}
last_verified: "2026-04-05"
keywords:
  - "{prefix_literal}"          # Exact key prefix for AC match
  - "{provider_name_lowercase}" # Provider name for context match
  - "{env_var_hint}"            # Common env var name fragments
patterns:
  - regex: '{RE2_compatible_regex}'
    entropy_min: {3.0-4.0}
    confidence: {high|medium|low}
verify:
  method: {GET|POST}
  url: {lightweight_api_endpoint}
  headers:
    {auth_header}: "{KEY_placeholder}"
  valid_status: [200]
  invalid_status: [401, 403]
```

### Confidence Level Assignment Strategy
| Key Format | Confidence | Rationale |
|------------|------------|-----------|
| Unique prefix + fixed length (e.g., `sk-ant-api03-`, `gsk_`, `r8_`, `xai-`) | **high** | Prefix alone is near-unique; false positive rate extremely low |
| Unique prefix, variable length (e.g., `sk-proj-`, `AIzaSy`) | **high** | Prefix is distinctive enough |
| Short generic prefix + context needed (e.g., `sk-` for Cohere) | **medium** | Prefix collides with OpenAI legacy; needs keyword context |
| No prefix, opaque token (e.g., UUID, hex string, base64) | **low** | Requires strong keyword context; high false positive risk without AC pre-filter |
| 32-char hex string (e.g., Azure OpenAI) | **low** | Extremely generic format; keyword context mandatory |

### Keyword Strategy for Low-Confidence Providers
Providers with generic key formats (no distinctive prefix) MUST have rich keyword lists for the Aho-Corasick pre-filter to work effectively. Include:
1. Provider name (lowercase)
2. Common env var fragments (e.g., `together_api`, `baseten_api`, `modal_token`)
3. API base URL fragments (e.g., `api.together.xyz`, `api.baseten.co`)
4. SDK/config identifiers (e.g., `togetherai`, `deepinfra`)

### Anti-Patterns to Avoid
- **Overly broad regex without keyword anchor:** A pattern like `[A-Za-z0-9]{40}` without keywords would match every 40-char alphanumeric string -- useless.
- **PCRE features in regex:** Go RE2 does not support lookahead (`(?=)`), lookbehind (`(?<=)`), or backreferences. All patterns must be RE2-safe.
- **Hardcoding `T3BlbkFJ` for non-OpenAI providers:** The base64 "OpenAI" magic string is OpenAI-specific; do not use for other providers.
- **Missing dual-location sync:** Forgetting to copy YAML to both `providers/` and `pkg/providers/definitions/`.

## Provider Key Format Research

### Tier 1: Frontier Providers (12)

#### 1. OpenAI
**Confidence: HIGH** -- TruffleHog verified
- **Prefixes:** `sk-proj-`, `sk-svcacct-`, `sk-None-`, legacy `sk-` (all contain `T3BlbkFJ` base64 marker)
- **TruffleHog regex:** `sk-(?:(?:proj|svcacct|service)-[A-Za-z0-9_-]+|[a-zA-Z0-9]+)T3BlbkFJ[A-Za-z0-9_-]+`
- **KeyHunter regex (simplified):** `sk-proj-[A-Za-z0-9_\-]{48,}` (existing, covers primary format)
- **Note:** Existing openai.yaml only covers `sk-proj-`. Should add patterns for `sk-svcacct-` and legacy `sk-` with T3BlbkFJ marker.
- **Verify:** GET `https://api.openai.com/v1/models` with `Authorization: Bearer {KEY}` -- 200=valid, 401=invalid
- **Keywords:** `sk-proj-`, `sk-svcacct-`, `openai`, `T3BlbkFJ`

#### 2. Anthropic
**Confidence: HIGH** -- TruffleHog + gitleaks verified
- **Prefixes:** `sk-ant-api03-` (standard), `sk-ant-admin01-` (admin)
- **TruffleHog regex:** `sk-ant-(?:admin01|api03)-[\w\-]{93}AA`
- **gitleaks regex:** `sk-ant-api03-[a-zA-Z0-9_\-]{93}AA`
- **Note:** Existing anthropic.yaml pattern `sk-ant-api03-[A-Za-z0-9_\-]{93,}` should be tightened to end with `AA` suffix.
- **Verify:** GET `https://api.anthropic.com/v1/models` with `x-api-key: {KEY}` + `anthropic-version: 2023-06-01` -- 200=valid, 401=invalid
- **Keywords:** `sk-ant-api03-`, `sk-ant-admin01-`, `anthropic`

#### 3. Google AI (Gemini)
**Confidence: HIGH** -- TruffleHog verified
- **Prefix:** `AIzaSy`
- **TruffleHog regex:** `AIzaSy[A-Za-z0-9_-]{33}`
- **Total length:** 39 characters
- **Note:** Same key format as all Google API keys. No trailing word boundary needed (keys can end with hyphen).
- **Verify:** GET `https://generativelanguage.googleapis.com/v1/models?key={KEY}` -- 200=valid, 400/403=invalid
- **Keywords:** `AIzaSy`, `google_api`, `gemini`

#### 4. Google Vertex AI
**Confidence: MEDIUM**
- **Format:** Uses Google Cloud service account JSON key files (not a simple API key string) OR standard Google API keys (AIzaSy format, same as #3).
- **Approach:** For API key mode, reuse `AIzaSy` pattern (same as Google AI). Service account JSON key detection is a separate concern (JSON file with `"type": "service_account"` and `"private_key"` field).
- **Recommendation:** Create a separate `vertex-ai` provider YAML that focuses on the API key path with `AIzaSy` pattern AND a service account private key regex.
- **Service account key regex:** The private key in a GCP service account JSON starts with `-----BEGIN RSA PRIVATE KEY-----` -- but this is a general GCP credential, not Vertex-specific. Keep this provider focused on the API key path.
- **Verify:** GET `https://generativelanguage.googleapis.com/v1/models?key={KEY}` (same endpoint works for Vertex API keys)
- **Keywords:** `vertex`, `google_cloud`, `AIzaSy`, `vertex_ai`

#### 5. AWS Bedrock
**Confidence: HIGH** -- gitleaks verified
- **Long-lived prefix:** `ABSK` (base64 encodes to `BedrockAPIKey-`)
- **gitleaks regex (long-lived):** `ABSK[A-Za-z0-9+/]{109,269}={0,2}`
- **Short-lived prefix:** `bedrock-api-key-YmVkcm9ja` (base64 prefix)
- **Also detect:** AWS IAM access keys (`AKIA[0-9A-Z]{16}` + secret) used with Bedrock
- **Recommendation:** Two patterns: (1) Bedrock-specific ABSK prefix, (2) AWS AKIA (general, shared with other AWS services)
- **Verify:** Cannot verify with a simple HTTP call -- AWS Bedrock requires SigV4 signing. Mark verify as empty/placeholder.
- **Keywords:** `ABSK`, `bedrock`, `aws_bedrock`, `AKIA`

#### 6. Azure OpenAI
**Confidence: MEDIUM** -- TruffleHog verified but pattern is generic
- **Format:** 32-character lowercase hexadecimal string
- **TruffleHog regex:** `[a-f0-9]{32}` (with keyword context requirement)
- **Problem:** 32-char hex is extremely generic. MUST rely on keywords for context.
- **Keywords:** `azure`, `openai.azure.com`, `azure_openai`, `api-key`, `cognitive`
- **Verify:** GET `https://{resource}.openai.azure.com/openai/deployments?api-version=2024-02-01` with `api-key: {KEY}` -- but requires resource name. Cannot generically verify. Mark verify as placeholder.
- **Entropy min:** 3.5 (hex has theoretical max ~4.0)

#### 7. Meta AI (Llama API)
**Confidence: LOW**
- **Format:** Not publicly documented as of April 2026. Meta Llama API launched April 2025.
- **Env var:** `META_LLAMA_API_KEY`
- **Approach:** Generic long token pattern with strong keyword context.
- **Keywords:** `meta`, `llama`, `meta_llama`, `llama_api`
- **Regex:** Generic high-entropy alphanumeric pattern, medium confidence
- **Verify:** GET `https://api.llama.com/v1/models` with `Authorization: Bearer {KEY}` (inferred from OpenAI-compatible API)

#### 8. xAI (Grok)
**Confidence: HIGH** -- TruffleHog verified
- **Prefix:** `xai-`
- **TruffleHog regex:** `xai-[0-9a-zA-Z_]{80}`
- **Total length:** 84 characters
- **Verify:** GET `https://api.x.ai/v1/api-key` with `Authorization: Bearer {KEY}` -- 200=valid, 401=invalid
- **Keywords:** `xai-`, `xai`, `grok`

#### 9. Cohere
**Confidence: MEDIUM** -- gitleaks verified but pattern requires context
- **Format:** 40 alphanumeric characters, no distinctive prefix
- **gitleaks regex:** Context-dependent match on `cohere` or `CO_API_KEY` keyword + `[a-zA-Z0-9]{40}`
- **Problem:** 40-char alphanumeric overlaps with many other tokens.
- **Keywords:** `cohere`, `CO_API_KEY`, `cohere_api`
- **Verify:** GET `https://api.cohere.ai/v1/models` with `Authorization: Bearer {KEY}` -- 200=valid, 401=invalid
- **Entropy min:** 4.0 (higher threshold to reduce false positives)

#### 10. Mistral AI
**Confidence: MEDIUM**
- **Format:** Not prefixed (GitGuardian confirms "Prefixed: False"). Opaque token.
- **Approach:** Keyword-context match. Mistral keys appear to be 32-char alphanumeric or UUID-format.
- **Keywords:** `mistral`, `MISTRAL_API_KEY`, `mistral.ai`, `la_plateforme`
- **Verify:** GET `https://api.mistral.ai/v1/models` with `Authorization: Bearer {KEY}` -- 200=valid, 401=invalid

#### 11. Inflection AI
**Confidence: LOW**
- **Format:** Not publicly documented. API launched late 2025 via inflection-sdk.
- **Env var:** `PI_API_KEY`, `INFLECTION_API_KEY`
- **Keywords:** `inflection`, `pi_api`, `PI_API_KEY`
- **Verify:** Endpoint unclear -- inferred OpenAI-compatible pattern. Mark as placeholder.

#### 12. AI21 Labs
**Confidence: LOW**
- **Format:** Not publicly documented with distinctive prefix.
- **Env var:** `AI21_API_KEY`
- **Keywords:** `ai21`, `AI21_API_KEY`, `jamba`, `jurassic`
- **Verify:** GET `https://api.ai21.com/studio/v1/models` with `Authorization: Bearer {KEY}` -- inferred

### Tier 2: Inference Platforms (14)

#### 1. Together AI
**Confidence: LOW-MEDIUM**
- **Format:** Appears to use generic opaque tokens. Some documentation shows `sk-` prefix but this may be example placeholder.
- **Keywords:** `together`, `TOGETHER_API_KEY`, `api.together.xyz`, `togetherai`
- **Verify:** GET `https://api.together.xyz/v1/models` with `Authorization: Bearer {KEY}`

#### 2. Fireworks AI
**Confidence: LOW-MEDIUM**
- **Format:** GitGuardian confirms "Prefixed: False". Opaque token format.
- **Prefix:** `fw_` prefix has been reported in some sources but not confirmed by GitGuardian.
- **Keywords:** `fireworks`, `FIREWORKS_API_KEY`, `fireworks.ai`, `fw_`
- **Verify:** GET `https://api.fireworks.ai/inference/v1/models` with `Authorization: Bearer {KEY}`

#### 3. Groq
**Confidence: HIGH** -- TruffleHog verified
- **Prefix:** `gsk_`
- **TruffleHog regex:** `gsk_[a-zA-Z0-9]{52}`
- **Total length:** 56 characters
- **Verify:** GET `https://api.groq.com/openai/v1/models` with `Authorization: Bearer {KEY}` -- 200=valid, 401=invalid
- **Keywords:** `gsk_`, `groq`, `GROQ_API_KEY`

#### 4. Replicate
**Confidence: HIGH** -- TruffleHog verified
- **Prefix:** `r8_`
- **TruffleHog regex:** `r8_[0-9A-Za-z-_]{37}`
- **Total length:** 40 characters
- **Verify:** GET `https://api.replicate.com/v1/predictions` with `Authorization: Bearer {KEY}` -- 200=valid, 401=invalid
- **Keywords:** `r8_`, `replicate`, `REPLICATE_API_TOKEN`

#### 5. Anyscale
**Confidence: MEDIUM**
- **Prefix:** `esecret_`
- **Regex:** `esecret_[A-Za-z0-9_-]{20,}`
- **Keywords:** `esecret_`, `anyscale`, `ANYSCALE_API_KEY`
- **Verify:** GET `https://api.endpoints.anyscale.com/v1/models` with `Authorization: Bearer {KEY}`
- **Note:** Anyscale Endpoints are being deprecated in favor of their newer platform. Still worth detecting.

#### 6. DeepInfra
**Confidence: LOW-MEDIUM**
- **Format:** Opaque token. JWT-based scoped tokens use `jwt:` prefix.
- **Keywords:** `deepinfra`, `DEEPINFRA_API_KEY`, `deepinfra.com`
- **Verify:** GET `https://api.deepinfra.com/v1/openai/models` with `Authorization: Bearer {KEY}`

#### 7. Lepton AI
**Confidence: LOW**
- **Format:** Not publicly documented. NVIDIA acquired Lepton AI April 2025 (rebranded to DGX Cloud Lepton).
- **Keywords:** `lepton`, `LEPTON_API_TOKEN`, `lepton.ai`
- **Verify:** Endpoint uncertain. Mark as placeholder.

#### 8. Modal
**Confidence: LOW**
- **Format:** Not publicly documented with distinctive format.
- **Keywords:** `modal`, `MODAL_TOKEN_ID`, `MODAL_TOKEN_SECRET`, `modal.com`
- **Note:** Modal uses token ID + token secret pair, not a single API key.
- **Verify:** Mark as placeholder.

#### 9. Baseten
**Confidence: LOW-MEDIUM**
- **Format:** Uses API keys passed in header.
- **Keywords:** `baseten`, `BASETEN_API_KEY`, `api.baseten.co`
- **Verify:** GET `https://api.baseten.co/v1/models` with `Authorization: Api-Key {KEY}` (non-standard header)

#### 10. Cerebrium
**Confidence: LOW**
- **Format:** Not publicly documented with distinctive format.
- **Keywords:** `cerebrium`, `CEREBRIUM_API_KEY`, `cerebrium.ai`
- **Verify:** Mark as placeholder.

#### 11. NovitaAI
**Confidence: LOW**
- **Format:** Not publicly documented with distinctive format.
- **Keywords:** `novita`, `NOVITA_API_KEY`, `novita.ai`
- **Verify:** GET `https://api.novita.ai/v3/openai/models` with `Authorization: Bearer {KEY}` (inferred)

#### 12. SambaNova
**Confidence: LOW**
- **Format:** Not publicly documented with distinctive prefix.
- **Keywords:** `sambanova`, `SAMBANOVA_API_KEY`, `sambastudio`, `snapi`
- **Verify:** GET `https://api.sambanova.ai/v1/models` with `Authorization: Bearer {KEY}` (inferred OpenAI-compatible)

#### 13. OctoAI
**Confidence: LOW**
- **Format:** Not publicly documented. OctoAI was shut down / merged in 2025.
- **Keywords:** `octoai`, `OCTOAI_TOKEN`, `octo.ai`
- **Verify:** Mark as placeholder (service may be defunct).

#### 14. Friendli
**Confidence: LOW**
- **Format:** Uses "Friendli Token" -- not publicly documented format.
- **Keywords:** `friendli`, `FRIENDLI_TOKEN`, `friendli.ai`
- **Verify:** Mark as placeholder.

### IMPORTANT: Perplexity Classification Issue

The phase description lists Perplexity as Tier 2, but REQUIREMENTS.md lists it under Tier 3 (PROV-03). The gitleaks config has a pattern: `pplx-[a-zA-Z0-9]{48}`. **Follow the phase description** which includes it in the Tier 2 list. If Perplexity should not be Tier 2, one of the other 14 Tier 2 providers from PROV-02 should replace it. Looking at the PROV-02 requirement list, Perplexity is listed there as one of the 14. This is correct for this phase.

#### Perplexity (listed in PROV-02)
**Confidence: HIGH** -- gitleaks verified
- **Prefix:** `pplx-`
- **gitleaks regex:** `pplx-[a-zA-Z0-9]{48}`
- **Total length:** 53 characters
- **Keywords:** `pplx-`, `perplexity`, `PERPLEXITY_API_KEY`
- **Verify:** GET `https://api.perplexity.ai/chat/completions` with `Authorization: Bearer {KEY}` -- but this is a POST endpoint. Use model listing if available.

## Don't Hand-Roll

| Problem | Don't Build | Use Instead | Why |
|---------|-------------|-------------|-----|
| Regex validation | Custom regex validator | Go `regexp.Compile` + existing schema validation in `UnmarshalYAML` | Schema already validates confidence levels; add regex compilation check |
| Key format research | Guessing patterns | TruffleHog/gitleaks source code + GitGuardian detector docs | These tools have validated patterns against real-world data |
| AC automaton rebuild | Manual keyword management | Existing `NewRegistry()` auto-builds AC from all provider keywords | Just add YAML files; registry handles everything |
| Dual file sync | Manual copy script | Plan tasks should explicitly copy to both locations | Simple but must not be forgotten |

## Common Pitfalls

### Pitfall 1: Catastrophic Backtracking in Regex
**What goes wrong:** Complex regex with nested quantifiers causes exponential backtracking.
**Why it happens:** Using patterns like `(a+)+` or `([A-Za-z0-9]+[-_]?)+`.
**How to avoid:** Go RE2 engine prevents this by design -- but still write simple, linear patterns. Avoid unnecessary alternation groups.
**Warning signs:** Regex compilation errors in tests.

### Pitfall 2: Overly Broad Patterns Without Keywords
**What goes wrong:** Pattern like `[A-Za-z0-9]{32}` matches random strings everywhere.
**Why it happens:** Provider has no distinctive prefix (Azure OpenAI, Mistral, AI21, etc.).
**How to avoid:** Set confidence to `low`, require strong keyword list for AC pre-filtering. The scanner will only test regex AFTER AC finds a keyword match.
**Warning signs:** Hundreds of false positives when scanning any codebase.

### Pitfall 3: Forgetting Dual-Location File Sync
**What goes wrong:** Provider added to `providers/` but not `pkg/providers/definitions/`, or vice versa.
**Why it happens:** Phase 1 decision to maintain both locations.
**How to avoid:** Every task that creates a YAML file must explicitly create it in both locations. Consider a verification step that compares both directories.
**Warning signs:** `go test` passes but `providers/` directory has different count than `pkg/providers/definitions/`.

### Pitfall 4: Invalid YAML Schema
**What goes wrong:** Missing `format_version`, empty `last_verified`, or invalid confidence value.
**Why it happens:** Copy-paste errors or typos.
**How to avoid:** The `UnmarshalYAML` validation catches these. Run `go test ./pkg/providers/...` after each batch.
**Warning signs:** Test failures with schema validation errors.

### Pitfall 5: Regex Not RE2-Compatible
**What goes wrong:** Using PCRE features like `(?<=prefix)` or `(?!suffix)`.
**Why it happens:** Copying regex from TruffleHog (which uses Go RE2 but sometimes with custom wrappers) or from gitleaks (which uses Go RE2 but has context-match wrappers).
**How to avoid:** Test every regex with `regexp.MustCompile()` in Go. Strip boundary assertions that gitleaks adds (like `(?:[\x60'"\s;]|\\[nr]|$)` -- these are gitleaks-specific context matchers, not needed in our YAML patterns).
**Warning signs:** `regexp.Compile` returns error.

### Pitfall 6: Schema Missing Category Field
**What goes wrong:** CONTEXT.md mentions a `category` field in the YAML format, but `pkg/providers/schema.go` Provider struct has no `Category` field.
**Why it happens:** CONTEXT.md describes an intended schema that was not fully implemented in Phase 1.
**How to avoid:** Either add `Category` field to schema.go in this phase, or omit it from YAML files. Recommend adding it since it is mentioned in CLI-04 (`keyhunter providers list` might want category filtering).
**Warning signs:** YAML has `category` field that gets silently ignored by Go YAML parser.

## Code Examples

### Provider YAML File (High-Confidence Prefixed Provider)
```yaml
# Source: TruffleHog xAI detector (verified)
format_version: 1
name: xai
display_name: xAI
tier: 1
last_verified: "2026-04-05"
keywords:
  - "xai-"
  - "xai"
  - "grok"
patterns:
  - regex: 'xai-[0-9a-zA-Z_]{80}'
    entropy_min: 3.5
    confidence: high
verify:
  method: GET
  url: https://api.x.ai/v1/api-key
  headers:
    Authorization: "Bearer {KEY}"
  valid_status: [200]
  invalid_status: [401, 403]
```

### Provider YAML File (Low-Confidence Generic Provider)
```yaml
# Low-confidence: no distinctive prefix, requires keyword context
format_version: 1
name: together
display_name: Together AI
tier: 2
last_verified: "2026-04-05"
keywords:
  - "together"
  - "together_api"
  - "api.together.xyz"
  - "togetherai"
  - "TOGETHER_API_KEY"
patterns:
  - regex: '[A-Za-z0-9]{64}'
    entropy_min: 4.0
    confidence: low
verify:
  method: GET
  url: https://api.together.xyz/v1/models
  headers:
    Authorization: "Bearer {KEY}"
  valid_status: [200]
  invalid_status: [401, 403]
```

### Test Pattern: Verify Regex Compiles
```go
func TestProviderRegexCompiles(t *testing.T) {
    reg, err := providers.NewRegistry()
    require.NoError(t, err)
    for _, p := range reg.List() {
        for i, pat := range p.Patterns {
            _, err := regexp.Compile(pat.Regex)
            assert.NoError(t, err, "provider %s pattern %d has invalid regex: %s", p.Name, i, pat.Regex)
        }
    }
}
```

### Test Pattern: Verify Provider Count
```go
func TestTier1And2ProviderCount(t *testing.T) {
    reg, err := providers.NewRegistry()
    require.NoError(t, err)
    stats := reg.Stats()
    assert.GreaterOrEqual(t, stats.ByTier[1], 12, "expected at least 12 Tier 1 providers")
    assert.GreaterOrEqual(t, stats.ByTier[2], 14, "expected at least 14 Tier 2 providers")
    assert.GreaterOrEqual(t, stats.Total, 26, "expected at least 26 total providers")
}
```

### Test Pattern: AC Pre-Filter Matches Known Key Prefix
```go
func TestACMatchesKnownPrefixes(t *testing.T) {
    reg, err := providers.NewRegistry()
    require.NoError(t, err)
    ac := reg.AC()
    
    prefixes := []string{
        "sk-proj-abc123", "sk-ant-api03-abc", "AIzaSyABC123",
        "xai-abc123", "gsk_abc123", "r8_abc123", "pplx-abc123",
    }
    for _, prefix := range prefixes {
        matches := ac.FindAll(prefix)
        assert.NotEmpty(t, matches, "AC should match prefix: %s", prefix)
    }
}
```

## State of the Art

| Old Approach | Current Approach | When Changed | Impact |
|--------------|------------------|--------------|--------|
| OpenAI `sk-` prefix only | `sk-proj-`, `sk-svcacct-`, `sk-None-` with `T3BlbkFJ` marker | 2024 | Must detect all modern prefixes |
| AWS Bedrock via IAM only | Bedrock-specific `ABSK` API keys | Late 2025 | New key type to detect |
| Regex-only detection | Regex + keyword pre-filter + entropy | Current best practice | KeyHunter architecture already supports this |
| Simple prefix match | TruffleHog suffix markers (e.g., Anthropic `AA` suffix) | Current | Tighter patterns reduce false positives |

## Open Questions

1. **Category field in schema**
   - What we know: CONTEXT.md mentions `category` in YAML format, but `schema.go` has no `Category` field.
   - What's unclear: Whether to add it now or defer.
   - Recommendation: Add `Category string \`yaml:"category"\`` to Provider struct as part of this phase. Values like `"frontier"`, `"inference-platform"`, `"specialized"` would support CLI-04 filtering.

2. **Perplexity tier assignment**
   - What we know: PROV-02 lists "Perplexity" as one of 14 Tier 2 providers. PROV-03 also lists it under Tier 3.
   - What's unclear: Which is canonical.
   - Recommendation: Follow the phase description (PROV-02 lists it as Tier 2). If needed, adjust in Phase 3.

3. **OpenAI existing pattern needs update**
   - What we know: Current `openai.yaml` only has `sk-proj-` pattern. Modern keys also use `sk-svcacct-`, `sk-None-`, and legacy `sk-` with `T3BlbkFJ` marker.
   - Recommendation: Update openai.yaml with additional patterns.

4. **Anthropic existing pattern can be tightened**
   - What we know: Current pattern `sk-ant-api03-[A-Za-z0-9_\-]{93,}` does not require the `AA` suffix.
   - Recommendation: Tighten to `sk-ant-api03-[A-Za-z0-9_\-]{93}AA` per TruffleHog and add admin key pattern.

5. **Defunct/transitioning providers**
   - What we know: OctoAI appears defunct. Anyscale Endpoints being deprecated. Lepton acquired by NVIDIA.
   - Recommendation: Still create YAML definitions for all -- leaked keys from defunct services can still have value if the service was operational.

## Validation Architecture

### Test Framework
| Property | Value |
|----------|-------|
| Framework | Go standard `testing` + `testify` v1.x |
| Config file | `go.mod` (no separate test config) |
| Quick run command | `go test ./pkg/providers/... -v -count=1` |
| Full suite command | `go test ./... -v -count=1` |

### Phase Requirements -> Test Map
| Req ID | Behavior | Test Type | Automated Command | File Exists? |
|--------|----------|-----------|-------------------|-------------|
| PROV-01 | 12 Tier 1 providers loaded with correct names/tiers | unit | `go test ./pkg/providers/... -run TestTier1Providers -v` | Wave 0 |
| PROV-01 | Each Tier 1 provider regex compiles and matches sample key | unit | `go test ./pkg/providers/... -run TestTier1Patterns -v` | Wave 0 |
| PROV-02 | 14 Tier 2 providers loaded with correct names/tiers | unit | `go test ./pkg/providers/... -run TestTier2Providers -v` | Wave 0 |
| PROV-02 | Each Tier 2 provider regex compiles and matches sample key | unit | `go test ./pkg/providers/... -run TestTier2Patterns -v` | Wave 0 |
| PROV-01+02 | AC automaton matches all provider keywords | unit | `go test ./pkg/providers/... -run TestACMatchesAllKeywords -v` | Wave 0 |
| PROV-01+02 | Registry stats show 26+ providers | unit | `go test ./pkg/providers/... -run TestProviderCount -v` | Wave 0 |

### Sampling Rate
- **Per task commit:** `go test ./pkg/providers/... -v -count=1`
- **Per wave merge:** `go test ./... -v -count=1`
- **Phase gate:** Full suite green before `/gsd:verify-work`

### Wave 0 Gaps
- [ ] `pkg/providers/tier1_test.go` -- covers PROV-01 (provider names, tier values, pattern compilation, sample key matching)
- [ ] `pkg/providers/tier2_test.go` -- covers PROV-02 (same as above for Tier 2)
- [ ] Update `pkg/providers/registry_test.go` -- update `TestRegistryLoad` assertion from `>= 3` to `>= 26`

## Sources

### Primary (HIGH confidence)
- [TruffleHog OpenAI detector](https://github.com/trufflesecurity/trufflehog/blob/main/pkg/detectors/openai/openai.go) -- regex pattern for OpenAI keys
- [TruffleHog Anthropic detector](https://github.com/trufflesecurity/trufflehog/blob/main/pkg/detectors/anthropic/anthropic.go) -- regex pattern for Anthropic keys
- [TruffleHog Google Gemini detector](https://github.com/trufflesecurity/trufflehog/blob/main/pkg/detectors/googlegemini/googlegemini.go) -- regex pattern for Google AI keys
- [TruffleHog Groq detector](https://github.com/trufflesecurity/trufflehog/blob/main/pkg/detectors/groq/groq.go) -- regex pattern `gsk_[a-zA-Z0-9]{52}`
- [TruffleHog xAI detector](https://github.com/trufflesecurity/trufflehog/blob/main/pkg/detectors/xai/xai.go) -- regex pattern `xai-[0-9a-zA-Z_]{80}`
- [TruffleHog Replicate detector](https://github.com/trufflesecurity/trufflehog/blob/main/pkg/detectors/replicate/replicate.go) -- regex pattern `r8_[0-9A-Za-z-_]{37}`
- [TruffleHog Azure OpenAI detector](https://github.com/trufflesecurity/trufflehog/blob/main/pkg/detectors/azure_openai/azure_openai.go) -- 32-char hex with keyword context
- [gitleaks config](https://github.com/gitleaks/gitleaks/blob/master/config/gitleaks.toml) -- Anthropic, Cohere, AWS Bedrock patterns

### Secondary (MEDIUM confidence)
- [AWS Bedrock API keys (Wiz blog)](https://www.wiz.io/blog/a-new-type-of-long-lived-key-on-aws-bedrock-api-keys) -- ABSK prefix documentation
- [GitGuardian Groq detector](https://docs.gitguardian.com/secrets-detection/secrets-detection-engine/detectors/specifics/groq_api_key) -- confirms prefixed format
- [GitGuardian Mistral detector](https://docs.gitguardian.com/secrets-detection/secrets-detection-engine/detectors/specifics/mistralai_apikey) -- confirms "Prefixed: False"
- [GitGuardian Fireworks detector](https://docs.gitguardian.com/secrets-detection/secrets-detection-engine/detectors/specifics/fireworks_ai_api_key) -- confirms "Prefixed: False"
- [OpenAI community forum](https://community.openai.com/t/how-to-create-an-api-secret-key-with-prefix-sk-only-always-creates-sk-proj-keys/1263531) -- sk-proj-, sk-svcacct-, sk-None- prefixes
- [Replicate docs](https://replicate.com/docs/topics/security/api-tokens) -- r8_ prefix, 40-char format
- [Anyscale docs](https://docs.anyscale.com/endpoints/text-generation/authenticate/) -- esecret_ prefix

### Tertiary (LOW confidence -- needs validation at implementation)
- Meta AI key format -- not documented, inferred from OpenAI-compatible API pattern
- Inflection AI key format -- not documented, SDK-based inference only
- AI21 Labs key format -- no prefix confirmed
- Together AI key format -- no distinctive prefix confirmed
- DeepInfra key format -- JWT-based tokens referenced but standard key format unclear
- Lepton AI, Modal, Baseten, Cerebrium, NovitaAI, SambaNova, OctoAI, Friendli -- key formats not publicly documented

## Metadata

**Confidence breakdown:**
- Standard stack: HIGH -- no new dependencies, using existing Phase 1 infrastructure
- Architecture: HIGH -- dual-location YAML pattern well-established
- Provider key formats (Tier 1 well-known): HIGH -- TruffleHog/gitleaks verified patterns
- Provider key formats (Tier 1 less-known): MEDIUM -- Meta, Inflection, AI21 have limited documentation
- Provider key formats (Tier 2 prefixed): HIGH -- Groq, Replicate, Perplexity, Anyscale have clear prefixes
- Provider key formats (Tier 2 generic): LOW -- many Tier 2 providers lack documented key formats
- Pitfalls: HIGH -- well-understood RE2 and false-positive challenges

**Research date:** 2026-04-05
**Valid until:** 2026-05-05 (30 days -- API key formats change infrequently)