docs(02): research phase domain - provider key formats and regex patterns
This commit is contained in:
573
.planning/phases/02-tier-1-2-providers/02-RESEARCH.md
Normal file
573
.planning/phases/02-tier-1-2-providers/02-RESEARCH.md
Normal file
@@ -0,0 +1,573 @@
|
||||
# Phase 2: Tier 1-2 Providers - Research
|
||||
|
||||
**Researched:** 2026-04-05
|
||||
**Domain:** LLM/AI Provider API Key Formats, Regex Patterns, Verification Endpoints
|
||||
**Confidence:** MEDIUM (well-documented providers are HIGH; lesser-known Tier 2 providers are LOW-MEDIUM)
|
||||
|
||||
## Summary
|
||||
|
||||
This phase requires creating 26 provider YAML definitions (12 Tier 1 + 14 Tier 2) following the established schema pattern from Phase 1. The core challenge is not Go code -- the infrastructure (schema, loader, registry, AC automaton) already exists and works. The challenge is **accuracy of regex patterns and key format data** across 26 providers with varying documentation quality.
|
||||
|
||||
For Tier 1 providers (OpenAI, Anthropic, Google AI, xAI, Cohere, Groq), key formats are well-documented with authoritative regex patterns from TruffleHog and gitleaks. For Tier 2 inference platforms (Together, Fireworks, Lepton, Modal, Baseten, Cerebrium, NovitaAI, SambaNova, OctoAI, Friendli), many providers lack distinctive key prefixes and use generic opaque token formats, making regex-only detection harder and requiring keyword-context-based matching.
|
||||
|
||||
**Primary recommendation:** Create all 26 YAML files using the established schema. For well-prefixed providers, use HIGH confidence patterns. For generic-format providers, use MEDIUM/LOW confidence patterns with strong keyword lists for AC pre-filtering. File placement follows the dual-location pattern established in Phase 1 (providers/ and pkg/providers/definitions/).
|
||||
|
||||
<user_constraints>
|
||||
## User Constraints (from CONTEXT.md)
|
||||
|
||||
### Locked Decisions
|
||||
None -- all implementation choices at Claude's discretion (infrastructure/data phase).
|
||||
|
||||
### Claude's Discretion
|
||||
All implementation choices are at Claude's discretion -- pure infrastructure/data phase. Use ROADMAP phase goal, success criteria, and existing provider YAML schema (providers/openai.yaml, providers/anthropic.yaml, providers/huggingface.yaml) as templates.
|
||||
|
||||
### Deferred Ideas (OUT OF SCOPE)
|
||||
None -- discuss phase skipped.
|
||||
</user_constraints>
|
||||
|
||||
<phase_requirements>
|
||||
## Phase Requirements
|
||||
|
||||
| ID | Description | Research Support |
|
||||
|----|-------------|------------------|
|
||||
| PROV-01 | 12 Tier 1 Frontier provider YAML definitions (OpenAI, Anthropic, Google AI, Vertex, AWS Bedrock, Azure OpenAI, Meta AI, xAI, Cohere, Mistral, Inflection, AI21) | Regex patterns documented below for all 12; verify endpoints identified |
|
||||
| PROV-02 | 14 Tier 2 Inference Platform provider definitions (Together, Fireworks, Groq, Replicate, Anyscale, DeepInfra, Lepton, Modal, Baseten, Cerebrium, NovitaAI, Sambanova, OctoAI, Friendli) | Regex patterns documented for prefixed providers (Groq, Replicate, Anyscale, Perplexity); keyword-context approach recommended for generic-format providers |
|
||||
</phase_requirements>
|
||||
|
||||
## Project Constraints (from CLAUDE.md)
|
||||
|
||||
- **Go regexp only (RE2)** -- no PCRE/regexp2. All patterns must be RE2-safe (no lookahead/lookbehind, no backreferences).
|
||||
- **Providers in dual locations**: `providers/` (user-visible) and `pkg/providers/definitions/` (Go embed source). Files must be kept in sync.
|
||||
- **Schema**: `format_version: 1`, `name`, `display_name`, `tier`, `last_verified`, `keywords[]`, `patterns[]` (with `regex`, `entropy_min`, `confidence`), `verify` (with `method`, `url`, `headers`, `valid_status`, `invalid_status`).
|
||||
- **Confidence values**: `high`, `medium`, `low` (validated in UnmarshalYAML).
|
||||
- **Keywords**: lowercase, used for Aho-Corasick pre-filtering via DFA automaton.
|
||||
|
||||
## Standard Stack
|
||||
|
||||
No new libraries required. This phase is pure YAML data file creation using the existing infrastructure from Phase 1.
|
||||
|
||||
### Existing Infrastructure Used
|
||||
| Component | Location | Purpose |
|
||||
|-----------|----------|---------|
|
||||
| Provider struct | `pkg/providers/schema.go` | YAML schema with validation |
|
||||
| Loader | `pkg/providers/loader.go` | embed.FS walker for `definitions/*.yaml` |
|
||||
| Registry | `pkg/providers/registry.go` | Provider index + AC automaton build |
|
||||
| Tests | `pkg/providers/registry_test.go` | Registry load, get, stats, AC tests |
|
||||
|
||||
## Architecture Patterns
|
||||
|
||||
### File Placement (Dual Location)
|
||||
Every new provider YAML must be placed in BOTH:
|
||||
```
|
||||
providers/{name}.yaml # User-visible reference
|
||||
pkg/providers/definitions/{name}.yaml # Go embed source (compiled into binary)
|
||||
```
|
||||
|
||||
### YAML Template Pattern
|
||||
```yaml
|
||||
format_version: 1
|
||||
name: {provider_slug}
|
||||
display_name: {Display Name}
|
||||
tier: {1 or 2}
|
||||
last_verified: "2026-04-05"
|
||||
keywords:
|
||||
- "{prefix_literal}" # Exact key prefix for AC match
|
||||
- "{provider_name_lowercase}" # Provider name for context match
|
||||
- "{env_var_hint}" # Common env var name fragments
|
||||
patterns:
|
||||
- regex: '{RE2_compatible_regex}'
|
||||
entropy_min: {3.0-4.0}
|
||||
confidence: {high|medium|low}
|
||||
verify:
|
||||
method: {GET|POST}
|
||||
url: {lightweight_api_endpoint}
|
||||
headers:
|
||||
{auth_header}: "{KEY_placeholder}"
|
||||
valid_status: [200]
|
||||
invalid_status: [401, 403]
|
||||
```
|
||||
|
||||
### Confidence Level Assignment Strategy
|
||||
| Key Format | Confidence | Rationale |
|
||||
|------------|------------|-----------|
|
||||
| Unique prefix + fixed length (e.g., `sk-ant-api03-`, `gsk_`, `r8_`, `xai-`) | **high** | Prefix alone is near-unique; false positive rate extremely low |
|
||||
| Unique prefix, variable length (e.g., `sk-proj-`, `AIzaSy`) | **high** | Prefix is distinctive enough |
|
||||
| Short generic prefix + context needed (e.g., `sk-` for Cohere) | **medium** | Prefix collides with OpenAI legacy; needs keyword context |
|
||||
| No prefix, opaque token (e.g., UUID, hex string, base64) | **low** | Requires strong keyword context; high false positive risk without AC pre-filter |
|
||||
| 32-char hex string (e.g., Azure OpenAI) | **low** | Extremely generic format; keyword context mandatory |
|
||||
|
||||
### Keyword Strategy for Low-Confidence Providers
|
||||
Providers with generic key formats (no distinctive prefix) MUST have rich keyword lists for the Aho-Corasick pre-filter to work effectively. Include:
|
||||
1. Provider name (lowercase)
|
||||
2. Common env var fragments (e.g., `together_api`, `baseten_api`, `modal_token`)
|
||||
3. API base URL fragments (e.g., `api.together.xyz`, `api.baseten.co`)
|
||||
4. SDK/config identifiers (e.g., `togetherai`, `deepinfra`)
|
||||
|
||||
### Anti-Patterns to Avoid
|
||||
- **Overly broad regex without keyword anchor:** A pattern like `[A-Za-z0-9]{40}` without keywords would match every 40-char alphanumeric string -- useless.
|
||||
- **PCRE features in regex:** Go RE2 does not support lookahead (`(?=)`), lookbehind (`(?<=)`), or backreferences. All patterns must be RE2-safe.
|
||||
- **Hardcoding `T3BlbkFJ` for non-OpenAI providers:** The base64 "OpenAI" magic string is OpenAI-specific; do not use for other providers.
|
||||
- **Missing dual-location sync:** Forgetting to copy YAML to both `providers/` and `pkg/providers/definitions/`.
|
||||
|
||||
## Provider Key Format Research
|
||||
|
||||
### Tier 1: Frontier Providers (12)
|
||||
|
||||
#### 1. OpenAI
|
||||
**Confidence: HIGH** -- TruffleHog verified
|
||||
- **Prefixes:** `sk-proj-`, `sk-svcacct-`, `sk-None-`, legacy `sk-` (all contain `T3BlbkFJ` base64 marker)
|
||||
- **TruffleHog regex:** `sk-(?:(?:proj|svcacct|service)-[A-Za-z0-9_-]+|[a-zA-Z0-9]+)T3BlbkFJ[A-Za-z0-9_-]+`
|
||||
- **KeyHunter regex (simplified):** `sk-proj-[A-Za-z0-9_\-]{48,}` (existing, covers primary format)
|
||||
- **Note:** Existing openai.yaml only covers `sk-proj-`. Should add patterns for `sk-svcacct-` and legacy `sk-` with T3BlbkFJ marker.
|
||||
- **Verify:** GET `https://api.openai.com/v1/models` with `Authorization: Bearer {KEY}` -- 200=valid, 401=invalid
|
||||
- **Keywords:** `sk-proj-`, `sk-svcacct-`, `openai`, `T3BlbkFJ`
|
||||
|
||||
#### 2. Anthropic
|
||||
**Confidence: HIGH** -- TruffleHog + gitleaks verified
|
||||
- **Prefixes:** `sk-ant-api03-` (standard), `sk-ant-admin01-` (admin)
|
||||
- **TruffleHog regex:** `sk-ant-(?:admin01|api03)-[\w\-]{93}AA`
|
||||
- **gitleaks regex:** `sk-ant-api03-[a-zA-Z0-9_\-]{93}AA`
|
||||
- **Note:** Existing anthropic.yaml pattern `sk-ant-api03-[A-Za-z0-9_\-]{93,}` should be tightened to end with `AA` suffix.
|
||||
- **Verify:** GET `https://api.anthropic.com/v1/models` with `x-api-key: {KEY}` + `anthropic-version: 2023-06-01` -- 200=valid, 401=invalid
|
||||
- **Keywords:** `sk-ant-api03-`, `sk-ant-admin01-`, `anthropic`
|
||||
|
||||
#### 3. Google AI (Gemini)
|
||||
**Confidence: HIGH** -- TruffleHog verified
|
||||
- **Prefix:** `AIzaSy`
|
||||
- **TruffleHog regex:** `AIzaSy[A-Za-z0-9_-]{33}`
|
||||
- **Total length:** 39 characters
|
||||
- **Note:** Same key format as all Google API keys. No trailing word boundary needed (keys can end with hyphen).
|
||||
- **Verify:** GET `https://generativelanguage.googleapis.com/v1/models?key={KEY}` -- 200=valid, 400/403=invalid
|
||||
- **Keywords:** `AIzaSy`, `google_api`, `gemini`
|
||||
|
||||
#### 4. Google Vertex AI
|
||||
**Confidence: MEDIUM**
|
||||
- **Format:** Uses Google Cloud service account JSON key files (not a simple API key string) OR standard Google API keys (AIzaSy format, same as #3).
|
||||
- **Approach:** For API key mode, reuse `AIzaSy` pattern (same as Google AI). Service account JSON key detection is a separate concern (JSON file with `"type": "service_account"` and `"private_key"` field).
|
||||
- **Recommendation:** Create a separate `vertex-ai` provider YAML that focuses on the API key path with `AIzaSy` pattern AND a service account private key regex.
|
||||
- **Service account key regex:** The private key in a GCP service account JSON starts with `-----BEGIN RSA PRIVATE KEY-----` -- but this is a general GCP credential, not Vertex-specific. Keep this provider focused on the API key path.
|
||||
- **Verify:** GET `https://generativelanguage.googleapis.com/v1/models?key={KEY}` (same endpoint works for Vertex API keys)
|
||||
- **Keywords:** `vertex`, `google_cloud`, `AIzaSy`, `vertex_ai`
|
||||
|
||||
#### 5. AWS Bedrock
|
||||
**Confidence: HIGH** -- gitleaks verified
|
||||
- **Long-lived prefix:** `ABSK` (base64 encodes to `BedrockAPIKey-`)
|
||||
- **gitleaks regex (long-lived):** `ABSK[A-Za-z0-9+/]{109,269}={0,2}`
|
||||
- **Short-lived prefix:** `bedrock-api-key-YmVkcm9ja` (base64 prefix)
|
||||
- **Also detect:** AWS IAM access keys (`AKIA[0-9A-Z]{16}` + secret) used with Bedrock
|
||||
- **Recommendation:** Two patterns: (1) Bedrock-specific ABSK prefix, (2) AWS AKIA (general, shared with other AWS services)
|
||||
- **Verify:** Cannot verify with a simple HTTP call -- AWS Bedrock requires SigV4 signing. Mark verify as empty/placeholder.
|
||||
- **Keywords:** `ABSK`, `bedrock`, `aws_bedrock`, `AKIA`
|
||||
|
||||
#### 6. Azure OpenAI
|
||||
**Confidence: MEDIUM** -- TruffleHog verified but pattern is generic
|
||||
- **Format:** 32-character lowercase hexadecimal string
|
||||
- **TruffleHog regex:** `[a-f0-9]{32}` (with keyword context requirement)
|
||||
- **Problem:** 32-char hex is extremely generic. MUST rely on keywords for context.
|
||||
- **Keywords:** `azure`, `openai.azure.com`, `azure_openai`, `api-key`, `cognitive`
|
||||
- **Verify:** GET `https://{resource}.openai.azure.com/openai/deployments?api-version=2024-02-01` with `api-key: {KEY}` -- but requires resource name. Cannot generically verify. Mark verify as placeholder.
|
||||
- **Entropy min:** 3.5 (hex has theoretical max ~4.0)
|
||||
|
||||
#### 7. Meta AI (Llama API)
|
||||
**Confidence: LOW**
|
||||
- **Format:** Not publicly documented as of April 2026. Meta Llama API launched April 2025.
|
||||
- **Env var:** `META_LLAMA_API_KEY`
|
||||
- **Approach:** Generic long token pattern with strong keyword context.
|
||||
- **Keywords:** `meta`, `llama`, `meta_llama`, `llama_api`
|
||||
- **Regex:** Generic high-entropy alphanumeric pattern, medium confidence
|
||||
- **Verify:** GET `https://api.llama.com/v1/models` with `Authorization: Bearer {KEY}` (inferred from OpenAI-compatible API)
|
||||
|
||||
#### 8. xAI (Grok)
|
||||
**Confidence: HIGH** -- TruffleHog verified
|
||||
- **Prefix:** `xai-`
|
||||
- **TruffleHog regex:** `xai-[0-9a-zA-Z_]{80}`
|
||||
- **Total length:** 84 characters
|
||||
- **Verify:** GET `https://api.x.ai/v1/api-key` with `Authorization: Bearer {KEY}` -- 200=valid, 401=invalid
|
||||
- **Keywords:** `xai-`, `xai`, `grok`
|
||||
|
||||
#### 9. Cohere
|
||||
**Confidence: MEDIUM** -- gitleaks verified but pattern requires context
|
||||
- **Format:** 40 alphanumeric characters, no distinctive prefix
|
||||
- **gitleaks regex:** Context-dependent match on `cohere` or `CO_API_KEY` keyword + `[a-zA-Z0-9]{40}`
|
||||
- **Problem:** 40-char alphanumeric overlaps with many other tokens.
|
||||
- **Keywords:** `cohere`, `CO_API_KEY`, `cohere_api`
|
||||
- **Verify:** GET `https://api.cohere.ai/v1/models` with `Authorization: Bearer {KEY}` -- 200=valid, 401=invalid
|
||||
- **Entropy min:** 4.0 (higher threshold to reduce false positives)
|
||||
|
||||
#### 10. Mistral AI
|
||||
**Confidence: MEDIUM**
|
||||
- **Format:** Not prefixed (GitGuardian confirms "Prefixed: False"). Opaque token.
|
||||
- **Approach:** Keyword-context match. Mistral keys appear to be 32-char alphanumeric or UUID-format.
|
||||
- **Keywords:** `mistral`, `MISTRAL_API_KEY`, `mistral.ai`, `la_plateforme`
|
||||
- **Verify:** GET `https://api.mistral.ai/v1/models` with `Authorization: Bearer {KEY}` -- 200=valid, 401=invalid
|
||||
|
||||
#### 11. Inflection AI
|
||||
**Confidence: LOW**
|
||||
- **Format:** Not publicly documented. API launched late 2025 via inflection-sdk.
|
||||
- **Env var:** `PI_API_KEY`, `INFLECTION_API_KEY`
|
||||
- **Keywords:** `inflection`, `pi_api`, `PI_API_KEY`
|
||||
- **Verify:** Endpoint unclear -- inferred OpenAI-compatible pattern. Mark as placeholder.
|
||||
|
||||
#### 12. AI21 Labs
|
||||
**Confidence: LOW**
|
||||
- **Format:** Not publicly documented with distinctive prefix.
|
||||
- **Env var:** `AI21_API_KEY`
|
||||
- **Keywords:** `ai21`, `AI21_API_KEY`, `jamba`, `jurassic`
|
||||
- **Verify:** GET `https://api.ai21.com/studio/v1/models` with `Authorization: Bearer {KEY}` -- inferred
|
||||
|
||||
### Tier 2: Inference Platforms (14)
|
||||
|
||||
#### 1. Together AI
|
||||
**Confidence: LOW-MEDIUM**
|
||||
- **Format:** Appears to use generic opaque tokens. Some documentation shows `sk-` prefix but this may be example placeholder.
|
||||
- **Keywords:** `together`, `TOGETHER_API_KEY`, `api.together.xyz`, `togetherai`
|
||||
- **Verify:** GET `https://api.together.xyz/v1/models` with `Authorization: Bearer {KEY}`
|
||||
|
||||
#### 2. Fireworks AI
|
||||
**Confidence: LOW-MEDIUM**
|
||||
- **Format:** GitGuardian confirms "Prefixed: False". Opaque token format.
|
||||
- **Prefix:** `fw_` prefix has been reported in some sources but not confirmed by GitGuardian.
|
||||
- **Keywords:** `fireworks`, `FIREWORKS_API_KEY`, `fireworks.ai`, `fw_`
|
||||
- **Verify:** GET `https://api.fireworks.ai/inference/v1/models` with `Authorization: Bearer {KEY}`
|
||||
|
||||
#### 3. Groq
|
||||
**Confidence: HIGH** -- TruffleHog verified
|
||||
- **Prefix:** `gsk_`
|
||||
- **TruffleHog regex:** `gsk_[a-zA-Z0-9]{52}`
|
||||
- **Total length:** 56 characters
|
||||
- **Verify:** GET `https://api.groq.com/openai/v1/models` with `Authorization: Bearer {KEY}` -- 200=valid, 401=invalid
|
||||
- **Keywords:** `gsk_`, `groq`, `GROQ_API_KEY`
|
||||
|
||||
#### 4. Replicate
|
||||
**Confidence: HIGH** -- TruffleHog verified
|
||||
- **Prefix:** `r8_`
|
||||
- **TruffleHog regex:** `r8_[0-9A-Za-z-_]{37}`
|
||||
- **Total length:** 40 characters
|
||||
- **Verify:** GET `https://api.replicate.com/v1/predictions` with `Authorization: Bearer {KEY}` -- 200=valid, 401=invalid
|
||||
- **Keywords:** `r8_`, `replicate`, `REPLICATE_API_TOKEN`
|
||||
|
||||
#### 5. Anyscale
|
||||
**Confidence: MEDIUM**
|
||||
- **Prefix:** `esecret_`
|
||||
- **Regex:** `esecret_[A-Za-z0-9_-]{20,}`
|
||||
- **Keywords:** `esecret_`, `anyscale`, `ANYSCALE_API_KEY`
|
||||
- **Verify:** GET `https://api.endpoints.anyscale.com/v1/models` with `Authorization: Bearer {KEY}`
|
||||
- **Note:** Anyscale Endpoints are being deprecated in favor of their newer platform. Still worth detecting.
|
||||
|
||||
#### 6. DeepInfra
|
||||
**Confidence: LOW-MEDIUM**
|
||||
- **Format:** Opaque token. JWT-based scoped tokens use `jwt:` prefix.
|
||||
- **Keywords:** `deepinfra`, `DEEPINFRA_API_KEY`, `deepinfra.com`
|
||||
- **Verify:** GET `https://api.deepinfra.com/v1/openai/models` with `Authorization: Bearer {KEY}`
|
||||
|
||||
#### 7. Lepton AI
|
||||
**Confidence: LOW**
|
||||
- **Format:** Not publicly documented. NVIDIA acquired Lepton AI April 2025 (rebranded to DGX Cloud Lepton).
|
||||
- **Keywords:** `lepton`, `LEPTON_API_TOKEN`, `lepton.ai`
|
||||
- **Verify:** Endpoint uncertain. Mark as placeholder.
|
||||
|
||||
#### 8. Modal
|
||||
**Confidence: LOW**
|
||||
- **Format:** Not publicly documented with distinctive format.
|
||||
- **Keywords:** `modal`, `MODAL_TOKEN_ID`, `MODAL_TOKEN_SECRET`, `modal.com`
|
||||
- **Note:** Modal uses token ID + token secret pair, not a single API key.
|
||||
- **Verify:** Mark as placeholder.
|
||||
|
||||
#### 9. Baseten
|
||||
**Confidence: LOW-MEDIUM**
|
||||
- **Format:** Uses API keys passed in header.
|
||||
- **Keywords:** `baseten`, `BASETEN_API_KEY`, `api.baseten.co`
|
||||
- **Verify:** GET `https://api.baseten.co/v1/models` with `Authorization: Api-Key {KEY}` (non-standard header)
|
||||
|
||||
#### 10. Cerebrium
|
||||
**Confidence: LOW**
|
||||
- **Format:** Not publicly documented with distinctive format.
|
||||
- **Keywords:** `cerebrium`, `CEREBRIUM_API_KEY`, `cerebrium.ai`
|
||||
- **Verify:** Mark as placeholder.
|
||||
|
||||
#### 11. NovitaAI
|
||||
**Confidence: LOW**
|
||||
- **Format:** Not publicly documented with distinctive format.
|
||||
- **Keywords:** `novita`, `NOVITA_API_KEY`, `novita.ai`
|
||||
- **Verify:** GET `https://api.novita.ai/v3/openai/models` with `Authorization: Bearer {KEY}` (inferred)
|
||||
|
||||
#### 12. SambaNova
|
||||
**Confidence: LOW**
|
||||
- **Format:** Not publicly documented with distinctive prefix.
|
||||
- **Keywords:** `sambanova`, `SAMBANOVA_API_KEY`, `sambastudio`, `snapi`
|
||||
- **Verify:** GET `https://api.sambanova.ai/v1/models` with `Authorization: Bearer {KEY}` (inferred OpenAI-compatible)
|
||||
|
||||
#### 13. OctoAI
|
||||
**Confidence: LOW**
|
||||
- **Format:** Not publicly documented. OctoAI was shut down / merged in 2025.
|
||||
- **Keywords:** `octoai`, `OCTOAI_TOKEN`, `octo.ai`
|
||||
- **Verify:** Mark as placeholder (service may be defunct).
|
||||
|
||||
#### 14. Friendli
|
||||
**Confidence: LOW**
|
||||
- **Format:** Uses "Friendli Token" -- not publicly documented format.
|
||||
- **Keywords:** `friendli`, `FRIENDLI_TOKEN`, `friendli.ai`
|
||||
- **Verify:** Mark as placeholder.
|
||||
|
||||
### IMPORTANT: Perplexity Classification Issue
|
||||
|
||||
The phase description lists Perplexity as Tier 2, but REQUIREMENTS.md lists it under Tier 3 (PROV-03). The gitleaks config has a pattern: `pplx-[a-zA-Z0-9]{48}`. **Follow the phase description** which includes it in the Tier 2 list. If Perplexity should not be Tier 2, one of the other 14 Tier 2 providers from PROV-02 should replace it. Looking at the PROV-02 requirement list, Perplexity is listed there as one of the 14. This is correct for this phase.
|
||||
|
||||
#### Perplexity (listed in PROV-02)
|
||||
**Confidence: HIGH** -- gitleaks verified
|
||||
- **Prefix:** `pplx-`
|
||||
- **gitleaks regex:** `pplx-[a-zA-Z0-9]{48}`
|
||||
- **Total length:** 53 characters
|
||||
- **Keywords:** `pplx-`, `perplexity`, `PERPLEXITY_API_KEY`
|
||||
- **Verify:** GET `https://api.perplexity.ai/chat/completions` with `Authorization: Bearer {KEY}` -- but this is a POST endpoint. Use model listing if available.
|
||||
|
||||
## Don't Hand-Roll
|
||||
|
||||
| Problem | Don't Build | Use Instead | Why |
|
||||
|---------|-------------|-------------|-----|
|
||||
| Regex validation | Custom regex validator | Go `regexp.Compile` + existing schema validation in `UnmarshalYAML` | Schema already validates confidence levels; add regex compilation check |
|
||||
| Key format research | Guessing patterns | TruffleHog/gitleaks source code + GitGuardian detector docs | These tools have validated patterns against real-world data |
|
||||
| AC automaton rebuild | Manual keyword management | Existing `NewRegistry()` auto-builds AC from all provider keywords | Just add YAML files; registry handles everything |
|
||||
| Dual file sync | Manual copy script | Plan tasks should explicitly copy to both locations | Simple but must not be forgotten |
|
||||
|
||||
## Common Pitfalls
|
||||
|
||||
### Pitfall 1: Catastrophic Backtracking in Regex
|
||||
**What goes wrong:** Complex regex with nested quantifiers causes exponential backtracking.
|
||||
**Why it happens:** Using patterns like `(a+)+` or `([A-Za-z0-9]+[-_]?)+`.
|
||||
**How to avoid:** Go RE2 engine prevents this by design -- but still write simple, linear patterns. Avoid unnecessary alternation groups.
|
||||
**Warning signs:** Regex compilation errors in tests.
|
||||
|
||||
### Pitfall 2: Overly Broad Patterns Without Keywords
|
||||
**What goes wrong:** Pattern like `[A-Za-z0-9]{32}` matches random strings everywhere.
|
||||
**Why it happens:** Provider has no distinctive prefix (Azure OpenAI, Mistral, AI21, etc.).
|
||||
**How to avoid:** Set confidence to `low`, require strong keyword list for AC pre-filtering. The scanner will only test regex AFTER AC finds a keyword match.
|
||||
**Warning signs:** Hundreds of false positives when scanning any codebase.
|
||||
|
||||
### Pitfall 3: Forgetting Dual-Location File Sync
|
||||
**What goes wrong:** Provider added to `providers/` but not `pkg/providers/definitions/`, or vice versa.
|
||||
**Why it happens:** Phase 1 decision to maintain both locations.
|
||||
**How to avoid:** Every task that creates a YAML file must explicitly create it in both locations. Consider a verification step that compares both directories.
|
||||
**Warning signs:** `go test` passes but `providers/` directory has different count than `pkg/providers/definitions/`.
|
||||
|
||||
### Pitfall 4: Invalid YAML Schema
|
||||
**What goes wrong:** Missing `format_version`, empty `last_verified`, or invalid confidence value.
|
||||
**Why it happens:** Copy-paste errors or typos.
|
||||
**How to avoid:** The `UnmarshalYAML` validation catches these. Run `go test ./pkg/providers/...` after each batch.
|
||||
**Warning signs:** Test failures with schema validation errors.
|
||||
|
||||
### Pitfall 5: Regex Not RE2-Compatible
|
||||
**What goes wrong:** Using PCRE features like `(?<=prefix)` or `(?!suffix)`.
|
||||
**Why it happens:** Copying regex from TruffleHog (which uses Go RE2 but sometimes with custom wrappers) or from gitleaks (which uses Go RE2 but has context-match wrappers).
|
||||
**How to avoid:** Test every regex with `regexp.MustCompile()` in Go. Strip boundary assertions that gitleaks adds (like `(?:[\x60'"\s;]|\\[nr]|$)` -- these are gitleaks-specific context matchers, not needed in our YAML patterns).
|
||||
**Warning signs:** `regexp.Compile` returns error.
|
||||
|
||||
### Pitfall 6: Schema Missing Category Field
|
||||
**What goes wrong:** CONTEXT.md mentions a `category` field in the YAML format, but `pkg/providers/schema.go` Provider struct has no `Category` field.
|
||||
**Why it happens:** CONTEXT.md describes an intended schema that was not fully implemented in Phase 1.
|
||||
**How to avoid:** Either add `Category` field to schema.go in this phase, or omit it from YAML files. Recommend adding it since it is mentioned in CLI-04 (`keyhunter providers list` might want category filtering).
|
||||
**Warning signs:** YAML has `category` field that gets silently ignored by Go YAML parser.
|
||||
|
||||
## Code Examples
|
||||
|
||||
### Provider YAML File (High-Confidence Prefixed Provider)
|
||||
```yaml
|
||||
# Source: TruffleHog xAI detector (verified)
|
||||
format_version: 1
|
||||
name: xai
|
||||
display_name: xAI
|
||||
tier: 1
|
||||
last_verified: "2026-04-05"
|
||||
keywords:
|
||||
- "xai-"
|
||||
- "xai"
|
||||
- "grok"
|
||||
patterns:
|
||||
- regex: 'xai-[0-9a-zA-Z_]{80}'
|
||||
entropy_min: 3.5
|
||||
confidence: high
|
||||
verify:
|
||||
method: GET
|
||||
url: https://api.x.ai/v1/api-key
|
||||
headers:
|
||||
Authorization: "Bearer {KEY}"
|
||||
valid_status: [200]
|
||||
invalid_status: [401, 403]
|
||||
```
|
||||
|
||||
### Provider YAML File (Low-Confidence Generic Provider)
|
||||
```yaml
|
||||
# Low-confidence: no distinctive prefix, requires keyword context
|
||||
format_version: 1
|
||||
name: together
|
||||
display_name: Together AI
|
||||
tier: 2
|
||||
last_verified: "2026-04-05"
|
||||
keywords:
|
||||
- "together"
|
||||
- "together_api"
|
||||
- "api.together.xyz"
|
||||
- "togetherai"
|
||||
- "TOGETHER_API_KEY"
|
||||
patterns:
|
||||
- regex: '[A-Za-z0-9]{64}'
|
||||
entropy_min: 4.0
|
||||
confidence: low
|
||||
verify:
|
||||
method: GET
|
||||
url: https://api.together.xyz/v1/models
|
||||
headers:
|
||||
Authorization: "Bearer {KEY}"
|
||||
valid_status: [200]
|
||||
invalid_status: [401, 403]
|
||||
```
|
||||
|
||||
### Test Pattern: Verify Regex Compiles
|
||||
```go
|
||||
func TestProviderRegexCompiles(t *testing.T) {
|
||||
reg, err := providers.NewRegistry()
|
||||
require.NoError(t, err)
|
||||
for _, p := range reg.List() {
|
||||
for i, pat := range p.Patterns {
|
||||
_, err := regexp.Compile(pat.Regex)
|
||||
assert.NoError(t, err, "provider %s pattern %d has invalid regex: %s", p.Name, i, pat.Regex)
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Test Pattern: Verify Provider Count
|
||||
```go
|
||||
func TestTier1And2ProviderCount(t *testing.T) {
|
||||
reg, err := providers.NewRegistry()
|
||||
require.NoError(t, err)
|
||||
stats := reg.Stats()
|
||||
assert.GreaterOrEqual(t, stats.ByTier[1], 12, "expected at least 12 Tier 1 providers")
|
||||
assert.GreaterOrEqual(t, stats.ByTier[2], 14, "expected at least 14 Tier 2 providers")
|
||||
assert.GreaterOrEqual(t, stats.Total, 26, "expected at least 26 total providers")
|
||||
}
|
||||
```
|
||||
|
||||
### Test Pattern: AC Pre-Filter Matches Known Key Prefix
|
||||
```go
|
||||
func TestACMatchesKnownPrefixes(t *testing.T) {
|
||||
reg, err := providers.NewRegistry()
|
||||
require.NoError(t, err)
|
||||
ac := reg.AC()
|
||||
|
||||
prefixes := []string{
|
||||
"sk-proj-abc123", "sk-ant-api03-abc", "AIzaSyABC123",
|
||||
"xai-abc123", "gsk_abc123", "r8_abc123", "pplx-abc123",
|
||||
}
|
||||
for _, prefix := range prefixes {
|
||||
matches := ac.FindAll(prefix)
|
||||
assert.NotEmpty(t, matches, "AC should match prefix: %s", prefix)
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## State of the Art
|
||||
|
||||
| Old Approach | Current Approach | When Changed | Impact |
|
||||
|--------------|------------------|--------------|--------|
|
||||
| OpenAI `sk-` prefix only | `sk-proj-`, `sk-svcacct-`, `sk-None-` with `T3BlbkFJ` marker | 2024 | Must detect all modern prefixes |
|
||||
| AWS Bedrock via IAM only | Bedrock-specific `ABSK` API keys | Late 2025 | New key type to detect |
|
||||
| Regex-only detection | Regex + keyword pre-filter + entropy | Current best practice | KeyHunter architecture already supports this |
|
||||
| Simple prefix match | TruffleHog suffix markers (e.g., Anthropic `AA` suffix) | Current | Tighter patterns reduce false positives |
|
||||
|
||||
## Open Questions
|
||||
|
||||
1. **Category field in schema**
|
||||
- What we know: CONTEXT.md mentions `category` in YAML format, but `schema.go` has no `Category` field.
|
||||
- What's unclear: Whether to add it now or defer.
|
||||
- Recommendation: Add `Category string \`yaml:"category"\`` to Provider struct as part of this phase. Values like `"frontier"`, `"inference-platform"`, `"specialized"` would support CLI-04 filtering.
|
||||
|
||||
2. **Perplexity tier assignment**
|
||||
- What we know: PROV-02 lists "Perplexity" as one of 14 Tier 2 providers. PROV-03 also lists it under Tier 3.
|
||||
- What's unclear: Which is canonical.
|
||||
- Recommendation: Follow the phase description (PROV-02 lists it as Tier 2). If needed, adjust in Phase 3.
|
||||
|
||||
3. **OpenAI existing pattern needs update**
|
||||
- What we know: Current `openai.yaml` only has `sk-proj-` pattern. Modern keys also use `sk-svcacct-`, `sk-None-`, and legacy `sk-` with `T3BlbkFJ` marker.
|
||||
- Recommendation: Update openai.yaml with additional patterns.
|
||||
|
||||
4. **Anthropic existing pattern can be tightened**
|
||||
- What we know: Current pattern `sk-ant-api03-[A-Za-z0-9_\-]{93,}` does not require the `AA` suffix.
|
||||
- Recommendation: Tighten to `sk-ant-api03-[A-Za-z0-9_\-]{93}AA` per TruffleHog and add admin key pattern.
|
||||
|
||||
5. **Defunct/transitioning providers**
|
||||
- What we know: OctoAI appears defunct. Anyscale Endpoints being deprecated. Lepton acquired by NVIDIA.
|
||||
- Recommendation: Still create YAML definitions for all -- leaked keys from defunct services can still have value if the service was operational.
|
||||
|
||||
## Validation Architecture
|
||||
|
||||
### Test Framework
|
||||
| Property | Value |
|
||||
|----------|-------|
|
||||
| Framework | Go standard `testing` + `testify` v1.x |
|
||||
| Config file | `go.mod` (no separate test config) |
|
||||
| Quick run command | `go test ./pkg/providers/... -v -count=1` |
|
||||
| Full suite command | `go test ./... -v -count=1` |
|
||||
|
||||
### Phase Requirements -> Test Map
|
||||
| Req ID | Behavior | Test Type | Automated Command | File Exists? |
|
||||
|--------|----------|-----------|-------------------|-------------|
|
||||
| PROV-01 | 12 Tier 1 providers loaded with correct names/tiers | unit | `go test ./pkg/providers/... -run TestTier1Providers -v` | Wave 0 |
|
||||
| PROV-01 | Each Tier 1 provider regex compiles and matches sample key | unit | `go test ./pkg/providers/... -run TestTier1Patterns -v` | Wave 0 |
|
||||
| PROV-02 | 14 Tier 2 providers loaded with correct names/tiers | unit | `go test ./pkg/providers/... -run TestTier2Providers -v` | Wave 0 |
|
||||
| PROV-02 | Each Tier 2 provider regex compiles and matches sample key | unit | `go test ./pkg/providers/... -run TestTier2Patterns -v` | Wave 0 |
|
||||
| PROV-01+02 | AC automaton matches all provider keywords | unit | `go test ./pkg/providers/... -run TestACMatchesAllKeywords -v` | Wave 0 |
|
||||
| PROV-01+02 | Registry stats show 26+ providers | unit | `go test ./pkg/providers/... -run TestProviderCount -v` | Wave 0 |
|
||||
|
||||
### Sampling Rate
|
||||
- **Per task commit:** `go test ./pkg/providers/... -v -count=1`
|
||||
- **Per wave merge:** `go test ./... -v -count=1`
|
||||
- **Phase gate:** Full suite green before `/gsd:verify-work`
|
||||
|
||||
### Wave 0 Gaps
|
||||
- [ ] `pkg/providers/tier1_test.go` -- covers PROV-01 (provider names, tier values, pattern compilation, sample key matching)
|
||||
- [ ] `pkg/providers/tier2_test.go` -- covers PROV-02 (same as above for Tier 2)
|
||||
- [ ] Update `pkg/providers/registry_test.go` -- update `TestRegistryLoad` assertion from `>= 3` to `>= 26`
|
||||
|
||||
## Sources
|
||||
|
||||
### Primary (HIGH confidence)
|
||||
- [TruffleHog OpenAI detector](https://github.com/trufflesecurity/trufflehog/blob/main/pkg/detectors/openai/openai.go) -- regex pattern for OpenAI keys
|
||||
- [TruffleHog Anthropic detector](https://github.com/trufflesecurity/trufflehog/blob/main/pkg/detectors/anthropic/anthropic.go) -- regex pattern for Anthropic keys
|
||||
- [TruffleHog Google Gemini detector](https://github.com/trufflesecurity/trufflehog/blob/main/pkg/detectors/googlegemini/googlegemini.go) -- regex pattern for Google AI keys
|
||||
- [TruffleHog Groq detector](https://github.com/trufflesecurity/trufflehog/blob/main/pkg/detectors/groq/groq.go) -- regex pattern `gsk_[a-zA-Z0-9]{52}`
|
||||
- [TruffleHog xAI detector](https://github.com/trufflesecurity/trufflehog/blob/main/pkg/detectors/xai/xai.go) -- regex pattern `xai-[0-9a-zA-Z_]{80}`
|
||||
- [TruffleHog Replicate detector](https://github.com/trufflesecurity/trufflehog/blob/main/pkg/detectors/replicate/replicate.go) -- regex pattern `r8_[0-9A-Za-z-_]{37}`
|
||||
- [TruffleHog Azure OpenAI detector](https://github.com/trufflesecurity/trufflehog/blob/main/pkg/detectors/azure_openai/azure_openai.go) -- 32-char hex with keyword context
|
||||
- [gitleaks config](https://github.com/gitleaks/gitleaks/blob/master/config/gitleaks.toml) -- Anthropic, Cohere, AWS Bedrock patterns
|
||||
|
||||
### Secondary (MEDIUM confidence)
|
||||
- [AWS Bedrock API keys (Wiz blog)](https://www.wiz.io/blog/a-new-type-of-long-lived-key-on-aws-bedrock-api-keys) -- ABSK prefix documentation
|
||||
- [GitGuardian Groq detector](https://docs.gitguardian.com/secrets-detection/secrets-detection-engine/detectors/specifics/groq_api_key) -- confirms prefixed format
|
||||
- [GitGuardian Mistral detector](https://docs.gitguardian.com/secrets-detection/secrets-detection-engine/detectors/specifics/mistralai_apikey) -- confirms "Prefixed: False"
|
||||
- [GitGuardian Fireworks detector](https://docs.gitguardian.com/secrets-detection/secrets-detection-engine/detectors/specifics/fireworks_ai_api_key) -- confirms "Prefixed: False"
|
||||
- [OpenAI community forum](https://community.openai.com/t/how-to-create-an-api-secret-key-with-prefix-sk-only-always-creates-sk-proj-keys/1263531) -- sk-proj-, sk-svcacct-, sk-None- prefixes
|
||||
- [Replicate docs](https://replicate.com/docs/topics/security/api-tokens) -- r8_ prefix, 40-char format
|
||||
- [Anyscale docs](https://docs.anyscale.com/endpoints/text-generation/authenticate/) -- esecret_ prefix
|
||||
|
||||
### Tertiary (LOW confidence -- needs validation at implementation)
|
||||
- Meta AI key format -- not documented, inferred from OpenAI-compatible API pattern
|
||||
- Inflection AI key format -- not documented, SDK-based inference only
|
||||
- AI21 Labs key format -- no prefix confirmed
|
||||
- Together AI key format -- no distinctive prefix confirmed
|
||||
- DeepInfra key format -- JWT-based tokens referenced but standard key format unclear
|
||||
- Lepton AI, Modal, Baseten, Cerebrium, NovitaAI, SambaNova, OctoAI, Friendli -- key formats not publicly documented
|
||||
|
||||
## Metadata
|
||||
|
||||
**Confidence breakdown:**
|
||||
- Standard stack: HIGH -- no new dependencies, using existing Phase 1 infrastructure
|
||||
- Architecture: HIGH -- dual-location YAML pattern well-established
|
||||
- Provider key formats (Tier 1 well-known): HIGH -- TruffleHog/gitleaks verified patterns
|
||||
- Provider key formats (Tier 1 less-known): MEDIUM -- Meta, Inflection, AI21 have limited documentation
|
||||
- Provider key formats (Tier 2 prefixed): HIGH -- Groq, Replicate, Perplexity, Anyscale have clear prefixes
|
||||
- Provider key formats (Tier 2 generic): LOW -- many Tier 2 providers lack documented key formats
|
||||
- Pitfalls: HIGH -- well-understood RE2 and false-positive challenges
|
||||
|
||||
**Research date:** 2026-04-05
|
||||
**Valid until:** 2026-05-05 (30 days -- API key formats change infrequently)
|
||||
Reference in New Issue
Block a user