32 KiB
Phase 2: Tier 1-2 Providers - Research
Researched: 2026-04-05 Domain: LLM/AI Provider API Key Formats, Regex Patterns, Verification Endpoints Confidence: MEDIUM (well-documented providers are HIGH; lesser-known Tier 2 providers are LOW-MEDIUM)
Summary
This phase requires creating 26 provider YAML definitions (12 Tier 1 + 14 Tier 2) following the established schema pattern from Phase 1. The core challenge is not Go code -- the infrastructure (schema, loader, registry, AC automaton) already exists and works. The challenge is accuracy of regex patterns and key format data across 26 providers with varying documentation quality.
For Tier 1 providers (OpenAI, Anthropic, Google AI, xAI, Cohere, Groq), key formats are well-documented with authoritative regex patterns from TruffleHog and gitleaks. For Tier 2 inference platforms (Together, Fireworks, Lepton, Modal, Baseten, Cerebrium, NovitaAI, SambaNova, OctoAI, Friendli), many providers lack distinctive key prefixes and use generic opaque token formats, making regex-only detection harder and requiring keyword-context-based matching.
Primary recommendation: Create all 26 YAML files using the established schema. For well-prefixed providers, use HIGH confidence patterns. For generic-format providers, use MEDIUM/LOW confidence patterns with strong keyword lists for AC pre-filtering. File placement follows the dual-location pattern established in Phase 1 (providers/ and pkg/providers/definitions/).
<user_constraints>
User Constraints (from CONTEXT.md)
Locked Decisions
None -- all implementation choices at Claude's discretion (infrastructure/data phase).
Claude's Discretion
All implementation choices are at Claude's discretion -- pure infrastructure/data phase. Use ROADMAP phase goal, success criteria, and existing provider YAML schema (providers/openai.yaml, providers/anthropic.yaml, providers/huggingface.yaml) as templates.
Deferred Ideas (OUT OF SCOPE)
None -- discuss phase skipped. </user_constraints>
<phase_requirements>
Phase Requirements
| ID | Description | Research Support |
|---|---|---|
| PROV-01 | 12 Tier 1 Frontier provider YAML definitions (OpenAI, Anthropic, Google AI, Vertex, AWS Bedrock, Azure OpenAI, Meta AI, xAI, Cohere, Mistral, Inflection, AI21) | Regex patterns documented below for all 12; verify endpoints identified |
| PROV-02 | 14 Tier 2 Inference Platform provider definitions (Together, Fireworks, Groq, Replicate, Anyscale, DeepInfra, Lepton, Modal, Baseten, Cerebrium, NovitaAI, Sambanova, OctoAI, Friendli) | Regex patterns documented for prefixed providers (Groq, Replicate, Anyscale, Perplexity); keyword-context approach recommended for generic-format providers |
| </phase_requirements> |
Project Constraints (from CLAUDE.md)
- Go regexp only (RE2) -- no PCRE/regexp2. All patterns must be RE2-safe (no lookahead/lookbehind, no backreferences).
- Providers in dual locations:
providers/(user-visible) andpkg/providers/definitions/(Go embed source). Files must be kept in sync. - Schema:
format_version: 1,name,display_name,tier,last_verified,keywords[],patterns[](withregex,entropy_min,confidence),verify(withmethod,url,headers,valid_status,invalid_status). - Confidence values:
high,medium,low(validated in UnmarshalYAML). - Keywords: lowercase, used for Aho-Corasick pre-filtering via DFA automaton.
Standard Stack
No new libraries required. This phase is pure YAML data file creation using the existing infrastructure from Phase 1.
Existing Infrastructure Used
| Component | Location | Purpose |
|---|---|---|
| Provider struct | pkg/providers/schema.go |
YAML schema with validation |
| Loader | pkg/providers/loader.go |
embed.FS walker for definitions/*.yaml |
| Registry | pkg/providers/registry.go |
Provider index + AC automaton build |
| Tests | pkg/providers/registry_test.go |
Registry load, get, stats, AC tests |
Architecture Patterns
File Placement (Dual Location)
Every new provider YAML must be placed in BOTH:
providers/{name}.yaml # User-visible reference
pkg/providers/definitions/{name}.yaml # Go embed source (compiled into binary)
YAML Template Pattern
format_version: 1
name: {provider_slug}
display_name: {Display Name}
tier: {1 or 2}
last_verified: "2026-04-05"
keywords:
- "{prefix_literal}" # Exact key prefix for AC match
- "{provider_name_lowercase}" # Provider name for context match
- "{env_var_hint}" # Common env var name fragments
patterns:
- regex: '{RE2_compatible_regex}'
entropy_min: {3.0-4.0}
confidence: {high|medium|low}
verify:
method: {GET|POST}
url: {lightweight_api_endpoint}
headers:
{auth_header}: "{KEY_placeholder}"
valid_status: [200]
invalid_status: [401, 403]
Confidence Level Assignment Strategy
| Key Format | Confidence | Rationale |
|---|---|---|
Unique prefix + fixed length (e.g., sk-ant-api03-, gsk_, r8_, xai-) |
high | Prefix alone is near-unique; false positive rate extremely low |
Unique prefix, variable length (e.g., sk-proj-, AIzaSy) |
high | Prefix is distinctive enough |
Short generic prefix + context needed (e.g., sk- for Cohere) |
medium | Prefix collides with OpenAI legacy; needs keyword context |
| No prefix, opaque token (e.g., UUID, hex string, base64) | low | Requires strong keyword context; high false positive risk without AC pre-filter |
| 32-char hex string (e.g., Azure OpenAI) | low | Extremely generic format; keyword context mandatory |
Keyword Strategy for Low-Confidence Providers
Providers with generic key formats (no distinctive prefix) MUST have rich keyword lists for the Aho-Corasick pre-filter to work effectively. Include:
- Provider name (lowercase)
- Common env var fragments (e.g.,
together_api,baseten_api,modal_token) - API base URL fragments (e.g.,
api.together.xyz,api.baseten.co) - SDK/config identifiers (e.g.,
togetherai,deepinfra)
Anti-Patterns to Avoid
- Overly broad regex without keyword anchor: A pattern like
[A-Za-z0-9]{40}without keywords would match every 40-char alphanumeric string -- useless. - PCRE features in regex: Go RE2 does not support lookahead (
(?=)), lookbehind ((?<=)), or backreferences. All patterns must be RE2-safe. - Hardcoding
T3BlbkFJfor non-OpenAI providers: The base64 "OpenAI" magic string is OpenAI-specific; do not use for other providers. - Missing dual-location sync: Forgetting to copy YAML to both
providers/andpkg/providers/definitions/.
Provider Key Format Research
Tier 1: Frontier Providers (12)
1. OpenAI
Confidence: HIGH -- TruffleHog verified
- Prefixes:
sk-proj-,sk-svcacct-,sk-None-, legacysk-(all containT3BlbkFJbase64 marker) - TruffleHog regex:
sk-(?:(?:proj|svcacct|service)-[A-Za-z0-9_-]+|[a-zA-Z0-9]+)T3BlbkFJ[A-Za-z0-9_-]+ - KeyHunter regex (simplified):
sk-proj-[A-Za-z0-9_\-]{48,}(existing, covers primary format) - Note: Existing openai.yaml only covers
sk-proj-. Should add patterns forsk-svcacct-and legacysk-with T3BlbkFJ marker. - Verify: GET
https://api.openai.com/v1/modelswithAuthorization: Bearer {KEY}-- 200=valid, 401=invalid - Keywords:
sk-proj-,sk-svcacct-,openai,T3BlbkFJ
2. Anthropic
Confidence: HIGH -- TruffleHog + gitleaks verified
- Prefixes:
sk-ant-api03-(standard),sk-ant-admin01-(admin) - TruffleHog regex:
sk-ant-(?:admin01|api03)-[\w\-]{93}AA - gitleaks regex:
sk-ant-api03-[a-zA-Z0-9_\-]{93}AA - Note: Existing anthropic.yaml pattern
sk-ant-api03-[A-Za-z0-9_\-]{93,}should be tightened to end withAAsuffix. - Verify: GET
https://api.anthropic.com/v1/modelswithx-api-key: {KEY}+anthropic-version: 2023-06-01-- 200=valid, 401=invalid - Keywords:
sk-ant-api03-,sk-ant-admin01-,anthropic
3. Google AI (Gemini)
Confidence: HIGH -- TruffleHog verified
- Prefix:
AIzaSy - TruffleHog regex:
AIzaSy[A-Za-z0-9_-]{33} - Total length: 39 characters
- Note: Same key format as all Google API keys. No trailing word boundary needed (keys can end with hyphen).
- Verify: GET
https://generativelanguage.googleapis.com/v1/models?key={KEY}-- 200=valid, 400/403=invalid - Keywords:
AIzaSy,google_api,gemini
4. Google Vertex AI
Confidence: MEDIUM
- Format: Uses Google Cloud service account JSON key files (not a simple API key string) OR standard Google API keys (AIzaSy format, same as #3).
- Approach: For API key mode, reuse
AIzaSypattern (same as Google AI). Service account JSON key detection is a separate concern (JSON file with"type": "service_account"and"private_key"field). - Recommendation: Create a separate
vertex-aiprovider YAML that focuses on the API key path withAIzaSypattern AND a service account private key regex. - Service account key regex: The private key in a GCP service account JSON starts with
-----BEGIN RSA PRIVATE KEY------- but this is a general GCP credential, not Vertex-specific. Keep this provider focused on the API key path. - Verify: GET
https://generativelanguage.googleapis.com/v1/models?key={KEY}(same endpoint works for Vertex API keys) - Keywords:
vertex,google_cloud,AIzaSy,vertex_ai
5. AWS Bedrock
Confidence: HIGH -- gitleaks verified
- Long-lived prefix:
ABSK(base64 encodes toBedrockAPIKey-) - gitleaks regex (long-lived):
ABSK[A-Za-z0-9+/]{109,269}={0,2} - Short-lived prefix:
bedrock-api-key-YmVkcm9ja(base64 prefix) - Also detect: AWS IAM access keys (
AKIA[0-9A-Z]{16}+ secret) used with Bedrock - Recommendation: Two patterns: (1) Bedrock-specific ABSK prefix, (2) AWS AKIA (general, shared with other AWS services)
- Verify: Cannot verify with a simple HTTP call -- AWS Bedrock requires SigV4 signing. Mark verify as empty/placeholder.
- Keywords:
ABSK,bedrock,aws_bedrock,AKIA
6. Azure OpenAI
Confidence: MEDIUM -- TruffleHog verified but pattern is generic
- Format: 32-character lowercase hexadecimal string
- TruffleHog regex:
[a-f0-9]{32}(with keyword context requirement) - Problem: 32-char hex is extremely generic. MUST rely on keywords for context.
- Keywords:
azure,openai.azure.com,azure_openai,api-key,cognitive - Verify: GET
https://{resource}.openai.azure.com/openai/deployments?api-version=2024-02-01withapi-key: {KEY}-- but requires resource name. Cannot generically verify. Mark verify as placeholder. - Entropy min: 3.5 (hex has theoretical max ~4.0)
7. Meta AI (Llama API)
Confidence: LOW
- Format: Not publicly documented as of April 2026. Meta Llama API launched April 2025.
- Env var:
META_LLAMA_API_KEY - Approach: Generic long token pattern with strong keyword context.
- Keywords:
meta,llama,meta_llama,llama_api - Regex: Generic high-entropy alphanumeric pattern, medium confidence
- Verify: GET
https://api.llama.com/v1/modelswithAuthorization: Bearer {KEY}(inferred from OpenAI-compatible API)
8. xAI (Grok)
Confidence: HIGH -- TruffleHog verified
- Prefix:
xai- - TruffleHog regex:
xai-[0-9a-zA-Z_]{80} - Total length: 84 characters
- Verify: GET
https://api.x.ai/v1/api-keywithAuthorization: Bearer {KEY}-- 200=valid, 401=invalid - Keywords:
xai-,xai,grok
9. Cohere
Confidence: MEDIUM -- gitleaks verified but pattern requires context
- Format: 40 alphanumeric characters, no distinctive prefix
- gitleaks regex: Context-dependent match on
cohereorCO_API_KEYkeyword +[a-zA-Z0-9]{40} - Problem: 40-char alphanumeric overlaps with many other tokens.
- Keywords:
cohere,CO_API_KEY,cohere_api - Verify: GET
https://api.cohere.ai/v1/modelswithAuthorization: Bearer {KEY}-- 200=valid, 401=invalid - Entropy min: 4.0 (higher threshold to reduce false positives)
10. Mistral AI
Confidence: MEDIUM
- Format: Not prefixed (GitGuardian confirms "Prefixed: False"). Opaque token.
- Approach: Keyword-context match. Mistral keys appear to be 32-char alphanumeric or UUID-format.
- Keywords:
mistral,MISTRAL_API_KEY,mistral.ai,la_plateforme - Verify: GET
https://api.mistral.ai/v1/modelswithAuthorization: Bearer {KEY}-- 200=valid, 401=invalid
11. Inflection AI
Confidence: LOW
- Format: Not publicly documented. API launched late 2025 via inflection-sdk.
- Env var:
PI_API_KEY,INFLECTION_API_KEY - Keywords:
inflection,pi_api,PI_API_KEY - Verify: Endpoint unclear -- inferred OpenAI-compatible pattern. Mark as placeholder.
12. AI21 Labs
Confidence: LOW
- Format: Not publicly documented with distinctive prefix.
- Env var:
AI21_API_KEY - Keywords:
ai21,AI21_API_KEY,jamba,jurassic - Verify: GET
https://api.ai21.com/studio/v1/modelswithAuthorization: Bearer {KEY}-- inferred
Tier 2: Inference Platforms (14)
1. Together AI
Confidence: LOW-MEDIUM
- Format: Appears to use generic opaque tokens. Some documentation shows
sk-prefix but this may be example placeholder. - Keywords:
together,TOGETHER_API_KEY,api.together.xyz,togetherai - Verify: GET
https://api.together.xyz/v1/modelswithAuthorization: Bearer {KEY}
2. Fireworks AI
Confidence: LOW-MEDIUM
- Format: GitGuardian confirms "Prefixed: False". Opaque token format.
- Prefix:
fw_prefix has been reported in some sources but not confirmed by GitGuardian. - Keywords:
fireworks,FIREWORKS_API_KEY,fireworks.ai,fw_ - Verify: GET
https://api.fireworks.ai/inference/v1/modelswithAuthorization: Bearer {KEY}
3. Groq
Confidence: HIGH -- TruffleHog verified
- Prefix:
gsk_ - TruffleHog regex:
gsk_[a-zA-Z0-9]{52} - Total length: 56 characters
- Verify: GET
https://api.groq.com/openai/v1/modelswithAuthorization: Bearer {KEY}-- 200=valid, 401=invalid - Keywords:
gsk_,groq,GROQ_API_KEY
4. Replicate
Confidence: HIGH -- TruffleHog verified
- Prefix:
r8_ - TruffleHog regex:
r8_[0-9A-Za-z-_]{37} - Total length: 40 characters
- Verify: GET
https://api.replicate.com/v1/predictionswithAuthorization: Bearer {KEY}-- 200=valid, 401=invalid - Keywords:
r8_,replicate,REPLICATE_API_TOKEN
5. Anyscale
Confidence: MEDIUM
- Prefix:
esecret_ - Regex:
esecret_[A-Za-z0-9_-]{20,} - Keywords:
esecret_,anyscale,ANYSCALE_API_KEY - Verify: GET
https://api.endpoints.anyscale.com/v1/modelswithAuthorization: Bearer {KEY} - Note: Anyscale Endpoints are being deprecated in favor of their newer platform. Still worth detecting.
6. DeepInfra
Confidence: LOW-MEDIUM
- Format: Opaque token. JWT-based scoped tokens use
jwt:prefix. - Keywords:
deepinfra,DEEPINFRA_API_KEY,deepinfra.com - Verify: GET
https://api.deepinfra.com/v1/openai/modelswithAuthorization: Bearer {KEY}
7. Lepton AI
Confidence: LOW
- Format: Not publicly documented. NVIDIA acquired Lepton AI April 2025 (rebranded to DGX Cloud Lepton).
- Keywords:
lepton,LEPTON_API_TOKEN,lepton.ai - Verify: Endpoint uncertain. Mark as placeholder.
8. Modal
Confidence: LOW
- Format: Not publicly documented with distinctive format.
- Keywords:
modal,MODAL_TOKEN_ID,MODAL_TOKEN_SECRET,modal.com - Note: Modal uses token ID + token secret pair, not a single API key.
- Verify: Mark as placeholder.
9. Baseten
Confidence: LOW-MEDIUM
- Format: Uses API keys passed in header.
- Keywords:
baseten,BASETEN_API_KEY,api.baseten.co - Verify: GET
https://api.baseten.co/v1/modelswithAuthorization: Api-Key {KEY}(non-standard header)
10. Cerebrium
Confidence: LOW
- Format: Not publicly documented with distinctive format.
- Keywords:
cerebrium,CEREBRIUM_API_KEY,cerebrium.ai - Verify: Mark as placeholder.
11. NovitaAI
Confidence: LOW
- Format: Not publicly documented with distinctive format.
- Keywords:
novita,NOVITA_API_KEY,novita.ai - Verify: GET
https://api.novita.ai/v3/openai/modelswithAuthorization: Bearer {KEY}(inferred)
12. SambaNova
Confidence: LOW
- Format: Not publicly documented with distinctive prefix.
- Keywords:
sambanova,SAMBANOVA_API_KEY,sambastudio,snapi - Verify: GET
https://api.sambanova.ai/v1/modelswithAuthorization: Bearer {KEY}(inferred OpenAI-compatible)
13. OctoAI
Confidence: LOW
- Format: Not publicly documented. OctoAI was shut down / merged in 2025.
- Keywords:
octoai,OCTOAI_TOKEN,octo.ai - Verify: Mark as placeholder (service may be defunct).
14. Friendli
Confidence: LOW
- Format: Uses "Friendli Token" -- not publicly documented format.
- Keywords:
friendli,FRIENDLI_TOKEN,friendli.ai - Verify: Mark as placeholder.
IMPORTANT: Perplexity Classification Issue
The phase description lists Perplexity as Tier 2, but REQUIREMENTS.md lists it under Tier 3 (PROV-03). The gitleaks config has a pattern: pplx-[a-zA-Z0-9]{48}. Follow the phase description which includes it in the Tier 2 list. If Perplexity should not be Tier 2, one of the other 14 Tier 2 providers from PROV-02 should replace it. Looking at the PROV-02 requirement list, Perplexity is listed there as one of the 14. This is correct for this phase.
Perplexity (listed in PROV-02)
Confidence: HIGH -- gitleaks verified
- Prefix:
pplx- - gitleaks regex:
pplx-[a-zA-Z0-9]{48} - Total length: 53 characters
- Keywords:
pplx-,perplexity,PERPLEXITY_API_KEY - Verify: GET
https://api.perplexity.ai/chat/completionswithAuthorization: Bearer {KEY}-- but this is a POST endpoint. Use model listing if available.
Don't Hand-Roll
| Problem | Don't Build | Use Instead | Why |
|---|---|---|---|
| Regex validation | Custom regex validator | Go regexp.Compile + existing schema validation in UnmarshalYAML |
Schema already validates confidence levels; add regex compilation check |
| Key format research | Guessing patterns | TruffleHog/gitleaks source code + GitGuardian detector docs | These tools have validated patterns against real-world data |
| AC automaton rebuild | Manual keyword management | Existing NewRegistry() auto-builds AC from all provider keywords |
Just add YAML files; registry handles everything |
| Dual file sync | Manual copy script | Plan tasks should explicitly copy to both locations | Simple but must not be forgotten |
Common Pitfalls
Pitfall 1: Catastrophic Backtracking in Regex
What goes wrong: Complex regex with nested quantifiers causes exponential backtracking.
Why it happens: Using patterns like (a+)+ or ([A-Za-z0-9]+[-_]?)+.
How to avoid: Go RE2 engine prevents this by design -- but still write simple, linear patterns. Avoid unnecessary alternation groups.
Warning signs: Regex compilation errors in tests.
Pitfall 2: Overly Broad Patterns Without Keywords
What goes wrong: Pattern like [A-Za-z0-9]{32} matches random strings everywhere.
Why it happens: Provider has no distinctive prefix (Azure OpenAI, Mistral, AI21, etc.).
How to avoid: Set confidence to low, require strong keyword list for AC pre-filtering. The scanner will only test regex AFTER AC finds a keyword match.
Warning signs: Hundreds of false positives when scanning any codebase.
Pitfall 3: Forgetting Dual-Location File Sync
What goes wrong: Provider added to providers/ but not pkg/providers/definitions/, or vice versa.
Why it happens: Phase 1 decision to maintain both locations.
How to avoid: Every task that creates a YAML file must explicitly create it in both locations. Consider a verification step that compares both directories.
Warning signs: go test passes but providers/ directory has different count than pkg/providers/definitions/.
Pitfall 4: Invalid YAML Schema
What goes wrong: Missing format_version, empty last_verified, or invalid confidence value.
Why it happens: Copy-paste errors or typos.
How to avoid: The UnmarshalYAML validation catches these. Run go test ./pkg/providers/... after each batch.
Warning signs: Test failures with schema validation errors.
Pitfall 5: Regex Not RE2-Compatible
What goes wrong: Using PCRE features like (?<=prefix) or (?!suffix).
Why it happens: Copying regex from TruffleHog (which uses Go RE2 but sometimes with custom wrappers) or from gitleaks (which uses Go RE2 but has context-match wrappers).
How to avoid: Test every regex with regexp.MustCompile() in Go. Strip boundary assertions that gitleaks adds (like (?:[\x60'"\s;]|\\[nr]|$) -- these are gitleaks-specific context matchers, not needed in our YAML patterns).
Warning signs: regexp.Compile returns error.
Pitfall 6: Schema Missing Category Field
What goes wrong: CONTEXT.md mentions a category field in the YAML format, but pkg/providers/schema.go Provider struct has no Category field.
Why it happens: CONTEXT.md describes an intended schema that was not fully implemented in Phase 1.
How to avoid: Either add Category field to schema.go in this phase, or omit it from YAML files. Recommend adding it since it is mentioned in CLI-04 (keyhunter providers list might want category filtering).
Warning signs: YAML has category field that gets silently ignored by Go YAML parser.
Code Examples
Provider YAML File (High-Confidence Prefixed Provider)
# Source: TruffleHog xAI detector (verified)
format_version: 1
name: xai
display_name: xAI
tier: 1
last_verified: "2026-04-05"
keywords:
- "xai-"
- "xai"
- "grok"
patterns:
- regex: 'xai-[0-9a-zA-Z_]{80}'
entropy_min: 3.5
confidence: high
verify:
method: GET
url: https://api.x.ai/v1/api-key
headers:
Authorization: "Bearer {KEY}"
valid_status: [200]
invalid_status: [401, 403]
Provider YAML File (Low-Confidence Generic Provider)
# Low-confidence: no distinctive prefix, requires keyword context
format_version: 1
name: together
display_name: Together AI
tier: 2
last_verified: "2026-04-05"
keywords:
- "together"
- "together_api"
- "api.together.xyz"
- "togetherai"
- "TOGETHER_API_KEY"
patterns:
- regex: '[A-Za-z0-9]{64}'
entropy_min: 4.0
confidence: low
verify:
method: GET
url: https://api.together.xyz/v1/models
headers:
Authorization: "Bearer {KEY}"
valid_status: [200]
invalid_status: [401, 403]
Test Pattern: Verify Regex Compiles
func TestProviderRegexCompiles(t *testing.T) {
reg, err := providers.NewRegistry()
require.NoError(t, err)
for _, p := range reg.List() {
for i, pat := range p.Patterns {
_, err := regexp.Compile(pat.Regex)
assert.NoError(t, err, "provider %s pattern %d has invalid regex: %s", p.Name, i, pat.Regex)
}
}
}
Test Pattern: Verify Provider Count
func TestTier1And2ProviderCount(t *testing.T) {
reg, err := providers.NewRegistry()
require.NoError(t, err)
stats := reg.Stats()
assert.GreaterOrEqual(t, stats.ByTier[1], 12, "expected at least 12 Tier 1 providers")
assert.GreaterOrEqual(t, stats.ByTier[2], 14, "expected at least 14 Tier 2 providers")
assert.GreaterOrEqual(t, stats.Total, 26, "expected at least 26 total providers")
}
Test Pattern: AC Pre-Filter Matches Known Key Prefix
func TestACMatchesKnownPrefixes(t *testing.T) {
reg, err := providers.NewRegistry()
require.NoError(t, err)
ac := reg.AC()
prefixes := []string{
"sk-proj-abc123", "sk-ant-api03-abc", "AIzaSyABC123",
"xai-abc123", "gsk_abc123", "r8_abc123", "pplx-abc123",
}
for _, prefix := range prefixes {
matches := ac.FindAll(prefix)
assert.NotEmpty(t, matches, "AC should match prefix: %s", prefix)
}
}
State of the Art
| Old Approach | Current Approach | When Changed | Impact |
|---|---|---|---|
OpenAI sk- prefix only |
sk-proj-, sk-svcacct-, sk-None- with T3BlbkFJ marker |
2024 | Must detect all modern prefixes |
| AWS Bedrock via IAM only | Bedrock-specific ABSK API keys |
Late 2025 | New key type to detect |
| Regex-only detection | Regex + keyword pre-filter + entropy | Current best practice | KeyHunter architecture already supports this |
| Simple prefix match | TruffleHog suffix markers (e.g., Anthropic AA suffix) |
Current | Tighter patterns reduce false positives |
Open Questions
-
Category field in schema
- What we know: CONTEXT.md mentions
categoryin YAML format, butschema.gohas noCategoryfield. - What's unclear: Whether to add it now or defer.
- Recommendation: Add
Category string \yaml:"category"`to Provider struct as part of this phase. Values like"frontier","inference-platform","specialized"` would support CLI-04 filtering.
- What we know: CONTEXT.md mentions
-
Perplexity tier assignment
- What we know: PROV-02 lists "Perplexity" as one of 14 Tier 2 providers. PROV-03 also lists it under Tier 3.
- What's unclear: Which is canonical.
- Recommendation: Follow the phase description (PROV-02 lists it as Tier 2). If needed, adjust in Phase 3.
-
OpenAI existing pattern needs update
- What we know: Current
openai.yamlonly hassk-proj-pattern. Modern keys also usesk-svcacct-,sk-None-, and legacysk-withT3BlbkFJmarker. - Recommendation: Update openai.yaml with additional patterns.
- What we know: Current
-
Anthropic existing pattern can be tightened
- What we know: Current pattern
sk-ant-api03-[A-Za-z0-9_\-]{93,}does not require theAAsuffix. - Recommendation: Tighten to
sk-ant-api03-[A-Za-z0-9_\-]{93}AAper TruffleHog and add admin key pattern.
- What we know: Current pattern
-
Defunct/transitioning providers
- What we know: OctoAI appears defunct. Anyscale Endpoints being deprecated. Lepton acquired by NVIDIA.
- Recommendation: Still create YAML definitions for all -- leaked keys from defunct services can still have value if the service was operational.
Validation Architecture
Test Framework
| Property | Value |
|---|---|
| Framework | Go standard testing + testify v1.x |
| Config file | go.mod (no separate test config) |
| Quick run command | go test ./pkg/providers/... -v -count=1 |
| Full suite command | go test ./... -v -count=1 |
Phase Requirements -> Test Map
| Req ID | Behavior | Test Type | Automated Command | File Exists? |
|---|---|---|---|---|
| PROV-01 | 12 Tier 1 providers loaded with correct names/tiers | unit | go test ./pkg/providers/... -run TestTier1Providers -v |
Wave 0 |
| PROV-01 | Each Tier 1 provider regex compiles and matches sample key | unit | go test ./pkg/providers/... -run TestTier1Patterns -v |
Wave 0 |
| PROV-02 | 14 Tier 2 providers loaded with correct names/tiers | unit | go test ./pkg/providers/... -run TestTier2Providers -v |
Wave 0 |
| PROV-02 | Each Tier 2 provider regex compiles and matches sample key | unit | go test ./pkg/providers/... -run TestTier2Patterns -v |
Wave 0 |
| PROV-01+02 | AC automaton matches all provider keywords | unit | go test ./pkg/providers/... -run TestACMatchesAllKeywords -v |
Wave 0 |
| PROV-01+02 | Registry stats show 26+ providers | unit | go test ./pkg/providers/... -run TestProviderCount -v |
Wave 0 |
Sampling Rate
- Per task commit:
go test ./pkg/providers/... -v -count=1 - Per wave merge:
go test ./... -v -count=1 - Phase gate: Full suite green before
/gsd:verify-work
Wave 0 Gaps
pkg/providers/tier1_test.go-- covers PROV-01 (provider names, tier values, pattern compilation, sample key matching)pkg/providers/tier2_test.go-- covers PROV-02 (same as above for Tier 2)- Update
pkg/providers/registry_test.go-- updateTestRegistryLoadassertion from>= 3to>= 26
Sources
Primary (HIGH confidence)
- TruffleHog OpenAI detector -- regex pattern for OpenAI keys
- TruffleHog Anthropic detector -- regex pattern for Anthropic keys
- TruffleHog Google Gemini detector -- regex pattern for Google AI keys
- TruffleHog Groq detector -- regex pattern
gsk_[a-zA-Z0-9]{52} - TruffleHog xAI detector -- regex pattern
xai-[0-9a-zA-Z_]{80} - TruffleHog Replicate detector -- regex pattern
r8_[0-9A-Za-z-_]{37} - TruffleHog Azure OpenAI detector -- 32-char hex with keyword context
- gitleaks config -- Anthropic, Cohere, AWS Bedrock patterns
Secondary (MEDIUM confidence)
- AWS Bedrock API keys (Wiz blog) -- ABSK prefix documentation
- GitGuardian Groq detector -- confirms prefixed format
- GitGuardian Mistral detector -- confirms "Prefixed: False"
- GitGuardian Fireworks detector -- confirms "Prefixed: False"
- OpenAI community forum -- sk-proj-, sk-svcacct-, sk-None- prefixes
- Replicate docs -- r8_ prefix, 40-char format
- Anyscale docs -- esecret_ prefix
Tertiary (LOW confidence -- needs validation at implementation)
- Meta AI key format -- not documented, inferred from OpenAI-compatible API pattern
- Inflection AI key format -- not documented, SDK-based inference only
- AI21 Labs key format -- no prefix confirmed
- Together AI key format -- no distinctive prefix confirmed
- DeepInfra key format -- JWT-based tokens referenced but standard key format unclear
- Lepton AI, Modal, Baseten, Cerebrium, NovitaAI, SambaNova, OctoAI, Friendli -- key formats not publicly documented
Metadata
Confidence breakdown:
- Standard stack: HIGH -- no new dependencies, using existing Phase 1 infrastructure
- Architecture: HIGH -- dual-location YAML pattern well-established
- Provider key formats (Tier 1 well-known): HIGH -- TruffleHog/gitleaks verified patterns
- Provider key formats (Tier 1 less-known): MEDIUM -- Meta, Inflection, AI21 have limited documentation
- Provider key formats (Tier 2 prefixed): HIGH -- Groq, Replicate, Perplexity, Anyscale have clear prefixes
- Provider key formats (Tier 2 generic): LOW -- many Tier 2 providers lack documented key formats
- Pitfalls: HIGH -- well-understood RE2 and false-positive challenges
Research date: 2026-04-05 Valid until: 2026-05-05 (30 days -- API key formats change infrequently)