Files

32 KiB

Phase 2: Tier 1-2 Providers - Research

Researched: 2026-04-05 Domain: LLM/AI Provider API Key Formats, Regex Patterns, Verification Endpoints Confidence: MEDIUM (well-documented providers are HIGH; lesser-known Tier 2 providers are LOW-MEDIUM)

Summary

This phase requires creating 26 provider YAML definitions (12 Tier 1 + 14 Tier 2) following the established schema pattern from Phase 1. The core challenge is not Go code -- the infrastructure (schema, loader, registry, AC automaton) already exists and works. The challenge is accuracy of regex patterns and key format data across 26 providers with varying documentation quality.

For Tier 1 providers (OpenAI, Anthropic, Google AI, xAI, Cohere, Groq), key formats are well-documented with authoritative regex patterns from TruffleHog and gitleaks. For Tier 2 inference platforms (Together, Fireworks, Lepton, Modal, Baseten, Cerebrium, NovitaAI, SambaNova, OctoAI, Friendli), many providers lack distinctive key prefixes and use generic opaque token formats, making regex-only detection harder and requiring keyword-context-based matching.

Primary recommendation: Create all 26 YAML files using the established schema. For well-prefixed providers, use HIGH confidence patterns. For generic-format providers, use MEDIUM/LOW confidence patterns with strong keyword lists for AC pre-filtering. File placement follows the dual-location pattern established in Phase 1 (providers/ and pkg/providers/definitions/).

<user_constraints>

User Constraints (from CONTEXT.md)

Locked Decisions

None -- all implementation choices at Claude's discretion (infrastructure/data phase).

Claude's Discretion

All implementation choices are at Claude's discretion -- pure infrastructure/data phase. Use ROADMAP phase goal, success criteria, and existing provider YAML schema (providers/openai.yaml, providers/anthropic.yaml, providers/huggingface.yaml) as templates.

Deferred Ideas (OUT OF SCOPE)

None -- discuss phase skipped. </user_constraints>

<phase_requirements>

Phase Requirements

ID Description Research Support
PROV-01 12 Tier 1 Frontier provider YAML definitions (OpenAI, Anthropic, Google AI, Vertex, AWS Bedrock, Azure OpenAI, Meta AI, xAI, Cohere, Mistral, Inflection, AI21) Regex patterns documented below for all 12; verify endpoints identified
PROV-02 14 Tier 2 Inference Platform provider definitions (Together, Fireworks, Groq, Replicate, Anyscale, DeepInfra, Lepton, Modal, Baseten, Cerebrium, NovitaAI, Sambanova, OctoAI, Friendli) Regex patterns documented for prefixed providers (Groq, Replicate, Anyscale, Perplexity); keyword-context approach recommended for generic-format providers
</phase_requirements>

Project Constraints (from CLAUDE.md)

  • Go regexp only (RE2) -- no PCRE/regexp2. All patterns must be RE2-safe (no lookahead/lookbehind, no backreferences).
  • Providers in dual locations: providers/ (user-visible) and pkg/providers/definitions/ (Go embed source). Files must be kept in sync.
  • Schema: format_version: 1, name, display_name, tier, last_verified, keywords[], patterns[] (with regex, entropy_min, confidence), verify (with method, url, headers, valid_status, invalid_status).
  • Confidence values: high, medium, low (validated in UnmarshalYAML).
  • Keywords: lowercase, used for Aho-Corasick pre-filtering via DFA automaton.

Standard Stack

No new libraries required. This phase is pure YAML data file creation using the existing infrastructure from Phase 1.

Existing Infrastructure Used

Component Location Purpose
Provider struct pkg/providers/schema.go YAML schema with validation
Loader pkg/providers/loader.go embed.FS walker for definitions/*.yaml
Registry pkg/providers/registry.go Provider index + AC automaton build
Tests pkg/providers/registry_test.go Registry load, get, stats, AC tests

Architecture Patterns

File Placement (Dual Location)

Every new provider YAML must be placed in BOTH:

providers/{name}.yaml                    # User-visible reference
pkg/providers/definitions/{name}.yaml    # Go embed source (compiled into binary)

YAML Template Pattern

format_version: 1
name: {provider_slug}
display_name: {Display Name}
tier: {1 or 2}
last_verified: "2026-04-05"
keywords:
  - "{prefix_literal}"          # Exact key prefix for AC match
  - "{provider_name_lowercase}" # Provider name for context match
  - "{env_var_hint}"            # Common env var name fragments
patterns:
  - regex: '{RE2_compatible_regex}'
    entropy_min: {3.0-4.0}
    confidence: {high|medium|low}
verify:
  method: {GET|POST}
  url: {lightweight_api_endpoint}
  headers:
    {auth_header}: "{KEY_placeholder}"
  valid_status: [200]
  invalid_status: [401, 403]

Confidence Level Assignment Strategy

Key Format Confidence Rationale
Unique prefix + fixed length (e.g., sk-ant-api03-, gsk_, r8_, xai-) high Prefix alone is near-unique; false positive rate extremely low
Unique prefix, variable length (e.g., sk-proj-, AIzaSy) high Prefix is distinctive enough
Short generic prefix + context needed (e.g., sk- for Cohere) medium Prefix collides with OpenAI legacy; needs keyword context
No prefix, opaque token (e.g., UUID, hex string, base64) low Requires strong keyword context; high false positive risk without AC pre-filter
32-char hex string (e.g., Azure OpenAI) low Extremely generic format; keyword context mandatory

Keyword Strategy for Low-Confidence Providers

Providers with generic key formats (no distinctive prefix) MUST have rich keyword lists for the Aho-Corasick pre-filter to work effectively. Include:

  1. Provider name (lowercase)
  2. Common env var fragments (e.g., together_api, baseten_api, modal_token)
  3. API base URL fragments (e.g., api.together.xyz, api.baseten.co)
  4. SDK/config identifiers (e.g., togetherai, deepinfra)

Anti-Patterns to Avoid

  • Overly broad regex without keyword anchor: A pattern like [A-Za-z0-9]{40} without keywords would match every 40-char alphanumeric string -- useless.
  • PCRE features in regex: Go RE2 does not support lookahead ((?=)), lookbehind ((?<=)), or backreferences. All patterns must be RE2-safe.
  • Hardcoding T3BlbkFJ for non-OpenAI providers: The base64 "OpenAI" magic string is OpenAI-specific; do not use for other providers.
  • Missing dual-location sync: Forgetting to copy YAML to both providers/ and pkg/providers/definitions/.

Provider Key Format Research

Tier 1: Frontier Providers (12)

1. OpenAI

Confidence: HIGH -- TruffleHog verified

  • Prefixes: sk-proj-, sk-svcacct-, sk-None-, legacy sk- (all contain T3BlbkFJ base64 marker)
  • TruffleHog regex: sk-(?:(?:proj|svcacct|service)-[A-Za-z0-9_-]+|[a-zA-Z0-9]+)T3BlbkFJ[A-Za-z0-9_-]+
  • KeyHunter regex (simplified): sk-proj-[A-Za-z0-9_\-]{48,} (existing, covers primary format)
  • Note: Existing openai.yaml only covers sk-proj-. Should add patterns for sk-svcacct- and legacy sk- with T3BlbkFJ marker.
  • Verify: GET https://api.openai.com/v1/models with Authorization: Bearer {KEY} -- 200=valid, 401=invalid
  • Keywords: sk-proj-, sk-svcacct-, openai, T3BlbkFJ

2. Anthropic

Confidence: HIGH -- TruffleHog + gitleaks verified

  • Prefixes: sk-ant-api03- (standard), sk-ant-admin01- (admin)
  • TruffleHog regex: sk-ant-(?:admin01|api03)-[\w\-]{93}AA
  • gitleaks regex: sk-ant-api03-[a-zA-Z0-9_\-]{93}AA
  • Note: Existing anthropic.yaml pattern sk-ant-api03-[A-Za-z0-9_\-]{93,} should be tightened to end with AA suffix.
  • Verify: GET https://api.anthropic.com/v1/models with x-api-key: {KEY} + anthropic-version: 2023-06-01 -- 200=valid, 401=invalid
  • Keywords: sk-ant-api03-, sk-ant-admin01-, anthropic

3. Google AI (Gemini)

Confidence: HIGH -- TruffleHog verified

  • Prefix: AIzaSy
  • TruffleHog regex: AIzaSy[A-Za-z0-9_-]{33}
  • Total length: 39 characters
  • Note: Same key format as all Google API keys. No trailing word boundary needed (keys can end with hyphen).
  • Verify: GET https://generativelanguage.googleapis.com/v1/models?key={KEY} -- 200=valid, 400/403=invalid
  • Keywords: AIzaSy, google_api, gemini

4. Google Vertex AI

Confidence: MEDIUM

  • Format: Uses Google Cloud service account JSON key files (not a simple API key string) OR standard Google API keys (AIzaSy format, same as #3).
  • Approach: For API key mode, reuse AIzaSy pattern (same as Google AI). Service account JSON key detection is a separate concern (JSON file with "type": "service_account" and "private_key" field).
  • Recommendation: Create a separate vertex-ai provider YAML that focuses on the API key path with AIzaSy pattern AND a service account private key regex.
  • Service account key regex: The private key in a GCP service account JSON starts with -----BEGIN RSA PRIVATE KEY----- -- but this is a general GCP credential, not Vertex-specific. Keep this provider focused on the API key path.
  • Verify: GET https://generativelanguage.googleapis.com/v1/models?key={KEY} (same endpoint works for Vertex API keys)
  • Keywords: vertex, google_cloud, AIzaSy, vertex_ai

5. AWS Bedrock

Confidence: HIGH -- gitleaks verified

  • Long-lived prefix: ABSK (base64 encodes to BedrockAPIKey-)
  • gitleaks regex (long-lived): ABSK[A-Za-z0-9+/]{109,269}={0,2}
  • Short-lived prefix: bedrock-api-key-YmVkcm9ja (base64 prefix)
  • Also detect: AWS IAM access keys (AKIA[0-9A-Z]{16} + secret) used with Bedrock
  • Recommendation: Two patterns: (1) Bedrock-specific ABSK prefix, (2) AWS AKIA (general, shared with other AWS services)
  • Verify: Cannot verify with a simple HTTP call -- AWS Bedrock requires SigV4 signing. Mark verify as empty/placeholder.
  • Keywords: ABSK, bedrock, aws_bedrock, AKIA

6. Azure OpenAI

Confidence: MEDIUM -- TruffleHog verified but pattern is generic

  • Format: 32-character lowercase hexadecimal string
  • TruffleHog regex: [a-f0-9]{32} (with keyword context requirement)
  • Problem: 32-char hex is extremely generic. MUST rely on keywords for context.
  • Keywords: azure, openai.azure.com, azure_openai, api-key, cognitive
  • Verify: GET https://{resource}.openai.azure.com/openai/deployments?api-version=2024-02-01 with api-key: {KEY} -- but requires resource name. Cannot generically verify. Mark verify as placeholder.
  • Entropy min: 3.5 (hex has theoretical max ~4.0)

7. Meta AI (Llama API)

Confidence: LOW

  • Format: Not publicly documented as of April 2026. Meta Llama API launched April 2025.
  • Env var: META_LLAMA_API_KEY
  • Approach: Generic long token pattern with strong keyword context.
  • Keywords: meta, llama, meta_llama, llama_api
  • Regex: Generic high-entropy alphanumeric pattern, medium confidence
  • Verify: GET https://api.llama.com/v1/models with Authorization: Bearer {KEY} (inferred from OpenAI-compatible API)

8. xAI (Grok)

Confidence: HIGH -- TruffleHog verified

  • Prefix: xai-
  • TruffleHog regex: xai-[0-9a-zA-Z_]{80}
  • Total length: 84 characters
  • Verify: GET https://api.x.ai/v1/api-key with Authorization: Bearer {KEY} -- 200=valid, 401=invalid
  • Keywords: xai-, xai, grok

9. Cohere

Confidence: MEDIUM -- gitleaks verified but pattern requires context

  • Format: 40 alphanumeric characters, no distinctive prefix
  • gitleaks regex: Context-dependent match on cohere or CO_API_KEY keyword + [a-zA-Z0-9]{40}
  • Problem: 40-char alphanumeric overlaps with many other tokens.
  • Keywords: cohere, CO_API_KEY, cohere_api
  • Verify: GET https://api.cohere.ai/v1/models with Authorization: Bearer {KEY} -- 200=valid, 401=invalid
  • Entropy min: 4.0 (higher threshold to reduce false positives)

10. Mistral AI

Confidence: MEDIUM

  • Format: Not prefixed (GitGuardian confirms "Prefixed: False"). Opaque token.
  • Approach: Keyword-context match. Mistral keys appear to be 32-char alphanumeric or UUID-format.
  • Keywords: mistral, MISTRAL_API_KEY, mistral.ai, la_plateforme
  • Verify: GET https://api.mistral.ai/v1/models with Authorization: Bearer {KEY} -- 200=valid, 401=invalid

11. Inflection AI

Confidence: LOW

  • Format: Not publicly documented. API launched late 2025 via inflection-sdk.
  • Env var: PI_API_KEY, INFLECTION_API_KEY
  • Keywords: inflection, pi_api, PI_API_KEY
  • Verify: Endpoint unclear -- inferred OpenAI-compatible pattern. Mark as placeholder.

12. AI21 Labs

Confidence: LOW

  • Format: Not publicly documented with distinctive prefix.
  • Env var: AI21_API_KEY
  • Keywords: ai21, AI21_API_KEY, jamba, jurassic
  • Verify: GET https://api.ai21.com/studio/v1/models with Authorization: Bearer {KEY} -- inferred

Tier 2: Inference Platforms (14)

1. Together AI

Confidence: LOW-MEDIUM

  • Format: Appears to use generic opaque tokens. Some documentation shows sk- prefix but this may be example placeholder.
  • Keywords: together, TOGETHER_API_KEY, api.together.xyz, togetherai
  • Verify: GET https://api.together.xyz/v1/models with Authorization: Bearer {KEY}

2. Fireworks AI

Confidence: LOW-MEDIUM

  • Format: GitGuardian confirms "Prefixed: False". Opaque token format.
  • Prefix: fw_ prefix has been reported in some sources but not confirmed by GitGuardian.
  • Keywords: fireworks, FIREWORKS_API_KEY, fireworks.ai, fw_
  • Verify: GET https://api.fireworks.ai/inference/v1/models with Authorization: Bearer {KEY}

3. Groq

Confidence: HIGH -- TruffleHog verified

  • Prefix: gsk_
  • TruffleHog regex: gsk_[a-zA-Z0-9]{52}
  • Total length: 56 characters
  • Verify: GET https://api.groq.com/openai/v1/models with Authorization: Bearer {KEY} -- 200=valid, 401=invalid
  • Keywords: gsk_, groq, GROQ_API_KEY

4. Replicate

Confidence: HIGH -- TruffleHog verified

  • Prefix: r8_
  • TruffleHog regex: r8_[0-9A-Za-z-_]{37}
  • Total length: 40 characters
  • Verify: GET https://api.replicate.com/v1/predictions with Authorization: Bearer {KEY} -- 200=valid, 401=invalid
  • Keywords: r8_, replicate, REPLICATE_API_TOKEN

5. Anyscale

Confidence: MEDIUM

  • Prefix: esecret_
  • Regex: esecret_[A-Za-z0-9_-]{20,}
  • Keywords: esecret_, anyscale, ANYSCALE_API_KEY
  • Verify: GET https://api.endpoints.anyscale.com/v1/models with Authorization: Bearer {KEY}
  • Note: Anyscale Endpoints are being deprecated in favor of their newer platform. Still worth detecting.

6. DeepInfra

Confidence: LOW-MEDIUM

  • Format: Opaque token. JWT-based scoped tokens use jwt: prefix.
  • Keywords: deepinfra, DEEPINFRA_API_KEY, deepinfra.com
  • Verify: GET https://api.deepinfra.com/v1/openai/models with Authorization: Bearer {KEY}

7. Lepton AI

Confidence: LOW

  • Format: Not publicly documented. NVIDIA acquired Lepton AI April 2025 (rebranded to DGX Cloud Lepton).
  • Keywords: lepton, LEPTON_API_TOKEN, lepton.ai
  • Verify: Endpoint uncertain. Mark as placeholder.

8. Modal

Confidence: LOW

  • Format: Not publicly documented with distinctive format.
  • Keywords: modal, MODAL_TOKEN_ID, MODAL_TOKEN_SECRET, modal.com
  • Note: Modal uses token ID + token secret pair, not a single API key.
  • Verify: Mark as placeholder.

9. Baseten

Confidence: LOW-MEDIUM

  • Format: Uses API keys passed in header.
  • Keywords: baseten, BASETEN_API_KEY, api.baseten.co
  • Verify: GET https://api.baseten.co/v1/models with Authorization: Api-Key {KEY} (non-standard header)

10. Cerebrium

Confidence: LOW

  • Format: Not publicly documented with distinctive format.
  • Keywords: cerebrium, CEREBRIUM_API_KEY, cerebrium.ai
  • Verify: Mark as placeholder.

11. NovitaAI

Confidence: LOW

  • Format: Not publicly documented with distinctive format.
  • Keywords: novita, NOVITA_API_KEY, novita.ai
  • Verify: GET https://api.novita.ai/v3/openai/models with Authorization: Bearer {KEY} (inferred)

12. SambaNova

Confidence: LOW

  • Format: Not publicly documented with distinctive prefix.
  • Keywords: sambanova, SAMBANOVA_API_KEY, sambastudio, snapi
  • Verify: GET https://api.sambanova.ai/v1/models with Authorization: Bearer {KEY} (inferred OpenAI-compatible)

13. OctoAI

Confidence: LOW

  • Format: Not publicly documented. OctoAI was shut down / merged in 2025.
  • Keywords: octoai, OCTOAI_TOKEN, octo.ai
  • Verify: Mark as placeholder (service may be defunct).

14. Friendli

Confidence: LOW

  • Format: Uses "Friendli Token" -- not publicly documented format.
  • Keywords: friendli, FRIENDLI_TOKEN, friendli.ai
  • Verify: Mark as placeholder.

IMPORTANT: Perplexity Classification Issue

The phase description lists Perplexity as Tier 2, but REQUIREMENTS.md lists it under Tier 3 (PROV-03). The gitleaks config has a pattern: pplx-[a-zA-Z0-9]{48}. Follow the phase description which includes it in the Tier 2 list. If Perplexity should not be Tier 2, one of the other 14 Tier 2 providers from PROV-02 should replace it. Looking at the PROV-02 requirement list, Perplexity is listed there as one of the 14. This is correct for this phase.

Perplexity (listed in PROV-02)

Confidence: HIGH -- gitleaks verified

  • Prefix: pplx-
  • gitleaks regex: pplx-[a-zA-Z0-9]{48}
  • Total length: 53 characters
  • Keywords: pplx-, perplexity, PERPLEXITY_API_KEY
  • Verify: GET https://api.perplexity.ai/chat/completions with Authorization: Bearer {KEY} -- but this is a POST endpoint. Use model listing if available.

Don't Hand-Roll

Problem Don't Build Use Instead Why
Regex validation Custom regex validator Go regexp.Compile + existing schema validation in UnmarshalYAML Schema already validates confidence levels; add regex compilation check
Key format research Guessing patterns TruffleHog/gitleaks source code + GitGuardian detector docs These tools have validated patterns against real-world data
AC automaton rebuild Manual keyword management Existing NewRegistry() auto-builds AC from all provider keywords Just add YAML files; registry handles everything
Dual file sync Manual copy script Plan tasks should explicitly copy to both locations Simple but must not be forgotten

Common Pitfalls

Pitfall 1: Catastrophic Backtracking in Regex

What goes wrong: Complex regex with nested quantifiers causes exponential backtracking. Why it happens: Using patterns like (a+)+ or ([A-Za-z0-9]+[-_]?)+. How to avoid: Go RE2 engine prevents this by design -- but still write simple, linear patterns. Avoid unnecessary alternation groups. Warning signs: Regex compilation errors in tests.

Pitfall 2: Overly Broad Patterns Without Keywords

What goes wrong: Pattern like [A-Za-z0-9]{32} matches random strings everywhere. Why it happens: Provider has no distinctive prefix (Azure OpenAI, Mistral, AI21, etc.). How to avoid: Set confidence to low, require strong keyword list for AC pre-filtering. The scanner will only test regex AFTER AC finds a keyword match. Warning signs: Hundreds of false positives when scanning any codebase.

Pitfall 3: Forgetting Dual-Location File Sync

What goes wrong: Provider added to providers/ but not pkg/providers/definitions/, or vice versa. Why it happens: Phase 1 decision to maintain both locations. How to avoid: Every task that creates a YAML file must explicitly create it in both locations. Consider a verification step that compares both directories. Warning signs: go test passes but providers/ directory has different count than pkg/providers/definitions/.

Pitfall 4: Invalid YAML Schema

What goes wrong: Missing format_version, empty last_verified, or invalid confidence value. Why it happens: Copy-paste errors or typos. How to avoid: The UnmarshalYAML validation catches these. Run go test ./pkg/providers/... after each batch. Warning signs: Test failures with schema validation errors.

Pitfall 5: Regex Not RE2-Compatible

What goes wrong: Using PCRE features like (?<=prefix) or (?!suffix). Why it happens: Copying regex from TruffleHog (which uses Go RE2 but sometimes with custom wrappers) or from gitleaks (which uses Go RE2 but has context-match wrappers). How to avoid: Test every regex with regexp.MustCompile() in Go. Strip boundary assertions that gitleaks adds (like (?:[\x60'"\s;]|\\[nr]|$) -- these are gitleaks-specific context matchers, not needed in our YAML patterns). Warning signs: regexp.Compile returns error.

Pitfall 6: Schema Missing Category Field

What goes wrong: CONTEXT.md mentions a category field in the YAML format, but pkg/providers/schema.go Provider struct has no Category field. Why it happens: CONTEXT.md describes an intended schema that was not fully implemented in Phase 1. How to avoid: Either add Category field to schema.go in this phase, or omit it from YAML files. Recommend adding it since it is mentioned in CLI-04 (keyhunter providers list might want category filtering). Warning signs: YAML has category field that gets silently ignored by Go YAML parser.

Code Examples

Provider YAML File (High-Confidence Prefixed Provider)

# Source: TruffleHog xAI detector (verified)
format_version: 1
name: xai
display_name: xAI
tier: 1
last_verified: "2026-04-05"
keywords:
  - "xai-"
  - "xai"
  - "grok"
patterns:
  - regex: 'xai-[0-9a-zA-Z_]{80}'
    entropy_min: 3.5
    confidence: high
verify:
  method: GET
  url: https://api.x.ai/v1/api-key
  headers:
    Authorization: "Bearer {KEY}"
  valid_status: [200]
  invalid_status: [401, 403]

Provider YAML File (Low-Confidence Generic Provider)

# Low-confidence: no distinctive prefix, requires keyword context
format_version: 1
name: together
display_name: Together AI
tier: 2
last_verified: "2026-04-05"
keywords:
  - "together"
  - "together_api"
  - "api.together.xyz"
  - "togetherai"
  - "TOGETHER_API_KEY"
patterns:
  - regex: '[A-Za-z0-9]{64}'
    entropy_min: 4.0
    confidence: low
verify:
  method: GET
  url: https://api.together.xyz/v1/models
  headers:
    Authorization: "Bearer {KEY}"
  valid_status: [200]
  invalid_status: [401, 403]

Test Pattern: Verify Regex Compiles

func TestProviderRegexCompiles(t *testing.T) {
    reg, err := providers.NewRegistry()
    require.NoError(t, err)
    for _, p := range reg.List() {
        for i, pat := range p.Patterns {
            _, err := regexp.Compile(pat.Regex)
            assert.NoError(t, err, "provider %s pattern %d has invalid regex: %s", p.Name, i, pat.Regex)
        }
    }
}

Test Pattern: Verify Provider Count

func TestTier1And2ProviderCount(t *testing.T) {
    reg, err := providers.NewRegistry()
    require.NoError(t, err)
    stats := reg.Stats()
    assert.GreaterOrEqual(t, stats.ByTier[1], 12, "expected at least 12 Tier 1 providers")
    assert.GreaterOrEqual(t, stats.ByTier[2], 14, "expected at least 14 Tier 2 providers")
    assert.GreaterOrEqual(t, stats.Total, 26, "expected at least 26 total providers")
}

Test Pattern: AC Pre-Filter Matches Known Key Prefix

func TestACMatchesKnownPrefixes(t *testing.T) {
    reg, err := providers.NewRegistry()
    require.NoError(t, err)
    ac := reg.AC()
    
    prefixes := []string{
        "sk-proj-abc123", "sk-ant-api03-abc", "AIzaSyABC123",
        "xai-abc123", "gsk_abc123", "r8_abc123", "pplx-abc123",
    }
    for _, prefix := range prefixes {
        matches := ac.FindAll(prefix)
        assert.NotEmpty(t, matches, "AC should match prefix: %s", prefix)
    }
}

State of the Art

Old Approach Current Approach When Changed Impact
OpenAI sk- prefix only sk-proj-, sk-svcacct-, sk-None- with T3BlbkFJ marker 2024 Must detect all modern prefixes
AWS Bedrock via IAM only Bedrock-specific ABSK API keys Late 2025 New key type to detect
Regex-only detection Regex + keyword pre-filter + entropy Current best practice KeyHunter architecture already supports this
Simple prefix match TruffleHog suffix markers (e.g., Anthropic AA suffix) Current Tighter patterns reduce false positives

Open Questions

  1. Category field in schema

    • What we know: CONTEXT.md mentions category in YAML format, but schema.go has no Category field.
    • What's unclear: Whether to add it now or defer.
    • Recommendation: Add Category string \yaml:"category"`to Provider struct as part of this phase. Values like"frontier", "inference-platform", "specialized"` would support CLI-04 filtering.
  2. Perplexity tier assignment

    • What we know: PROV-02 lists "Perplexity" as one of 14 Tier 2 providers. PROV-03 also lists it under Tier 3.
    • What's unclear: Which is canonical.
    • Recommendation: Follow the phase description (PROV-02 lists it as Tier 2). If needed, adjust in Phase 3.
  3. OpenAI existing pattern needs update

    • What we know: Current openai.yaml only has sk-proj- pattern. Modern keys also use sk-svcacct-, sk-None-, and legacy sk- with T3BlbkFJ marker.
    • Recommendation: Update openai.yaml with additional patterns.
  4. Anthropic existing pattern can be tightened

    • What we know: Current pattern sk-ant-api03-[A-Za-z0-9_\-]{93,} does not require the AA suffix.
    • Recommendation: Tighten to sk-ant-api03-[A-Za-z0-9_\-]{93}AA per TruffleHog and add admin key pattern.
  5. Defunct/transitioning providers

    • What we know: OctoAI appears defunct. Anyscale Endpoints being deprecated. Lepton acquired by NVIDIA.
    • Recommendation: Still create YAML definitions for all -- leaked keys from defunct services can still have value if the service was operational.

Validation Architecture

Test Framework

Property Value
Framework Go standard testing + testify v1.x
Config file go.mod (no separate test config)
Quick run command go test ./pkg/providers/... -v -count=1
Full suite command go test ./... -v -count=1

Phase Requirements -> Test Map

Req ID Behavior Test Type Automated Command File Exists?
PROV-01 12 Tier 1 providers loaded with correct names/tiers unit go test ./pkg/providers/... -run TestTier1Providers -v Wave 0
PROV-01 Each Tier 1 provider regex compiles and matches sample key unit go test ./pkg/providers/... -run TestTier1Patterns -v Wave 0
PROV-02 14 Tier 2 providers loaded with correct names/tiers unit go test ./pkg/providers/... -run TestTier2Providers -v Wave 0
PROV-02 Each Tier 2 provider regex compiles and matches sample key unit go test ./pkg/providers/... -run TestTier2Patterns -v Wave 0
PROV-01+02 AC automaton matches all provider keywords unit go test ./pkg/providers/... -run TestACMatchesAllKeywords -v Wave 0
PROV-01+02 Registry stats show 26+ providers unit go test ./pkg/providers/... -run TestProviderCount -v Wave 0

Sampling Rate

  • Per task commit: go test ./pkg/providers/... -v -count=1
  • Per wave merge: go test ./... -v -count=1
  • Phase gate: Full suite green before /gsd:verify-work

Wave 0 Gaps

  • pkg/providers/tier1_test.go -- covers PROV-01 (provider names, tier values, pattern compilation, sample key matching)
  • pkg/providers/tier2_test.go -- covers PROV-02 (same as above for Tier 2)
  • Update pkg/providers/registry_test.go -- update TestRegistryLoad assertion from >= 3 to >= 26

Sources

Primary (HIGH confidence)

Secondary (MEDIUM confidence)

Tertiary (LOW confidence -- needs validation at implementation)

  • Meta AI key format -- not documented, inferred from OpenAI-compatible API pattern
  • Inflection AI key format -- not documented, SDK-based inference only
  • AI21 Labs key format -- no prefix confirmed
  • Together AI key format -- no distinctive prefix confirmed
  • DeepInfra key format -- JWT-based tokens referenced but standard key format unclear
  • Lepton AI, Modal, Baseten, Cerebrium, NovitaAI, SambaNova, OctoAI, Friendli -- key formats not publicly documented

Metadata

Confidence breakdown:

  • Standard stack: HIGH -- no new dependencies, using existing Phase 1 infrastructure
  • Architecture: HIGH -- dual-location YAML pattern well-established
  • Provider key formats (Tier 1 well-known): HIGH -- TruffleHog/gitleaks verified patterns
  • Provider key formats (Tier 1 less-known): MEDIUM -- Meta, Inflection, AI21 have limited documentation
  • Provider key formats (Tier 2 prefixed): HIGH -- Groq, Replicate, Perplexity, Anyscale have clear prefixes
  • Provider key formats (Tier 2 generic): LOW -- many Tier 2 providers lack documented key formats
  • Pitfalls: HIGH -- well-understood RE2 and false-positive challenges

Research date: 2026-04-05 Valid until: 2026-05-05 (30 days -- API key formats change infrequently)