Files

salvacybersec b8c69cba7e docs(02): research phase domain - provider key formats and regex patterns

2026-04-05 13:06:03 +03:00

32 KiB

Raw Permalink Blame History

Phase 2: Tier 1-2 Providers - Research

Researched: 2026-04-05 Domain: LLM/AI Provider API Key Formats, Regex Patterns, Verification Endpoints Confidence: MEDIUM (well-documented providers are HIGH; lesser-known Tier 2 providers are LOW-MEDIUM)

Summary

This phase requires creating 26 provider YAML definitions (12 Tier 1 + 14 Tier 2) following the established schema pattern from Phase 1. The core challenge is not Go code -- the infrastructure (schema, loader, registry, AC automaton) already exists and works. The challenge is accuracy of regex patterns and key format data across 26 providers with varying documentation quality.

For Tier 1 providers (OpenAI, Anthropic, Google AI, xAI, Cohere, Groq), key formats are well-documented with authoritative regex patterns from TruffleHog and gitleaks. For Tier 2 inference platforms (Together, Fireworks, Lepton, Modal, Baseten, Cerebrium, NovitaAI, SambaNova, OctoAI, Friendli), many providers lack distinctive key prefixes and use generic opaque token formats, making regex-only detection harder and requiring keyword-context-based matching.

Primary recommendation: Create all 26 YAML files using the established schema. For well-prefixed providers, use HIGH confidence patterns. For generic-format providers, use MEDIUM/LOW confidence patterns with strong keyword lists for AC pre-filtering. File placement follows the dual-location pattern established in Phase 1 (providers/ and pkg/providers/definitions/).

<user_constraints>

User Constraints (from CONTEXT.md)

Locked Decisions

None -- all implementation choices at Claude's discretion (infrastructure/data phase).

Claude's Discretion

All implementation choices are at Claude's discretion -- pure infrastructure/data phase. Use ROADMAP phase goal, success criteria, and existing provider YAML schema (providers/openai.yaml, providers/anthropic.yaml, providers/huggingface.yaml) as templates.

Deferred Ideas (OUT OF SCOPE)

None -- discuss phase skipped. </user_constraints>

<phase_requirements>

Phase Requirements

ID	Description	Research Support
PROV-01	12 Tier 1 Frontier provider YAML definitions (OpenAI, Anthropic, Google AI, Vertex, AWS Bedrock, Azure OpenAI, Meta AI, xAI, Cohere, Mistral, Inflection, AI21)	Regex patterns documented below for all 12; verify endpoints identified
PROV-02	14 Tier 2 Inference Platform provider definitions (Together, Fireworks, Groq, Replicate, Anyscale, DeepInfra, Lepton, Modal, Baseten, Cerebrium, NovitaAI, Sambanova, OctoAI, Friendli)	Regex patterns documented for prefixed providers (Groq, Replicate, Anyscale, Perplexity); keyword-context approach recommended for generic-format providers
</phase_requirements>

Project Constraints (from CLAUDE.md)

Go regexp only (RE2) -- no PCRE/regexp2. All patterns must be RE2-safe (no lookahead/lookbehind, no backreferences).
Providers in dual locations: providers/ (user-visible) and pkg/providers/definitions/ (Go embed source). Files must be kept in sync.
Schema: format_version: 1, name, display_name, tier, last_verified, keywords[], patterns[] (with regex, entropy_min, confidence), verify (with method, url, headers, valid_status, invalid_status).
Confidence values: high, medium, low (validated in UnmarshalYAML).
Keywords: lowercase, used for Aho-Corasick pre-filtering via DFA automaton.

Standard Stack

No new libraries required. This phase is pure YAML data file creation using the existing infrastructure from Phase 1.

Existing Infrastructure Used

Component	Location	Purpose
Provider struct	`pkg/providers/schema.go`	YAML schema with validation
Loader	`pkg/providers/loader.go`	embed.FS walker for `definitions/*.yaml`
Registry	`pkg/providers/registry.go`	Provider index + AC automaton build
Tests	`pkg/providers/registry_test.go`	Registry load, get, stats, AC tests

Architecture Patterns

File Placement (Dual Location)

Every new provider YAML must be placed in BOTH:

providers/{name}.yaml                    # User-visible reference
pkg/providers/definitions/{name}.yaml    # Go embed source (compiled into binary)

YAML Template Pattern

format_version: 1
name: {provider_slug}
display_name: {Display Name}
tier: {1 or 2}
last_verified: "2026-04-05"
keywords:
  - "{prefix_literal}"          # Exact key prefix for AC match
  - "{provider_name_lowercase}" # Provider name for context match
  - "{env_var_hint}"            # Common env var name fragments
patterns:
  - regex: '{RE2_compatible_regex}'
    entropy_min: {3.0-4.0}
    confidence: {high|medium|low}
verify:
  method: {GET|POST}
  url: {lightweight_api_endpoint}
  headers:
    {auth_header}: "{KEY_placeholder}"
  valid_status: [200]
  invalid_status: [401, 403]

Confidence Level Assignment Strategy

Key Format	Confidence	Rationale
Unique prefix + fixed length (e.g., `sk-ant-api03-`, `gsk_`, `r8_`, `xai-`)	high	Prefix alone is near-unique; false positive rate extremely low
Unique prefix, variable length (e.g., `sk-proj-`, `AIzaSy`)	high	Prefix is distinctive enough
Short generic prefix + context needed (e.g., `sk-` for Cohere)	medium	Prefix collides with OpenAI legacy; needs keyword context
No prefix, opaque token (e.g., UUID, hex string, base64)	low	Requires strong keyword context; high false positive risk without AC pre-filter
32-char hex string (e.g., Azure OpenAI)	low	Extremely generic format; keyword context mandatory

Keyword Strategy for Low-Confidence Providers

Providers with generic key formats (no distinctive prefix) MUST have rich keyword lists for the Aho-Corasick pre-filter to work effectively. Include:

Provider name (lowercase)
Common env var fragments (e.g., together_api, baseten_api, modal_token)
API base URL fragments (e.g., api.together.xyz, api.baseten.co)
SDK/config identifiers (e.g., togetherai, deepinfra)

Anti-Patterns to Avoid

Overly broad regex without keyword anchor: A pattern like [A-Za-z0-9]{40} without keywords would match every 40-char alphanumeric string -- useless.
PCRE features in regex: Go RE2 does not support lookahead ((?=)), lookbehind ((?<=)), or backreferences. All patterns must be RE2-safe.
Hardcoding T3BlbkFJ for non-OpenAI providers: The base64 "OpenAI" magic string is OpenAI-specific; do not use for other providers.
Missing dual-location sync: Forgetting to copy YAML to both providers/ and pkg/providers/definitions/.

Provider Key Format Research

Tier 1: Frontier Providers (12)

1. OpenAI

Confidence: HIGH -- TruffleHog verified

Prefixes: sk-proj-, sk-svcacct-, sk-None-, legacy sk- (all contain T3BlbkFJ base64 marker)
TruffleHog regex: sk-(?:(?:proj|svcacct|service)-[A-Za-z0-9_-]+|[a-zA-Z0-9]+)T3BlbkFJ[A-Za-z0-9_-]+
KeyHunter regex (simplified): sk-proj-[A-Za-z0-9_\-]{48,} (existing, covers primary format)
Note: Existing openai.yaml only covers sk-proj-. Should add patterns for sk-svcacct- and legacy sk- with T3BlbkFJ marker.
Verify: GET https://api.openai.com/v1/models with Authorization: Bearer {KEY} -- 200=valid, 401=invalid
Keywords: sk-proj-, sk-svcacct-, openai, T3BlbkFJ

2. Anthropic

Confidence: HIGH -- TruffleHog + gitleaks verified

Prefixes: sk-ant-api03- (standard), sk-ant-admin01- (admin)
TruffleHog regex: sk-ant-(?:admin01|api03)-[\w\-]{93}AA
gitleaks regex: sk-ant-api03-[a-zA-Z0-9_\-]{93}AA
Note: Existing anthropic.yaml pattern sk-ant-api03-[A-Za-z0-9_\-]{93,} should be tightened to end with AA suffix.
Verify: GET https://api.anthropic.com/v1/models with x-api-key: {KEY} + anthropic-version: 2023-06-01 -- 200=valid, 401=invalid
Keywords: sk-ant-api03-, sk-ant-admin01-, anthropic

3. Google AI (Gemini)

Confidence: HIGH -- TruffleHog verified

Prefix: AIzaSy
TruffleHog regex: AIzaSy[A-Za-z0-9_-]{33}
Total length: 39 characters
Note: Same key format as all Google API keys. No trailing word boundary needed (keys can end with hyphen).
Verify: GET https://generativelanguage.googleapis.com/v1/models?key={KEY} -- 200=valid, 400/403=invalid
Keywords: AIzaSy, google_api, gemini

4. Google Vertex AI

Confidence: MEDIUM

Format: Uses Google Cloud service account JSON key files (not a simple API key string) OR standard Google API keys (AIzaSy format, same as #3).
Approach: For API key mode, reuse AIzaSy pattern (same as Google AI). Service account JSON key detection is a separate concern (JSON file with "type": "service_account" and "private_key" field).
Recommendation: Create a separate vertex-ai provider YAML that focuses on the API key path with AIzaSy pattern AND a service account private key regex.
Service account key regex: The private key in a GCP service account JSON starts with -----BEGIN RSA PRIVATE KEY----- -- but this is a general GCP credential, not Vertex-specific. Keep this provider focused on the API key path.
Verify: GET https://generativelanguage.googleapis.com/v1/models?key={KEY} (same endpoint works for Vertex API keys)
Keywords: vertex, google_cloud, AIzaSy, vertex_ai

5. AWS Bedrock

Confidence: HIGH -- gitleaks verified

Long-lived prefix: ABSK (base64 encodes to BedrockAPIKey-)
gitleaks regex (long-lived): ABSK[A-Za-z0-9+/]{109,269}={0,2}
Short-lived prefix: bedrock-api-key-YmVkcm9ja (base64 prefix)
Also detect: AWS IAM access keys (AKIA[0-9A-Z]{16} + secret) used with Bedrock
Recommendation: Two patterns: (1) Bedrock-specific ABSK prefix, (2) AWS AKIA (general, shared with other AWS services)
Verify: Cannot verify with a simple HTTP call -- AWS Bedrock requires SigV4 signing. Mark verify as empty/placeholder.
Keywords: ABSK, bedrock, aws_bedrock, AKIA

6. Azure OpenAI

Confidence: MEDIUM -- TruffleHog verified but pattern is generic

Format: 32-character lowercase hexadecimal string
TruffleHog regex: [a-f0-9]{32} (with keyword context requirement)
Problem: 32-char hex is extremely generic. MUST rely on keywords for context.
Keywords: azure, openai.azure.com, azure_openai, api-key, cognitive
Verify: GET https://{resource}.openai.azure.com/openai/deployments?api-version=2024-02-01 with api-key: {KEY} -- but requires resource name. Cannot generically verify. Mark verify as placeholder.
Entropy min: 3.5 (hex has theoretical max ~4.0)

7. Meta AI (Llama API)

Confidence: LOW

Format: Not publicly documented as of April 2026. Meta Llama API launched April 2025.
Env var: META_LLAMA_API_KEY
Approach: Generic long token pattern with strong keyword context.
Keywords: meta, llama, meta_llama, llama_api
Regex: Generic high-entropy alphanumeric pattern, medium confidence
Verify: GET https://api.llama.com/v1/models with Authorization: Bearer {KEY} (inferred from OpenAI-compatible API)

8. xAI (Grok)

Confidence: HIGH -- TruffleHog verified

Prefix: xai-
TruffleHog regex: xai-[0-9a-zA-Z_]{80}
Total length: 84 characters
Verify: GET https://api.x.ai/v1/api-key with Authorization: Bearer {KEY} -- 200=valid, 401=invalid
Keywords: xai-, xai, grok

9. Cohere

Confidence: MEDIUM -- gitleaks verified but pattern requires context

Format: 40 alphanumeric characters, no distinctive prefix
gitleaks regex: Context-dependent match on cohere or CO_API_KEY keyword + [a-zA-Z0-9]{40}
Problem: 40-char alphanumeric overlaps with many other tokens.
Keywords: cohere, CO_API_KEY, cohere_api
Verify: GET https://api.cohere.ai/v1/models with Authorization: Bearer {KEY} -- 200=valid, 401=invalid
Entropy min: 4.0 (higher threshold to reduce false positives)

10. Mistral AI

Confidence: MEDIUM

Format: Not prefixed (GitGuardian confirms "Prefixed: False"). Opaque token.
Approach: Keyword-context match. Mistral keys appear to be 32-char alphanumeric or UUID-format.
Keywords: mistral, MISTRAL_API_KEY, mistral.ai, la_plateforme
Verify: GET https://api.mistral.ai/v1/models with Authorization: Bearer {KEY} -- 200=valid, 401=invalid

11. Inflection AI

Confidence: LOW

Format: Not publicly documented. API launched late 2025 via inflection-sdk.
Env var: PI_API_KEY, INFLECTION_API_KEY
Keywords: inflection, pi_api, PI_API_KEY
Verify: Endpoint unclear -- inferred OpenAI-compatible pattern. Mark as placeholder.

12. AI21 Labs

Confidence: LOW

Format: Not publicly documented with distinctive prefix.
Env var: AI21_API_KEY
Keywords: ai21, AI21_API_KEY, jamba, jurassic
Verify: GET https://api.ai21.com/studio/v1/models with Authorization: Bearer {KEY} -- inferred

Tier 2: Inference Platforms (14)

1. Together AI

Confidence: LOW-MEDIUM

Format: Appears to use generic opaque tokens. Some documentation shows sk- prefix but this may be example placeholder.
Keywords: together, TOGETHER_API_KEY, api.together.xyz, togetherai
Verify: GET https://api.together.xyz/v1/models with Authorization: Bearer {KEY}

2. Fireworks AI

Confidence: LOW-MEDIUM

Format: GitGuardian confirms "Prefixed: False". Opaque token format.
Prefix: fw_ prefix has been reported in some sources but not confirmed by GitGuardian.
Keywords: fireworks, FIREWORKS_API_KEY, fireworks.ai, fw_
Verify: GET https://api.fireworks.ai/inference/v1/models with Authorization: Bearer {KEY}

3. Groq

Confidence: HIGH -- TruffleHog verified

Prefix: gsk_
TruffleHog regex: gsk_[a-zA-Z0-9]{52}
Total length: 56 characters
Verify: GET https://api.groq.com/openai/v1/models with Authorization: Bearer {KEY} -- 200=valid, 401=invalid
Keywords: gsk_, groq, GROQ_API_KEY

4. Replicate

Confidence: HIGH -- TruffleHog verified

Prefix: r8_
TruffleHog regex: r8_[0-9A-Za-z-_]{37}
Total length: 40 characters
Verify: GET https://api.replicate.com/v1/predictions with Authorization: Bearer {KEY} -- 200=valid, 401=invalid
Keywords: r8_, replicate, REPLICATE_API_TOKEN

5. Anyscale

Confidence: MEDIUM

Prefix: esecret_
Regex: esecret_[A-Za-z0-9_-]{20,}
Keywords: esecret_, anyscale, ANYSCALE_API_KEY
Verify: GET https://api.endpoints.anyscale.com/v1/models with Authorization: Bearer {KEY}
Note: Anyscale Endpoints are being deprecated in favor of their newer platform. Still worth detecting.

6. DeepInfra

Confidence: LOW-MEDIUM

Format: Opaque token. JWT-based scoped tokens use jwt: prefix.
Keywords: deepinfra, DEEPINFRA_API_KEY, deepinfra.com
Verify: GET https://api.deepinfra.com/v1/openai/models with Authorization: Bearer {KEY}

7. Lepton AI

Confidence: LOW

Format: Not publicly documented. NVIDIA acquired Lepton AI April 2025 (rebranded to DGX Cloud Lepton).
Keywords: lepton, LEPTON_API_TOKEN, lepton.ai
Verify: Endpoint uncertain. Mark as placeholder.

Confidence: LOW

Format: Not publicly documented with distinctive format.
Keywords: modal, MODAL_TOKEN_ID, MODAL_TOKEN_SECRET, modal.com
Note: Modal uses token ID + token secret pair, not a single API key.
Verify: Mark as placeholder.

9. Baseten

Confidence: LOW-MEDIUM

Format: Uses API keys passed in header.
Keywords: baseten, BASETEN_API_KEY, api.baseten.co
Verify: GET https://api.baseten.co/v1/models with Authorization: Api-Key {KEY} (non-standard header)

10. Cerebrium

Confidence: LOW

Format: Not publicly documented with distinctive format.
Keywords: cerebrium, CEREBRIUM_API_KEY, cerebrium.ai
Verify: Mark as placeholder.

11. NovitaAI

Confidence: LOW

Format: Not publicly documented with distinctive format.
Keywords: novita, NOVITA_API_KEY, novita.ai
Verify: GET https://api.novita.ai/v3/openai/models with Authorization: Bearer {KEY} (inferred)

12. SambaNova

Confidence: LOW

Format: Not publicly documented with distinctive prefix.
Keywords: sambanova, SAMBANOVA_API_KEY, sambastudio, snapi
Verify: GET https://api.sambanova.ai/v1/models with Authorization: Bearer {KEY} (inferred OpenAI-compatible)

13. OctoAI

Confidence: LOW

Format: Not publicly documented. OctoAI was shut down / merged in 2025.
Keywords: octoai, OCTOAI_TOKEN, octo.ai
Verify: Mark as placeholder (service may be defunct).

14. Friendli

Confidence: LOW

Format: Uses "Friendli Token" -- not publicly documented format.
Keywords: friendli, FRIENDLI_TOKEN, friendli.ai
Verify: Mark as placeholder.

IMPORTANT: Perplexity Classification Issue

The phase description lists Perplexity as Tier 2, but REQUIREMENTS.md lists it under Tier 3 (PROV-03). The gitleaks config has a pattern: pplx-[a-zA-Z0-9]{48}. Follow the phase description which includes it in the Tier 2 list. If Perplexity should not be Tier 2, one of the other 14 Tier 2 providers from PROV-02 should replace it. Looking at the PROV-02 requirement list, Perplexity is listed there as one of the 14. This is correct for this phase.

Perplexity (listed in PROV-02)

Confidence: HIGH -- gitleaks verified

Prefix: pplx-
gitleaks regex: pplx-[a-zA-Z0-9]{48}
Total length: 53 characters
Keywords: pplx-, perplexity, PERPLEXITY_API_KEY
Verify: GET https://api.perplexity.ai/chat/completions with Authorization: Bearer {KEY} -- but this is a POST endpoint. Use model listing if available.

Don't Hand-Roll

Problem	Don't Build	Use Instead	Why
Regex validation	Custom regex validator	Go `regexp.Compile` + existing schema validation in `UnmarshalYAML`	Schema already validates confidence levels; add regex compilation check
Key format research	Guessing patterns	TruffleHog/gitleaks source code + GitGuardian detector docs	These tools have validated patterns against real-world data
AC automaton rebuild	Manual keyword management	Existing `NewRegistry()` auto-builds AC from all provider keywords	Just add YAML files; registry handles everything
Dual file sync	Manual copy script	Plan tasks should explicitly copy to both locations	Simple but must not be forgotten

Common Pitfalls

Pitfall 1: Catastrophic Backtracking in Regex

What goes wrong: Complex regex with nested quantifiers causes exponential backtracking. Why it happens: Using patterns like (a+)+ or ([A-Za-z0-9]+[-_]?)+. How to avoid: Go RE2 engine prevents this by design -- but still write simple, linear patterns. Avoid unnecessary alternation groups. Warning signs: Regex compilation errors in tests.

Pitfall 2: Overly Broad Patterns Without Keywords

What goes wrong: Pattern like [A-Za-z0-9]{32} matches random strings everywhere. Why it happens: Provider has no distinctive prefix (Azure OpenAI, Mistral, AI21, etc.). How to avoid: Set confidence to low, require strong keyword list for AC pre-filtering. The scanner will only test regex AFTER AC finds a keyword match. Warning signs: Hundreds of false positives when scanning any codebase.

Pitfall 3: Forgetting Dual-Location File Sync

What goes wrong: Provider added to providers/ but not pkg/providers/definitions/, or vice versa. Why it happens: Phase 1 decision to maintain both locations. How to avoid: Every task that creates a YAML file must explicitly create it in both locations. Consider a verification step that compares both directories. Warning signs: go test passes but providers/ directory has different count than pkg/providers/definitions/.

Pitfall 4: Invalid YAML Schema

What goes wrong: Missing format_version, empty last_verified, or invalid confidence value. Why it happens: Copy-paste errors or typos. How to avoid: The UnmarshalYAML validation catches these. Run go test ./pkg/providers/... after each batch. Warning signs: Test failures with schema validation errors.

Pitfall 5: Regex Not RE2-Compatible

What goes wrong: Using PCRE features like (?<=prefix) or (?!suffix). Why it happens: Copying regex from TruffleHog (which uses Go RE2 but sometimes with custom wrappers) or from gitleaks (which uses Go RE2 but has context-match wrappers). How to avoid: Test every regex with regexp.MustCompile() in Go. Strip boundary assertions that gitleaks adds (like (?:[\x60'"\s;]|\\[nr]|$) -- these are gitleaks-specific context matchers, not needed in our YAML patterns). Warning signs: regexp.Compile returns error.

Pitfall 6: Schema Missing Category Field

What goes wrong: CONTEXT.md mentions a category field in the YAML format, but pkg/providers/schema.go Provider struct has no Category field. Why it happens: CONTEXT.md describes an intended schema that was not fully implemented in Phase 1. How to avoid: Either add Category field to schema.go in this phase, or omit it from YAML files. Recommend adding it since it is mentioned in CLI-04 (keyhunter providers list might want category filtering). Warning signs: YAML has category field that gets silently ignored by Go YAML parser.

Code Examples

Provider YAML File (High-Confidence Prefixed Provider)

# Source: TruffleHog xAI detector (verified)
format_version: 1
name: xai
display_name: xAI
tier: 1
last_verified: "2026-04-05"
keywords:
  - "xai-"
  - "xai"
  - "grok"
patterns:
  - regex: 'xai-[0-9a-zA-Z_]{80}'
    entropy_min: 3.5
    confidence: high
verify:
  method: GET
  url: https://api.x.ai/v1/api-key
  headers:
    Authorization: "Bearer {KEY}"
  valid_status: [200]
  invalid_status: [401, 403]

Provider YAML File (Low-Confidence Generic Provider)

# Low-confidence: no distinctive prefix, requires keyword context
format_version: 1
name: together
display_name: Together AI
tier: 2
last_verified: "2026-04-05"
keywords:
  - "together"
  - "together_api"
  - "api.together.xyz"
  - "togetherai"
  - "TOGETHER_API_KEY"
patterns:
  - regex: '[A-Za-z0-9]{64}'
    entropy_min: 4.0
    confidence: low
verify:
  method: GET
  url: https://api.together.xyz/v1/models
  headers:
    Authorization: "Bearer {KEY}"
  valid_status: [200]
  invalid_status: [401, 403]

Test Pattern: Verify Regex Compiles

func TestProviderRegexCompiles(t *testing.T) {
    reg, err := providers.NewRegistry()
    require.NoError(t, err)
    for _, p := range reg.List() {
        for i, pat := range p.Patterns {
            _, err := regexp.Compile(pat.Regex)
            assert.NoError(t, err, "provider %s pattern %d has invalid regex: %s", p.Name, i, pat.Regex)
        }
    }
}

Test Pattern: Verify Provider Count

func TestTier1And2ProviderCount(t *testing.T) {
    reg, err := providers.NewRegistry()
    require.NoError(t, err)
    stats := reg.Stats()
    assert.GreaterOrEqual(t, stats.ByTier[1], 12, "expected at least 12 Tier 1 providers")
    assert.GreaterOrEqual(t, stats.ByTier[2], 14, "expected at least 14 Tier 2 providers")
    assert.GreaterOrEqual(t, stats.Total, 26, "expected at least 26 total providers")
}

Test Pattern: AC Pre-Filter Matches Known Key Prefix

func TestACMatchesKnownPrefixes(t *testing.T) {
    reg, err := providers.NewRegistry()
    require.NoError(t, err)
    ac := reg.AC()
    
    prefixes := []string{
        "sk-proj-abc123", "sk-ant-api03-abc", "AIzaSyABC123",
        "xai-abc123", "gsk_abc123", "r8_abc123", "pplx-abc123",
    }
    for _, prefix := range prefixes {
        matches := ac.FindAll(prefix)
        assert.NotEmpty(t, matches, "AC should match prefix: %s", prefix)
    }
}

State of the Art

Old Approach	Current Approach	When Changed	Impact
OpenAI `sk-` prefix only	`sk-proj-`, `sk-svcacct-`, `sk-None-` with `T3BlbkFJ` marker	2024	Must detect all modern prefixes
AWS Bedrock via IAM only	Bedrock-specific `ABSK` API keys	Late 2025	New key type to detect
Regex-only detection	Regex + keyword pre-filter + entropy	Current best practice	KeyHunter architecture already supports this
Simple prefix match	TruffleHog suffix markers (e.g., Anthropic `AA` suffix)	Current	Tighter patterns reduce false positives

Open Questions

Category field in schema
- What we know: CONTEXT.md mentions category in YAML format, but schema.go has no Category field.
- What's unclear: Whether to add it now or defer.
- Recommendation: Add Category string \yaml:"category"`to Provider struct as part of this phase. Values like"frontier", "inference-platform", "specialized"` would support CLI-04 filtering.
Perplexity tier assignment
- What we know: PROV-02 lists "Perplexity" as one of 14 Tier 2 providers. PROV-03 also lists it under Tier 3.
- What's unclear: Which is canonical.
- Recommendation: Follow the phase description (PROV-02 lists it as Tier 2). If needed, adjust in Phase 3.
OpenAI existing pattern needs update
- What we know: Current openai.yaml only has sk-proj- pattern. Modern keys also use sk-svcacct-, sk-None-, and legacy sk- with T3BlbkFJ marker.
- Recommendation: Update openai.yaml with additional patterns.
Anthropic existing pattern can be tightened
- What we know: Current pattern sk-ant-api03-[A-Za-z0-9_\-]{93,} does not require the AA suffix.
- Recommendation: Tighten to sk-ant-api03-[A-Za-z0-9_\-]{93}AA per TruffleHog and add admin key pattern.
Defunct/transitioning providers
- What we know: OctoAI appears defunct. Anyscale Endpoints being deprecated. Lepton acquired by NVIDIA.
- Recommendation: Still create YAML definitions for all -- leaked keys from defunct services can still have value if the service was operational.

Validation Architecture

Test Framework

Property	Value
Framework	Go standard `testing` + `testify` v1.x
Config file	`go.mod` (no separate test config)
Quick run command	`go test ./pkg/providers/... -v -count=1`
Full suite command	`go test ./... -v -count=1`

Phase Requirements -> Test Map

Req ID	Behavior	Test Type	Automated Command	File Exists?
PROV-01	12 Tier 1 providers loaded with correct names/tiers	unit	`go test ./pkg/providers/... -run TestTier1Providers -v`	Wave 0
PROV-01	Each Tier 1 provider regex compiles and matches sample key	unit	`go test ./pkg/providers/... -run TestTier1Patterns -v`	Wave 0
PROV-02	14 Tier 2 providers loaded with correct names/tiers	unit	`go test ./pkg/providers/... -run TestTier2Providers -v`	Wave 0
PROV-02	Each Tier 2 provider regex compiles and matches sample key	unit	`go test ./pkg/providers/... -run TestTier2Patterns -v`	Wave 0
PROV-01+02	AC automaton matches all provider keywords	unit	`go test ./pkg/providers/... -run TestACMatchesAllKeywords -v`	Wave 0
PROV-01+02	Registry stats show 26+ providers	unit	`go test ./pkg/providers/... -run TestProviderCount -v`	Wave 0

Sampling Rate

Per task commit: go test ./pkg/providers/... -v -count=1
Per wave merge: go test ./... -v -count=1
Phase gate: Full suite green before /gsd:verify-work

Wave 0 Gaps

pkg/providers/tier1_test.go -- covers PROV-01 (provider names, tier values, pattern compilation, sample key matching)
pkg/providers/tier2_test.go -- covers PROV-02 (same as above for Tier 2)
Update pkg/providers/registry_test.go -- update TestRegistryLoad assertion from >= 3 to >= 26

Sources

Primary (HIGH confidence)

TruffleHog OpenAI detector -- regex pattern for OpenAI keys
TruffleHog Anthropic detector -- regex pattern for Anthropic keys
TruffleHog Google Gemini detector -- regex pattern for Google AI keys
TruffleHog Groq detector -- regex pattern gsk_[a-zA-Z0-9]{52}
TruffleHog xAI detector -- regex pattern xai-[0-9a-zA-Z_]{80}
TruffleHog Replicate detector -- regex pattern r8_[0-9A-Za-z-_]{37}
TruffleHog Azure OpenAI detector -- 32-char hex with keyword context
gitleaks config -- Anthropic, Cohere, AWS Bedrock patterns

Secondary (MEDIUM confidence)

AWS Bedrock API keys (Wiz blog) -- ABSK prefix documentation
GitGuardian Groq detector -- confirms prefixed format
GitGuardian Mistral detector -- confirms "Prefixed: False"
GitGuardian Fireworks detector -- confirms "Prefixed: False"
OpenAI community forum -- sk-proj-, sk-svcacct-, sk-None- prefixes
Replicate docs -- r8_ prefix, 40-char format
Anyscale docs -- esecret_ prefix

Tertiary (LOW confidence -- needs validation at implementation)

Meta AI key format -- not documented, inferred from OpenAI-compatible API pattern
Inflection AI key format -- not documented, SDK-based inference only
AI21 Labs key format -- no prefix confirmed
Together AI key format -- no distinctive prefix confirmed
DeepInfra key format -- JWT-based tokens referenced but standard key format unclear
Lepton AI, Modal, Baseten, Cerebrium, NovitaAI, SambaNova, OctoAI, Friendli -- key formats not publicly documented

Metadata

Confidence breakdown:

Standard stack: HIGH -- no new dependencies, using existing Phase 1 infrastructure
Architecture: HIGH -- dual-location YAML pattern well-established
Provider key formats (Tier 1 well-known): HIGH -- TruffleHog/gitleaks verified patterns
Provider key formats (Tier 1 less-known): MEDIUM -- Meta, Inflection, AI21 have limited documentation
Provider key formats (Tier 2 prefixed): HIGH -- Groq, Replicate, Perplexity, Anyscale have clear prefixes
Provider key formats (Tier 2 generic): LOW -- many Tier 2 providers lack documented key formats
Pitfalls: HIGH -- well-understood RE2 and false-positive challenges

Research date: 2026-04-05 Valid until: 2026-05-05 (30 days -- API key formats change infrequently)

32 KiB Raw Permalink Blame History

Phase 2: Tier 1-2 Providers - Research

Summary

User Constraints (from CONTEXT.md)

Locked Decisions

Claude's Discretion

Deferred Ideas (OUT OF SCOPE)

Phase Requirements

Project Constraints (from CLAUDE.md)

Standard Stack

Existing Infrastructure Used

Architecture Patterns

File Placement (Dual Location)

YAML Template Pattern

Confidence Level Assignment Strategy

Keyword Strategy for Low-Confidence Providers

Anti-Patterns to Avoid

Provider Key Format Research

Tier 1: Frontier Providers (12)

1. OpenAI

2. Anthropic

3. Google AI (Gemini)

4. Google Vertex AI

5. AWS Bedrock

6. Azure OpenAI

7. Meta AI (Llama API)

8. xAI (Grok)

9. Cohere

10. Mistral AI

11. Inflection AI

12. AI21 Labs

Tier 2: Inference Platforms (14)

1. Together AI

2. Fireworks AI

3. Groq

4. Replicate

5. Anyscale

6. DeepInfra

7. Lepton AI

8. Modal

9. Baseten

10. Cerebrium

11. NovitaAI

12. SambaNova

13. OctoAI

14. Friendli

IMPORTANT: Perplexity Classification Issue

Perplexity (listed in PROV-02)

Don't Hand-Roll

Common Pitfalls

Pitfall 1: Catastrophic Backtracking in Regex

Pitfall 2: Overly Broad Patterns Without Keywords

Pitfall 3: Forgetting Dual-Location File Sync

Pitfall 4: Invalid YAML Schema

Pitfall 5: Regex Not RE2-Compatible

Pitfall 6: Schema Missing Category Field

Code Examples

Provider YAML File (High-Confidence Prefixed Provider)

Provider YAML File (Low-Confidence Generic Provider)

Test Pattern: Verify Regex Compiles

Test Pattern: Verify Provider Count

Test Pattern: AC Pre-Filter Matches Known Key Prefix

State of the Art

Open Questions

Validation Architecture

Test Framework

Phase Requirements -> Test Map

Sampling Rate

Wave 0 Gaps

Sources

Primary (HIGH confidence)

Secondary (MEDIUM confidence)

Tertiary (LOW confidence -- needs validation at implementation)

Metadata

32 KiB

Raw Permalink Blame History