Files
keyhunter/RESEARCH_REPORT.md
2026-04-06 13:21:39 +03:00

27 KiB

API Key Scanner Market Research Report

Date: April 4, 2026


Table of Contents

  1. Existing Open-Source API Key Scanners
  2. LLM-Specific API Key Tools
  3. Top LLM API Providers (100+)
  4. API Key Patterns by Provider
  5. Key Validation Approaches
  6. Market Gaps & Opportunities

1. Existing Open-Source API Key Scanners

1.1 TruffleHog

  • GitHub: https://github.com/trufflesecurity/trufflehog
  • Stars: ~25,500
  • Language: Go
  • Detectors: 800+ secret types
  • Approach: Detector-based (each detector is a small Go program for a specific credential type)
  • Detection methods:
    • Pattern matching via dedicated detectors
    • Active verification against live APIs
    • Permission/scope analysis (~20 credential types)
  • AI/LLM detectors confirmed: OpenAI, OpenAI Admin Key, Anthropic
  • Scanning sources: Git repos, GitHub orgs, S3 buckets, GCS, Docker images, Jenkins, Elasticsearch, Postman, Slack, local filesystems
  • Key differentiator: Verification — not just "this looks like a key" but "this is an active key with these permissions"
  • Limitations:
    • Heavy/slow compared to regex-only scanners
    • Not all 800+ detectors have verification
    • LLM provider coverage still incomplete (no confirmed Cohere, Mistral, Groq detectors)

1.2 Gitleaks

  • GitHub: https://github.com/gitleaks/gitleaks
  • Stars: ~25,800
  • Language: Go
  • Rules: 150+ regex patterns in gitleaks.toml
  • Approach: Regex pattern matching with optional entropy checks
  • Detection methods:
    • Regex patterns defined in TOML config
    • Keyword matching
    • Entropy thresholds
    • Allowlists for false positive reduction
  • AI/LLM rules confirmed:
    • anthropic-admin-api-key: sk-ant-admin01-[a-zA-Z0-9_\-]{93}AA
    • anthropic-api-key: sk-ant-api03-[a-zA-Z0-9_\-]{93}AA
    • openai-api-key: Updated to include sk-proj- and sk-svcacct- formats
    • cohere-api-token: Keyword-based detection
    • huggingface-access-token: hf_[a-z]{34}
    • huggingface-organization-api-token: api_org_[a-z]{34}
  • Key differentiator: Fast, simple, excellent as pre-commit hook
  • Limitations:
    • No active verification of detected keys
    • Regex-only means higher false positive rate for generic patterns
    • Limited LLM provider coverage beyond the 5 above
  • Note: Gitleaks creator launched "Betterleaks" in 2026 as a successor built for the agentic era

1.3 detect-secrets (Yelp)

  • GitHub: https://github.com/Yelp/detect-secrets
  • Stars: ~4,300
  • Language: Python
  • Plugins: 27 built-in detectors
  • Approach: Baseline methodology — tracks known secrets and flags new ones
  • Detection methods:
    • Regex-based plugins (structured secrets)
    • High entropy string detection (Base64, Hex)
    • Keyword detection (variable name matching)
    • Optional ML-based gibberish detector (v1.1+)
  • AI/LLM plugins confirmed:
    • OpenAIDetector plugin exists
    • No dedicated Anthropic, Cohere, Mistral, or Groq plugins
  • Key differentiator: Baseline approach — only flags NEW secrets, not historical ones; enterprise-friendly
  • Limitations:
    • Minimal LLM provider coverage
    • No active verification
    • Fewer patterns than TruffleHog or Gitleaks
    • Python-only (slower than Go/Rust alternatives)

1.4 Nosey Parker (Praetorian)

  • GitHub: https://github.com/praetorian-inc/noseyparker
  • Stars: ~2,300
  • Language: Rust
  • Rules: 188 high-precision regex rules
  • Approach: Hybrid regex + ML denoising
  • Detection methods:
    • 188 tested regex rules tuned for low false positives
    • ML model for false positive reduction (10-1000x improvement)
    • Deduplication/grouping of findings
  • Performance: GB/s scanning speeds, tested on 20TB+ datasets
  • Key differentiator: ML-enhanced denoising, extreme performance
  • Status: RETIRED — replaced by Titus (https://github.com/praetorian-inc/titus)
  • Limitations:
    • No specific LLM provider rules documented
    • No active verification
    • Project discontinued

1.5 GitGuardian

  • Website: https://www.gitguardian.com
  • Type: Commercial + free tier for public repos
  • Detectors: 450+ secret types
  • Approach: Regex + AI-powered false positive reduction
  • Detection methods:
    • Specific prefix-based detectors
    • Fine-tuned code-LLM for false positive filtering
    • Validity checking for supported detectors
  • AI/LLM coverage:
    • Groq API Key (prefixed, with validity check)
    • OpenAI, Anthropic, HuggingFace (confirmed)
    • AI-related leaked secrets up 81% YoY in 2025
    • 1,275,105 leaked AI service secrets detected in 2025
  • Key differentiator: AI-powered false positive reduction, massive scale (scans all public GitHub)
  • Limitations:
    • Commercial/proprietary for private repos
    • Regex patterns not publicly disclosed

1.6 GitHub Secret Scanning (Native)

  • Type: Built into GitHub
  • Approach: Provider-partnered pattern matching + Copilot AI
  • AI/LLM patterns supported (with push protection and validity status):
Provider Pattern Push Protection Validity Check
Anthropic anthropic_admin_api_key Yes Yes
Anthropic anthropic_api_key Yes Yes
Anthropic anthropic_session_id Yes No
Cohere cohere_api_key Yes No
DeepSeek deepseek_api_key No Yes
Google google_gemini_api_key No No
Groq groq_api_key Yes Yes
Hugging Face hf_org_api_key Yes No
Hugging Face hf_user_access_token Yes Yes
Mistral AI mistral_ai_api_key No No
OpenAI openai_api_key Yes Yes
Replicate replicate_api_token Yes Yes
xAI xai_api_key Yes Yes
Azure azure_openai_key Yes No
  • Recent developments (March 2026):
    • Added 37 new secret detectors including Langchain
    • Extended scanning to AI coding agents via MCP
    • Copilot uses GPT-3.5-Turbo + GPT-4 for unstructured secret detection (94% FP reduction)
    • Base64-encoded secret detection with push protection

1.7 Other Notable Tools

Tool Stars Language Patterns Key Feature
KeyHacks (streaak) 6,100 Markdown/Shell 100+ services Validation curl commands for bug bounty
keyhacks.sh (gwen001) ~500 Bash 50+ Automated version of KeyHacks
Secrets Patterns DB (mazen160) 1,400 YAML/Regex 1,600+ Largest open-source regex DB, exports to TruffleHog/Gitleaks format
secret-regex-list (h33tlit) ~1,000 Regex 100+ Regex patterns for scraping secrets
regextokens (odomojuli) ~300 Regex 50+ OAuth/API token regex patterns
Betterleaks New (2026) Go Gitleaks successor for agentic era

2. LLM-Specific API Key Tools

2.1 Dedicated LLM Key Validators

Tool URL Providers Approach
TestMyAPIKey.com testmyapikey.com OpenAI, Anthropic Claude, + 13 others Client-side regex + live API validation
SecurityWall Checker securitywall.co/tools/api-key-checker 455+ patterns, 350+ services (incl. OpenAI, Anthropic) Client-side regex, generates curl commands
VibeFactory Scanner vibefactory.ai/api-key-security-scanner 150+ types (incl. OpenAI) Scans deployed websites for exposed keys
KeyLeak Detector github.com/Amal-David/keyleak-detector Multiple Headless browser + network interception
OpenAI Key Tester trevorfox.com/api-key-tester/openai OpenAI, Anthropic Direct API validation
Chatbot API Tester apikeytester.netlify.app OpenAI, DeepSeek, OpenRouter Endpoint validation
SecurityToolkits securitytoolkits.com/tools/apikey-validator Multiple API key/token checker

2.2 LLM Gateways with Key Validation

These tools validate keys as part of their proxy/gateway functionality:

Tool Stars Providers Validation Approach
LiteLLM ~18k 107 providers AuthenticationError mapping from all providers
OpenRouter 60+ providers, 500+ models Unified API key, provider-level validation
Portkey AI ~5k 30+ providers AI gateway with key validation
LLM-API-Key-Proxy ~200 OpenAI, Anthropic compatible Self-hosted proxy with key validation

2.3 Key Gap: No Comprehensive LLM-Focused Scanner

Critical finding: There is NO dedicated open-source tool that:

  1. Detects API keys from all major LLM providers (50+)
  2. Validates them against live APIs
  3. Reports provider, model access, rate limits, and spend
  4. Covers both legacy and new key formats

The closest tools are:

  • TruffleHog (broadest verification, but only ~3 confirmed LLM detectors)
  • GitHub Secret Scanning (14 AI-related patterns, but GitHub-only)
  • GitGuardian (broad AI coverage, but commercial)

3. Top LLM API Providers

Tier 1: Major Cloud & Frontier Model Providers

# Provider Key Product Notes
1 OpenAI GPT-5, GPT-4o, o-series Market leader
2 Anthropic Claude Opus 4, Sonnet, Haiku Enterprise focus
3 Google (Gemini/Vertex AI) Gemini 2.5 Pro/Flash 2M token context
4 AWS Bedrock Multi-model (Claude, Llama, etc.) AWS ecosystem
5 Azure OpenAI GPT-4o, o-series Enterprise SLA 99.9%
6 Google AI Studio Gemini API Developer-friendly
7 xAI Grok 4.1 2M context, low cost

Tier 2: Specialized & Competitive Providers

# Provider Key Product Notes
8 Mistral AI Mistral Large, Codestral European, open-weight
9 Cohere Command R+ Enterprise RAG focus
10 DeepSeek DeepSeek R1, V3 Ultra-low cost reasoning
11 Perplexity Sonar Pro Search-augmented LLM
12 Together AI 200+ open-source models Low latency inference
13 Groq LPU inference Fastest inference speeds
14 Fireworks AI Open-source model hosting Sub-100ms latency
15 Replicate Model hosting platform Pay-per-use
16 Cerebras Wafer-scale inference Ultra-fast inference
17 SambaNova Enterprise inference Custom silicon
18 AI21 Jamba models Long context
19 Stability AI Stable Diffusion, text models Image + text
20 NVIDIA NIM Optimized model serving GPU-optimized

Tier 3: Infrastructure, Platform & Gateway Providers

# Provider Key Product Notes
21 Cloudflare Workers AI Edge inference Edge computing
22 Vercel AI AI SDK, v0 Frontend-focused
23 OpenRouter Multi-model gateway 500+ models
24 HuggingFace Inference API, 300+ models Open-source hub
25 DeepInfra Inference platform Cost-effective
26 Novita AI 200+ production APIs Multi-modal
27 Baseten Model serving Custom deployments
28 Anyscale Ray-based inference Scalable
29 Lambda AI GPU cloud + inference
30 OctoAI Optimized inference
31 Databricks DBRX, model serving Data + AI
32 Snowflake Cortex AI Data warehouse + AI
33 Oracle OCI OCI AI Enterprise
34 SAP Generative AI Hub Enterprise AI SAP ecosystem
35 IBM WatsonX Granite models Enterprise

Tier 4: Chinese & Regional Providers

# Provider Key Product Notes
36 Alibaba (Qwen/Dashscope) Qwen 2.5/3 series Top Chinese open-source
37 Baidu (Wenxin/ERNIE) ERNIE 4.0 Chinese market leader
38 ByteDance (Doubao) Doubao/Kimi TikTok parent
39 Zhipu AI GLM-4.5 ChatGLM lineage
40 Baichuan Baichuan 4 Domain-specific (law, finance)
41 Moonshot AI (Kimi) Kimi K1.5/K2 128K context
42 01.AI (Yi) Yi-Large, Yi-34B Founded by Kai-Fu Lee
43 MiniMax MiniMax models Chinese AI tiger
44 StepFun Step models Chinese AI tiger
45 Tencent (Hunyuan) Hunyuan models WeChat ecosystem
46 iFlyTek (Spark) Spark models Voice/NLP specialist
47 SenseNova (SenseTime) SenseNova models Vision + language
48 Volcano Engine (ByteDance) Cloud AI services ByteDance cloud
49 Nebius AI Inference platform Yandex spinoff

Tier 5: Emerging, Niche & Specialized Providers

# Provider Key Product Notes
50 Aleph Alpha Luminous models EU-focused, compliance
51 Comet API ML experiment tracking
52 Writer Palmyra models Enterprise content
53 Reka AI Reka Core/Flash Multimodal
54 Upstage Solar models Korean provider
55 FriendliAI Inference optimization
56 Forefront AI Model hosting
57 GooseAI GPT-NeoX hosting Low cost
58 NLP Cloud Model hosting
59 Predibase Fine-tuning platform LoRA specialist
60 Clarifai Vision + LLM
61 AiLAYER AI platform
62 AIMLAPI Multi-model API
63 Corcel Decentralized inference Bittensor-based
64 HyperBee AI AI platform
65 Lamini Fine-tuning + inference
66 Monster API GPU inference
67 Neets.ai TTS + LLM
68 Featherless AI Inference
69 Hyperbolic Inference platform
70 Inference.net Open-source inference
71 Galadriel Decentralized AI
72 PublicAI Community inference
73 Bytez Model hosting
74 Chutes Inference
75 GMI Cloud GPU cloud + inference
76 Nscale Inference platform
77 Scaleway European cloud AI
78 OVHCloud AI European cloud AI
79 Heroku AI PaaS AI add-on
80 Sarvam.ai Indian AI models

Tier 6: Self-Hosted & Local Inference

# Provider Key Product Notes
81 Ollama Local LLM runner No API key needed
82 LM Studio Desktop LLM No API key needed
83 vLLM Inference engine Self-hosted
84 Llamafile Single-file LLM Self-hosted
85 Xinference Inference platform Self-hosted
86 Triton Inference Server NVIDIA serving Self-hosted
87 LlamaGate Gateway Self-hosted
88 Docker Model Runner Container inference Self-hosted

Tier 7: Aggregators, Gateways & Middleware

# Provider Key Product Notes
89 LiteLLM AI gateway (107 providers) Open-source
90 Portkey AI gateway Observability
91 Helicone LLM observability Proxy-based
92 Bifrost AI gateway (Go) Fastest gateway
93 Kong AI Gateway API management Enterprise
94 Vercel AI Gateway Edge AI
95 Cloudflare AI Gateway Edge AI
96 Agenta LLM ops platform
97 Straico Multi-model
98 AI302 Gateway
99 AIHubMix Gateway
100 Zenmux Gateway
101 Poe Multi-model chat Quora
102 Gitee AI Chinese GitHub AI
103 GitHub Models GitHub-hosted inference
104 GitHub Copilot Code completion
105 ModelScope Chinese model hub Alibaba
106 Voyage AI Embeddings
107 Jina AI Embeddings + search
108 Deepgram Speech-to-text
109 ElevenLabs Text-to-speech
110 Black Forest Labs Image generation (FLUX)
111 Fal AI Image/video generation
112 RunwayML Video generation
113 Recraft Image generation
114 DataRobot ML platform
115 Weights & Biases ML ops + inference
116 CompactifAI Model compression
117 GradientAI Fine-tuning
118 Topaz AI platform
119 Synthetic Data generation
120 Infiniai Inference
121 Higress AI gateway Alibaba
122 PPIO Inference
123 Qiniu Chinese cloud AI
124 NanoGPT Lightweight inference
125 Morph AI platform
126 Milvus Vector DB + AI
127 XiaoMi MiMo Xiaomi AI
128 Petals Distributed inference
129 ZeroOne AI platform
130 Lemonade AI platform
131 Taichu Chinese AI
132 Amazon Nova AWS native models

4. API Key Patterns by Provider

4.1 Confirmed Key Prefixes & Formats

Provider Prefix Regex Pattern Confidence
OpenAI (legacy) sk- sk-[a-zA-Z0-9]{48} High
OpenAI (project) sk-proj- sk-proj-[a-zA-Z0-9_-]{80,} High
OpenAI (service account) sk-svcacct- sk-svcacct-[a-zA-Z0-9_-]{80,} High
OpenAI (legacy user) sk-None- sk-None-[a-zA-Z0-9_-]{80,} High
Anthropic (API) sk-ant-api03- sk-ant-api03-[a-zA-Z0-9_\-]{93}AA High
Anthropic (Admin) sk-ant-admin01- sk-ant-admin01-[a-zA-Z0-9_\-]{93}AA High
Google AI / Gemini AIza AIza[0-9A-Za-z\-_]{35} High
HuggingFace (user) hf_ hf_[a-zA-Z]{34} High
HuggingFace (org) api_org_ api_org_[a-zA-Z]{34} High
Groq gsk_ gsk_[a-zA-Z0-9]{48,} High
Replicate r8_ r8_[a-zA-Z0-9]{40} High
Fireworks AI fw_ fw_[a-zA-Z0-9_-]{40,} Medium
Perplexity pplx- pplx-[a-zA-Z0-9]{48} High
AWS (general) AKIA AKIA[0-9A-Z]{16} High
GitHub PAT ghp_ ghp_[a-zA-Z0-9]{36} High
Stripe (secret) sk_live_ sk_live_[0-9a-zA-Z]{24} High

4.2 Providers with No Known Distinct Prefix

These providers use generic-looking API keys without distinguishing prefixes, making detection harder:

Provider Key Format Detection Approach
Mistral AI Generic alphanumeric Keyword-based (MISTRAL_API_KEY)
Cohere Generic alphanumeric Keyword-based (COHERE_API_KEY, CO_API_KEY)
Together AI Generic alphanumeric Keyword-based
DeepSeek sk- prefix (same as OpenAI legacy) Keyword context needed
Azure OpenAI 32-char hex Keyword-based
Stability AI sk- prefix Keyword context needed
AI21 Generic alphanumeric Keyword-based
Cerebras Generic alphanumeric Keyword-based
SambaNova Generic alphanumeric Keyword-based

4.3 Detection Difficulty Tiers

Easy (unique prefix): OpenAI (sk-proj-, sk-svcacct-), Anthropic (sk-ant-), HuggingFace (hf_), Groq (gsk_), Replicate (r8_), Perplexity (pplx-), AWS (AKIA)

Medium (shared or short prefix): OpenAI legacy (sk-), DeepSeek (sk-), Stability (sk-), Fireworks (fw_), Google (AIza)

Hard (no prefix, keyword-only): Mistral, Cohere, Together AI, Azure OpenAI, AI21, Cerebras, most Chinese providers


5. Key Validation Approaches

5.1 Common Validation Endpoints

Provider Validation Method Endpoint Cost
OpenAI List models GET /v1/models Free (no tokens consumed)
Anthropic Send minimal message POST /v1/messages (tiny prompt) Minimal cost (~1 token)
Google Gemini List models GET /v1/models Free
Cohere Token check POST /v1/tokenize or /v1/generate Minimal
HuggingFace Whoami GET /api/whoami Free
Groq List models GET /v1/models Free
Replicate Get account GET /v1/account Free
Mistral List models GET /v1/models Free
AWS STS GetCallerIdentity POST sts.amazonaws.com Free
Azure OpenAI List deployments GET /openai/deployments Free

5.2 Validation Strategy Patterns

  1. Passive detection (regex only): Fastest, highest false positive rate. Used by Gitleaks, detect-secrets baseline mode.

  2. Passive + entropy: Combines regex with entropy scoring. Reduces false positives for generic patterns. Used by detect-secrets with entropy plugins.

  3. Active verification (API call): Makes lightweight API call to confirm key is live. Used by TruffleHog, GitHub secret scanning. Eliminates false positives but requires network access.

  4. Deep analysis (permission enumeration): Beyond verification, enumerates what the key can access. Used by TruffleHog for ~20 credential types. Most actionable but slowest.

5.3 How Existing Tools Validate

Tool Passive Entropy Active Verification Permission Analysis
TruffleHog Yes No Yes (800+ detectors) Yes (~20 types)
Gitleaks Yes Optional No No
detect-secrets Yes Yes Limited No
Nosey Parker Yes ML-based No No
GitGuardian Yes Yes Yes (selected) Limited
GitHub Scanning Yes AI-based Yes (selected) No
SecurityWall Yes No Generates curl cmds No
KeyHacks No No Manual curl cmds Limited

6. Market Gaps & Opportunities

6.1 Underserved Areas

  1. LLM-specific comprehensive scanner: No tool covers all 50+ LLM API providers with both detection and validation.

  2. New key format coverage: OpenAI's sk-proj- and sk-svcacct- formats are recent; many scanners only detect legacy sk- format. Gitleaks only added these in late 2025 via PR #1780.

  3. Chinese/regional provider detection: Almost zero coverage for Qwen, Baichuan, Zhipu, Moonshot, Yi, ERNIE, Doubao API keys in any scanner.

  4. Key metadata extraction: No tool extracts org, project, rate limits, or spend from detected LLM keys.

  5. Agentic AI context: With AI agents increasingly using API keys, there's a growing need for scanners that understand multi-key configurations (e.g., an agent with OpenAI + Anthropic + Serp API keys).

  6. Vibe coding exposure: VibeFactory's scanner addresses the problem of API keys exposed in frontend JavaScript by vibe-coded apps, but this is still nascent.

6.2 Scale of the Problem

  • 28 million credentials leaked on GitHub in 2025 (Snyk)
  • 1,275,105 leaked AI service secrets in 2025 (GitGuardian), up 81% YoY
  • 8 of 10 fastest-growing leaked secret categories are AI-related (GitGuardian)
  • Fastest growing: Brave Search API (+1,255%), Firecrawl (+796%), Supabase (+992%)
  • AI keys are found at 42.28 per million commits for Groq alone (GitGuardian)

6.3 Competitive Landscape Summary

                    Verification Depth
                    |
        TruffleHog  |  ████████████████  (800+ detectors, deep analysis)
        GitGuardian |  ████████████      (450+ detectors, commercial)
        GitHub      |  ██████████        (AI-powered, platform-locked)
        Gitleaks    |  ████              (150+ regex, no verification)
        detect-sec  |  ███               (27 plugins, baseline approach)
        NoseyParker |  ██                (188 rules, ML denoising, retired)
                    |
                    +------ LLM Provider Coverage ------>
                    
        None of these tools provide >15 LLM provider detectors.
        The market opportunity is a scanner focused on 50-100+ LLM providers
        with active verification, permission analysis, and cost estimation.

Sources

Open-Source Scanner Tools

Comparison & Analysis

API Key Patterns & Validation

LLM Key Validation Tools

LLM Provider Lists

GitHub Secret Scanning

Market Data