salvacybersec/keyhunter

Fork 0

Files

salvacybersec 095b90ec07 merge: phase 14-03 frontend leaks

2026-04-06 13:21:39 +03:00

27 KiB

Raw Blame History

API Key Scanner Market Research Report

Date: April 4, 2026

Existing Open-Source API Key Scanners
LLM-Specific API Key Tools
Top LLM API Providers (100+)
API Key Patterns by Provider
Key Validation Approaches
Market Gaps & Opportunities

1. Existing Open-Source API Key Scanners

1.1 TruffleHog

GitHub: https://github.com/trufflesecurity/trufflehog
Stars: ~25,500
Language: Go
Detectors: 800+ secret types
Approach: Detector-based (each detector is a small Go program for a specific credential type)
Detection methods:
- Pattern matching via dedicated detectors
- Active verification against live APIs
- Permission/scope analysis (~20 credential types)
AI/LLM detectors confirmed: OpenAI, OpenAI Admin Key, Anthropic
Scanning sources: Git repos, GitHub orgs, S3 buckets, GCS, Docker images, Jenkins, Elasticsearch, Postman, Slack, local filesystems
Key differentiator: Verification — not just "this looks like a key" but "this is an active key with these permissions"
Limitations:
- Heavy/slow compared to regex-only scanners
- Not all 800+ detectors have verification
- LLM provider coverage still incomplete (no confirmed Cohere, Mistral, Groq detectors)

1.2 Gitleaks

GitHub: https://github.com/gitleaks/gitleaks
Stars: ~25,800
Language: Go
Rules: 150+ regex patterns in gitleaks.toml
Approach: Regex pattern matching with optional entropy checks
Detection methods:
- Regex patterns defined in TOML config
- Keyword matching
- Entropy thresholds
- Allowlists for false positive reduction
AI/LLM rules confirmed:
- anthropic-admin-api-key: sk-ant-admin01-[a-zA-Z0-9_\-]{93}AA
- anthropic-api-key: sk-ant-api03-[a-zA-Z0-9_\-]{93}AA
- openai-api-key: Updated to include sk-proj- and sk-svcacct- formats
- cohere-api-token: Keyword-based detection
- huggingface-access-token: hf_[a-z]{34}
- huggingface-organization-api-token: api_org_[a-z]{34}
Key differentiator: Fast, simple, excellent as pre-commit hook
Limitations:
- No active verification of detected keys
- Regex-only means higher false positive rate for generic patterns
- Limited LLM provider coverage beyond the 5 above
Note: Gitleaks creator launched "Betterleaks" in 2026 as a successor built for the agentic era

1.3 detect-secrets (Yelp)

GitHub: https://github.com/Yelp/detect-secrets
Stars: ~4,300
Language: Python
Plugins: 27 built-in detectors
Approach: Baseline methodology — tracks known secrets and flags new ones
Detection methods:
- Regex-based plugins (structured secrets)
- High entropy string detection (Base64, Hex)
- Keyword detection (variable name matching)
- Optional ML-based gibberish detector (v1.1+)
AI/LLM plugins confirmed:
- OpenAIDetector plugin exists
- No dedicated Anthropic, Cohere, Mistral, or Groq plugins
Key differentiator: Baseline approach — only flags NEW secrets, not historical ones; enterprise-friendly
Limitations:
- Minimal LLM provider coverage
- No active verification
- Fewer patterns than TruffleHog or Gitleaks
- Python-only (slower than Go/Rust alternatives)

1.4 Nosey Parker (Praetorian)

GitHub: https://github.com/praetorian-inc/noseyparker
Stars: ~2,300
Language: Rust
Rules: 188 high-precision regex rules
Approach: Hybrid regex + ML denoising
Detection methods:
- 188 tested regex rules tuned for low false positives
- ML model for false positive reduction (10-1000x improvement)
- Deduplication/grouping of findings
Performance: GB/s scanning speeds, tested on 20TB+ datasets
Key differentiator: ML-enhanced denoising, extreme performance
Status: RETIRED — replaced by Titus (https://github.com/praetorian-inc/titus)
Limitations:
- No specific LLM provider rules documented
- No active verification
- Project discontinued

1.5 GitGuardian

Website: https://www.gitguardian.com
Type: Commercial + free tier for public repos
Detectors: 450+ secret types
Approach: Regex + AI-powered false positive reduction
Detection methods:
- Specific prefix-based detectors
- Fine-tuned code-LLM for false positive filtering
- Validity checking for supported detectors
AI/LLM coverage:
- Groq API Key (prefixed, with validity check)
- OpenAI, Anthropic, HuggingFace (confirmed)
- AI-related leaked secrets up 81% YoY in 2025
- 1,275,105 leaked AI service secrets detected in 2025
Key differentiator: AI-powered false positive reduction, massive scale (scans all public GitHub)
Limitations:
- Commercial/proprietary for private repos
- Regex patterns not publicly disclosed

1.6 GitHub Secret Scanning (Native)

Type: Built into GitHub
Approach: Provider-partnered pattern matching + Copilot AI
AI/LLM patterns supported (with push protection and validity status):

Provider	Pattern	Push Protection	Validity Check
Anthropic	`anthropic_admin_api_key`	Yes	Yes
Anthropic	`anthropic_api_key`	Yes	Yes
Anthropic	`anthropic_session_id`	Yes	No
Cohere	`cohere_api_key`	Yes	No
DeepSeek	`deepseek_api_key`	No	Yes
Google	`google_gemini_api_key`	No	No
Groq	`groq_api_key`	Yes	Yes
Hugging Face	`hf_org_api_key`	Yes	No
Hugging Face	`hf_user_access_token`	Yes	Yes
Mistral AI	`mistral_ai_api_key`	No	No
OpenAI	`openai_api_key`	Yes	Yes
Replicate	`replicate_api_token`	Yes	Yes
xAI	`xai_api_key`	Yes	Yes
Azure	`azure_openai_key`	Yes	No

Recent developments (March 2026):
- Added 37 new secret detectors including Langchain
- Extended scanning to AI coding agents via MCP
- Copilot uses GPT-3.5-Turbo + GPT-4 for unstructured secret detection (94% FP reduction)
- Base64-encoded secret detection with push protection

1.7 Other Notable Tools

Tool	Stars	Language	Patterns	Key Feature
KeyHacks (streaak)	6,100	Markdown/Shell	100+ services	Validation curl commands for bug bounty
keyhacks.sh (gwen001)	~500	Bash	50+	Automated version of KeyHacks
Secrets Patterns DB (mazen160)	1,400	YAML/Regex	1,600+	Largest open-source regex DB, exports to TruffleHog/Gitleaks format
secret-regex-list (h33tlit)	~1,000	Regex	100+	Regex patterns for scraping secrets
regextokens (odomojuli)	~300	Regex	50+	OAuth/API token regex patterns
Betterleaks	New (2026)	Go	—	Gitleaks successor for agentic era

2. LLM-Specific API Key Tools

2.1 Dedicated LLM Key Validators

Tool	URL	Providers	Approach
TestMyAPIKey.com	testmyapikey.com	OpenAI, Anthropic Claude, + 13 others	Client-side regex + live API validation
SecurityWall Checker	securitywall.co/tools/api-key-checker	455+ patterns, 350+ services (incl. OpenAI, Anthropic)	Client-side regex, generates curl commands
VibeFactory Scanner	vibefactory.ai/api-key-security-scanner	150+ types (incl. OpenAI)	Scans deployed websites for exposed keys
KeyLeak Detector	github.com/Amal-David/keyleak-detector	Multiple	Headless browser + network interception
OpenAI Key Tester	trevorfox.com/api-key-tester/openai	OpenAI, Anthropic	Direct API validation
Chatbot API Tester	apikeytester.netlify.app	OpenAI, DeepSeek, OpenRouter	Endpoint validation
SecurityToolkits	securitytoolkits.com/tools/apikey-validator	Multiple	API key/token checker

2.2 LLM Gateways with Key Validation

These tools validate keys as part of their proxy/gateway functionality:

Tool	Stars	Providers	Validation Approach
LiteLLM	~18k	107 providers	AuthenticationError mapping from all providers
OpenRouter	—	60+ providers, 500+ models	Unified API key, provider-level validation
Portkey AI	~5k	30+ providers	AI gateway with key validation
LLM-API-Key-Proxy	~200	OpenAI, Anthropic compatible	Self-hosted proxy with key validation

2.3 Key Gap: No Comprehensive LLM-Focused Scanner

Critical finding: There is NO dedicated open-source tool that:

Detects API keys from all major LLM providers (50+)
Validates them against live APIs
Reports provider, model access, rate limits, and spend
Covers both legacy and new key formats

The closest tools are:

TruffleHog (broadest verification, but only ~3 confirmed LLM detectors)
GitHub Secret Scanning (14 AI-related patterns, but GitHub-only)
GitGuardian (broad AI coverage, but commercial)

3. Top LLM API Providers

Tier 1: Major Cloud & Frontier Model Providers

#	Provider	Key Product	Notes
1	OpenAI	GPT-5, GPT-4o, o-series	Market leader
2	Anthropic	Claude Opus 4, Sonnet, Haiku	Enterprise focus
3	Google (Gemini/Vertex AI)	Gemini 2.5 Pro/Flash	2M token context
4	AWS Bedrock	Multi-model (Claude, Llama, etc.)	AWS ecosystem
5	Azure OpenAI	GPT-4o, o-series	Enterprise SLA 99.9%
6	Google AI Studio	Gemini API	Developer-friendly
7	xAI	Grok 4.1	2M context, low cost

Tier 2: Specialized & Competitive Providers

#	Provider	Key Product	Notes
8	Mistral AI	Mistral Large, Codestral	European, open-weight
9	Cohere	Command R+	Enterprise RAG focus
10	DeepSeek	DeepSeek R1, V3	Ultra-low cost reasoning
11	Perplexity	Sonar Pro	Search-augmented LLM
12	Together AI	200+ open-source models	Low latency inference
13	Groq	LPU inference	Fastest inference speeds
14	Fireworks AI	Open-source model hosting	Sub-100ms latency
15	Replicate	Model hosting platform	Pay-per-use
16	Cerebras	Wafer-scale inference	Ultra-fast inference
17	SambaNova	Enterprise inference	Custom silicon
18	AI21	Jamba models	Long context
19	Stability AI	Stable Diffusion, text models	Image + text
20	NVIDIA NIM	Optimized model serving	GPU-optimized

Tier 3: Infrastructure, Platform & Gateway Providers

#	Provider	Key Product	Notes
21	Cloudflare Workers AI	Edge inference	Edge computing
22	Vercel AI	AI SDK, v0	Frontend-focused
23	OpenRouter	Multi-model gateway	500+ models
24	HuggingFace	Inference API, 300+ models	Open-source hub
25	DeepInfra	Inference platform	Cost-effective
26	Novita AI	200+ production APIs	Multi-modal
27	Baseten	Model serving	Custom deployments
28	Anyscale	Ray-based inference	Scalable
29	Lambda AI	GPU cloud + inference
30	OctoAI	Optimized inference
31	Databricks	DBRX, model serving	Data + AI
32	Snowflake	Cortex AI	Data warehouse + AI
33	Oracle OCI	OCI AI	Enterprise
34	SAP Generative AI Hub	Enterprise AI	SAP ecosystem
35	IBM WatsonX	Granite models	Enterprise

Tier 4: Chinese & Regional Providers

#	Provider	Key Product	Notes
36	Alibaba (Qwen/Dashscope)	Qwen 2.5/3 series	Top Chinese open-source
37	Baidu (Wenxin/ERNIE)	ERNIE 4.0	Chinese market leader
38	ByteDance (Doubao)	Doubao/Kimi	TikTok parent
39	Zhipu AI	GLM-4.5	ChatGLM lineage
40	Baichuan	Baichuan 4	Domain-specific (law, finance)
41	Moonshot AI (Kimi)	Kimi K1.5/K2	128K context
42	01.AI (Yi)	Yi-Large, Yi-34B	Founded by Kai-Fu Lee
43	MiniMax	MiniMax models	Chinese AI tiger
44	StepFun	Step models	Chinese AI tiger
45	Tencent (Hunyuan)	Hunyuan models	WeChat ecosystem
46	iFlyTek (Spark)	Spark models	Voice/NLP specialist
47	SenseNova (SenseTime)	SenseNova models	Vision + language
48	Volcano Engine (ByteDance)	Cloud AI services	ByteDance cloud
49	Nebius AI	Inference platform	Yandex spinoff

Tier 5: Emerging, Niche & Specialized Providers

#	Provider	Key Product	Notes
50	Aleph Alpha	Luminous models	EU-focused, compliance
51	Comet API	ML experiment tracking
52	Writer	Palmyra models	Enterprise content
53	Reka AI	Reka Core/Flash	Multimodal
54	Upstage	Solar models	Korean provider
55	FriendliAI	Inference optimization
56	Forefront AI	Model hosting
57	GooseAI	GPT-NeoX hosting	Low cost
58	NLP Cloud	Model hosting
59	Predibase	Fine-tuning platform	LoRA specialist
60	Clarifai	Vision + LLM
61	AiLAYER	AI platform
62	AIMLAPI	Multi-model API
63	Corcel	Decentralized inference	Bittensor-based
64	HyperBee AI	AI platform
65	Lamini	Fine-tuning + inference
66	Monster API	GPU inference
67	Neets.ai	TTS + LLM
68	Featherless AI	Inference
69	Hyperbolic	Inference platform
70	Inference.net	Open-source inference
71	Galadriel	Decentralized AI
72	PublicAI	Community inference
73	Bytez	Model hosting
74	Chutes	Inference
75	GMI Cloud	GPU cloud + inference
76	Nscale	Inference platform
77	Scaleway	European cloud AI
78	OVHCloud AI	European cloud AI
79	Heroku AI	PaaS AI add-on
80	Sarvam.ai	Indian AI models

Tier 6: Self-Hosted & Local Inference

#	Provider	Key Product	Notes
81	Ollama	Local LLM runner	No API key needed
82	LM Studio	Desktop LLM	No API key needed
83	vLLM	Inference engine	Self-hosted
84	Llamafile	Single-file LLM	Self-hosted
85	Xinference	Inference platform	Self-hosted
86	Triton Inference Server	NVIDIA serving	Self-hosted
87	LlamaGate	Gateway	Self-hosted
88	Docker Model Runner	Container inference	Self-hosted

Tier 7: Aggregators, Gateways & Middleware

#	Provider	Key Product	Notes
89	LiteLLM	AI gateway (107 providers)	Open-source
90	Portkey	AI gateway	Observability
91	Helicone	LLM observability	Proxy-based
92	Bifrost	AI gateway (Go)	Fastest gateway
93	Kong AI Gateway	API management	Enterprise
94	Vercel AI Gateway	Edge AI
95	Cloudflare AI Gateway	Edge AI
96	Agenta	LLM ops platform
97	Straico	Multi-model
98	AI302	Gateway
99	AIHubMix	Gateway
100	Zenmux	Gateway
101	Poe	Multi-model chat	Quora
102	Gitee AI	Chinese GitHub AI
103	GitHub Models	GitHub-hosted inference
104	GitHub Copilot	Code completion
105	ModelScope	Chinese model hub	Alibaba
106	Voyage AI	Embeddings
107	Jina AI	Embeddings + search
108	Deepgram	Speech-to-text
109	ElevenLabs	Text-to-speech
110	Black Forest Labs	Image generation (FLUX)
111	Fal AI	Image/video generation
112	RunwayML	Video generation
113	Recraft	Image generation
114	DataRobot	ML platform
115	Weights & Biases	ML ops + inference
116	CompactifAI	Model compression
117	GradientAI	Fine-tuning
118	Topaz	AI platform
119	Synthetic	Data generation
120	Infiniai	Inference
121	Higress	AI gateway	Alibaba
122	PPIO	Inference
123	Qiniu	Chinese cloud AI
124	NanoGPT	Lightweight inference
125	Morph	AI platform
126	Milvus	Vector DB + AI
127	XiaoMi MiMo	Xiaomi AI
128	Petals	Distributed inference
129	ZeroOne	AI platform
130	Lemonade	AI platform
131	Taichu	Chinese AI
132	Amazon Nova	AWS native models

4. API Key Patterns by Provider

4.1 Confirmed Key Prefixes & Formats

Provider	Prefix	Regex Pattern	Confidence
OpenAI (legacy)	`sk-`	`sk-[a-zA-Z0-9]{48}`	High
OpenAI (project)	`sk-proj-`	`sk-proj-[a-zA-Z0-9_-]{80,}`	High
OpenAI (service account)	`sk-svcacct-`	`sk-svcacct-[a-zA-Z0-9_-]{80,}`	High
OpenAI (legacy user)	`sk-None-`	`sk-None-[a-zA-Z0-9_-]{80,}`	High
Anthropic (API)	`sk-ant-api03-`	`sk-ant-api03-[a-zA-Z0-9_\-]{93}AA`	High
Anthropic (Admin)	`sk-ant-admin01-`	`sk-ant-admin01-[a-zA-Z0-9_\-]{93}AA`	High
Google AI / Gemini	`AIza`	`AIza[0-9A-Za-z\-_]{35}`	High
HuggingFace (user)	`hf_`	`hf_[a-zA-Z]{34}`	High
HuggingFace (org)	`api_org_`	`api_org_[a-zA-Z]{34}`	High
Groq	`gsk_`	`gsk_[a-zA-Z0-9]{48,}`	High
Replicate	`r8_`	`r8_[a-zA-Z0-9]{40}`	High
Fireworks AI	`fw_`	`fw_[a-zA-Z0-9_-]{40,}`	Medium
Perplexity	`pplx-`	`pplx-[a-zA-Z0-9]{48}`	High
AWS (general)	`AKIA`	`AKIA[0-9A-Z]{16}`	High
GitHub PAT	`ghp_`	`ghp_[a-zA-Z0-9]{36}`	High
Stripe (secret)	`sk_live_`	`sk_live_[0-9a-zA-Z]{24}`	High

4.2 Providers with No Known Distinct Prefix

These providers use generic-looking API keys without distinguishing prefixes, making detection harder:

Provider	Key Format	Detection Approach
Mistral AI	Generic alphanumeric	Keyword-based (`MISTRAL_API_KEY`)
Cohere	Generic alphanumeric	Keyword-based (`COHERE_API_KEY`, `CO_API_KEY`)
Together AI	Generic alphanumeric	Keyword-based
DeepSeek	`sk-` prefix (same as OpenAI legacy)	Keyword context needed
Azure OpenAI	32-char hex	Keyword-based
Stability AI	`sk-` prefix	Keyword context needed
AI21	Generic alphanumeric	Keyword-based
Cerebras	Generic alphanumeric	Keyword-based
SambaNova	Generic alphanumeric	Keyword-based

4.3 Detection Difficulty Tiers

Easy (unique prefix): OpenAI (sk-proj-, sk-svcacct-), Anthropic (sk-ant-), HuggingFace (hf_), Groq (gsk_), Replicate (r8_), Perplexity (pplx-), AWS (AKIA)

Medium (shared or short prefix): OpenAI legacy (sk-), DeepSeek (sk-), Stability (sk-), Fireworks (fw_), Google (AIza)

Hard (no prefix, keyword-only): Mistral, Cohere, Together AI, Azure OpenAI, AI21, Cerebras, most Chinese providers

5. Key Validation Approaches

5.1 Common Validation Endpoints

Provider	Validation Method	Endpoint	Cost
OpenAI	List models	`GET /v1/models`	Free (no tokens consumed)
Anthropic	Send minimal message	`POST /v1/messages` (tiny prompt)	Minimal cost (~1 token)
Google Gemini	List models	`GET /v1/models`	Free
Cohere	Token check	`POST /v1/tokenize` or `/v1/generate`	Minimal
HuggingFace	Whoami	`GET /api/whoami`	Free
Groq	List models	`GET /v1/models`	Free
Replicate	Get account	`GET /v1/account`	Free
Mistral	List models	`GET /v1/models`	Free
AWS	STS GetCallerIdentity	`POST sts.amazonaws.com`	Free
Azure OpenAI	List deployments	`GET /openai/deployments`	Free

5.2 Validation Strategy Patterns

Passive detection (regex only): Fastest, highest false positive rate. Used by Gitleaks, detect-secrets baseline mode.
Passive + entropy: Combines regex with entropy scoring. Reduces false positives for generic patterns. Used by detect-secrets with entropy plugins.
Active verification (API call): Makes lightweight API call to confirm key is live. Used by TruffleHog, GitHub secret scanning. Eliminates false positives but requires network access.
Deep analysis (permission enumeration): Beyond verification, enumerates what the key can access. Used by TruffleHog for ~20 credential types. Most actionable but slowest.

5.3 How Existing Tools Validate

Tool	Passive	Entropy	Active Verification	Permission Analysis
TruffleHog	Yes	No	Yes (800+ detectors)	Yes (~20 types)
Gitleaks	Yes	Optional	No	No
detect-secrets	Yes	Yes	Limited	No
Nosey Parker	Yes	ML-based	No	No
GitGuardian	Yes	Yes	Yes (selected)	Limited
GitHub Scanning	Yes	AI-based	Yes (selected)	No
SecurityWall	Yes	No	Generates curl cmds	No
KeyHacks	No	No	Manual curl cmds	Limited

6. Market Gaps & Opportunities

6.1 Underserved Areas

LLM-specific comprehensive scanner: No tool covers all 50+ LLM API providers with both detection and validation.
New key format coverage: OpenAI's sk-proj- and sk-svcacct- formats are recent; many scanners only detect legacy sk- format. Gitleaks only added these in late 2025 via PR #1780.
Chinese/regional provider detection: Almost zero coverage for Qwen, Baichuan, Zhipu, Moonshot, Yi, ERNIE, Doubao API keys in any scanner.
Key metadata extraction: No tool extracts org, project, rate limits, or spend from detected LLM keys.
Agentic AI context: With AI agents increasingly using API keys, there's a growing need for scanners that understand multi-key configurations (e.g., an agent with OpenAI + Anthropic + Serp API keys).
Vibe coding exposure: VibeFactory's scanner addresses the problem of API keys exposed in frontend JavaScript by vibe-coded apps, but this is still nascent.

6.2 Scale of the Problem

28 million credentials leaked on GitHub in 2025 (Snyk)
1,275,105 leaked AI service secrets in 2025 (GitGuardian), up 81% YoY
8 of 10 fastest-growing leaked secret categories are AI-related (GitGuardian)
Fastest growing: Brave Search API (+1,255%), Firecrawl (+796%), Supabase (+992%)
AI keys are found at 42.28 per million commits for Groq alone (GitGuardian)

6.3 Competitive Landscape Summary

                    Verification Depth
                    |
        TruffleHog  |  ████████████████  (800+ detectors, deep analysis)
        GitGuardian |  ████████████      (450+ detectors, commercial)
        GitHub      |  ██████████        (AI-powered, platform-locked)
        Gitleaks    |  ████              (150+ regex, no verification)
        detect-sec  |  ███               (27 plugins, baseline approach)
        NoseyParker |  ██                (188 rules, ML denoising, retired)
                    |
                    +------ LLM Provider Coverage ------>
                    
        None of these tools provide >15 LLM provider detectors.
        The market opportunity is a scanner focused on 50-100+ LLM providers
        with active verification, permission analysis, and cost estimation.

27 KiB Raw Blame History