Feature Landscape: API Key Scanner Domain

Domain: API key / secret scanner — LLM/AI provider focus, OSINT recon, active verification Researched: 2026-04-04 Competitive Reference: TruffleHog, Gitleaks, Betterleaks, detect-secrets, GitGuardian, Nosey Parker/Titus, GitHub Secret Scanning

Competitive Landscape Summary

Tool	LLM Providers	Verification	OSINT/Recon	Sources	Output
TruffleHog	~15 (OpenAI, Anthropic, HF partial)	Yes, 700+	No	Git, S3, Docker, Postman, Jenkins	JSON, text
Gitleaks	~5-10 (OpenAI, HF partial)	No	No	Git, dir, stdin	JSON, CSV, SARIF, JUnit
Betterleaks	~10-15 (est.)	Planned	No	Git, dir, files	Unknown (Gitleaks-compatible)
detect-secrets	~5 (keyword-based)	No	No	Files, git staged	JSON baseline
Titus	450+ rules (broad SaaS)	Yes (validate flag)	No	Files, git, binary	JSON
GitGuardian	550+ detectors	Yes (validity checks)	No	Git, CI/CD, Slack, Jira, Docker	Dashboard, alerts
GitHub Secret Scanning	700+ patterns (cloud-first)	Yes (validity checks)	No	GitHub repos only	Dashboard, SARIF
KeyHunter (target)	108 LLM providers	Yes (opt-in)	Yes (80+ sources)	Git+OSINT+IoT+Paste	Table, JSON, SARIF, CSV

Key market gap confirmed: No existing open-source tool covers 100+ LLM providers with detection + verification + OSINT recon combined. The 81% YoY surge in AI-service credential leaks (GitGuardian 2026 report, 1.27M leaked secrets) validates the demand.

Table Stakes

Features users expect from any credible secret scanner. Missing one = users choose a competitor immediately.

Feature	Why Expected	Complexity	Notes
Regex-based pattern detection	Every tool has it; users assume it exists	Low	Foundation of all scanners; must be fast
Entropy analysis	Standard complement to regex since TruffleHog popularized it	Low	Shannon entropy; high FP rate alone — needs keywords too
Keyword pre-filtering	TruffleHog's performance trick; users of large repos demand it	Low-Med	Filter to candidate files before applying regex; 10x speedup
Git history scanning	TruffleHog/Gitleaks primary use case; users expect full history	Med	Must traverse all commits, branches, tags
Directory/file scanning	Needed for non-git use cases (CI artifacts, file shares)	Low	Walk directory tree, apply detectors
JSON output	Machine-readable output for pipeline integration	Low	Standard across all tools
False positive reduction / deduplication	Alert fatigue is a known pain point across all scanners	Med	Deduplicate same secret seen in N commits
Pre-commit hook support	Shift-left; developers expect git hook integration	Low	Blocks commits with detected secrets
CI/CD integration	GitHub Actions, GitLab CI, Jenkins — any serious scanner has this	Low	Binary runs in pipeline; exit code drives pass/fail
SARIF output	Required for GitHub Code Scanning tab, GitLab Security dashboard	Low	Standard format; Gitleaks, Titus, Zimara all support it
Masked output by default	Security hygiene; users expect keys not printed in clear to terminal	Low	Mask middle chars; --unmask flag to show full
Provider-based detection rules	Users expect named detectors ("OpenAI key detected"), not raw regex	Med	Named detectors with confidence; YAML definitions in KeyHunter's case
Active key verification (opt-in)	TruffleHog verified this: confirmed keys are worth 10x more to users	Med	MUST be opt-in; legal/ethical requirement; network call to provider API
--verify flag (off by default)	Legal safety norm in the ecosystem; users expect passive-by-default	Low	Standard pattern established by TruffleHog
CSV export	Needed for spreadsheet/reporting workflows	Low	Standard; Gitleaks, Titus support it
Multi-platform binary	Single binary install is the expectation for Go tools	Low	Linux, macOS; Docker for Windows

Differentiators

Features that set KeyHunter apart from every existing tool. These are the competitive moat.

Tier 1: Core Differentiators (Primary Competitive Advantage)

Feature	Value Proposition	Complexity	Notes
108 LLM/AI provider coverage	No tool covers more than ~15-20 LLM providers; this is a 5-7x gap	High	YAML-driven provider definitions; must include prefix-based (OpenAI, Anthropic, HF, Groq, Replicate) AND keyword-based (Mistral, Cohere, Together AI, Chinese providers)
OSINT/Recon engine (80+ sources)	No scanner combines detection + OSINT in one tool	Very High	18 source categories: code hosting, paste sites, IoT scanners, search dorks, package registries, CI/CD logs, web archives, forums, etc.
Active verification for 108 LLM providers	TruffleHog verifies ~700 types but covers far fewer LLM providers	High	Each YAML provider definition includes verify endpoint; --verify opt-in
Built-in dork engine (150+ dorks)	Search engine dorking is manual today; no tool has YAML-managed dorks	Med	GitHub, Google, Shodan, Censys, ZoomEye, FOFA dorks in YAML; extensible same way as providers
IoT scanner integration	Shodan/Censys/ZoomEye/FOFA for exposed LLM endpoints	High	Scans for vLLM, Ollama, LiteLLM proxy leaks — a growing attack surface (1scan showed thousands of exposed LLM endpoints)
YAML provider plugin system	Community can add providers without recompiling	Med	compile-time embed via Go `//go:embed`; provider = pattern + keywords + verify endpoint + metadata

Tier 2: Strong Differentiators (Meaningfully Better Than Alternatives)

Feature	Value Proposition	Complexity	Notes
Paste site aggregation (20+ sites)	Paste sites are a top leak vector; no scanner covers them systematically	High	Pastebin, dpaste, paste.ee, rentry, hastebin, ix.io etc.
Package registry scanning (8+ registries)	npm, PyPI, RubyGems, crates.io, Maven, NuGet, Packagist, Go proxy — LLM keys embedded in packages are a real vector	High	Scan source tarballs and metadata
Container/IaC scanning	Docker Hub layers, K8s configs, Terraform state, Helm, Ansible	High	Complements git scanning with infra layer
Web dashboard (htmx + Tailwind)	No open-source scanner has an embedded web UI	High	SQLite backend, embedded in binary via go:embed; scans/keys/recon/providers/dorks/settings
Telegram bot integration	Immediate mobile notification of findings; no scanner has this	Med	/scan, /verify, /recon, /status commands
Scheduled scanning with auto-notify	Recurring scans with cron; no scanner has this natively	Med	Cron-based; webhook or Telegram on new findings
SQLite storage with AES-256 encryption	Persistent scan state; other tools are stateless	Med	Store findings, recon results, key status history
TruffleHog + Gitleaks import adapter	Lets users pipe existing tool output into KeyHunter's verification/storage	Low-Med	JSON import from both tools; normalizes results
APK decompile scanning	Mobile app binaries as a source; no common scanner does this	High	Depends on external apktool/jadx; wrap as optional integration
Web archive scanning	Wayback Machine + CommonCrawl for historical leaks	Med	Useful for finding keys that were removed from code but still indexed
Source map / webpack bundle scanning	Frontend JS bundles frequently contain embedded API keys	Med	Fetch and parse JS source maps from deployed sites
Permission analysis (future)	TruffleHog Analyze covers 30 types; KeyHunter could expand to LLM scope	Very High	Know what a leaked key can do — model access, billing, rate limits

Tier 3: Nice-to-Have Differentiators

Feature	Value Proposition	Complexity	Notes
Colored table output	Better UX than plain text	Low	Use lipgloss or tablewriter; standard in modern Go CLIs
Rate limiting per OSINT source	Responsible scanning without bans	Med	Per-host rate limiter; configurable
Stealth mode / robots.txt respect	Ethical scanning; avoids legal issues for researchers	Med	Opt-in stealth; obey robots.txt when configured
Delta-based git scanning	Only scan new commits since last run; performance for CI	Med	Store last scanned commit hash in SQLite
mmap-based file reading	Memory-efficient scanning of large files	Med	Use for large log files and archives
Worker pool parallelism	TruffleHog does this; expected for performance	Med	configurable goroutine pool per source type
Cloud storage scanning (S3, GCS, Azure Blob)	Buckets frequently contain leaked config files	High	Requires cloud credentials to scan; scope carefully
Forum/community scanning (Reddit, HN, StackOverflow)	Real leak vector; developers share code with keys	High	Rate-limited scraping; search API where available
Collaboration tool scanning (Notion, Confluence)	Enterprise leak vector; increasingly relevant	Very High	Auth flows complex; may need per-org API tokens
Threat intel integration (VirusTotal, IntelX)	Cross-reference found keys against known breach databases	High	Add-on verification layer

Anti-Features

Features to deliberately NOT build. Building these would waste resources, create scope creep, or undermine the tool's identity.

Anti-Feature	Why Avoid	What to Do Instead
Key rotation / remediation	KeyHunter is a finder, not a fixer; building rotation = competing with HashiCorp Vault, AWS Secrets Manager, Doppler	Document links to provider-specific rotation guides; link from findings output
SaaS / cloud-hosted version	Shifts tool from open-source security tool to commercial product; legal/privacy complexity explodes	Keep open-source; let users self-host the web dashboard
GUI desktop app	High dev cost for low security-tool audience benefit; security tools are CLI-first	CLI + embedded web dashboard covers both audiences
Real-time streaming API	Batch scanning is the primary mode; streaming adds websocket/SSE complexity for marginal gain	Use scheduled scans + webhooks/Telegram for near-real-time alerting
Windows native build	Small portion of target audience (red teams, DevSecOps); WSL/Docker serves them	State WSL/Docker support clearly in README
AI-generated code scanning (static analysis)	Different domain entirely from secret detection; scope creep	Stay focused on credential/secret detection
Automatic key invalidation	Calling provider API to revoke a key without explicit user consent is dangerous and potentially illegal	Gate ALL provider API calls behind --verify; never call provider APIs passively
Scanning without user consent	Legal and ethical requirement; all scanning must be intentional	Require explicit targets; no auto-discovery of new repos to scan
Built-in proxy/VPN	Scope creep; tool should not manage network routing	Document use with external proxies; support HTTP_PROXY env var
Key marketplace / sharing	Fundamentally changes the ethical posture of the tool from defender to attacker	Hard no; never log or transmit found keys anywhere outside local SQLite
Excessive telemetry	Security tools must not phone home; community trust requires zero telemetry	No analytics, no crash reporting, no network calls except explicit --verify

Feature Dependencies

Regex patterns + keyword lists
    -> Provider YAML definitions (pattern + keywords + verify endpoint)
        -> Core scanning engine (file, dir, git)
            -> Active verification (--verify flag)
            -> SQLite storage (findings persistence)
                -> Web dashboard (htmx, reads from SQLite)
                -> JSON/CSV/SARIF export
                -> Telegram bot (reads from SQLite, sends alerts)
                -> Scheduled scanning (cron -> scan -> SQLite -> notify)

Provider YAML definitions
    -> Dork YAML definitions (same extensibility pattern)
        -> Built-in dork engine
            -> OSINT/Recon engine (uses dorks per source)
                -> IoT scanners (Shodan, Censys, ZoomEye, FOFA)
                -> Code hosting (GitHub, GitLab, HuggingFace, etc.)
                -> Paste sites
                -> Package registries
                -> Search engine dorking
                -> Web archives
                -> CI/CD logs
                -> Forums
                -> Collaboration tools
                -> Cloud storage
                -> Container/IaC

TruffleHog/Gitleaks JSON import
    -> Active verification (can verify imported keys)
    -> SQLite storage (can store imported findings)

Delta-based git scanning
    -> SQLite storage (requires stored last-scanned commit)

Keyword pre-filtering
    -> Core scanning engine (filter before regex application)

Worker pool parallelism
    -> All scanning operations (applies globally)

MVP Recommendation

Build in strict dependency order. Each phase must be complete before the next delivers value.

Phase 1 — Foundation (table stakes, no differentiators yet):

Provider YAML definitions for 108 LLM providers (patterns, keywords, verify endpoints)
Core scanning engine: regex + entropy + keyword pre-filtering
Input sources: file, dir, git history, stdin
Active verification via --verify flag (off by default)
Output: colored table, JSON, SARIF, CSV
SQLite storage with AES-256

Phase 2 — First differentiators (competitive moat begins here): 7. Full key access: --unmask, keys show, web dashboard 8. TruffleHog + Gitleaks import adapters 9. Built-in dork engine (YAML dorks, 150+) 10. Pre-commit hook + CI/CD integration (SARIF, exit codes)

Phase 3 — OSINT engine (the primary differentiator): 11. Recon engine core: code hosting (GitHub, GitLab, HuggingFace, Replit, etc.) 12. Paste site aggregator (20+ sites) 13. Search engine dorking (Google, Bing, DuckDuckGo, etc.) 14. Package registries (npm, PyPI, RubyGems, etc.) 15. IoT scanners (Shodan, Censys, ZoomEye, FOFA, Netlas, BinaryEdge)

Phase 4 — Automation and reach: 16. Telegram bot 17. Scheduled scanning (cron-based) 18. Remaining OSINT sources: CI/CD logs, web archives, forums, cloud storage, container/IaC, APK, source maps, threat intel

Defer permanently:

Collaboration tool scanning (Notion, Confluence, Google Docs): auth complexity is very high; add in v2 if demand exists
Permission analysis: very high complexity; requires provider-specific API exploration per provider; good v2 feature
Web archive scanning: CommonCrawl data is huge; requires careful scoping to avoid running for days

Detection Method Tradeoffs

Based on research across competitive tools, relevant for architectural decisions:

Method	Recall	Precision	Speed	Best For
Regex (named patterns)	High (for known formats)	High	Fast	Provider keys with known prefixes (OpenAI sk-proj-, Anthropic sk-ant-api03-, HuggingFace hf_, Groq gsk_)
Entropy (Shannon)	Medium (70.4% per Betterleaks data)	Low (high FP)	Fast	Generic high-entropy strings; use as secondary signal only
BPE Tokenization (Betterleaks)	Very High (98.6%)	High	Medium	Next-gen; consider for v2
Keyword pre-filtering	N/A (filter only)	N/A	Very Fast	Reduce candidate set before regex; TruffleHog pattern
ML/LLM-based (Nosey Parker AI, GPT-4)	High	Very High	Slow/expensive	FuzzingLabs: GPT-5-mini hits 84.4% recall vs Gitleaks 37.5%; v2 consideration
Contextual validation	High	Very High	Medium	GitGuardian's third layer; reduces FP significantly

KeyHunter approach: Regex (primary) + keyword pre-filtering (performance) + entropy (secondary signal). ML-based detection is a v2 feature once the provider coverage gap is closed.

Ecosystem Context (2026)

AI-service credential leaks: 1.27M in 2025, up 81% YoY (GitGuardian State of Secrets Sprawl 2026)
29M total secrets leaked on GitHub in 2025 (34% YoY increase, largest single-year jump ever)
LLM infrastructure leaks grow 5x faster than core model provider leaks
Claude Code-assisted commits show 3.2% leak rate vs 1.5% baseline — AI coding tools making it worse
24,008 unique secrets in MCP configuration files found in 2025
Betterleaks (March 2026): BPE tokenization achieves 98.6% recall vs 70.4% for entropy — new detection paradigm worth tracking
FuzzingLabs (April 2026): GPT-5-mini hits 84.4% recall vs Gitleaks 37.5%, TruffleHog 0% on split/obfuscated secrets — LLM-based detection becoming viable
TruffleHog + HuggingFace partnership: native HF scanner for models, datasets, Spaces
GitHub Secret Scanning: added DeepSeek validity checks in March 2026 — LLM provider awareness growing

Sources

TruffleHog GitHub — feature set, detector count, scanning sources
TruffleHog Analyze — permission analysis feature
Gitleaks GitHub — output formats, detection methods
Betterleaks — BleepingComputer — BPE tokenization, recall metrics
Betterleaks — Aikido — comparison with Gitleaks
Titus — Praetorian — 450+ rules, validation, Burp extension
Titus GitHub — feature details
GitGuardian Secrets Detection — 550+ detectors, enterprise features
GitGuardian State of Secrets Sprawl 2026 — market statistics
GitHub Secret Scanning March 2026 — validity checks for DeepSeek
GitHub Secret Scanning Coverage Update — 28 new detectors
GitGuardian Secret Scanning Tools 2026 — market landscape
keyhacks — streaak — API key validation endpoints reference
detect-secrets — Yelp — baseline approach, 27 detectors
FuzzingLabs — LLM vs regex benchmark — GPT-5-mini 84.4% vs Gitleaks 37.5%
AI/LLM API key scanning on GitHub at scale — real-world leak discovery
Comparative study of secret scanning tools — precision/recall benchmarks

19 KiB Raw Blame History