# Feature Landscape: API Key Scanner Domain **Domain:** API key / secret scanner — LLM/AI provider focus, OSINT recon, active verification **Researched:** 2026-04-04 **Competitive Reference:** TruffleHog, Gitleaks, Betterleaks, detect-secrets, GitGuardian, Nosey Parker/Titus, GitHub Secret Scanning --- ## Competitive Landscape Summary | Tool | LLM Providers | Verification | OSINT/Recon | Sources | Output | |------|--------------|-------------|------------|---------|--------| | TruffleHog | ~15 (OpenAI, Anthropic, HF partial) | Yes, 700+ | No | Git, S3, Docker, Postman, Jenkins | JSON, text | | Gitleaks | ~5-10 (OpenAI, HF partial) | No | No | Git, dir, stdin | JSON, CSV, SARIF, JUnit | | Betterleaks | ~10-15 (est.) | Planned | No | Git, dir, files | Unknown (Gitleaks-compatible) | | detect-secrets | ~5 (keyword-based) | No | No | Files, git staged | JSON baseline | | Titus | 450+ rules (broad SaaS) | Yes (validate flag) | No | Files, git, binary | JSON | | GitGuardian | 550+ detectors | Yes (validity checks) | No | Git, CI/CD, Slack, Jira, Docker | Dashboard, alerts | | GitHub Secret Scanning | 700+ patterns (cloud-first) | Yes (validity checks) | No | GitHub repos only | Dashboard, SARIF | | KeyHunter (target) | 108 LLM providers | Yes (opt-in) | Yes (80+ sources) | Git+OSINT+IoT+Paste | Table, JSON, SARIF, CSV | **Key market gap confirmed:** No existing open-source tool covers 100+ LLM providers with detection + verification + OSINT recon combined. The 81% YoY surge in AI-service credential leaks (GitGuardian 2026 report, 1.27M leaked secrets) validates the demand. --- ## Table Stakes Features users expect from any credible secret scanner. Missing one = users choose a competitor immediately. | Feature | Why Expected | Complexity | Notes | |---------|--------------|------------|-------| | Regex-based pattern detection | Every tool has it; users assume it exists | Low | Foundation of all scanners; must be fast | | Entropy analysis | Standard complement to regex since TruffleHog popularized it | Low | Shannon entropy; high FP rate alone — needs keywords too | | Keyword pre-filtering | TruffleHog's performance trick; users of large repos demand it | Low-Med | Filter to candidate files before applying regex; 10x speedup | | Git history scanning | TruffleHog/Gitleaks primary use case; users expect full history | Med | Must traverse all commits, branches, tags | | Directory/file scanning | Needed for non-git use cases (CI artifacts, file shares) | Low | Walk directory tree, apply detectors | | JSON output | Machine-readable output for pipeline integration | Low | Standard across all tools | | False positive reduction / deduplication | Alert fatigue is a known pain point across all scanners | Med | Deduplicate same secret seen in N commits | | Pre-commit hook support | Shift-left; developers expect git hook integration | Low | Blocks commits with detected secrets | | CI/CD integration | GitHub Actions, GitLab CI, Jenkins — any serious scanner has this | Low | Binary runs in pipeline; exit code drives pass/fail | | SARIF output | Required for GitHub Code Scanning tab, GitLab Security dashboard | Low | Standard format; Gitleaks, Titus, Zimara all support it | | Masked output by default | Security hygiene; users expect keys not printed in clear to terminal | Low | Mask middle chars; --unmask flag to show full | | Provider-based detection rules | Users expect named detectors ("OpenAI key detected"), not raw regex | Med | Named detectors with confidence; YAML definitions in KeyHunter's case | | Active key verification (opt-in) | TruffleHog verified this: confirmed keys are worth 10x more to users | Med | MUST be opt-in; legal/ethical requirement; network call to provider API | | --verify flag (off by default) | Legal safety norm in the ecosystem; users expect passive-by-default | Low | Standard pattern established by TruffleHog | | CSV export | Needed for spreadsheet/reporting workflows | Low | Standard; Gitleaks, Titus support it | | Multi-platform binary | Single binary install is the expectation for Go tools | Low | Linux, macOS; Docker for Windows | --- ## Differentiators Features that set KeyHunter apart from every existing tool. These are the competitive moat. ### Tier 1: Core Differentiators (Primary Competitive Advantage) | Feature | Value Proposition | Complexity | Notes | |---------|-------------------|------------|-------| | 108 LLM/AI provider coverage | No tool covers more than ~15-20 LLM providers; this is a 5-7x gap | High | YAML-driven provider definitions; must include prefix-based (OpenAI, Anthropic, HF, Groq, Replicate) AND keyword-based (Mistral, Cohere, Together AI, Chinese providers) | | OSINT/Recon engine (80+ sources) | No scanner combines detection + OSINT in one tool | Very High | 18 source categories: code hosting, paste sites, IoT scanners, search dorks, package registries, CI/CD logs, web archives, forums, etc. | | Active verification for 108 LLM providers | TruffleHog verifies ~700 types but covers far fewer LLM providers | High | Each YAML provider definition includes verify endpoint; --verify opt-in | | Built-in dork engine (150+ dorks) | Search engine dorking is manual today; no tool has YAML-managed dorks | Med | GitHub, Google, Shodan, Censys, ZoomEye, FOFA dorks in YAML; extensible same way as providers | | IoT scanner integration | Shodan/Censys/ZoomEye/FOFA for exposed LLM endpoints | High | Scans for vLLM, Ollama, LiteLLM proxy leaks — a growing attack surface (1scan showed thousands of exposed LLM endpoints) | | YAML provider plugin system | Community can add providers without recompiling | Med | compile-time embed via Go `//go:embed`; provider = pattern + keywords + verify endpoint + metadata | ### Tier 2: Strong Differentiators (Meaningfully Better Than Alternatives) | Feature | Value Proposition | Complexity | Notes | |---------|-------------------|------------|-------| | Paste site aggregation (20+ sites) | Paste sites are a top leak vector; no scanner covers them systematically | High | Pastebin, dpaste, paste.ee, rentry, hastebin, ix.io etc. | | Package registry scanning (8+ registries) | npm, PyPI, RubyGems, crates.io, Maven, NuGet, Packagist, Go proxy — LLM keys embedded in packages are a real vector | High | Scan source tarballs and metadata | | Container/IaC scanning | Docker Hub layers, K8s configs, Terraform state, Helm, Ansible | High | Complements git scanning with infra layer | | Web dashboard (htmx + Tailwind) | No open-source scanner has an embedded web UI | High | SQLite backend, embedded in binary via go:embed; scans/keys/recon/providers/dorks/settings | | Telegram bot integration | Immediate mobile notification of findings; no scanner has this | Med | /scan, /verify, /recon, /status commands | | Scheduled scanning with auto-notify | Recurring scans with cron; no scanner has this natively | Med | Cron-based; webhook or Telegram on new findings | | SQLite storage with AES-256 encryption | Persistent scan state; other tools are stateless | Med | Store findings, recon results, key status history | | TruffleHog + Gitleaks import adapter | Lets users pipe existing tool output into KeyHunter's verification/storage | Low-Med | JSON import from both tools; normalizes results | | APK decompile scanning | Mobile app binaries as a source; no common scanner does this | High | Depends on external apktool/jadx; wrap as optional integration | | Web archive scanning | Wayback Machine + CommonCrawl for historical leaks | Med | Useful for finding keys that were removed from code but still indexed | | Source map / webpack bundle scanning | Frontend JS bundles frequently contain embedded API keys | Med | Fetch and parse JS source maps from deployed sites | | Permission analysis (future) | TruffleHog Analyze covers 30 types; KeyHunter could expand to LLM scope | Very High | Know what a leaked key can do — model access, billing, rate limits | ### Tier 3: Nice-to-Have Differentiators | Feature | Value Proposition | Complexity | Notes | |---------|-------------------|------------|-------| | Colored table output | Better UX than plain text | Low | Use lipgloss or tablewriter; standard in modern Go CLIs | | Rate limiting per OSINT source | Responsible scanning without bans | Med | Per-host rate limiter; configurable | | Stealth mode / robots.txt respect | Ethical scanning; avoids legal issues for researchers | Med | Opt-in stealth; obey robots.txt when configured | | Delta-based git scanning | Only scan new commits since last run; performance for CI | Med | Store last scanned commit hash in SQLite | | mmap-based file reading | Memory-efficient scanning of large files | Med | Use for large log files and archives | | Worker pool parallelism | TruffleHog does this; expected for performance | Med | configurable goroutine pool per source type | | Cloud storage scanning (S3, GCS, Azure Blob) | Buckets frequently contain leaked config files | High | Requires cloud credentials to scan; scope carefully | | Forum/community scanning (Reddit, HN, StackOverflow) | Real leak vector; developers share code with keys | High | Rate-limited scraping; search API where available | | Collaboration tool scanning (Notion, Confluence) | Enterprise leak vector; increasingly relevant | Very High | Auth flows complex; may need per-org API tokens | | Threat intel integration (VirusTotal, IntelX) | Cross-reference found keys against known breach databases | High | Add-on verification layer | --- ## Anti-Features Features to deliberately NOT build. Building these would waste resources, create scope creep, or undermine the tool's identity. | Anti-Feature | Why Avoid | What to Do Instead | |--------------|-----------|-------------------| | Key rotation / remediation | KeyHunter is a finder, not a fixer; building rotation = competing with HashiCorp Vault, AWS Secrets Manager, Doppler | Document links to provider-specific rotation guides; link from findings output | | SaaS / cloud-hosted version | Shifts tool from open-source security tool to commercial product; legal/privacy complexity explodes | Keep open-source; let users self-host the web dashboard | | GUI desktop app | High dev cost for low security-tool audience benefit; security tools are CLI-first | CLI + embedded web dashboard covers both audiences | | Real-time streaming API | Batch scanning is the primary mode; streaming adds websocket/SSE complexity for marginal gain | Use scheduled scans + webhooks/Telegram for near-real-time alerting | | Windows native build | Small portion of target audience (red teams, DevSecOps); WSL/Docker serves them | State WSL/Docker support clearly in README | | AI-generated code scanning (static analysis) | Different domain entirely from secret detection; scope creep | Stay focused on credential/secret detection | | Automatic key invalidation | Calling provider API to revoke a key without explicit user consent is dangerous and potentially illegal | Gate ALL provider API calls behind --verify; never call provider APIs passively | | Scanning without user consent | Legal and ethical requirement; all scanning must be intentional | Require explicit targets; no auto-discovery of new repos to scan | | Built-in proxy/VPN | Scope creep; tool should not manage network routing | Document use with external proxies; support HTTP_PROXY env var | | Key marketplace / sharing | Fundamentally changes the ethical posture of the tool from defender to attacker | Hard no; never log or transmit found keys anywhere outside local SQLite | | Excessive telemetry | Security tools must not phone home; community trust requires zero telemetry | No analytics, no crash reporting, no network calls except explicit --verify | --- ## Feature Dependencies ``` Regex patterns + keyword lists -> Provider YAML definitions (pattern + keywords + verify endpoint) -> Core scanning engine (file, dir, git) -> Active verification (--verify flag) -> SQLite storage (findings persistence) -> Web dashboard (htmx, reads from SQLite) -> JSON/CSV/SARIF export -> Telegram bot (reads from SQLite, sends alerts) -> Scheduled scanning (cron -> scan -> SQLite -> notify) Provider YAML definitions -> Dork YAML definitions (same extensibility pattern) -> Built-in dork engine -> OSINT/Recon engine (uses dorks per source) -> IoT scanners (Shodan, Censys, ZoomEye, FOFA) -> Code hosting (GitHub, GitLab, HuggingFace, etc.) -> Paste sites -> Package registries -> Search engine dorking -> Web archives -> CI/CD logs -> Forums -> Collaboration tools -> Cloud storage -> Container/IaC TruffleHog/Gitleaks JSON import -> Active verification (can verify imported keys) -> SQLite storage (can store imported findings) Delta-based git scanning -> SQLite storage (requires stored last-scanned commit) Keyword pre-filtering -> Core scanning engine (filter before regex application) Worker pool parallelism -> All scanning operations (applies globally) ``` --- ## MVP Recommendation Build in strict dependency order. Each phase must be complete before the next delivers value. **Phase 1 — Foundation (table stakes, no differentiators yet):** 1. Provider YAML definitions for 108 LLM providers (patterns, keywords, verify endpoints) 2. Core scanning engine: regex + entropy + keyword pre-filtering 3. Input sources: file, dir, git history, stdin 4. Active verification via `--verify` flag (off by default) 5. Output: colored table, JSON, SARIF, CSV 6. SQLite storage with AES-256 **Phase 2 — First differentiators (competitive moat begins here):** 7. Full key access: `--unmask`, `keys show`, web dashboard 8. TruffleHog + Gitleaks import adapters 9. Built-in dork engine (YAML dorks, 150+) 10. Pre-commit hook + CI/CD integration (SARIF, exit codes) **Phase 3 — OSINT engine (the primary differentiator):** 11. Recon engine core: code hosting (GitHub, GitLab, HuggingFace, Replit, etc.) 12. Paste site aggregator (20+ sites) 13. Search engine dorking (Google, Bing, DuckDuckGo, etc.) 14. Package registries (npm, PyPI, RubyGems, etc.) 15. IoT scanners (Shodan, Censys, ZoomEye, FOFA, Netlas, BinaryEdge) **Phase 4 — Automation and reach:** 16. Telegram bot 17. Scheduled scanning (cron-based) 18. Remaining OSINT sources: CI/CD logs, web archives, forums, cloud storage, container/IaC, APK, source maps, threat intel **Defer permanently:** - Collaboration tool scanning (Notion, Confluence, Google Docs): auth complexity is very high; add in v2 if demand exists - Permission analysis: very high complexity; requires provider-specific API exploration per provider; good v2 feature - Web archive scanning: CommonCrawl data is huge; requires careful scoping to avoid running for days --- ## Detection Method Tradeoffs Based on research across competitive tools, relevant for architectural decisions: | Method | Recall | Precision | Speed | Best For | |--------|--------|-----------|-------|----------| | Regex (named patterns) | High (for known formats) | High | Fast | Provider keys with known prefixes (OpenAI sk-proj-, Anthropic sk-ant-api03-, HuggingFace hf_, Groq gsk_) | | Entropy (Shannon) | Medium (70.4% per Betterleaks data) | Low (high FP) | Fast | Generic high-entropy strings; use as secondary signal only | | BPE Tokenization (Betterleaks) | Very High (98.6%) | High | Medium | Next-gen; consider for v2 | | Keyword pre-filtering | N/A (filter only) | N/A | Very Fast | Reduce candidate set before regex; TruffleHog pattern | | ML/LLM-based (Nosey Parker AI, GPT-4) | High | Very High | Slow/expensive | FuzzingLabs: GPT-5-mini hits 84.4% recall vs Gitleaks 37.5%; v2 consideration | | Contextual validation | High | Very High | Medium | GitGuardian's third layer; reduces FP significantly | **KeyHunter approach:** Regex (primary) + keyword pre-filtering (performance) + entropy (secondary signal). ML-based detection is a v2 feature once the provider coverage gap is closed. --- ## Ecosystem Context (2026) - AI-service credential leaks: 1.27M in 2025, up 81% YoY (GitGuardian State of Secrets Sprawl 2026) - 29M total secrets leaked on GitHub in 2025 (34% YoY increase, largest single-year jump ever) - LLM infrastructure leaks grow 5x faster than core model provider leaks - Claude Code-assisted commits show 3.2% leak rate vs 1.5% baseline — AI coding tools making it worse - 24,008 unique secrets in MCP configuration files found in 2025 - Betterleaks (March 2026): BPE tokenization achieves 98.6% recall vs 70.4% for entropy — new detection paradigm worth tracking - FuzzingLabs (April 2026): GPT-5-mini hits 84.4% recall vs Gitleaks 37.5%, TruffleHog 0% on split/obfuscated secrets — LLM-based detection becoming viable - TruffleHog + HuggingFace partnership: native HF scanner for models, datasets, Spaces - GitHub Secret Scanning: added DeepSeek validity checks in March 2026 — LLM provider awareness growing --- ## Sources - [TruffleHog GitHub](https://github.com/trufflesecurity/trufflehog) — feature set, detector count, scanning sources - [TruffleHog Analyze](https://trufflesecurity.com/blog/trufflehog-now-analyzes-permissions-of-api-keys-and-passwords) — permission analysis feature - [Gitleaks GitHub](https://github.com/gitleaks/gitleaks) — output formats, detection methods - [Betterleaks — BleepingComputer](https://www.bleepingcomputer.com/news/security/betterleaks-a-new-open-source-secrets-scanner-to-replace-gitleaks/) — BPE tokenization, recall metrics - [Betterleaks — Aikido](https://www.aikido.dev/blog/betterleaks-gitleaks-successor) — comparison with Gitleaks - [Titus — Praetorian](https://www.praetorian.com/blog/titus-open-source-secret-scanner/) — 450+ rules, validation, Burp extension - [Titus GitHub](https://github.com/praetorian-inc/titus) — feature details - [GitGuardian Secrets Detection](https://www.gitguardian.com/solutions/secrets-detection) — 550+ detectors, enterprise features - [GitGuardian State of Secrets Sprawl 2026](https://blog.gitguardian.com/the-state-of-secrets-sprawl-2026/) — market statistics - [GitHub Secret Scanning March 2026](https://github.blog/changelog/2026-03-10-secret-scanning-pattern-updates-march-2026/) — validity checks for DeepSeek - [GitHub Secret Scanning Coverage Update](https://github.blog/changelog/2026-03-31-github-secret-scanning-nine-new-types-and-more/) — 28 new detectors - [GitGuardian Secret Scanning Tools 2026](https://blog.gitguardian.com/secret-scanning-tools/) — market landscape - [keyhacks — streaak](https://github.com/streaak/keyhacks) — API key validation endpoints reference - [detect-secrets — Yelp](https://github.com/Yelp/detect-secrets) — baseline approach, 27 detectors - [FuzzingLabs — LLM vs regex benchmark](https://x.com/FuzzingLabs/status/1980668916851483010) — GPT-5-mini 84.4% vs Gitleaks 37.5% - [AI/LLM API key scanning on GitHub at scale](https://dev.to/zaim_abbasi/claude-openai-google-api-keys-all-public-this-is-what-i-found-after-scanning-github-at-scale-fj5) — real-world leak discovery - [Comparative study of secret scanning tools](https://arxiv.org/pdf/2307.00714) — precision/recall benchmarks