Files
keyhunter/.planning/research/FEATURES.md
2026-04-04 19:03:12 +03:00

19 KiB

Feature Landscape: API Key Scanner Domain

Domain: API key / secret scanner — LLM/AI provider focus, OSINT recon, active verification Researched: 2026-04-04 Competitive Reference: TruffleHog, Gitleaks, Betterleaks, detect-secrets, GitGuardian, Nosey Parker/Titus, GitHub Secret Scanning


Competitive Landscape Summary

Tool LLM Providers Verification OSINT/Recon Sources Output
TruffleHog ~15 (OpenAI, Anthropic, HF partial) Yes, 700+ No Git, S3, Docker, Postman, Jenkins JSON, text
Gitleaks ~5-10 (OpenAI, HF partial) No No Git, dir, stdin JSON, CSV, SARIF, JUnit
Betterleaks ~10-15 (est.) Planned No Git, dir, files Unknown (Gitleaks-compatible)
detect-secrets ~5 (keyword-based) No No Files, git staged JSON baseline
Titus 450+ rules (broad SaaS) Yes (validate flag) No Files, git, binary JSON
GitGuardian 550+ detectors Yes (validity checks) No Git, CI/CD, Slack, Jira, Docker Dashboard, alerts
GitHub Secret Scanning 700+ patterns (cloud-first) Yes (validity checks) No GitHub repos only Dashboard, SARIF
KeyHunter (target) 108 LLM providers Yes (opt-in) Yes (80+ sources) Git+OSINT+IoT+Paste Table, JSON, SARIF, CSV

Key market gap confirmed: No existing open-source tool covers 100+ LLM providers with detection + verification + OSINT recon combined. The 81% YoY surge in AI-service credential leaks (GitGuardian 2026 report, 1.27M leaked secrets) validates the demand.


Table Stakes

Features users expect from any credible secret scanner. Missing one = users choose a competitor immediately.

Feature Why Expected Complexity Notes
Regex-based pattern detection Every tool has it; users assume it exists Low Foundation of all scanners; must be fast
Entropy analysis Standard complement to regex since TruffleHog popularized it Low Shannon entropy; high FP rate alone — needs keywords too
Keyword pre-filtering TruffleHog's performance trick; users of large repos demand it Low-Med Filter to candidate files before applying regex; 10x speedup
Git history scanning TruffleHog/Gitleaks primary use case; users expect full history Med Must traverse all commits, branches, tags
Directory/file scanning Needed for non-git use cases (CI artifacts, file shares) Low Walk directory tree, apply detectors
JSON output Machine-readable output for pipeline integration Low Standard across all tools
False positive reduction / deduplication Alert fatigue is a known pain point across all scanners Med Deduplicate same secret seen in N commits
Pre-commit hook support Shift-left; developers expect git hook integration Low Blocks commits with detected secrets
CI/CD integration GitHub Actions, GitLab CI, Jenkins — any serious scanner has this Low Binary runs in pipeline; exit code drives pass/fail
SARIF output Required for GitHub Code Scanning tab, GitLab Security dashboard Low Standard format; Gitleaks, Titus, Zimara all support it
Masked output by default Security hygiene; users expect keys not printed in clear to terminal Low Mask middle chars; --unmask flag to show full
Provider-based detection rules Users expect named detectors ("OpenAI key detected"), not raw regex Med Named detectors with confidence; YAML definitions in KeyHunter's case
Active key verification (opt-in) TruffleHog verified this: confirmed keys are worth 10x more to users Med MUST be opt-in; legal/ethical requirement; network call to provider API
--verify flag (off by default) Legal safety norm in the ecosystem; users expect passive-by-default Low Standard pattern established by TruffleHog
CSV export Needed for spreadsheet/reporting workflows Low Standard; Gitleaks, Titus support it
Multi-platform binary Single binary install is the expectation for Go tools Low Linux, macOS; Docker for Windows

Differentiators

Features that set KeyHunter apart from every existing tool. These are the competitive moat.

Tier 1: Core Differentiators (Primary Competitive Advantage)

Feature Value Proposition Complexity Notes
108 LLM/AI provider coverage No tool covers more than ~15-20 LLM providers; this is a 5-7x gap High YAML-driven provider definitions; must include prefix-based (OpenAI, Anthropic, HF, Groq, Replicate) AND keyword-based (Mistral, Cohere, Together AI, Chinese providers)
OSINT/Recon engine (80+ sources) No scanner combines detection + OSINT in one tool Very High 18 source categories: code hosting, paste sites, IoT scanners, search dorks, package registries, CI/CD logs, web archives, forums, etc.
Active verification for 108 LLM providers TruffleHog verifies ~700 types but covers far fewer LLM providers High Each YAML provider definition includes verify endpoint; --verify opt-in
Built-in dork engine (150+ dorks) Search engine dorking is manual today; no tool has YAML-managed dorks Med GitHub, Google, Shodan, Censys, ZoomEye, FOFA dorks in YAML; extensible same way as providers
IoT scanner integration Shodan/Censys/ZoomEye/FOFA for exposed LLM endpoints High Scans for vLLM, Ollama, LiteLLM proxy leaks — a growing attack surface (1scan showed thousands of exposed LLM endpoints)
YAML provider plugin system Community can add providers without recompiling Med compile-time embed via Go //go:embed; provider = pattern + keywords + verify endpoint + metadata

Tier 2: Strong Differentiators (Meaningfully Better Than Alternatives)

Feature Value Proposition Complexity Notes
Paste site aggregation (20+ sites) Paste sites are a top leak vector; no scanner covers them systematically High Pastebin, dpaste, paste.ee, rentry, hastebin, ix.io etc.
Package registry scanning (8+ registries) npm, PyPI, RubyGems, crates.io, Maven, NuGet, Packagist, Go proxy — LLM keys embedded in packages are a real vector High Scan source tarballs and metadata
Container/IaC scanning Docker Hub layers, K8s configs, Terraform state, Helm, Ansible High Complements git scanning with infra layer
Web dashboard (htmx + Tailwind) No open-source scanner has an embedded web UI High SQLite backend, embedded in binary via go:embed; scans/keys/recon/providers/dorks/settings
Telegram bot integration Immediate mobile notification of findings; no scanner has this Med /scan, /verify, /recon, /status commands
Scheduled scanning with auto-notify Recurring scans with cron; no scanner has this natively Med Cron-based; webhook or Telegram on new findings
SQLite storage with AES-256 encryption Persistent scan state; other tools are stateless Med Store findings, recon results, key status history
TruffleHog + Gitleaks import adapter Lets users pipe existing tool output into KeyHunter's verification/storage Low-Med JSON import from both tools; normalizes results
APK decompile scanning Mobile app binaries as a source; no common scanner does this High Depends on external apktool/jadx; wrap as optional integration
Web archive scanning Wayback Machine + CommonCrawl for historical leaks Med Useful for finding keys that were removed from code but still indexed
Source map / webpack bundle scanning Frontend JS bundles frequently contain embedded API keys Med Fetch and parse JS source maps from deployed sites
Permission analysis (future) TruffleHog Analyze covers 30 types; KeyHunter could expand to LLM scope Very High Know what a leaked key can do — model access, billing, rate limits

Tier 3: Nice-to-Have Differentiators

Feature Value Proposition Complexity Notes
Colored table output Better UX than plain text Low Use lipgloss or tablewriter; standard in modern Go CLIs
Rate limiting per OSINT source Responsible scanning without bans Med Per-host rate limiter; configurable
Stealth mode / robots.txt respect Ethical scanning; avoids legal issues for researchers Med Opt-in stealth; obey robots.txt when configured
Delta-based git scanning Only scan new commits since last run; performance for CI Med Store last scanned commit hash in SQLite
mmap-based file reading Memory-efficient scanning of large files Med Use for large log files and archives
Worker pool parallelism TruffleHog does this; expected for performance Med configurable goroutine pool per source type
Cloud storage scanning (S3, GCS, Azure Blob) Buckets frequently contain leaked config files High Requires cloud credentials to scan; scope carefully
Forum/community scanning (Reddit, HN, StackOverflow) Real leak vector; developers share code with keys High Rate-limited scraping; search API where available
Collaboration tool scanning (Notion, Confluence) Enterprise leak vector; increasingly relevant Very High Auth flows complex; may need per-org API tokens
Threat intel integration (VirusTotal, IntelX) Cross-reference found keys against known breach databases High Add-on verification layer

Anti-Features

Features to deliberately NOT build. Building these would waste resources, create scope creep, or undermine the tool's identity.

Anti-Feature Why Avoid What to Do Instead
Key rotation / remediation KeyHunter is a finder, not a fixer; building rotation = competing with HashiCorp Vault, AWS Secrets Manager, Doppler Document links to provider-specific rotation guides; link from findings output
SaaS / cloud-hosted version Shifts tool from open-source security tool to commercial product; legal/privacy complexity explodes Keep open-source; let users self-host the web dashboard
GUI desktop app High dev cost for low security-tool audience benefit; security tools are CLI-first CLI + embedded web dashboard covers both audiences
Real-time streaming API Batch scanning is the primary mode; streaming adds websocket/SSE complexity for marginal gain Use scheduled scans + webhooks/Telegram for near-real-time alerting
Windows native build Small portion of target audience (red teams, DevSecOps); WSL/Docker serves them State WSL/Docker support clearly in README
AI-generated code scanning (static analysis) Different domain entirely from secret detection; scope creep Stay focused on credential/secret detection
Automatic key invalidation Calling provider API to revoke a key without explicit user consent is dangerous and potentially illegal Gate ALL provider API calls behind --verify; never call provider APIs passively
Scanning without user consent Legal and ethical requirement; all scanning must be intentional Require explicit targets; no auto-discovery of new repos to scan
Built-in proxy/VPN Scope creep; tool should not manage network routing Document use with external proxies; support HTTP_PROXY env var
Key marketplace / sharing Fundamentally changes the ethical posture of the tool from defender to attacker Hard no; never log or transmit found keys anywhere outside local SQLite
Excessive telemetry Security tools must not phone home; community trust requires zero telemetry No analytics, no crash reporting, no network calls except explicit --verify

Feature Dependencies

Regex patterns + keyword lists
    -> Provider YAML definitions (pattern + keywords + verify endpoint)
        -> Core scanning engine (file, dir, git)
            -> Active verification (--verify flag)
            -> SQLite storage (findings persistence)
                -> Web dashboard (htmx, reads from SQLite)
                -> JSON/CSV/SARIF export
                -> Telegram bot (reads from SQLite, sends alerts)
                -> Scheduled scanning (cron -> scan -> SQLite -> notify)

Provider YAML definitions
    -> Dork YAML definitions (same extensibility pattern)
        -> Built-in dork engine
            -> OSINT/Recon engine (uses dorks per source)
                -> IoT scanners (Shodan, Censys, ZoomEye, FOFA)
                -> Code hosting (GitHub, GitLab, HuggingFace, etc.)
                -> Paste sites
                -> Package registries
                -> Search engine dorking
                -> Web archives
                -> CI/CD logs
                -> Forums
                -> Collaboration tools
                -> Cloud storage
                -> Container/IaC

TruffleHog/Gitleaks JSON import
    -> Active verification (can verify imported keys)
    -> SQLite storage (can store imported findings)

Delta-based git scanning
    -> SQLite storage (requires stored last-scanned commit)

Keyword pre-filtering
    -> Core scanning engine (filter before regex application)

Worker pool parallelism
    -> All scanning operations (applies globally)

MVP Recommendation

Build in strict dependency order. Each phase must be complete before the next delivers value.

Phase 1 — Foundation (table stakes, no differentiators yet):

  1. Provider YAML definitions for 108 LLM providers (patterns, keywords, verify endpoints)
  2. Core scanning engine: regex + entropy + keyword pre-filtering
  3. Input sources: file, dir, git history, stdin
  4. Active verification via --verify flag (off by default)
  5. Output: colored table, JSON, SARIF, CSV
  6. SQLite storage with AES-256

Phase 2 — First differentiators (competitive moat begins here): 7. Full key access: --unmask, keys show, web dashboard 8. TruffleHog + Gitleaks import adapters 9. Built-in dork engine (YAML dorks, 150+) 10. Pre-commit hook + CI/CD integration (SARIF, exit codes)

Phase 3 — OSINT engine (the primary differentiator): 11. Recon engine core: code hosting (GitHub, GitLab, HuggingFace, Replit, etc.) 12. Paste site aggregator (20+ sites) 13. Search engine dorking (Google, Bing, DuckDuckGo, etc.) 14. Package registries (npm, PyPI, RubyGems, etc.) 15. IoT scanners (Shodan, Censys, ZoomEye, FOFA, Netlas, BinaryEdge)

Phase 4 — Automation and reach: 16. Telegram bot 17. Scheduled scanning (cron-based) 18. Remaining OSINT sources: CI/CD logs, web archives, forums, cloud storage, container/IaC, APK, source maps, threat intel

Defer permanently:

  • Collaboration tool scanning (Notion, Confluence, Google Docs): auth complexity is very high; add in v2 if demand exists
  • Permission analysis: very high complexity; requires provider-specific API exploration per provider; good v2 feature
  • Web archive scanning: CommonCrawl data is huge; requires careful scoping to avoid running for days

Detection Method Tradeoffs

Based on research across competitive tools, relevant for architectural decisions:

Method Recall Precision Speed Best For
Regex (named patterns) High (for known formats) High Fast Provider keys with known prefixes (OpenAI sk-proj-, Anthropic sk-ant-api03-, HuggingFace hf_, Groq gsk_)
Entropy (Shannon) Medium (70.4% per Betterleaks data) Low (high FP) Fast Generic high-entropy strings; use as secondary signal only
BPE Tokenization (Betterleaks) Very High (98.6%) High Medium Next-gen; consider for v2
Keyword pre-filtering N/A (filter only) N/A Very Fast Reduce candidate set before regex; TruffleHog pattern
ML/LLM-based (Nosey Parker AI, GPT-4) High Very High Slow/expensive FuzzingLabs: GPT-5-mini hits 84.4% recall vs Gitleaks 37.5%; v2 consideration
Contextual validation High Very High Medium GitGuardian's third layer; reduces FP significantly

KeyHunter approach: Regex (primary) + keyword pre-filtering (performance) + entropy (secondary signal). ML-based detection is a v2 feature once the provider coverage gap is closed.


Ecosystem Context (2026)

  • AI-service credential leaks: 1.27M in 2025, up 81% YoY (GitGuardian State of Secrets Sprawl 2026)
  • 29M total secrets leaked on GitHub in 2025 (34% YoY increase, largest single-year jump ever)
  • LLM infrastructure leaks grow 5x faster than core model provider leaks
  • Claude Code-assisted commits show 3.2% leak rate vs 1.5% baseline — AI coding tools making it worse
  • 24,008 unique secrets in MCP configuration files found in 2025
  • Betterleaks (March 2026): BPE tokenization achieves 98.6% recall vs 70.4% for entropy — new detection paradigm worth tracking
  • FuzzingLabs (April 2026): GPT-5-mini hits 84.4% recall vs Gitleaks 37.5%, TruffleHog 0% on split/obfuscated secrets — LLM-based detection becoming viable
  • TruffleHog + HuggingFace partnership: native HF scanner for models, datasets, Spaces
  • GitHub Secret Scanning: added DeepSeek validity checks in March 2026 — LLM provider awareness growing

Sources