Files
keyhunter/.planning/research/FEATURES.md
2026-04-04 19:03:12 +03:00

252 lines
19 KiB
Markdown

# Feature Landscape: API Key Scanner Domain
**Domain:** API key / secret scanner — LLM/AI provider focus, OSINT recon, active verification
**Researched:** 2026-04-04
**Competitive Reference:** TruffleHog, Gitleaks, Betterleaks, detect-secrets, GitGuardian, Nosey Parker/Titus, GitHub Secret Scanning
---
## Competitive Landscape Summary
| Tool | LLM Providers | Verification | OSINT/Recon | Sources | Output |
|------|--------------|-------------|------------|---------|--------|
| TruffleHog | ~15 (OpenAI, Anthropic, HF partial) | Yes, 700+ | No | Git, S3, Docker, Postman, Jenkins | JSON, text |
| Gitleaks | ~5-10 (OpenAI, HF partial) | No | No | Git, dir, stdin | JSON, CSV, SARIF, JUnit |
| Betterleaks | ~10-15 (est.) | Planned | No | Git, dir, files | Unknown (Gitleaks-compatible) |
| detect-secrets | ~5 (keyword-based) | No | No | Files, git staged | JSON baseline |
| Titus | 450+ rules (broad SaaS) | Yes (validate flag) | No | Files, git, binary | JSON |
| GitGuardian | 550+ detectors | Yes (validity checks) | No | Git, CI/CD, Slack, Jira, Docker | Dashboard, alerts |
| GitHub Secret Scanning | 700+ patterns (cloud-first) | Yes (validity checks) | No | GitHub repos only | Dashboard, SARIF |
| KeyHunter (target) | 108 LLM providers | Yes (opt-in) | Yes (80+ sources) | Git+OSINT+IoT+Paste | Table, JSON, SARIF, CSV |
**Key market gap confirmed:** No existing open-source tool covers 100+ LLM providers with detection + verification + OSINT recon combined. The 81% YoY surge in AI-service credential leaks (GitGuardian 2026 report, 1.27M leaked secrets) validates the demand.
---
## Table Stakes
Features users expect from any credible secret scanner. Missing one = users choose a competitor immediately.
| Feature | Why Expected | Complexity | Notes |
|---------|--------------|------------|-------|
| Regex-based pattern detection | Every tool has it; users assume it exists | Low | Foundation of all scanners; must be fast |
| Entropy analysis | Standard complement to regex since TruffleHog popularized it | Low | Shannon entropy; high FP rate alone — needs keywords too |
| Keyword pre-filtering | TruffleHog's performance trick; users of large repos demand it | Low-Med | Filter to candidate files before applying regex; 10x speedup |
| Git history scanning | TruffleHog/Gitleaks primary use case; users expect full history | Med | Must traverse all commits, branches, tags |
| Directory/file scanning | Needed for non-git use cases (CI artifacts, file shares) | Low | Walk directory tree, apply detectors |
| JSON output | Machine-readable output for pipeline integration | Low | Standard across all tools |
| False positive reduction / deduplication | Alert fatigue is a known pain point across all scanners | Med | Deduplicate same secret seen in N commits |
| Pre-commit hook support | Shift-left; developers expect git hook integration | Low | Blocks commits with detected secrets |
| CI/CD integration | GitHub Actions, GitLab CI, Jenkins — any serious scanner has this | Low | Binary runs in pipeline; exit code drives pass/fail |
| SARIF output | Required for GitHub Code Scanning tab, GitLab Security dashboard | Low | Standard format; Gitleaks, Titus, Zimara all support it |
| Masked output by default | Security hygiene; users expect keys not printed in clear to terminal | Low | Mask middle chars; --unmask flag to show full |
| Provider-based detection rules | Users expect named detectors ("OpenAI key detected"), not raw regex | Med | Named detectors with confidence; YAML definitions in KeyHunter's case |
| Active key verification (opt-in) | TruffleHog verified this: confirmed keys are worth 10x more to users | Med | MUST be opt-in; legal/ethical requirement; network call to provider API |
| --verify flag (off by default) | Legal safety norm in the ecosystem; users expect passive-by-default | Low | Standard pattern established by TruffleHog |
| CSV export | Needed for spreadsheet/reporting workflows | Low | Standard; Gitleaks, Titus support it |
| Multi-platform binary | Single binary install is the expectation for Go tools | Low | Linux, macOS; Docker for Windows |
---
## Differentiators
Features that set KeyHunter apart from every existing tool. These are the competitive moat.
### Tier 1: Core Differentiators (Primary Competitive Advantage)
| Feature | Value Proposition | Complexity | Notes |
|---------|-------------------|------------|-------|
| 108 LLM/AI provider coverage | No tool covers more than ~15-20 LLM providers; this is a 5-7x gap | High | YAML-driven provider definitions; must include prefix-based (OpenAI, Anthropic, HF, Groq, Replicate) AND keyword-based (Mistral, Cohere, Together AI, Chinese providers) |
| OSINT/Recon engine (80+ sources) | No scanner combines detection + OSINT in one tool | Very High | 18 source categories: code hosting, paste sites, IoT scanners, search dorks, package registries, CI/CD logs, web archives, forums, etc. |
| Active verification for 108 LLM providers | TruffleHog verifies ~700 types but covers far fewer LLM providers | High | Each YAML provider definition includes verify endpoint; --verify opt-in |
| Built-in dork engine (150+ dorks) | Search engine dorking is manual today; no tool has YAML-managed dorks | Med | GitHub, Google, Shodan, Censys, ZoomEye, FOFA dorks in YAML; extensible same way as providers |
| IoT scanner integration | Shodan/Censys/ZoomEye/FOFA for exposed LLM endpoints | High | Scans for vLLM, Ollama, LiteLLM proxy leaks — a growing attack surface (1scan showed thousands of exposed LLM endpoints) |
| YAML provider plugin system | Community can add providers without recompiling | Med | compile-time embed via Go `//go:embed`; provider = pattern + keywords + verify endpoint + metadata |
### Tier 2: Strong Differentiators (Meaningfully Better Than Alternatives)
| Feature | Value Proposition | Complexity | Notes |
|---------|-------------------|------------|-------|
| Paste site aggregation (20+ sites) | Paste sites are a top leak vector; no scanner covers them systematically | High | Pastebin, dpaste, paste.ee, rentry, hastebin, ix.io etc. |
| Package registry scanning (8+ registries) | npm, PyPI, RubyGems, crates.io, Maven, NuGet, Packagist, Go proxy — LLM keys embedded in packages are a real vector | High | Scan source tarballs and metadata |
| Container/IaC scanning | Docker Hub layers, K8s configs, Terraform state, Helm, Ansible | High | Complements git scanning with infra layer |
| Web dashboard (htmx + Tailwind) | No open-source scanner has an embedded web UI | High | SQLite backend, embedded in binary via go:embed; scans/keys/recon/providers/dorks/settings |
| Telegram bot integration | Immediate mobile notification of findings; no scanner has this | Med | /scan, /verify, /recon, /status commands |
| Scheduled scanning with auto-notify | Recurring scans with cron; no scanner has this natively | Med | Cron-based; webhook or Telegram on new findings |
| SQLite storage with AES-256 encryption | Persistent scan state; other tools are stateless | Med | Store findings, recon results, key status history |
| TruffleHog + Gitleaks import adapter | Lets users pipe existing tool output into KeyHunter's verification/storage | Low-Med | JSON import from both tools; normalizes results |
| APK decompile scanning | Mobile app binaries as a source; no common scanner does this | High | Depends on external apktool/jadx; wrap as optional integration |
| Web archive scanning | Wayback Machine + CommonCrawl for historical leaks | Med | Useful for finding keys that were removed from code but still indexed |
| Source map / webpack bundle scanning | Frontend JS bundles frequently contain embedded API keys | Med | Fetch and parse JS source maps from deployed sites |
| Permission analysis (future) | TruffleHog Analyze covers 30 types; KeyHunter could expand to LLM scope | Very High | Know what a leaked key can do — model access, billing, rate limits |
### Tier 3: Nice-to-Have Differentiators
| Feature | Value Proposition | Complexity | Notes |
|---------|-------------------|------------|-------|
| Colored table output | Better UX than plain text | Low | Use lipgloss or tablewriter; standard in modern Go CLIs |
| Rate limiting per OSINT source | Responsible scanning without bans | Med | Per-host rate limiter; configurable |
| Stealth mode / robots.txt respect | Ethical scanning; avoids legal issues for researchers | Med | Opt-in stealth; obey robots.txt when configured |
| Delta-based git scanning | Only scan new commits since last run; performance for CI | Med | Store last scanned commit hash in SQLite |
| mmap-based file reading | Memory-efficient scanning of large files | Med | Use for large log files and archives |
| Worker pool parallelism | TruffleHog does this; expected for performance | Med | configurable goroutine pool per source type |
| Cloud storage scanning (S3, GCS, Azure Blob) | Buckets frequently contain leaked config files | High | Requires cloud credentials to scan; scope carefully |
| Forum/community scanning (Reddit, HN, StackOverflow) | Real leak vector; developers share code with keys | High | Rate-limited scraping; search API where available |
| Collaboration tool scanning (Notion, Confluence) | Enterprise leak vector; increasingly relevant | Very High | Auth flows complex; may need per-org API tokens |
| Threat intel integration (VirusTotal, IntelX) | Cross-reference found keys against known breach databases | High | Add-on verification layer |
---
## Anti-Features
Features to deliberately NOT build. Building these would waste resources, create scope creep, or undermine the tool's identity.
| Anti-Feature | Why Avoid | What to Do Instead |
|--------------|-----------|-------------------|
| Key rotation / remediation | KeyHunter is a finder, not a fixer; building rotation = competing with HashiCorp Vault, AWS Secrets Manager, Doppler | Document links to provider-specific rotation guides; link from findings output |
| SaaS / cloud-hosted version | Shifts tool from open-source security tool to commercial product; legal/privacy complexity explodes | Keep open-source; let users self-host the web dashboard |
| GUI desktop app | High dev cost for low security-tool audience benefit; security tools are CLI-first | CLI + embedded web dashboard covers both audiences |
| Real-time streaming API | Batch scanning is the primary mode; streaming adds websocket/SSE complexity for marginal gain | Use scheduled scans + webhooks/Telegram for near-real-time alerting |
| Windows native build | Small portion of target audience (red teams, DevSecOps); WSL/Docker serves them | State WSL/Docker support clearly in README |
| AI-generated code scanning (static analysis) | Different domain entirely from secret detection; scope creep | Stay focused on credential/secret detection |
| Automatic key invalidation | Calling provider API to revoke a key without explicit user consent is dangerous and potentially illegal | Gate ALL provider API calls behind --verify; never call provider APIs passively |
| Scanning without user consent | Legal and ethical requirement; all scanning must be intentional | Require explicit targets; no auto-discovery of new repos to scan |
| Built-in proxy/VPN | Scope creep; tool should not manage network routing | Document use with external proxies; support HTTP_PROXY env var |
| Key marketplace / sharing | Fundamentally changes the ethical posture of the tool from defender to attacker | Hard no; never log or transmit found keys anywhere outside local SQLite |
| Excessive telemetry | Security tools must not phone home; community trust requires zero telemetry | No analytics, no crash reporting, no network calls except explicit --verify |
---
## Feature Dependencies
```
Regex patterns + keyword lists
-> Provider YAML definitions (pattern + keywords + verify endpoint)
-> Core scanning engine (file, dir, git)
-> Active verification (--verify flag)
-> SQLite storage (findings persistence)
-> Web dashboard (htmx, reads from SQLite)
-> JSON/CSV/SARIF export
-> Telegram bot (reads from SQLite, sends alerts)
-> Scheduled scanning (cron -> scan -> SQLite -> notify)
Provider YAML definitions
-> Dork YAML definitions (same extensibility pattern)
-> Built-in dork engine
-> OSINT/Recon engine (uses dorks per source)
-> IoT scanners (Shodan, Censys, ZoomEye, FOFA)
-> Code hosting (GitHub, GitLab, HuggingFace, etc.)
-> Paste sites
-> Package registries
-> Search engine dorking
-> Web archives
-> CI/CD logs
-> Forums
-> Collaboration tools
-> Cloud storage
-> Container/IaC
TruffleHog/Gitleaks JSON import
-> Active verification (can verify imported keys)
-> SQLite storage (can store imported findings)
Delta-based git scanning
-> SQLite storage (requires stored last-scanned commit)
Keyword pre-filtering
-> Core scanning engine (filter before regex application)
Worker pool parallelism
-> All scanning operations (applies globally)
```
---
## MVP Recommendation
Build in strict dependency order. Each phase must be complete before the next delivers value.
**Phase 1 — Foundation (table stakes, no differentiators yet):**
1. Provider YAML definitions for 108 LLM providers (patterns, keywords, verify endpoints)
2. Core scanning engine: regex + entropy + keyword pre-filtering
3. Input sources: file, dir, git history, stdin
4. Active verification via `--verify` flag (off by default)
5. Output: colored table, JSON, SARIF, CSV
6. SQLite storage with AES-256
**Phase 2 — First differentiators (competitive moat begins here):**
7. Full key access: `--unmask`, `keys show`, web dashboard
8. TruffleHog + Gitleaks import adapters
9. Built-in dork engine (YAML dorks, 150+)
10. Pre-commit hook + CI/CD integration (SARIF, exit codes)
**Phase 3 — OSINT engine (the primary differentiator):**
11. Recon engine core: code hosting (GitHub, GitLab, HuggingFace, Replit, etc.)
12. Paste site aggregator (20+ sites)
13. Search engine dorking (Google, Bing, DuckDuckGo, etc.)
14. Package registries (npm, PyPI, RubyGems, etc.)
15. IoT scanners (Shodan, Censys, ZoomEye, FOFA, Netlas, BinaryEdge)
**Phase 4 — Automation and reach:**
16. Telegram bot
17. Scheduled scanning (cron-based)
18. Remaining OSINT sources: CI/CD logs, web archives, forums, cloud storage, container/IaC, APK, source maps, threat intel
**Defer permanently:**
- Collaboration tool scanning (Notion, Confluence, Google Docs): auth complexity is very high; add in v2 if demand exists
- Permission analysis: very high complexity; requires provider-specific API exploration per provider; good v2 feature
- Web archive scanning: CommonCrawl data is huge; requires careful scoping to avoid running for days
---
## Detection Method Tradeoffs
Based on research across competitive tools, relevant for architectural decisions:
| Method | Recall | Precision | Speed | Best For |
|--------|--------|-----------|-------|----------|
| Regex (named patterns) | High (for known formats) | High | Fast | Provider keys with known prefixes (OpenAI sk-proj-, Anthropic sk-ant-api03-, HuggingFace hf_, Groq gsk_) |
| Entropy (Shannon) | Medium (70.4% per Betterleaks data) | Low (high FP) | Fast | Generic high-entropy strings; use as secondary signal only |
| BPE Tokenization (Betterleaks) | Very High (98.6%) | High | Medium | Next-gen; consider for v2 |
| Keyword pre-filtering | N/A (filter only) | N/A | Very Fast | Reduce candidate set before regex; TruffleHog pattern |
| ML/LLM-based (Nosey Parker AI, GPT-4) | High | Very High | Slow/expensive | FuzzingLabs: GPT-5-mini hits 84.4% recall vs Gitleaks 37.5%; v2 consideration |
| Contextual validation | High | Very High | Medium | GitGuardian's third layer; reduces FP significantly |
**KeyHunter approach:** Regex (primary) + keyword pre-filtering (performance) + entropy (secondary signal). ML-based detection is a v2 feature once the provider coverage gap is closed.
---
## Ecosystem Context (2026)
- AI-service credential leaks: 1.27M in 2025, up 81% YoY (GitGuardian State of Secrets Sprawl 2026)
- 29M total secrets leaked on GitHub in 2025 (34% YoY increase, largest single-year jump ever)
- LLM infrastructure leaks grow 5x faster than core model provider leaks
- Claude Code-assisted commits show 3.2% leak rate vs 1.5% baseline — AI coding tools making it worse
- 24,008 unique secrets in MCP configuration files found in 2025
- Betterleaks (March 2026): BPE tokenization achieves 98.6% recall vs 70.4% for entropy — new detection paradigm worth tracking
- FuzzingLabs (April 2026): GPT-5-mini hits 84.4% recall vs Gitleaks 37.5%, TruffleHog 0% on split/obfuscated secrets — LLM-based detection becoming viable
- TruffleHog + HuggingFace partnership: native HF scanner for models, datasets, Spaces
- GitHub Secret Scanning: added DeepSeek validity checks in March 2026 — LLM provider awareness growing
---
## Sources
- [TruffleHog GitHub](https://github.com/trufflesecurity/trufflehog) — feature set, detector count, scanning sources
- [TruffleHog Analyze](https://trufflesecurity.com/blog/trufflehog-now-analyzes-permissions-of-api-keys-and-passwords) — permission analysis feature
- [Gitleaks GitHub](https://github.com/gitleaks/gitleaks) — output formats, detection methods
- [Betterleaks — BleepingComputer](https://www.bleepingcomputer.com/news/security/betterleaks-a-new-open-source-secrets-scanner-to-replace-gitleaks/) — BPE tokenization, recall metrics
- [Betterleaks — Aikido](https://www.aikido.dev/blog/betterleaks-gitleaks-successor) — comparison with Gitleaks
- [Titus — Praetorian](https://www.praetorian.com/blog/titus-open-source-secret-scanner/) — 450+ rules, validation, Burp extension
- [Titus GitHub](https://github.com/praetorian-inc/titus) — feature details
- [GitGuardian Secrets Detection](https://www.gitguardian.com/solutions/secrets-detection) — 550+ detectors, enterprise features
- [GitGuardian State of Secrets Sprawl 2026](https://blog.gitguardian.com/the-state-of-secrets-sprawl-2026/) — market statistics
- [GitHub Secret Scanning March 2026](https://github.blog/changelog/2026-03-10-secret-scanning-pattern-updates-march-2026/) — validity checks for DeepSeek
- [GitHub Secret Scanning Coverage Update](https://github.blog/changelog/2026-03-31-github-secret-scanning-nine-new-types-and-more/) — 28 new detectors
- [GitGuardian Secret Scanning Tools 2026](https://blog.gitguardian.com/secret-scanning-tools/) — market landscape
- [keyhacks — streaak](https://github.com/streaak/keyhacks) — API key validation endpoints reference
- [detect-secrets — Yelp](https://github.com/Yelp/detect-secrets) — baseline approach, 27 detectors
- [FuzzingLabs — LLM vs regex benchmark](https://x.com/FuzzingLabs/status/1980668916851483010) — GPT-5-mini 84.4% vs Gitleaks 37.5%
- [AI/LLM API key scanning on GitHub at scale](https://dev.to/zaim_abbasi/claude-openai-google-api-keys-all-public-this-is-what-i-found-after-scanning-github-at-scale-fj5) — real-world leak discovery
- [Comparative study of secret scanning tools](https://arxiv.org/pdf/2307.00714) — precision/recall benchmarks