# KeyHunter

## What This Is

KeyHunter is a comprehensive, modular API key scanner built in Go, focused on detecting and validating API keys from 108+ LLM/AI providers. It combines native scanning with external tool integration (TruffleHog, Gitleaks), OSINT/recon across 80+ internet sources, a web dashboard, and Telegram bot notifications. Designed for red teams, DevSecOps, bug bounty hunters, and security researchers.

## Core Value

Detect leaked LLM API keys across more providers and more internet sources than any other tool, with active verification to confirm keys are real and alive.

## Requirements

### Validated

(None yet — ship to validate)

### Active

- [ ] Core scanning engine (regex + entropy + keyword pre-filtering)
- [ ] 108 provider YAML definitions with patterns, keywords, verify endpoints
- [ ] Plugin-based architecture — providers as YAML, compile-time embedded
- [ ] Multiple input sources: file, dir, git history, stdin, URL, clipboard
- [ ] Active key verification via `--verify` flag (off by default)
- [ ] Full key access: `--unmask`, JSON export, `keys show`, web dashboard, Telegram
- [ ] CLI with Cobra: scan, verify, import, recon, keys, serve, dorks, providers, config, hook, schedule
- [ ] TruffleHog & Gitleaks JSON import adapters
- [ ] OSINT/Recon engine: 80+ sources across 18 categories
- [ ] IoT scanners: Shodan, Censys, ZoomEye, FOFA, Netlas, BinaryEdge
- [ ] Code hosting: GitHub, GitLab, Bitbucket, Codeberg, Gitea, Replit, CodeSandbox, HuggingFace, Kaggle, etc.
- [ ] Search engine dorking: Google, Bing, DuckDuckGo, Yandex, Brave
- [ ] Paste site aggregator: Pastebin, dpaste, paste.ee, rentry, hastebin, ix.io, etc.
- [ ] Package registries: npm, PyPI, RubyGems, crates.io, Maven, NuGet, Packagist, Go proxy
- [ ] Container/infra: Docker Hub layers, K8s configs, Terraform state, Helm, Ansible
- [ ] Cloud storage: S3, GCS, Azure Blob, DO Spaces, MinIO, GrayHatWarfare
- [ ] CI/CD logs: Travis, CircleCI, GitHub Actions, Jenkins, GitLab CI
- [ ] Web archives: Wayback Machine, CommonCrawl
- [ ] Forums: StackOverflow, Reddit, HackerNews, dev.to, Medium, Telegram groups, Discord
- [ ] Collaboration: Notion, Confluence, Trello, Google Docs
- [ ] Frontend/JS: Source maps, webpack bundles, exposed .env, Swagger, deploy previews
- [ ] Log aggregators: Elasticsearch/Kibana, Grafana, Sentry
- [ ] Threat intel: VirusTotal, IntelX, URLhaus
- [ ] Mobile: APK decompile scanning
- [ ] DNS/Subdomain: crt.sh, config endpoint probing
- [ ] API marketplaces: Postman, SwaggerHub
- [ ] Built-in dork engine: 150+ dorks in YAML (GitHub, Google, Shodan, Censys, ZoomEye, FOFA, etc.)
- [ ] Web dashboard: htmx + Tailwind + SQLite, scans/keys/recon/providers/dorks/settings pages
- [ ] Telegram bot: /scan, /verify, /recon, /status, /stats, /subscribe, /key
- [ ] Scheduled scanning: cron-based recurring scans with auto-notify
- [ ] Pre-commit hook & CI/CD integration (SARIF output)
- [ ] Output formats: table (colored), JSON, SARIF, CSV
- [ ] SQLite storage with AES-256 encryption
- [ ] Worker pool parallelism, keyword pre-filtering, mmap, delta-based git scanning
- [ ] Rate limiting per source, stealth mode, robots.txt respect

### Out of Scope

- GUI desktop app — CLI + web dashboard is sufficient
- Real-time streaming API — batch scanning is the primary mode
- Key rotation/remediation — KeyHunter finds keys, doesn't manage them
- Paid SaaS version — open-source tool only
- Windows native — Linux/macOS primary, Windows via WSL/Docker

## Context

- AI-related credential leaks grew 81% YoY in 2025 (GitGuardian report)
- 28M credentials leaked on GitHub in 2025
- Best existing tools (TruffleHog, Gitleaks) cover at most ~15 LLM providers
- No dedicated tool covers 100+ LLM providers with detection + verification + OSINT
- LiteLLM supports 107 providers — our 108 provider list covers the market comprehensively
- High-confidence key prefixes exist for: OpenAI (sk-proj-), Anthropic (sk-ant-api03-), HuggingFace (hf_), Groq (gsk_), Replicate (r8_), Perplexity (pplx-), Fireworks (fw_), Google AI (AIza)
- Many providers (Mistral, Cohere, Together AI, Chinese providers) use generic keys — keyword-based detection needed

## Constraints

- **Language**: Go 1.22+ — single binary distribution, performance, TruffleHog/Gitleaks ecosystem alignment
- **Architecture**: Plugin-based — providers as YAML files, compile-time embedded via Go embed
- **Storage**: SQLite — zero-dependency embedded database, AES-256 encrypted
- **Web stack**: htmx + Tailwind CSS — no JS framework dependency, embedded in binary
- **CLI framework**: Cobra — industry standard for Go CLIs
- **Verification**: Must be opt-in (`--verify` flag) — passive scanning by default for legal safety
- **Key masking**: Default masked output, `--unmask` for full keys — shoulder surfing protection

## Key Decisions

| Decision | Rationale | Outcome |
|----------|-----------|---------|
| Go over Python/Rust | Single binary, performance, ecosystem alignment with TruffleHog/Gitleaks | — Pending |
| Plugin architecture (YAML providers) | Community extensibility, easy to add providers without recompile | — Pending |
| Compile-time embed over runtime plugins | Single binary advantage, no external dependency loading | — Pending |
| SQLite over PostgreSQL | Zero dependency, embedded, sufficient for local tool | — Pending |
| htmx over React/Vue | Minimal JS, embedded in binary, server-rendered simplicity | — Pending |
| Keyword pre-filtering before regex | 10x performance improvement on large codebases (TruffleHog's approach) | — Pending |
| YAML dorks alongside YAML providers | Consistent extensibility pattern, community can add dorks same way | — Pending |
| Configurable verification (--verify) | Legal safety — passive scanning by default, active only when explicitly requested | — Pending |

## Evolution

This document evolves at phase transitions and milestone boundaries.

**After each phase transition:**
1. Requirements invalidated? -> Move to Out of Scope with reason
2. Requirements validated? -> Move to Validated with phase reference
3. New requirements emerged? -> Add to Active
4. Decisions to log? -> Add to Key Decisions
5. "What This Is" still accurate? -> Update if drifted

**After each milestone:**
1. Full review of all sections
2. Core Value check — still the right priority?
3. Audit Out of Scope — reasons still valid?
4. Update Context with current state

---
*Last updated: 2026-04-04 after initialization*