keyhunter/README.md

# KeyHunter

> The most comprehensive API key scanner for LLM/AI providers. Detect, validate, and monitor leaked API keys across 108+ providers.

[![Go](https://img.shields.io/badge/Go-1.22+-00ADD8?style=flat-square&logo=go)](https://golang.org)
[![License](https://img.shields.io/badge/License-MIT-green?style=flat-square)](LICENSE)
[![Providers](https://img.shields.io/badge/Providers-108+-red?style=flat-square)](providers/)

---

## Why KeyHunter?

Existing tools like TruffleHog (~3 LLM detectors) and Gitleaks (~5 LLM rules) were built for general secret scanning. AI-related credential leaks grew **81% year-over-year** in 2025, yet no tool covers more than ~15 LLM providers.

**KeyHunter fills that gap** with 108+ provider-specific detectors, active key validation, OSINT/recon capabilities, and a growing set of internet sources for leak discovery.

### How It Compares

| Feature | KeyHunter | TruffleHog | Gitleaks | detect-secrets |
|---------|-----------|------------|----------|----------------|
| LLM Providers | **108+** | ~3 | ~5 | ~1 |
| Active Verification | **108+ endpoints** | ~20 types | No | No |
| OSINT/Recon Sources | **18 live** (80+ planned) | No | No | No |
| External Tool Import | **TruffleHog + Gitleaks** | - | - | - |
| Dork Engine | **150 built-in YAML dorks** | No | No | No |
| Pre-commit Hook | **Built-in** | Yes | Yes | Yes |
| SARIF Output | **Yes** | Yes | Yes | No |
| Provider YAML Plugin | **Community-extensible** | Go code only | TOML rules | Python plugins |
| Web Dashboard | Coming soon | No | No | No |
| Telegram Bot | Coming soon | No | No | No |
| Scheduled Scanning | Coming soon | No | No | No |

---

## Features

### Implemented

#### Core Scanning Engine
- **3-stage pipeline** -- AC pre-filter, regex match, entropy scoring
- **ants worker pool** for parallel scanning with configurable worker count
- **108 provider YAML definitions** (Tier 1-9), dual-located with `go:embed`

#### Input Sources
- **File scanning** -- single file analysis
- **Directory scanning** -- recursive traversal with glob exclusions and mmap
- **Git history scanning** -- full commit history analysis
- **stdin/pipe** support -- `echo "sk-proj-..." | keyhunter scan stdin`
- **URL fetching** -- scan any remote URL content
- **Clipboard scanning** -- instant clipboard content analysis

#### Active Verification
- YAML-driven `HTTPVerifier` -- lightweight API calls to verify if detected keys are active
- Permission and scope extraction (org, rate limits, model access)
- Consent prompt and `LEGAL.md` for legal safety
- Configurable via `--verify` flag (off by default)

#### Output Formats
- **Table** -- colored terminal output with key masking (default)
- **JSON** -- full key values for programmatic consumption
- **CSV** -- spreadsheet-compatible export
- **SARIF 2.1.0** -- CI/CD integration (GitHub Code Scanning, etc.)
- Exit codes: `0` (clean), `1` (findings), `2` (error)

#### Key Management
- `keyhunter keys list` -- list all discovered keys (masked by default)
- `keyhunter keys show <id>` -- full key details
- `keyhunter keys export` -- export in JSON/CSV format
- `keyhunter keys copy <id>` -- copy key to clipboard
- `keyhunter keys delete <id>` -- remove a key from the database
- `keyhunter keys verify <id>` -- verify a specific key

#### External Tool Import
- **TruffleHog v3** JSON import with LLM-specific enrichment
- **Gitleaks** JSON and CSV import
- Deduplication across imports via `(provider, masked_key, source)` hashing

#### Git Pre-commit Hook
- `keyhunter hook install` -- embedded shell script, blocks leaks before commit
- `keyhunter hook uninstall` -- clean removal
- Backup of existing hooks with `--force`

#### Dork Engine
- **150 built-in YAML dorks** across 8 source types (GitHub, GitLab, Google, Shodan, Censys, ZoomEye, FOFA, Bing)
- GitHub live executor with authenticated API
- CLI management: `keyhunter dorks list`, `keyhunter dorks list --source=github`, `keyhunter dorks add`, `keyhunter dorks run`, `keyhunter dorks export`

#### OSINT / Recon Engine (18 Sources Live)

The recon framework provides a `ReconSource` interface with per-source rate limiting, stealth mode, robots.txt compliance, parallel sweep, and result deduplication.

**Code Hosting & Snippets** (live)
- **GitHub** -- code search with automated dorks
- **GitLab** -- code search
- **Bitbucket** -- code search
- **GitHub Gist** -- public gist search
- **Codeberg** -- alternative Git platform search
- **HuggingFace** -- Spaces, repos, model configs (high-yield for LLM keys)
- **Replit** -- public repl search
- **CodeSandbox** -- sandbox search
- **StackBlitz Sandboxes** -- sandbox search
- **Kaggle** -- notebooks and datasets with API keys

**Search Engine Dorking** (live)
- **Google** -- Custom Search API / SerpAPI
- **Bing** -- Azure Cognitive Services search
- **DuckDuckGo** -- HTML scraping fallback
- **Yandex** -- XML API search
- **Brave** -- Brave Search API

**Paste Sites** (live)
- **Pastebin** -- scraping API
- **GistPaste** -- paste search
- **PasteSites** -- multi-paste aggregator

**`recon full`** -- parallel sweep across all 18 live sources with deduplication and unified reporting.

#### CLI Commands
| Command | Status |
|---------|--------|
| `keyhunter scan` | Implemented |
| `keyhunter providers list/info/stats` | Implemented |
| `keyhunter config init/set/get` | Implemented |
| `keyhunter keys list/show/export/copy/delete/verify` | Implemented |
| `keyhunter import` | Implemented |
| `keyhunter hook install/uninstall` | Implemented |
| `keyhunter dorks list/add/run/export` | Implemented |
| `keyhunter recon full/list` | Implemented |
| `keyhunter legal` | Implemented |
| `keyhunter verify` | Stub |
| `keyhunter serve` | Stub |
| `keyhunter schedule` | Stub |

### Coming Soon

The following features are on the roadmap but not yet implemented:

#### Phase 12 -- IoT Scanners & Cloud Storage
- **Shodan** -- exposed LLM proxies, dashboards, API endpoints
- **Censys** -- HTTP body search for leaked credentials
- **ZoomEye** -- IoT scanner
- **FOFA** -- Asian infrastructure scanning
- **Netlas** -- HTTP response body search
- **BinaryEdge** -- internet-wide scan data
- **AWS S3 / GCS / Azure Blob / DigitalOcean Spaces** -- bucket enumeration and scanning

#### Phase 13 -- Package Registries, Containers & IaC
- **npm / PyPI / RubyGems / crates.io / Maven / NuGet** -- package source scanning
- **Docker Hub** -- image layer scanning
- **Terraform / Helm Charts / Ansible** -- IaC scanning

#### Phase 14 -- CI/CD Logs, Web Archives & Frontend Leaks
- **GitHub Actions / Travis CI / CircleCI / Jenkins / GitLab CI** -- public build log scanning
- **Wayback Machine / CommonCrawl** -- historical web archive scanning
- **JS Source Maps / Webpack bundles / exposed .env** -- frontend leak detection

#### Phase 15 -- Forums & Collaboration
- **Stack Overflow / Reddit / Hacker News / dev.to / Medium** -- forum scanning
- **Notion / Confluence / Trello** -- collaboration tool scanning
- **Elasticsearch / Grafana / Sentry** -- exposed log aggregators
- **Telegram groups / Discord** -- public channel scanning

#### Phase 16 -- Threat Intel, Mobile, DNS & API Marketplaces
- **VirusTotal / Intelligence X / URLhaus** -- threat intelligence
- **APK analysis** -- mobile app decompilation
- **crt.sh / subdomain probing** -- DNS/subdomain discovery
- **Postman / SwaggerHub** -- API marketplace scanning

#### Phase 17 -- Telegram Bot & Scheduler
- **Telegram Bot** -- scan triggers, key alerts, recon results
- **Scheduled scanning** -- cron-based recurring scans with auto-notify

#### Phase 18 -- Web Dashboard
- **Web Dashboard** -- htmx + Tailwind, SQLite-backed, real-time scan viewer

---

## Quick Start

### Install

```bash
# From source
go install github.com/salvacybersec/keyhunter@latest

# Binary release (when available)
curl -sSL https://github.com/salvacybersec/keyhunter/releases/latest/download/keyhunter_linux_amd64.tar.gz | tar -xz
sudo mv keyhunter /usr/local/bin/
```

### Basic Usage

```bash
# Scan a directory
keyhunter scan ./my-project/

# Scan with active verification
keyhunter scan ./my-project/ --verify

# Scan git history
keyhunter scan --git .

# Scan from pipe
cat secrets.txt | keyhunter scan stdin

# Scan only specific providers
keyhunter scan . --providers=openai,anthropic,deepseek

# JSON output
keyhunter scan . --output=json > results.json

# SARIF output for CI/CD
keyhunter scan . --output=sarif > keyhunter.sarif

# CSV output
keyhunter scan . --output=csv > results.csv
```

### OSINT / Recon

```bash
# Full sweep across all 18 live sources
keyhunter recon full

# Sweep specific sources only
keyhunter recon full --sources=github,gitlab,gist

# List available recon sources
keyhunter recon list

# Code hosting sources
keyhunter recon full --sources=github
keyhunter recon full --sources=gitlab
keyhunter recon full --sources=bitbucket
keyhunter recon full --sources=gist
keyhunter recon full --sources=codeberg
keyhunter recon full --sources=huggingface
keyhunter recon full --sources=replit
keyhunter recon full --sources=codesandbox
keyhunter recon full --sources=sandboxes
keyhunter recon full --sources=kaggle

# Search engine dorking
keyhunter recon full --sources=google
keyhunter recon full --sources=bing
keyhunter recon full --sources=duckduckgo
keyhunter recon full --sources=yandex
keyhunter recon full --sources=brave

# Paste sites
keyhunter recon full --sources=pastebin
keyhunter recon full --sources=gistpaste
keyhunter recon full --sources=pastesites
```

### Dork Management

```bash
keyhunter dorks list                          # All dorks across all sources
keyhunter dorks list --source=github          # GitHub dorks only
keyhunter dorks list --source=google          # Google dorks only
keyhunter dorks add github 'filename:.env "GROQ_API_KEY"'
keyhunter dorks run google --category=frontier
keyhunter dorks export
```

### Key Management

Keys are masked by default in terminal output (shoulder surfing protection). Ways to access full key values:

```bash
# Show full keys in scan output
keyhunter scan . --unmask

# JSON export always includes full keys
keyhunter scan . --output=json > results.json

# Key management commands
keyhunter keys list                   # Masked list
keyhunter keys list --unmask          # Full key list
keyhunter keys show <id>              # Single key full details (always unmasked)
keyhunter keys copy <id>              # Copy key to clipboard
keyhunter keys export --format=json   # Export all keys with full values
keyhunter keys verify <id>            # Verify key + show full details
keyhunter keys delete <id>            # Remove key from database
```

**Example `keyhunter keys show` output:**
```
 ID:          a3f7b2c1
 Provider:    OpenAI
 Pattern:     OpenAI Project Key
 Key:         sk-proj-abc123def456ghi789jkl012mno345pqr678stu901vwx234
 Confidence:  HIGH
 Source:      src/config.py:42
 Found:       2026-04-04 14:32:01
 Scan ID:     scan_001
 Status:      ACTIVE (verified 2026-04-04 14:32:05)
 Org:         my-org
 Rate Limit:  500 req/min
 Revoke URL:  https://platform.openai.com/api-keys
```

### Import External Tools

```bash
# Run TruffleHog, then enrich with KeyHunter
trufflehog git . --json > trufflehog.json
keyhunter import --format=trufflehog trufflehog.json

# Run Gitleaks, then enrich
gitleaks detect -f json -r gitleaks.json
keyhunter import --format=gitleaks gitleaks.json

# Gitleaks CSV
gitleaks detect -f csv -r gitleaks.csv
keyhunter import --format=gitleaks-csv gitleaks.csv
```

### CI/CD Integration

KeyHunter ships with a git **pre-commit hook** that blocks leaks before they land in
history, a **GitHub Actions** integration that uploads SARIF findings directly into
the repository's Code Scanning tab, and an `import` command that consolidates
TruffleHog and Gitleaks output into one normalized database.

```bash
# Install pre-commit hook (scans staged files only)
keyhunter hook install

# GitHub Actions (SARIF output for Code Scanning upload)
keyhunter scan . --output sarif > keyhunter.sarif

# Import findings from other scanners
keyhunter import --format=trufflehog trufflehog.json
keyhunter import --format=gitleaks   gitleaks.json

# Exit codes: 0 = clean, 1 = keys found, 2 = error
keyhunter scan . && echo "Clean" || echo "Keys found!"
```

See [docs/CI-CD.md](docs/CI-CD.md) for the full guide, including a copy-paste
GitHub Actions workflow and the pre-commit hook install/uninstall lifecycle.

---

## Configuration

```bash
# Initialize config
keyhunter config init
# Creates ~/.keyhunter.yaml

# Set API tokens for recon sources (currently supported)
keyhunter config set recon.github.token "YOUR_GITHUB_TOKEN"
keyhunter config set recon.gitlab.token "YOUR_GITLAB_TOKEN"
keyhunter config set recon.bitbucket.token "YOUR_BITBUCKET_TOKEN"
keyhunter config set recon.huggingface.token "YOUR_HF_TOKEN"
keyhunter config set recon.kaggle.token "YOUR_KAGGLE_TOKEN"
keyhunter config set recon.google.apikey "YOUR_GOOGLE_API_KEY"
keyhunter config set recon.google.cx "YOUR_GOOGLE_CX_ID"
keyhunter config set recon.bing.apikey "YOUR_BING_API_KEY"
keyhunter config set recon.brave.apikey "YOUR_BRAVE_API_KEY"
keyhunter config set recon.yandex.apikey "YOUR_YANDEX_API_KEY"
keyhunter config set recon.yandex.user "YOUR_YANDEX_USER"

# View current config
keyhunter config get recon.github.token
```

### Config File (`~/.keyhunter.yaml`)

```yaml
scan:
  workers: 8
  verify_timeout: 10s
  default_output: table

recon:
  stealth: false
  respect_robots: true
  github:
    token: ""
  gitlab:
    token: ""
  bitbucket:
    token: ""
  huggingface:
    token: ""
  kaggle:
    token: ""
  google:
    apikey: ""
    cx: ""
  bing:
    apikey: ""
  brave:
    apikey: ""
  yandex:
    apikey: ""
    user: ""
```

### Stealth & Ethics Flags
```bash
--stealth           # User-agent rotation, increased request spacing
--respect-robots    # Respect robots.txt (default: on)
```

---

## Supported Providers (108)

### Tier 1 -- Frontier

| Provider | Key Pattern | Confidence | Verify |
|----------|-------------|------------|--------|
| OpenAI | `sk-proj-*`, `sk-svcacct-*` | High | `GET /v1/models` |
| Anthropic | `sk-ant-api03-*` | High | `GET /v1/models` |
| Google AI (Gemini) | `AIza*` | High | `GET /v1/models` |
| Google Vertex AI | OAuth token | Medium | `GET /v1/models` |
| AWS Bedrock | `AKIA*` | High | `GetFoundationModel` |
| Azure OpenAI | 32-char hex | Medium | `GET /openai/deployments` |
| Meta AI | `meta-llama-*` | Medium | `GET /v1/models` |
| xAI (Grok) | `xai-*` | High | `GET /v1/models` |
| Cohere | `co-*` | High | `GET /v1/models` |
| Mistral AI | 32-char generic | Low | `GET /v1/models` |
| Inflection AI | Generic UUID | Low | `GET /api/models` |
| AI21 Labs | Generic key | Low | `GET /v1/models` |

### Tier 2 -- Inference Platforms

| Provider | Key Pattern | Confidence | Verify |
|----------|-------------|------------|--------|
| Together AI | Generic key | Low | `GET /v1/models` |
| Fireworks AI | `fw_*` | High | `GET /v1/models` |
| Groq | `gsk_*` | High | `GET /openai/v1/models` |
| Replicate | `r8_*` | High | `GET /v1/predictions` |
| Anyscale | Generic key | Low | `GET /v1/models` |
| DeepInfra | Generic key | Low | `GET /v1/models` |
| Lepton AI | `lpt_*` | High | `GET /v1/models` |
| Modal | Generic token | Low | `GET /api/apps` |
| Baseten | Generic key | Low | `GET /v1/models` |
| Cerebrium | Generic key | Low | `GET /v1/models` |
| NovitaAI | Generic key | Low | `GET /v1/models` |
| Sambanova | Generic key | Low | `GET /v1/models` |
| OctoAI | Generic key | Low | `GET /v1/models` |
| Friendli AI | Generic key | Low | `GET /v1/models` |

### Tier 3 -- Specialized/Vertical

| Provider | Key Pattern | Confidence | Verify |
|----------|-------------|------------|--------|
| Perplexity | `pplx-*` | High | `GET /chat/completions` |
| You.com | Generic key | Low | `GET /v1/search` |
| Voyage AI | `voy-*` | High | `GET /v1/models` |
| Jina AI | `jina_*` | High | `GET /v1/models` |
| Unstructured | Generic key | Low | `GET /general/v0/general` |
| AssemblyAI | Generic key | Low | `GET /v2/transcript` |
| Deepgram | Generic key | Low | `GET /v1/projects` |
| ElevenLabs | `el_*` | High | `GET /v1/user` |
| Stability AI | `sk-*` | Medium | `GET /v1/engines/list` |
| Runway ML | Generic key | Low | `GET /v1/models` |
| Midjourney | Generic key | Low | N/A |
| HuggingFace | `hf_*` | High | `GET /api/whoami` |

### Tier 4 -- Chinese/Regional

| Provider | Key Pattern | Confidence | Verify |
|----------|-------------|------------|--------|
| DeepSeek | `sk-*` | Medium | `GET /v1/models` |
| Baichuan | Generic key | Low | `GET /v1/models` |
| Zhipu AI (GLM) | Generic key | Low | `POST /api/paas/v4/chat` |
| Moonshot AI (Kimi) | `sk-*` | Medium | `GET /v1/models` |
| Yi (01.AI) | Generic key | Low | `GET /v1/models` |
| Qwen (Alibaba) | `sk-*` | Medium | `GET /v1/models` |
| Baidu (ERNIE) | API Key + Secret | Medium | Token endpoint |
| ByteDance (Doubao) | Generic key | Low | `GET /v1/models` |
| SenseTime | Generic key | Low | `GET /v1/models` |
| iFlytek (Spark) | API Key + Secret | Medium | WebSocket handshake |
| MiniMax | Generic key | Low | `GET /v1/models` |
| Stepfun | Generic key | Low | `GET /v1/models` |
| 360 AI | Generic key | Low | `GET /v1/models` |
| Kuaishou (Kling) | Generic key | Low | `GET /v1/models` |
| Tencent Hunyuan | SecretId + SecretKey | Medium | `DescribeModels` |
| SiliconFlow | `sf_*` | High | `GET /v1/models` |

### Tier 5 -- Infrastructure/Gateway

| Provider | Key Pattern | Confidence | Verify |
|----------|-------------|------------|--------|
| Cloudflare AI | Cloudflare API token | Medium | `GET /ai/models` |
| Vercel AI | `vercel_*` | High | `GET /v1/models` |
| LiteLLM | Generic key | Low | `GET /v1/models` |
| Portkey | Generic key | Low | `GET /v1/models` |
| Helicone | `sk-helicone-*` | High | `GET /v1/models` |
| OpenRouter | `sk-or-*` | High | `GET /api/v1/models` |
| Martian | Generic key | Low | `GET /v1/models` |
| AI Gateway (Kong) | Generic key | Low | Health endpoint |
| BricksAI | Generic key | Low | `GET /v1/models` |
| Aether | Generic key | Low | `GET /v1/models` |
| Not Diamond | Generic key | Low | `GET /v1/models` |

### Tier 6 -- Emerging/Niche

| Provider | Key Pattern | Confidence | Verify |
|----------|-------------|------------|--------|
| Reka AI | Generic key | Low | `GET /v1/models` |
| Aleph Alpha | Generic key | Low | `GET /models` |
| Writer | Generic key | Low | `GET /v1/models` |
| Jasper AI | Generic key | Low | N/A |
| Typeface | Generic key | Low | N/A |
| Comet ML | Generic key | Low | `GET /api/rest/v2` |
| Weights & Biases | Generic key | Low | `GET /api/v1/viewer` |
| LangSmith | `ls__*` | High | `GET /api/v1/info` |
| Pinecone | Generic key | Low | `GET /databases` |
| Weaviate | Generic key | Low | `GET /v1/meta` |
| Qdrant | Generic key | Low | `GET /collections` |
| Chroma | Generic key | Low | `GET /api/v1/heartbeat` |
| Milvus | Generic key | Low | `GET /v1/vector/collections` |
| Neon AI | Generic key | Low | N/A |
| Lamini | Generic key | Low | `GET /v1/models` |

### Tier 7 -- Code & Dev Tools

| Provider | Key Pattern | Confidence | Verify |
|----------|-------------|------------|--------|
| GitHub Copilot | `ghu_*`, `ghp_*` | High | `GET /user` |
| Cursor | Generic key | Low | N/A |
| Tabnine | Generic key | Low | N/A |
| Codeium/Windsurf | Generic key | Low | N/A |
| Sourcegraph Cody | `sgp_*` | High | `GET /.api/current-user` |
| Amazon CodeWhisperer | `AKIA*` | High | STS GetCallerIdentity |
| Replit AI | Generic key | Low | N/A |
| Codestral (Mistral) | Generic key | Low | `GET /v1/models` |
| IBM watsonx.ai | `ibm_*` | Medium | IAM token endpoint |
| Oracle AI | Generic key | Low | N/A |

### Tier 8 -- Self-Hosted/Open Infra

| Provider | Key Pattern | Confidence | Verify |
|----------|-------------|------------|--------|
| Ollama | N/A (local) | N/A | `GET /api/tags` |
| vLLM | Generic key | Low | `GET /v1/models` |
| LocalAI | Generic key | Low | `GET /v1/models` |
| LM Studio | N/A (local) | N/A | `GET /v1/models` |
| llama.cpp | N/A (local) | N/A | `GET /health` |
| GPT4All | N/A (local) | N/A | N/A |
| text-generation-webui | Generic key | Low | `GET /v1/models` |
| TensorRT-LLM | N/A | N/A | Health endpoint |
| Triton Inference Server | N/A | N/A | `GET /v2/health/ready` |
| Jan AI | N/A (local) | N/A | `GET /v1/models` |

### Tier 9 -- Enterprise/Legacy

| Provider | Key Pattern | Confidence | Verify |
|----------|-------------|------------|--------|
| Salesforce Einstein | Generic token | Low | REST API |
| ServiceNow AI | Generic token | Low | REST API |
| SAP AI Core | OAuth token | Low | Token endpoint |
| Palantir AIP | Generic token | Low | REST API |
| Databricks (DBRX) | `dapi*` | High | `GET /api/2.0/clusters` |
| Snowflake Cortex | JWT token | Medium | SQL endpoint |
| Oracle Generative AI | Generic key | Low | REST API |
| HPE GreenLake AI | Generic token | Low | REST API |

---

## Architecture

```
                    +------------------+
                    |   CLI (Cobra)    |
                    +--------+---------+
                             |
              +--------------+--------------+
              |              |              |
     +--------v--+   +------v-----+  +-----v------+
     | Input      |   | Recon      |  | Import     |
     | Adapters   |   | Engine     |  | Adapters   |
     | - file     |   | (18 live)  |  | - trufflehog|
     | - dir      |   | - Code(10) |  | - gitleaks |
     | - git      |   | - Search(5)|  +-----+------+
     | - stdin    |   | - Paste(3) |        |
     | - url      |   +------+-----+        |
     | - clipboard|          |              |
     +--------+---+          |              |
              |              |              |
              +-------+------+--------------+
                      |
              +-------v--------+
              | Scanner Engine |
              | - matcher.go   |
              | - verifier.go  |
              +-------+--------+
                      |
         +------------+-------------+
         |            |             |
   +-----v----+ +----v-----+ +----v-------+
   | Output   | | Dork     | | Key        |
   | - table  | | Engine   | | Management |
   | - json   | | - 150    | | - list     |
   | - sarif  | |   dorks  | | - show     |
   | - csv    | | - 8 src  | | - export   |
   +----------+ +----------+ +------------+

   +------------------------------------------+
   | Provider Registry (108+ YAML providers)  |
   | Dork Registry (150 YAML dorks)           |
   +------------------------------------------+
```

### Key Design Decisions

- **YAML Providers** -- Adding a new provider = adding a YAML file. No recompile needed for pattern-only changes (when using external provider dir). Built-in providers are embedded at compile time.
- **Keyword Pre-filtering** -- Before running regex, files are scanned for keywords via Aho-Corasick. This provides ~10x speedup on large codebases.
- **Worker Pool** -- Parallel scanning with configurable worker count via ants. Default: CPU count.
- **Delta-based Git Scanning** -- Only scans changes between commits, not entire trees.
- **SQLite Storage** -- All scan results persisted with AES-256 encryption.

---

## Dork Examples (150 Built-in)

### GitHub
```
filename:.env "OPENAI_API_KEY"
filename:.env "ANTHROPIC_API_KEY"
filename:config.yaml "api_key" "sk-"
"sk-proj-" language:python
"sk-ant-api03" language:javascript
filename:docker-compose "API_KEY"
"api_key" extension:ipynb
filename:.toml "api_key" "sk-"
filename:terraform.tfvars "api_key"
```

### Google Dorking
```
"sk-proj-" -github.com -stackoverflow.com
"sk-ant-api03-" filetype:env
"OPENAI_API_KEY" filetype:yml
"ANTHROPIC_API_KEY" filetype:json
inurl:.env "API_KEY"
intitle:"index of" .env
site:pastebin.com "sk-proj-"
site:replit.com "OPENAI_API_KEY"
```

### Shodan (for future IoT recon sources)
```
http.html:"openai" "api_key" port:8080
http.title:"LiteLLM" port:4000
http.html:"ollama" port:11434
http.title:"Kubernetes Dashboard"
```

---

## Use Cases

### Red Team / Pentest
```bash
# Multi-source recon against a target org
keyhunter recon full --sources=github,gitlab,gist,pastebin

# Scan a cloned repository
keyhunter scan ./target-repo/ --verify

# Scan git history for rotated keys
keyhunter scan --git ./target-repo/
```

### DevSecOps / CI Pipeline
```bash
# Pre-commit hook
keyhunter hook install

# GitHub Actions step
- name: KeyHunter Scan
  run: keyhunter scan . --output=sarif > keyhunter.sarif
```

### Bug Bounty
```bash
# Search code hosting platforms for leaked keys
keyhunter recon full --sources=github,gitlab,bitbucket,gist,codeberg
keyhunter recon full --sources=huggingface,kaggle,replit,codesandbox

# Search engine dorking
keyhunter recon full --sources=google,bing,duckduckgo,brave

# Paste site monitoring
keyhunter recon full --sources=pastebin,pastesites,gistpaste
```

---

## Security & Ethics

### Built-in Protections
- Key values **masked by default** in terminal (first 8 + last 4 chars) -- use `--unmask` for full keys
- **Full keys always available** via: `--unmask`, `--output=json`, `keyhunter keys show`
- Database is **AES-256 encrypted** (full keys stored encrypted)
- API tokens stored **encrypted** in config
- No key values written to logs during `--verify`

### Rate Limiting (Recon Sources)
| Source | Rate Limit |
|--------|-----------|
| GitHub API (auth) | 30 req/min |
| GitHub API (unauth) | 10 req/min |
| Google Custom Search | 100/day free, 10K/day paid |
| Bing Search | 1,000/month (free) |
| Brave Search | Per API plan |
| Paste sites | 1 req/2sec |

---

## Contributing

### Adding a New Provider

1. Create `providers/your-provider.yaml`:

```yaml
id: your-provider
name: Your Provider
category: emerging
website: https://api.yourprovider.com
confidence: medium

patterns:
  - id: your-provider-key
    name: "Your Provider API Key"
    regex: '\byp_[A-Za-z0-9]{32}\b'
    confidence: high
    description: "Your Provider API key with yp_ prefix"

keywords:
  - "yp_"
  - "YOUR_PROVIDER_API_KEY"

verify:
  enabled: true
  method: GET
  url: "https://api.yourprovider.com/v1/models"
  headers:
    Authorization: "Bearer {{key}}"
  success_codes: [200]
  failure_codes: [401, 403]

metadata:
  docs: "https://docs.yourprovider.com"
  key_url: "https://dashboard.yourprovider.com/keys"
  env_vars: ["YOUR_PROVIDER_API_KEY"]
```

2. Run tests: `go test ./pkg/provider/...`
3. Submit a PR

### Adding a New Dork

1. Edit `dorks/<source>.yaml` and add your dork entry
2. Submit a PR

---

## Roadmap

- [x] Core scanning engine (file, dir, git, stdin, url, clipboard)
- [x] 108 provider YAML definitions (Tier 1-9)
- [x] Active verification (YAML-driven HTTPVerifier)
- [x] Output formats: table, JSON, CSV, SARIF 2.1.0
- [x] CLI with Cobra (scan, providers, config, keys, import, hook, dorks, recon, legal)
- [x] TruffleHog & Gitleaks import adapters
- [x] Key management (list, show, export, copy, delete, verify)
- [x] Git pre-commit hook (install/uninstall)
- [x] Dork engine with 150 built-in dorks across 8 sources
- [x] OSINT recon framework with 18 live sources
- [ ] IoT scanners (Shodan, Censys, ZoomEye, FOFA, Netlas, BinaryEdge)
- [ ] Cloud storage scanning (S3, GCS, Azure, DigitalOcean)
- [ ] Package registries (npm, PyPI, RubyGems, crates.io, Maven, NuGet)
- [ ] Container & IaC scanning (Docker Hub, Terraform, Helm, Ansible)
- [ ] CI/CD log scanning (GitHub Actions, Travis, CircleCI, Jenkins, GitLab CI)
- [ ] Web archives (Wayback Machine, CommonCrawl)
- [ ] Frontend leak detection (source maps, webpack, .env exposure)
- [ ] Forums & collaboration tools (Stack Overflow, Reddit, Notion, Trello)
- [ ] Threat intel (VirusTotal, Intelligence X, URLhaus)
- [ ] Telegram bot with auto-notifications
- [ ] Scheduled scanning (cron-based)
- [ ] Web dashboard (htmx + Tailwind + SQLite)
- [ ] Docker image
- [ ] Homebrew formula

---

## Disclaimer

KeyHunter is designed for **authorized security testing**, **defensive security**, **bug bounty programs**, and **educational purposes** only. Always ensure you have proper authorization before scanning any target. Unauthorized access to computer systems is illegal.

---

## License

MIT License - see [LICENSE](LICENSE) for details.