diff --git a/.planning/REQUIREMENTS.md b/.planning/REQUIREMENTS.md new file mode 100644 index 0000000..9f4d692 --- /dev/null +++ b/.planning/REQUIREMENTS.md @@ -0,0 +1,311 @@ +# Requirements: KeyHunter + +**Defined:** 2026-04-04 +**Core Value:** Detect leaked LLM API keys across more providers and more internet sources than any other tool, with active verification to confirm keys are real and alive. + +## v1 Requirements + +Requirements for initial release. Each maps to roadmap phases. + +### Core Engine + +- [ ] **CORE-01**: Scanner engine detects API keys using keyword pre-filtering + regex matching pipeline +- [ ] **CORE-02**: Provider definitions loaded from YAML files embedded at compile time via Go embed +- [ ] **CORE-03**: Provider registry manages 108+ provider definitions with pattern, keyword, confidence, and verify metadata +- [ ] **CORE-04**: Entropy analysis as secondary signal for low-confidence providers (generic key formats) +- [ ] **CORE-05**: Worker pool parallelism with configurable worker count (default: CPU count) +- [ ] **CORE-06**: Aho-Corasick keyword pre-filter runs before regex for 10x performance on large files +- [ ] **CORE-07**: mmap-based large file reading for memory efficiency + +### Providers + +- [ ] **PROV-01**: 12 Tier 1 Frontier provider YAML definitions (OpenAI, Anthropic, Google AI, Vertex, AWS Bedrock, Azure OpenAI, Meta AI, xAI, Cohere, Mistral, Inflection, AI21) +- [ ] **PROV-02**: 14 Tier 2 Inference Platform provider definitions (Together, Fireworks, Groq, Replicate, Anyscale, DeepInfra, Lepton, Modal, Baseten, Cerebrium, NovitaAI, Sambanova, OctoAI, Friendli) +- [ ] **PROV-03**: 12 Tier 3 Specialized provider definitions (Perplexity, You.com, Voyage, Jina, Unstructured, AssemblyAI, Deepgram, ElevenLabs, Stability, Runway, Midjourney, HuggingFace) +- [ ] **PROV-04**: 16 Tier 4 Chinese/Regional provider definitions (DeepSeek, Baichuan, Zhipu, Moonshot, Yi, Qwen, Baidu, ByteDance, SenseTime, iFlytek, MiniMax, Stepfun, 360 AI, Kuaishou, Tencent, SiliconFlow) +- [ ] **PROV-05**: 11 Tier 5 Infrastructure/Gateway provider definitions (Cloudflare AI, Vercel AI, LiteLLM, Portkey, Helicone, OpenRouter, Martian, Kong, BricksAI, Aether, Not Diamond) +- [ ] **PROV-06**: 15 Tier 6 Emerging/Niche provider definitions (Reka, Aleph Alpha, Writer, Jasper, Typeface, Comet, W&B, LangSmith, Pinecone, Weaviate, Qdrant, Chroma, Milvus, Neon, Lamini) +- [ ] **PROV-07**: 10 Tier 7 Code/Dev Tools provider definitions (GitHub Copilot, Cursor, Tabnine, Codeium, Sourcegraph, CodeWhisperer, Replit AI, Codestral, watsonx, Oracle AI) +- [ ] **PROV-08**: 10 Tier 8 Self-Hosted provider definitions (Ollama, vLLM, LocalAI, LM Studio, llama.cpp, GPT4All, text-gen-webui, TensorRT-LLM, Triton, Jan AI) +- [ ] **PROV-09**: 8 Tier 9 Enterprise provider definitions (Salesforce Einstein, ServiceNow, SAP AI Core, Palantir, Databricks, Snowflake, Oracle GenAI, HPE GreenLake) +- [ ] **PROV-10**: Provider YAML schema includes format_version and last_verified date for pattern health tracking + +### Input Sources + +- [ ] **INPUT-01**: File and directory scanning with recursive traversal and glob exclusion patterns +- [ ] **INPUT-02**: Git-aware scanning — full history, branches, stash, delta-based diffs +- [ ] **INPUT-03**: Git scanning supports --since flag for time-scoped history scan +- [ ] **INPUT-04**: stdin/pipe input support (cat file | keyhunter scan stdin) +- [ ] **INPUT-05**: URL fetching — scan content from any remote URL +- [ ] **INPUT-06**: Clipboard content scanning + +### Verification + +- [ ] **VRFY-01**: Active key verification via lightweight API calls when --verify flag is set +- [ ] **VRFY-02**: Verification is opt-in only (off by default) with consent prompt on first use +- [ ] **VRFY-03**: Each provider YAML defines verify endpoint, method, headers, success/failure codes +- [ ] **VRFY-04**: Verification extracts additional metadata (org, rate limit, permissions) when available +- [ ] **VRFY-05**: Configurable verification timeout (default 10s, --verify-timeout flag) +- [ ] **VRFY-06**: Legal disclaimer and documentation ships with verification feature + +### Output & Reporting + +- [ ] **OUT-01**: Colored terminal table output (default) +- [ ] **OUT-02**: JSON output format +- [ ] **OUT-03**: SARIF output format (CI/CD compatible) +- [ ] **OUT-04**: CSV output format +- [ ] **OUT-05**: Key masking by default (first 8 + last 4 chars) with --unmask flag for full keys +- [ ] **OUT-06**: Exit codes: 0=clean, 1=keys found, 2=error + +### Key Management + +- [ ] **KEYS-01**: keyhunter keys list — show all found keys (masked by default, --unmask for full) +- [ ] **KEYS-02**: keyhunter keys show — single key full detail (always unmasked) +- [ ] **KEYS-03**: keyhunter keys export --format=json|csv — export all keys with full values +- [ ] **KEYS-04**: keyhunter keys copy — copy full key to clipboard +- [ ] **KEYS-05**: keyhunter keys verify — verify specific key and show full detail +- [ ] **KEYS-06**: keyhunter keys delete — remove key from database + +### External Tool Import + +- [ ] **IMP-01**: TruffleHog JSON output parser and importer +- [ ] **IMP-02**: Gitleaks JSON output parser and importer +- [ ] **IMP-03**: Generic CSV import for custom tool output + +### Storage + +- [ ] **STOR-01**: SQLite database for persisting scan results, keys, recon history +- [ ] **STOR-02**: Application-level AES-256 encryption for stored keys and sensitive config +- [ ] **STOR-03**: Encryption key derived from user passphrase via Argon2 + +### CLI + +- [ ] **CLI-01**: Cobra-based CLI with commands: scan, verify, import, recon, keys, serve, dorks, providers, config, hook, schedule +- [ ] **CLI-02**: keyhunter config init creates ~/.keyhunter.yaml +- [ ] **CLI-03**: keyhunter config set for all configuration +- [ ] **CLI-04**: keyhunter providers list/info/stats for provider management +- [ ] **CLI-05**: Scan flags: --providers, --category, --confidence, --exclude, --verify, --workers, --output, --unmask, --notify + +### CI/CD Integration + +- [ ] **CICD-01**: keyhunter hook install/uninstall for git pre-commit hooks +- [ ] **CICD-02**: SARIF output uploadable to GitHub Security tab + +### OSINT/Recon — IoT & Internet Scanners + +- [ ] **RECON-IOT-01**: Shodan API search and dorking +- [ ] **RECON-IOT-02**: Censys API search +- [ ] **RECON-IOT-03**: ZoomEye API search +- [ ] **RECON-IOT-04**: FOFA API search +- [ ] **RECON-IOT-05**: Netlas API search +- [ ] **RECON-IOT-06**: BinaryEdge API search + +### OSINT/Recon — Code Hosting & Snippets + +- [ ] **RECON-CODE-01**: GitHub code search with automated dork execution +- [ ] **RECON-CODE-02**: GitLab code search with dork execution +- [ ] **RECON-CODE-03**: GitHub Gist search +- [ ] **RECON-CODE-04**: Bitbucket code search +- [ ] **RECON-CODE-05**: Codeberg/Gitea search (Gitea auto-discovered via Shodan) +- [ ] **RECON-CODE-06**: Replit public repl scanning +- [ ] **RECON-CODE-07**: CodeSandbox project scanning +- [ ] **RECON-CODE-08**: HuggingFace Spaces and repos scanning +- [ ] **RECON-CODE-09**: Kaggle notebook scanning +- [ ] **RECON-CODE-10**: CodePen, JSFiddle, StackBlitz, Glitch, Observable, Gitpod scanning + +### OSINT/Recon — Search Engine Dorking + +- [ ] **RECON-DORK-01**: Google dorking via Custom Search API / SerpAPI with 100+ built-in dorks +- [ ] **RECON-DORK-02**: Bing dorking via Azure Cognitive Services +- [ ] **RECON-DORK-03**: DuckDuckGo, Yandex, Brave search integration + +### OSINT/Recon — Paste Sites + +- [ ] **RECON-PASTE-01**: Multi-paste aggregator (Pastebin, dpaste, paste.ee, rentry, hastebin, ix.io, etc.) + +### OSINT/Recon — Package Registries + +- [ ] **RECON-PKG-01**: npm registry package scanning (download + extract + grep) +- [ ] **RECON-PKG-02**: PyPI package scanning +- [ ] **RECON-PKG-03**: RubyGems, crates.io, Maven, NuGet, Packagist, Go proxy scanning + +### OSINT/Recon — Container & Infrastructure + +- [ ] **RECON-INFRA-01**: Docker Hub image layer scanning and build arg extraction +- [ ] **RECON-INFRA-02**: Kubernetes exposed dashboards and public Secret/ConfigMap discovery +- [ ] **RECON-INFRA-03**: Terraform state file and registry module scanning +- [ ] **RECON-INFRA-04**: Helm chart and Ansible Galaxy scanning + +### OSINT/Recon — Cloud Storage + +- [ ] **RECON-CLOUD-01**: AWS S3 bucket enumeration and content scanning +- [ ] **RECON-CLOUD-02**: GCS, Azure Blob, DigitalOcean Spaces, Backblaze B2 scanning +- [ ] **RECON-CLOUD-03**: Self-hosted MinIO instance discovery via Shodan +- [ ] **RECON-CLOUD-04**: GrayHatWarfare bucket search engine integration + +### OSINT/Recon — CI/CD Logs + +- [ ] **RECON-CI-01**: GitHub Actions workflow log scanning +- [ ] **RECON-CI-02**: Travis CI and CircleCI public build log scanning +- [ ] **RECON-CI-03**: Exposed Jenkins instance discovery and console output scanning +- [ ] **RECON-CI-04**: GitLab CI/CD pipeline trace scanning + +### OSINT/Recon — Web Archives + +- [ ] **RECON-ARCH-01**: Wayback Machine CDX API historical snapshot scanning +- [ ] **RECON-ARCH-02**: CommonCrawl index and WARC record scanning + +### OSINT/Recon — Forums & Documentation + +- [ ] **RECON-FORUM-01**: Stack Overflow / Stack Exchange API search +- [ ] **RECON-FORUM-02**: Reddit subreddit search +- [ ] **RECON-FORUM-03**: Hacker News Algolia API search +- [ ] **RECON-FORUM-04**: dev.to and Medium article scanning +- [ ] **RECON-FORUM-05**: Telegram public channel scanning +- [ ] **RECON-FORUM-06**: Discord indexed content search + +### OSINT/Recon — Collaboration Tools + +- [ ] **RECON-COLLAB-01**: Notion public page scanning (via Google dorking) +- [ ] **RECON-COLLAB-02**: Confluence exposed instance scanning +- [ ] **RECON-COLLAB-03**: Trello public board scanning +- [ ] **RECON-COLLAB-04**: Google Docs/Sheets public document scanning + +### OSINT/Recon — Frontend & JS Leaks + +- [ ] **RECON-JS-01**: JavaScript source map extraction and scanning +- [ ] **RECON-JS-02**: Webpack/Vite bundle scanning for inlined env vars +- [ ] **RECON-JS-03**: Exposed .env file scanning on web servers +- [ ] **RECON-JS-04**: Exposed Swagger/OpenAPI documentation scanning +- [ ] **RECON-JS-05**: Vercel/Netlify deploy preview JS bundle scanning + +### OSINT/Recon — Log Aggregators + +- [ ] **RECON-LOG-01**: Exposed Elasticsearch/Kibana instance scanning +- [ ] **RECON-LOG-02**: Exposed Grafana dashboard scanning +- [ ] **RECON-LOG-03**: Exposed Sentry instance scanning + +### OSINT/Recon — Threat Intelligence + +- [ ] **RECON-INTEL-01**: VirusTotal file and URL search +- [ ] **RECON-INTEL-02**: Intelligence X aggregated search +- [ ] **RECON-INTEL-03**: URLhaus search + +### OSINT/Recon — Mobile & DNS + +- [ ] **RECON-MOBILE-01**: APK download, decompile, and scanning +- [ ] **RECON-DNS-01**: crt.sh Certificate Transparency log subdomain discovery +- [ ] **RECON-DNS-02**: Subdomain config endpoint probing (.env, /api/config, /actuator/env) + +### OSINT/Recon — API Marketplaces + +- [ ] **RECON-API-01**: Postman public collections and workspaces scanning +- [ ] **RECON-API-02**: SwaggerHub published API scanning + +### OSINT/Recon — Infrastructure + +- [ ] **RECON-INFRA-05**: Per-source rate limiter with configurable limits +- [ ] **RECON-INFRA-06**: Stealth mode (--stealth) with UA rotation and increased delays +- [ ] **RECON-INFRA-07**: robots.txt respect (--respect-robots, default on) +- [ ] **RECON-INFRA-08**: Recon full command — parallel sweep across all sources with deduplication + +### Dork Engine + +- [ ] **DORK-01**: YAML-based dork definitions (GitHub, Google, Shodan, Censys, ZoomEye, FOFA, GitLab, Bing) +- [ ] **DORK-02**: 150+ built-in dorks across all sources +- [ ] **DORK-03**: keyhunter dorks list/add/run/export commands +- [ ] **DORK-04**: Category-filtered dork execution (--category=frontier) + +### Web Dashboard + +- [ ] **WEB-01**: Embedded HTTP server (chi + htmx + Tailwind CSS) +- [ ] **WEB-02**: Dashboard overview page with summary statistics +- [ ] **WEB-03**: Scan history and scan detail pages +- [ ] **WEB-04**: Key listing page with filtering and "Reveal Key" toggle +- [ ] **WEB-05**: OSINT/Recon launcher and results page +- [ ] **WEB-06**: Provider listing and statistics page +- [ ] **WEB-07**: Dork management page +- [ ] **WEB-08**: Settings configuration page +- [ ] **WEB-09**: REST API (/api/v1/*) for programmatic access +- [ ] **WEB-10**: Optional basic auth / token auth +- [ ] **WEB-11**: Server-Sent Events for live scan progress + +### Telegram Bot + +- [ ] **TELE-01**: /scan command — remote scan trigger +- [ ] **TELE-02**: /verify command — key verification +- [ ] **TELE-03**: /recon command — dork execution +- [ ] **TELE-04**: /status, /stats, /providers, /help commands +- [ ] **TELE-05**: /subscribe and /unsubscribe for auto-notifications +- [ ] **TELE-06**: /key command — full key detail in private chat +- [ ] **TELE-07**: Auto-notification on new key findings + +### Scheduled Scanning + +- [ ] **SCHED-01**: Cron-based recurring scan scheduling +- [ ] **SCHED-02**: keyhunter schedule add/list/remove commands +- [ ] **SCHED-03**: Auto-notify on scheduled scan completion + +## v2 Requirements + +### Advanced Detection + +- **ADV-01**: BPE tokenization-based detection (Betterleaks approach, 98.6% recall) +- **ADV-02**: ML/LLM-based key detection for zero-pattern providers +- **ADV-03**: Custom provider YAML hot-reload without recompile (external dir) + +### Additional Integrations + +- **INT-01**: Slack notification module +- **INT-02**: Webhook notification module +- **INT-03**: JIRA ticket creation on key findings +- **INT-04**: PagerDuty alert integration + +### Advanced OSINT + +- **OSINT-01**: Dark web / breach database search (Dehashed, HIBP correlation) +- **OSINT-02**: IPA (iOS) app decompile and scanning +- **OSINT-03**: Backblaze B2 deep scanning +- **OSINT-04**: Rapid7 Open Data integration + +## Out of Scope + +| Feature | Reason | +|---------|--------| +| GUI desktop app | CLI + web dashboard covers all use cases | +| Key rotation/remediation | KeyHunter detects, doesn't manage — separate concern | +| Automatic key invalidation | Legal exposure, not our responsibility | +| SaaS hosted version | Open-source tool only, no infrastructure to maintain | +| Telemetry/analytics | Privacy-first tool, no phone-home | +| Windows native binary | Linux/macOS primary, Windows via WSL/Docker | +| Real-time streaming API | Batch scanning is primary mode | +| regexp2/PCRE patterns | Catastrophic backtracking risk — Go stdlib regexp (RE2) only | + +## Traceability + +| Requirement | Phase | Status | +|-------------|-------|--------| +| CORE-01 to CORE-07 | Phase 1 | Pending | +| PROV-01 to PROV-10 | Phase 2 | Pending | +| INPUT-01 to INPUT-06 | Phase 3 | Pending | +| VRFY-01 to VRFY-06 | Phase 4 | Pending | +| OUT-01 to OUT-06 | Phase 5 | Pending | +| KEYS-01 to KEYS-06 | Phase 5 | Pending | +| IMP-01 to IMP-03 | Phase 6 | Pending | +| STOR-01 to STOR-03 | Phase 1 | Pending | +| CLI-01 to CLI-05 | Phase 1 | Pending | +| CICD-01 to CICD-02 | Phase 7 | Pending | +| RECON-* | Phase 8-15 | Pending | +| DORK-01 to DORK-04 | Phase 8 | Pending | +| WEB-01 to WEB-11 | Phase 16 | Pending | +| TELE-01 to TELE-07 | Phase 17 | Pending | +| SCHED-01 to SCHED-03 | Phase 18 | Pending | + +**Coverage:** +- v1 requirements: 120 total +- Mapped to phases: 120 +- Unmapped: 0 + +--- +*Requirements defined: 2026-04-04* +*Last updated: 2026-04-04 after initial definition*