docs: create roadmap (18 phases)

This commit is contained in:
salvacybersec
2026-04-04 19:12:41 +03:00
parent 6803863833
commit ee92aad4cf
4 changed files with 531 additions and 18 deletions

View File

@@ -285,27 +285,47 @@ Requirements for initial release. Each maps to roadmap phases.
| Requirement | Phase | Status |
|-------------|-------|--------|
| CORE-01 to CORE-07 | Phase 1 | Pending |
| PROV-01 to PROV-10 | Phase 2 | Pending |
| INPUT-01 to INPUT-06 | Phase 3 | Pending |
| VRFY-01 to VRFY-06 | Phase 4 | Pending |
| OUT-01 to OUT-06 | Phase 5 | Pending |
| KEYS-01 to KEYS-06 | Phase 5 | Pending |
| IMP-01 to IMP-03 | Phase 6 | Pending |
| STOR-01 to STOR-03 | Phase 1 | Pending |
| CLI-01 to CLI-05 | Phase 1 | Pending |
| CICD-01 to CICD-02 | Phase 7 | Pending |
| RECON-* | Phase 8-15 | Pending |
| DORK-01 to DORK-04 | Phase 8 | Pending |
| WEB-01 to WEB-11 | Phase 16 | Pending |
| TELE-01 to TELE-07 | Phase 17 | Pending |
| SCHED-01 to SCHED-03 | Phase 18 | Pending |
| CORE-01, CORE-02, CORE-03, CORE-04, CORE-05, CORE-06, CORE-07 | Phase 1 | Pending |
| STOR-01, STOR-02, STOR-03 | Phase 1 | Pending |
| CLI-01, CLI-02, CLI-03, CLI-04, CLI-05 | Phase 1 | Pending |
| PROV-10 | Phase 1 | Pending |
| PROV-01, PROV-02 | Phase 2 | Pending |
| PROV-03, PROV-04, PROV-05, PROV-06, PROV-07, PROV-08, PROV-09 | Phase 3 | Pending |
| INPUT-01, INPUT-02, INPUT-03, INPUT-04, INPUT-05, INPUT-06 | Phase 4 | Pending |
| VRFY-01, VRFY-02, VRFY-03, VRFY-04, VRFY-05, VRFY-06 | Phase 5 | Pending |
| OUT-01, OUT-02, OUT-03, OUT-04, OUT-05, OUT-06 | Phase 6 | Pending |
| KEYS-01, KEYS-02, KEYS-03, KEYS-04, KEYS-05, KEYS-06 | Phase 6 | Pending |
| IMP-01, IMP-02, IMP-03 | Phase 7 | Pending |
| CICD-01, CICD-02 | Phase 7 | Pending |
| DORK-01, DORK-02, DORK-03, DORK-04 | Phase 8 | Pending |
| RECON-INFRA-05, RECON-INFRA-06, RECON-INFRA-07, RECON-INFRA-08 | Phase 9 | Pending |
| RECON-CODE-01, RECON-CODE-02, RECON-CODE-03, RECON-CODE-04, RECON-CODE-05 | Phase 10 | Pending |
| RECON-CODE-06, RECON-CODE-07, RECON-CODE-08, RECON-CODE-09, RECON-CODE-10 | Phase 10 | Pending |
| RECON-DORK-01, RECON-DORK-02, RECON-DORK-03 | Phase 11 | Pending |
| RECON-PASTE-01 | Phase 11 | Pending |
| RECON-IOT-01, RECON-IOT-02, RECON-IOT-03, RECON-IOT-04, RECON-IOT-05, RECON-IOT-06 | Phase 12 | Pending |
| RECON-CLOUD-01, RECON-CLOUD-02, RECON-CLOUD-03, RECON-CLOUD-04 | Phase 12 | Pending |
| RECON-PKG-01, RECON-PKG-02, RECON-PKG-03 | Phase 13 | Pending |
| RECON-INFRA-01, RECON-INFRA-02, RECON-INFRA-03, RECON-INFRA-04 | Phase 13 | Pending |
| RECON-CI-01, RECON-CI-02, RECON-CI-03, RECON-CI-04 | Phase 14 | Pending |
| RECON-ARCH-01, RECON-ARCH-02 | Phase 14 | Pending |
| RECON-JS-01, RECON-JS-02, RECON-JS-03, RECON-JS-04, RECON-JS-05 | Phase 14 | Pending |
| RECON-FORUM-01, RECON-FORUM-02, RECON-FORUM-03, RECON-FORUM-04, RECON-FORUM-05, RECON-FORUM-06 | Phase 15 | Pending |
| RECON-COLLAB-01, RECON-COLLAB-02, RECON-COLLAB-03, RECON-COLLAB-04 | Phase 15 | Pending |
| RECON-LOG-01, RECON-LOG-02, RECON-LOG-03 | Phase 15 | Pending |
| RECON-INTEL-01, RECON-INTEL-02, RECON-INTEL-03 | Phase 16 | Pending |
| RECON-MOBILE-01 | Phase 16 | Pending |
| RECON-DNS-01, RECON-DNS-02 | Phase 16 | Pending |
| RECON-API-01, RECON-API-02 | Phase 16 | Pending |
| TELE-01, TELE-02, TELE-03, TELE-04, TELE-05, TELE-06, TELE-07 | Phase 17 | Pending |
| SCHED-01, SCHED-02, SCHED-03 | Phase 17 | Pending |
| WEB-01, WEB-02, WEB-03, WEB-04, WEB-05, WEB-06, WEB-07, WEB-08, WEB-09, WEB-10, WEB-11 | Phase 18 | Pending |
**Coverage:**
- v1 requirements: 120 total
- Mapped to phases: 120
- v1 requirements: 146 total (file count; PROJECT.md summary of 120 was a pre-count estimate)
- Mapped to phases: 146
- Unmapped: 0
---
*Requirements defined: 2026-04-04*
*Last updated: 2026-04-04 after initial definition*
*Last updated: 2026-04-04 after roadmap creation (18 phases)*

268
.planning/ROADMAP.md Normal file
View File

@@ -0,0 +1,268 @@
# Roadmap: KeyHunter
## Overview
KeyHunter is built in dependency order: the provider registry and storage schema come first because every other subsystem depends on them, then the scanning engine pipeline, then the full provider library, then input sources and verification, then output and key management, then the first competitive differentiators (dork engine, import adapters, CI/CD integration), then the OSINT/recon engine (infrastructure architecture before individual sources), then automation and notification (Telegram bot, scheduler), and finally the web dashboard which aggregates all subsystems. Each phase delivers a complete, verifiable capability before the next phase begins.
## Phases
**Phase Numbering:**
- Integer phases (1, 2, 3): Planned milestone work
- Decimal phases (2.1, 2.2): Urgent insertions (marked with INSERTED)
Decimal phases appear between their surrounding integers in numeric order.
- [ ] **Phase 1: Foundation** - Provider registry schema, storage layer with AES-256, and CLI skeleton that everything else depends on
- [ ] **Phase 2: Tier 1-2 Providers** - Frontier and inference platform provider YAML definitions (26 providers, highest-value targets)
- [ ] **Phase 3: Tier 3-9 Providers** - Remaining 82 provider definitions completing 108+ provider coverage
- [ ] **Phase 4: Input Sources** - All input modes: file/dir, full git history, stdin, URL, clipboard
- [ ] **Phase 5: Verification Engine** - Opt-in active key verification with consent prompt and legal documentation
- [ ] **Phase 6: Output, Reporting & Key Management** - All output formats and complete key management CLI
- [ ] **Phase 7: Import Adapters & CI/CD Integration** - TruffleHog/Gitleaks import + pre-commit hooks + SARIF to GitHub Security
- [ ] **Phase 8: Dork Engine** - YAML-based dork definitions with 150+ built-in dorks and management commands
- [ ] **Phase 9: OSINT Infrastructure** - Per-source rate limiter architecture and recon engine framework before any sources
- [ ] **Phase 10: OSINT Code Hosting** - GitHub, GitLab, Bitbucket, HuggingFace and 6 more code hosting sources
- [ ] **Phase 11: OSINT Search & Paste** - Search engine dorking and paste site aggregation
- [ ] **Phase 12: OSINT IoT & Cloud Storage** - Shodan/Censys/ZoomEye/FOFA and S3/GCS/Azure cloud storage scanning
- [ ] **Phase 13: OSINT Package Registries & Container/IaC** - npm/PyPI/crates.io and Docker Hub/K8s/Terraform scanning
- [ ] **Phase 14: OSINT CI/CD Logs, Web Archives & Frontend Leaks** - Build logs, Wayback Machine, and JS bundle/env scanning
- [ ] **Phase 15: OSINT Forums, Collaboration & Log Aggregators** - StackOverflow/Reddit/HN, Notion/Trello, Elasticsearch/Grafana/Sentry
- [ ] **Phase 16: OSINT Threat Intel, Mobile, DNS & API Marketplaces** - VirusTotal/IntelX, APK scanning, crt.sh, Postman/SwaggerHub
- [ ] **Phase 17: Telegram Bot & Scheduled Scanning** - Remote control bot and cron-based recurring scans with auto-notify
- [ ] **Phase 18: Web Dashboard** - Embedded htmx + Tailwind dashboard aggregating all subsystems with SSE live updates
## Phase Details
### Phase 1: Foundation
**Goal**: The provider registry schema, encrypted storage layer, and CLI skeleton exist and function correctly — all downstream subsystems have stable interfaces to build against
**Depends on**: Nothing (first phase)
**Requirements**: CORE-01, CORE-02, CORE-03, CORE-04, CORE-05, CORE-06, CORE-07, STOR-01, STOR-02, STOR-03, CLI-01, CLI-02, CLI-03, CLI-04, CLI-05, PROV-10
**Success Criteria** (what must be TRUE):
1. `keyhunter scan ./somefile` runs the three-stage pipeline (Aho-Corasick pre-filter → regex → entropy) and returns findings with provider names
2. Findings are persisted to a SQLite database with the key value stored AES-256 encrypted — plaintext key never appears in the database file
3. `keyhunter config init` creates `~/.keyhunter.yaml` and `keyhunter config set <key> <value>` persists values
4. `keyhunter providers list` and `keyhunter providers info <name>` return provider metadata from YAML definitions
5. Provider YAML schema includes `format_version` and `last_verified` fields validated at load time
**Plans**: TBD
### Phase 2: Tier 1-2 Providers
**Goal**: The 26 highest-value LLM provider YAML definitions exist with accurate regex patterns, keyword lists, confidence levels, and verify endpoints — covering OpenAI, Anthropic, Google AI, AWS Bedrock, Azure OpenAI and all major inference platforms
**Depends on**: Phase 1
**Requirements**: PROV-01, PROV-02
**Success Criteria** (what must be TRUE):
1. `keyhunter scan` correctly identifies keys from all 12 Tier 1 providers (OpenAI sk-proj-, Anthropic sk-ant-api03-, Google AIza, etc.) with correct provider names
2. `keyhunter scan` correctly identifies keys from all 14 Tier 2 inference platform providers (Groq gsk_, Replicate r8_, Fireworks fw_, Perplexity pplx-, etc.)
3. Each provider YAML includes a `keywords` list that enables Aho-Corasick pre-filtering to skip files with no matching context
4. `keyhunter providers stats` shows 26 providers loaded with pattern and keyword counts
**Plans**: TBD
### Phase 3: Tier 3-9 Providers
**Goal**: All 108+ LLM provider definitions exist — specialized models, Chinese/regional providers, infrastructure gateways, emerging tools, code assistants, self-hosted runtimes, and enterprise platforms
**Depends on**: Phase 2
**Requirements**: PROV-03, PROV-04, PROV-05, PROV-06, PROV-07, PROV-08, PROV-09
**Success Criteria** (what must be TRUE):
1. `keyhunter providers stats` shows 108+ total providers across all tiers
2. Chinese/regional provider keys (DeepSeek, Zhipu, Moonshot, Qwen, Baidu, ByteDance, etc.) are detected using keyword-based matching since they use generic key formats
3. Self-hosted provider definitions (Ollama, vLLM, LocalAI, etc.) include patterns for API key authentication when applicable
4. `keyhunter providers list --tier=enterprise` returns Salesforce, ServiceNow, SAP, Palantir, Databricks, Snowflake, Oracle, HPE providers
**Plans**: TBD
### Phase 4: Input Sources
**Goal**: Users can point KeyHunter at any content source — local files, git history across all branches, piped content, remote URLs, and the clipboard — and all are scanned through the same detection pipeline
**Depends on**: Phase 2
**Requirements**: INPUT-01, INPUT-02, INPUT-03, INPUT-04, INPUT-05, INPUT-06
**Success Criteria** (what must be TRUE):
1. `keyhunter scan ./myrepo` recursively scans all files with glob exclusion patterns (e.g., `--exclude="*.min.js"`) and mmap is used for files above a configurable size threshold
2. `keyhunter scan --git ./myrepo` scans full git history including all branches, tags, and stash entries; `--since=2024-01-01` limits to commits after that date
3. `cat secrets.txt | keyhunter scan stdin` detects keys from piped input
4. `keyhunter scan --url https://example.com/config.js` fetches and scans the remote URL content
5. `keyhunter scan --clipboard` scans the current clipboard content
**Plans**: TBD
### Phase 5: Verification Engine
**Goal**: Users can opt into active key verification with a consent prompt, legal documentation, and per-provider API calls that confirm whether a found key is live and return metadata about the key's access level
**Depends on**: Phase 2
**Requirements**: VRFY-01, VRFY-02, VRFY-03, VRFY-04, VRFY-05, VRFY-06
**Success Criteria** (what must be TRUE):
1. `keyhunter scan --verify` triggers a one-time consent prompt on first use with clear legal language; user must type "yes" to proceed
2. Each provider YAML's verify endpoint, method, headers, and success/failure codes are used for verification — no hardcoded verification logic
3. `keyhunter scan --verify` extracts and displays org name, rate limit tier, and available permissions when the provider API returns them
4. `--verify-timeout=30s` changes the per-key verification timeout from the default 10s
5. A `LEGAL.md` file shipping with the binary documents the legal implications of using `--verify`
**Plans**: TBD
### Phase 6: Output, Reporting & Key Management
**Goal**: Users can consume scan results in any format they need and perform full lifecycle management of stored keys — listing, inspecting, exporting, copying, and deleting
**Depends on**: Phase 5
**Requirements**: OUT-01, OUT-02, OUT-03, OUT-04, OUT-05, OUT-06, KEYS-01, KEYS-02, KEYS-03, KEYS-04, KEYS-05, KEYS-06
**Success Criteria** (what must be TRUE):
1. Default table output shows colored, masked keys (first 8 + last 4 chars); `--unmask` reveals full key values; `--output=json|sarif|csv` switches format
2. Exit code is 0 when no keys found, 1 when keys are found, 2 on scan error — confirming CI/CD compatibility
3. `keyhunter keys list` shows all stored keys masked; `keyhunter keys show <id>` shows full unmasked detail
4. `keyhunter keys export --format=json` produces a JSON file with full key values; `--format=csv` produces a CSV
5. `keyhunter keys copy <id>` copies the full key to clipboard; `keyhunter keys delete <id>` removes the key from the database
**Plans**: TBD
### Phase 7: Import Adapters & CI/CD Integration
**Goal**: Users can import findings from TruffleHog and Gitleaks into KeyHunter's database, and use KeyHunter in pre-commit hooks and CI/CD pipelines with SARIF output uploadable to GitHub Security
**Depends on**: Phase 6
**Requirements**: IMP-01, IMP-02, IMP-03, CICD-01, CICD-02
**Success Criteria** (what must be TRUE):
1. `keyhunter import --format=trufflehog results.json` parses TruffleHog v3 JSON output and normalizes findings into the KeyHunter database
2. `keyhunter import --format=gitleaks results.json` and `--format=csv` both import and deduplicate against existing findings
3. `keyhunter hook install` installs a git pre-commit hook; running `git commit` on a file with a known API key blocks the commit and prints findings
4. `keyhunter scan --output=sarif` produces a valid SARIF 2.1.0 file that GitHub Code Scanning accepts without errors
**Plans**: TBD
### Phase 8: Dork Engine
**Goal**: Users can run, manage, and extend a library of 150+ built-in YAML dorks across GitHub, Google, Shodan, Censys, ZoomEye, FOFA, GitLab, and Bing — using the same extensibility pattern as provider definitions
**Depends on**: Phase 7
**Requirements**: DORK-01, DORK-02, DORK-03, DORK-04
**Success Criteria** (what must be TRUE):
1. `keyhunter dorks list` shows 150+ built-in dorks with source engine and category columns
2. `keyhunter dorks run --source=github --category=frontier` executes all Tier 1 frontier provider dorks against GitHub code search
3. `keyhunter dorks add --source=google --query='site:pastebin.com "sk-ant-api03-"'` persists a custom dork that appears in subsequent `dorks list` output
4. `keyhunter dorks export --format=json` exports all dorks including custom additions
**Plans**: TBD
### Phase 9: OSINT Infrastructure
**Goal**: The recon engine's `ReconSource` interface, per-source rate limiter architecture, stealth mode, and parallel sweep orchestrator exist and are validated — all individual source modules build on this foundation
**Depends on**: Phase 8
**Requirements**: RECON-INFRA-05, RECON-INFRA-06, RECON-INFRA-07, RECON-INFRA-08
**Success Criteria** (what must be TRUE):
1. Every recon source module holds its own `rate.Limiter` instance — no centralized rate limiter — and the `ReconSource` interface enforces a `RateLimit() rate.Limit` method
2. `keyhunter recon full --stealth` applies user-agent rotation and jitter delays to all sources; log output shows "source exhausted" events rather than silently returning empty results
3. `keyhunter recon full --respect-robots` (default on) respects robots.txt for web-scraping sources before making any requests
4. `keyhunter recon full` fans out to all enabled sources in parallel and deduplicates findings before persisting to the database
**Plans**: TBD
### Phase 10: OSINT Code Hosting
**Goal**: Users can scan 10 code hosting platforms — GitHub, GitLab, Bitbucket, GitHub Gist, Codeberg/Gitea, Replit, CodeSandbox, HuggingFace, Kaggle, and miscellaneous code sandbox sites — for leaked LLM API keys
**Depends on**: Phase 9
**Requirements**: RECON-CODE-01, RECON-CODE-02, RECON-CODE-03, RECON-CODE-04, RECON-CODE-05, RECON-CODE-06, RECON-CODE-07, RECON-CODE-08, RECON-CODE-09, RECON-CODE-10
**Success Criteria** (what must be TRUE):
1. `keyhunter recon --sources=github,gitlab` executes provider keyword dorks against GitHub and GitLab code search APIs and feeds results into the detection pipeline
2. `keyhunter recon --sources=huggingface` scans HuggingFace Spaces and model repos for exposed keys
3. `keyhunter recon --sources=gist,bitbucket,codeberg` scans public gists, Bitbucket repos, and Codeberg/Gitea instances
4. `keyhunter recon --sources=replit,codesandbox,kaggle` scans public repls, sandboxes, and notebooks
5. All code hosting source findings are stored in the database with source attribution and deduplication
**Plans**: TBD
### Phase 11: OSINT Search & Paste
**Goal**: Users can run automated search engine dorking against Google, Bing, DuckDuckGo, Yandex, and Brave, and scan 15+ paste site aggregations for leaked API keys
**Depends on**: Phase 9
**Requirements**: RECON-DORK-01, RECON-DORK-02, RECON-DORK-03, RECON-PASTE-01
**Success Criteria** (what must be TRUE):
1. `keyhunter recon --sources=google` runs built-in dorks via Google Custom Search API or SerpAPI and returns results with the dork query that triggered each finding
2. `keyhunter recon --sources=bing` executes dorks via Azure Cognitive Services and `--sources=duckduckgo,yandex,brave` via their respective integrations
3. `keyhunter recon --sources=paste` queries Pastebin API and scrapes 15+ additional paste sites, feeding raw content through the detection pipeline
**Plans**: TBD
### Phase 12: OSINT IoT & Cloud Storage
**Goal**: Users can discover exposed LLM endpoints via IoT scanners (Shodan, Censys, ZoomEye, FOFA, Netlas, BinaryEdge) and scan publicly accessible cloud storage buckets (S3, GCS, Azure Blob, MinIO, GrayHatWarfare) for leaked keys
**Depends on**: Phase 9
**Requirements**: RECON-IOT-01, RECON-IOT-02, RECON-IOT-03, RECON-IOT-04, RECON-IOT-05, RECON-IOT-06, RECON-CLOUD-01, RECON-CLOUD-02, RECON-CLOUD-03, RECON-CLOUD-04
**Success Criteria** (what must be TRUE):
1. `keyhunter recon --sources=shodan` searches Shodan for exposed vLLM, Ollama, and LiteLLM proxy endpoints using the user's API key
2. `keyhunter recon --sources=censys,zoomeye,fofa,netlas,binaryedge` each execute IoT searches with appropriate query formats per platform
3. `keyhunter recon --sources=s3` enumerates publicly accessible S3 buckets and scans readable objects for API key patterns
4. `keyhunter recon --sources=gcs,azureblob,spaces` scans GCS, Azure Blob, and DigitalOcean Spaces; `--sources=minio` discovers MinIO instances via Shodan integration
5. `keyhunter recon --sources=grayhoundwarfare` queries the GrayHatWarfare bucket search engine for matching bucket names
**Plans**: TBD
### Phase 13: OSINT Package Registries & Container/IaC
**Goal**: Users can scan npm, PyPI, and 6 other package registries for packages containing leaked keys, and scan Docker Hub image layers, Kubernetes configs, Terraform state files, Helm charts, and Ansible Galaxy for secrets in infrastructure code
**Depends on**: Phase 9
**Requirements**: RECON-PKG-01, RECON-PKG-02, RECON-PKG-03, RECON-INFRA-01, RECON-INFRA-02, RECON-INFRA-03, RECON-INFRA-04
**Success Criteria** (what must be TRUE):
1. `keyhunter recon --sources=npm` downloads and extracts package tarballs for recently published packages matching LLM-related keywords and scans their contents
2. `keyhunter recon --sources=pypi,rubygems,crates,maven,nuget,packagist,goproxy` scans respective registries using the same download-extract-scan pattern
3. `keyhunter recon --sources=dockerhub` extracts and scans image layers and build args from public Docker Hub images
4. `keyhunter recon --sources=k8s` discovers publicly exposed Kubernetes dashboards and scans publicly readable Secret/ConfigMap objects
5. `keyhunter recon --sources=terraform,helm,ansible` scans Terraform registry modules, Helm chart repositories, and Ansible Galaxy roles
**Plans**: TBD
### Phase 14: OSINT CI/CD Logs, Web Archives & Frontend Leaks
**Goal**: Users can scan public CI/CD build logs, historical web snapshots from the Wayback Machine and CommonCrawl, and frontend JavaScript artifacts (source maps, webpack bundles, exposed .env files) for leaked API keys
**Depends on**: Phase 9
**Requirements**: RECON-CI-01, RECON-CI-02, RECON-CI-03, RECON-CI-04, RECON-ARCH-01, RECON-ARCH-02, RECON-JS-01, RECON-JS-02, RECON-JS-03, RECON-JS-04, RECON-JS-05
**Success Criteria** (what must be TRUE):
1. `keyhunter recon --sources=github-actions` scans public GitHub Actions workflow run logs for leaked keys in build output
2. `keyhunter recon --sources=travis,circleci,jenkins,gitlab-ci` scans public build logs from each CI platform
3. `keyhunter recon --sources=wayback` queries the CDX API for historical snapshots of target domains and scans retrieved content
4. `keyhunter recon --sources=commoncrawl` searches CommonCrawl indexes for pages matching LLM provider keywords and scans WARC records
5. `keyhunter recon --sources=sourcemaps,webpack,dotenv,swagger,deploypreview` each extract and scan the relevant JS artifacts and configuration files
**Plans**: TBD
### Phase 15: OSINT Forums, Collaboration & Log Aggregators
**Goal**: Users can search developer forums, public collaboration tool pages, and exposed monitoring dashboards for leaked API keys — covering Stack Overflow, Reddit, HackerNews, dev.to, Telegram channels, Discord, Notion, Confluence, Trello, Google Docs, Elasticsearch, Grafana, and Sentry
**Depends on**: Phase 9
**Requirements**: RECON-FORUM-01, RECON-FORUM-02, RECON-FORUM-03, RECON-FORUM-04, RECON-FORUM-05, RECON-FORUM-06, RECON-COLLAB-01, RECON-COLLAB-02, RECON-COLLAB-03, RECON-COLLAB-04, RECON-LOG-01, RECON-LOG-02, RECON-LOG-03
**Success Criteria** (what must be TRUE):
1. `keyhunter recon --sources=stackoverflow,reddit,hackernews` queries each platform's API/Algolia search for LLM provider keywords and scans result content
2. `keyhunter recon --sources=devto,medium,telegram,discord` scans publicly accessible posts, articles, and indexed channel content
3. `keyhunter recon --sources=notion,confluence,trello,googledocs` scans publicly accessible pages via dorking and direct API access where available
4. `keyhunter recon --sources=elasticsearch,grafana,sentry` discovers exposed instances and scans accessible log data and dashboards
**Plans**: TBD
### Phase 16: OSINT Threat Intel, Mobile, DNS & API Marketplaces
**Goal**: Users can search threat intelligence platforms, scan decompiled Android APKs, perform DNS/subdomain discovery for config endpoint probing, and scan Postman/SwaggerHub API collections for leaked LLM keys
**Depends on**: Phase 9
**Requirements**: RECON-INTEL-01, RECON-INTEL-02, RECON-INTEL-03, RECON-MOBILE-01, RECON-DNS-01, RECON-DNS-02, RECON-API-01, RECON-API-02
**Success Criteria** (what must be TRUE):
1. `keyhunter recon --sources=virustotal,intelx,urlhaus` queries each threat intelligence platform for files and URLs containing LLM provider keywords
2. `keyhunter recon --sources=apk --target=com.example.app` downloads, decompiles (via apktool/jadx), and scans APK content for API keys
3. `keyhunter recon --sources=crtsh --target=example.com` discovers subdomains via Certificate Transparency logs and probes each for `.env`, `/api/config`, and `/actuator/env` endpoints
4. `keyhunter recon --sources=postman,swaggerhub` scans public Postman collections and SwaggerHub API definitions for hardcoded keys in request examples
**Plans**: TBD
### Phase 17: Telegram Bot & Scheduled Scanning
**Goal**: Users can control KeyHunter remotely via a Telegram bot with scan, verify, recon, status, and subscription commands, and set up cron-based recurring scans that auto-notify on new findings
**Depends on**: Phase 16
**Requirements**: TELE-01, TELE-02, TELE-03, TELE-04, TELE-05, TELE-06, TELE-07, SCHED-01, SCHED-02, SCHED-03
**Success Criteria** (what must be TRUE):
1. `keyhunter serve --telegram` starts the bot; `/scan ./myrepo` in a private Telegram chat triggers a scan and returns findings (masked keys only, never unmasked)
2. `/verify <key-id>`, `/recon --sources=github`, `/status`, `/stats`, `/providers`, and `/help` all respond correctly in private chat
3. `/subscribe` enables auto-notifications; new key findings from any scan trigger an immediate Telegram message to all subscribed users
4. `/key <id>` sends full key detail to the requesting user's private chat only
5. `keyhunter schedule add --cron="0 */6 * * *" --scan=./myrepo` adds a recurring scan; `keyhunter schedule list` shows it; the job persists across restarts and sends Telegram notifications on new findings
**Plans**: TBD
### Phase 18: Web Dashboard
**Goal**: Users can manage and interact with all KeyHunter capabilities through an embedded web dashboard — viewing scans, managing keys, launching recon, browsing providers, managing dorks, and configuring settings — with live scan progress via SSE
**Depends on**: Phase 17
**Requirements**: WEB-01, WEB-02, WEB-03, WEB-04, WEB-05, WEB-06, WEB-07, WEB-08, WEB-09, WEB-10, WEB-11
**Success Criteria** (what must be TRUE):
1. `keyhunter serve` starts an embedded HTTP server with the full dashboard accessible in a browser; the binary contains all HTML, CSS, and assets via go:embed
2. The dashboard overview page shows total keys found, scan count, and active providers as summary statistics
3. The keys page lists all findings with masked values and a "Reveal Key" toggle that shows the full key on demand
4. The recon page allows launching a recon sweep with source selection and shows live progress via Server-Sent Events
5. The REST API at `/api/v1/*` accepts and returns JSON for all dashboard actions; optional basic auth or token auth is configurable via settings page
**Plans**: TBD
**UI hint**: yes
## Progress
**Execution Order:**
Phases execute in numeric order: 1 → 2 → 3 → ... → 18
| Phase | Plans Complete | Status | Completed |
|-------|----------------|--------|-----------|
| 1. Foundation | 0/? | Not started | - |
| 2. Tier 1-2 Providers | 0/? | Not started | - |
| 3. Tier 3-9 Providers | 0/? | Not started | - |
| 4. Input Sources | 0/? | Not started | - |
| 5. Verification Engine | 0/? | Not started | - |
| 6. Output, Reporting & Key Management | 0/? | Not started | - |
| 7. Import Adapters & CI/CD Integration | 0/? | Not started | - |
| 8. Dork Engine | 0/? | Not started | - |
| 9. OSINT Infrastructure | 0/? | Not started | - |
| 10. OSINT Code Hosting | 0/? | Not started | - |
| 11. OSINT Search & Paste | 0/? | Not started | - |
| 12. OSINT IoT & Cloud Storage | 0/? | Not started | - |
| 13. OSINT Package Registries & Container/IaC | 0/? | Not started | - |
| 14. OSINT CI/CD Logs, Web Archives & Frontend Leaks | 0/? | Not started | - |
| 15. OSINT Forums, Collaboration & Log Aggregators | 0/? | Not started | - |
| 16. OSINT Threat Intel, Mobile, DNS & API Marketplaces | 0/? | Not started | - |
| 17. Telegram Bot & Scheduled Scanning | 0/? | Not started | - |
| 18. Web Dashboard | 0/? | Not started | - |

65
.planning/STATE.md Normal file
View File

@@ -0,0 +1,65 @@
# Project State
## Project Reference
See: .planning/PROJECT.md (updated 2026-04-04)
**Core value:** Detect leaked LLM API keys across more providers and more internet sources than any other tool, with active verification to confirm keys are real and alive.
**Current focus:** Phase 1 — Foundation
## Current Position
Phase: 1 of 18 (Foundation)
Plan: 0 of ? in current phase
Status: Ready to plan
Last activity: 2026-04-04 — Roadmap created, 18 phases defined covering 146 v1 requirements
Progress: [░░░░░░░░░░░░░░░░░░░░] 0%
## Performance Metrics
**Velocity:**
- Total plans completed: 0
- Average duration: —
- Total execution time: 0 hours
**By Phase:**
| Phase | Plans | Total | Avg/Plan |
|-------|-------|-------|----------|
| - | - | - | - |
**Recent Trend:**
- Last 5 plans: —
- Trend: —
*Updated after each plan completion*
## Accumulated Context
### Decisions
Decisions are logged in PROJECT.md Key Decisions table.
Recent decisions affecting current work:
- Roadmap: CGO_ENABLED=0 throughout — modernc.org/sqlite over mattn/go-sqlite3 (see PROJECT.md)
- Roadmap: Per-source rate limiter architecture (Phase 9) must precede all OSINT source modules (Phases 10-16)
- Roadmap: AES-256 encryption added in Phase 1, not post-hoc — avoids migration complexity
- Roadmap: Verification (Phase 5) requires consent prompt + LEGAL.md — not optional polish
### Pending Todos
None yet.
### Blockers/Concerns
- Phase 1: Argon2 vs PBKDF2 for database encryption key derivation — needs decision before Storage Layer implementation
- Phase 1: Aho-Corasick library choice (cloudflare/ahocorasick vs bobrik/ahocorasick) — verify which TruffleHog uses
- Phase 2+: Provider YAML patterns for 108 providers — lesser-known providers need targeted research (Chinese LLMs, niche APIs)
- Phase 11: Google Custom Search API quota (100 queries/day free tier) vs direct scraping ToS trade-off — product decision needed
## Session Continuity
Last session: 2026-04-04
Stopped at: Roadmap written to .planning/ROADMAP.md; ready to begin Phase 1 planning
Resume file: None

160
CLAUDE.md Normal file
View File

@@ -0,0 +1,160 @@
<!-- GSD:project-start source:PROJECT.md -->
## Project
**KeyHunter**
KeyHunter is a comprehensive, modular API key scanner built in Go, focused on detecting and validating API keys from 108+ LLM/AI providers. It combines native scanning with external tool integration (TruffleHog, Gitleaks), OSINT/recon across 80+ internet sources, a web dashboard, and Telegram bot notifications. Designed for red teams, DevSecOps, bug bounty hunters, and security researchers.
**Core Value:** Detect leaked LLM API keys across more providers and more internet sources than any other tool, with active verification to confirm keys are real and alive.
### Constraints
- **Language**: Go 1.22+ — single binary distribution, performance, TruffleHog/Gitleaks ecosystem alignment
- **Architecture**: Plugin-based — providers as YAML files, compile-time embedded via Go embed
- **Storage**: SQLite — zero-dependency embedded database, AES-256 encrypted
- **Web stack**: htmx + Tailwind CSS — no JS framework dependency, embedded in binary
- **CLI framework**: Cobra — industry standard for Go CLIs
- **Verification**: Must be opt-in (`--verify` flag) — passive scanning by default for legal safety
- **Key masking**: Default masked output, `--unmask` for full keys — shoulder surfing protection
<!-- GSD:project-end -->
<!-- GSD:stack-start source:research/STACK.md -->
## Technology Stack
## Recommended Stack
### Core CLI Framework
| Technology | Version | Purpose | Why |
|------------|---------|---------|-----|
| `github.com/spf13/cobra` | v1.10.2 | CLI command tree (scan, verify, recon, keys, serve, dorks, providers, config, hook, schedule) | Industry standard for Go CLIs. Used by Kubernetes, Docker, GitHub CLI. Sub-command hierarchy, persistent flags, shell completion, man-page generation are all built in. No viable alternative — it IS the Go CLI standard. |
| `github.com/spf13/viper` | v1.21.0 | Configuration management (YAML/JSON/env/flags binding) | Designed to pair with Cobra. Handles config file + env var + CLI flag precedence chain automatically. v1.21.0 switched to maintained yaml lib, cleaning supply-chain issues. |
### Web Dashboard
| Technology | Version | Purpose | Why |
|------------|---------|---------|-----|
| `github.com/go-chi/chi/v5` | v5.2.5 | HTTP router for dashboard and API | 100% net/http compatible — no custom context or handler types. Zero external dependencies. Routes embed naturally into `go:embed` serving. Used by major Go projects. Requires Go 1.22+ (matches project constraint). |
| `github.com/a-h/templ` | v0.3.1001 | Type-safe HTML template compilation | `.templ` files compile to Go — template errors are caught at compile time, not runtime. Composes naturally with HTMX. Significantly safer than `html/template` for a project with a public-facing dashboard. |
| htmx | v2.x (CDN or vendored) | Frontend interactivity without JS framework | Server-rendered with AJAX behavior. No build step. Aligns with "embed in binary" architecture constraint. Use `go:embed` to bundle the htmx.min.js into the binary. |
| Tailwind CSS | v4.x (standalone CLI) | Utility-first styling | v4 ships a standalone binary — no Node.js required. Use `@tailwindcss/cli` to compile a single CSS file, then `go:embed` it. Air watches both `.templ` and CSS changes during development. |
### Database
| Technology | Version | Purpose | Why |
|------------|---------|---------|-----|
| `modernc.org/sqlite` | v1.35.x (SQLite 3.51.2 inside) | Embedded database for scan results, keys, recon data | Pure Go — no CGo, no C compiler requirement. Cross-compiles cleanly for Linux/macOS/ARM. Actively maintained (updated 2026-03-17). Zero external process dependency for single-binary distribution. |
| `database/sql` (stdlib) | — | SQL interface layer | Use standard library interface over `modernc.org/sqlite` directly — driver is registered as `"sqlite"`. No ORM needed for a tool of this scope. Raw SQL gives full control and avoids ORM magic bugs. |
### Concurrency
| Technology | Version | Purpose | Why |
|------------|---------|---------|-----|
| `github.com/panjf2000/ants/v2` | v2.12.0 | Worker pool for parallel scanning across files, sources, and verification requests | Mature, battle-tested goroutine pool. Dynamically resizable. Handles thousands of concurrent tasks without goroutine explosion. Used in high-throughput Go systems. v2.12.0 adds ReleaseContext for clean shutdown. |
| `golang.org/x/time/rate` | latest x/ | Per-source rate limiting for OSINT/recon sources | Official Go extended library. Token bucket algorithm. Rate-limit each external source (Shodan, GitHub, etc.) independently. Used by TruffleHog for the same purpose. |
| `sync`, `context` (stdlib) | — | Cancellation, mutex, waitgroups | Standard library is sufficient for coordination between pool and caller. No additional abstraction needed. |
### YAML Provider/Dork Engine
| Technology | Version | Purpose | Why |
|------------|---------|---------|-----|
| `gopkg.in/yaml.v3` | v3.x | Parse provider YAML definitions and dork YAML files embedded via `go:embed` | Direct, well-understood API. v3 handles inline/anchored structs correctly. The Cobra v1.10.2 release migrated away from it to `go.yaml.in/yaml/v3` — however for provider YAML parsing, gopkg.in/yaml.v3 remains stable and appropriate. |
| `embed` (stdlib) | — | Compile-time embedding of `/providers/*.yaml` and `/dorks/*.yaml` | Go 1.16+ native. No external dependency. Providers and dorks are baked into the binary at compile time — no runtime filesystem access needed. |
### Telegram Bot
| Technology | Version | Purpose | Why |
|------------|---------|---------|-----|
| `github.com/mymmrac/telego` | v1.8.0 | Telegram bot for /scan, /verify, /recon, /status, /stats, /subscribe, /key commands | One-to-one Telegram Bot API v9.6 mapping. Supports long polling and webhooks. Type-safe API surface. v1.8.0 released 2026-04-03 with API v9.6 support. Actively maintained. |
### Scheduled Scanning
| Technology | Version | Purpose | Why |
|------------|---------|---------|-----|
| `github.com/go-co-op/gocron/v2` | v2.19.1 | Cron-based recurring scans | Modern API, v2 has full context support and job lifecycle management. Race condition fix in v2.19.1 (important for scheduler reliability). Better API than robfig/cron v3. |
### Output and Formatting
| Technology | Version | Purpose | Why |
|------------|---------|---------|-----|
| `github.com/charmbracelet/lipgloss` | latest | Colored terminal table output, status indicators | Declarative style definitions. Composes with any output stream. Used across the Go security tool ecosystem (TruffleHog uses it indirectly). |
| `github.com/charmbracelet/bubbles` | latest | Progress bars for long scans, spinners during verification | Pre-built terminal UI components. Pairs with lipgloss. Less overhead than full Bubble Tea TUI — use components only. |
| `encoding/json` (stdlib) | — | JSON output format | Standard library is sufficient. No external JSON library needed. |
| SARIF output | custom | CI/CD integration output | Implement SARIF 2.1.0 format directly — it is a straightforward JSON schema. Do not use a library; gosec's SARIF package is not designed for import. ~200 lines of struct definitions covers the needed schema. |
### HTTP Client (OSINT/Recon/Verification)
| Technology | Version | Purpose | Why |
|------------|---------|---------|-----|
| `net/http` (stdlib) | — | All outbound HTTP requests for verification and OSINT | Standard library client is sufficient. Supports custom TLS, proxy settings, timeouts. Avoid adding httpclient wrappers that hide behavior. |
| `golang.org/x/time/rate` | latest | Per-source rate limiting on outbound requests | Already listed under concurrency — same package serves both purposes. |
### Development Tooling
| Technology | Version | Purpose | Why |
|------------|---------|---------|-----|
| `github.com/air-verse/air` | latest | Hot reload during dashboard development | Watches Go + templ files, rebuilds on change. Industry standard for Go web dev loops. |
| `@tailwindcss/cli` | v4.x standalone | CSS compilation without Node.js | v4 standalone binary eliminates Node dependency entirely. Run as `tailwindcss -i input.css -o dist/style.css --watch` alongside air. |
| `golangci-lint` | latest | Static analysis and linting | Multi-linter runner. Include gosec, staticcheck, errcheck at minimum. |
| `go test` (stdlib) | — | Testing | Standard library testing is sufficient. Use `testify` for assertions only. |
| `github.com/stretchr/testify` | v1.x | Test assertions | Assert/require packages only. No mocking framework needed at this scope. |
### Build and Distribution
| Technology | Version | Purpose | Why |
|------------|---------|---------|-----|
| `go build` with `-ldflags` | — | Single binary compilation | `CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build -ldflags="-s -w"` produces a stripped static binary. modernc.org/sqlite makes CGO=0 possible. |
| `goreleaser` | v2.x | Multi-platform release builds (Linux amd64/arm64, macOS amd64/arm64) | Standard tool for Go binary releases. Produces checksums, archives, and optionally Homebrew taps. |
## Alternatives Considered
| Category | Recommended | Alternative | Why Not |
|----------|-------------|-------------|---------|
| Web framework | chi v5 | Fiber | Fiber uses fasthttp (not net/http) — breaks standard middleware and `go:embed` serving patterns |
| Web framework | chi v5 | Echo | Echo works with net/http but adds unnecessary abstraction for a dashboard-only use case |
| Templating | templ | html/template (stdlib) | stdlib templates have no compile-time type checking; errors surface at runtime, not build time |
| SQLite driver | modernc.org/sqlite | mattn/go-sqlite3 | mattn requires CGo — breaks cross-compilation and single-binary distribution goals |
| SQLite encryption | application-level AES-256 | SQLCipher (go-sqlcipher) | SQLCipher requires CGo, reintroducing the problem modernc.org/sqlite solves |
| Config | viper | koanf | koanf is cleaner but viper's Cobra integration is tight; viper v1.21 fixed the main key-casing issues |
| Concurrency | ants | pond | pond is simpler but less battle-tested at high concurrency (80+ simultaneous OSINT sources) |
| Scheduler | gocron v2 | robfig/cron v3 | robfig unmaintained since 2020 with known panic bugs in production |
| Telegram | telego | telebot v4 | telebot has better DX but less complete API coverage; telego's 1:1 mapping matters for a bot sending scan results |
| SARIF | custom structs | gosec/v2/report/sarif | gosec SARIF package is internal to gosec, not a published importable library |
| Terminal UI | lipgloss + bubbles | Full Bubble Tea | Full TUI event loop is overkill; components-only approach is simpler and sufficient |
## Canonical go.mod Dependencies
- Pin `cobra`, `chi`, `templ`, `telego`, `ants`, `gocron` to exact versions above (verified current).
- Use `go get -u` on `golang.org/x/time` — x/ packages track Go versions not semver.
- `modernc.org/sqlite` — pin to whatever `go get modernc.org/sqlite@latest` resolves at project init.
## Build Commands
# Development (dashboard with hot reload)
# Test
# Production binary (Linux amd64, CGO-free)
# Release (all platforms)
## Sources
- Cobra releases: https://github.com/spf13/cobra/releases (v1.10.2 confirmed)
- Chi releases: https://github.com/go-chi/chi/releases (v5.2.5 confirmed)
- Templ releases: https://github.com/a-h/templ/releases (v0.3.1001 confirmed)
- Telego releases: https://github.com/mymmrac/telego/releases (v1.8.0, 2026-04-03)
- Ants releases: https://github.com/panjf2000/ants/releases (v2.12.0 confirmed)
- gocron releases: https://github.com/go-co-op/gocron/releases (v2.19.1 confirmed)
- Viper releases: https://github.com/spf13/viper/releases (v1.21.0 confirmed)
- modernc.org/sqlite: https://pkg.go.dev/modernc.org/sqlite (SQLite 3.51.2, updated 2026-03-17)
- Chi router comparison 2025: https://blog.logrocket.com/top-go-frameworks-2025/
- Go web stack 2025 (chi + templ + htmx): https://www.ersin.nz/articles/a-great-web-stack-for-go
- Tailwind v4 standalone (no Node): https://dev.to/getjv/tailwind-css-with-air-and-go-no-node-no-problem-3j92
- SQLite driver comparison: https://github.com/cvilsmeier/go-sqlite-bench
- robfig/cron maintenance status: https://github.com/netresearch/go-cron (unmaintained since 2020 note)
- Viper vs koanf: https://itnext.io/golang-configuration-management-library-viper-vs-koanf-eea60a652a22
- TruffleHog output formats: https://deepwiki.com/trufflesecurity/trufflehog/6-output-and-results
- Gitleaks output formats: https://appsecsanta.com/sast-tools/gitleaks-vs-trufflehog
<!-- GSD:stack-end -->
<!-- GSD:conventions-start source:CONVENTIONS.md -->
## Conventions
Conventions not yet established. Will populate as patterns emerge during development.
<!-- GSD:conventions-end -->
<!-- GSD:architecture-start source:ARCHITECTURE.md -->
## Architecture
Architecture not yet mapped. Follow existing patterns found in the codebase.
<!-- GSD:architecture-end -->
<!-- GSD:workflow-start source:GSD defaults -->
## GSD Workflow Enforcement
Before using Edit, Write, or other file-changing tools, start work through a GSD command so planning artifacts and execution context stay in sync.
Use these entry points:
- `/gsd:quick` for small fixes, doc updates, and ad-hoc tasks
- `/gsd:debug` for investigation and bug fixing
- `/gsd:execute-phase` for planned phase work
Do not make direct repo edits outside a GSD workflow unless the user explicitly asks to bypass it.
<!-- GSD:workflow-end -->
<!-- GSD:profile-start -->
## Developer Profile
> Profile not yet configured. Run `/gsd:profile-user` to generate your developer profile.
> This section is managed by `generate-claude-profile` -- do not edit manually.
<!-- GSD:profile-end -->