Files
keyhunter/.planning/research/SUMMARY.md
2026-04-04 19:03:12 +03:00

24 KiB

Project Research Summary

Project: KeyHunter — Go-based API Key Scanner for 108+ LLM Providers Domain: Secret detection / OSINT recon tool targeting AI/LLM credential leaks Researched: 2026-04-04 Confidence: HIGH

Executive Summary

KeyHunter operates in a validated and growing market: 1.27 million AI-service credentials were leaked in 2025 (up 81% YoY), yet no open-source tool combines 100+ LLM provider coverage with active verification and OSINT recon in a single binary. The competitive gap against TruffleHog (~15 LLM providers), Gitleaks (~5-10), and Titus (450+ rules but no OSINT) is real and exploitable. The correct build approach is a staged pipeline modeled after TruffleHog v3's internal architecture: keyword pre-filter (Aho-Corasick) then regex detection then optional verification, all connected via buffered channels and a goroutine worker pool. The entire tool ships as a single static binary with embedded providers, dorks, templates, and a web dashboard — no runtime dependencies.

The recommended stack is settled and high-confidence across every category. Go 1.22+ with Cobra/Viper for CLI, chi v5 + templ + htmx + Tailwind v4 for the dashboard, modernc.org/sqlite (pure Go, CGO-free) for storage, ants v2 for concurrency, and telego for Telegram. The critical architectural constraint is CGO_ENABLED=0 throughout — the choice of modernc.org/sqlite over mattn/go-sqlite3 exists specifically to preserve cross-compilation and single-binary distribution. Every library choice flows from this constraint and should not be revisited.

The primary risk is building in the wrong order. The provider YAML registry and storage schema must exist before anything else — all 10 other subsystems depend on them. The secondary risk is false positives: entropy-based detection alone produces up to 80% false positive rates, which kills adoption. The third risk is legal: active key verification (--verify) requires explicit consent UX and documentation because a single API call with a discovered credential may constitute unauthorized access under CFAA and analogous laws. These risks have clear mitigations documented in research and must be addressed in the phases where they first appear.


Key Findings

The stack is dominated by CGO-free, stdlib-compatible choices that enable a single cross-compiled binary. Every component was verified against official release pages as of 2026-04-04. The key non-obvious choices: modernc.org/sqlite instead of the more common mattn/go-sqlite3 (preserves CGO=0), chi instead of Fiber/Echo (100% net/http compatible, critical for go:embed serving), telego instead of telebot (1:1 Telegram API mapping for a tool where exact API behavior matters), and custom SARIF structs instead of importing gosec's internal package (gosec SARIF is not published as an importable library).

Core technologies:

  • cobra v1.10.2 + viper v1.21.0: CLI command tree and config — industry standard, used by TruffleHog and Gitleaks themselves
  • chi v5.2.5: HTTP router for dashboard — zero external deps, net/http native, enables go:embed static serving
  • templ v0.3.1001: Type-safe HTML templates — compile-time error checking, composes with htmx
  • modernc.org/sqlite v1.35.x: Pure Go SQLite — no CGO, cross-compiles cleanly to Linux/macOS/ARM
  • ants v2.12.0: Worker pool — battle-tested at high concurrency (thousands of goroutines across 80+ OSINT sources)
  • golang.org/x/time/rate: Per-source rate limiting — official Go extended library, token bucket algorithm
  • telego v1.8.0: Telegram bot (released 2026-04-03, Telegram Bot API v9.6) — 1:1 API mapping
  • gocron v2.19.1: Scheduler — modern API, full context support, race condition fix in this version
  • lipgloss + bubbles: Terminal output — components-only, not full Bubble Tea TUI event loop
  • Application-level AES-256 via crypto/aes stdlib: Key encryption without SQLCipher's CGO dependency
  • Custom SARIF 2.1.0 structs (~200 lines): CI/CD output format without importing gosec internals
  • goreleaser v2: Multi-platform release builds (Linux/macOS amd64/arm64)

Do not use: mattn/go-sqlite3 (CGO), Fiber/Echo (breaks net/http compatibility), robfig/cron (unmaintained since 2020, panic bugs), full Bubble Tea TUI (overkill for a scanning tool), regexp2/PCRE (loses RE2 linear-time guarantee).

Expected Features

The market has established strong conventions. KeyHunter must meet every table-stakes requirement before any differentiators matter — users instantly benchmark against TruffleHog and Gitleaks on the fundamentals.

Must have (table stakes):

  • Regex-based pattern detection with keyword pre-filtering (10x speedup, TruffleHog-proven)
  • Entropy analysis as secondary signal (not primary — 80% FP rate when used alone)
  • Git history scanning (full commits, branches, tags)
  • Directory/file scanning and stdin
  • Active key verification via --verify (off by default — legal and ethical requirement)
  • JSON, SARIF, CSV, colored table output
  • Pre-commit hook and CI/CD integration (exit codes + SARIF)
  • Masked output by default (--unmask to show full keys)
  • Provider-named detectors (not raw regex output)
  • Multi-platform static binary

Should have (competitive moat):

  • 108 LLM/AI provider coverage in YAML (5-7x more than any competitor)
  • OSINT/Recon engine with 80+ sources (code hosts, paste sites, IoT scanners, package registries, search dorks)
  • Built-in dork engine (150+ dorks in YAML, same extensibility as providers)
  • IoT scanner integration (Shodan, Censys, ZoomEye, FOFA for exposed LLM endpoints)
  • Web dashboard (htmx + Tailwind, embedded in binary via go:embed)
  • SQLite storage with AES-256 (persistent scan state — all competitors are stateless)
  • Telegram bot integration (/scan, /verify, /recon, /status commands)
  • Scheduled scanning with auto-notify
  • TruffleHog + Gitleaks import adapters
  • Delta-based git scanning (only new commits since last run)

Defer (v2+):

  • BPE tokenization detection (Betterleaks approach, 98.6% recall vs 70.4% entropy — worth tracking)
  • LLM-based detection (GPT-5-mini achieves 84.4% recall on obfuscated secrets — viable but slow/expensive)
  • Permission analysis per discovered key (TruffleHog covers 30 types; LLM scope would require per-provider API exploration)
  • Collaboration tool scanning (Notion, Confluence — auth flow complexity very high)
  • Web archive scanning at scale (CommonCrawl is enormous; requires careful scoping)
  • Windows native build (WSL/Docker serves that audience adequately)
  • Key rotation/remediation (different product category; competes with Vault, Doppler)

Architecture Approach

KeyHunter is a single Go binary with 10 discrete subsystems communicating through well-defined interfaces. The architecture is explicitly modeled on TruffleHog v3's pipeline: source adapters produce chunks onto buffered channels, Aho-Corasick pre-filters reduce the candidate set before expensive regex evaluation, detector workers run at 8x CPU multiplier, and verification workers run as a separate pool (never blocking detectors). The Provider Registry and Storage Layer have zero mutual dependencies and must be built first — everything else depends on them.

Major components:

  1. Provider Registry (pkg/providers) — YAML definitions embedded via go:embed at compile time; serves patterns, keywords, and verify endpoints to all other subsystems
  2. Scanning Engine (pkg/engine) — three-stage pipeline (Aho-Corasick pre-filter → detector workers → verification workers); Source interface with concrete adapters for file/dir/git/stdin/URL
  3. Storage Layer (pkg/storage) — modernc.org/sqlite; findings, scans, recon jobs, scheduled jobs, settings; AES-256 on key_encrypted column from day one
  4. Verification Engine (pkg/verify) — opt-in HTTP calls to provider-defined endpoints; per-provider rate limiting; in-memory session cache to prevent duplicate calls
  5. OSINT / Recon Engine (pkg/recon) — fan-out orchestrator to 80+ sources across 17 categories; each source module holds its own rate.Limiter; results feed back into Scanning Engine
  6. Dork Engine (pkg/recon/dorks) — YAML dork definitions, multi-search-engine dispatch; sub-component of Recon Engine
  7. Import Adapters (pkg/importers) — TruffleHog v3 and Gitleaks v8 JSON normalization to internal Finding struct
  8. Web Dashboard (pkg/dashboard) — chi + templ + htmx + Tailwind, all embedded via go:embed; SSE for live scan progress (no WebSocket)
  9. Notification System (pkg/notify) — telego long-poll bot; private-chat-only restriction; never sends unmasked keys
  10. Scheduler (pkg/scheduler) — gocron v2; job definitions persisted to SQLite for restart survival

Patterns that must be followed:

  • Buffered channels between all pipeline stages (prevent goroutine starvation and parallelism collapse)
  • Source interface pattern (new sources by implementation, not engine modification)
  • Per-source rate limiters in Recon Engine (not centralized — sources have wildly different limits)
  • SSE not WebSockets for dashboard live updates (works through proxies, simpler, htmx native support)
  • Provider Registry injected via constructor, not global (enables testing without full initialization)

Critical Pitfalls

  1. Catastrophic regex backtracking — Go's RE2-backed regexp package already guarantees linear-time execution; never use regexp2 or PCRE for any provider pattern. Add a regex complexity linter to the YAML PR review pipeline before community contributions open.

  2. False positives killing adoption — Entropy-only detection produces up to 80% FP rate (HashiCorp 2025 research). Use the layered pipeline: keyword pre-filter then regex then entropy as secondary signal. Add YAML-level allowlists for test/example/dummy contexts. Expose --min-confidence flag. Never ship entropy-only detection for any provider.

  3. Legal exposure from active verification — Calling a provider API with a discovered key constitutes "accessing a computer system" regardless of intent. Verification must be opt-in behind --verify, require a consent prompt on first use, be limited to read-only endpoints, and ship with clear legal documentation. This is not a nice-to-have.

  4. Provider pattern rot — LLM providers change key formats without changelog entries (OpenAI sk- to sk-proj- migration in 2024). Add format_version and last_verified fields to provider YAML. Build a weekly pattern health CI job against known-format example keys. Monitor TruffleHog/GitHub Secret Scanning changelog as external signals.

  5. OSINT rate limiting and silent IP bans — Aggressive scanning triggers IP bans from Google (after ~100 dork requests/hour), GitHub, Shodan, and Pastebin without error responses. Design per-source rate limiter architecture before implementing any individual source — retrofitting across 80 sources is a rewrite. Log "source exhausted" events explicitly rather than silently returning empty results.


Implications for Roadmap

Research produces a clear 4-phase structure derived from hard dependency ordering in the architecture. Each phase is complete before the next delivers value.

Phase 1: Foundation — Provider Registry, Core Engine, Storage

Rationale: Every other subsystem depends on the provider definitions and storage schema. The scanning pipeline is the critical path — nothing else makes sense without it. Build order from ARCHITECTURE.md is explicit: Provider Registry first (no dependencies), Storage Layer second (no dependencies), Scanning Engine third (depends on both), Verification Engine fourth (depends on engine and provider verify specs), Output Formatters fifth (validate scanner output before building anything on top).

Delivers:

  • 108 LLM provider YAML definitions with regex patterns, keywords, confidence levels, and verify endpoints
  • Core scanning pipeline: keyword pre-filter (Aho-Corasick) + regex + entropy
  • Input sources: file, dir, git history, stdin
  • Active verification via --verify with legal consent prompt
  • Output: colored table, JSON, SARIF (custom structs), CSV
  • SQLite storage with AES-256 key column encryption from day one
  • CLI with cobra/viper: scan, verify, keys, providers, config

Addresses pitfalls: Regex complexity (use RE2 always), false positives (layered pipeline, not entropy-only), legal exposure (opt-in verify with consent), key storage (AES-256 in Phase 1 not later), pattern collisions (confidence taxonomy before patterns are written).

Research flags: Standard patterns — TruffleHog v3 architecture is well-documented; buffered channel pipeline is established Go idiom. Provider YAML schema design is the one area that may need a brief spike to validate before writing 108 definitions.


Phase 2: First Differentiators — Dork Engine, Import Adapters, CI/CD Integration

Rationale: Once the core scanner is proven, add the features that create the first layer of competitive moat. Import adapters are low-complexity (Storage Layer only) and can be developed in parallel with the dork engine. Pre-commit hooks and CI/CD exit codes complete the table-stakes checklist and make the tool usable in production pipelines.

Delivers:

  • Built-in dork engine (150+ YAML dorks, extensible same pattern as providers)
  • TruffleHog v3 and Gitleaks v8 JSON import adapters
  • Pre-commit hook integration (hook install command)
  • CI/CD: SARIF to GitHub Code Scanning, exit code conventions, GitHub Actions workflow example
  • --unmask flag and keys show command
  • Delta-based git scanning (last-scanned commit hash in SQLite)

Addresses pitfalls: SARIF severity defaults to warning (not error) to avoid CI over-blocking; promote to error only for verified-active keys. Dork engine uses Google Custom Search API and DuckDuckGo (not direct scraping) to avoid ToS violations.

Research flags: Standard patterns for import adapters and CI/CD. Dork engine needs light research on Google Custom Search API quota and DuckDuckGo scraping stance.


Phase 3: OSINT / Recon Engine — The Primary Differentiator

Rationale: The recon engine is the largest engineering effort and the deepest competitive moat. It must be built after the scanning pipeline is stable because the recon engine is a consumer of the scanning engine (recon text results are chunked and fed through the detection pipeline). Per-source rate limiter architecture must be designed before implementing individual source modules — retrofitting is a rewrite.

Delivers:

  • Recon engine core with ReconSource interface and per-source rate limiters
  • Code hosting sources: GitHub, GitLab, HuggingFace, Bitbucket, Kaggle, Replit (prioritize — largest attack surface)
  • Paste site aggregator: Pastebin API + 15+ additional paste sites
  • Search engine dorking: DuckDuckGo (default), Google Custom Search API (opt-in with user API key), Bing
  • Package registry scanning: npm, PyPI, RubyGems, crates.io, Maven, NuGet
  • IoT scanner integration: Shodan, Censys, ZoomEye, FOFA for exposed LLM endpoints (vLLM, Ollama, LiteLLM proxy)
  • recon CLI command with --categories filter, --stealth mode

Addresses pitfalls: Per-source rate limiters (architecture first, sources second), stealth mode with jitter delays, explicit "source exhausted" logging, robots.txt respect in stealth mode.

Research flags: Needs research per source category during planning. IoT scanner APIs (Shodan/Censys/ZoomEye/FOFA) have different query formats and rate tiers. Paste site APIs vary widely — Pastebin has official API; many others require scraping. Each source module benefits from a brief API investigation before implementation.


Phase 4: Automation, Reach, and Remaining OSINT Sources

Rationale: Automation features (Telegram bot, scheduler) require both the scanning engine and storage to be mature. The notification system is triggered by verification events, which requires Phase 1's verification engine. The web dashboard is the last component because it aggregates all subsystems into a UI and depends on all of them.

Delivers:

  • Telegram bot (/scan, /verify, /recon, /status, /stats, /subscribe, /key) — private chat only, no unmasked keys
  • Scheduled scanning with cron expressions (gocron v2), job persistence in SQLite, auto-notify on new findings
  • Web dashboard (chi + templ + htmx + Tailwind v4 standalone, all embedded via go:embed) — pages: scans, keys, recon, providers, dorks, settings
  • SSE for live scan progress in dashboard
  • Remaining OSINT sources: CI/CD logs (Travis, CircleCI, GitHub Actions), web archives (Wayback Machine), forums (StackOverflow, Reddit, HN), cloud storage (S3, GCS), container/IaC (Docker Hub, Terraform, Helm), frontend (source maps, webpack bundles, exposed .env), threat intel (VirusTotal, IntelX)
  • APK decompile scanning (wraps external apktool/jadx as optional integration)

Addresses pitfalls: Telegram bot restricted to private chats only; never sends unmasked keys regardless of --unmask setting; bot commands rate-limited to prevent unauthorized use. Dashboard has no auth by default (local tool) with optional basic auth config for shared deployments.

Research flags: Web dashboard (templ + htmx + SSE pattern) is well-documented in Go 2025 ecosystem. Remaining OSINT sources each need targeted API research — particularly cloud storage source auth flows (requires user-provided credentials), forum scraping rate limits, and APK toolchain detection.


Phase Ordering Rationale

  • Dependency chain is non-negotiable: Provider Registry has no dependencies; everything depends on it. Storage Layer has no dependencies; everything writes to it. These two must ship before any other subsystem can be meaningfully built or tested.
  • Scanning engine before OSINT engine: The recon engine is a consumer of the scanning engine (recon results feed into the detection pipeline). Building OSINT first would mean the output has nowhere to go.
  • Per-source rate limiter architecture before individual OSINT sources: The PITFALLS research is explicit — retrofitting rate limiting across 80 sources after they are built is a rewrite. The ReconSource interface design must include RateLimit() rate.Limit from the first source module.
  • Dashboard last: It aggregates all subsystems. Building it earlier means building against unstable interfaces.
  • AES-256 encryption in Phase 1: A real GHSA security advisory (GHSA-4h8c-qrcq-cv5c) documents the exact failure mode of adding encryption later. Adding it post-hoc requires database migration logic and is routinely skipped.

Research Flags

Phases needing deeper research during planning:

  • Phase 3 (OSINT sources): Each source category warrants targeted API research before implementation. IoT scanner query formats, paste site scraping posture, package registry tarball access patterns all vary. Recommend a brief research spike per source category at planning time.
  • Phase 4 (remaining OSINT + cloud storage): Cloud storage source auth flows (AWS credentials, GCS service accounts) need scoping to avoid credential management scope creep. Forum scraping (Reddit, HN, StackOverflow) has rate limit and ToS complexity.

Phases with standard patterns (can skip research-phase):

  • Phase 1 (core pipeline): TruffleHog v3 architecture is fully documented via DeepWiki and official repo. Buffered channel pipeline, Aho-Corasick pre-filter, and Source interface are established patterns.
  • Phase 2 (import adapters, CI/CD): Import adapters are JSON struct mapping. CI/CD integration is exit code + SARIF — Gitleaks and TruffleHog both document their approach.
  • Phase 4 (web dashboard): The chi + templ + htmx + go:embed + SSE stack is well-covered in 2025 Go web development ecosystem resources.

Confidence Assessment

Area Confidence Notes
Stack HIGH All library versions verified via official GitHub releases as of 2026-04-04. Key tradeoffs (modernc.org vs mattn, chi vs Fiber, telego vs telebot) are well-reasoned with documented rationale.
Features HIGH Competitive landscape verified against official TruffleHog, Gitleaks, Titus, GitGuardian, and GitHub Secret Scanning documentation. Market gap confirmed with GitGuardian 2026 report data.
Architecture HIGH Component design derived from TruffleHog v3 internals via DeepWiki (generated from official source) and official repo inspection. Patterns are production-proven at scale.
Pitfalls HIGH (technical) / MEDIUM (legal) ReDoS, false positives, and rate limiting pitfalls have strong sourcing (official Go docs, HashiCorp research, GitHub rate limit docs). CFAA analysis is MEDIUM — law is jurisdiction-dependent and evolving.

Overall confidence: HIGH

Gaps to Address

  • Application-level AES-256 implementation details: STACK.md recommends crypto/aes + crypto/cipher for key column encryption but does not specify key derivation (PBKDF2? Argon2? OS keychain integration?). This needs a decision before Storage Layer implementation. PITFALLS.md recommends OS keychain (macOS Keychain, Linux libsecret) for the database encryption key — validate zalando/go-keyring or platform-native options during Phase 1 planning.

  • Provider YAML for 108 providers: Research confirmed the schema and 3 reference providers (OpenAI, Anthropic, HuggingFace). The 108 provider list itself is not enumerated in research. Sourcing patterns for lesser-known providers (Chinese LLM providers, niche AI APIs) will require targeted research during Phase 1 provider definition work. TruffleHog detector source, Gitleaks rules, and GitHub Secret Scanning changelog are the best references.

  • Google Custom Search API vs direct dork scraping: The legal/ToS analysis recommends the Custom Search API (100 queries/day free tier), but this limits dork throughput significantly. The trade-off between coverage and ToS compliance needs a product decision before Phase 2.

  • Aho-Corasick library choice: ARCHITECTURE.md specifies Aho-Corasick pre-filtering without naming the Go library. cloudflare/ahocorasick and aho-corasick by bobrik are the common options. Verify which TruffleHog uses and match for proven behavior.

  • go-sqlcipher vs application-level AES-256 consistency: ARCHITECTURE.md references go-sqlcipher in one location but STACK.md recommends application-level AES-256 specifically to avoid CGO. Resolve this inconsistency explicitly before Storage Layer implementation — the correct answer is application-level AES-256 per the CGO constraint.


Sources

Primary (HIGH confidence)

  • TruffleHog v3 official repo — feature set, detector count, channel pipeline architecture
  • DeepWiki TruffleHog engine architecture — internal pipeline design, Aho-Corasick pre-filter
  • Gitleaks official repo — output formats, detection methods, SARIF support
  • GitHub releases (cobra, chi, templ, telego, ants, gocron, viper) — version verification
  • modernc.org/sqlite pkg.go.dev — SQLite 3.51.2, last updated 2026-03-17, pure Go confirmed
  • GitGuardian State of Secrets Sprawl 2026 — market statistics (1.27M AI credentials leaked)
  • Go RE2/regexp package documentation — linear-time guarantee confirmed
  • GHSA-4h8c-qrcq-cv5c — real advisory for unencrypted SQLite key storage
  • GitHub Secret Scanning changelog (March 2026) — 28 new detectors, DeepSeek validity checks

Secondary (MEDIUM confidence)

  • HashiCorp (2025): False positives — 80% FP rate from entropy-only detection, documented
  • Betterleaks / BleepingComputer (2026) — BPE tokenization 98.6% recall vs 70.4% entropy
  • FuzzingLabs (April 2026) — GPT-5-mini 84.4% recall on obfuscated secrets
  • Go web stack articles (chi + templ + htmx, Tailwind v4 standalone) — community blog sources, patterns verified against official repos
  • CFAA analysis sources (Arnold & Porter, DOJ Justice Manual 9-48.000) — legal interpretation, jurisdiction-dependent
  • robfig/cron maintenance status — noted via netresearch/go-cron fork comment

Tertiary (LOW confidence)

  • Google dork rate limit estimate (~100 req/hour) — community observation, not officially documented
  • Pastebin scraping posture — community consensus; official policy not published

Research completed: 2026-04-04 Ready for roadmap: yes