docs: complete project research
This commit is contained in:
575
.planning/research/ARCHITECTURE.md
Normal file
575
.planning/research/ARCHITECTURE.md
Normal file
@@ -0,0 +1,575 @@
|
||||
# Architecture Patterns
|
||||
|
||||
**Domain:** API key / secret scanner with OSINT recon, web dashboard, and notification system
|
||||
**Project:** KeyHunter
|
||||
**Researched:** 2026-04-04
|
||||
**Overall confidence:** HIGH (TruffleHog/Gitleaks internals verified via DeepWiki and official repos; Go patterns verified via official docs and production examples)
|
||||
|
||||
---
|
||||
|
||||
## Recommended Architecture
|
||||
|
||||
KeyHunter is a single Go binary composed of seven discrete subsystems. Each subsystem owns its own package boundary. Communication between subsystems flows through well-defined interfaces — not direct struct coupling.
|
||||
|
||||
```
|
||||
CLI (Cobra)
|
||||
|
|
||||
+---> Scanning Engine (regex + entropy + Aho-Corasick pre-filter)
|
||||
| |
|
||||
| +--> Provider Registry (YAML definitions, embed.FS at compile time)
|
||||
| +--> Source Adapters (file, dir, git, URL, stdin, clipboard)
|
||||
| +--> Worker Pool (goroutine pool + buffered channels)
|
||||
| +--> Verification Engine (opt-in, per-provider HTTP endpoints)
|
||||
|
|
||||
+---> OSINT / Recon Engine (80+ sources, category-based orchestration)
|
||||
| |
|
||||
| +--> Source Modules (one module per category, rate-limited)
|
||||
| +--> Dork Engine (YAML dorks, multi-search-engine dispatch)
|
||||
| +--> Recon Worker Pool (per-source concurrency + throttle)
|
||||
|
|
||||
+---> Import Adapters (TruffleHog JSON, Gitleaks JSON -> internal Finding)
|
||||
|
|
||||
+---> Storage Layer (SQLite via go-sqlcipher, AES-256 at rest)
|
||||
|
|
||||
+---> Web Dashboard (htmx + Tailwind, Go templates, embed.FS, SSE)
|
||||
|
|
||||
+---> Notification System (Telegram bot, long polling, command router)
|
||||
|
|
||||
+---> Scheduler (gocron, cron expressions, persisted job state)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Component Boundaries
|
||||
|
||||
### 1. CLI Layer (`pkg/cli`)
|
||||
|
||||
**Responsibility:** Command routing only. Zero business logic. Parses flags, wires subcommands, starts the correct subsystem. Uses Cobra (industry standard for Go CLIs, used by TruffleHog v3 and Gitleaks).
|
||||
|
||||
**Communicates with:** All subsystems as the top-level entry point.
|
||||
|
||||
**Key commands:** `scan`, `verify`, `import`, `recon`, `keys`, `serve`, `dorks`, `providers`, `config`, `hook`, `schedule`.
|
||||
|
||||
**Build notes:** Cobra subcommand tree should be defined as a package; main.go should remain under 30 lines.
|
||||
|
||||
---
|
||||
|
||||
### 2. Provider Registry (`pkg/providers`)
|
||||
|
||||
**Responsibility:** Load and serve provider definitions. Providers are YAML files embedded at compile time via `//go:embed providers/*.yaml`. The registry parses them on startup into an in-memory slice of `Provider` structs.
|
||||
|
||||
**Provider YAML schema:**
|
||||
```yaml
|
||||
name: openai
|
||||
version: 1
|
||||
keywords: ["sk-proj-", "openai"]
|
||||
patterns:
|
||||
- regex: 'sk-proj-[A-Za-z0-9]{48}'
|
||||
entropy_min: 3.5
|
||||
confidence: high
|
||||
verify:
|
||||
method: POST
|
||||
url: https://api.openai.com/v1/models
|
||||
headers:
|
||||
Authorization: "Bearer {KEY}"
|
||||
valid_status: [200]
|
||||
invalid_status: [401, 403]
|
||||
```
|
||||
|
||||
**Communicates with:** Scanning Engine (provides patterns and keywords), Verification Engine (provides verify endpoint specs), Web Dashboard (provider listing pages).
|
||||
|
||||
**Build rationale:** Must be implemented first. Everything downstream depends on it. No external loading at runtime — compile-time embed gives single binary advantage TruffleHog documented as a key design goal.
|
||||
|
||||
---
|
||||
|
||||
### 3. Scanning Engine (`pkg/engine`)
|
||||
|
||||
**Responsibility:** Core detection pipeline. Replicates TruffleHog v3's three-stage approach: keyword pre-filter → regex/entropy detection → optional verification. Manages the goroutine worker pool.
|
||||
|
||||
**Pipeline stages (mirrors TruffleHog's architecture):**
|
||||
|
||||
```
|
||||
Source Adapter → chunker → [keyword pre-filter: Aho-Corasick]
|
||||
|
|
||||
[detector workers] (8x CPU multiplier)
|
||||
|
|
||||
[verification workers] (1x multiplier, opt-in)
|
||||
|
|
||||
results channel
|
||||
|
|
||||
[output formatter]
|
||||
```
|
||||
|
||||
**Aho-Corasick pre-filter:** Before running expensive regex, scan chunks for keyword presence. TruffleHog documented this delivers approximately 10x performance improvement on large codebases. Each provider supplies `keywords` — the Aho-Corasick automaton is built from all keywords at startup.
|
||||
|
||||
**Channel-based communication:**
|
||||
- `chunksChan chan Chunk` — raw chunks from sources
|
||||
- `detectableChan chan Chunk` — keyword-matched chunks only
|
||||
- `resultsChan chan Finding` — confirmed detections
|
||||
- All channels are buffered to prevent goroutine starvation.
|
||||
|
||||
**Source adapters implement a single interface:**
|
||||
```go
|
||||
type Source interface {
|
||||
Name() string
|
||||
Chunks(ctx context.Context, ch chan<- Chunk) error
|
||||
}
|
||||
```
|
||||
|
||||
**Concrete source adapters:** `FileSource`, `DirSource`, `GitSource`, `URLSource`, `StdinSource`, `ClipboardSource`.
|
||||
|
||||
**Communicates with:** Provider Registry (fetches detector specs), Verification Engine (forwards candidates), Storage Layer (persists findings), Output Formatter (writes CLI results).
|
||||
|
||||
---
|
||||
|
||||
### 4. Verification Engine (`pkg/verify`)
|
||||
|
||||
**Responsibility:** Active key validation. Off by default, activated with `--verify`. Makes HTTP calls to provider-defined endpoints with the discovered key. Classifies results as `verified` (valid key), `invalid` (key rejected), or `unknown` (endpoint unreachable/ambiguous).
|
||||
|
||||
**Caching:** Results are cached in-memory per session to avoid duplicate API calls for the same key. Cache key = `provider:key_hash`.
|
||||
|
||||
**Rate limiting:** Per-provider rate limiter (token bucket) prevents triggering account lockouts or abuse detection.
|
||||
|
||||
**Communicates with:** Scanning Engine (receives candidates), Storage Layer (updates finding status), Notification System (triggers alerts on verified finds).
|
||||
|
||||
---
|
||||
|
||||
### 5. OSINT / Recon Engine (`pkg/recon`)
|
||||
|
||||
**Responsibility:** Orchestrates searches across 80+ external sources in 18 categories. Acts as a dispatcher: receives a target query, fans out to all configured source modules, aggregates raw text results, and pipes them into the Scanning Engine.
|
||||
|
||||
**Category-module mapping:**
|
||||
```
|
||||
pkg/recon/
|
||||
sources/
|
||||
iot/ (shodan, censys, zoomeye, fofa, netlas, binaryedge)
|
||||
code/ (github, gitlab, bitbucket, huggingface, kaggle, ...)
|
||||
search/ (google, bing, duckduckgo, yandex, brave dorking)
|
||||
paste/ (pastebin, dpaste, hastebin, rentry, ix.io, ...)
|
||||
registry/ (npm, pypi, rubygems, crates.io, maven, nuget, ...)
|
||||
container/ (docker hub layers, k8s configs, terraform, helm)
|
||||
cloud/ (s3, gcs, azure blob, do spaces, minio)
|
||||
cicd/ (travis, circleci, github actions, jenkins)
|
||||
archive/ (wayback machine, commoncrawl)
|
||||
forum/ (stackoverflow, reddit, hackernews, dev.to, medium)
|
||||
collab/ (notion, confluence, trello)
|
||||
frontend/ (source maps, webpack, exposed .env, swagger)
|
||||
log/ (elasticsearch, grafana, sentry)
|
||||
intel/ (virustotal, intelx, urlhaus)
|
||||
mobile/ (apk decompile)
|
||||
dns/ (crt.sh, endpoint probing)
|
||||
api/ (postman, swaggerhub)
|
||||
```
|
||||
|
||||
**Each source module implements:**
|
||||
```go
|
||||
type ReconSource interface {
|
||||
Name() string
|
||||
Category() string
|
||||
Search(ctx context.Context, query string, opts Options) ([]string, error)
|
||||
RateLimit() rate.Limit
|
||||
}
|
||||
```
|
||||
|
||||
**Orchestrator behavior:**
|
||||
1. Fan out to all enabled source modules concurrently.
|
||||
2. Each module uses its own `rate.Limiter` (respects per-source limits).
|
||||
3. Stealth mode adds jitter delays and respects robots.txt.
|
||||
4. Aggregated text results → chunked → fed to Scanning Engine.
|
||||
|
||||
**Dork Engine (`pkg/recon/dorks`):** Separate sub-component. Reads YAML dork definitions, formats them per search engine syntax, dispatches to search source modules.
|
||||
|
||||
**Communicates with:** Scanning Engine (sends chunked recon text for detection), Storage Layer (persists recon job state and results), CLI Layer.
|
||||
|
||||
---
|
||||
|
||||
### 6. Import Adapters (`pkg/importers`)
|
||||
|
||||
**Responsibility:** Parse external tool JSON output (TruffleHog, Gitleaks) and convert to internal `Finding` structs for storage. Decouples third-party formats from internal model.
|
||||
|
||||
**Adapters:**
|
||||
- `TruffleHogAdapter` — parses TruffleHog v3 JSON output
|
||||
- `GitleaksAdapter` — parses Gitleaks v8 JSON output
|
||||
|
||||
**Communicates with:** Storage Layer only (writes normalized findings).
|
||||
|
||||
---
|
||||
|
||||
### 7. Storage Layer (`pkg/storage`)
|
||||
|
||||
**Responsibility:** Persistence. All findings, provider data, recon jobs, scan metadata, dorks, and scheduler state live here. SQLite via go-sqlcipher (AES-256 encryption at rest).
|
||||
|
||||
**Schema boundaries:**
|
||||
```
|
||||
findings (id, provider, key_masked, key_encrypted, status, source, path, timestamp, verified)
|
||||
scans (id, type, target, started_at, finished_at, finding_count)
|
||||
recon_jobs (id, query, categories, started_at, finished_at, source_count)
|
||||
scheduled_jobs (id, cron_expr, scan_config_json, last_run, next_run, enabled)
|
||||
settings (key, value)
|
||||
```
|
||||
|
||||
**Key masking:** Full keys are AES-256 encrypted in `key_encrypted`. Display value in `key_masked` is truncated to first 8 / last 4 characters. `--unmask` flag decrypts on access.
|
||||
|
||||
**Communicates with:** All subsystems that need persistence (Scanning Engine, Recon Engine, Import Adapters, Dashboard, Scheduler, Notification System).
|
||||
|
||||
---
|
||||
|
||||
### 8. Web Dashboard (`pkg/dashboard`)
|
||||
|
||||
**Responsibility:** Embedded web UI. Go templates + htmx + Tailwind CSS, all embedded via `//go:embed` at compile time. No external JS framework. Server-sent events (SSE) for live scan progress without WebSocket complexity.
|
||||
|
||||
**Pages:** scans, keys, recon, providers, dorks, settings.
|
||||
|
||||
**HTTP server:** Standard library `net/http` is sufficient. No framework overhead needed for this scale.
|
||||
|
||||
**SSE pattern for live updates:**
|
||||
```go
|
||||
// Scan progress pushed to browser via SSE
|
||||
// Browser uses hx-sse extension to update scan status table
|
||||
```
|
||||
|
||||
**Communicates with:** Storage Layer (reads/writes), Scanning Engine (triggers scans, receives SSE events), Recon Engine (triggers recon jobs).
|
||||
|
||||
---
|
||||
|
||||
### 9. Notification System (`pkg/notify`)
|
||||
|
||||
**Responsibility:** Telegram bot integration. Sends alerts on verified findings, responds to bot commands. Uses long polling (preferred for single-instance local tools — no public URL needed, simpler setup than webhooks).
|
||||
|
||||
**Bot commands map to CLI commands:** `/scan`, `/verify`, `/recon`, `/status`, `/stats`, `/subscribe`, `/key`.
|
||||
|
||||
**Subscribe pattern:** Users `/subscribe` to be notified when verified findings are discovered. Subscriber chat IDs stored in SQLite settings.
|
||||
|
||||
**Communicates with:** Storage Layer (reads findings, subscriber list), Scanning Engine (receives verified finding events).
|
||||
|
||||
---
|
||||
|
||||
### 10. Scheduler (`pkg/scheduler`)
|
||||
|
||||
**Responsibility:** Cron-based recurring scan scheduling. Uses `go-co-op/gocron` (actively maintained fork of jasonlvhit/gocron). Scheduled job definitions persisted in SQLite so they survive restarts.
|
||||
|
||||
**Communicates with:** Storage Layer (reads/writes job definitions), Scanning Engine (triggers scans), Notification System (notifies on scan completion).
|
||||
|
||||
---
|
||||
|
||||
## Data Flow
|
||||
|
||||
### Flow 1: CLI Scan
|
||||
|
||||
```
|
||||
User: keyhunter scan --path ./repo --verify
|
||||
|
||||
CLI Layer
|
||||
-> parses flags, builds ScanConfig
|
||||
-> calls Engine.Scan(ctx, config)
|
||||
|
||||
Scanning Engine
|
||||
-> GitSource.Chunks() produces chunks onto chunksChan
|
||||
-> Aho-Corasick filter passes keyword-matched chunks to detectableChan
|
||||
-> Detector Workers apply provider patterns, produce candidates on resultsChan
|
||||
-> Verification Workers (if --verify) call provider verify endpoints
|
||||
-> Findings written to Storage Layer
|
||||
-> Output Formatter writes colored table / JSON / SARIF to stdout
|
||||
```
|
||||
|
||||
### Flow 2: Recon Job
|
||||
|
||||
```
|
||||
User: keyhunter recon --query "OPENAI_API_KEY" --categories code,paste,search
|
||||
|
||||
CLI Layer
|
||||
-> calls Recon Engine with query + categories
|
||||
|
||||
Recon Engine
|
||||
-> fans out to all enabled source modules for selected categories
|
||||
-> each module rate-limits itself, fetches content
|
||||
-> raw text results chunked and sent to Scanning Engine via internal channel
|
||||
|
||||
Scanning Engine
|
||||
-> same pipeline as Flow 1
|
||||
-> findings tagged with recon source metadata
|
||||
-> persisted to Storage Layer
|
||||
```
|
||||
|
||||
### Flow 3: Web Dashboard Live Scan
|
||||
|
||||
```
|
||||
Browser: POST /api/scan (hx-post from htmx)
|
||||
-> Dashboard handler creates scan record in Storage Layer
|
||||
-> Dashboard handler starts Scanning Engine in goroutine
|
||||
-> Browser subscribes to SSE endpoint GET /api/scan/:id/events
|
||||
-> Engine emits progress events to SSE channel
|
||||
-> htmx SSE extension updates scan status table in real time
|
||||
-> On completion, full findings table rendered via hx-get
|
||||
```
|
||||
|
||||
### Flow 4: Scheduled Scan + Telegram Notification
|
||||
|
||||
```
|
||||
Scheduler (gocron)
|
||||
-> fires job at cron time
|
||||
-> reads ScanConfig from SQLite scheduled_jobs
|
||||
-> triggers Scanning Engine
|
||||
|
||||
Scanning Engine
|
||||
-> runs scan, persists findings
|
||||
|
||||
Notification System
|
||||
-> on verified finding: reads subscriber list from SQLite
|
||||
-> sends Telegram message to each subscriber via bot API (long poll loop)
|
||||
```
|
||||
|
||||
### Flow 5: Import from External Tool
|
||||
|
||||
```
|
||||
User: keyhunter import --tool trufflehog --file th_output.json
|
||||
|
||||
CLI Layer -> Import Adapter (TruffleHogAdapter)
|
||||
-> reads JSON, maps to []Finding
|
||||
-> writes to Storage Layer
|
||||
-> prints import summary to stdout
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Build Order (Phase Dependencies)
|
||||
|
||||
This ordering reflects hard dependencies — a later component cannot be meaningfully built without the earlier ones.
|
||||
|
||||
| Order | Component | Depends On | Why First |
|
||||
|-------|-----------|-----------|-----------|
|
||||
| 1 | Provider Registry | nothing | All other subsystems depend on provider definitions. Must exist before any detection can be designed. |
|
||||
| 2 | Storage Layer | nothing (schema only) | Findings model must be defined before anything writes to it. |
|
||||
| 3 | Scanning Engine (core pipeline) | Provider Registry, Storage Layer | Engine is the critical path. Source adapters and worker pool pattern established here. |
|
||||
| 4 | Verification Engine | Scanning Engine, Provider Registry | Layered on top of scanning, needs provider verify specs. |
|
||||
| 5 | Output Formatters (table, JSON, SARIF, CSV) | Scanning Engine | Needed to validate scanner output before building anything on top. |
|
||||
| 6 | Import Adapters | Storage Layer | Self-contained, only needs storage model. Can be parallel with 4/5. |
|
||||
| 7 | OSINT / Recon Engine | Scanning Engine, Storage Layer | Builds on the established scanning pipeline as its consumer. |
|
||||
| 8 | Dork Engine | Recon Engine (search sources) | Sub-component of Recon; needs search source modules to exist. |
|
||||
| 9 | Scheduler | Scanning Engine, Storage Layer | Requires engine and persistence. Adds recurring execution on top. |
|
||||
| 10 | Web Dashboard | Storage Layer, Scanning Engine, Recon Engine | Aggregates all subsystems into UI; must be last. |
|
||||
| 11 | Notification System | Storage Layer, Verification Engine | Triggered by verification events; needs findings and subscriber storage. |
|
||||
|
||||
**MVP critical path:** Provider Registry → Storage Layer → Scanning Engine → Verification Engine → Output Formatters.
|
||||
|
||||
Everything else (OSINT, Dashboard, Notifications, Scheduler) layers on top of this proven core.
|
||||
|
||||
---
|
||||
|
||||
## Patterns to Follow
|
||||
|
||||
### Pattern 1: Buffered Channel Pipeline (TruffleHog-derived)
|
||||
|
||||
**What:** Goroutine stages connected by buffered channels. Each stage has a configurable concurrency multiplier.
|
||||
|
||||
**When:** Any multi-stage concurrent processing (scanning, recon aggregation).
|
||||
|
||||
**Example:**
|
||||
```go
|
||||
// Engine spin-up
|
||||
chunksChan := make(chan Chunk, 1000)
|
||||
detectableChan := make(chan Chunk, 500)
|
||||
resultsChan := make(chan Finding, 100)
|
||||
|
||||
// Stage goroutines
|
||||
for i := 0; i < runtime.NumCPU()*8; i++ {
|
||||
go detectorWorker(detectableChan, resultsChan, providers)
|
||||
}
|
||||
for i := 0; i < runtime.NumCPU(); i++ {
|
||||
go verifyWorker(resultsChan, storage, notify)
|
||||
}
|
||||
```
|
||||
|
||||
**Why:** Decouples stages, prevents fast producers from blocking slow consumers, enables independent scaling of each stage.
|
||||
|
||||
---
|
||||
|
||||
### Pattern 2: Source Interface + Adapter
|
||||
|
||||
**What:** All scan inputs implement a single `Source` interface. New sources are added by implementing the interface, not changing the engine.
|
||||
|
||||
**When:** Adding any new input type (new code host, new file format).
|
||||
|
||||
**Example:**
|
||||
```go
|
||||
type Source interface {
|
||||
Name() string
|
||||
Chunks(ctx context.Context, ch chan<- Chunk) error
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Pattern 3: YAML Provider with compile-time embed
|
||||
|
||||
**What:** Provider definitions live in `providers/*.yaml`, embedded at compile time. No runtime file loading.
|
||||
|
||||
**When:** Adding new LLM provider detection support.
|
||||
|
||||
**Why:** Single binary distribution. Zero external dependencies at runtime. Community can submit PRs with YAML files — no Go code required to add a provider.
|
||||
|
||||
```go
|
||||
//go:embed providers/*.yaml
|
||||
var providersFS embed.FS
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Pattern 4: Rate Limiter per Recon Source
|
||||
|
||||
**What:** Each recon source module holds its own `golang.org/x/time/rate.Limiter`. The orchestrator does not centrally throttle.
|
||||
|
||||
**When:** All external HTTP calls in the recon engine.
|
||||
|
||||
**Why:** Different sources have wildly different rate limits (Shodan: 1 req/s free; GitHub: 30 req/min unauthenticated; Pastebin: no documented limit). Centralizing would set all to the slowest.
|
||||
|
||||
---
|
||||
|
||||
### Pattern 5: SSE for Dashboard Live Updates
|
||||
|
||||
**What:** Server-Sent Events pushed from Go HTTP handler to htmx SSE extension. One-way server→browser push. No WebSocket needed.
|
||||
|
||||
**When:** Live scan progress, recon job status.
|
||||
|
||||
**Why:** SSE uses standard HTTP, works through proxies, simpler than WebSockets for one-way push, supported natively by htmx SSE extension.
|
||||
|
||||
---
|
||||
|
||||
## Anti-Patterns to Avoid
|
||||
|
||||
### Anti-Pattern 1: Global State for Provider Registry
|
||||
|
||||
**What:** Storing providers as package-level globals loaded once at startup.
|
||||
|
||||
**Why bad:** Makes testing impossible without full initialization. Prevents future per-scan provider subsets.
|
||||
|
||||
**Instead:** Pass a `*ProviderRegistry` explicitly to the engine constructor.
|
||||
|
||||
---
|
||||
|
||||
### Anti-Pattern 2: Unbuffered Result Channels
|
||||
|
||||
**What:** Using `make(chan Finding)` (unbuffered) for the results pipeline.
|
||||
|
||||
**Why bad:** A slow output writer blocks detector workers, collapsing parallelism. TruffleHog's architecture explicitly uses buffered channels managing thousands of concurrent operations.
|
||||
|
||||
**Instead:** Buffer proportional to expected throughput (`make(chan Finding, 1000)`).
|
||||
|
||||
---
|
||||
|
||||
### Anti-Pattern 3: Direct HTTP in Detector Workers
|
||||
|
||||
**What:** Detector goroutines making HTTP calls to verify endpoints inline.
|
||||
|
||||
**Why bad:** Verification is slow (network I/O). It would block detector workers, killing throughput.
|
||||
|
||||
**Instead:** Separate verification worker pool as a distinct pipeline stage (TruffleHog's design).
|
||||
|
||||
---
|
||||
|
||||
### Anti-Pattern 4: Runtime YAML Loading for Providers
|
||||
|
||||
**What:** Loading provider YAML from filesystem at scan time.
|
||||
|
||||
**Why bad:** Breaks single binary distribution. Users must manage provider files separately. Security risk (external file modification).
|
||||
|
||||
**Instead:** `//go:embed providers/*.yaml` at compile time.
|
||||
|
||||
---
|
||||
|
||||
### Anti-Pattern 5: Storing Plaintext Keys in SQLite
|
||||
|
||||
**What:** Storing full API keys as plaintext in the database.
|
||||
|
||||
**Why bad:** Database file = credential dump. Any process with file access can read all found keys.
|
||||
|
||||
**Instead:** AES-256 encrypt the full key column. Store only masked version for display. Decrypt on explicit `--unmask` or via auth-gated dashboard endpoint.
|
||||
|
||||
---
|
||||
|
||||
### Anti-Pattern 6: Monolithic Recon Orchestrator
|
||||
|
||||
**What:** One giant function that loops through all 80+ sources sequentially.
|
||||
|
||||
**Why bad:** Recon over 80 sources sequentially would take hours. No per-source error isolation.
|
||||
|
||||
**Instead:** Fan-out pattern. Each source module runs concurrently in its own goroutine. Errors are per-source (one failing source doesn't abort the job).
|
||||
|
||||
---
|
||||
|
||||
## Package Structure
|
||||
|
||||
```
|
||||
keyhunter/
|
||||
main.go (< 30 lines, cobra root init)
|
||||
cmd/ (cobra command definitions)
|
||||
scan.go
|
||||
recon.go
|
||||
keys.go
|
||||
serve.go
|
||||
...
|
||||
pkg/
|
||||
providers/ (Provider struct, YAML loader, embed.FS)
|
||||
engine/ (scanning pipeline, worker pool, Aho-Corasick)
|
||||
sources/ (Source interface + concrete adapters)
|
||||
file.go
|
||||
dir.go
|
||||
git.go
|
||||
url.go
|
||||
stdin.go
|
||||
clipboard.go
|
||||
verify/ (Verification engine, HTTP client, cache)
|
||||
recon/ (Recon orchestrator)
|
||||
sources/ (ReconSource interface + category modules)
|
||||
iot/
|
||||
code/
|
||||
search/
|
||||
paste/
|
||||
...
|
||||
dorks/ (Dork engine, YAML dork loader)
|
||||
importers/ (TruffleHog + Gitleaks JSON adapters)
|
||||
storage/ (SQLite layer, go-sqlcipher, schema, migrations)
|
||||
dashboard/ (HTTP handlers, Go templates, embed.FS)
|
||||
static/ (tailwind CSS, htmx JS — embedded)
|
||||
templates/ (HTML templates — embedded)
|
||||
notify/ (Telegram bot, long polling, command router)
|
||||
scheduler/ (gocron wrapper, SQLite persistence)
|
||||
output/ (Table, JSON, SARIF, CSV formatters)
|
||||
config/ (Config struct, YAML config file, env vars)
|
||||
providers/ (YAML provider definitions — embedded at build)
|
||||
openai.yaml
|
||||
anthropic.yaml
|
||||
...
|
||||
dorks/ (YAML dork definitions — embedded at build)
|
||||
github.yaml
|
||||
google.yaml
|
||||
...
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Scalability Considerations
|
||||
|
||||
| Concern | Single user / local tool | Team / shared instance |
|
||||
|---------|--------------------------|------------------------|
|
||||
| Concurrency | Worker pool default: `8x NumCPU` detectors | Configurable via `--concurrency` flag |
|
||||
| Storage | SQLite handles millions of findings at local scale | SQLite WAL mode for concurrent readers; migrate to PostgreSQL only if needed (out of scope per PROJECT.md) |
|
||||
| Recon rate limits | Per-source rate limiters; stealth mode adds jitter | API keys / tokens configured per source for higher limits |
|
||||
| Dashboard | Embedded single-instance; no auth by default | Optionally add basic auth via config for shared deployments |
|
||||
| Verification | Opt-in; per-provider rate limiting prevents API abuse | Same — no change needed at team scale |
|
||||
|
||||
---
|
||||
|
||||
## Sources
|
||||
|
||||
- DeepWiki TruffleHog engine architecture: https://deepwiki.com/trufflesecurity/trufflehog/2.1-engine-configuration (HIGH confidence — generated from official source)
|
||||
- TruffleHog v3 official repo: https://github.com/trufflesecurity/trufflehog (HIGH confidence)
|
||||
- TruffleHog v3 source packages: https://pkg.go.dev/github.com/trufflesecurity/trufflehog/v3/pkg/sources (HIGH confidence)
|
||||
- Gitleaks official repo: https://github.com/gitleaks/gitleaks (HIGH confidence)
|
||||
- Go embed package: https://pkg.go.dev/embed (HIGH confidence — official)
|
||||
- go-co-op/gocron: https://github.com/go-co-op/gocron (HIGH confidence)
|
||||
- go-sqlcipher (AES-256): https://github.com/mutecomm/go-sqlcipher (MEDIUM confidence — check active maintenance status)
|
||||
- SQLCipher: https://github.com/sqlcipher/sqlcipher (HIGH confidence)
|
||||
- SSE with Go + htmx: https://threedots.tech/post/live-website-updates-go-sse-htmx/ (MEDIUM confidence — community blog, well-verified pattern)
|
||||
- Telego (Telegram bot Go): https://github.com/mymmrac/telego (MEDIUM confidence)
|
||||
- TruffleHog v3 introducing blog: https://trufflesecurity.com/blog/introducing-trufflehog-v3 (HIGH confidence — official)
|
||||
251
.planning/research/FEATURES.md
Normal file
251
.planning/research/FEATURES.md
Normal file
@@ -0,0 +1,251 @@
|
||||
# Feature Landscape: API Key Scanner Domain
|
||||
|
||||
**Domain:** API key / secret scanner — LLM/AI provider focus, OSINT recon, active verification
|
||||
**Researched:** 2026-04-04
|
||||
**Competitive Reference:** TruffleHog, Gitleaks, Betterleaks, detect-secrets, GitGuardian, Nosey Parker/Titus, GitHub Secret Scanning
|
||||
|
||||
---
|
||||
|
||||
## Competitive Landscape Summary
|
||||
|
||||
| Tool | LLM Providers | Verification | OSINT/Recon | Sources | Output |
|
||||
|------|--------------|-------------|------------|---------|--------|
|
||||
| TruffleHog | ~15 (OpenAI, Anthropic, HF partial) | Yes, 700+ | No | Git, S3, Docker, Postman, Jenkins | JSON, text |
|
||||
| Gitleaks | ~5-10 (OpenAI, HF partial) | No | No | Git, dir, stdin | JSON, CSV, SARIF, JUnit |
|
||||
| Betterleaks | ~10-15 (est.) | Planned | No | Git, dir, files | Unknown (Gitleaks-compatible) |
|
||||
| detect-secrets | ~5 (keyword-based) | No | No | Files, git staged | JSON baseline |
|
||||
| Titus | 450+ rules (broad SaaS) | Yes (validate flag) | No | Files, git, binary | JSON |
|
||||
| GitGuardian | 550+ detectors | Yes (validity checks) | No | Git, CI/CD, Slack, Jira, Docker | Dashboard, alerts |
|
||||
| GitHub Secret Scanning | 700+ patterns (cloud-first) | Yes (validity checks) | No | GitHub repos only | Dashboard, SARIF |
|
||||
| KeyHunter (target) | 108 LLM providers | Yes (opt-in) | Yes (80+ sources) | Git+OSINT+IoT+Paste | Table, JSON, SARIF, CSV |
|
||||
|
||||
**Key market gap confirmed:** No existing open-source tool covers 100+ LLM providers with detection + verification + OSINT recon combined. The 81% YoY surge in AI-service credential leaks (GitGuardian 2026 report, 1.27M leaked secrets) validates the demand.
|
||||
|
||||
---
|
||||
|
||||
## Table Stakes
|
||||
|
||||
Features users expect from any credible secret scanner. Missing one = users choose a competitor immediately.
|
||||
|
||||
| Feature | Why Expected | Complexity | Notes |
|
||||
|---------|--------------|------------|-------|
|
||||
| Regex-based pattern detection | Every tool has it; users assume it exists | Low | Foundation of all scanners; must be fast |
|
||||
| Entropy analysis | Standard complement to regex since TruffleHog popularized it | Low | Shannon entropy; high FP rate alone — needs keywords too |
|
||||
| Keyword pre-filtering | TruffleHog's performance trick; users of large repos demand it | Low-Med | Filter to candidate files before applying regex; 10x speedup |
|
||||
| Git history scanning | TruffleHog/Gitleaks primary use case; users expect full history | Med | Must traverse all commits, branches, tags |
|
||||
| Directory/file scanning | Needed for non-git use cases (CI artifacts, file shares) | Low | Walk directory tree, apply detectors |
|
||||
| JSON output | Machine-readable output for pipeline integration | Low | Standard across all tools |
|
||||
| False positive reduction / deduplication | Alert fatigue is a known pain point across all scanners | Med | Deduplicate same secret seen in N commits |
|
||||
| Pre-commit hook support | Shift-left; developers expect git hook integration | Low | Blocks commits with detected secrets |
|
||||
| CI/CD integration | GitHub Actions, GitLab CI, Jenkins — any serious scanner has this | Low | Binary runs in pipeline; exit code drives pass/fail |
|
||||
| SARIF output | Required for GitHub Code Scanning tab, GitLab Security dashboard | Low | Standard format; Gitleaks, Titus, Zimara all support it |
|
||||
| Masked output by default | Security hygiene; users expect keys not printed in clear to terminal | Low | Mask middle chars; --unmask flag to show full |
|
||||
| Provider-based detection rules | Users expect named detectors ("OpenAI key detected"), not raw regex | Med | Named detectors with confidence; YAML definitions in KeyHunter's case |
|
||||
| Active key verification (opt-in) | TruffleHog verified this: confirmed keys are worth 10x more to users | Med | MUST be opt-in; legal/ethical requirement; network call to provider API |
|
||||
| --verify flag (off by default) | Legal safety norm in the ecosystem; users expect passive-by-default | Low | Standard pattern established by TruffleHog |
|
||||
| CSV export | Needed for spreadsheet/reporting workflows | Low | Standard; Gitleaks, Titus support it |
|
||||
| Multi-platform binary | Single binary install is the expectation for Go tools | Low | Linux, macOS; Docker for Windows |
|
||||
|
||||
---
|
||||
|
||||
## Differentiators
|
||||
|
||||
Features that set KeyHunter apart from every existing tool. These are the competitive moat.
|
||||
|
||||
### Tier 1: Core Differentiators (Primary Competitive Advantage)
|
||||
|
||||
| Feature | Value Proposition | Complexity | Notes |
|
||||
|---------|-------------------|------------|-------|
|
||||
| 108 LLM/AI provider coverage | No tool covers more than ~15-20 LLM providers; this is a 5-7x gap | High | YAML-driven provider definitions; must include prefix-based (OpenAI, Anthropic, HF, Groq, Replicate) AND keyword-based (Mistral, Cohere, Together AI, Chinese providers) |
|
||||
| OSINT/Recon engine (80+ sources) | No scanner combines detection + OSINT in one tool | Very High | 18 source categories: code hosting, paste sites, IoT scanners, search dorks, package registries, CI/CD logs, web archives, forums, etc. |
|
||||
| Active verification for 108 LLM providers | TruffleHog verifies ~700 types but covers far fewer LLM providers | High | Each YAML provider definition includes verify endpoint; --verify opt-in |
|
||||
| Built-in dork engine (150+ dorks) | Search engine dorking is manual today; no tool has YAML-managed dorks | Med | GitHub, Google, Shodan, Censys, ZoomEye, FOFA dorks in YAML; extensible same way as providers |
|
||||
| IoT scanner integration | Shodan/Censys/ZoomEye/FOFA for exposed LLM endpoints | High | Scans for vLLM, Ollama, LiteLLM proxy leaks — a growing attack surface (1scan showed thousands of exposed LLM endpoints) |
|
||||
| YAML provider plugin system | Community can add providers without recompiling | Med | compile-time embed via Go `//go:embed`; provider = pattern + keywords + verify endpoint + metadata |
|
||||
|
||||
### Tier 2: Strong Differentiators (Meaningfully Better Than Alternatives)
|
||||
|
||||
| Feature | Value Proposition | Complexity | Notes |
|
||||
|---------|-------------------|------------|-------|
|
||||
| Paste site aggregation (20+ sites) | Paste sites are a top leak vector; no scanner covers them systematically | High | Pastebin, dpaste, paste.ee, rentry, hastebin, ix.io etc. |
|
||||
| Package registry scanning (8+ registries) | npm, PyPI, RubyGems, crates.io, Maven, NuGet, Packagist, Go proxy — LLM keys embedded in packages are a real vector | High | Scan source tarballs and metadata |
|
||||
| Container/IaC scanning | Docker Hub layers, K8s configs, Terraform state, Helm, Ansible | High | Complements git scanning with infra layer |
|
||||
| Web dashboard (htmx + Tailwind) | No open-source scanner has an embedded web UI | High | SQLite backend, embedded in binary via go:embed; scans/keys/recon/providers/dorks/settings |
|
||||
| Telegram bot integration | Immediate mobile notification of findings; no scanner has this | Med | /scan, /verify, /recon, /status commands |
|
||||
| Scheduled scanning with auto-notify | Recurring scans with cron; no scanner has this natively | Med | Cron-based; webhook or Telegram on new findings |
|
||||
| SQLite storage with AES-256 encryption | Persistent scan state; other tools are stateless | Med | Store findings, recon results, key status history |
|
||||
| TruffleHog + Gitleaks import adapter | Lets users pipe existing tool output into KeyHunter's verification/storage | Low-Med | JSON import from both tools; normalizes results |
|
||||
| APK decompile scanning | Mobile app binaries as a source; no common scanner does this | High | Depends on external apktool/jadx; wrap as optional integration |
|
||||
| Web archive scanning | Wayback Machine + CommonCrawl for historical leaks | Med | Useful for finding keys that were removed from code but still indexed |
|
||||
| Source map / webpack bundle scanning | Frontend JS bundles frequently contain embedded API keys | Med | Fetch and parse JS source maps from deployed sites |
|
||||
| Permission analysis (future) | TruffleHog Analyze covers 30 types; KeyHunter could expand to LLM scope | Very High | Know what a leaked key can do — model access, billing, rate limits |
|
||||
|
||||
### Tier 3: Nice-to-Have Differentiators
|
||||
|
||||
| Feature | Value Proposition | Complexity | Notes |
|
||||
|---------|-------------------|------------|-------|
|
||||
| Colored table output | Better UX than plain text | Low | Use lipgloss or tablewriter; standard in modern Go CLIs |
|
||||
| Rate limiting per OSINT source | Responsible scanning without bans | Med | Per-host rate limiter; configurable |
|
||||
| Stealth mode / robots.txt respect | Ethical scanning; avoids legal issues for researchers | Med | Opt-in stealth; obey robots.txt when configured |
|
||||
| Delta-based git scanning | Only scan new commits since last run; performance for CI | Med | Store last scanned commit hash in SQLite |
|
||||
| mmap-based file reading | Memory-efficient scanning of large files | Med | Use for large log files and archives |
|
||||
| Worker pool parallelism | TruffleHog does this; expected for performance | Med | configurable goroutine pool per source type |
|
||||
| Cloud storage scanning (S3, GCS, Azure Blob) | Buckets frequently contain leaked config files | High | Requires cloud credentials to scan; scope carefully |
|
||||
| Forum/community scanning (Reddit, HN, StackOverflow) | Real leak vector; developers share code with keys | High | Rate-limited scraping; search API where available |
|
||||
| Collaboration tool scanning (Notion, Confluence) | Enterprise leak vector; increasingly relevant | Very High | Auth flows complex; may need per-org API tokens |
|
||||
| Threat intel integration (VirusTotal, IntelX) | Cross-reference found keys against known breach databases | High | Add-on verification layer |
|
||||
|
||||
---
|
||||
|
||||
## Anti-Features
|
||||
|
||||
Features to deliberately NOT build. Building these would waste resources, create scope creep, or undermine the tool's identity.
|
||||
|
||||
| Anti-Feature | Why Avoid | What to Do Instead |
|
||||
|--------------|-----------|-------------------|
|
||||
| Key rotation / remediation | KeyHunter is a finder, not a fixer; building rotation = competing with HashiCorp Vault, AWS Secrets Manager, Doppler | Document links to provider-specific rotation guides; link from findings output |
|
||||
| SaaS / cloud-hosted version | Shifts tool from open-source security tool to commercial product; legal/privacy complexity explodes | Keep open-source; let users self-host the web dashboard |
|
||||
| GUI desktop app | High dev cost for low security-tool audience benefit; security tools are CLI-first | CLI + embedded web dashboard covers both audiences |
|
||||
| Real-time streaming API | Batch scanning is the primary mode; streaming adds websocket/SSE complexity for marginal gain | Use scheduled scans + webhooks/Telegram for near-real-time alerting |
|
||||
| Windows native build | Small portion of target audience (red teams, DevSecOps); WSL/Docker serves them | State WSL/Docker support clearly in README |
|
||||
| AI-generated code scanning (static analysis) | Different domain entirely from secret detection; scope creep | Stay focused on credential/secret detection |
|
||||
| Automatic key invalidation | Calling provider API to revoke a key without explicit user consent is dangerous and potentially illegal | Gate ALL provider API calls behind --verify; never call provider APIs passively |
|
||||
| Scanning without user consent | Legal and ethical requirement; all scanning must be intentional | Require explicit targets; no auto-discovery of new repos to scan |
|
||||
| Built-in proxy/VPN | Scope creep; tool should not manage network routing | Document use with external proxies; support HTTP_PROXY env var |
|
||||
| Key marketplace / sharing | Fundamentally changes the ethical posture of the tool from defender to attacker | Hard no; never log or transmit found keys anywhere outside local SQLite |
|
||||
| Excessive telemetry | Security tools must not phone home; community trust requires zero telemetry | No analytics, no crash reporting, no network calls except explicit --verify |
|
||||
|
||||
---
|
||||
|
||||
## Feature Dependencies
|
||||
|
||||
```
|
||||
Regex patterns + keyword lists
|
||||
-> Provider YAML definitions (pattern + keywords + verify endpoint)
|
||||
-> Core scanning engine (file, dir, git)
|
||||
-> Active verification (--verify flag)
|
||||
-> SQLite storage (findings persistence)
|
||||
-> Web dashboard (htmx, reads from SQLite)
|
||||
-> JSON/CSV/SARIF export
|
||||
-> Telegram bot (reads from SQLite, sends alerts)
|
||||
-> Scheduled scanning (cron -> scan -> SQLite -> notify)
|
||||
|
||||
Provider YAML definitions
|
||||
-> Dork YAML definitions (same extensibility pattern)
|
||||
-> Built-in dork engine
|
||||
-> OSINT/Recon engine (uses dorks per source)
|
||||
-> IoT scanners (Shodan, Censys, ZoomEye, FOFA)
|
||||
-> Code hosting (GitHub, GitLab, HuggingFace, etc.)
|
||||
-> Paste sites
|
||||
-> Package registries
|
||||
-> Search engine dorking
|
||||
-> Web archives
|
||||
-> CI/CD logs
|
||||
-> Forums
|
||||
-> Collaboration tools
|
||||
-> Cloud storage
|
||||
-> Container/IaC
|
||||
|
||||
TruffleHog/Gitleaks JSON import
|
||||
-> Active verification (can verify imported keys)
|
||||
-> SQLite storage (can store imported findings)
|
||||
|
||||
Delta-based git scanning
|
||||
-> SQLite storage (requires stored last-scanned commit)
|
||||
|
||||
Keyword pre-filtering
|
||||
-> Core scanning engine (filter before regex application)
|
||||
|
||||
Worker pool parallelism
|
||||
-> All scanning operations (applies globally)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## MVP Recommendation
|
||||
|
||||
Build in strict dependency order. Each phase must be complete before the next delivers value.
|
||||
|
||||
**Phase 1 — Foundation (table stakes, no differentiators yet):**
|
||||
1. Provider YAML definitions for 108 LLM providers (patterns, keywords, verify endpoints)
|
||||
2. Core scanning engine: regex + entropy + keyword pre-filtering
|
||||
3. Input sources: file, dir, git history, stdin
|
||||
4. Active verification via `--verify` flag (off by default)
|
||||
5. Output: colored table, JSON, SARIF, CSV
|
||||
6. SQLite storage with AES-256
|
||||
|
||||
**Phase 2 — First differentiators (competitive moat begins here):**
|
||||
7. Full key access: `--unmask`, `keys show`, web dashboard
|
||||
8. TruffleHog + Gitleaks import adapters
|
||||
9. Built-in dork engine (YAML dorks, 150+)
|
||||
10. Pre-commit hook + CI/CD integration (SARIF, exit codes)
|
||||
|
||||
**Phase 3 — OSINT engine (the primary differentiator):**
|
||||
11. Recon engine core: code hosting (GitHub, GitLab, HuggingFace, Replit, etc.)
|
||||
12. Paste site aggregator (20+ sites)
|
||||
13. Search engine dorking (Google, Bing, DuckDuckGo, etc.)
|
||||
14. Package registries (npm, PyPI, RubyGems, etc.)
|
||||
15. IoT scanners (Shodan, Censys, ZoomEye, FOFA, Netlas, BinaryEdge)
|
||||
|
||||
**Phase 4 — Automation and reach:**
|
||||
16. Telegram bot
|
||||
17. Scheduled scanning (cron-based)
|
||||
18. Remaining OSINT sources: CI/CD logs, web archives, forums, cloud storage, container/IaC, APK, source maps, threat intel
|
||||
|
||||
**Defer permanently:**
|
||||
- Collaboration tool scanning (Notion, Confluence, Google Docs): auth complexity is very high; add in v2 if demand exists
|
||||
- Permission analysis: very high complexity; requires provider-specific API exploration per provider; good v2 feature
|
||||
- Web archive scanning: CommonCrawl data is huge; requires careful scoping to avoid running for days
|
||||
|
||||
---
|
||||
|
||||
## Detection Method Tradeoffs
|
||||
|
||||
Based on research across competitive tools, relevant for architectural decisions:
|
||||
|
||||
| Method | Recall | Precision | Speed | Best For |
|
||||
|--------|--------|-----------|-------|----------|
|
||||
| Regex (named patterns) | High (for known formats) | High | Fast | Provider keys with known prefixes (OpenAI sk-proj-, Anthropic sk-ant-api03-, HuggingFace hf_, Groq gsk_) |
|
||||
| Entropy (Shannon) | Medium (70.4% per Betterleaks data) | Low (high FP) | Fast | Generic high-entropy strings; use as secondary signal only |
|
||||
| BPE Tokenization (Betterleaks) | Very High (98.6%) | High | Medium | Next-gen; consider for v2 |
|
||||
| Keyword pre-filtering | N/A (filter only) | N/A | Very Fast | Reduce candidate set before regex; TruffleHog pattern |
|
||||
| ML/LLM-based (Nosey Parker AI, GPT-4) | High | Very High | Slow/expensive | FuzzingLabs: GPT-5-mini hits 84.4% recall vs Gitleaks 37.5%; v2 consideration |
|
||||
| Contextual validation | High | Very High | Medium | GitGuardian's third layer; reduces FP significantly |
|
||||
|
||||
**KeyHunter approach:** Regex (primary) + keyword pre-filtering (performance) + entropy (secondary signal). ML-based detection is a v2 feature once the provider coverage gap is closed.
|
||||
|
||||
---
|
||||
|
||||
## Ecosystem Context (2026)
|
||||
|
||||
- AI-service credential leaks: 1.27M in 2025, up 81% YoY (GitGuardian State of Secrets Sprawl 2026)
|
||||
- 29M total secrets leaked on GitHub in 2025 (34% YoY increase, largest single-year jump ever)
|
||||
- LLM infrastructure leaks grow 5x faster than core model provider leaks
|
||||
- Claude Code-assisted commits show 3.2% leak rate vs 1.5% baseline — AI coding tools making it worse
|
||||
- 24,008 unique secrets in MCP configuration files found in 2025
|
||||
- Betterleaks (March 2026): BPE tokenization achieves 98.6% recall vs 70.4% for entropy — new detection paradigm worth tracking
|
||||
- FuzzingLabs (April 2026): GPT-5-mini hits 84.4% recall vs Gitleaks 37.5%, TruffleHog 0% on split/obfuscated secrets — LLM-based detection becoming viable
|
||||
- TruffleHog + HuggingFace partnership: native HF scanner for models, datasets, Spaces
|
||||
- GitHub Secret Scanning: added DeepSeek validity checks in March 2026 — LLM provider awareness growing
|
||||
|
||||
---
|
||||
|
||||
## Sources
|
||||
|
||||
- [TruffleHog GitHub](https://github.com/trufflesecurity/trufflehog) — feature set, detector count, scanning sources
|
||||
- [TruffleHog Analyze](https://trufflesecurity.com/blog/trufflehog-now-analyzes-permissions-of-api-keys-and-passwords) — permission analysis feature
|
||||
- [Gitleaks GitHub](https://github.com/gitleaks/gitleaks) — output formats, detection methods
|
||||
- [Betterleaks — BleepingComputer](https://www.bleepingcomputer.com/news/security/betterleaks-a-new-open-source-secrets-scanner-to-replace-gitleaks/) — BPE tokenization, recall metrics
|
||||
- [Betterleaks — Aikido](https://www.aikido.dev/blog/betterleaks-gitleaks-successor) — comparison with Gitleaks
|
||||
- [Titus — Praetorian](https://www.praetorian.com/blog/titus-open-source-secret-scanner/) — 450+ rules, validation, Burp extension
|
||||
- [Titus GitHub](https://github.com/praetorian-inc/titus) — feature details
|
||||
- [GitGuardian Secrets Detection](https://www.gitguardian.com/solutions/secrets-detection) — 550+ detectors, enterprise features
|
||||
- [GitGuardian State of Secrets Sprawl 2026](https://blog.gitguardian.com/the-state-of-secrets-sprawl-2026/) — market statistics
|
||||
- [GitHub Secret Scanning March 2026](https://github.blog/changelog/2026-03-10-secret-scanning-pattern-updates-march-2026/) — validity checks for DeepSeek
|
||||
- [GitHub Secret Scanning Coverage Update](https://github.blog/changelog/2026-03-31-github-secret-scanning-nine-new-types-and-more/) — 28 new detectors
|
||||
- [GitGuardian Secret Scanning Tools 2026](https://blog.gitguardian.com/secret-scanning-tools/) — market landscape
|
||||
- [keyhacks — streaak](https://github.com/streaak/keyhacks) — API key validation endpoints reference
|
||||
- [detect-secrets — Yelp](https://github.com/Yelp/detect-secrets) — baseline approach, 27 detectors
|
||||
- [FuzzingLabs — LLM vs regex benchmark](https://x.com/FuzzingLabs/status/1980668916851483010) — GPT-5-mini 84.4% vs Gitleaks 37.5%
|
||||
- [AI/LLM API key scanning on GitHub at scale](https://dev.to/zaim_abbasi/claude-openai-google-api-keys-all-public-this-is-what-i-found-after-scanning-github-at-scale-fj5) — real-world leak discovery
|
||||
- [Comparative study of secret scanning tools](https://arxiv.org/pdf/2307.00714) — precision/recall benchmarks
|
||||
295
.planning/research/PITFALLS.md
Normal file
295
.planning/research/PITFALLS.md
Normal file
@@ -0,0 +1,295 @@
|
||||
# Domain Pitfalls
|
||||
|
||||
**Domain:** API key scanner / secret detection tool (LLM/AI provider focus)
|
||||
**Project:** KeyHunter
|
||||
**Researched:** 2026-04-04
|
||||
**Sources confidence:** HIGH for regex/performance (official Go docs, RE2 docs), MEDIUM for legal (CFAA analysis + DOJ policy), MEDIUM for OSINT reliability (community + vendor research)
|
||||
|
||||
---
|
||||
|
||||
## Critical Pitfalls
|
||||
|
||||
Mistakes that cause rewrites, legal exposure, or complete tool rejection.
|
||||
|
||||
---
|
||||
|
||||
### Pitfall 1: Catastrophic Regex Backtracking on Large Inputs
|
||||
|
||||
**What goes wrong:** Secret-detection patterns written with nested quantifiers (e.g., `(.+)+`, `(a|aa)+`) cause exponential CPU time on adversarial or malformed input. A single poorly written pattern can peg one CPU core at 100% indefinitely when scanning large files, binary blobs, or minified JavaScript. This is ReDoS — the same class of bug that has caused production outages in major platforms.
|
||||
|
||||
**Why it happens:** Pattern authors focus on correctness, not worst-case complexity. Patterns for generic secrets (Mistral, Cohere, Together AI) that lack high-confidence prefixes tend toward broad `[A-Za-z0-9]+` quantifiers, which become catastrophic when chained.
|
||||
|
||||
**Consequences:** Scanner hangs indefinitely on large repos or source maps. Workers in the pool block. If running as a CI hook or Telegram-triggered scan, the entire pipeline stalls with no feedback.
|
||||
|
||||
**Prevention:**
|
||||
- Go's `regexp` package already uses the RE2 engine, which guarantees linear time execution — never backtracking exponentially. This is a free safety net. Do NOT switch to `regexp2` (PCRE-compatible) for any pattern, as it loses this guarantee.
|
||||
- Add a per-pattern timeout enforced with `context.WithTimeout` as defense in depth.
|
||||
- Add a regex complexity linter to the YAML provider validation step (CI check on provider YAML PRs).
|
||||
- Benchmark every new provider pattern against a 10MB synthetic worst-case string before merging.
|
||||
|
||||
**Detection (warning signs):**
|
||||
- Provider patterns containing `(.+)+`, `(a+)+`, or alternation inside repetition.
|
||||
- Scan times that scale super-linearly with file size.
|
||||
- Worker pool goroutines that never complete.
|
||||
|
||||
**Phase mapping:** Phase that builds the core regex engine and provider YAML schema. Add pattern complexity validation before any community provider contributions open up.
|
||||
|
||||
---
|
||||
|
||||
### Pitfall 2: False Positive Rate Killing Adoption (Alert Fatigue)
|
||||
|
||||
**What goes wrong:** Up to 80% of alerts from entropy-only or broad-regex scanners are false positives (HashiCorp, 2025). Developers stop reviewing alerts within weeks. The tool becomes security theater: real secrets are ignored because every alert feels like noise.
|
||||
|
||||
**Why it happens:** Three compounding errors:
|
||||
1. Using entropy alone to flag secrets — entropy measures randomness, not whether a string is actually a secret. High-entropy strings like UUIDs, hashes, base64 content, and test fixtures flood results.
|
||||
2. Patterns written to maximize recall without a precision floor — matching anything that looks like a key prefix.
|
||||
3. No post-filtering for known non-secret contexts (test files, `.example` files, mock data, documentation).
|
||||
|
||||
**Consequences:** Red teams and bug bounty hunters abandon the tool after the first scan of a medium-sized monorepo produces 5,000 results with 4,000 false positives. The core value proposition ("find real, live keys") collapses.
|
||||
|
||||
**Prevention:**
|
||||
- Layer detection: keyword pre-filter first (fast string match), then regex (pattern confirmation), then entropy check (optional calibration), then active verification (ground truth when `--verify` is enabled). Never rely on entropy alone.
|
||||
- Implement allowlist patterns for known false-positive contexts at the YAML level: `test_`, `example_`, `fake_`, `dummy_`, `placeholder`.
|
||||
- Expose a `--min-confidence` flag so users can tune recall/precision trade-off.
|
||||
- Track and report per-provider false positive rate in internal benchmarks (CredData dataset is a standard benchmark).
|
||||
- GitGuardian's ML approach (FP Remover) reduced false positives by 50%. A heuristic version (file path context, variable name context) achieves similar results without ML overhead.
|
||||
|
||||
**Detection (warning signs):**
|
||||
- First scan of any real project returns >1000 results.
|
||||
- Providers with no high-confidence prefix (e.g., generic 32-char hex keys) have >50% FP rate in test runs.
|
||||
- Users filing GitHub issues asking "why did it flag my README example?"
|
||||
|
||||
**Phase mapping:** Core engine phase. Must be addressed before OSINT/recon sources amplify the problem at scale — OSINT sources will produce far more candidate strings than local file scanning.
|
||||
|
||||
---
|
||||
|
||||
### Pitfall 3: Active Key Verification Creates Legal Exposure
|
||||
|
||||
**What goes wrong:** Calling a provider's API with a discovered key — even a single `/v1/models` health-check request — to confirm it is valid constitutes "accessing a computer system." Under CFAA (Computer Fraud and Abuse Act) and analogous laws in other jurisdictions, using a credential you did not receive authorization to use may constitute unauthorized access regardless of intent.
|
||||
|
||||
**Why it happens:** Tool authors conflate "I found this key publicly" with "I am authorized to use this key." Public availability does not grant authorization. State laws (e.g., Virginia post-2024) have expanded computer crime definitions beyond the narrowed federal CFAA post-Van Buren ruling.
|
||||
|
||||
**Consequences:**
|
||||
- Criminal exposure for the tool's users (bug bounty hunters, red teamers without explicit scope authorization).
|
||||
- Civil liability if verification consumes paid quota from a key owner's account — each verification call may incur real cost to the victim.
|
||||
- If KeyHunter becomes associated with a high-profile incident, the project could be taken down or banned from GitHub.
|
||||
|
||||
**Prevention:**
|
||||
- Verification is opt-in behind `--verify` with clear documentation that the user bears legal responsibility for scope.
|
||||
- Add a consent prompt on first `--verify` use: "Active verification sends HTTP requests to provider APIs using discovered keys. Ensure you have explicit authorization. Press Enter to continue or Ctrl+C to abort."
|
||||
- Document the legal risk in README and man page. Cite the good-faith security research exception under DOJ policy — it requires documented authorization.
|
||||
- Limit verification to read-only, non-destructive endpoints (list models, check account status) — never write operations.
|
||||
- Consider adding a `--dry-run-verify` flag that shows what endpoints would be called without actually calling them.
|
||||
|
||||
**Detection (warning signs):**
|
||||
- Any verification endpoint that modifies state, consumes significant quota, or accesses user data.
|
||||
- Users running `--verify` in automated CI pipelines against repositories they do not own.
|
||||
|
||||
**Phase mapping:** Verification feature phase. Legal risk documentation must ship with the feature, not as a follow-up.
|
||||
|
||||
---
|
||||
|
||||
### Pitfall 4: Provider Pattern Rot — YAML Definitions Become Stale
|
||||
|
||||
**What goes wrong:** LLM providers change key formats, rotate prefixes, or add new key types. OpenAI migrated from `sk-` to `sk-proj-` for project-scoped keys in 2024. Anthropic keys use `sk-ant-api03-` (the `api03` version suffix implies prior versions existed). Patterns written for old formats silently miss the new format while still matching the old (now-retired) format.
|
||||
|
||||
**Why it happens:** Provider API key formats are undocumented implementation details. Providers change them without changelog entries. No automated system alerts the scanner maintainer when a provider changes key format.
|
||||
|
||||
**Consequences:**
|
||||
- False negatives for the new format — active keys in the wild are missed entirely.
|
||||
- False positives for the old format — patterns match expired key formats that no longer work.
|
||||
- Tool appears broken to users who know the new format exists.
|
||||
|
||||
**Prevention:**
|
||||
- Add a `format_version` field to each provider YAML definition. Document when the format was last verified against live keys.
|
||||
- Add integration tests that attempt to construct a syntactically valid key for each provider and confirm the pattern matches it.
|
||||
- Monitor provider changelogs and release notes as part of maintenance. Subscribe to OpenAI, Anthropic, and major provider changelogs.
|
||||
- For providers with high-confidence prefixes, encode the full prefix including version segment in the pattern (e.g., `sk-ant-api03-` not just `sk-ant-`).
|
||||
- GitHub Advanced Security adds/updates patterns monthly (28 new detectors added March 2026 alone). Use their changelog as an external signal for provider format changes.
|
||||
- Build a "pattern health check" CI job that runs weekly against a curated set of known-format example keys.
|
||||
|
||||
**Detection (warning signs):**
|
||||
- User reports that a clearly valid Anthropic/OpenAI key is not detected.
|
||||
- Provider documentation mentions a new key format in release notes.
|
||||
- TruffleHog or Gitleaks updates a provider pattern — check their commits.
|
||||
|
||||
**Phase mapping:** Provider YAML definition phase and ongoing maintenance. Add pattern health CI before the first public release.
|
||||
|
||||
---
|
||||
|
||||
### Pitfall 5: OSINT Source Rate Limiting and IP Banning
|
||||
|
||||
**What goes wrong:** The 80+ OSINT sources (GitHub, Shodan, Censys, Pastebin, Google dorks, etc.) all have rate limits, bot detection, and account suspension policies. Aggressive scanning results in IP bans, CAPTCHA walls, API key revocation, or account termination — often silently. The scanner fails to report results without telling the user why.
|
||||
|
||||
**Why it happens:** Tool authors test against their own accounts with light traffic. Production use by red teams involves parallel workers hitting the same source from the same IP, triggering anomaly detection within minutes.
|
||||
|
||||
**Consequences:**
|
||||
- Google blocks IP after ~100 automated dork requests per hour — all dork results disappear silently.
|
||||
- GitHub bans integration tokens that exceed secondary rate limits (concurrent requests, not just per-hour).
|
||||
- Shodan/Censys accounts get flagged for automated abuse patterns.
|
||||
- Pastebin blocks scraping; their API is the only supported programmatic access.
|
||||
- The tool appears to work (no error returned) but returns empty results from banned sources.
|
||||
|
||||
**Prevention:**
|
||||
- Implement per-source rate limiting with configurable delays. Do not share a single rate limiter across sources.
|
||||
- Respect `X-RateLimit-Remaining` and `Retry-After` headers. Back off exponentially on 429 responses.
|
||||
- For Google dork scanning: use the Custom Search JSON API (100 queries/day free, 10,000 paid) rather than scraping `google.com` directly.
|
||||
- For Pastebin: use the official Pastebin API, not HTML scraping.
|
||||
- Add `--stealth` mode flag that introduces human-like delays (1-5s jitter) between requests.
|
||||
- Make rate limit configuration per-source in YAML so users can tune for their plan/tier.
|
||||
- Log "source exhausted" events clearly — never silently skip a source without telling the user.
|
||||
- For GitHub: use multiple tokens with rotation to stay within per-token limits across parallel workers.
|
||||
|
||||
**Detection (warning signs):**
|
||||
- A previously productive source suddenly returns 0 results.
|
||||
- HTTP 429 responses not surfaced to user.
|
||||
- Workers for a source finish suspiciously fast (blocked silently).
|
||||
|
||||
**Phase mapping:** OSINT/recon engine phase. Rate limiting architecture must be designed before implementing individual sources — retrofitting it after 80 sources are built is a rewrite.
|
||||
|
||||
---
|
||||
|
||||
## Moderate Pitfalls
|
||||
|
||||
Mistakes that degrade quality or require significant rework but do not cause critical failure.
|
||||
|
||||
---
|
||||
|
||||
### Pitfall 6: Git History Scanning Memory Exhaustion on Large Repos
|
||||
|
||||
**What goes wrong:** Scanning the full git history of a large monorepo (50k+ commits, large binary artifacts, LFS objects) exhausts available memory or runs for hours. Binary files (images, compiled artifacts, ML model weights) are loaded into memory and fed through regex patterns that can never match — wasting CPU and RAM.
|
||||
|
||||
**Prevention:**
|
||||
- Skip binary files before regex evaluation. Use file extension allowlists and MIME type sniffing (read first 512 bytes).
|
||||
- Use delta-based scanning: only scan changed lines in each commit diff, not the full file on every commit. TruffleHog v3 uses this approach.
|
||||
- Implement a per-file size limit (default 10MB) above which scanning is skipped with a warning.
|
||||
- For git history: use `git log --diff-filter=A` to only scan added/modified content, not deletions.
|
||||
- Stream commit data rather than loading the entire repo object store into memory.
|
||||
|
||||
**Phase mapping:** Core scanning engine phase. Design streaming architecture from the start.
|
||||
|
||||
---
|
||||
|
||||
### Pitfall 7: SQLite Database Stores Keys Unencrypted by Default
|
||||
|
||||
**What goes wrong:** SQLite databases store plaintext by default. A documented security advisory (GHSA-4h8c-qrcq-cv5c) shows a real project storing API keys in SQLite without encryption, allowing anyone with filesystem access to read all discovered secrets.
|
||||
|
||||
**Prevention:**
|
||||
- Use SQLCipher or Go's `modernc.org/sqlite` with application-level AES-256 encryption for sensitive columns.
|
||||
- The PROJECT.md already specifies AES-256 encryption — implement this in Phase 1, not as an afterthought.
|
||||
- Encrypt the database key itself using the OS keychain (macOS Keychain, Linux libsecret) rather than storing it in the config file.
|
||||
- Never log full key values to stdout or log files. Honor `--unmask` boundary in all code paths.
|
||||
|
||||
**Phase mapping:** Storage layer phase. Encryption must be in from the beginning — adding it post-hoc requires database migration logic.
|
||||
|
||||
---
|
||||
|
||||
### Pitfall 8: Verification Endpoint Maintenance Burden
|
||||
|
||||
**What goes wrong:** Provider verification endpoints change without notice. A provider may deprecate `/v1/models`, require different authentication headers, change rate limits for unauthenticated requests, or start requiring a minimum account tier for endpoint access. Verification silently returns false negatives — valid keys appear invalid.
|
||||
|
||||
**Prevention:**
|
||||
- Store the verification endpoint and expected response codes in the provider YAML (already planned). Include a `last_verified` date.
|
||||
- Add a `--test-verifiers` command that runs each provider's verifier against a known-invalid key and confirms it returns the expected "invalid" response — catching when providers break the expected behavior.
|
||||
- Design the verifier interface so fallback logic is easy: if primary endpoint returns unexpected status, try secondary endpoint before marking key invalid.
|
||||
|
||||
**Phase mapping:** Verification feature phase.
|
||||
|
||||
---
|
||||
|
||||
### Pitfall 9: Generic Provider Patterns Produce Massive Cross-Provider Collisions
|
||||
|
||||
**What goes wrong:** Providers that use generic 32-character hex or alphanumeric keys (Mistral, Cohere, Together AI, many Chinese providers) have patterns that overlap heavily. A key detected as "Mistral" may actually be a Together AI key. The tool confidently mis-attributes keys, damaging credibility.
|
||||
|
||||
**Prevention:**
|
||||
- For generic-pattern providers, rely heavily on keyword context (surrounding variable names, import statements, config file names) to disambiguate.
|
||||
- Report confidence level per detection: HIGH (prefix-confirmed), MEDIUM (keyword context), LOW (entropy only, no prefix, no keyword).
|
||||
- When verification is enabled, attribution becomes deterministic — the key either authenticates to provider A or it doesn't.
|
||||
- Consider a "multi-match" result type that lists all candidate providers when a key matches multiple generic patterns.
|
||||
|
||||
**Phase mapping:** Provider YAML definition phase. Define a confidence taxonomy before implementing patterns.
|
||||
|
||||
---
|
||||
|
||||
## Minor Pitfalls
|
||||
|
||||
---
|
||||
|
||||
### Pitfall 10: Telegram Bot Exposes Keys in Group Chats
|
||||
|
||||
**What goes wrong:** `/scan` or `/key` commands triggered in a Telegram group chat expose full key values to all group members. Even with masking by default, scan results in group contexts violate the principle that discovered keys are sensitive.
|
||||
|
||||
**Prevention:**
|
||||
- Restrict Telegram bot commands to private chats or authorized user IDs only.
|
||||
- Never send unmasked keys via Telegram regardless of `--unmask` setting.
|
||||
- Rate-limit bot commands to prevent abuse by unauthorized users who discover the bot.
|
||||
|
||||
**Phase mapping:** Telegram bot implementation phase.
|
||||
|
||||
---
|
||||
|
||||
### Pitfall 11: SARIF Output Used Without Context Causing CI Over-Blocking
|
||||
|
||||
**What goes wrong:** SARIF output consumed by GitHub Advanced Security or similar CI tooling blocks all PRs when the scanner finds any result. In a repo with legacy committed keys (even expired ones), every PR is blocked indefinitely — causing teams to disable the scanner entirely.
|
||||
|
||||
**Prevention:**
|
||||
- Default SARIF severity to `warning`, not `error`, for unverified findings.
|
||||
- Promote to `error` only for verified-active keys (requires `--verify`).
|
||||
- Document the recommended CI configuration with a severity filter.
|
||||
|
||||
**Phase mapping:** CI/CD integration phase.
|
||||
|
||||
---
|
||||
|
||||
### Pitfall 12: Dork Queries Violate Search Engine ToS
|
||||
|
||||
**What goes wrong:** Automated Google/Bing dork queries violate their Terms of Service. This is enforced via CAPTCHA walls and IP bans, but also creates legal exposure if the tool is used at scale by enterprise customers who sign ToS agreements.
|
||||
|
||||
**Prevention:**
|
||||
- Document that automated dork scanning uses search engine APIs (Google Custom Search, Bing Web Search API) where possible, not direct HTML scraping.
|
||||
- Offer DuckDuckGo as a default dork source (more permissive scraping stance), with Google/Bing as opt-in via API key.
|
||||
- Include a ToS disclaimer in dork documentation.
|
||||
|
||||
**Phase mapping:** Dork engine phase.
|
||||
|
||||
---
|
||||
|
||||
## Phase-Specific Warnings
|
||||
|
||||
| Phase Topic | Likely Pitfall | Mitigation |
|
||||
|-------------|---------------|------------|
|
||||
| Core regex engine | Catastrophic backtracking (Pitfall 1) | Use Go's RE2-backed `regexp`, add pattern complexity linter |
|
||||
| Provider YAML definitions | Generic pattern collisions (Pitfall 9), format rot (Pitfall 4) | Confidence taxonomy, format_version field, pattern health CI |
|
||||
| False positive filtering | Alert fatigue (Pitfall 2) | Layered detection pipeline, allowlist context, confidence levels |
|
||||
| SQLite storage | Unencrypted key storage (Pitfall 7) | AES-256 from day one, OS keychain for database key |
|
||||
| Verification feature | Legal exposure (Pitfall 3), endpoint rot (Pitfall 8) | Opt-in only, consent prompt, test-verifiers command |
|
||||
| OSINT/recon engine | Rate limiting and IP bans (Pitfall 5) | Per-source rate limiter architecture before implementing sources |
|
||||
| Git history scanning | Memory exhaustion (Pitfall 6) | Binary file skip, delta-based scanning, streaming |
|
||||
| Telegram bot | Key exposure in group chats (Pitfall 10) | Private chat restriction, no unmasked keys via bot |
|
||||
| CI/CD integration (SARIF) | Over-blocking PRs (Pitfall 11) | Warning severity by default, error only for verified keys |
|
||||
| Dork engine | ToS violations (Pitfall 12) | Search engine APIs over direct scraping, ToS documentation |
|
||||
|
||||
---
|
||||
|
||||
## Sources
|
||||
|
||||
- HashiCorp: [False positives: A big problem for secret scanners](https://www.hashicorp.com/en/blog/false-positives-a-big-problem-for-secret-scanners)
|
||||
- HashiCorp/InfoQ: [Why traditional secret scanning tools fail](https://www.infoq.com/news/2025/10/hashicorp-secrets/)
|
||||
- GitGuardian: [Should we target zero false positives?](https://blog.gitguardian.com/should-we-target-zero-false-positives/)
|
||||
- GitGuardian: [ML-powered FP Remover cuts 50% false positives](https://blog.gitguardian.com/fp-remover-cuts-false-positives-by-half/)
|
||||
- GitGuardian: [Secrets Detection Engine optimization](https://blog.gitguardian.com/fast-scans-return-earlier/)
|
||||
- Google RE2: [RE2 — Fast, safe, non-backtracking regex](https://github.com/google/re2)
|
||||
- Checkmarx: [ReDoS in Go](https://checkmarx.com/blog/redos-go/)
|
||||
- Nightfall AI: [Best Go Regex Library](https://www.nightfall.ai/blog/best-go-regex-library)
|
||||
- DOJ CFAA Policy: [Justice Manual 9-48.000](https://www.justice.gov/jm/jm-9-48000-computer-fraud)
|
||||
- Arnold & Porter: [DOJ Revised CFAA Policy on Exceeds Authorized Access](https://www.arnoldporter.com/en/perspectives/blogs/enforcement-edge/2022/05/dojs-revised-cfaa-policy)
|
||||
- Center for Cybersecurity Policy: [Virginia Expands Computer Crime Law](https://www.centerforcybersecuritypolicy.org/insights-and-research/virginia-supreme-court-expands-computer-crime-law-raising-legal-issues-for-ethical-hackers)
|
||||
- Security Advisory: [Unencrypted Storage of API Keys in SQLite](https://github.com/LearningCircuit/local-deep-research/security/advisories/GHSA-4h8c-qrcq-cv5c)
|
||||
- GitHub Changelog: [Secret scanning pattern updates — March 2026](https://github.blog/changelog/2026-03-10-secret-scanning-pattern-updates-march-2026/)
|
||||
- GitHub Docs: [Rate limits for the REST API](https://docs.github.com/en/rest/using-the-rest-api/rate-limits-for-the-rest-api)
|
||||
- GitHub Changelog: [Updated rate limits for unauthenticated requests](https://github.blog/changelog/2025-05-08-updated-rate-limits-for-unauthenticated-requests/)
|
||||
- OpenAI Community: [sk-proj- key format discussion](https://community.openai.com/t/how-to-create-an-api-secret-key-with-prefix-sk-only-always-creates-sk-proj-keys/1263531)
|
||||
- Geekmonkey: [Hyperscan vs RE2 performance comparison](https://geekmonkey.org/regular-expression-matching-at-scale-with-hyperscan/)
|
||||
- Betterleaks/CyberInfos: [Fixing API Key Leak Detection Gaps](https://www.cyberinfos.in/betterleaks-secrets-scanner-api-key-detection/)
|
||||
- arxiv: [Secret Breach Detection with LLMs](https://arxiv.org/html/2504.18784v1)
|
||||
- SecurityBoulevard: [How to reduce false positives while scanning for secrets](https://securityboulevard.com/2021/02/how-to-reduce-false-positives-while-scanning-for-secrets/)
|
||||
259
.planning/research/STACK.md
Normal file
259
.planning/research/STACK.md
Normal file
@@ -0,0 +1,259 @@
|
||||
# Technology Stack
|
||||
|
||||
**Project:** KeyHunter — Go-based API Key Scanner
|
||||
**Researched:** 2026-04-04
|
||||
**Research Mode:** Ecosystem
|
||||
|
||||
---
|
||||
|
||||
## Recommended Stack
|
||||
|
||||
### Core CLI Framework
|
||||
|
||||
| Technology | Version | Purpose | Why |
|
||||
|------------|---------|---------|-----|
|
||||
| `github.com/spf13/cobra` | v1.10.2 | CLI command tree (scan, verify, recon, keys, serve, dorks, providers, config, hook, schedule) | Industry standard for Go CLIs. Used by Kubernetes, Docker, GitHub CLI. Sub-command hierarchy, persistent flags, shell completion, man-page generation are all built in. No viable alternative — it IS the Go CLI standard. |
|
||||
| `github.com/spf13/viper` | v1.21.0 | Configuration management (YAML/JSON/env/flags binding) | Designed to pair with Cobra. Handles config file + env var + CLI flag precedence chain automatically. v1.21.0 switched to maintained yaml lib, cleaning supply-chain issues. |
|
||||
|
||||
**Confidence: HIGH** — Verified via GitHub releases. Used in TruffleHog and Gitleaks themselves.
|
||||
|
||||
---
|
||||
|
||||
### Web Dashboard
|
||||
|
||||
| Technology | Version | Purpose | Why |
|
||||
|------------|---------|---------|-----|
|
||||
| `github.com/go-chi/chi/v5` | v5.2.5 | HTTP router for dashboard and API | 100% net/http compatible — no custom context or handler types. Zero external dependencies. Routes embed naturally into `go:embed` serving. Used by major Go projects. Requires Go 1.22+ (matches project constraint). |
|
||||
| `github.com/a-h/templ` | v0.3.1001 | Type-safe HTML template compilation | `.templ` files compile to Go — template errors are caught at compile time, not runtime. Composes naturally with HTMX. Significantly safer than `html/template` for a project with a public-facing dashboard. |
|
||||
| htmx | v2.x (CDN or vendored) | Frontend interactivity without JS framework | Server-rendered with AJAX behavior. No build step. Aligns with "embed in binary" architecture constraint. Use `go:embed` to bundle the htmx.min.js into the binary. |
|
||||
| Tailwind CSS | v4.x (standalone CLI) | Utility-first styling | v4 ships a standalone binary — no Node.js required. Use `@tailwindcss/cli` to compile a single CSS file, then `go:embed` it. Air watches both `.templ` and CSS changes during development. |
|
||||
|
||||
**Confidence: HIGH** — Verified via GitHub releases and multiple 2025 tutorial ecosystem sources.
|
||||
|
||||
**Do NOT use:** Fiber or Echo for this project. Both introduce custom handler types that break `net/http` compatibility and complicate `go:embed` static file serving. Chi's zero-dependency, stdlib-native approach is correct for a single-binary security tool.
|
||||
|
||||
---
|
||||
|
||||
### Database
|
||||
|
||||
| Technology | Version | Purpose | Why |
|
||||
|------------|---------|---------|-----|
|
||||
| `modernc.org/sqlite` | v1.35.x (SQLite 3.51.2 inside) | Embedded database for scan results, keys, recon data | Pure Go — no CGo, no C compiler requirement. Cross-compiles cleanly for Linux/macOS/ARM. Actively maintained (updated 2026-03-17). Zero external process dependency for single-binary distribution. |
|
||||
| `database/sql` (stdlib) | — | SQL interface layer | Use standard library interface over `modernc.org/sqlite` directly — driver is registered as `"sqlite"`. No ORM needed for a tool of this scope. Raw SQL gives full control and avoids ORM magic bugs. |
|
||||
|
||||
**Encryption approach:** Application-level AES-256 using Go's `crypto/aes` + `crypto/cipher` stdlib. Encrypt key material fields before writing to SQLite, decrypt on read. This avoids the CGo dependency of SQLCipher while achieving the same security goal for a local tool. The database file itself stays portable.
|
||||
|
||||
**Confidence: HIGH (modernc.org/sqlite choice) / MEDIUM (application-level encryption approach).**
|
||||
|
||||
**Do NOT use:** `mattn/go-sqlite3` — requires CGo, breaks cross-compilation, adds C toolchain dependency. `SQLCipher` wrappers — all require CGo, reintroducing the problem modernc solves.
|
||||
|
||||
---
|
||||
|
||||
### Concurrency
|
||||
|
||||
| Technology | Version | Purpose | Why |
|
||||
|------------|---------|---------|-----|
|
||||
| `github.com/panjf2000/ants/v2` | v2.12.0 | Worker pool for parallel scanning across files, sources, and verification requests | Mature, battle-tested goroutine pool. Dynamically resizable. Handles thousands of concurrent tasks without goroutine explosion. Used in high-throughput Go systems. v2.12.0 adds ReleaseContext for clean shutdown. |
|
||||
| `golang.org/x/time/rate` | latest x/ | Per-source rate limiting for OSINT/recon sources | Official Go extended library. Token bucket algorithm. Rate-limit each external source (Shodan, GitHub, etc.) independently. Used by TruffleHog for the same purpose. |
|
||||
| `sync`, `context` (stdlib) | — | Cancellation, mutex, waitgroups | Standard library is sufficient for coordination between pool and caller. No additional abstraction needed. |
|
||||
|
||||
**Confidence: HIGH (ants) / HIGH (x/time/rate).**
|
||||
|
||||
**Do NOT use:** `pond` or `conc` — viable alternatives but ants has more production mileage at the scale this tool will operate (thousands of files, 80+ concurrent OSINT sources). `conc` is excellent for structured concurrency but adds abstraction that ants doesn't need.
|
||||
|
||||
---
|
||||
|
||||
### YAML Provider/Dork Engine
|
||||
|
||||
| Technology | Version | Purpose | Why |
|
||||
|------------|---------|---------|-----|
|
||||
| `gopkg.in/yaml.v3` | v3.x | Parse provider YAML definitions and dork YAML files embedded via `go:embed` | Direct, well-understood API. v3 handles inline/anchored structs correctly. The Cobra v1.10.2 release migrated away from it to `go.yaml.in/yaml/v3` — however for provider YAML parsing, gopkg.in/yaml.v3 remains stable and appropriate. |
|
||||
| `embed` (stdlib) | — | Compile-time embedding of `/providers/*.yaml` and `/dorks/*.yaml` | Go 1.16+ native. No external dependency. Providers and dorks are baked into the binary at compile time — no runtime filesystem access needed. |
|
||||
|
||||
**Confidence: HIGH.**
|
||||
|
||||
**Do NOT use:** `sigs.k8s.io/yaml` — converts YAML to JSON internally, adding unnecessary round-trip for this use case. `goccy/go-yaml` — performance advantage is irrelevant for config parsing; adds dependency.
|
||||
|
||||
---
|
||||
|
||||
### Telegram Bot
|
||||
|
||||
| Technology | Version | Purpose | Why |
|
||||
|------------|---------|---------|-----|
|
||||
| `github.com/mymmrac/telego` | v1.8.0 | Telegram bot for /scan, /verify, /recon, /status, /stats, /subscribe, /key commands | One-to-one Telegram Bot API v9.6 mapping. Supports long polling and webhooks. Type-safe API surface. v1.8.0 released 2026-04-03 with API v9.6 support. Actively maintained. |
|
||||
|
||||
**Confidence: HIGH** — Latest release confirmed day before research date.
|
||||
|
||||
**Alternative considered:** `gopkg.in/telebot.v4` — higher-level abstraction, good DX, but less full API coverage. For a security tool where controlling exact API behavior matters, telego's 1:1 mapping is preferable.
|
||||
|
||||
---
|
||||
|
||||
### Scheduled Scanning
|
||||
|
||||
| Technology | Version | Purpose | Why |
|
||||
|------------|---------|---------|-----|
|
||||
| `github.com/go-co-op/gocron/v2` | v2.19.1 | Cron-based recurring scans | Modern API, v2 has full context support and job lifecycle management. Race condition fix in v2.19.1 (important for scheduler reliability). Better API than robfig/cron v3. |
|
||||
|
||||
**Confidence: MEDIUM** — robfig/cron v3 also viable; gocron v2 has cleaner modern API. The netresearch/go-cron fork note (robfig unmaintained since 2020) makes gocron the better default.
|
||||
|
||||
**Do NOT use:** `robfig/cron` directly — unmaintained since 2020, 50+ open PRs, known panic bugs in production.
|
||||
|
||||
---
|
||||
|
||||
### Output and Formatting
|
||||
|
||||
| Technology | Version | Purpose | Why |
|
||||
|------------|---------|---------|-----|
|
||||
| `github.com/charmbracelet/lipgloss` | latest | Colored terminal table output, status indicators | Declarative style definitions. Composes with any output stream. Used across the Go security tool ecosystem (TruffleHog uses it indirectly). |
|
||||
| `github.com/charmbracelet/bubbles` | latest | Progress bars for long scans, spinners during verification | Pre-built terminal UI components. Pairs with lipgloss. Less overhead than full Bubble Tea TUI — use components only. |
|
||||
| `encoding/json` (stdlib) | — | JSON output format | Standard library is sufficient. No external JSON library needed. |
|
||||
| SARIF output | custom | CI/CD integration output | Implement SARIF 2.1.0 format directly — it is a straightforward JSON schema. Do not use a library; gosec's SARIF package is not designed for import. ~200 lines of struct definitions covers the needed schema. |
|
||||
|
||||
**Confidence: HIGH (lipgloss/bubbles) / HIGH (stdlib JSON) / MEDIUM (custom SARIF).**
|
||||
|
||||
**Do NOT use:** Full Bubble Tea TUI — KeyHunter is a scanning tool, not an interactive terminal application. The TUI overhead (event loop, model/update/view) is inappropriate. Use lipgloss + bubbles components directly via tea.Program only for progress output.
|
||||
|
||||
---
|
||||
|
||||
### HTTP Client (OSINT/Recon/Verification)
|
||||
|
||||
| Technology | Version | Purpose | Why |
|
||||
|------------|---------|---------|-----|
|
||||
| `net/http` (stdlib) | — | All outbound HTTP requests for verification and OSINT | Standard library client is sufficient. Supports custom TLS, proxy settings, timeouts. Avoid adding httpclient wrappers that hide behavior. |
|
||||
| `golang.org/x/time/rate` | latest | Per-source rate limiting on outbound requests | Already listed under concurrency — same package serves both purposes. |
|
||||
|
||||
**Confidence: HIGH.**
|
||||
|
||||
**Do NOT use:** `resty`, `go-retryablehttp`, or other HTTP client wrappers — they add abstractions that make debugging OSINT source failures harder. Implement retry/backoff directly using stdlib + time/rate.
|
||||
|
||||
---
|
||||
|
||||
### Development Tooling
|
||||
|
||||
| Technology | Version | Purpose | Why |
|
||||
|------------|---------|---------|-----|
|
||||
| `github.com/air-verse/air` | latest | Hot reload during dashboard development | Watches Go + templ files, rebuilds on change. Industry standard for Go web dev loops. |
|
||||
| `@tailwindcss/cli` | v4.x standalone | CSS compilation without Node.js | v4 standalone binary eliminates Node dependency entirely. Run as `tailwindcss -i input.css -o dist/style.css --watch` alongside air. |
|
||||
| `golangci-lint` | latest | Static analysis and linting | Multi-linter runner. Include gosec, staticcheck, errcheck at minimum. |
|
||||
| `go test` (stdlib) | — | Testing | Standard library testing is sufficient. Use `testify` for assertions only. |
|
||||
| `github.com/stretchr/testify` | v1.x | Test assertions | Assert/require packages only. No mocking framework needed at this scope. |
|
||||
|
||||
**Confidence: HIGH.**
|
||||
|
||||
---
|
||||
|
||||
### Build and Distribution
|
||||
|
||||
| Technology | Version | Purpose | Why |
|
||||
|------------|---------|---------|-----|
|
||||
| `go build` with `-ldflags` | — | Single binary compilation | `CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build -ldflags="-s -w"` produces a stripped static binary. modernc.org/sqlite makes CGO=0 possible. |
|
||||
| `goreleaser` | v2.x | Multi-platform release builds (Linux amd64/arm64, macOS amd64/arm64) | Standard tool for Go binary releases. Produces checksums, archives, and optionally Homebrew taps. |
|
||||
|
||||
**Confidence: HIGH.**
|
||||
|
||||
---
|
||||
|
||||
## Alternatives Considered
|
||||
|
||||
| Category | Recommended | Alternative | Why Not |
|
||||
|----------|-------------|-------------|---------|
|
||||
| Web framework | chi v5 | Fiber | Fiber uses fasthttp (not net/http) — breaks standard middleware and `go:embed` serving patterns |
|
||||
| Web framework | chi v5 | Echo | Echo works with net/http but adds unnecessary abstraction for a dashboard-only use case |
|
||||
| Templating | templ | html/template (stdlib) | stdlib templates have no compile-time type checking; errors surface at runtime, not build time |
|
||||
| SQLite driver | modernc.org/sqlite | mattn/go-sqlite3 | mattn requires CGo — breaks cross-compilation and single-binary distribution goals |
|
||||
| SQLite encryption | application-level AES-256 | SQLCipher (go-sqlcipher) | SQLCipher requires CGo, reintroducing the problem modernc.org/sqlite solves |
|
||||
| Config | viper | koanf | koanf is cleaner but viper's Cobra integration is tight; viper v1.21 fixed the main key-casing issues |
|
||||
| Concurrency | ants | pond | pond is simpler but less battle-tested at high concurrency (80+ simultaneous OSINT sources) |
|
||||
| Scheduler | gocron v2 | robfig/cron v3 | robfig unmaintained since 2020 with known panic bugs in production |
|
||||
| Telegram | telego | telebot v4 | telebot has better DX but less complete API coverage; telego's 1:1 mapping matters for a bot sending scan results |
|
||||
| SARIF | custom structs | gosec/v2/report/sarif | gosec SARIF package is internal to gosec, not a published importable library |
|
||||
| Terminal UI | lipgloss + bubbles | Full Bubble Tea | Full TUI event loop is overkill; components-only approach is simpler and sufficient |
|
||||
|
||||
---
|
||||
|
||||
## Canonical go.mod Dependencies
|
||||
|
||||
```go
|
||||
module github.com/yourusername/keyhunter
|
||||
|
||||
go 1.22
|
||||
|
||||
require (
|
||||
// CLI
|
||||
github.com/spf13/cobra v1.10.2
|
||||
github.com/spf13/viper v1.21.0
|
||||
|
||||
// Web dashboard
|
||||
github.com/go-chi/chi/v5 v5.2.5
|
||||
github.com/a-h/templ v0.3.1001
|
||||
|
||||
// Database
|
||||
modernc.org/sqlite v1.35.x
|
||||
|
||||
// YAML provider/dork engine
|
||||
gopkg.in/yaml.v3 v3.0.1
|
||||
|
||||
// Concurrency + rate limiting
|
||||
github.com/panjf2000/ants/v2 v2.12.0
|
||||
golang.org/x/time latest
|
||||
|
||||
// Telegram bot
|
||||
github.com/mymmrac/telego v1.8.0
|
||||
|
||||
// Scheduler
|
||||
github.com/go-co-op/gocron/v2 v2.19.1
|
||||
|
||||
// Terminal output
|
||||
github.com/charmbracelet/lipgloss latest
|
||||
github.com/charmbracelet/bubbles latest
|
||||
|
||||
// Testing
|
||||
github.com/stretchr/testify v1.x
|
||||
)
|
||||
```
|
||||
|
||||
**Notes on pinning:**
|
||||
- Pin `cobra`, `chi`, `templ`, `telego`, `ants`, `gocron` to exact versions above (verified current).
|
||||
- Use `go get -u` on `golang.org/x/time` — x/ packages track Go versions not semver.
|
||||
- `modernc.org/sqlite` — pin to whatever `go get modernc.org/sqlite@latest` resolves at project init.
|
||||
|
||||
---
|
||||
|
||||
## Build Commands
|
||||
|
||||
```bash
|
||||
# Development (dashboard with hot reload)
|
||||
air & # hot-reload Go + templ
|
||||
tailwindcss -i web/input.css -o web/dist/style.css --watch & # CSS watch
|
||||
|
||||
# Test
|
||||
go test ./... -race -cover
|
||||
|
||||
# Production binary (Linux amd64, CGO-free)
|
||||
CGO_ENABLED=0 GOOS=linux GOARCH=amd64 \
|
||||
go build -ldflags="-s -w" -o keyhunter ./cmd/keyhunter
|
||||
|
||||
# Release (all platforms)
|
||||
goreleaser release --clean
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Sources
|
||||
|
||||
- Cobra releases: https://github.com/spf13/cobra/releases (v1.10.2 confirmed)
|
||||
- Chi releases: https://github.com/go-chi/chi/releases (v5.2.5 confirmed)
|
||||
- Templ releases: https://github.com/a-h/templ/releases (v0.3.1001 confirmed)
|
||||
- Telego releases: https://github.com/mymmrac/telego/releases (v1.8.0, 2026-04-03)
|
||||
- Ants releases: https://github.com/panjf2000/ants/releases (v2.12.0 confirmed)
|
||||
- gocron releases: https://github.com/go-co-op/gocron/releases (v2.19.1 confirmed)
|
||||
- Viper releases: https://github.com/spf13/viper/releases (v1.21.0 confirmed)
|
||||
- modernc.org/sqlite: https://pkg.go.dev/modernc.org/sqlite (SQLite 3.51.2, updated 2026-03-17)
|
||||
- Chi router comparison 2025: https://blog.logrocket.com/top-go-frameworks-2025/
|
||||
- Go web stack 2025 (chi + templ + htmx): https://www.ersin.nz/articles/a-great-web-stack-for-go
|
||||
- Tailwind v4 standalone (no Node): https://dev.to/getjv/tailwind-css-with-air-and-go-no-node-no-problem-3j92
|
||||
- SQLite driver comparison: https://github.com/cvilsmeier/go-sqlite-bench
|
||||
- robfig/cron maintenance status: https://github.com/netresearch/go-cron (unmaintained since 2020 note)
|
||||
- Viper vs koanf: https://itnext.io/golang-configuration-management-library-viper-vs-koanf-eea60a652a22
|
||||
- TruffleHog output formats: https://deepwiki.com/trufflesecurity/trufflehog/6-output-and-results
|
||||
- Gitleaks output formats: https://appsecsanta.com/sast-tools/gitleaks-vs-trufflehog
|
||||
265
.planning/research/SUMMARY.md
Normal file
265
.planning/research/SUMMARY.md
Normal file
@@ -0,0 +1,265 @@
|
||||
# Project Research Summary
|
||||
|
||||
**Project:** KeyHunter — Go-based API Key Scanner for 108+ LLM Providers
|
||||
**Domain:** Secret detection / OSINT recon tool targeting AI/LLM credential leaks
|
||||
**Researched:** 2026-04-04
|
||||
**Confidence:** HIGH
|
||||
|
||||
## Executive Summary
|
||||
|
||||
KeyHunter operates in a validated and growing market: 1.27 million AI-service credentials were leaked in 2025 (up 81% YoY), yet no open-source tool combines 100+ LLM provider coverage with active verification and OSINT recon in a single binary. The competitive gap against TruffleHog (~15 LLM providers), Gitleaks (~5-10), and Titus (450+ rules but no OSINT) is real and exploitable. The correct build approach is a staged pipeline modeled after TruffleHog v3's internal architecture: keyword pre-filter (Aho-Corasick) then regex detection then optional verification, all connected via buffered channels and a goroutine worker pool. The entire tool ships as a single static binary with embedded providers, dorks, templates, and a web dashboard — no runtime dependencies.
|
||||
|
||||
The recommended stack is settled and high-confidence across every category. Go 1.22+ with Cobra/Viper for CLI, chi v5 + templ + htmx + Tailwind v4 for the dashboard, modernc.org/sqlite (pure Go, CGO-free) for storage, ants v2 for concurrency, and telego for Telegram. The critical architectural constraint is `CGO_ENABLED=0` throughout — the choice of modernc.org/sqlite over mattn/go-sqlite3 exists specifically to preserve cross-compilation and single-binary distribution. Every library choice flows from this constraint and should not be revisited.
|
||||
|
||||
The primary risk is building in the wrong order. The provider YAML registry and storage schema must exist before anything else — all 10 other subsystems depend on them. The secondary risk is false positives: entropy-based detection alone produces up to 80% false positive rates, which kills adoption. The third risk is legal: active key verification (`--verify`) requires explicit consent UX and documentation because a single API call with a discovered credential may constitute unauthorized access under CFAA and analogous laws. These risks have clear mitigations documented in research and must be addressed in the phases where they first appear.
|
||||
|
||||
---
|
||||
|
||||
## Key Findings
|
||||
|
||||
### Recommended Stack
|
||||
|
||||
The stack is dominated by CGO-free, stdlib-compatible choices that enable a single cross-compiled binary. Every component was verified against official release pages as of 2026-04-04. The key non-obvious choices: `modernc.org/sqlite` instead of the more common `mattn/go-sqlite3` (preserves CGO=0), `chi` instead of Fiber/Echo (100% net/http compatible, critical for go:embed serving), `telego` instead of telebot (1:1 Telegram API mapping for a tool where exact API behavior matters), and custom SARIF structs instead of importing gosec's internal package (gosec SARIF is not published as an importable library).
|
||||
|
||||
**Core technologies:**
|
||||
- `cobra v1.10.2` + `viper v1.21.0`: CLI command tree and config — industry standard, used by TruffleHog and Gitleaks themselves
|
||||
- `chi v5.2.5`: HTTP router for dashboard — zero external deps, net/http native, enables go:embed static serving
|
||||
- `templ v0.3.1001`: Type-safe HTML templates — compile-time error checking, composes with htmx
|
||||
- `modernc.org/sqlite v1.35.x`: Pure Go SQLite — no CGO, cross-compiles cleanly to Linux/macOS/ARM
|
||||
- `ants v2.12.0`: Worker pool — battle-tested at high concurrency (thousands of goroutines across 80+ OSINT sources)
|
||||
- `golang.org/x/time/rate`: Per-source rate limiting — official Go extended library, token bucket algorithm
|
||||
- `telego v1.8.0`: Telegram bot (released 2026-04-03, Telegram Bot API v9.6) — 1:1 API mapping
|
||||
- `gocron v2.19.1`: Scheduler — modern API, full context support, race condition fix in this version
|
||||
- `lipgloss` + `bubbles`: Terminal output — components-only, not full Bubble Tea TUI event loop
|
||||
- Application-level AES-256 via `crypto/aes` stdlib: Key encryption without SQLCipher's CGO dependency
|
||||
- Custom SARIF 2.1.0 structs (~200 lines): CI/CD output format without importing gosec internals
|
||||
- `goreleaser v2`: Multi-platform release builds (Linux/macOS amd64/arm64)
|
||||
|
||||
**Do not use:** mattn/go-sqlite3 (CGO), Fiber/Echo (breaks net/http compatibility), robfig/cron (unmaintained since 2020, panic bugs), full Bubble Tea TUI (overkill for a scanning tool), regexp2/PCRE (loses RE2 linear-time guarantee).
|
||||
|
||||
### Expected Features
|
||||
|
||||
The market has established strong conventions. KeyHunter must meet every table-stakes requirement before any differentiators matter — users instantly benchmark against TruffleHog and Gitleaks on the fundamentals.
|
||||
|
||||
**Must have (table stakes):**
|
||||
- Regex-based pattern detection with keyword pre-filtering (10x speedup, TruffleHog-proven)
|
||||
- Entropy analysis as secondary signal (not primary — 80% FP rate when used alone)
|
||||
- Git history scanning (full commits, branches, tags)
|
||||
- Directory/file scanning and stdin
|
||||
- Active key verification via `--verify` (off by default — legal and ethical requirement)
|
||||
- JSON, SARIF, CSV, colored table output
|
||||
- Pre-commit hook and CI/CD integration (exit codes + SARIF)
|
||||
- Masked output by default (`--unmask` to show full keys)
|
||||
- Provider-named detectors (not raw regex output)
|
||||
- Multi-platform static binary
|
||||
|
||||
**Should have (competitive moat):**
|
||||
- 108 LLM/AI provider coverage in YAML (5-7x more than any competitor)
|
||||
- OSINT/Recon engine with 80+ sources (code hosts, paste sites, IoT scanners, package registries, search dorks)
|
||||
- Built-in dork engine (150+ dorks in YAML, same extensibility as providers)
|
||||
- IoT scanner integration (Shodan, Censys, ZoomEye, FOFA for exposed LLM endpoints)
|
||||
- Web dashboard (htmx + Tailwind, embedded in binary via go:embed)
|
||||
- SQLite storage with AES-256 (persistent scan state — all competitors are stateless)
|
||||
- Telegram bot integration (`/scan`, `/verify`, `/recon`, `/status` commands)
|
||||
- Scheduled scanning with auto-notify
|
||||
- TruffleHog + Gitleaks import adapters
|
||||
- Delta-based git scanning (only new commits since last run)
|
||||
|
||||
**Defer (v2+):**
|
||||
- BPE tokenization detection (Betterleaks approach, 98.6% recall vs 70.4% entropy — worth tracking)
|
||||
- LLM-based detection (GPT-5-mini achieves 84.4% recall on obfuscated secrets — viable but slow/expensive)
|
||||
- Permission analysis per discovered key (TruffleHog covers 30 types; LLM scope would require per-provider API exploration)
|
||||
- Collaboration tool scanning (Notion, Confluence — auth flow complexity very high)
|
||||
- Web archive scanning at scale (CommonCrawl is enormous; requires careful scoping)
|
||||
- Windows native build (WSL/Docker serves that audience adequately)
|
||||
- Key rotation/remediation (different product category; competes with Vault, Doppler)
|
||||
|
||||
### Architecture Approach
|
||||
|
||||
KeyHunter is a single Go binary with 10 discrete subsystems communicating through well-defined interfaces. The architecture is explicitly modeled on TruffleHog v3's pipeline: source adapters produce chunks onto buffered channels, Aho-Corasick pre-filters reduce the candidate set before expensive regex evaluation, detector workers run at 8x CPU multiplier, and verification workers run as a separate pool (never blocking detectors). The Provider Registry and Storage Layer have zero mutual dependencies and must be built first — everything else depends on them.
|
||||
|
||||
**Major components:**
|
||||
1. **Provider Registry** (`pkg/providers`) — YAML definitions embedded via go:embed at compile time; serves patterns, keywords, and verify endpoints to all other subsystems
|
||||
2. **Scanning Engine** (`pkg/engine`) — three-stage pipeline (Aho-Corasick pre-filter → detector workers → verification workers); Source interface with concrete adapters for file/dir/git/stdin/URL
|
||||
3. **Storage Layer** (`pkg/storage`) — modernc.org/sqlite; findings, scans, recon jobs, scheduled jobs, settings; AES-256 on key_encrypted column from day one
|
||||
4. **Verification Engine** (`pkg/verify`) — opt-in HTTP calls to provider-defined endpoints; per-provider rate limiting; in-memory session cache to prevent duplicate calls
|
||||
5. **OSINT / Recon Engine** (`pkg/recon`) — fan-out orchestrator to 80+ sources across 17 categories; each source module holds its own rate.Limiter; results feed back into Scanning Engine
|
||||
6. **Dork Engine** (`pkg/recon/dorks`) — YAML dork definitions, multi-search-engine dispatch; sub-component of Recon Engine
|
||||
7. **Import Adapters** (`pkg/importers`) — TruffleHog v3 and Gitleaks v8 JSON normalization to internal Finding struct
|
||||
8. **Web Dashboard** (`pkg/dashboard`) — chi + templ + htmx + Tailwind, all embedded via go:embed; SSE for live scan progress (no WebSocket)
|
||||
9. **Notification System** (`pkg/notify`) — telego long-poll bot; private-chat-only restriction; never sends unmasked keys
|
||||
10. **Scheduler** (`pkg/scheduler`) — gocron v2; job definitions persisted to SQLite for restart survival
|
||||
|
||||
**Patterns that must be followed:**
|
||||
- Buffered channels between all pipeline stages (prevent goroutine starvation and parallelism collapse)
|
||||
- Source interface pattern (new sources by implementation, not engine modification)
|
||||
- Per-source rate limiters in Recon Engine (not centralized — sources have wildly different limits)
|
||||
- SSE not WebSockets for dashboard live updates (works through proxies, simpler, htmx native support)
|
||||
- Provider Registry injected via constructor, not global (enables testing without full initialization)
|
||||
|
||||
### Critical Pitfalls
|
||||
|
||||
1. **Catastrophic regex backtracking** — Go's RE2-backed `regexp` package already guarantees linear-time execution; never use `regexp2` or PCRE for any provider pattern. Add a regex complexity linter to the YAML PR review pipeline before community contributions open.
|
||||
|
||||
2. **False positives killing adoption** — Entropy-only detection produces up to 80% FP rate (HashiCorp 2025 research). Use the layered pipeline: keyword pre-filter then regex then entropy as secondary signal. Add YAML-level allowlists for test/example/dummy contexts. Expose `--min-confidence` flag. Never ship entropy-only detection for any provider.
|
||||
|
||||
3. **Legal exposure from active verification** — Calling a provider API with a discovered key constitutes "accessing a computer system" regardless of intent. Verification must be opt-in behind `--verify`, require a consent prompt on first use, be limited to read-only endpoints, and ship with clear legal documentation. This is not a nice-to-have.
|
||||
|
||||
4. **Provider pattern rot** — LLM providers change key formats without changelog entries (OpenAI sk- to sk-proj- migration in 2024). Add `format_version` and `last_verified` fields to provider YAML. Build a weekly pattern health CI job against known-format example keys. Monitor TruffleHog/GitHub Secret Scanning changelog as external signals.
|
||||
|
||||
5. **OSINT rate limiting and silent IP bans** — Aggressive scanning triggers IP bans from Google (after ~100 dork requests/hour), GitHub, Shodan, and Pastebin without error responses. Design per-source rate limiter architecture before implementing any individual source — retrofitting across 80 sources is a rewrite. Log "source exhausted" events explicitly rather than silently returning empty results.
|
||||
|
||||
---
|
||||
|
||||
## Implications for Roadmap
|
||||
|
||||
Research produces a clear 4-phase structure derived from hard dependency ordering in the architecture. Each phase is complete before the next delivers value.
|
||||
|
||||
### Phase 1: Foundation — Provider Registry, Core Engine, Storage
|
||||
|
||||
**Rationale:** Every other subsystem depends on the provider definitions and storage schema. The scanning pipeline is the critical path — nothing else makes sense without it. Build order from ARCHITECTURE.md is explicit: Provider Registry first (no dependencies), Storage Layer second (no dependencies), Scanning Engine third (depends on both), Verification Engine fourth (depends on engine and provider verify specs), Output Formatters fifth (validate scanner output before building anything on top).
|
||||
|
||||
**Delivers:**
|
||||
- 108 LLM provider YAML definitions with regex patterns, keywords, confidence levels, and verify endpoints
|
||||
- Core scanning pipeline: keyword pre-filter (Aho-Corasick) + regex + entropy
|
||||
- Input sources: file, dir, git history, stdin
|
||||
- Active verification via `--verify` with legal consent prompt
|
||||
- Output: colored table, JSON, SARIF (custom structs), CSV
|
||||
- SQLite storage with AES-256 key column encryption from day one
|
||||
- CLI with cobra/viper: `scan`, `verify`, `keys`, `providers`, `config`
|
||||
|
||||
**Addresses pitfalls:** Regex complexity (use RE2 always), false positives (layered pipeline, not entropy-only), legal exposure (opt-in verify with consent), key storage (AES-256 in Phase 1 not later), pattern collisions (confidence taxonomy before patterns are written).
|
||||
|
||||
**Research flags:** Standard patterns — TruffleHog v3 architecture is well-documented; buffered channel pipeline is established Go idiom. Provider YAML schema design is the one area that may need a brief spike to validate before writing 108 definitions.
|
||||
|
||||
---
|
||||
|
||||
### Phase 2: First Differentiators — Dork Engine, Import Adapters, CI/CD Integration
|
||||
|
||||
**Rationale:** Once the core scanner is proven, add the features that create the first layer of competitive moat. Import adapters are low-complexity (Storage Layer only) and can be developed in parallel with the dork engine. Pre-commit hooks and CI/CD exit codes complete the table-stakes checklist and make the tool usable in production pipelines.
|
||||
|
||||
**Delivers:**
|
||||
- Built-in dork engine (150+ YAML dorks, extensible same pattern as providers)
|
||||
- TruffleHog v3 and Gitleaks v8 JSON import adapters
|
||||
- Pre-commit hook integration (`hook install` command)
|
||||
- CI/CD: SARIF to GitHub Code Scanning, exit code conventions, GitHub Actions workflow example
|
||||
- `--unmask` flag and `keys show` command
|
||||
- Delta-based git scanning (last-scanned commit hash in SQLite)
|
||||
|
||||
**Addresses pitfalls:** SARIF severity defaults to `warning` (not `error`) to avoid CI over-blocking; promote to `error` only for verified-active keys. Dork engine uses Google Custom Search API and DuckDuckGo (not direct scraping) to avoid ToS violations.
|
||||
|
||||
**Research flags:** Standard patterns for import adapters and CI/CD. Dork engine needs light research on Google Custom Search API quota and DuckDuckGo scraping stance.
|
||||
|
||||
---
|
||||
|
||||
### Phase 3: OSINT / Recon Engine — The Primary Differentiator
|
||||
|
||||
**Rationale:** The recon engine is the largest engineering effort and the deepest competitive moat. It must be built after the scanning pipeline is stable because the recon engine is a consumer of the scanning engine (recon text results are chunked and fed through the detection pipeline). Per-source rate limiter architecture must be designed before implementing individual source modules — retrofitting is a rewrite.
|
||||
|
||||
**Delivers:**
|
||||
- Recon engine core with `ReconSource` interface and per-source rate limiters
|
||||
- Code hosting sources: GitHub, GitLab, HuggingFace, Bitbucket, Kaggle, Replit (prioritize — largest attack surface)
|
||||
- Paste site aggregator: Pastebin API + 15+ additional paste sites
|
||||
- Search engine dorking: DuckDuckGo (default), Google Custom Search API (opt-in with user API key), Bing
|
||||
- Package registry scanning: npm, PyPI, RubyGems, crates.io, Maven, NuGet
|
||||
- IoT scanner integration: Shodan, Censys, ZoomEye, FOFA for exposed LLM endpoints (vLLM, Ollama, LiteLLM proxy)
|
||||
- `recon` CLI command with `--categories` filter, `--stealth` mode
|
||||
|
||||
**Addresses pitfalls:** Per-source rate limiters (architecture first, sources second), stealth mode with jitter delays, explicit "source exhausted" logging, robots.txt respect in stealth mode.
|
||||
|
||||
**Research flags:** Needs research per source category during planning. IoT scanner APIs (Shodan/Censys/ZoomEye/FOFA) have different query formats and rate tiers. Paste site APIs vary widely — Pastebin has official API; many others require scraping. Each source module benefits from a brief API investigation before implementation.
|
||||
|
||||
---
|
||||
|
||||
### Phase 4: Automation, Reach, and Remaining OSINT Sources
|
||||
|
||||
**Rationale:** Automation features (Telegram bot, scheduler) require both the scanning engine and storage to be mature. The notification system is triggered by verification events, which requires Phase 1's verification engine. The web dashboard is the last component because it aggregates all subsystems into a UI and depends on all of them.
|
||||
|
||||
**Delivers:**
|
||||
- Telegram bot (`/scan`, `/verify`, `/recon`, `/status`, `/stats`, `/subscribe`, `/key`) — private chat only, no unmasked keys
|
||||
- Scheduled scanning with cron expressions (gocron v2), job persistence in SQLite, auto-notify on new findings
|
||||
- Web dashboard (chi + templ + htmx + Tailwind v4 standalone, all embedded via go:embed) — pages: scans, keys, recon, providers, dorks, settings
|
||||
- SSE for live scan progress in dashboard
|
||||
- Remaining OSINT sources: CI/CD logs (Travis, CircleCI, GitHub Actions), web archives (Wayback Machine), forums (StackOverflow, Reddit, HN), cloud storage (S3, GCS), container/IaC (Docker Hub, Terraform, Helm), frontend (source maps, webpack bundles, exposed .env), threat intel (VirusTotal, IntelX)
|
||||
- APK decompile scanning (wraps external apktool/jadx as optional integration)
|
||||
|
||||
**Addresses pitfalls:** Telegram bot restricted to private chats only; never sends unmasked keys regardless of `--unmask` setting; bot commands rate-limited to prevent unauthorized use. Dashboard has no auth by default (local tool) with optional basic auth config for shared deployments.
|
||||
|
||||
**Research flags:** Web dashboard (templ + htmx + SSE pattern) is well-documented in Go 2025 ecosystem. Remaining OSINT sources each need targeted API research — particularly cloud storage source auth flows (requires user-provided credentials), forum scraping rate limits, and APK toolchain detection.
|
||||
|
||||
---
|
||||
|
||||
### Phase Ordering Rationale
|
||||
|
||||
- **Dependency chain is non-negotiable:** Provider Registry has no dependencies; everything depends on it. Storage Layer has no dependencies; everything writes to it. These two must ship before any other subsystem can be meaningfully built or tested.
|
||||
- **Scanning engine before OSINT engine:** The recon engine is a consumer of the scanning engine (recon results feed into the detection pipeline). Building OSINT first would mean the output has nowhere to go.
|
||||
- **Per-source rate limiter architecture before individual OSINT sources:** The PITFALLS research is explicit — retrofitting rate limiting across 80 sources after they are built is a rewrite. The `ReconSource` interface design must include `RateLimit() rate.Limit` from the first source module.
|
||||
- **Dashboard last:** It aggregates all subsystems. Building it earlier means building against unstable interfaces.
|
||||
- **AES-256 encryption in Phase 1:** A real GHSA security advisory (GHSA-4h8c-qrcq-cv5c) documents the exact failure mode of adding encryption later. Adding it post-hoc requires database migration logic and is routinely skipped.
|
||||
|
||||
### Research Flags
|
||||
|
||||
Phases needing deeper research during planning:
|
||||
- **Phase 3 (OSINT sources):** Each source category warrants targeted API research before implementation. IoT scanner query formats, paste site scraping posture, package registry tarball access patterns all vary. Recommend a brief research spike per source category at planning time.
|
||||
- **Phase 4 (remaining OSINT + cloud storage):** Cloud storage source auth flows (AWS credentials, GCS service accounts) need scoping to avoid credential management scope creep. Forum scraping (Reddit, HN, StackOverflow) has rate limit and ToS complexity.
|
||||
|
||||
Phases with standard patterns (can skip research-phase):
|
||||
- **Phase 1 (core pipeline):** TruffleHog v3 architecture is fully documented via DeepWiki and official repo. Buffered channel pipeline, Aho-Corasick pre-filter, and Source interface are established patterns.
|
||||
- **Phase 2 (import adapters, CI/CD):** Import adapters are JSON struct mapping. CI/CD integration is exit code + SARIF — Gitleaks and TruffleHog both document their approach.
|
||||
- **Phase 4 (web dashboard):** The chi + templ + htmx + go:embed + SSE stack is well-covered in 2025 Go web development ecosystem resources.
|
||||
|
||||
---
|
||||
|
||||
## Confidence Assessment
|
||||
|
||||
| Area | Confidence | Notes |
|
||||
|------|------------|-------|
|
||||
| Stack | HIGH | All library versions verified via official GitHub releases as of 2026-04-04. Key tradeoffs (modernc.org vs mattn, chi vs Fiber, telego vs telebot) are well-reasoned with documented rationale. |
|
||||
| Features | HIGH | Competitive landscape verified against official TruffleHog, Gitleaks, Titus, GitGuardian, and GitHub Secret Scanning documentation. Market gap confirmed with GitGuardian 2026 report data. |
|
||||
| Architecture | HIGH | Component design derived from TruffleHog v3 internals via DeepWiki (generated from official source) and official repo inspection. Patterns are production-proven at scale. |
|
||||
| Pitfalls | HIGH (technical) / MEDIUM (legal) | ReDoS, false positives, and rate limiting pitfalls have strong sourcing (official Go docs, HashiCorp research, GitHub rate limit docs). CFAA analysis is MEDIUM — law is jurisdiction-dependent and evolving. |
|
||||
|
||||
**Overall confidence:** HIGH
|
||||
|
||||
### Gaps to Address
|
||||
|
||||
- **Application-level AES-256 implementation details:** STACK.md recommends crypto/aes + crypto/cipher for key column encryption but does not specify key derivation (PBKDF2? Argon2? OS keychain integration?). This needs a decision before Storage Layer implementation. PITFALLS.md recommends OS keychain (macOS Keychain, Linux libsecret) for the database encryption key — validate `zalando/go-keyring` or platform-native options during Phase 1 planning.
|
||||
|
||||
- **Provider YAML for 108 providers:** Research confirmed the schema and 3 reference providers (OpenAI, Anthropic, HuggingFace). The 108 provider list itself is not enumerated in research. Sourcing patterns for lesser-known providers (Chinese LLM providers, niche AI APIs) will require targeted research during Phase 1 provider definition work. TruffleHog detector source, Gitleaks rules, and GitHub Secret Scanning changelog are the best references.
|
||||
|
||||
- **Google Custom Search API vs direct dork scraping:** The legal/ToS analysis recommends the Custom Search API (100 queries/day free tier), but this limits dork throughput significantly. The trade-off between coverage and ToS compliance needs a product decision before Phase 2.
|
||||
|
||||
- **Aho-Corasick library choice:** ARCHITECTURE.md specifies Aho-Corasick pre-filtering without naming the Go library. `cloudflare/ahocorasick` and `aho-corasick` by bobrik are the common options. Verify which TruffleHog uses and match for proven behavior.
|
||||
|
||||
- **go-sqlcipher vs application-level AES-256 consistency:** ARCHITECTURE.md references `go-sqlcipher` in one location but STACK.md recommends application-level AES-256 specifically to avoid CGO. Resolve this inconsistency explicitly before Storage Layer implementation — the correct answer is application-level AES-256 per the CGO constraint.
|
||||
|
||||
---
|
||||
|
||||
## Sources
|
||||
|
||||
### Primary (HIGH confidence)
|
||||
- TruffleHog v3 official repo — feature set, detector count, channel pipeline architecture
|
||||
- DeepWiki TruffleHog engine architecture — internal pipeline design, Aho-Corasick pre-filter
|
||||
- Gitleaks official repo — output formats, detection methods, SARIF support
|
||||
- GitHub releases (cobra, chi, templ, telego, ants, gocron, viper) — version verification
|
||||
- modernc.org/sqlite pkg.go.dev — SQLite 3.51.2, last updated 2026-03-17, pure Go confirmed
|
||||
- GitGuardian State of Secrets Sprawl 2026 — market statistics (1.27M AI credentials leaked)
|
||||
- Go RE2/regexp package documentation — linear-time guarantee confirmed
|
||||
- GHSA-4h8c-qrcq-cv5c — real advisory for unencrypted SQLite key storage
|
||||
- GitHub Secret Scanning changelog (March 2026) — 28 new detectors, DeepSeek validity checks
|
||||
|
||||
### Secondary (MEDIUM confidence)
|
||||
- HashiCorp (2025): False positives — 80% FP rate from entropy-only detection, documented
|
||||
- Betterleaks / BleepingComputer (2026) — BPE tokenization 98.6% recall vs 70.4% entropy
|
||||
- FuzzingLabs (April 2026) — GPT-5-mini 84.4% recall on obfuscated secrets
|
||||
- Go web stack articles (chi + templ + htmx, Tailwind v4 standalone) — community blog sources, patterns verified against official repos
|
||||
- CFAA analysis sources (Arnold & Porter, DOJ Justice Manual 9-48.000) — legal interpretation, jurisdiction-dependent
|
||||
- robfig/cron maintenance status — noted via netresearch/go-cron fork comment
|
||||
|
||||
### Tertiary (LOW confidence)
|
||||
- Google dork rate limit estimate (~100 req/hour) — community observation, not officially documented
|
||||
- Pastebin scraping posture — community consensus; official policy not published
|
||||
|
||||
---
|
||||
*Research completed: 2026-04-04*
|
||||
*Ready for roadmap: yes*
|
||||
Reference in New Issue
Block a user