Files

salvacybersec 6c3a84b1ff docs: complete project research

2026-04-04 19:03:12 +03:00

23 KiB

Raw Blame History

Architecture Patterns

Domain: API key / secret scanner with OSINT recon, web dashboard, and notification system Project: KeyHunter Researched: 2026-04-04 Overall confidence: HIGH (TruffleHog/Gitleaks internals verified via DeepWiki and official repos; Go patterns verified via official docs and production examples)

Recommended Architecture

KeyHunter is a single Go binary composed of seven discrete subsystems. Each subsystem owns its own package boundary. Communication between subsystems flows through well-defined interfaces — not direct struct coupling.

CLI (Cobra)
    |
    +---> Scanning Engine          (regex + entropy + Aho-Corasick pre-filter)
    |         |
    |         +--> Provider Registry    (YAML definitions, embed.FS at compile time)
    |         +--> Source Adapters      (file, dir, git, URL, stdin, clipboard)
    |         +--> Worker Pool          (goroutine pool + buffered channels)
    |         +--> Verification Engine  (opt-in, per-provider HTTP endpoints)
    |
    +---> OSINT / Recon Engine     (80+ sources, category-based orchestration)
    |         |
    |         +--> Source Modules       (one module per category, rate-limited)
    |         +--> Dork Engine          (YAML dorks, multi-search-engine dispatch)
    |         +--> Recon Worker Pool    (per-source concurrency + throttle)
    |
    +---> Import Adapters          (TruffleHog JSON, Gitleaks JSON -> internal Finding)
    |
    +---> Storage Layer            (SQLite via go-sqlcipher, AES-256 at rest)
    |
    +---> Web Dashboard            (htmx + Tailwind, Go templates, embed.FS, SSE)
    |
    +---> Notification System      (Telegram bot, long polling, command router)
    |
    +---> Scheduler                (gocron, cron expressions, persisted job state)

Component Boundaries

1. CLI Layer (`pkg/cli`)

Responsibility: Command routing only. Zero business logic. Parses flags, wires subcommands, starts the correct subsystem. Uses Cobra (industry standard for Go CLIs, used by TruffleHog v3 and Gitleaks).

Communicates with: All subsystems as the top-level entry point.

Key commands: scan, verify, import, recon, keys, serve, dorks, providers, config, hook, schedule.

Build notes: Cobra subcommand tree should be defined as a package; main.go should remain under 30 lines.

2. Provider Registry (`pkg/providers`)

Responsibility: Load and serve provider definitions. Providers are YAML files embedded at compile time via //go:embed providers/*.yaml. The registry parses them on startup into an in-memory slice of Provider structs.

Provider YAML schema:

name: openai
version: 1
keywords: ["sk-proj-", "openai"]
patterns:
  - regex: 'sk-proj-[A-Za-z0-9]{48}'
    entropy_min: 3.5
    confidence: high
verify:
  method: POST
  url: https://api.openai.com/v1/models
  headers:
    Authorization: "Bearer {KEY}"
  valid_status: [200]
  invalid_status: [401, 403]

Communicates with: Scanning Engine (provides patterns and keywords), Verification Engine (provides verify endpoint specs), Web Dashboard (provider listing pages).

Build rationale: Must be implemented first. Everything downstream depends on it. No external loading at runtime — compile-time embed gives single binary advantage TruffleHog documented as a key design goal.

3. Scanning Engine (`pkg/engine`)

Responsibility: Core detection pipeline. Replicates TruffleHog v3's three-stage approach: keyword pre-filter → regex/entropy detection → optional verification. Manages the goroutine worker pool.

Pipeline stages (mirrors TruffleHog's architecture):

Source Adapter  →  chunker  →  [keyword pre-filter: Aho-Corasick]
                                        |
                               [detector workers] (8x CPU multiplier)
                                        |
                              [verification workers] (1x multiplier, opt-in)
                                        |
                               results channel
                                        |
                              [output formatter]

Aho-Corasick pre-filter: Before running expensive regex, scan chunks for keyword presence. TruffleHog documented this delivers approximately 10x performance improvement on large codebases. Each provider supplies keywords — the Aho-Corasick automaton is built from all keywords at startup.

Channel-based communication:

chunksChan chan Chunk — raw chunks from sources
detectableChan chan Chunk — keyword-matched chunks only
resultsChan chan Finding — confirmed detections
All channels are buffered to prevent goroutine starvation.

Source adapters implement a single interface:

type Source interface {
    Name() string
    Chunks(ctx context.Context, ch chan<- Chunk) error
}

Concrete source adapters: FileSource, DirSource, GitSource, URLSource, StdinSource, ClipboardSource.

Communicates with: Provider Registry (fetches detector specs), Verification Engine (forwards candidates), Storage Layer (persists findings), Output Formatter (writes CLI results).

4. Verification Engine (`pkg/verify`)

Responsibility: Active key validation. Off by default, activated with --verify. Makes HTTP calls to provider-defined endpoints with the discovered key. Classifies results as verified (valid key), invalid (key rejected), or unknown (endpoint unreachable/ambiguous).

Caching: Results are cached in-memory per session to avoid duplicate API calls for the same key. Cache key = provider:key_hash.

Rate limiting: Per-provider rate limiter (token bucket) prevents triggering account lockouts or abuse detection.

Communicates with: Scanning Engine (receives candidates), Storage Layer (updates finding status), Notification System (triggers alerts on verified finds).

5. OSINT / Recon Engine (`pkg/recon`)

Responsibility: Orchestrates searches across 80+ external sources in 18 categories. Acts as a dispatcher: receives a target query, fans out to all configured source modules, aggregates raw text results, and pipes them into the Scanning Engine.

Category-module mapping:

pkg/recon/
  sources/
    iot/         (shodan, censys, zoomeye, fofa, netlas, binaryedge)
    code/        (github, gitlab, bitbucket, huggingface, kaggle, ...)
    search/      (google, bing, duckduckgo, yandex, brave dorking)
    paste/       (pastebin, dpaste, hastebin, rentry, ix.io, ...)
    registry/    (npm, pypi, rubygems, crates.io, maven, nuget, ...)
    container/   (docker hub layers, k8s configs, terraform, helm)
    cloud/       (s3, gcs, azure blob, do spaces, minio)
    cicd/        (travis, circleci, github actions, jenkins)
    archive/     (wayback machine, commoncrawl)
    forum/       (stackoverflow, reddit, hackernews, dev.to, medium)
    collab/      (notion, confluence, trello)
    frontend/    (source maps, webpack, exposed .env, swagger)
    log/         (elasticsearch, grafana, sentry)
    intel/       (virustotal, intelx, urlhaus)
    mobile/      (apk decompile)
    dns/         (crt.sh, endpoint probing)
    api/         (postman, swaggerhub)

Each source module implements:

type ReconSource interface {
    Name() string
    Category() string
    Search(ctx context.Context, query string, opts Options) ([]string, error)
    RateLimit() rate.Limit
}

Orchestrator behavior:

Fan out to all enabled source modules concurrently.
Each module uses its own rate.Limiter (respects per-source limits).
Stealth mode adds jitter delays and respects robots.txt.
Aggregated text results → chunked → fed to Scanning Engine.

Dork Engine (pkg/recon/dorks): Separate sub-component. Reads YAML dork definitions, formats them per search engine syntax, dispatches to search source modules.

Communicates with: Scanning Engine (sends chunked recon text for detection), Storage Layer (persists recon job state and results), CLI Layer.

6. Import Adapters (`pkg/importers`)

Responsibility: Parse external tool JSON output (TruffleHog, Gitleaks) and convert to internal Finding structs for storage. Decouples third-party formats from internal model.

Adapters:

TruffleHogAdapter — parses TruffleHog v3 JSON output
GitleaksAdapter — parses Gitleaks v8 JSON output

Communicates with: Storage Layer only (writes normalized findings).

7. Storage Layer (`pkg/storage`)

Responsibility: Persistence. All findings, provider data, recon jobs, scan metadata, dorks, and scheduler state live here. SQLite via go-sqlcipher (AES-256 encryption at rest).

Schema boundaries:

findings        (id, provider, key_masked, key_encrypted, status, source, path, timestamp, verified)
scans           (id, type, target, started_at, finished_at, finding_count)
recon_jobs      (id, query, categories, started_at, finished_at, source_count)
scheduled_jobs  (id, cron_expr, scan_config_json, last_run, next_run, enabled)
settings        (key, value)

Key masking: Full keys are AES-256 encrypted in key_encrypted. Display value in key_masked is truncated to first 8 / last 4 characters. --unmask flag decrypts on access.

Communicates with: All subsystems that need persistence (Scanning Engine, Recon Engine, Import Adapters, Dashboard, Scheduler, Notification System).

8. Web Dashboard (`pkg/dashboard`)

Responsibility: Embedded web UI. Go templates + htmx + Tailwind CSS, all embedded via //go:embed at compile time. No external JS framework. Server-sent events (SSE) for live scan progress without WebSocket complexity.

Pages: scans, keys, recon, providers, dorks, settings.

HTTP server: Standard library net/http is sufficient. No framework overhead needed for this scale.

SSE pattern for live updates:

// Scan progress pushed to browser via SSE
// Browser uses hx-sse extension to update scan status table

Communicates with: Storage Layer (reads/writes), Scanning Engine (triggers scans, receives SSE events), Recon Engine (triggers recon jobs).

9. Notification System (`pkg/notify`)

Responsibility: Telegram bot integration. Sends alerts on verified findings, responds to bot commands. Uses long polling (preferred for single-instance local tools — no public URL needed, simpler setup than webhooks).

Bot commands map to CLI commands: /scan, /verify, /recon, /status, /stats, /subscribe, /key.

Subscribe pattern: Users /subscribe to be notified when verified findings are discovered. Subscriber chat IDs stored in SQLite settings.

Communicates with: Storage Layer (reads findings, subscriber list), Scanning Engine (receives verified finding events).

10. Scheduler (`pkg/scheduler`)

Responsibility: Cron-based recurring scan scheduling. Uses go-co-op/gocron (actively maintained fork of jasonlvhit/gocron). Scheduled job definitions persisted in SQLite so they survive restarts.

Communicates with: Storage Layer (reads/writes job definitions), Scanning Engine (triggers scans), Notification System (notifies on scan completion).

Data Flow

Flow 1: CLI Scan

User: keyhunter scan --path ./repo --verify

CLI Layer
  -> parses flags, builds ScanConfig
  -> calls Engine.Scan(ctx, config)

Scanning Engine
  -> GitSource.Chunks() produces chunks onto chunksChan
  -> Aho-Corasick filter passes keyword-matched chunks to detectableChan
  -> Detector Workers apply provider patterns, produce candidates on resultsChan
  -> Verification Workers (if --verify) call provider verify endpoints
  -> Findings written to Storage Layer
  -> Output Formatter writes colored table / JSON / SARIF to stdout

Flow 2: Recon Job

User: keyhunter recon --query "OPENAI_API_KEY" --categories code,paste,search

CLI Layer
  -> calls Recon Engine with query + categories

Recon Engine
  -> fans out to all enabled source modules for selected categories
  -> each module rate-limits itself, fetches content
  -> raw text results chunked and sent to Scanning Engine via internal channel

Scanning Engine
  -> same pipeline as Flow 1
  -> findings tagged with recon source metadata
  -> persisted to Storage Layer

Flow 3: Web Dashboard Live Scan

Browser: POST /api/scan (hx-post from htmx)
  -> Dashboard handler creates scan record in Storage Layer
  -> Dashboard handler starts Scanning Engine in goroutine
  -> Browser subscribes to SSE endpoint GET /api/scan/:id/events
  -> Engine emits progress events to SSE channel
  -> htmx SSE extension updates scan status table in real time
  -> On completion, full findings table rendered via hx-get

Flow 4: Scheduled Scan + Telegram Notification

Scheduler (gocron)
  -> fires job at cron time
  -> reads ScanConfig from SQLite scheduled_jobs
  -> triggers Scanning Engine

Scanning Engine
  -> runs scan, persists findings

Notification System
  -> on verified finding: reads subscriber list from SQLite
  -> sends Telegram message to each subscriber via bot API (long poll loop)

Flow 5: Import from External Tool

User: keyhunter import --tool trufflehog --file th_output.json

CLI Layer -> Import Adapter (TruffleHogAdapter)
  -> reads JSON, maps to []Finding
  -> writes to Storage Layer
  -> prints import summary to stdout

Build Order (Phase Dependencies)

This ordering reflects hard dependencies — a later component cannot be meaningfully built without the earlier ones.

Order	Component	Depends On	Why First
1	Provider Registry	nothing	All other subsystems depend on provider definitions. Must exist before any detection can be designed.
2	Storage Layer	nothing (schema only)	Findings model must be defined before anything writes to it.
3	Scanning Engine (core pipeline)	Provider Registry, Storage Layer	Engine is the critical path. Source adapters and worker pool pattern established here.
4	Verification Engine	Scanning Engine, Provider Registry	Layered on top of scanning, needs provider verify specs.
5	Output Formatters (table, JSON, SARIF, CSV)	Scanning Engine	Needed to validate scanner output before building anything on top.
6	Import Adapters	Storage Layer	Self-contained, only needs storage model. Can be parallel with 4/5.
7	OSINT / Recon Engine	Scanning Engine, Storage Layer	Builds on the established scanning pipeline as its consumer.
8	Dork Engine	Recon Engine (search sources)	Sub-component of Recon; needs search source modules to exist.
9	Scheduler	Scanning Engine, Storage Layer	Requires engine and persistence. Adds recurring execution on top.
10	Web Dashboard	Storage Layer, Scanning Engine, Recon Engine	Aggregates all subsystems into UI; must be last.
11	Notification System	Storage Layer, Verification Engine	Triggered by verification events; needs findings and subscriber storage.

MVP critical path: Provider Registry → Storage Layer → Scanning Engine → Verification Engine → Output Formatters.

Everything else (OSINT, Dashboard, Notifications, Scheduler) layers on top of this proven core.

Patterns to Follow

Pattern 1: Buffered Channel Pipeline (TruffleHog-derived)

What: Goroutine stages connected by buffered channels. Each stage has a configurable concurrency multiplier.

When: Any multi-stage concurrent processing (scanning, recon aggregation).

Example:

// Engine spin-up
chunksChan := make(chan Chunk, 1000)
detectableChan := make(chan Chunk, 500)
resultsChan := make(chan Finding, 100)

// Stage goroutines
for i := 0; i < runtime.NumCPU()*8; i++ {
    go detectorWorker(detectableChan, resultsChan, providers)
}
for i := 0; i < runtime.NumCPU(); i++ {
    go verifyWorker(resultsChan, storage, notify)
}

Why: Decouples stages, prevents fast producers from blocking slow consumers, enables independent scaling of each stage.

Pattern 2: Source Interface + Adapter

What: All scan inputs implement a single Source interface. New sources are added by implementing the interface, not changing the engine.

When: Adding any new input type (new code host, new file format).

Example:

type Source interface {
    Name() string
    Chunks(ctx context.Context, ch chan<- Chunk) error
}

Pattern 3: YAML Provider with compile-time embed

What: Provider definitions live in providers/*.yaml, embedded at compile time. No runtime file loading.

When: Adding new LLM provider detection support.

Why: Single binary distribution. Zero external dependencies at runtime. Community can submit PRs with YAML files — no Go code required to add a provider.

//go:embed providers/*.yaml
var providersFS embed.FS

Pattern 4: Rate Limiter per Recon Source

What: Each recon source module holds its own golang.org/x/time/rate.Limiter. The orchestrator does not centrally throttle.

When: All external HTTP calls in the recon engine.

Why: Different sources have wildly different rate limits (Shodan: 1 req/s free; GitHub: 30 req/min unauthenticated; Pastebin: no documented limit). Centralizing would set all to the slowest.

Pattern 5: SSE for Dashboard Live Updates

What: Server-Sent Events pushed from Go HTTP handler to htmx SSE extension. One-way server→browser push. No WebSocket needed.

When: Live scan progress, recon job status.

Why: SSE uses standard HTTP, works through proxies, simpler than WebSockets for one-way push, supported natively by htmx SSE extension.

Anti-Patterns to Avoid

Anti-Pattern 1: Global State for Provider Registry

What: Storing providers as package-level globals loaded once at startup.

Why bad: Makes testing impossible without full initialization. Prevents future per-scan provider subsets.

Instead: Pass a *ProviderRegistry explicitly to the engine constructor.

Anti-Pattern 2: Unbuffered Result Channels

What: Using make(chan Finding) (unbuffered) for the results pipeline.

Why bad: A slow output writer blocks detector workers, collapsing parallelism. TruffleHog's architecture explicitly uses buffered channels managing thousands of concurrent operations.

Instead: Buffer proportional to expected throughput (make(chan Finding, 1000)).

Anti-Pattern 3: Direct HTTP in Detector Workers

What: Detector goroutines making HTTP calls to verify endpoints inline.

Why bad: Verification is slow (network I/O). It would block detector workers, killing throughput.

Instead: Separate verification worker pool as a distinct pipeline stage (TruffleHog's design).

Anti-Pattern 4: Runtime YAML Loading for Providers

What: Loading provider YAML from filesystem at scan time.

Why bad: Breaks single binary distribution. Users must manage provider files separately. Security risk (external file modification).

Instead: //go:embed providers/*.yaml at compile time.

Anti-Pattern 5: Storing Plaintext Keys in SQLite

What: Storing full API keys as plaintext in the database.

Why bad: Database file = credential dump. Any process with file access can read all found keys.

Instead: AES-256 encrypt the full key column. Store only masked version for display. Decrypt on explicit --unmask or via auth-gated dashboard endpoint.

Anti-Pattern 6: Monolithic Recon Orchestrator

What: One giant function that loops through all 80+ sources sequentially.

Why bad: Recon over 80 sources sequentially would take hours. No per-source error isolation.

Instead: Fan-out pattern. Each source module runs concurrently in its own goroutine. Errors are per-source (one failing source doesn't abort the job).

Package Structure

keyhunter/
  main.go                    (< 30 lines, cobra root init)
  cmd/                       (cobra command definitions)
    scan.go
    recon.go
    keys.go
    serve.go
    ...
  pkg/
    providers/               (Provider struct, YAML loader, embed.FS)
    engine/                  (scanning pipeline, worker pool, Aho-Corasick)
      sources/               (Source interface + concrete adapters)
        file.go
        dir.go
        git.go
        url.go
        stdin.go
        clipboard.go
    verify/                  (Verification engine, HTTP client, cache)
    recon/                   (Recon orchestrator)
      sources/               (ReconSource interface + category modules)
        iot/
        code/
        search/
        paste/
        ...
      dorks/                 (Dork engine, YAML dork loader)
    importers/               (TruffleHog + Gitleaks JSON adapters)
    storage/                 (SQLite layer, go-sqlcipher, schema, migrations)
    dashboard/               (HTTP handlers, Go templates, embed.FS)
      static/                (tailwind CSS, htmx JS — embedded)
      templates/             (HTML templates — embedded)
    notify/                  (Telegram bot, long polling, command router)
    scheduler/               (gocron wrapper, SQLite persistence)
    output/                  (Table, JSON, SARIF, CSV formatters)
    config/                  (Config struct, YAML config file, env vars)
  providers/                 (YAML provider definitions — embedded at build)
    openai.yaml
    anthropic.yaml
    ...
  dorks/                     (YAML dork definitions — embedded at build)
    github.yaml
    google.yaml
    ...

Scalability Considerations

Concern	Single user / local tool	Team / shared instance
Concurrency	Worker pool default: `8x NumCPU` detectors	Configurable via `--concurrency` flag
Storage	SQLite handles millions of findings at local scale	SQLite WAL mode for concurrent readers; migrate to PostgreSQL only if needed (out of scope per PROJECT.md)
Recon rate limits	Per-source rate limiters; stealth mode adds jitter	API keys / tokens configured per source for higher limits
Dashboard	Embedded single-instance; no auth by default	Optionally add basic auth via config for shared deployments
Verification	Opt-in; per-provider rate limiting prevents API abuse	Same — no change needed at team scale

Sources

DeepWiki TruffleHog engine architecture: https://deepwiki.com/trufflesecurity/trufflehog/2.1-engine-configuration (HIGH confidence — generated from official source)
TruffleHog v3 official repo: https://github.com/trufflesecurity/trufflehog (HIGH confidence)
TruffleHog v3 source packages: https://pkg.go.dev/github.com/trufflesecurity/trufflehog/v3/pkg/sources (HIGH confidence)
Gitleaks official repo: https://github.com/gitleaks/gitleaks (HIGH confidence)
Go embed package: https://pkg.go.dev/embed (HIGH confidence — official)
go-co-op/gocron: https://github.com/go-co-op/gocron (HIGH confidence)
go-sqlcipher (AES-256): https://github.com/mutecomm/go-sqlcipher (MEDIUM confidence — check active maintenance status)
SQLCipher: https://github.com/sqlcipher/sqlcipher (HIGH confidence)
SSE with Go + htmx: https://threedots.tech/post/live-website-updates-go-sse-htmx/ (MEDIUM confidence — community blog, well-verified pattern)
Telego (Telegram bot Go): https://github.com/mymmrac/telego (MEDIUM confidence)
TruffleHog v3 introducing blog: https://trufflesecurity.com/blog/introducing-trufflehog-v3 (HIGH confidence — official)

23 KiB Raw Blame History