docs(04): phase context with source adapter decisions

This commit is contained in:
salvacybersec
2026-04-05 15:00:25 +03:00
parent 03e768782a
commit 1bc8f02370

View File

@@ -0,0 +1,103 @@
# Phase 4: Input Sources - Context
**Gathered:** 2026-04-05
**Status:** Ready for planning
**Mode:** Auto-generated
<domain>
## Phase Boundary
Users can point KeyHunter at any content source — local files/directories, git history across all branches, piped content (stdin), remote URLs, and the clipboard — and all are scanned through the same three-stage pipeline established in Phase 1. This phase extends the `sources` package with new `Source` implementations and wires them through the `scan` CLI command.
</domain>
<decisions>
## Implementation Decisions
### Directory/File Scanning (INPUT-01, CORE-07)
- **Recursive traversal** via `filepath.WalkDir` (stdlib, fast, Go 1.16+)
- **Glob exclusion** via `--exclude` flag accepting multiple patterns, matched with `filepath.Match` per path
- **Default exclusions**: `.git/`, `node_modules/`, `vendor/`, `*.min.js`, `*.map` (configurable)
- **mmap threshold**: files > 10MB use `golang.org/x/exp/mmap` for zero-copy reads; smaller files use `os.ReadFile`
- **Binary file detection**: skip files with null bytes in first 512 bytes (simple heuristic, avoids text encoding complexity)
### Git History (INPUT-02)
- **Library**: `github.com/go-git/go-git/v5` (pure Go, no CGO) — already ecosystem-standard for Go git ops
- **Scope**: all branches (refs/heads/*), all tags (refs/tags/*), all stash entries
- **Commit iteration**: walk each ref's commits, diff file contents, scan each blob
- **`--since=YYYY-MM-DD`**: filter by commit author date
- **Deduplication**: hash blob OIDs, skip already-scanned blobs (git objects are deduplicated by content)
- **Performance**: parallel blob scanning via ants pool
### Stdin (INPUT-03)
- **Trigger**: `keyhunter scan stdin` subcommand OR `keyhunter scan -` positional
- **Implementation**: `StdinSource` reads from `os.Stdin` in chunks, feeds pipeline
- **Source name in findings**: "stdin"
### URL (INPUT-04)
- **HTTP client**: stdlib `net/http` with 30s timeout, follows redirects (max 5)
- **User-Agent**: `keyhunter/<version>`
- **Content-length limit**: 50MB (reject larger)
- **Content-type**: accept any text/*, application/json, application/javascript, application/xml
- **TLS**: verify by default, `--insecure` flag to skip (off by default for safety)
- **URL scheme support**: http, https only (no file://, ftp://, etc.)
### Clipboard (INPUT-05)
- **Library**: `github.com/atotto/clipboard` (cross-platform: Linux via xclip/xsel/wl-clipboard, macOS via pbpaste, Windows via native API)
- **Graceful fallback**: if clipboard tool not installed, return clear error with install instructions
- **Linux detection**: check for xclip/xsel/wl-paste at startup, not during scan
### Source Interface Extension (INPUT-06)
- Existing `Source` interface in `pkg/engine/sources/source.go`: `Chunks(ctx, out chan<- types.Chunk) error`
- All new sources implement this interface — pipeline stays unchanged
- `sources.NewXxxSource(args)` factory functions for each source type
- `cmd/scan.go` selects source based on flags: `--git`, `--url`, `--clipboard`, `stdin` subcommand, else default FileSource
</decisions>
<code_context>
## Existing Code Insights
### Reusable Assets
- `pkg/engine/sources/source.go` — Source interface (keep unchanged)
- `pkg/engine/sources/file.go` — single-file FileSource (extend to DirSource OR keep separate)
- `pkg/engine/engine.go` — three-stage pipeline (unchanged)
- `pkg/types/chunk.go` — Chunk struct (unchanged)
- `cmd/scan.go` — wire new flags + source selection
### Established Patterns
- Source emits `types.Chunk{Data, Source, Offset}` on the channel
- Context cancellation: all sources check `ctx.Done()` in their loops
- Error propagation: return error from `Chunks()`, engine drains channel
### Integration Points
- `cmd/scan.go` — add flags: `--exclude`, `--git`, `--url`, `--clipboard`, `--since`, `--max-file-size`, `--insecure`, `--follow-redirects`
- New files: `pkg/engine/sources/{dir,git,stdin,url,clipboard}.go`
- Tests: `pkg/engine/sources/*_test.go`
### go.mod additions needed
- `github.com/go-git/go-git/v5`
- `github.com/atotto/clipboard`
- `golang.org/x/exp/mmap`
</code_context>
<specifics>
## Specific Ideas
- Default exclusions should match common "generated/vendored" directories to reduce noise
- Dir source should emit files in deterministic order for reproducible test output
- Git source should report commit SHA in finding source field: "git:abc1234:path/to/file.go"
- URL source should report URL in finding source field: "url:https://example.com/..."
</specifics>
<deferred>
## Deferred Ideas
- Clipboard watch mode (poll clipboard for changes) — out of scope
- HTTP authentication (basic/bearer for --url) — defer to Phase 5 or later
- SFTP/SCP URL schemes — out of scope
- Archive file extraction (.zip, .tar.gz scanning) — defer to later phase
</deferred>