docs(04): phase context with source adapter decisions
This commit is contained in:
103
.planning/phases/04-input-sources/04-CONTEXT.md
Normal file
103
.planning/phases/04-input-sources/04-CONTEXT.md
Normal file
@@ -0,0 +1,103 @@
|
||||
# Phase 4: Input Sources - Context
|
||||
|
||||
**Gathered:** 2026-04-05
|
||||
**Status:** Ready for planning
|
||||
**Mode:** Auto-generated
|
||||
|
||||
<domain>
|
||||
## Phase Boundary
|
||||
|
||||
Users can point KeyHunter at any content source — local files/directories, git history across all branches, piped content (stdin), remote URLs, and the clipboard — and all are scanned through the same three-stage pipeline established in Phase 1. This phase extends the `sources` package with new `Source` implementations and wires them through the `scan` CLI command.
|
||||
|
||||
</domain>
|
||||
|
||||
<decisions>
|
||||
## Implementation Decisions
|
||||
|
||||
### Directory/File Scanning (INPUT-01, CORE-07)
|
||||
- **Recursive traversal** via `filepath.WalkDir` (stdlib, fast, Go 1.16+)
|
||||
- **Glob exclusion** via `--exclude` flag accepting multiple patterns, matched with `filepath.Match` per path
|
||||
- **Default exclusions**: `.git/`, `node_modules/`, `vendor/`, `*.min.js`, `*.map` (configurable)
|
||||
- **mmap threshold**: files > 10MB use `golang.org/x/exp/mmap` for zero-copy reads; smaller files use `os.ReadFile`
|
||||
- **Binary file detection**: skip files with null bytes in first 512 bytes (simple heuristic, avoids text encoding complexity)
|
||||
|
||||
### Git History (INPUT-02)
|
||||
- **Library**: `github.com/go-git/go-git/v5` (pure Go, no CGO) — already ecosystem-standard for Go git ops
|
||||
- **Scope**: all branches (refs/heads/*), all tags (refs/tags/*), all stash entries
|
||||
- **Commit iteration**: walk each ref's commits, diff file contents, scan each blob
|
||||
- **`--since=YYYY-MM-DD`**: filter by commit author date
|
||||
- **Deduplication**: hash blob OIDs, skip already-scanned blobs (git objects are deduplicated by content)
|
||||
- **Performance**: parallel blob scanning via ants pool
|
||||
|
||||
### Stdin (INPUT-03)
|
||||
- **Trigger**: `keyhunter scan stdin` subcommand OR `keyhunter scan -` positional
|
||||
- **Implementation**: `StdinSource` reads from `os.Stdin` in chunks, feeds pipeline
|
||||
- **Source name in findings**: "stdin"
|
||||
|
||||
### URL (INPUT-04)
|
||||
- **HTTP client**: stdlib `net/http` with 30s timeout, follows redirects (max 5)
|
||||
- **User-Agent**: `keyhunter/<version>`
|
||||
- **Content-length limit**: 50MB (reject larger)
|
||||
- **Content-type**: accept any text/*, application/json, application/javascript, application/xml
|
||||
- **TLS**: verify by default, `--insecure` flag to skip (off by default for safety)
|
||||
- **URL scheme support**: http, https only (no file://, ftp://, etc.)
|
||||
|
||||
### Clipboard (INPUT-05)
|
||||
- **Library**: `github.com/atotto/clipboard` (cross-platform: Linux via xclip/xsel/wl-clipboard, macOS via pbpaste, Windows via native API)
|
||||
- **Graceful fallback**: if clipboard tool not installed, return clear error with install instructions
|
||||
- **Linux detection**: check for xclip/xsel/wl-paste at startup, not during scan
|
||||
|
||||
### Source Interface Extension (INPUT-06)
|
||||
- Existing `Source` interface in `pkg/engine/sources/source.go`: `Chunks(ctx, out chan<- types.Chunk) error`
|
||||
- All new sources implement this interface — pipeline stays unchanged
|
||||
- `sources.NewXxxSource(args)` factory functions for each source type
|
||||
- `cmd/scan.go` selects source based on flags: `--git`, `--url`, `--clipboard`, `stdin` subcommand, else default FileSource
|
||||
|
||||
</decisions>
|
||||
|
||||
<code_context>
|
||||
## Existing Code Insights
|
||||
|
||||
### Reusable Assets
|
||||
- `pkg/engine/sources/source.go` — Source interface (keep unchanged)
|
||||
- `pkg/engine/sources/file.go` — single-file FileSource (extend to DirSource OR keep separate)
|
||||
- `pkg/engine/engine.go` — three-stage pipeline (unchanged)
|
||||
- `pkg/types/chunk.go` — Chunk struct (unchanged)
|
||||
- `cmd/scan.go` — wire new flags + source selection
|
||||
|
||||
### Established Patterns
|
||||
- Source emits `types.Chunk{Data, Source, Offset}` on the channel
|
||||
- Context cancellation: all sources check `ctx.Done()` in their loops
|
||||
- Error propagation: return error from `Chunks()`, engine drains channel
|
||||
|
||||
### Integration Points
|
||||
- `cmd/scan.go` — add flags: `--exclude`, `--git`, `--url`, `--clipboard`, `--since`, `--max-file-size`, `--insecure`, `--follow-redirects`
|
||||
- New files: `pkg/engine/sources/{dir,git,stdin,url,clipboard}.go`
|
||||
- Tests: `pkg/engine/sources/*_test.go`
|
||||
|
||||
### go.mod additions needed
|
||||
- `github.com/go-git/go-git/v5`
|
||||
- `github.com/atotto/clipboard`
|
||||
- `golang.org/x/exp/mmap`
|
||||
|
||||
</code_context>
|
||||
|
||||
<specifics>
|
||||
## Specific Ideas
|
||||
|
||||
- Default exclusions should match common "generated/vendored" directories to reduce noise
|
||||
- Dir source should emit files in deterministic order for reproducible test output
|
||||
- Git source should report commit SHA in finding source field: "git:abc1234:path/to/file.go"
|
||||
- URL source should report URL in finding source field: "url:https://example.com/..."
|
||||
|
||||
</specifics>
|
||||
|
||||
<deferred>
|
||||
## Deferred Ideas
|
||||
|
||||
- Clipboard watch mode (poll clipboard for changes) — out of scope
|
||||
- HTTP authentication (basic/bearer for --url) — defer to Phase 5 or later
|
||||
- SFTP/SCP URL schemes — out of scope
|
||||
- Archive file extraction (.zip, .tar.gz scanning) — defer to later phase
|
||||
|
||||
</deferred>
|
||||
Reference in New Issue
Block a user