--- phase: 04-input-sources plan: 03 subsystem: input-sources tags: [git, go-git, source-adapter, history-scan] requires: - 04-01 # go.mod dependencies (go-git/v5) - pkg/engine/sources/source.go - pkg/types/chunk.go provides: - sources.GitSource - sources.NewGitSource affects: - pkg/engine/sources/ tech-stack: added: - github.com/go-git/go-git/v5 v5.17.2 (promoted from indirect to direct) patterns: - blob-level OID deduplication for git-history scanning - seed-ref enumeration (heads + tags + remotes + stash) feeding per-ref git log walks - commit-SHA-prefixed source attribution (git::) key-files: created: - pkg/engine/sources/git.go - pkg/engine/sources/git_test.go modified: - go.mod - go.sum decisions: - Used local helper `emitGitChunks` instead of shared `emitChunks` because plan 04-02 (dir.go) has not landed yet; slated for consolidation once 04-02 ships. - Also walk `refs/remotes/*` in seed collection so cloned repos without local branches still get full coverage. - Skip symbolic references (HEAD) via `ref.Type() == HashReference` filter to avoid double-walking what branch refs already cover. metrics: duration: ~6m completed: 2026-04-05 tasks: 1 tests: 8 requirements: - INPUT-02 --- # Phase 4 Plan 3: GitSource Summary GitSource walks every commit on every branch, tag, remote-tracking ref, and the stash using go-git/v5, deduplicates blob scans by OID, and emits chunks tagged `git::` — letting KeyHunter surface leaked keys that exist only in history. ## What Shipped ### `pkg/engine/sources/git.go` (~195 lines) - **`GitSource` struct** with `RepoPath`, `Since time.Time`, `ChunkSize`. - **`NewGitSource(path)`** factory that sets the default chunk size from the sources-package constant. - **`Chunks(ctx, out)`** implements the `Source` interface: 1. `git.PlainOpen` the repo (empty path / missing dir → error). 2. `collectSeedCommits` enumerates all `refs/heads`, `refs/tags`, `refs/remotes`, and `refs/stash` hash references; annotated tags are resolved to their underlying commit. 3. For each seed, `repo.Log(&git.LogOptions{From: seed})` walks ancestry; a `seenCommits` map prevents re-walking shared history across refs. 4. Per commit: `Since` cutoff short-circuit (via `c.Author.When.Before`), then `emitCommitBlobs`. 5. `emitCommitBlobs` streams `tree.Files()`, skipping already-seen OIDs (`seenBlobs` map), go-git `IsBinary` hits, and first-512-byte null-byte positives; text blobs are piped through `emitGitChunks` with the source string `git::`. - **`emitGitChunks`** mirrors `file.go` overlap-chunking semantics (default 4096 with 256 overlap) so historic blobs are scanned with the same boundary guarantees as on-disk files. ### `pkg/engine/sources/git_test.go` (~180 lines, 8 subtests) | Test | Guarantee | |---|---| | `TestGitSource_HistoryWalk` | 3-commit repo yields chunks whose sources all match `^git:[0-9a-f]{7}:.+$` | | `TestGitSource_BlobDeduplication` | Two files with identical content → one blob scanned, not two | | `TestGitSource_ModifiedFileKeepsBothVersions` | Editing `a.txt` preserves both old and new blobs in output | | `TestGitSource_MultiBranch` | Checks out a `feature` branch, adds a file, returns to base, adds another file → both branches' unique blobs appear | | `TestGitSource_TagReachesOldCommit` | Lightweight tag on an early commit keeps that commit reachable after HEAD moves on | | `TestGitSource_SinceFilterExcludesAll` | `Since = now + 1h` emits zero chunks | | `TestGitSource_SourceFormat` | Nested path `path/to/file.txt` round-trips in the source field | | `TestGitSource_MissingRepo` | Non-existent path returns an error rather than panicking | ## Verification ``` go vet ./pkg/engine/sources/... # clean go test ./pkg/engine/sources/... -run TestGitSource -race -count=1 -v # 8/8 PASS go build ./pkg/engine/sources/... # clean ``` Grep acceptance checks from the plan — all hit: - `git.PlainOpen` → git.go:47 - `seenBlobs` → git.go:62, 143, 146 - `fmt.Sprintf("git:%s:%s"` → git.go:172 - `g.Since` → git.go:81 ## Deviations from Plan ### Auto-fixed Issues **1. [Rule 3 - Blocking] `emitChunks` helper not yet available** - **Found during:** Task 1 - **Issue:** Plan referenced `emitChunks` from `pkg/engine/sources/dir.go`, which is produced by plan 04-02 (not yet executed in this wave). Compilation would have failed. - **Fix:** Added a local `emitGitChunks` mirroring `FileSource`'s overlap-chunk logic, plus a local `gitBinarySniffSize` constant. Documented as a temporary shim slated for consolidation when 04-02 lands. - **Files modified:** pkg/engine/sources/git.go - **Commit:** e48a7a4 **2. [Rule 2 - Critical functionality] Walk remote-tracking refs as seeds** - **Found during:** Task 1 (review of `collectSeedCommits`) - **Issue:** A freshly cloned repo often has zero `refs/heads/*` entries locally (only `refs/remotes/origin/*`). Restricting seeds to branches+tags+stash would produce an empty scan in that common case. - **Fix:** Also include `name.IsRemote()` refs in the seed set. Filter out symbolic refs (HEAD) via `ref.Type() == plumbing.HashReference` to avoid duplicate walks. - **Files modified:** pkg/engine/sources/git.go - **Commit:** e48a7a4 ### Out-of-Scope / Deferred - Shared `emitChunks` consolidation: to be handled in 04-02 + follow-up cleanup. - Parallel blob scanning via ants pool (noted in 04-CONTEXT.md as a performance idea) — deferred; current single-goroutine walk is already correct and respects context cancellation. ## go-git Version Resolved by plan 04-01: **`github.com/go-git/go-git/v5 v5.17.2`** (promoted from indirect to direct in `go.mod` by this plan). ## Commits - `e48a7a4` — feat(04-03): implement GitSource with full-history traversal ## Self-Check: PASSED - FOUND: pkg/engine/sources/git.go - FOUND: pkg/engine/sources/git_test.go - FOUND: commit e48a7a4 - Tests: 8/8 pass under `-race` - `go vet` clean