Files
2026-04-05 15:19:15 +03:00

5.9 KiB

phase, plan, subsystem, tags, requires, provides, affects, tech-stack, key-files, decisions, metrics, requirements
phase plan subsystem tags requires provides affects tech-stack key-files decisions metrics requirements
04-input-sources 03 input-sources
git
go-git
source-adapter
history-scan
04-01
pkg/engine/sources/source.go
pkg/types/chunk.go
sources.GitSource
sources.NewGitSource
pkg/engine/sources/
added patterns
github.com/go-git/go-git/v5 v5.17.2 (promoted from indirect to direct)
blob-level OID deduplication for git-history scanning
seed-ref enumeration (heads + tags + remotes + stash) feeding per-ref git log walks
commit-SHA-prefixed source attribution (git:<short-sha>:<path>)
created modified
pkg/engine/sources/git.go
pkg/engine/sources/git_test.go
go.mod
go.sum
Used local helper `emitGitChunks` instead of shared `emitChunks` because plan 04-02 (dir.go) has not landed yet; slated for consolidation once 04-02 ships.
Also walk `refs/remotes/*` in seed collection so cloned repos without local branches still get full coverage.
Skip symbolic references (HEAD) via `ref.Type() == HashReference` filter to avoid double-walking what branch refs already cover.
duration completed tasks tests
~6m 2026-04-05 1 8
INPUT-02

Phase 4 Plan 3: GitSource Summary

GitSource walks every commit on every branch, tag, remote-tracking ref, and the stash using go-git/v5, deduplicates blob scans by OID, and emits chunks tagged git:<short-sha>:<path> — letting KeyHunter surface leaked keys that exist only in history.

What Shipped

pkg/engine/sources/git.go (~195 lines)

  • GitSource struct with RepoPath, Since time.Time, ChunkSize.
  • NewGitSource(path) factory that sets the default chunk size from the sources-package constant.
  • Chunks(ctx, out) implements the Source interface:
    1. git.PlainOpen the repo (empty path / missing dir → error).
    2. collectSeedCommits enumerates all refs/heads, refs/tags, refs/remotes, and refs/stash hash references; annotated tags are resolved to their underlying commit.
    3. For each seed, repo.Log(&git.LogOptions{From: seed}) walks ancestry; a seenCommits map prevents re-walking shared history across refs.
    4. Per commit: Since cutoff short-circuit (via c.Author.When.Before), then emitCommitBlobs.
    5. emitCommitBlobs streams tree.Files(), skipping already-seen OIDs (seenBlobs map), go-git IsBinary hits, and first-512-byte null-byte positives; text blobs are piped through emitGitChunks with the source string git:<short-sha>:<path>.
  • emitGitChunks mirrors file.go overlap-chunking semantics (default 4096 with 256 overlap) so historic blobs are scanned with the same boundary guarantees as on-disk files.

pkg/engine/sources/git_test.go (~180 lines, 8 subtests)

Test Guarantee
TestGitSource_HistoryWalk 3-commit repo yields chunks whose sources all match ^git:[0-9a-f]{7}:.+$
TestGitSource_BlobDeduplication Two files with identical content → one blob scanned, not two
TestGitSource_ModifiedFileKeepsBothVersions Editing a.txt preserves both old and new blobs in output
TestGitSource_MultiBranch Checks out a feature branch, adds a file, returns to base, adds another file → both branches' unique blobs appear
TestGitSource_TagReachesOldCommit Lightweight tag on an early commit keeps that commit reachable after HEAD moves on
TestGitSource_SinceFilterExcludesAll Since = now + 1h emits zero chunks
TestGitSource_SourceFormat Nested path path/to/file.txt round-trips in the source field
TestGitSource_MissingRepo Non-existent path returns an error rather than panicking

Verification

go vet ./pkg/engine/sources/...                                        # clean
go test ./pkg/engine/sources/... -run TestGitSource -race -count=1 -v  # 8/8 PASS
go build ./pkg/engine/sources/...                                      # clean

Grep acceptance checks from the plan — all hit:

  • git.PlainOpen → git.go:47
  • seenBlobs → git.go:62, 143, 146
  • fmt.Sprintf("git:%s:%s" → git.go:172
  • g.Since → git.go:81

Deviations from Plan

Auto-fixed Issues

1. [Rule 3 - Blocking] emitChunks helper not yet available

  • Found during: Task 1
  • Issue: Plan referenced emitChunks from pkg/engine/sources/dir.go, which is produced by plan 04-02 (not yet executed in this wave). Compilation would have failed.
  • Fix: Added a local emitGitChunks mirroring FileSource's overlap-chunk logic, plus a local gitBinarySniffSize constant. Documented as a temporary shim slated for consolidation when 04-02 lands.
  • Files modified: pkg/engine/sources/git.go
  • Commit: e48a7a4

2. [Rule 2 - Critical functionality] Walk remote-tracking refs as seeds

  • Found during: Task 1 (review of collectSeedCommits)
  • Issue: A freshly cloned repo often has zero refs/heads/* entries locally (only refs/remotes/origin/*). Restricting seeds to branches+tags+stash would produce an empty scan in that common case.
  • Fix: Also include name.IsRemote() refs in the seed set. Filter out symbolic refs (HEAD) via ref.Type() == plumbing.HashReference to avoid duplicate walks.
  • Files modified: pkg/engine/sources/git.go
  • Commit: e48a7a4

Out-of-Scope / Deferred

  • Shared emitChunks consolidation: to be handled in 04-02 + follow-up cleanup.
  • Parallel blob scanning via ants pool (noted in 04-CONTEXT.md as a performance idea) — deferred; current single-goroutine walk is already correct and respects context cancellation.

go-git Version

Resolved by plan 04-01: github.com/go-git/go-git/v5 v5.17.2 (promoted from indirect to direct in go.mod by this plan).

Commits

  • e48a7a4 — feat(04-03): implement GitSource with full-history traversal

Self-Check: PASSED

  • FOUND: pkg/engine/sources/git.go
  • FOUND: pkg/engine/sources/git_test.go
  • FOUND: commit e48a7a4
  • Tests: 8/8 pass under -race
  • go vet clean