From e8ba651a51d737bc80e94fc41a3d8eee6bddd911 Mon Sep 17 00:00:00 2001 From: salvacybersec Date: Sun, 5 Apr 2026 15:19:15 +0300 Subject: [PATCH] docs(04-03): complete GitSource plan --- .planning/REQUIREMENTS.md | 2 +- .planning/ROADMAP.md | 2 +- .planning/STATE.md | 14 +- .../phases/04-input-sources/04-03-SUMMARY.md | 124 ++++++++++++++++++ 4 files changed, 134 insertions(+), 8 deletions(-) create mode 100644 .planning/phases/04-input-sources/04-03-SUMMARY.md diff --git a/.planning/REQUIREMENTS.md b/.planning/REQUIREMENTS.md index a80793b..6b4054a 100644 --- a/.planning/REQUIREMENTS.md +++ b/.planning/REQUIREMENTS.md @@ -33,7 +33,7 @@ Requirements for initial release. Each maps to roadmap phases. ### Input Sources - [ ] **INPUT-01**: File and directory scanning with recursive traversal and glob exclusion patterns -- [ ] **INPUT-02**: Git-aware scanning — full history, branches, stash, delta-based diffs +- [x] **INPUT-02**: Git-aware scanning — full history, branches, stash, delta-based diffs - [ ] **INPUT-03**: Git scanning supports --since flag for time-scoped history scan - [ ] **INPUT-04**: stdin/pipe input support (cat file | keyhunter scan stdin) - [ ] **INPUT-05**: URL fetching — scan content from any remote URL diff --git a/.planning/ROADMAP.md b/.planning/ROADMAP.md index 3454bef..ea7982f 100644 --- a/.planning/ROADMAP.md +++ b/.planning/ROADMAP.md @@ -106,7 +106,7 @@ Plans: Plans: - [x] 04-01-PLAN.md — Wave 0: add go-git/v5, atotto/clipboard, golang.org/x/exp/mmap dependencies - [ ] 04-02-PLAN.md — DirSource: recursive walk, glob exclusion, binary skip, mmap for large files (INPUT-01, CORE-07) -- [ ] 04-03-PLAN.md — GitSource: full-history scan across branches/tags with blob dedup and --since (INPUT-02) +- [x] 04-03-PLAN.md — GitSource: full-history scan across branches/tags with blob dedup and --since (INPUT-02) - [ ] 04-04-PLAN.md — StdinSource, URLSource, ClipboardSource (INPUT-03, INPUT-04, INPUT-05) - [ ] 04-05-PLAN.md — cmd/scan.go source-selection dispatch wiring all new sources (INPUT-06) diff --git a/.planning/STATE.md b/.planning/STATE.md index cc70764..2efccbf 100644 --- a/.planning/STATE.md +++ b/.planning/STATE.md @@ -3,14 +3,14 @@ gsd_state_version: 1.0 milestone: v1.0 milestone_name: milestone status: executing -stopped_at: Completed 04-01-PLAN.md -last_updated: "2026-04-05T12:15:22.450Z" +stopped_at: Completed 04-03-PLAN.md +last_updated: "2026-04-05T12:19:11.974Z" last_activity: 2026-04-05 progress: total_phases: 18 completed_phases: 3 total_plans: 23 - completed_plans: 19 + completed_plans: 20 percent: 20 --- @@ -26,7 +26,7 @@ See: .planning/PROJECT.md (updated 2026-04-04) ## Current Position Phase: 04 (input-sources) — EXECUTING -Plan: 2 of 5 +Plan: 3 of 5 Status: Ready to execute Last activity: 2026-04-05 @@ -66,6 +66,7 @@ Progress: [██░░░░░░░░] 20% | Phase 03-tier-3-9-providers P01 | 3m | 2 tasks | 32 files | | Phase 03 P08 | 2min | 1 tasks | 1 files | | Phase 04 P01 | 1m | 1 tasks | 2 files | +| Phase 04-input-sources P03 | 6m | 1 tasks | 2 files | ## Accumulated Context @@ -86,6 +87,7 @@ Recent decisions affecting current work: - [Phase 02-tier-1-2-providers]: AWS Bedrock verify URL left empty — SigV4 signing deferred to Phase 5 verification engine - [Phase 03-tier-3-9-providers]: Keyword-only detection for providers without documented key prefixes (You.com, Unstructured, Runway, Midjourney) to avoid false positives. - [Phase 04]: Use 'go mod download' instead of 'go mod tidy' when bootstrapping dependencies ahead of their consumers +- [Phase 04-input-sources]: GitSource walks heads+tags+remotes+stash with per-OID blob dedup ### Pending Todos @@ -100,6 +102,6 @@ None yet. ## Session Continuity -Last session: 2026-04-05T12:15:22.447Z -Stopped at: Completed 04-01-PLAN.md +Last session: 2026-04-05T12:19:11.971Z +Stopped at: Completed 04-03-PLAN.md Resume file: None diff --git a/.planning/phases/04-input-sources/04-03-SUMMARY.md b/.planning/phases/04-input-sources/04-03-SUMMARY.md new file mode 100644 index 0000000..ccd6e74 --- /dev/null +++ b/.planning/phases/04-input-sources/04-03-SUMMARY.md @@ -0,0 +1,124 @@ +--- +phase: 04-input-sources +plan: 03 +subsystem: input-sources +tags: [git, go-git, source-adapter, history-scan] +requires: + - 04-01 # go.mod dependencies (go-git/v5) + - pkg/engine/sources/source.go + - pkg/types/chunk.go +provides: + - sources.GitSource + - sources.NewGitSource +affects: + - pkg/engine/sources/ +tech-stack: + added: + - github.com/go-git/go-git/v5 v5.17.2 (promoted from indirect to direct) + patterns: + - blob-level OID deduplication for git-history scanning + - seed-ref enumeration (heads + tags + remotes + stash) feeding per-ref git log walks + - commit-SHA-prefixed source attribution (git::) +key-files: + created: + - pkg/engine/sources/git.go + - pkg/engine/sources/git_test.go + modified: + - go.mod + - go.sum +decisions: + - Used local helper `emitGitChunks` instead of shared `emitChunks` because plan 04-02 (dir.go) has not landed yet; slated for consolidation once 04-02 ships. + - Also walk `refs/remotes/*` in seed collection so cloned repos without local branches still get full coverage. + - Skip symbolic references (HEAD) via `ref.Type() == HashReference` filter to avoid double-walking what branch refs already cover. +metrics: + duration: ~6m + completed: 2026-04-05 + tasks: 1 + tests: 8 +requirements: + - INPUT-02 +--- + +# Phase 4 Plan 3: GitSource Summary + +GitSource walks every commit on every branch, tag, remote-tracking ref, and the stash using go-git/v5, deduplicates blob scans by OID, and emits chunks tagged `git::` — letting KeyHunter surface leaked keys that exist only in history. + +## What Shipped + +### `pkg/engine/sources/git.go` (~195 lines) + +- **`GitSource` struct** with `RepoPath`, `Since time.Time`, `ChunkSize`. +- **`NewGitSource(path)`** factory that sets the default chunk size from the sources-package constant. +- **`Chunks(ctx, out)`** implements the `Source` interface: + 1. `git.PlainOpen` the repo (empty path / missing dir → error). + 2. `collectSeedCommits` enumerates all `refs/heads`, `refs/tags`, `refs/remotes`, and `refs/stash` hash references; annotated tags are resolved to their underlying commit. + 3. For each seed, `repo.Log(&git.LogOptions{From: seed})` walks ancestry; a `seenCommits` map prevents re-walking shared history across refs. + 4. Per commit: `Since` cutoff short-circuit (via `c.Author.When.Before`), then `emitCommitBlobs`. + 5. `emitCommitBlobs` streams `tree.Files()`, skipping already-seen OIDs (`seenBlobs` map), go-git `IsBinary` hits, and first-512-byte null-byte positives; text blobs are piped through `emitGitChunks` with the source string `git::`. +- **`emitGitChunks`** mirrors `file.go` overlap-chunking semantics (default 4096 with 256 overlap) so historic blobs are scanned with the same boundary guarantees as on-disk files. + +### `pkg/engine/sources/git_test.go` (~180 lines, 8 subtests) + +| Test | Guarantee | +|---|---| +| `TestGitSource_HistoryWalk` | 3-commit repo yields chunks whose sources all match `^git:[0-9a-f]{7}:.+$` | +| `TestGitSource_BlobDeduplication` | Two files with identical content → one blob scanned, not two | +| `TestGitSource_ModifiedFileKeepsBothVersions` | Editing `a.txt` preserves both old and new blobs in output | +| `TestGitSource_MultiBranch` | Checks out a `feature` branch, adds a file, returns to base, adds another file → both branches' unique blobs appear | +| `TestGitSource_TagReachesOldCommit` | Lightweight tag on an early commit keeps that commit reachable after HEAD moves on | +| `TestGitSource_SinceFilterExcludesAll` | `Since = now + 1h` emits zero chunks | +| `TestGitSource_SourceFormat` | Nested path `path/to/file.txt` round-trips in the source field | +| `TestGitSource_MissingRepo` | Non-existent path returns an error rather than panicking | + +## Verification + +``` +go vet ./pkg/engine/sources/... # clean +go test ./pkg/engine/sources/... -run TestGitSource -race -count=1 -v # 8/8 PASS +go build ./pkg/engine/sources/... # clean +``` + +Grep acceptance checks from the plan — all hit: +- `git.PlainOpen` → git.go:47 +- `seenBlobs` → git.go:62, 143, 146 +- `fmt.Sprintf("git:%s:%s"` → git.go:172 +- `g.Since` → git.go:81 + +## Deviations from Plan + +### Auto-fixed Issues + +**1. [Rule 3 - Blocking] `emitChunks` helper not yet available** +- **Found during:** Task 1 +- **Issue:** Plan referenced `emitChunks` from `pkg/engine/sources/dir.go`, which is produced by plan 04-02 (not yet executed in this wave). Compilation would have failed. +- **Fix:** Added a local `emitGitChunks` mirroring `FileSource`'s overlap-chunk logic, plus a local `gitBinarySniffSize` constant. Documented as a temporary shim slated for consolidation when 04-02 lands. +- **Files modified:** pkg/engine/sources/git.go +- **Commit:** e48a7a4 + +**2. [Rule 2 - Critical functionality] Walk remote-tracking refs as seeds** +- **Found during:** Task 1 (review of `collectSeedCommits`) +- **Issue:** A freshly cloned repo often has zero `refs/heads/*` entries locally (only `refs/remotes/origin/*`). Restricting seeds to branches+tags+stash would produce an empty scan in that common case. +- **Fix:** Also include `name.IsRemote()` refs in the seed set. Filter out symbolic refs (HEAD) via `ref.Type() == plumbing.HashReference` to avoid duplicate walks. +- **Files modified:** pkg/engine/sources/git.go +- **Commit:** e48a7a4 + +### Out-of-Scope / Deferred + +- Shared `emitChunks` consolidation: to be handled in 04-02 + follow-up cleanup. +- Parallel blob scanning via ants pool (noted in 04-CONTEXT.md as a performance idea) — deferred; current single-goroutine walk is already correct and respects context cancellation. + +## go-git Version + +Resolved by plan 04-01: **`github.com/go-git/go-git/v5 v5.17.2`** (promoted from indirect to direct in `go.mod` by this plan). + +## Commits + +- `e48a7a4` — feat(04-03): implement GitSource with full-history traversal + +## Self-Check: PASSED + +- FOUND: pkg/engine/sources/git.go +- FOUND: pkg/engine/sources/git_test.go +- FOUND: commit e48a7a4 +- Tests: 8/8 pass under `-race` +- `go vet` clean