--- phase: 07-import-cicd plan: 03 subsystem: importer+output tags: [importer, dedup, sarif, cicd, github-code-scanning] requires: - pkg/engine/finding.go - pkg/output/sarif.go provides: - pkg/importer.FindingKey - pkg/importer.Dedup - testdata/sarif/sarif-2.1.0-minimal-schema.json - pkg/output.TestSARIFGitHubValidation affects: - pkg/importer - pkg/output tech-stack: added: [] patterns: - "SHA-256 stable hash for dedup identity (provider\\x00masked\\x00source\\x00line)" - "Fixture-driven schema validation via JSON field list, stdlib only" key-files: created: - pkg/importer/dedup.go - pkg/importer/dedup_test.go - pkg/output/sarif_github_test.go - testdata/sarif/sarif-2.1.0-minimal-schema.json modified: [] decisions: - "FindingKey uses provider+masked+source+line as identity tuple; DetectedAt and Confidence intentionally excluded so re-imports collapse" - "SARIF validation uses a hand-rolled required-fields fixture rather than the full 500KB SARIF 2.1.0 schema — zero external deps, targets GitHub's enforced surface" - "Test file walks ../.. to locate repo root rather than hardcoding paths, keeping the fixture relocatable" metrics: duration: "~4 min" completed: 2026-04-05 requirements: [IMP-03, CICD-02] --- # Phase 7 Plan 03: Dedup Helper & SARIF GitHub Validation Summary Adds a stable dedup identity hash for imported findings (IMP-03) and a standalone test proving Phase 6's `SARIFFormatter` output satisfies GitHub Code Scanning's required-field surface (CICD-02). ## What Was Built ### Task 1: Dedup helper (`pkg/importer/dedup.go`) - `FindingKey(engine.Finding) string`: hex-encoded SHA-256 over `provider\x00masked\x00source\x00line`. Fields outside that tuple (DetectedAt, Confidence, VerifyStatus, ...) do not participate — re-running the same import later must collapse onto the original finding. - `Dedup([]engine.Finding) ([]engine.Finding, int)`: preserves first-seen order, returns the deduplicated slice plus count of duplicates dropped. Callers use the drop count to surface `"Imported N findings (M new, K duplicates)"`. Tests (`pkg/importer/dedup_test.go`, 8 total, all passing): - `TestFindingKey_Stable` - `TestFindingKey_DiffersByProvider / ByMasked / BySource / ByLine` - `TestDedup_PreservesOrder` — `[A,B,A,C,B] -> [A,B,C]`, dropped=2 - `TestDedup_Empty` — nil input yields empty slice, 0 dropped - `TestDedup_IgnoresUnrelatedFields` — identical identity but different `DetectedAt`/`Confidence` collapses to one, first-seen wins ### Task 2: SARIF GitHub validation test - `testdata/sarif/sarif-2.1.0-minimal-schema.json`: required-fields subset for GitHub Code Scanning (top-level, run, tool.driver, result, location.physicalLocation, region, allowed result levels). Hand-curated from https://docs.github.com/en/code-security/code-scanning/integrating-with-code-scanning/sarif-support-for-code-scanning rather than shipping the 500KB full schema. - `pkg/output/sarif_github_test.go`: - `TestSARIFGitHubValidation`: renders 3 findings (high/medium/low confidence) through `SARIFFormatter.Format`, unmarshals to `map[string]any`, then walks the document and asserts every required field in the fixture exists, `version == "2.1.0"`, `$schema` is a non-empty `https://` URL, exactly one run, `tool.driver.name == "keyhunter"`, each result has `ruleId`, `level ∈ allowed_levels`, non-empty `message.text`, non-empty `locations`, and every `physicalLocation.region.startLine >= 1`. Includes one finding with `LineNumber: 0` to prove the Phase 6 `startLine` floor converts to 1. - `TestSARIFGitHubValidation_EmptyFindings`: empty input still produces a valid document with `runs[0].results == []` (not `null`) and tool.driver populated. - Uses stdlib only (`encoding/json`, `os`, `path/filepath`, `strings`, `testing`). Test walks `../../testdata/sarif/...` from the package directory so the fixture stays relocatable. ## Commits | Task | Commit | Description | | ---- | ------- | ---------------------------------------- | | 1 | 6a3d5b0 | feat(07-03): dedup helper for imported findings | | 2 | bd8eb9b | test(07-03): SARIF GitHub code scanning validation | ## Verification ``` $ go test ./pkg/importer/... ./pkg/output/... ok github.com/salvacybersec/keyhunter/pkg/importer (cached) ok github.com/salvacybersec/keyhunter/pkg/output 0.008s ``` All 8 importer dedup tests + 2 SARIF validation tests pass. Both targeted `go test` invocations from the plan's verify blocks pass first-try with no deviations. ## Deviations from Plan None - plan executed exactly as written. ## Downstream Consumers - **Plan 07-04** (import CLI command): will call `importer.Dedup` after parsing TruffleHog/Gitleaks output and report `(unique, dropped)` to the user before inserting into storage. - **CICD-02 closure**: `TestSARIFGitHubValidation` functions as a regression guard — any future change to `SARIFFormatter` that breaks GitHub Code Scanning upload compatibility will fail this test before reaching users. ## Key Decisions 1. **Identity tuple excludes verification/timing fields.** A verified finding re-imported later should still collapse onto the original. Only provider + masked key + source + line number define identity. 2. **Hand-rolled required-fields fixture over full SARIF schema.** GitHub enforces a small, well-documented subset. Shipping the full 500KB JSON Schema and a validator library would bloat the test binary without catching more bugs that matter to GitHub uploads. 3. **Line number floor asserted in test.** Makes the Phase 6 `startLine = max(line, 1)` behavior a contract rather than an incidental implementation detail — future refactors can't silently reintroduce `startLine: 0`, which GitHub rejects. ## Self-Check: PASSED - FOUND: pkg/importer/dedup.go - FOUND: pkg/importer/dedup_test.go - FOUND: pkg/output/sarif_github_test.go - FOUND: testdata/sarif/sarif-2.1.0-minimal-schema.json - FOUND commit: 6a3d5b0 - FOUND commit: bd8eb9b - Tests verified green: `go test ./pkg/importer/... ./pkg/output/...` - No stubs introduced.