Files
keyhunter/.planning/phases/07-import-cicd/07-03-SUMMARY.md

6.0 KiB

phase, plan, subsystem, tags, requires, provides, affects, tech-stack, key-files, decisions, metrics, requirements
phase plan subsystem tags requires provides affects tech-stack key-files decisions metrics requirements
07-import-cicd 03 importer+output
importer
dedup
sarif
cicd
github-code-scanning
pkg/engine/finding.go
pkg/output/sarif.go
pkg/importer.FindingKey
pkg/importer.Dedup
testdata/sarif/sarif-2.1.0-minimal-schema.json
pkg/output.TestSARIFGitHubValidation
pkg/importer
pkg/output
added patterns
SHA-256 stable hash for dedup identity (provider\x00masked\x00source\x00line)
Fixture-driven schema validation via JSON field list, stdlib only
created modified
pkg/importer/dedup.go
pkg/importer/dedup_test.go
pkg/output/sarif_github_test.go
testdata/sarif/sarif-2.1.0-minimal-schema.json
FindingKey uses provider+masked+source+line as identity tuple; DetectedAt and Confidence intentionally excluded so re-imports collapse
SARIF validation uses a hand-rolled required-fields fixture rather than the full 500KB SARIF 2.1.0 schema — zero external deps, targets GitHub's enforced surface
Test file walks ../.. to locate repo root rather than hardcoding paths, keeping the fixture relocatable
duration completed
~4 min 2026-04-05
IMP-03
CICD-02

Phase 7 Plan 03: Dedup Helper & SARIF GitHub Validation Summary

Adds a stable dedup identity hash for imported findings (IMP-03) and a standalone test proving Phase 6's SARIFFormatter output satisfies GitHub Code Scanning's required-field surface (CICD-02).

What Was Built

Task 1: Dedup helper (pkg/importer/dedup.go)

  • FindingKey(engine.Finding) string: hex-encoded SHA-256 over provider\x00masked\x00source\x00line. Fields outside that tuple (DetectedAt, Confidence, VerifyStatus, ...) do not participate — re-running the same import later must collapse onto the original finding.
  • Dedup([]engine.Finding) ([]engine.Finding, int): preserves first-seen order, returns the deduplicated slice plus count of duplicates dropped. Callers use the drop count to surface "Imported N findings (M new, K duplicates)".

Tests (pkg/importer/dedup_test.go, 8 total, all passing):

  • TestFindingKey_Stable
  • TestFindingKey_DiffersByProvider / ByMasked / BySource / ByLine
  • TestDedup_PreservesOrder[A,B,A,C,B] -> [A,B,C], dropped=2
  • TestDedup_Empty — nil input yields empty slice, 0 dropped
  • TestDedup_IgnoresUnrelatedFields — identical identity but different DetectedAt/Confidence collapses to one, first-seen wins

Task 2: SARIF GitHub validation test

  • testdata/sarif/sarif-2.1.0-minimal-schema.json: required-fields subset for GitHub Code Scanning (top-level, run, tool.driver, result, location.physicalLocation, region, allowed result levels). Hand-curated from https://docs.github.com/en/code-security/code-scanning/integrating-with-code-scanning/sarif-support-for-code-scanning rather than shipping the 500KB full schema.
  • pkg/output/sarif_github_test.go:
    • TestSARIFGitHubValidation: renders 3 findings (high/medium/low confidence) through SARIFFormatter.Format, unmarshals to map[string]any, then walks the document and asserts every required field in the fixture exists, version == "2.1.0", $schema is a non-empty https:// URL, exactly one run, tool.driver.name == "keyhunter", each result has ruleId, level ∈ allowed_levels, non-empty message.text, non-empty locations, and every physicalLocation.region.startLine >= 1. Includes one finding with LineNumber: 0 to prove the Phase 6 startLine floor converts to 1.
    • TestSARIFGitHubValidation_EmptyFindings: empty input still produces a valid document with runs[0].results == [] (not null) and tool.driver populated.
  • Uses stdlib only (encoding/json, os, path/filepath, strings, testing). Test walks ../../testdata/sarif/... from the package directory so the fixture stays relocatable.

Commits

Task Commit Description
1 6a3d5b0 feat(07-03): dedup helper for imported findings
2 bd8eb9b test(07-03): SARIF GitHub code scanning validation

Verification

$ go test ./pkg/importer/... ./pkg/output/...
ok  	github.com/salvacybersec/keyhunter/pkg/importer	(cached)
ok  	github.com/salvacybersec/keyhunter/pkg/output	0.008s

All 8 importer dedup tests + 2 SARIF validation tests pass. Both targeted go test invocations from the plan's verify blocks pass first-try with no deviations.

Deviations from Plan

None - plan executed exactly as written.

Downstream Consumers

  • Plan 07-04 (import CLI command): will call importer.Dedup after parsing TruffleHog/Gitleaks output and report (unique, dropped) to the user before inserting into storage.
  • CICD-02 closure: TestSARIFGitHubValidation functions as a regression guard — any future change to SARIFFormatter that breaks GitHub Code Scanning upload compatibility will fail this test before reaching users.

Key Decisions

  1. Identity tuple excludes verification/timing fields. A verified finding re-imported later should still collapse onto the original. Only provider + masked key + source + line number define identity.
  2. Hand-rolled required-fields fixture over full SARIF schema. GitHub enforces a small, well-documented subset. Shipping the full 500KB JSON Schema and a validator library would bloat the test binary without catching more bugs that matter to GitHub uploads.
  3. Line number floor asserted in test. Makes the Phase 6 startLine = max(line, 1) behavior a contract rather than an incidental implementation detail — future refactors can't silently reintroduce startLine: 0, which GitHub rejects.

Self-Check: PASSED

  • FOUND: pkg/importer/dedup.go
  • FOUND: pkg/importer/dedup_test.go
  • FOUND: pkg/output/sarif_github_test.go
  • FOUND: testdata/sarif/sarif-2.1.0-minimal-schema.json
  • FOUND commit: 6a3d5b0
  • FOUND commit: bd8eb9b
  • Tests verified green: go test ./pkg/importer/... ./pkg/output/...
  • No stubs introduced.