6.0 KiB
6.0 KiB
phase, plan, subsystem, tags, requires, provides, affects, tech-stack, key-files, decisions, metrics, requirements
| phase | plan | subsystem | tags | requires | provides | affects | tech-stack | key-files | decisions | metrics | requirements | ||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 07-import-cicd | 03 | importer+output |
|
|
|
|
|
|
|
|
|
Phase 7 Plan 03: Dedup Helper & SARIF GitHub Validation Summary
Adds a stable dedup identity hash for imported findings (IMP-03) and a standalone test proving Phase 6's SARIFFormatter output satisfies GitHub Code Scanning's required-field surface (CICD-02).
What Was Built
Task 1: Dedup helper (pkg/importer/dedup.go)
FindingKey(engine.Finding) string: hex-encoded SHA-256 overprovider\x00masked\x00source\x00line. Fields outside that tuple (DetectedAt, Confidence, VerifyStatus, ...) do not participate — re-running the same import later must collapse onto the original finding.Dedup([]engine.Finding) ([]engine.Finding, int): preserves first-seen order, returns the deduplicated slice plus count of duplicates dropped. Callers use the drop count to surface"Imported N findings (M new, K duplicates)".
Tests (pkg/importer/dedup_test.go, 8 total, all passing):
TestFindingKey_StableTestFindingKey_DiffersByProvider / ByMasked / BySource / ByLineTestDedup_PreservesOrder—[A,B,A,C,B] -> [A,B,C], dropped=2TestDedup_Empty— nil input yields empty slice, 0 droppedTestDedup_IgnoresUnrelatedFields— identical identity but differentDetectedAt/Confidencecollapses to one, first-seen wins
Task 2: SARIF GitHub validation test
testdata/sarif/sarif-2.1.0-minimal-schema.json: required-fields subset for GitHub Code Scanning (top-level, run, tool.driver, result, location.physicalLocation, region, allowed result levels). Hand-curated from https://docs.github.com/en/code-security/code-scanning/integrating-with-code-scanning/sarif-support-for-code-scanning rather than shipping the 500KB full schema.pkg/output/sarif_github_test.go:TestSARIFGitHubValidation: renders 3 findings (high/medium/low confidence) throughSARIFFormatter.Format, unmarshals tomap[string]any, then walks the document and asserts every required field in the fixture exists,version == "2.1.0",$schemais a non-emptyhttps://URL, exactly one run,tool.driver.name == "keyhunter", each result hasruleId,level ∈ allowed_levels, non-emptymessage.text, non-emptylocations, and everyphysicalLocation.region.startLine >= 1. Includes one finding withLineNumber: 0to prove the Phase 6startLinefloor converts to 1.TestSARIFGitHubValidation_EmptyFindings: empty input still produces a valid document withruns[0].results == [](notnull) and tool.driver populated.
- Uses stdlib only (
encoding/json,os,path/filepath,strings,testing). Test walks../../testdata/sarif/...from the package directory so the fixture stays relocatable.
Commits
| Task | Commit | Description |
|---|---|---|
| 1 | 6a3d5b0 |
feat(07-03): dedup helper for imported findings |
| 2 | bd8eb9b |
test(07-03): SARIF GitHub code scanning validation |
Verification
$ go test ./pkg/importer/... ./pkg/output/...
ok github.com/salvacybersec/keyhunter/pkg/importer (cached)
ok github.com/salvacybersec/keyhunter/pkg/output 0.008s
All 8 importer dedup tests + 2 SARIF validation tests pass. Both targeted go test invocations from the plan's verify blocks pass first-try with no deviations.
Deviations from Plan
None - plan executed exactly as written.
Downstream Consumers
- Plan 07-04 (import CLI command): will call
importer.Dedupafter parsing TruffleHog/Gitleaks output and report(unique, dropped)to the user before inserting into storage. - CICD-02 closure:
TestSARIFGitHubValidationfunctions as a regression guard — any future change toSARIFFormatterthat breaks GitHub Code Scanning upload compatibility will fail this test before reaching users.
Key Decisions
- Identity tuple excludes verification/timing fields. A verified finding re-imported later should still collapse onto the original. Only provider + masked key + source + line number define identity.
- Hand-rolled required-fields fixture over full SARIF schema. GitHub enforces a small, well-documented subset. Shipping the full 500KB JSON Schema and a validator library would bloat the test binary without catching more bugs that matter to GitHub uploads.
- Line number floor asserted in test. Makes the Phase 6
startLine = max(line, 1)behavior a contract rather than an incidental implementation detail — future refactors can't silently reintroducestartLine: 0, which GitHub rejects.