Files
keyhunter/.planning/phases/07-import-cicd/07-03-PLAN.md
2026-04-05 23:53:14 +03:00

191 lines
8.2 KiB
Markdown

---
phase: 07-import-cicd
plan: 03
type: execute
wave: 1
depends_on: []
files_modified:
- pkg/importer/dedup.go
- pkg/importer/dedup_test.go
- pkg/output/sarif_github_test.go
- testdata/sarif/sarif-2.1.0-minimal-schema.json
autonomous: true
requirements: [IMP-03, CICD-02]
must_haves:
truths:
- "Duplicate findings (same provider + masked key + source) are detected via stable hash"
- "SARIF output from Phase 6 contains all GitHub-required fields for code scanning uploads"
artifacts:
- path: pkg/importer/dedup.go
provides: "FindingKey hash + Dedup function"
contains: "func FindingKey"
- path: pkg/output/sarif_github_test.go
provides: "GitHub code scanning SARIF validation test"
contains: "TestSARIFGitHubValidation"
key_links:
- from: pkg/importer/dedup.go
to: pkg/engine/finding.go
via: "hashes engine.Finding fields"
pattern: "engine\\.Finding"
- from: pkg/output/sarif_github_test.go
to: pkg/output/sarif.go
via: "renders SARIFFormatter output and validates required fields"
pattern: "SARIFFormatter"
---
<objective>
Build two independent assets needed by Plan 07-04 and the GitHub integration story: (1) deduplication helper for imported findings (IMP-03), (2) a SARIF GitHub validation test that asserts Phase 6's SARIF output satisfies GitHub Code Scanning requirements (CICD-02).
Purpose: Imports will be re-run repeatedly; without dedup the database fills with copies. GitHub upload validation closes the loop on CICD-02 by proving SARIF output is acceptable without manual upload.
Output: Dedup package function, dedup unit tests, SARIF validation test, minimal schema fixture.
</objective>
<execution_context>
@$HOME/.claude/get-shit-done/workflows/execute-plan.md
@$HOME/.claude/get-shit-done/templates/summary.md
</execution_context>
<context>
@.planning/phases/07-import-cicd/07-CONTEXT.md
@pkg/engine/finding.go
@pkg/output/sarif.go
<interfaces>
From pkg/output/sarif.go:
```go
type SARIFFormatter struct{}
func (SARIFFormatter) Format(findings []engine.Finding, w io.Writer, opts Options) error
```
From pkg/engine/finding.go: engine.Finding with ProviderName, KeyMasked, Source, LineNumber.
</interfaces>
</context>
<tasks>
<task type="auto" tdd="true">
<name>Task 1: Dedup helper for imported findings</name>
<files>pkg/importer/dedup.go, pkg/importer/dedup_test.go</files>
<behavior>
- FindingKey(f engine.Finding) string returns hex-encoded SHA-256 over "provider\x00masked\x00source\x00line".
- Dedup(in []engine.Finding) (unique []engine.Finding, duplicates int): preserves first-seen order, drops subsequent matches of the same FindingKey, returns count of dropped.
- Two findings with same provider+masked+source+line are duplicates regardless of other fields (DetectedAt, Confidence).
- Different source paths or different line numbers are NOT duplicates.
</behavior>
<action>
Create pkg/importer/dedup.go:
```go
package importer
import (
"crypto/sha256"
"encoding/hex"
"fmt"
"github.com/salvacybersec/keyhunter/pkg/engine"
)
// FindingKey returns a stable identity hash for a finding based on the
// provider name, masked key, source path, and line number. This is the
// dedup identity used by import pipelines so the same underlying secret
// is not inserted twice when re-importing the same scanner output.
func FindingKey(f engine.Finding) string {
payload := fmt.Sprintf("%s\x00%s\x00%s\x00%d", f.ProviderName, f.KeyMasked, f.Source, f.LineNumber)
sum := sha256.Sum256([]byte(payload))
return hex.EncodeToString(sum[:])
}
// Dedup removes duplicate findings from in-memory slices before insert.
// Order of first-seen findings is preserved. Returns the deduplicated
// slice and the number of duplicates dropped.
func Dedup(in []engine.Finding) ([]engine.Finding, int) {
seen := make(map[string]struct{}, len(in))
out := make([]engine.Finding, 0, len(in))
dropped := 0
for _, f := range in {
k := FindingKey(f)
if _, ok := seen[k]; ok {
dropped++
continue
}
seen[k] = struct{}{}
out = append(out, f)
}
return out, dropped
}
```
Create pkg/importer/dedup_test.go with tests:
- TestFindingKey_Stable: same finding twice -> identical key.
- TestFindingKey_DiffersByProvider / ByMasked / BySource / ByLine.
- TestDedup_PreservesOrder: input [A, B, A, C, B] -> output [A, B, C], dropped=2.
- TestDedup_Empty: nil slice -> empty slice, 0 dropped.
- TestDedup_IgnoresUnrelatedFields: two findings identical except DetectedAt and Confidence -> one kept.
</action>
<verify>
<automated>cd /home/salva/Documents/apikey && go test ./pkg/importer/... -run Dedup -v</automated>
</verify>
<done>
- FindingKey + Dedup implemented
- 5 tests pass
</done>
</task>
<task type="auto">
<name>Task 2: SARIF GitHub code scanning validation test</name>
<files>pkg/output/sarif_github_test.go, testdata/sarif/sarif-2.1.0-minimal-schema.json</files>
<action>
Create testdata/sarif/sarif-2.1.0-minimal-schema.json — a minimal JSON document listing GitHub's required SARIF fields for code scanning upload. Not the full schema (would be 500KB); the required-fields subset documented at https://docs.github.com/en/code-security/code-scanning/integrating-with-code-scanning/sarif-support-for-code-scanning. Content:
```json
{
"required_top_level": ["$schema", "version", "runs"],
"required_run": ["tool", "results"],
"required_tool_driver": ["name", "version"],
"required_result": ["ruleId", "level", "message", "locations"],
"required_location_physical": ["artifactLocation", "region"],
"required_region": ["startLine"],
"allowed_levels": ["error", "warning", "note", "none"]
}
```
Create pkg/output/sarif_github_test.go (package `output`):
- TestSARIFGitHubValidation:
1. Build a []engine.Finding of 3 findings spanning high/medium/low confidence with realistic values (ProviderName, KeyValue, KeyMasked, Source, LineNumber).
2. Render via SARIFFormatter.Format into a bytes.Buffer with Options{ToolName: "keyhunter", ToolVersion: "test"}.
3. json.Unmarshal into map[string]any.
4. Load testdata/sarif/sarif-2.1.0-minimal-schema.json (relative to test file via os.ReadFile).
5. Assert every key in required_top_level exists at root.
6. Assert doc["version"] == "2.1.0".
7. Assert doc["$schema"] is a non-empty string starting with "https://".
8. runs := doc["runs"].([]any); require len(runs) == 1.
9. For the single run, assert tool.driver.name == "keyhunter", version non-empty, results is a slice.
10. For each result: assert ruleId non-empty string, level in allowed_levels, message.text non-empty, locations is non-empty slice.
11. For each location: assert physicalLocation.artifactLocation.uri non-empty and physicalLocation.region.startLine >= 1.
12. Assert startLine is always >= 1 even when input LineNumber is 0 (test one finding with LineNumber: 0 and confirm startLine in output == 1 — matches Phase 6 floor behavior).
- TestSARIFGitHubValidation_EmptyFindings: empty findings slice still produces a valid document with runs[0].results == [] (not null), tool.driver present.
Use standard library only (encoding/json, os, path/filepath, testing). No schema validation library.
</action>
<verify>
<automated>cd /home/salva/Documents/apikey && go test ./pkg/output/... -run SARIFGitHub -v</automated>
</verify>
<done>
- testdata/sarif/sarif-2.1.0-minimal-schema.json committed
- pkg/output/sarif_github_test.go passes
- SARIFFormatter output provably satisfies GitHub Code Scanning required fields
</done>
</task>
</tasks>
<verification>
go test ./pkg/importer/... ./pkg/output/... passes.
</verification>
<success_criteria>
Dedup helper usable by the import command (07-04). SARIF output validated against GitHub's required-field surface with no external dependencies, proving CICD-02 end-to-end.
</success_criteria>
<output>
After completion, create `.planning/phases/07-import-cicd/07-03-SUMMARY.md`.
</output>