diff --git a/.planning/phases/07-import-cicd/07-CONTEXT.md b/.planning/phases/07-import-cicd/07-CONTEXT.md new file mode 100644 index 0000000..a7574aa --- /dev/null +++ b/.planning/phases/07-import-cicd/07-CONTEXT.md @@ -0,0 +1,111 @@ +# Phase 7: Import Adapters & CI/CD Integration - Context + +**Gathered:** 2026-04-05 +**Status:** Ready for planning +**Mode:** Auto-generated + + +## Phase Boundary + +Three capabilities: +1. Import findings from TruffleHog v3 JSON and Gitleaks JSON/CSV into the SQLite database (normalize to KeyHunter Finding schema, deduplicate) +2. Git pre-commit hook install/uninstall that runs `keyhunter scan` on staged files and blocks commits with findings +3. SARIF 2.1.0 output already built in Phase 6 — this phase verifies it passes GitHub Code Scanning validation + + + + +## Implementation Decisions + +### Import Package (IMP-01, IMP-02, IMP-03) +- **New package**: `pkg/importer/` +- **Subdirectory per format**: `pkg/importer/trufflehog.go`, `pkg/importer/gitleaks.go` +- **Common interface**: `Importer` with `Import(r io.Reader) ([]engine.Finding, error)` +- **TruffleHog v3 JSON schema**: array of objects with `{SourceID, SourceMetadata, SourceName, DetectorName, DetectorType, Verified, Raw, Redacted, ExtraData}`. Map: + - DetectorName → Finding.ProviderName (lowercase) + - Raw → Finding.KeyValue + - Verified → Finding.VerifyStatus (live/unverified) + - SourceMetadata → source path (JSON-nested, need path-based extraction) +- **Gitleaks JSON schema**: array of `{Description, StartLine, EndLine, StartColumn, EndColumn, Match, Secret, File, SymlinkFile, Commit, Entropy, Author, Email, Date, Message, Tags, RuleID, Fingerprint}`. Map: + - RuleID → Finding.ProviderName (normalize against our provider names) + - Secret → Finding.KeyValue + - File → Finding.Source + - StartLine → Finding.LineNumber +- **Gitleaks CSV**: same columns as JSON, use encoding/csv +- **Deduplication**: hash(provider + masked_key + source) before insert, skip if exists +- **CLI command**: `keyhunter import --format= ` reads file, inserts findings + +### Git Hook (CICD-01) +- **Install**: `keyhunter hook install` writes `.git/hooks/pre-commit` shell script that calls `keyhunter scan $(git diff --cached --name-only --diff-filter=ACMR)` with `--exit-code` +- **Uninstall**: `keyhunter hook uninstall` removes the script (preserves any non-keyhunter portions in a backup) +- **Content**: simple bash script with shebang, runs keyhunter, exits with scan's exit code +- **Detection**: before install, check if pre-commit already exists; if yes, ask to append or replace (--force flag to skip prompt) + +### SARIF GitHub Validation (CICD-02) +- Already produced in Phase 6 by SARIFFormatter +- This phase: add a validation test using official SARIF schema JSON (download once, embed in repo as testdata, validate output against it via gjson or simple schema check) +- Add documentation in README about uploading to GitHub: `.github/workflows/keyhunter.yml` example workflow +- Validate the SARIF against a real GitHub-accepted format — use a minimal validator or manual schema check + +### New Files +``` +pkg/importer/ + importer.go — Importer interface + trufflehog.go — TruffleHog v3 JSON parser + gitleaks.go — Gitleaks JSON + CSV parser + dedup.go — dedup hash logic + importer_test.go + testdata/ + trufflehog-sample.json + gitleaks-sample.json + gitleaks-sample.csv + +cmd/ + import.go — keyhunter import command (replace stub) + hook.go — keyhunter hook install/uninstall (replace stub) + hook_script.sh — embedded pre-commit script template (via go:embed) + +docs/ + CI-CD.md — GitHub Actions example, pre-commit setup + +testdata/sarif/ + sarif-2.1.0-schema.json — official schema for validation tests +``` + + + + +## Existing Code Insights + +### Reusable Assets +- pkg/engine/finding.go — Finding struct (target for imports) +- pkg/storage/findings.go — SaveFinding (target for inserts) +- pkg/output/sarif.go — SARIFFormatter from Phase 6 +- cmd/stubs.go — import and hook are stubs to replace + +### Provider Name Normalization +- TruffleHog uses names like "OpenAI", "GitHubV2", "AWS" (mixed case) +- Gitleaks uses names like "openai-api-key", "aws-access-token" (kebab) +- Normalize to KeyHunter's lowercase names: openai, aws-bedrock, etc. +- Unknown provider names → keep as-is, tag confidence "imported" + + + + +## Specific Ideas + +- Import command should show a summary: "Imported N findings (M new, K duplicates)" +- Hook install should verify `.git/` exists in current directory; error if not a repo +- SARIF validation test should check: `$schema`, `version`, `runs[]`, `runs[].tool.driver.name == "keyhunter"`, each result has `ruleId`, `level`, `message`, `locations` + + + + +## Deferred Ideas + +- Import from arbitrary JSON formats via jsonpath config — over-engineering +- pre-push and post-merge hooks — pre-commit is enough for v1 +- GitHub App integration for automatic scanning on PRs — separate project +- Semgrep/Snyk output format imports — defer to v2 + +