# Phase 7: Import Adapters & CI/CD Integration - Context **Gathered:** 2026-04-05 **Status:** Ready for planning **Mode:** Auto-generated ## Phase Boundary Three capabilities: 1. Import findings from TruffleHog v3 JSON and Gitleaks JSON/CSV into the SQLite database (normalize to KeyHunter Finding schema, deduplicate) 2. Git pre-commit hook install/uninstall that runs `keyhunter scan` on staged files and blocks commits with findings 3. SARIF 2.1.0 output already built in Phase 6 — this phase verifies it passes GitHub Code Scanning validation ## Implementation Decisions ### Import Package (IMP-01, IMP-02, IMP-03) - **New package**: `pkg/importer/` - **Subdirectory per format**: `pkg/importer/trufflehog.go`, `pkg/importer/gitleaks.go` - **Common interface**: `Importer` with `Import(r io.Reader) ([]engine.Finding, error)` - **TruffleHog v3 JSON schema**: array of objects with `{SourceID, SourceMetadata, SourceName, DetectorName, DetectorType, Verified, Raw, Redacted, ExtraData}`. Map: - DetectorName → Finding.ProviderName (lowercase) - Raw → Finding.KeyValue - Verified → Finding.VerifyStatus (live/unverified) - SourceMetadata → source path (JSON-nested, need path-based extraction) - **Gitleaks JSON schema**: array of `{Description, StartLine, EndLine, StartColumn, EndColumn, Match, Secret, File, SymlinkFile, Commit, Entropy, Author, Email, Date, Message, Tags, RuleID, Fingerprint}`. Map: - RuleID → Finding.ProviderName (normalize against our provider names) - Secret → Finding.KeyValue - File → Finding.Source - StartLine → Finding.LineNumber - **Gitleaks CSV**: same columns as JSON, use encoding/csv - **Deduplication**: hash(provider + masked_key + source) before insert, skip if exists - **CLI command**: `keyhunter import --format= ` reads file, inserts findings ### Git Hook (CICD-01) - **Install**: `keyhunter hook install` writes `.git/hooks/pre-commit` shell script that calls `keyhunter scan $(git diff --cached --name-only --diff-filter=ACMR)` with `--exit-code` - **Uninstall**: `keyhunter hook uninstall` removes the script (preserves any non-keyhunter portions in a backup) - **Content**: simple bash script with shebang, runs keyhunter, exits with scan's exit code - **Detection**: before install, check if pre-commit already exists; if yes, ask to append or replace (--force flag to skip prompt) ### SARIF GitHub Validation (CICD-02) - Already produced in Phase 6 by SARIFFormatter - This phase: add a validation test using official SARIF schema JSON (download once, embed in repo as testdata, validate output against it via gjson or simple schema check) - Add documentation in README about uploading to GitHub: `.github/workflows/keyhunter.yml` example workflow - Validate the SARIF against a real GitHub-accepted format — use a minimal validator or manual schema check ### New Files ``` pkg/importer/ importer.go — Importer interface trufflehog.go — TruffleHog v3 JSON parser gitleaks.go — Gitleaks JSON + CSV parser dedup.go — dedup hash logic importer_test.go testdata/ trufflehog-sample.json gitleaks-sample.json gitleaks-sample.csv cmd/ import.go — keyhunter import command (replace stub) hook.go — keyhunter hook install/uninstall (replace stub) hook_script.sh — embedded pre-commit script template (via go:embed) docs/ CI-CD.md — GitHub Actions example, pre-commit setup testdata/sarif/ sarif-2.1.0-schema.json — official schema for validation tests ``` ## Existing Code Insights ### Reusable Assets - pkg/engine/finding.go — Finding struct (target for imports) - pkg/storage/findings.go — SaveFinding (target for inserts) - pkg/output/sarif.go — SARIFFormatter from Phase 6 - cmd/stubs.go — import and hook are stubs to replace ### Provider Name Normalization - TruffleHog uses names like "OpenAI", "GitHubV2", "AWS" (mixed case) - Gitleaks uses names like "openai-api-key", "aws-access-token" (kebab) - Normalize to KeyHunter's lowercase names: openai, aws-bedrock, etc. - Unknown provider names → keep as-is, tag confidence "imported" ## Specific Ideas - Import command should show a summary: "Imported N findings (M new, K duplicates)" - Hook install should verify `.git/` exists in current directory; error if not a repo - SARIF validation test should check: `$schema`, `version`, `runs[]`, `runs[].tool.driver.name == "keyhunter"`, each result has `ruleId`, `level`, `message`, `locations` ## Deferred Ideas - Import from arbitrary JSON formats via jsonpath config — over-engineering - pre-push and post-merge hooks — pre-commit is enough for v1 - GitHub App integration for automatic scanning on PRs — separate project - Semgrep/Snyk output format imports — defer to v2