docs(07): import adapters and CI/CD context

This commit is contained in:
salvacybersec
2026-04-05 23:47:19 +03:00
parent f6f6730ddb
commit 5c74c35a26

View File

@@ -0,0 +1,111 @@
# Phase 7: Import Adapters & CI/CD Integration - Context
**Gathered:** 2026-04-05
**Status:** Ready for planning
**Mode:** Auto-generated
<domain>
## Phase Boundary
Three capabilities:
1. Import findings from TruffleHog v3 JSON and Gitleaks JSON/CSV into the SQLite database (normalize to KeyHunter Finding schema, deduplicate)
2. Git pre-commit hook install/uninstall that runs `keyhunter scan` on staged files and blocks commits with findings
3. SARIF 2.1.0 output already built in Phase 6 — this phase verifies it passes GitHub Code Scanning validation
</domain>
<decisions>
## Implementation Decisions
### Import Package (IMP-01, IMP-02, IMP-03)
- **New package**: `pkg/importer/`
- **Subdirectory per format**: `pkg/importer/trufflehog.go`, `pkg/importer/gitleaks.go`
- **Common interface**: `Importer` with `Import(r io.Reader) ([]engine.Finding, error)`
- **TruffleHog v3 JSON schema**: array of objects with `{SourceID, SourceMetadata, SourceName, DetectorName, DetectorType, Verified, Raw, Redacted, ExtraData}`. Map:
- DetectorName → Finding.ProviderName (lowercase)
- Raw → Finding.KeyValue
- Verified → Finding.VerifyStatus (live/unverified)
- SourceMetadata → source path (JSON-nested, need path-based extraction)
- **Gitleaks JSON schema**: array of `{Description, StartLine, EndLine, StartColumn, EndColumn, Match, Secret, File, SymlinkFile, Commit, Entropy, Author, Email, Date, Message, Tags, RuleID, Fingerprint}`. Map:
- RuleID → Finding.ProviderName (normalize against our provider names)
- Secret → Finding.KeyValue
- File → Finding.Source
- StartLine → Finding.LineNumber
- **Gitleaks CSV**: same columns as JSON, use encoding/csv
- **Deduplication**: hash(provider + masked_key + source) before insert, skip if exists
- **CLI command**: `keyhunter import --format=<trufflehog|gitleaks|gitleaks-csv> <file>` reads file, inserts findings
### Git Hook (CICD-01)
- **Install**: `keyhunter hook install` writes `.git/hooks/pre-commit` shell script that calls `keyhunter scan $(git diff --cached --name-only --diff-filter=ACMR)` with `--exit-code`
- **Uninstall**: `keyhunter hook uninstall` removes the script (preserves any non-keyhunter portions in a backup)
- **Content**: simple bash script with shebang, runs keyhunter, exits with scan's exit code
- **Detection**: before install, check if pre-commit already exists; if yes, ask to append or replace (--force flag to skip prompt)
### SARIF GitHub Validation (CICD-02)
- Already produced in Phase 6 by SARIFFormatter
- This phase: add a validation test using official SARIF schema JSON (download once, embed in repo as testdata, validate output against it via gjson or simple schema check)
- Add documentation in README about uploading to GitHub: `.github/workflows/keyhunter.yml` example workflow
- Validate the SARIF against a real GitHub-accepted format — use a minimal validator or manual schema check
### New Files
```
pkg/importer/
importer.go — Importer interface
trufflehog.go — TruffleHog v3 JSON parser
gitleaks.go — Gitleaks JSON + CSV parser
dedup.go — dedup hash logic
importer_test.go
testdata/
trufflehog-sample.json
gitleaks-sample.json
gitleaks-sample.csv
cmd/
import.go — keyhunter import command (replace stub)
hook.go — keyhunter hook install/uninstall (replace stub)
hook_script.sh — embedded pre-commit script template (via go:embed)
docs/
CI-CD.md — GitHub Actions example, pre-commit setup
testdata/sarif/
sarif-2.1.0-schema.json — official schema for validation tests
```
</decisions>
<code_context>
## Existing Code Insights
### Reusable Assets
- pkg/engine/finding.go — Finding struct (target for imports)
- pkg/storage/findings.go — SaveFinding (target for inserts)
- pkg/output/sarif.go — SARIFFormatter from Phase 6
- cmd/stubs.go — import and hook are stubs to replace
### Provider Name Normalization
- TruffleHog uses names like "OpenAI", "GitHubV2", "AWS" (mixed case)
- Gitleaks uses names like "openai-api-key", "aws-access-token" (kebab)
- Normalize to KeyHunter's lowercase names: openai, aws-bedrock, etc.
- Unknown provider names → keep as-is, tag confidence "imported"
</code_context>
<specifics>
## Specific Ideas
- Import command should show a summary: "Imported N findings (M new, K duplicates)"
- Hook install should verify `.git/` exists in current directory; error if not a repo
- SARIF validation test should check: `$schema`, `version`, `runs[]`, `runs[].tool.driver.name == "keyhunter"`, each result has `ruleId`, `level`, `message`, `locations`
</specifics>
<deferred>
## Deferred Ideas
- Import from arbitrary JSON formats via jsonpath config — over-engineering
- pre-push and post-merge hooks — pre-commit is enough for v1
- GitHub App integration for automatic scanning on PRs — separate project
- Semgrep/Snyk output format imports — defer to v2
</deferred>