Files
keyhunter/.planning/research/PITFALLS.md
2026-04-04 19:03:12 +03:00

296 lines
21 KiB
Markdown

# Domain Pitfalls
**Domain:** API key scanner / secret detection tool (LLM/AI provider focus)
**Project:** KeyHunter
**Researched:** 2026-04-04
**Sources confidence:** HIGH for regex/performance (official Go docs, RE2 docs), MEDIUM for legal (CFAA analysis + DOJ policy), MEDIUM for OSINT reliability (community + vendor research)
---
## Critical Pitfalls
Mistakes that cause rewrites, legal exposure, or complete tool rejection.
---
### Pitfall 1: Catastrophic Regex Backtracking on Large Inputs
**What goes wrong:** Secret-detection patterns written with nested quantifiers (e.g., `(.+)+`, `(a|aa)+`) cause exponential CPU time on adversarial or malformed input. A single poorly written pattern can peg one CPU core at 100% indefinitely when scanning large files, binary blobs, or minified JavaScript. This is ReDoS — the same class of bug that has caused production outages in major platforms.
**Why it happens:** Pattern authors focus on correctness, not worst-case complexity. Patterns for generic secrets (Mistral, Cohere, Together AI) that lack high-confidence prefixes tend toward broad `[A-Za-z0-9]+` quantifiers, which become catastrophic when chained.
**Consequences:** Scanner hangs indefinitely on large repos or source maps. Workers in the pool block. If running as a CI hook or Telegram-triggered scan, the entire pipeline stalls with no feedback.
**Prevention:**
- Go's `regexp` package already uses the RE2 engine, which guarantees linear time execution — never backtracking exponentially. This is a free safety net. Do NOT switch to `regexp2` (PCRE-compatible) for any pattern, as it loses this guarantee.
- Add a per-pattern timeout enforced with `context.WithTimeout` as defense in depth.
- Add a regex complexity linter to the YAML provider validation step (CI check on provider YAML PRs).
- Benchmark every new provider pattern against a 10MB synthetic worst-case string before merging.
**Detection (warning signs):**
- Provider patterns containing `(.+)+`, `(a+)+`, or alternation inside repetition.
- Scan times that scale super-linearly with file size.
- Worker pool goroutines that never complete.
**Phase mapping:** Phase that builds the core regex engine and provider YAML schema. Add pattern complexity validation before any community provider contributions open up.
---
### Pitfall 2: False Positive Rate Killing Adoption (Alert Fatigue)
**What goes wrong:** Up to 80% of alerts from entropy-only or broad-regex scanners are false positives (HashiCorp, 2025). Developers stop reviewing alerts within weeks. The tool becomes security theater: real secrets are ignored because every alert feels like noise.
**Why it happens:** Three compounding errors:
1. Using entropy alone to flag secrets — entropy measures randomness, not whether a string is actually a secret. High-entropy strings like UUIDs, hashes, base64 content, and test fixtures flood results.
2. Patterns written to maximize recall without a precision floor — matching anything that looks like a key prefix.
3. No post-filtering for known non-secret contexts (test files, `.example` files, mock data, documentation).
**Consequences:** Red teams and bug bounty hunters abandon the tool after the first scan of a medium-sized monorepo produces 5,000 results with 4,000 false positives. The core value proposition ("find real, live keys") collapses.
**Prevention:**
- Layer detection: keyword pre-filter first (fast string match), then regex (pattern confirmation), then entropy check (optional calibration), then active verification (ground truth when `--verify` is enabled). Never rely on entropy alone.
- Implement allowlist patterns for known false-positive contexts at the YAML level: `test_`, `example_`, `fake_`, `dummy_`, `placeholder`.
- Expose a `--min-confidence` flag so users can tune recall/precision trade-off.
- Track and report per-provider false positive rate in internal benchmarks (CredData dataset is a standard benchmark).
- GitGuardian's ML approach (FP Remover) reduced false positives by 50%. A heuristic version (file path context, variable name context) achieves similar results without ML overhead.
**Detection (warning signs):**
- First scan of any real project returns >1000 results.
- Providers with no high-confidence prefix (e.g., generic 32-char hex keys) have >50% FP rate in test runs.
- Users filing GitHub issues asking "why did it flag my README example?"
**Phase mapping:** Core engine phase. Must be addressed before OSINT/recon sources amplify the problem at scale — OSINT sources will produce far more candidate strings than local file scanning.
---
### Pitfall 3: Active Key Verification Creates Legal Exposure
**What goes wrong:** Calling a provider's API with a discovered key — even a single `/v1/models` health-check request — to confirm it is valid constitutes "accessing a computer system." Under CFAA (Computer Fraud and Abuse Act) and analogous laws in other jurisdictions, using a credential you did not receive authorization to use may constitute unauthorized access regardless of intent.
**Why it happens:** Tool authors conflate "I found this key publicly" with "I am authorized to use this key." Public availability does not grant authorization. State laws (e.g., Virginia post-2024) have expanded computer crime definitions beyond the narrowed federal CFAA post-Van Buren ruling.
**Consequences:**
- Criminal exposure for the tool's users (bug bounty hunters, red teamers without explicit scope authorization).
- Civil liability if verification consumes paid quota from a key owner's account — each verification call may incur real cost to the victim.
- If KeyHunter becomes associated with a high-profile incident, the project could be taken down or banned from GitHub.
**Prevention:**
- Verification is opt-in behind `--verify` with clear documentation that the user bears legal responsibility for scope.
- Add a consent prompt on first `--verify` use: "Active verification sends HTTP requests to provider APIs using discovered keys. Ensure you have explicit authorization. Press Enter to continue or Ctrl+C to abort."
- Document the legal risk in README and man page. Cite the good-faith security research exception under DOJ policy — it requires documented authorization.
- Limit verification to read-only, non-destructive endpoints (list models, check account status) — never write operations.
- Consider adding a `--dry-run-verify` flag that shows what endpoints would be called without actually calling them.
**Detection (warning signs):**
- Any verification endpoint that modifies state, consumes significant quota, or accesses user data.
- Users running `--verify` in automated CI pipelines against repositories they do not own.
**Phase mapping:** Verification feature phase. Legal risk documentation must ship with the feature, not as a follow-up.
---
### Pitfall 4: Provider Pattern Rot — YAML Definitions Become Stale
**What goes wrong:** LLM providers change key formats, rotate prefixes, or add new key types. OpenAI migrated from `sk-` to `sk-proj-` for project-scoped keys in 2024. Anthropic keys use `sk-ant-api03-` (the `api03` version suffix implies prior versions existed). Patterns written for old formats silently miss the new format while still matching the old (now-retired) format.
**Why it happens:** Provider API key formats are undocumented implementation details. Providers change them without changelog entries. No automated system alerts the scanner maintainer when a provider changes key format.
**Consequences:**
- False negatives for the new format — active keys in the wild are missed entirely.
- False positives for the old format — patterns match expired key formats that no longer work.
- Tool appears broken to users who know the new format exists.
**Prevention:**
- Add a `format_version` field to each provider YAML definition. Document when the format was last verified against live keys.
- Add integration tests that attempt to construct a syntactically valid key for each provider and confirm the pattern matches it.
- Monitor provider changelogs and release notes as part of maintenance. Subscribe to OpenAI, Anthropic, and major provider changelogs.
- For providers with high-confidence prefixes, encode the full prefix including version segment in the pattern (e.g., `sk-ant-api03-` not just `sk-ant-`).
- GitHub Advanced Security adds/updates patterns monthly (28 new detectors added March 2026 alone). Use their changelog as an external signal for provider format changes.
- Build a "pattern health check" CI job that runs weekly against a curated set of known-format example keys.
**Detection (warning signs):**
- User reports that a clearly valid Anthropic/OpenAI key is not detected.
- Provider documentation mentions a new key format in release notes.
- TruffleHog or Gitleaks updates a provider pattern — check their commits.
**Phase mapping:** Provider YAML definition phase and ongoing maintenance. Add pattern health CI before the first public release.
---
### Pitfall 5: OSINT Source Rate Limiting and IP Banning
**What goes wrong:** The 80+ OSINT sources (GitHub, Shodan, Censys, Pastebin, Google dorks, etc.) all have rate limits, bot detection, and account suspension policies. Aggressive scanning results in IP bans, CAPTCHA walls, API key revocation, or account termination — often silently. The scanner fails to report results without telling the user why.
**Why it happens:** Tool authors test against their own accounts with light traffic. Production use by red teams involves parallel workers hitting the same source from the same IP, triggering anomaly detection within minutes.
**Consequences:**
- Google blocks IP after ~100 automated dork requests per hour — all dork results disappear silently.
- GitHub bans integration tokens that exceed secondary rate limits (concurrent requests, not just per-hour).
- Shodan/Censys accounts get flagged for automated abuse patterns.
- Pastebin blocks scraping; their API is the only supported programmatic access.
- The tool appears to work (no error returned) but returns empty results from banned sources.
**Prevention:**
- Implement per-source rate limiting with configurable delays. Do not share a single rate limiter across sources.
- Respect `X-RateLimit-Remaining` and `Retry-After` headers. Back off exponentially on 429 responses.
- For Google dork scanning: use the Custom Search JSON API (100 queries/day free, 10,000 paid) rather than scraping `google.com` directly.
- For Pastebin: use the official Pastebin API, not HTML scraping.
- Add `--stealth` mode flag that introduces human-like delays (1-5s jitter) between requests.
- Make rate limit configuration per-source in YAML so users can tune for their plan/tier.
- Log "source exhausted" events clearly — never silently skip a source without telling the user.
- For GitHub: use multiple tokens with rotation to stay within per-token limits across parallel workers.
**Detection (warning signs):**
- A previously productive source suddenly returns 0 results.
- HTTP 429 responses not surfaced to user.
- Workers for a source finish suspiciously fast (blocked silently).
**Phase mapping:** OSINT/recon engine phase. Rate limiting architecture must be designed before implementing individual sources — retrofitting it after 80 sources are built is a rewrite.
---
## Moderate Pitfalls
Mistakes that degrade quality or require significant rework but do not cause critical failure.
---
### Pitfall 6: Git History Scanning Memory Exhaustion on Large Repos
**What goes wrong:** Scanning the full git history of a large monorepo (50k+ commits, large binary artifacts, LFS objects) exhausts available memory or runs for hours. Binary files (images, compiled artifacts, ML model weights) are loaded into memory and fed through regex patterns that can never match — wasting CPU and RAM.
**Prevention:**
- Skip binary files before regex evaluation. Use file extension allowlists and MIME type sniffing (read first 512 bytes).
- Use delta-based scanning: only scan changed lines in each commit diff, not the full file on every commit. TruffleHog v3 uses this approach.
- Implement a per-file size limit (default 10MB) above which scanning is skipped with a warning.
- For git history: use `git log --diff-filter=A` to only scan added/modified content, not deletions.
- Stream commit data rather than loading the entire repo object store into memory.
**Phase mapping:** Core scanning engine phase. Design streaming architecture from the start.
---
### Pitfall 7: SQLite Database Stores Keys Unencrypted by Default
**What goes wrong:** SQLite databases store plaintext by default. A documented security advisory (GHSA-4h8c-qrcq-cv5c) shows a real project storing API keys in SQLite without encryption, allowing anyone with filesystem access to read all discovered secrets.
**Prevention:**
- Use SQLCipher or Go's `modernc.org/sqlite` with application-level AES-256 encryption for sensitive columns.
- The PROJECT.md already specifies AES-256 encryption — implement this in Phase 1, not as an afterthought.
- Encrypt the database key itself using the OS keychain (macOS Keychain, Linux libsecret) rather than storing it in the config file.
- Never log full key values to stdout or log files. Honor `--unmask` boundary in all code paths.
**Phase mapping:** Storage layer phase. Encryption must be in from the beginning — adding it post-hoc requires database migration logic.
---
### Pitfall 8: Verification Endpoint Maintenance Burden
**What goes wrong:** Provider verification endpoints change without notice. A provider may deprecate `/v1/models`, require different authentication headers, change rate limits for unauthenticated requests, or start requiring a minimum account tier for endpoint access. Verification silently returns false negatives — valid keys appear invalid.
**Prevention:**
- Store the verification endpoint and expected response codes in the provider YAML (already planned). Include a `last_verified` date.
- Add a `--test-verifiers` command that runs each provider's verifier against a known-invalid key and confirms it returns the expected "invalid" response — catching when providers break the expected behavior.
- Design the verifier interface so fallback logic is easy: if primary endpoint returns unexpected status, try secondary endpoint before marking key invalid.
**Phase mapping:** Verification feature phase.
---
### Pitfall 9: Generic Provider Patterns Produce Massive Cross-Provider Collisions
**What goes wrong:** Providers that use generic 32-character hex or alphanumeric keys (Mistral, Cohere, Together AI, many Chinese providers) have patterns that overlap heavily. A key detected as "Mistral" may actually be a Together AI key. The tool confidently mis-attributes keys, damaging credibility.
**Prevention:**
- For generic-pattern providers, rely heavily on keyword context (surrounding variable names, import statements, config file names) to disambiguate.
- Report confidence level per detection: HIGH (prefix-confirmed), MEDIUM (keyword context), LOW (entropy only, no prefix, no keyword).
- When verification is enabled, attribution becomes deterministic — the key either authenticates to provider A or it doesn't.
- Consider a "multi-match" result type that lists all candidate providers when a key matches multiple generic patterns.
**Phase mapping:** Provider YAML definition phase. Define a confidence taxonomy before implementing patterns.
---
## Minor Pitfalls
---
### Pitfall 10: Telegram Bot Exposes Keys in Group Chats
**What goes wrong:** `/scan` or `/key` commands triggered in a Telegram group chat expose full key values to all group members. Even with masking by default, scan results in group contexts violate the principle that discovered keys are sensitive.
**Prevention:**
- Restrict Telegram bot commands to private chats or authorized user IDs only.
- Never send unmasked keys via Telegram regardless of `--unmask` setting.
- Rate-limit bot commands to prevent abuse by unauthorized users who discover the bot.
**Phase mapping:** Telegram bot implementation phase.
---
### Pitfall 11: SARIF Output Used Without Context Causing CI Over-Blocking
**What goes wrong:** SARIF output consumed by GitHub Advanced Security or similar CI tooling blocks all PRs when the scanner finds any result. In a repo with legacy committed keys (even expired ones), every PR is blocked indefinitely — causing teams to disable the scanner entirely.
**Prevention:**
- Default SARIF severity to `warning`, not `error`, for unverified findings.
- Promote to `error` only for verified-active keys (requires `--verify`).
- Document the recommended CI configuration with a severity filter.
**Phase mapping:** CI/CD integration phase.
---
### Pitfall 12: Dork Queries Violate Search Engine ToS
**What goes wrong:** Automated Google/Bing dork queries violate their Terms of Service. This is enforced via CAPTCHA walls and IP bans, but also creates legal exposure if the tool is used at scale by enterprise customers who sign ToS agreements.
**Prevention:**
- Document that automated dork scanning uses search engine APIs (Google Custom Search, Bing Web Search API) where possible, not direct HTML scraping.
- Offer DuckDuckGo as a default dork source (more permissive scraping stance), with Google/Bing as opt-in via API key.
- Include a ToS disclaimer in dork documentation.
**Phase mapping:** Dork engine phase.
---
## Phase-Specific Warnings
| Phase Topic | Likely Pitfall | Mitigation |
|-------------|---------------|------------|
| Core regex engine | Catastrophic backtracking (Pitfall 1) | Use Go's RE2-backed `regexp`, add pattern complexity linter |
| Provider YAML definitions | Generic pattern collisions (Pitfall 9), format rot (Pitfall 4) | Confidence taxonomy, format_version field, pattern health CI |
| False positive filtering | Alert fatigue (Pitfall 2) | Layered detection pipeline, allowlist context, confidence levels |
| SQLite storage | Unencrypted key storage (Pitfall 7) | AES-256 from day one, OS keychain for database key |
| Verification feature | Legal exposure (Pitfall 3), endpoint rot (Pitfall 8) | Opt-in only, consent prompt, test-verifiers command |
| OSINT/recon engine | Rate limiting and IP bans (Pitfall 5) | Per-source rate limiter architecture before implementing sources |
| Git history scanning | Memory exhaustion (Pitfall 6) | Binary file skip, delta-based scanning, streaming |
| Telegram bot | Key exposure in group chats (Pitfall 10) | Private chat restriction, no unmasked keys via bot |
| CI/CD integration (SARIF) | Over-blocking PRs (Pitfall 11) | Warning severity by default, error only for verified keys |
| Dork engine | ToS violations (Pitfall 12) | Search engine APIs over direct scraping, ToS documentation |
---
## Sources
- HashiCorp: [False positives: A big problem for secret scanners](https://www.hashicorp.com/en/blog/false-positives-a-big-problem-for-secret-scanners)
- HashiCorp/InfoQ: [Why traditional secret scanning tools fail](https://www.infoq.com/news/2025/10/hashicorp-secrets/)
- GitGuardian: [Should we target zero false positives?](https://blog.gitguardian.com/should-we-target-zero-false-positives/)
- GitGuardian: [ML-powered FP Remover cuts 50% false positives](https://blog.gitguardian.com/fp-remover-cuts-false-positives-by-half/)
- GitGuardian: [Secrets Detection Engine optimization](https://blog.gitguardian.com/fast-scans-return-earlier/)
- Google RE2: [RE2 — Fast, safe, non-backtracking regex](https://github.com/google/re2)
- Checkmarx: [ReDoS in Go](https://checkmarx.com/blog/redos-go/)
- Nightfall AI: [Best Go Regex Library](https://www.nightfall.ai/blog/best-go-regex-library)
- DOJ CFAA Policy: [Justice Manual 9-48.000](https://www.justice.gov/jm/jm-9-48000-computer-fraud)
- Arnold & Porter: [DOJ Revised CFAA Policy on Exceeds Authorized Access](https://www.arnoldporter.com/en/perspectives/blogs/enforcement-edge/2022/05/dojs-revised-cfaa-policy)
- Center for Cybersecurity Policy: [Virginia Expands Computer Crime Law](https://www.centerforcybersecuritypolicy.org/insights-and-research/virginia-supreme-court-expands-computer-crime-law-raising-legal-issues-for-ethical-hackers)
- Security Advisory: [Unencrypted Storage of API Keys in SQLite](https://github.com/LearningCircuit/local-deep-research/security/advisories/GHSA-4h8c-qrcq-cv5c)
- GitHub Changelog: [Secret scanning pattern updates — March 2026](https://github.blog/changelog/2026-03-10-secret-scanning-pattern-updates-march-2026/)
- GitHub Docs: [Rate limits for the REST API](https://docs.github.com/en/rest/using-the-rest-api/rate-limits-for-the-rest-api)
- GitHub Changelog: [Updated rate limits for unauthenticated requests](https://github.blog/changelog/2025-05-08-updated-rate-limits-for-unauthenticated-requests/)
- OpenAI Community: [sk-proj- key format discussion](https://community.openai.com/t/how-to-create-an-api-secret-key-with-prefix-sk-only-always-creates-sk-proj-keys/1263531)
- Geekmonkey: [Hyperscan vs RE2 performance comparison](https://geekmonkey.org/regular-expression-matching-at-scale-with-hyperscan/)
- Betterleaks/CyberInfos: [Fixing API Key Leak Detection Gaps](https://www.cyberinfos.in/betterleaks-secrets-scanner-api-key-detection/)
- arxiv: [Secret Breach Detection with LLMs](https://arxiv.org/html/2504.18784v1)
- SecurityBoulevard: [How to reduce false positives while scanning for secrets](https://securityboulevard.com/2021/02/how-to-reduce-false-positives-while-scanning-for-secrets/)