Add quality_analyzer.py — PDF quality scoring for FOIA/CIA filtering
Scans PDF folders and scores each document (0-100) based on: - Text content (word count, coherence, OCR garbage detection) - Font presence (scanned vs text-based) - File size, page count, filename quality - Language detection (Arabic/Russian/Turkish/English) Labels: high (70+), medium (40-69), low (20-39), noise (<20) Outputs JSON + CSV. Can move noise to Arsiv/noise with --move. Usage: --scan, --report, --export-csv, --move [--confirm] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
4
.gitignore
vendored
4
.gitignore
vendored
@@ -9,3 +9,7 @@ ocr_output/
|
||||
__pycache__/
|
||||
*.pyc
|
||||
.venv/
|
||||
quality_analysis.json
|
||||
quality_analysis.csv
|
||||
quality_scan.log
|
||||
upload_run.log
|
||||
|
||||
Reference in New Issue
Block a user