Scans PDF folders and scores each document (0-100) based on: - Text content (word count, coherence, OCR garbage detection) - Font presence (scanned vs text-based) - File size, page count, filename quality - Language detection (Arabic/Russian/Turkish/English) Labels: high (70+), medium (40-69), low (20-39), noise (<20) Outputs JSON + CSV. Can move noise to Arsiv/noise with --move. Usage: --scan, --report, --export-csv, --move [--confirm] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
16 lines
245 B
Plaintext
16 lines
245 B
Plaintext
# State files (machine-specific, regenerated by script)
|
|
upload_progress.json
|
|
setup.log
|
|
|
|
# OCR output (large binary files)
|
|
ocr_output/
|
|
|
|
# Python
|
|
__pycache__/
|
|
*.pyc
|
|
.venv/
|
|
quality_analysis.json
|
|
quality_analysis.csv
|
|
quality_scan.log
|
|
upload_run.log
|