Add quality_analyzer.py — PDF quality scoring for FOIA/CIA filtering · 803e8be284 - anything-llm-rag

Add quality_analyzer.py — PDF quality scoring for FOIA/CIA filtering

Scans PDF folders and scores each document (0-100) based on:
- Text content (word count, coherence, OCR garbage detection)
- Font presence (scanned vs text-based)
- File size, page count, filename quality
- Language detection (Arabic/Russian/Turkish/English)

Labels: high (70+), medium (40-69), low (20-39), noise (<20)
Outputs JSON + CSV. Can move noise to Arsiv/noise with --move.

Usage: --scan, --report, --export-csv, --move [--confirm]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

This commit is contained in:

salvacybersec

2026-04-07 23:00:34 +03:00

parent 6c5a828b13

commit 803e8be284

2 changed files with 421 additions and 0 deletions

4

.gitignore vendored

View File

@@ -9,3 +9,7 @@ ocr_output/
 __pycache__/
 *.pyc
 .venv/
 quality_analysis.json
 quality_analysis.csv
 quality_scan.log
 upload_run.log

Add quality_analyzer.py — PDF quality scoring for FOIA/CIA filtering

4 .gitignore vendored Unescape Escape View File

4

.gitignore vendored

View File