anything-llm-rag

Author	SHA1	Message	Date
salvacybersec	792e951e62	Add Russian + Arabic OCR support (tur+eng+rus+ara) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-07 01:58:45 +03:00
salvacybersec	50f1d08c62	Add --ocr-only mode for standalone batch OCR without upload/embed Scans entire library for scanned PDFs (pdffonts detection), OCRs them in-place with ocrmypdf (tur+eng, 3 retries per file). No AnythingLLM API needed — works offline. Supports --persona, --cluster, --priority filters and --dry-run preview. Usage: python3 setup.py --ocr-only --dry-run # preview python3 setup.py --ocr-only # OCR all scanned PDFs python3 setup.py --ocr-only --cluster intel Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-07 01:40:10 +03:00
salvacybersec	dfde494bd4	Robust embed verification: 5 retries per call + LanceDB physical checks Every API call: - 5 retries with progressive backoff (Olla routes to random instances) - Body error detection (API 200 but embed error in response) Per persona verification: - First batch: LanceDB must physically grow + query must return sources - Every 10th batch: LanceDB growth check - Final: triple check (LanceDB size + workspace doc count API + search query) - Abort on model-not-found errors, skip after 5 consecutive failures Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-07 01:30:09 +03:00
salvacybersec	3d4654f454	Add robust embedding verification — no silent failures - Pre-flight: test embedding model with 3 retries (120s timeout for cold start) - First-batch verify: after batch 1, query workspace to confirm vectors searchable - Abort on model errors: "not found" or "failed to embed" stops immediately - Consecutive failure guard: 3 fails in a row → skip persona, continue others - Response error check: API 200 but embed error in body → caught and logged - Never record progress for failed embeds Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-07 01:26:41 +03:00
salvacybersec	3176ebf102	Fix batch size (20→5) and script detection in monitor - Reduce embed batch to 5 — AnythingLLM hangs on batches >10 - Fix check_script_running() to properly detect setup.py process (was returning false because pgrep matched monitor.py too) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-07 00:33:35 +03:00
salvacybersec	1028d11507	Add structured logging + log panel to monitor - setup.py: logging module with file (setup.log) + console output - Line-buffered output (fixes background execution buffering) - API calls with timeout (300s), retry (3x), debug logging - Per-batch progress: [1/29] persona batch 1/20 (20 docs) - --verbose flag for debug-level console - monitor.py: log tail in CLI + web dashboard - CLI: colorized last 15 log lines - Web: scrollable log panel with level-based colors - Smaller embed batches (20 instead of 50) for reliability Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-07 00:30:29 +03:00
salvacybersec	c45efcb261	Add --reassign mode for fast vector recovery without disk scanning Skips the slow folder scan (50K+ files) and upload phases — directly re-embeds already-uploaded documents to workspaces using progress state. Use with --reset to clear assignment tracking first. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-07 00:04:36 +03:00
salvacybersec	9e9b75e0b3	Initial commit: AnythingLLM persona RAG integration 28 persona workspace with document upload, OCR pipeline, and vector embedding assignment via AnythingLLM API. Supports 5 clusters (intel, cyber, military, humanities, engineering) with batch processing and resume capability. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-06 23:07:44 +03:00

8 Commits