anything-llm-rag

Author	SHA1	Message	Date
salvacybersec	fd454c4d79	Add upload progress panel to web monitor - Upload progress bar with percentage, file count, speed, ETA - Detects active upload from upload_*.log files automatically - Last 10 upload lines shown with ✓/✗ color coding - Combined log panel shows both setup.log and upload logs - Upload folder distribution in API response Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 00:48:22 +03:00
salvacybersec	803e8be284	Add quality_analyzer.py — PDF quality scoring for FOIA/CIA filtering Scans PDF folders and scores each document (0-100) based on: - Text content (word count, coherence, OCR garbage detection) - Font presence (scanned vs text-based) - File size, page count, filename quality - Language detection (Arabic/Russian/Turkish/English) Labels: high (70+), medium (40-69), low (20-39), noise (<20) Outputs JSON + CSV. Can move noise to Arsiv/noise with --move. Usage: --scan, --report, --export-csv, --move [--confirm] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-07 23:00:34 +03:00
salvacybersec	6c5a828b13	Tune speed profiles based on real-world Olla analysis Log analysis: 30s too short (model load ~50s), 300s wastes time. Real embed takes 1-20s when model is warm, 40-60s on cold load. Tuned profiles: fast: 90s timeout, 5 retries, batch 5, 1s delay medium: 120s timeout, 3 retries, batch 5, 2s delay slow: 300s timeout, 3 retries, batch 5, 5s delay Result: 1-2s/batch when model warm (was 200s/batch avg) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-07 11:08:28 +03:00
salvacybersec	0a07045e17	Add --speed fast/medium/slow profiles for embed operations Speed profiles control timeout, retries, batch size, and delays: fast: 30s timeout, 7 retries, batch 10, 1s delay (~5x faster) medium: 60s timeout, 5 retries, batch 5, 2s delay (default) slow: 300s timeout, 3 retries, batch 5, 5s delay (safe) Analysis showed 54% of batches hit 300s timeout on Olla bad routes, wasting 7.7h on 155 batches. Fast mode reduces timeout waste from 300s to 30s per bad route — real embeds take ~18s on average. Also reduced default batch delay from 5s to 2s in config.yaml. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-07 10:30:50 +03:00
salvacybersec	be0a333134	Add ETA calculation to web dashboard and CLI monitor Parses batch timestamps from setup.log, averages last 20 batches, calculates remaining time. Shows ETA, docs remaining, and avg seconds per batch in both web summary cards and CLI header. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-07 02:00:20 +03:00
salvacybersec	792e951e62	Add Russian + Arabic OCR support (tur+eng+rus+ara) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-07 01:58:45 +03:00
salvacybersec	50f1d08c62	Add --ocr-only mode for standalone batch OCR without upload/embed Scans entire library for scanned PDFs (pdffonts detection), OCRs them in-place with ocrmypdf (tur+eng, 3 retries per file). No AnythingLLM API needed — works offline. Supports --persona, --cluster, --priority filters and --dry-run preview. Usage: python3 setup.py --ocr-only --dry-run # preview python3 setup.py --ocr-only # OCR all scanned PDFs python3 setup.py --ocr-only --cluster intel Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-07 01:40:10 +03:00
salvacybersec	0eb84bab78	Update README: bge-m3 embedding, Olla proxy, verification system docs - Correct embedding model: bge-m3:latest (1024d) via Olla proxy - Document 3-layer verification system (per-call, first-batch, triple-check) - Add monitor.py usage section - Add full recovery procedures including lancedb/vector-cache cleanup - Document Olla load balancer retry behavior - Add technical notes on batch size, rate limiting, log buffering Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-07 01:34:16 +03:00
salvacybersec	dfde494bd4	Robust embed verification: 5 retries per call + LanceDB physical checks Every API call: - 5 retries with progressive backoff (Olla routes to random instances) - Body error detection (API 200 but embed error in response) Per persona verification: - First batch: LanceDB must physically grow + query must return sources - Every 10th batch: LanceDB growth check - Final: triple check (LanceDB size + workspace doc count API + search query) - Abort on model-not-found errors, skip after 5 consecutive failures Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-07 01:30:09 +03:00
salvacybersec	3d4654f454	Add robust embedding verification — no silent failures - Pre-flight: test embedding model with 3 retries (120s timeout for cold start) - First-batch verify: after batch 1, query workspace to confirm vectors searchable - Abort on model errors: "not found" or "failed to embed" stops immediately - Consecutive failure guard: 3 fails in a row → skip persona, continue others - Response error check: API 200 but embed error in body → caught and logged - Never record progress for failed embeds Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-07 01:26:41 +03:00
salvacybersec	fd29c0efb6	Fix monitor: use API slugs, show KB for small sizes - Fetch real workspace slugs from AnythingLLM API instead of guessing - Show KB instead of 0MB for small LanceDB/vector sizes - Fixes incorrect vector detection after embedding engine change Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-07 00:40:13 +03:00
salvacybersec	3176ebf102	Fix batch size (20→5) and script detection in monitor - Reduce embed batch to 5 — AnythingLLM hangs on batches >10 - Fix check_script_running() to properly detect setup.py process (was returning false because pgrep matched monitor.py too) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-07 00:33:35 +03:00
salvacybersec	1028d11507	Add structured logging + log panel to monitor - setup.py: logging module with file (setup.log) + console output - Line-buffered output (fixes background execution buffering) - API calls with timeout (300s), retry (3x), debug logging - Per-batch progress: [1/29] persona batch 1/20 (20 docs) - --verbose flag for debug-level console - monitor.py: log tail in CLI + web dashboard - CLI: colorized last 15 log lines - Web: scrollable log panel with level-based colors - Smaller embed batches (20 instead of 50) for reliability Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-07 00:30:29 +03:00
salvacybersec	9105c03b4b	Add monitor.py: CLI + web dashboard for embedding progress Three modes: python3 monitor.py # one-shot CLI python3 monitor.py --watch # auto-refresh 2s python3 monitor.py --web # web dashboard on :8899 Shows per-persona progress bars, vector sizes, API/script status, cluster grouping with color coding. Web mode auto-polls /api/status. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-07 00:24:07 +03:00
salvacybersec	98ed69653d	Update all docs: 29 personas, 88 paths, 39K files, --reassign mode Sync README, skill, memory, and Obsidian note with current state: - 29 persona workspaces across 5 clusters - 88 mapped paths covering 39,754 files (67 GB) - New --reassign --reset mode for fast vector recovery - Expanded skip_extensions list - Gitea repo reference added Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-07 00:19:27 +03:00
salvacybersec	e54ed045fe	Map remaining 4 files: _oneshots + KonferansSlaytlari to Neo Full library coverage complete — 0 unmapped content folders. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-07 00:15:34 +03:00
salvacybersec	24f22b5b6c	Expand config coverage from 73 to 86 mapped paths across 39K files - Add SiyasetVeTeori (1262 files) to Tribune + Sage personas - Add Marketing (12 files) to Herald persona - Add 7 regional FOIA-CIA folders to Frodo + Scribe - Add NATO/FOIA-NATO to Scribe + Warden - Add MobilGuvenlik to Neo + Specter, KonferansSunumlari to Neo - Add HHSGuvenlikEgitimi to Bastion - Expand skip_extensions: add .djvu .mobi .azw3 .json .log .py .js .jsx .htm .png .gif .mp3 .flac .gov .org .db Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-07 00:13:30 +03:00
salvacybersec	c45efcb261	Add --reassign mode for fast vector recovery without disk scanning Skips the slow folder scan (50K+ files) and upload phases — directly re-embeds already-uploaded documents to workspaces using progress state. Use with --reset to clear assignment tracking first. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-07 00:04:36 +03:00
salvacybersec	9e9b75e0b3	Initial commit: AnythingLLM persona RAG integration 28 persona workspace with document upload, OCR pipeline, and vector embedding assignment via AnythingLLM API. Supports 5 clusters (intel, cyber, military, humanities, engineering) with batch processing and resume capability. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-06 23:07:44 +03:00

19 Commits