10 Commits

Author SHA1 Message Date
salvacybersec
6c5a828b13 Tune speed profiles based on real-world Olla analysis
Log analysis: 30s too short (model load ~50s), 300s wastes time.
Real embed takes 1-20s when model is warm, 40-60s on cold load.

Tuned profiles:
  fast:   90s timeout, 5 retries, batch 5, 1s delay
  medium: 120s timeout, 3 retries, batch 5, 2s delay
  slow:   300s timeout, 3 retries, batch 5, 5s delay

Result: 1-2s/batch when model warm (was 200s/batch avg)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-07 11:08:28 +03:00
salvacybersec
0a07045e17 Add --speed fast/medium/slow profiles for embed operations
Speed profiles control timeout, retries, batch size, and delays:
  fast:   30s timeout, 7 retries, batch 10, 1s delay (~5x faster)
  medium: 60s timeout, 5 retries, batch 5, 2s delay (default)
  slow:   300s timeout, 3 retries, batch 5, 5s delay (safe)

Analysis showed 54% of batches hit 300s timeout on Olla bad routes,
wasting 7.7h on 155 batches. Fast mode reduces timeout waste from
300s to 30s per bad route — real embeds take ~18s on average.

Also reduced default batch delay from 5s to 2s in config.yaml.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-07 10:30:50 +03:00
salvacybersec
792e951e62 Add Russian + Arabic OCR support (tur+eng+rus+ara)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-07 01:58:45 +03:00
salvacybersec
50f1d08c62 Add --ocr-only mode for standalone batch OCR without upload/embed
Scans entire library for scanned PDFs (pdffonts detection), OCRs them
in-place with ocrmypdf (tur+eng, 3 retries per file). No AnythingLLM
API needed — works offline. Supports --persona, --cluster, --priority
filters and --dry-run preview.

Usage:
  python3 setup.py --ocr-only --dry-run    # preview
  python3 setup.py --ocr-only              # OCR all scanned PDFs
  python3 setup.py --ocr-only --cluster intel

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-07 01:40:10 +03:00
salvacybersec
dfde494bd4 Robust embed verification: 5 retries per call + LanceDB physical checks
Every API call:
- 5 retries with progressive backoff (Olla routes to random instances)
- Body error detection (API 200 but embed error in response)

Per persona verification:
- First batch: LanceDB must physically grow + query must return sources
- Every 10th batch: LanceDB growth check
- Final: triple check (LanceDB size + workspace doc count API + search query)
- Abort on model-not-found errors, skip after 5 consecutive failures

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-07 01:30:09 +03:00
salvacybersec
3d4654f454 Add robust embedding verification — no silent failures
- Pre-flight: test embedding model with 3 retries (120s timeout for cold start)
- First-batch verify: after batch 1, query workspace to confirm vectors searchable
- Abort on model errors: "not found" or "failed to embed" stops immediately
- Consecutive failure guard: 3 fails in a row → skip persona, continue others
- Response error check: API 200 but embed error in body → caught and logged
- Never record progress for failed embeds

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-07 01:26:41 +03:00
salvacybersec
3176ebf102 Fix batch size (20→5) and script detection in monitor
- Reduce embed batch to 5 — AnythingLLM hangs on batches >10
- Fix check_script_running() to properly detect setup.py process
  (was returning false because pgrep matched monitor.py too)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-07 00:33:35 +03:00
salvacybersec
1028d11507 Add structured logging + log panel to monitor
- setup.py: logging module with file (setup.log) + console output
  - Line-buffered output (fixes background execution buffering)
  - API calls with timeout (300s), retry (3x), debug logging
  - Per-batch progress: [1/29] persona batch 1/20 (20 docs)
  - --verbose flag for debug-level console
- monitor.py: log tail in CLI + web dashboard
  - CLI: colorized last 15 log lines
  - Web: scrollable log panel with level-based colors
- Smaller embed batches (20 instead of 50) for reliability

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-07 00:30:29 +03:00
salvacybersec
c45efcb261 Add --reassign mode for fast vector recovery without disk scanning
Skips the slow folder scan (50K+ files) and upload phases — directly
re-embeds already-uploaded documents to workspaces using progress state.
Use with --reset to clear assignment tracking first.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-07 00:04:36 +03:00
salvacybersec
9e9b75e0b3 Initial commit: AnythingLLM persona RAG integration
28 persona workspace with document upload, OCR pipeline, and vector embedding
assignment via AnythingLLM API. Supports 5 clusters (intel, cyber, military,
humanities, engineering) with batch processing and resume capability.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-06 23:07:44 +03:00