Commit Graph

19 Commits

Author SHA1 Message Date
salvacybersec
fd454c4d79 Add upload progress panel to web monitor
- Upload progress bar with percentage, file count, speed, ETA
- Detects active upload from upload_*.log files automatically
- Last 10 upload lines shown with ✓/✗ color coding
- Combined log panel shows both setup.log and upload logs
- Upload folder distribution in API response

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 00:48:22 +03:00
salvacybersec
803e8be284 Add quality_analyzer.py — PDF quality scoring for FOIA/CIA filtering
Scans PDF folders and scores each document (0-100) based on:
- Text content (word count, coherence, OCR garbage detection)
- Font presence (scanned vs text-based)
- File size, page count, filename quality
- Language detection (Arabic/Russian/Turkish/English)

Labels: high (70+), medium (40-69), low (20-39), noise (<20)
Outputs JSON + CSV. Can move noise to Arsiv/noise with --move.

Usage: --scan, --report, --export-csv, --move [--confirm]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-07 23:00:34 +03:00
salvacybersec
6c5a828b13 Tune speed profiles based on real-world Olla analysis
Log analysis: 30s too short (model load ~50s), 300s wastes time.
Real embed takes 1-20s when model is warm, 40-60s on cold load.

Tuned profiles:
  fast:   90s timeout, 5 retries, batch 5, 1s delay
  medium: 120s timeout, 3 retries, batch 5, 2s delay
  slow:   300s timeout, 3 retries, batch 5, 5s delay

Result: 1-2s/batch when model warm (was 200s/batch avg)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-07 11:08:28 +03:00
salvacybersec
0a07045e17 Add --speed fast/medium/slow profiles for embed operations
Speed profiles control timeout, retries, batch size, and delays:
  fast:   30s timeout, 7 retries, batch 10, 1s delay (~5x faster)
  medium: 60s timeout, 5 retries, batch 5, 2s delay (default)
  slow:   300s timeout, 3 retries, batch 5, 5s delay (safe)

Analysis showed 54% of batches hit 300s timeout on Olla bad routes,
wasting 7.7h on 155 batches. Fast mode reduces timeout waste from
300s to 30s per bad route — real embeds take ~18s on average.

Also reduced default batch delay from 5s to 2s in config.yaml.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-07 10:30:50 +03:00
salvacybersec
be0a333134 Add ETA calculation to web dashboard and CLI monitor
Parses batch timestamps from setup.log, averages last 20 batches,
calculates remaining time. Shows ETA, docs remaining, and avg
seconds per batch in both web summary cards and CLI header.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-07 02:00:20 +03:00
salvacybersec
792e951e62 Add Russian + Arabic OCR support (tur+eng+rus+ara)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-07 01:58:45 +03:00
salvacybersec
50f1d08c62 Add --ocr-only mode for standalone batch OCR without upload/embed
Scans entire library for scanned PDFs (pdffonts detection), OCRs them
in-place with ocrmypdf (tur+eng, 3 retries per file). No AnythingLLM
API needed — works offline. Supports --persona, --cluster, --priority
filters and --dry-run preview.

Usage:
  python3 setup.py --ocr-only --dry-run    # preview
  python3 setup.py --ocr-only              # OCR all scanned PDFs
  python3 setup.py --ocr-only --cluster intel

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-07 01:40:10 +03:00
salvacybersec
0eb84bab78 Update README: bge-m3 embedding, Olla proxy, verification system docs
- Correct embedding model: bge-m3:latest (1024d) via Olla proxy
- Document 3-layer verification system (per-call, first-batch, triple-check)
- Add monitor.py usage section
- Add full recovery procedures including lancedb/vector-cache cleanup
- Document Olla load balancer retry behavior
- Add technical notes on batch size, rate limiting, log buffering

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-07 01:34:16 +03:00
salvacybersec
dfde494bd4 Robust embed verification: 5 retries per call + LanceDB physical checks
Every API call:
- 5 retries with progressive backoff (Olla routes to random instances)
- Body error detection (API 200 but embed error in response)

Per persona verification:
- First batch: LanceDB must physically grow + query must return sources
- Every 10th batch: LanceDB growth check
- Final: triple check (LanceDB size + workspace doc count API + search query)
- Abort on model-not-found errors, skip after 5 consecutive failures

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-07 01:30:09 +03:00
salvacybersec
3d4654f454 Add robust embedding verification — no silent failures
- Pre-flight: test embedding model with 3 retries (120s timeout for cold start)
- First-batch verify: after batch 1, query workspace to confirm vectors searchable
- Abort on model errors: "not found" or "failed to embed" stops immediately
- Consecutive failure guard: 3 fails in a row → skip persona, continue others
- Response error check: API 200 but embed error in body → caught and logged
- Never record progress for failed embeds

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-07 01:26:41 +03:00
salvacybersec
fd29c0efb6 Fix monitor: use API slugs, show KB for small sizes
- Fetch real workspace slugs from AnythingLLM API instead of guessing
- Show KB instead of 0MB for small LanceDB/vector sizes
- Fixes incorrect vector detection after embedding engine change

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-07 00:40:13 +03:00
salvacybersec
3176ebf102 Fix batch size (20→5) and script detection in monitor
- Reduce embed batch to 5 — AnythingLLM hangs on batches >10
- Fix check_script_running() to properly detect setup.py process
  (was returning false because pgrep matched monitor.py too)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-07 00:33:35 +03:00
salvacybersec
1028d11507 Add structured logging + log panel to monitor
- setup.py: logging module with file (setup.log) + console output
  - Line-buffered output (fixes background execution buffering)
  - API calls with timeout (300s), retry (3x), debug logging
  - Per-batch progress: [1/29] persona batch 1/20 (20 docs)
  - --verbose flag for debug-level console
- monitor.py: log tail in CLI + web dashboard
  - CLI: colorized last 15 log lines
  - Web: scrollable log panel with level-based colors
- Smaller embed batches (20 instead of 50) for reliability

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-07 00:30:29 +03:00
salvacybersec
9105c03b4b Add monitor.py: CLI + web dashboard for embedding progress
Three modes:
  python3 monitor.py          # one-shot CLI
  python3 monitor.py --watch  # auto-refresh 2s
  python3 monitor.py --web    # web dashboard on :8899

Shows per-persona progress bars, vector sizes, API/script status,
cluster grouping with color coding. Web mode auto-polls /api/status.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-07 00:24:07 +03:00
salvacybersec
98ed69653d Update all docs: 29 personas, 88 paths, 39K files, --reassign mode
Sync README, skill, memory, and Obsidian note with current state:
- 29 persona workspaces across 5 clusters
- 88 mapped paths covering 39,754 files (67 GB)
- New --reassign --reset mode for fast vector recovery
- Expanded skip_extensions list
- Gitea repo reference added

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-07 00:19:27 +03:00
salvacybersec
e54ed045fe Map remaining 4 files: _oneshots + KonferansSlaytlari to Neo
Full library coverage complete — 0 unmapped content folders.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-07 00:15:34 +03:00
salvacybersec
24f22b5b6c Expand config coverage from 73 to 86 mapped paths across 39K files
- Add SiyasetVeTeori (1262 files) to Tribune + Sage personas
- Add Marketing (12 files) to Herald persona
- Add 7 regional FOIA-CIA folders to Frodo + Scribe
- Add NATO/FOIA-NATO to Scribe + Warden
- Add MobilGuvenlik to Neo + Specter, KonferansSunumlari to Neo
- Add HHSGuvenlikEgitimi to Bastion
- Expand skip_extensions: add .djvu .mobi .azw3 .json .log .py .js
  .jsx .htm .png .gif .mp3 .flac .gov .org .db

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-07 00:13:30 +03:00
salvacybersec
c45efcb261 Add --reassign mode for fast vector recovery without disk scanning
Skips the slow folder scan (50K+ files) and upload phases — directly
re-embeds already-uploaded documents to workspaces using progress state.
Use with --reset to clear assignment tracking first.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-07 00:04:36 +03:00
salvacybersec
9e9b75e0b3 Initial commit: AnythingLLM persona RAG integration
28 persona workspace with document upload, OCR pipeline, and vector embedding
assignment via AnythingLLM API. Supports 5 clusters (intel, cyber, military,
humanities, engineering) with batch processing and resume capability.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-06 23:07:44 +03:00