Initial commit: AnythingLLM persona RAG integration

28 persona workspace with document upload, OCR pipeline, and vector embedding assignment via AnythingLLM API. Supports 5 clusters (intel, cyber, military, humanities, engineering) with batch processing and resume capability. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-06 23:07:44 +03:00
commit 9e9b75e0b3
4 changed files with 1178 additions and 0 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -0,0 +1,10 @@
+# State files (machine-specific, regenerated by script)
+upload_progress.json
+
+# OCR output (large binary files)
+ocr_output/
+
+# Python
+__pycache__/
+*.pyc
+.venv/
--- a/README.md
+++ b/README.md
@@ -0,0 +1,63 @@
+# AnythingLLM × Persona RAG Integration
+
+28 persona workspace'i olan, kitap kütüphanesinden beslenen RAG sistemi. Her persona kendi uzmanlık alanındaki dokümanlarla vektör embed edilmiş durumda.
+
+## Mimari
+
+- **AnythingLLM Desktop** — `http://localhost:3001`
+- **LLM:** Ollama local (qwen3:14b)
+- **Embedding:** Google Gemini (gemini-embedding-001)
+- **Vector DB:** LanceDB
+- **OCR:** ocrmypdf (tur+eng)
+- **Kitap Kaynağı:** `/mnt/storage/Common/Books/`
+
+## Personalar (5 Cluster)
+
+| Cluster | Personalar |
+|---------|-----------|
+| Intel | Frodo, Echo, Ghost, Oracle, Wraith, Scribe, Polyglot |
+| Cyber | Neo, Bastion, Sentinel, Specter, Phantom, Cipher, Vortex |
+| Military | Marshal, Centurion, Corsair, Warden, Medic |
+| Humanities | Chronos, Tribune, Arbiter, Ledger, Sage, Herald, Scholar, Gambit |
+| Engineering | Forge, Architect |
+
+## Kullanım
+
+```bash
+# Durum kontrolü
+python3 setup.py --status
+
+# Workspace oluştur / güncelle
+python3 setup.py --create-workspaces
+
+# Tam pipeline (upload + OCR + embed)
+python3 setup.py --upload-documents --resume
+
+# Tek cluster veya persona
+python3 setup.py --upload-documents --cluster cyber --resume
+python3 setup.py --upload-documents --persona neo --priority 1 --resume
+
+# Önizleme
+python3 setup.py --upload-documents --dry-run
+```
+
+## Pipeline
+
+```
+Phase A: Text dosyaları upload
+Phase B: Scanned PDF'leri OCR (ocrmypdf)
+Phase C: OCR'lı dosyaları upload
+Final:   Workspace'lere assign/embed
+```
+
+## Recovery
+
+Vektör DB silinirse:
+1. `upload_progress.json`'da `workspace_docs` → `{}` sıfırla
+2. `python3 setup.py --upload-documents --resume` (sadece re-embed yapar)
+
+## Dosyalar
+
+- `setup.py` — Ana entegrasyon scripti (upload, OCR, workspace assignment)
+- `config.yaml` — Persona-klasör eşlemeleri, API config, batch ayarları
+- `upload_progress.json` — Upload/atama state tracker (gitignore'd)
--- a/config.yaml
+++ b/config.yaml
@@ -0,0 +1,425 @@
+# AnythingLLM × Persona Library Integration Config
+# Maps personas to book folders for workspace-based RAG
+#
+# Usage: python3 setup.py [--dry-run] [--persona <name>] [--upload-documents]
+
+anythingllm:
+  base_url: "http://localhost:3001/api/v1"
+  api_key: "SXQGXH3-AQ64B8E-KQNMDWC-WZBQAFW"
+
+storage:
+  book_library: "/mnt/storage/Common/Books"
+  personas_dir: "/home/salva/Documents/personas/personas"
+  # AnythingLLM copies uploaded originals to direct-uploads/
+  # This symlink sends them to HDD so SSD stays clean
+  hdd_storage: "/mnt/storage/anythingllm"
+
+embedding:
+  primary:
+    engine: "gemini"
+    model: "gemini-embedding-001"
+  fallback:
+    engine: "ollama"
+    base_path: "http://127.0.0.1:40114/olla/ollama"
+    model: "nomic-embed-text"
+
+# Batch processing — avoid API rate limits
+processing:
+  batch_size: 50          # files per batch
+  delay_between_batches: 5 # seconds
+  max_concurrent: 3       # parallel uploads
+  skip_extensions:        # don't process these
+    - ".bin"
+    - ".gz"
+    - ".zip"
+    - ".html"
+    - ".php"
+    - ".jpg"
+    - ".pptx"
+    - ".ppt"
+    - ".doc"
+
+# ─────────────────────────────────────────────────────────────
+# PERSONA → BOOK FOLDER MAPPINGS
+# ─────────────────────────────────────────────────────────────
+# priority: 1=core (always load), 2=extended (load if capacity allows)
+# max_files: cap per folder to keep workspace focused
+
+workspaces:
+
+  # ══════════════════════════════════════════════════════════
+  # INTELLIGENCE CLUSTER
+  # ══════════════════════════════════════════════════════════
+
+  frodo:
+    name: "Frodo — Stratejik İstihbarat"
+    persona_file: "frodo/general.md"
+    folders:
+      - path: "Istihbarat/TeoriVeAnaliz"
+        priority: 1
+      - path: "Istihbarat/Arastirmalar"
+        priority: 1
+      - path: "UluslararasiIliskiler"
+        priority: 1
+      - path: "GuvenlikStratejileri"
+        priority: 1
+      - path: "SETA"
+        priority: 2
+      - path: "ORSAM"
+        priority: 2
+      - path: "Istihbarat/TurkIstihbarati"
+        priority: 2
+      - path: "Istihbarat/RusIstihbarati"
+        priority: 2
+
+  echo:
+    name: "Echo — SIGINT/COMINT"
+    persona_file: "echo/general.md"
+    folders:
+      - path: "SiberGuvenlik/FOIA-IA-NSA-SIGINT"
+        priority: 1
+      - path: "SiberGuvenlik/ElektronikGuvenlik"
+        priority: 1
+      - path: "Istihbarat/FOIA-CIA-SogukSavas"
+        priority: 2
+      - path: "SiberGuvenlik/FOIA-SiberSavas"
+        priority: 2
+
+  ghost:
+    name: "Ghost — PSYOP & Bilgi Savaşı"
+    persona_file: "ghost/general.md"
+    folders:
+      - path: "SiberGuvenlik/BilgiSavasi"
+        priority: 1
+      - path: "Istihbarat/SorguTeknikleri"
+        priority: 1
+      - path: "GuvenlikStratejileri"
+        priority: 2
+
+  oracle:
+    name: "Oracle — OSINT & Dijital İstihbarat"
+    persona_file: "oracle/general.md"
+    folders:
+      - path: "SiberGuvenlik/OSINT"
+        priority: 1
+      - path: "Istihbarat/Arastirmalar"
+        priority: 2
+
+  wraith:
+    name: "Wraith — HUMINT & Karşı İstihbarat"
+    persona_file: "wraith/general.md"
+    folders:
+      - path: "Istihbarat/TeoriVeAnaliz"
+        priority: 1
+      - path: "Istihbarat/BiyografiVeAnilar"
+        priority: 1
+      - path: "Istihbarat/TurkIstihbarati"
+        priority: 1
+      - path: "Istihbarat/RusIstihbarati"
+        priority: 1
+      - path: "Istihbarat/SorguTeknikleri"
+        priority: 2
+      - path: "Istihbarat/IstihbaratTarihi"
+        priority: 2
+
+  scribe:
+    name: "Scribe — FOIA Arşivci"
+    persona_file: "scribe/general.md"
+    folders:
+      - path: "FOIA"
+        priority: 1
+      - path: "Istihbarat/CIA"
+        priority: 1
+      - path: "Istihbarat/FOIA-CIA-OrtaDogu"
+        priority: 1
+      - path: "Istihbarat/FOIA-CIA-SogukSavas"
+        priority: 1
+      - path: "Istihbarat/FOIA-CIA-Turkey"
+        priority: 1
+      - path: "Istihbarat/FOIA-FBI-COINTELPRO"
+        priority: 2
+      - path: "Istihbarat/FOIA-FBI-Vault"
+        priority: 2
+      - path: "Istihbarat/FOIA-IA-CIA-SogukSavas"
+        priority: 2
+      - path: "Istihbarat/FOIA-IA-CIA-Kuba-OrtaDogu"
+        priority: 2
+      - path: "Istihbarat/FOIA-IA-FBI"
+        priority: 2
+      - path: "Istihbarat/FOIA-IA-WWII"
+        priority: 2
+      - path: "SiberGuvenlik/FOIA-CyberWarfare"
+        priority: 2
+      - path: "SiberGuvenlik/FOIA-IA-NSA-SIGINT"
+        priority: 2
+
+  polyglot:
+    name: "Polyglot — Dilbilim & LINGINT"
+    persona_file: "polyglot/general.md"
+    folders:
+      - path: "Egitim"
+        priority: 1
+
+  # ══════════════════════════════════════════════════════════
+  # CYBERSECURITY CLUSTER
+  # ══════════════════════════════════════════════════════════
+
+  neo:
+    name: "Neo — Red Team & Exploit Dev"
+    persona_file: "neo/general.md"
+    folders:
+      - path: "SiberGuvenlik/PenetrasyonTesti"
+        priority: 1
+      - path: "SiberGuvenlik/SaldiriTeknikleri"
+        priority: 1
+      - path: "SiberGuvenlik/ZafiyetArastirmasi"
+        priority: 1
+      - path: "SiberGuvenlik/WebGuvenligi"
+        priority: 2
+
+  bastion:
+    name: "Bastion — Blue Team & DFIR"
+    persona_file: "bastion/general.md"
+    folders:
+      - path: "SiberGuvenlik/AdliBilisim"
+        priority: 1
+      - path: "SiberGuvenlik/GenelGuvenlik"
+        priority: 1
+      - path: "SiberGuvenlik/AgGuvenligi"
+        priority: 2
+      - path: "SiberGuvenlik/WindowsGuvenligi"
+        priority: 2
+
+  sentinel:
+    name: "Sentinel — Siber Tehdit İstihbaratı"
+    persona_file: "sentinel/general.md"
+    folders:
+      - path: "SiberGuvenlik/TehditIstihbarati"
+        priority: 1
+      - path: "SiberGuvenlik/SiberSavas"
+        priority: 1
+      - path: "SiberGuvenlik/SiberGuvenlikStratejisi"
+        priority: 1
+      - path: "SiberGuvenlik/FOIA-CyberWarfare"
+        priority: 2
+
+  specter:
+    name: "Specter — Zararlı Yazılım & Tersine Mühendislik"
+    persona_file: "specter/general.md"
+    folders:
+      - path: "SiberGuvenlik/ZararliYazilimAnalizi"
+        priority: 1
+      - path: "SiberGuvenlik/TersineMuhendislik"
+        priority: 1
+      - path: "SiberGuvenlik/KernelGuvenligi"
+        priority: 2
+
+  phantom:
+    name: "Phantom — Web Uygulama Güvenliği"
+    persona_file: "phantom/general.md"
+    folders:
+      - path: "SiberGuvenlik/WebGuvenligi"
+        priority: 1
+      - path: "SiberGuvenlik/PenetrasyonTesti"
+        priority: 2
+      - path: "SiberGuvenlik/BulutGuvenligi"
+        priority: 2
+
+  cipher:
+    name: "Cipher — Kriptografi"
+    persona_file: "cipher/general.md"
+    folders:
+      - path: "SiberGuvenlik/Kriptografi"
+        priority: 1
+      - path: "SiberGuvenlik/BilgiGuvenligi"
+        priority: 2
+
+  vortex:
+    name: "Vortex — Ağ Operasyonları"
+    persona_file: "vortex/general.md"
+    folders:
+      - path: "SiberGuvenlik/AgGuvenligi"
+        priority: 1
+      - path: "SiberGuvenlik/DonaninGuvenligi"
+        priority: 2
+      - path: "SiberGuvenlik/IoT"
+        priority: 2
+
+  # ══════════════════════════════════════════════════════════
+  # MILITARY CLUSTER
+  # ══════════════════════════════════════════════════════════
+
+  marshal:
+    name: "Marshal — Askeri Doktrin & Strateji"
+    persona_file: "marshal/general.md"
+    folders:
+      - path: "AskeriDoktrin"
+        priority: 1
+      - path: "NATO/Doktrin"
+        priority: 1
+      - path: "GuvenlikStratejileri"
+        priority: 1
+      - path: "NATO/Tatbikat"
+        priority: 2
+
+  centurion:
+    name: "Centurion — Askeri Tarih"
+    persona_file: "centurion/general.md"
+    folders:
+      - path: "AskeriTarih"
+        priority: 1
+      - path: "AskeriDoktrin"
+        priority: 2
+      - path: "DunyaTarihi"
+        priority: 2
+
+  corsair:
+    name: "Corsair — Özel Harekat & Düzensiz Savaş"
+    persona_file: "corsair/general.md"
+    folders:
+      - path: "AskeriDoktrin"
+        priority: 1
+      - path: "Istihbarat/TerorMucadele"
+        priority: 1
+      - path: "GuvenlikStratejileri"
+        priority: 2
+
+  warden:
+    name: "Warden — Savunma Analizi & Silah Sistemleri"
+    persona_file: "warden/general.md"
+    folders:
+      - path: "AskeriDoktrin"
+        priority: 1
+      - path: "NATO/Teknik"
+        priority: 1
+      - path: "GuvenlikStratejileri"
+        priority: 2
+      - path: "Istihbarat/SavunmaBakanligiRaporlari"
+        priority: 2
+
+  medic:
+    name: "Medic — Biyomedikal & KBRN"
+    persona_file: "medic/general.md"
+    folders:
+      - path: "Biyomedikal"
+        priority: 1
+      - path: "Istihbarat/KBRN"
+        priority: 1
+      - path: "BilimVeArastirma"
+        priority: 2
+
+  # ══════════════════════════════════════════════════════════
+  # HUMANITIES & ANALYSIS CLUSTER
+  # ══════════════════════════════════════════════════════════
+
+  chronos:
+    name: "Chronos — Dünya Tarihi & Medeniyet"
+    persona_file: "chronos/general.md"
+    folders:
+      - path: "DunyaTarihi"
+        priority: 1
+      - path: "OsmanliTarihi"
+        priority: 1
+      - path: "CumhuriyetTarihi"
+        priority: 1
+      - path: "RusyaTarihi"
+        priority: 1
+      - path: "YahudiTarihi"
+        priority: 2
+      - path: "AskeriTarih"
+        priority: 2
+
+  tribune:
+    name: "Tribune — Siyaset Bilimi & Rejim Analizi"
+    persona_file: "tribune/general.md"
+    folders:
+      - path: "UluslararasiIliskiler"
+        priority: 1
+      - path: "SETA"
+        priority: 1
+      - path: "ORSAM"
+        priority: 1
+      - path: "CumhuriyetTarihi"
+        priority: 2
+
+  arbiter:
+    name: "Arbiter — Uluslararası Hukuk"
+    persona_file: "arbiter/general.md"
+    folders:
+      - path: "Hukuk"
+        priority: 1
+      - path: "UluslararasiIliskiler"
+        priority: 2
+      - path: "NATO/Idari"
+        priority: 2
+
+  ledger:
+    name: "Ledger — Ekonomik İstihbarat & FININT"
+    persona_file: "ledger/general.md"
+    folders:
+      - path: "EkonomiVeFinans"
+        priority: 1
+
+  sage:
+    name: "Sage — Felsefe & İktidar Teorisi"
+    persona_file: "sage/general.md"
+    folders:
+      - path: "FelsefeVeEdebiyat"
+        priority: 1
+
+  herald:
+    name: "Herald — Medya Analizi & Stratejik İletişim"
+    persona_file: "herald/general.md"
+    folders:
+      - path: "SETA"
+        priority: 1
+      - path: "ORSAM"
+        priority: 2
+      - path: "UluslararasiIliskiler"
+        priority: 2
+
+  scholar:
+    name: "Scholar — Akademik Araştırma"
+    persona_file: "scholar/general.md"
+    folders:
+      - path: "BilimVeArastirma"
+        priority: 1
+      - path: "Egitim"
+        priority: 1
+      - path: "UluslararasiIliskiler"
+        priority: 2
+
+  gambit:
+    name: "Gambit — Satranç & Stratejik Düşünce"
+    persona_file: "gambit/general.md"
+    folders:
+      - path: "Satranc"
+        priority: 1
+
+  # ══════════════════════════════════════════════════════════
+  # ENGINEERING CLUSTER
+  # ══════════════════════════════════════════════════════════
+
+  forge:
+    name: "Forge — Yazılım & AI/ML"
+    persona_file: "forge/general.md"
+    folders:
+      - path: "AI"
+        priority: 1
+      - path: "Teknoloji"
+        priority: 1
+      - path: "SiberGuvenlik/Programlama"
+        priority: 2
+      - path: "SiberGuvenlik/YapayZekaGuvenligi"
+        priority: 2
+
+  architect:
+    name: "Architect — DevOps & Altyapı"
+    persona_file: "architect/general.md"
+    folders:
+      - path: "Teknoloji"
+        priority: 1
+      - path: "SiberGuvenlik/Linux"
+        priority: 2
+      - path: "SiberGuvenlik/BulutGuvenligi"
+        priority: 2
--- a/setup.py
+++ b/setup.py
@@ -0,0 +1,680 @@
+#!/usr/bin/env python3
+"""
+AnythingLLM × Persona Library Integration Setup
+
+Three-phase pipeline:
+  Phase A: Upload all text-based files
+  Phase B: OCR all scanned PDFs in-place
+  Phase C: Upload newly OCR'd files + assign to workspaces
+
+Usage:
+  python3 setup.py --storage-setup              # Symlink direct-uploads/ to HDD
+  python3 setup.py --create-workspaces          # Create workspaces + load persona prompts
+  python3 setup.py --upload-documents           # Full pipeline: upload → OCR → upload → assign
+  python3 setup.py --persona frodo              # Single persona
+  python3 setup.py --cluster intel              # Intel cluster (frodo,echo,ghost,oracle,wraith,scribe,polyglot)
+  python3 setup.py --cluster cyber              # Cyber cluster
+  python3 setup.py --cluster military           # Military cluster
+  python3 setup.py --cluster humanities         # Humanities cluster
+  python3 setup.py --cluster engineering        # Engineering cluster
+  python3 setup.py --priority 1                 # Only priority 1 (core) folders
+  python3 setup.py --dry-run                    # Preview
+  python3 setup.py --status                     # Show state
+"""
+
+import argparse
+import json
+import os
+import shutil
+import subprocess
+import sys
+import time
+from pathlib import Path
+
+import yaml
+
+try:
+    import requests
+except ImportError:
+    print("pip install requests pyyaml")
+    sys.exit(1)
+
+
+CONFIG_PATH = Path(__file__).parent / "config.yaml"
+PROGRESS_PATH = Path(__file__).parent / "upload_progress.json"
+ANYTHINGLLM_STORAGE = Path.home() / ".config/anythingllm-desktop/storage"
+SKIP_EXT = set()
+
+CLUSTERS = {
+    "intel": ["frodo", "echo", "ghost", "oracle", "wraith", "scribe", "polyglot"],
+    "cyber": ["neo", "bastion", "sentinel", "specter", "phantom", "cipher", "vortex"],
+    "military": ["marshal", "centurion", "corsair", "warden", "medic"],
+    "humanities": ["chronos", "tribune", "arbiter", "ledger", "sage", "herald", "scholar", "gambit"],
+    "engineering": ["forge", "architect"],
+}
+
+
+def load_config():
+    with open(CONFIG_PATH) as f:
+        cfg = yaml.safe_load(f)
+    global SKIP_EXT
+    SKIP_EXT = set(cfg["processing"]["skip_extensions"])
+    return cfg
+
+
+def load_progress():
+    if PROGRESS_PATH.exists():
+        with open(PROGRESS_PATH) as f:
+            return json.load(f)
+    return {"uploaded_files": {}, "workspace_docs": {}, "ocr_done": [], "ocr_failed": []}
+
+
+def save_progress(progress):
+    with open(PROGRESS_PATH, "w") as f:
+        json.dump(progress, f, indent=2, ensure_ascii=False)
+
+
+# ──────────────────────────────────────────────────────────
+# API
+# ──────────────────────────────────────────────────────────
+
+def api_request(config, method, endpoint, **kwargs):
+    url = f"{config['anythingllm']['base_url']}{endpoint}"
+    headers = {"Authorization": f"Bearer {config['anythingllm']['api_key']}"}
+    if "json" in kwargs:
+        headers["Content-Type"] = "application/json"
+    resp = getattr(requests, method)(url, headers=headers, **kwargs)
+    if resp.status_code not in (200, 201):
+        print(f"  API error {resp.status_code}: {resp.text[:300]}")
+        return None
+    return resp.json()
+
+
+def api_upload(config, file_path, folder_name=None):
+    endpoint = f"/document/upload/{folder_name}" if folder_name else "/document/upload"
+    url = f"{config['anythingllm']['base_url']}{endpoint}"
+    headers = {"Authorization": f"Bearer {config['anythingllm']['api_key']}"}
+    size_mb = file_path.stat().st_size / (1024 * 1024)
+    timeout = max(120, int(120 + (size_mb / 10) * 30))
+    try:
+        with open(file_path, "rb") as f:
+            files = {"file": (file_path.name, f, "application/octet-stream")}
+            resp = requests.post(url, headers=headers, files=files, timeout=timeout)
+        if resp.status_code not in (200, 201):
+            return None, resp.text[:200]
+        data = resp.json()
+        if data.get("success") and data.get("documents"):
+            return data["documents"], None
+        return None, data.get("error", "Unknown error")
+    except requests.exceptions.Timeout:
+        return None, f"timeout ({timeout}s)"
+    except Exception as e:
+        return None, str(e)
+
+
+def check_api(config):
+    try:
+        return api_request(config, "get", "/auth") is not None
+    except Exception:
+        return False
+
+
+def check_collector_alive():
+    try:
+        return requests.get("http://127.0.0.1:8888", timeout=3).status_code == 200
+    except Exception:
+        return False
+
+
+def wait_for_collector(max_wait=90):
+    print("    ⏳ waiting for collector...", end="", flush=True)
+    for i in range(max_wait):
+        if check_collector_alive():
+            print(" ✓")
+            return True
+        time.sleep(1)
+        if i % 10 == 9:
+            print(".", end="", flush=True)
+    print(" ✗ timeout")
+    return False
+
+
+def get_existing_workspaces(config):
+    result = api_request(config, "get", "/workspaces")
+    if result and "workspaces" in result:
+        return {ws["name"]: ws for ws in result["workspaces"]}
+    return {}
+
+
+# ──────────────────────────────────────────────────────────
+# PDF DETECTION & OCR
+# ──────────────────────────────────────────────────────────
+
+def is_scanned_pdf(file_path):
+    """Fast scan detection via pdffonts (~0.04s vs pdftotext ~2s).
+    No fonts embedded = scanned/image-only PDF."""
+    if file_path.suffix.lower() != ".pdf":
+        return False
+    try:
+        proc = subprocess.Popen(
+            ["pdffonts", "-l", "3", str(file_path)],
+            stdout=subprocess.PIPE, stderr=subprocess.PIPE,
+        )
+        try:
+            stdout, _ = proc.communicate(timeout=3)
+            lines = [l for l in stdout.decode(errors="ignore").strip().split("\n") if l.strip()]
+            return len(lines) <= 2
+        except subprocess.TimeoutExpired:
+            proc.kill()
+            proc.wait()
+            return False  # assume text if slow
+    except Exception:
+        return False
+
+
+def ocr_pdf(file_path, language="tur+eng", dpi=200):
+    """OCR a scanned PDF in-place. Returns True on success."""
+    import tempfile
+    tmp_fd, tmp_path = tempfile.mkstemp(suffix=".pdf", dir=file_path.parent)
+    os.close(tmp_fd)
+    tmp_path = Path(tmp_path)
+    try:
+        result = subprocess.run(
+            ["ocrmypdf", "--skip-text", "--rotate-pages", "--deskew",
+             "--jobs", "4", "--image-dpi", str(dpi), "-l", language,
+             "--output-type", "pdf", "--quiet",
+             str(file_path), str(tmp_path)],
+            capture_output=True, text=True, timeout=600,
+        )
+        if result.returncode in (0, 6) and tmp_path.exists() and tmp_path.stat().st_size > 0:
+            tmp_path.replace(file_path)
+            return True
+        tmp_path.unlink(missing_ok=True)
+        return False
+    except Exception:
+        tmp_path.unlink(missing_ok=True)
+        return False
+
+
+# ──────────────────────────────────────────────────────────
+# STEP 1: Storage offload
+# ──────────────────────────────────────────────────────────
+
+def storage_setup(config, dry_run=False):
+    hdd_base = Path(config["storage"]["hdd_storage"])
+    src = ANYTHINGLLM_STORAGE / "direct-uploads"
+    dst = hdd_base / "direct-uploads"
+
+    print("═══ Storage: direct-uploads/ → HDD ═══\n")
+    if src.is_symlink():
+        print(f"  ✓ already symlinked → {src.resolve()}")
+        return
+    if not dry_run:
+        dst.mkdir(parents=True, exist_ok=True)
+        if src.exists():
+            shutil.copytree(str(src), str(dst), dirs_exist_ok=True)
+            shutil.rmtree(src)
+        src.symlink_to(dst)
+    print(f"  ✓ done\n")
+
+
+# ──────────────────────────────────────────────────────────
+# STEP 2: Create workspaces + load prompts
+# ──────────────────────────────────────────────────────────
+
+def extract_system_prompt(config, persona_file):
+    personas_dir = Path(config["storage"]["personas_dir"])
+    fp = personas_dir / persona_file
+    if not fp.exists():
+        alt = personas_dir / persona_file.replace("/general.md", "") / "_meta.yaml"
+        fp = alt if alt.exists() else None
+    if not fp:
+        return None
+
+    content = fp.read_text(encoding="utf-8")
+    if fp.suffix == ".yaml":
+        meta = yaml.safe_load(content)
+        return meta.get("system_prompt", meta.get("description", ""))
+
+    parts = content.split("---")
+    if len(parts) >= 3:
+        try:
+            fm = yaml.safe_load(parts[1])
+        except yaml.YAMLError:
+            fm = {}
+        tone = fm.get("tone", "")
+        body = "---".join(parts[2:]).strip()
+        return f"Tone: {tone}\n\n{body}" if tone else body
+    return content
+
+
+def create_workspaces(config, persona_list=None, dry_run=False):
+    print("═══ Creating Workspaces ═══\n")
+    if not config["anythingllm"]["api_key"]:
+        print("  ✗ API key not set!")
+        return
+
+    existing = get_existing_workspaces(config)
+    created = skipped = 0
+
+    for codename, ws_config in config["workspaces"].items():
+        if persona_list and codename not in persona_list:
+            continue
+        name = ws_config["name"]
+        persona_file = ws_config.get("persona_file", "")
+        system_prompt = extract_system_prompt(config, persona_file) if persona_file else ""
+
+        if name in existing:
+            # Update prompt if workspace exists
+            slug = existing[name].get("slug", "?")
+            if system_prompt and not dry_run:
+                api_request(config, "post", f"/workspace/{slug}/update",
+                            json={"openAiPrompt": system_prompt})
+            print(f"  ✓ {codename} (prompt: {len(system_prompt or '')} chars)")
+            skipped += 1
+            continue
+
+        print(f"  → {codename}: creating '{name}'")
+        if not dry_run:
+            result = api_request(config, "post", "/workspace/new", json={"name": name})
+            if result:
+                slug = result.get("workspace", {}).get("slug", "?")
+                if system_prompt:
+                    api_request(config, "post", f"/workspace/{slug}/update",
+                                json={"openAiPrompt": system_prompt})
+                print(f"    ✓ created + prompt ({len(system_prompt)} chars)")
+                created += 1
+            else:
+                print(f"    ✗ failed")
+
+    print(f"\n  Created: {created}, Updated: {skipped}\n")
+
+
+# ──────────────────────────────────────────────────────────
+# STEP 3: Three-phase upload pipeline
+# ──────────────────────────────────────────────────────────
+
+def collect_all_files(config, book_library, persona_list=None, priority_filter=None, max_size_mb=100):
+    """Scan all folders and classify files as text-based or scanned."""
+    text_files = {}      # folder_name → [paths]
+    scanned_files = {}   # folder_name → [paths]
+    persona_folders = {} # codename → [folder_names]
+    folder_to_path = {}  # folder_name → source path
+
+    for codename, ws_config in config["workspaces"].items():
+        if persona_list and codename not in persona_list:
+            continue
+        persona_folders[codename] = []
+
+        for entry in ws_config.get("folders", []):
+            folder_path = entry["path"]
+            priority = entry.get("priority", 2)
+            if priority_filter and priority > priority_filter:
+                continue
+
+            folder_name = folder_path.replace("/", "_")
+            persona_folders[codename].append(folder_name)
+
+            if folder_name in text_files:
+                continue  # already scanned
+
+            src = book_library / folder_path
+            folder_to_path[folder_name] = src
+            text_files[folder_name] = []
+            scanned_files[folder_name] = []
+
+            if not src.exists():
+                print(f"  ⚠ {folder_path} not found")
+                continue
+
+            # Use find with -printf for speed on HDD (one syscall, no per-file stat)
+            try:
+                find_result = subprocess.run(
+                    ["find", str(src), "-type", "f", "-not", "-empty",
+                     "-printf", "%s %p\n"],
+                    capture_output=True, text=True, timeout=120,
+                )
+                all_files = []
+                max_bytes = (max_size_mb or 9999) * 1024 * 1024
+                for line in find_result.stdout.strip().split("\n"):
+                    if not line:
+                        continue
+                    parts = line.split(" ", 1)
+                    if len(parts) != 2:
+                        continue
+                    size, path = int(parts[0]), Path(parts[1])
+                    if path.suffix.lower() not in SKIP_EXT and size <= max_bytes:
+                        all_files.append(path)
+                all_files.sort()
+            except Exception:
+                all_files = sorted(
+                    f for f in src.rglob("*")
+                    if f.is_file() and f.suffix.lower() not in SKIP_EXT
+                    and f.stat().st_size > 0
+                    and (not max_size_mb or f.stat().st_size / (1024*1024) <= max_size_mb)
+                )
+            print(f"  {folder_name}: {len(all_files)} files found", flush=True)
+
+            # Classify every file (pdffonts is fast: ~0.04s per file)
+            for i, f in enumerate(all_files):
+                if is_scanned_pdf(f):
+                    scanned_files[folder_name].append(f)
+                else:
+                    text_files[folder_name].append(f)
+                if (i + 1) % 500 == 0:
+                    print(f"    {folder_name}: {i+1}/{len(all_files)} classified...", flush=True)
+
+            t = len(text_files[folder_name])
+            s = len(scanned_files[folder_name])
+            print(f"  {folder_name}: {t} text, {s} scanned")
+
+    return text_files, scanned_files, persona_folders, folder_to_path
+
+
+def upload_file_batch(config, folder_name, files, progress, batch_size, delay):
+    """Upload a list of files to a folder. Returns (uploaded, failed) counts."""
+    uploaded = failed = 0
+    new_files = [f for f in files if str(f) not in progress["uploaded_files"]]
+    if not new_files:
+        return 0, 0
+
+    print(f"  → {folder_name}: {len(new_files)} files")
+
+    for i, fp in enumerate(new_files):
+        if uploaded > 0 and uploaded % batch_size == 0:
+            print(f"    ⏸ batch pause ({delay}s)...")
+            time.sleep(delay)
+
+        size_mb = fp.stat().st_size / (1024 * 1024)
+        print(f"    [{i+1}/{len(new_files)}] {fp.name} ({size_mb:.1f}MB)", end="", flush=True)
+
+        if not check_collector_alive():
+            print(f" ⚠ collector down")
+            if not wait_for_collector():
+                print("    ✗ stopping — restart AnythingLLM and --resume")
+                save_progress(progress)
+                return uploaded, failed
+
+        docs, error = None, None
+        for attempt in range(3):
+            docs, error = api_upload(config, fp, folder_name=folder_name)
+            if docs:
+                break
+            if error and "not online" in str(error):
+                print(f" ↻", end="", flush=True)
+                time.sleep(5)
+                if not wait_for_collector():
+                    save_progress(progress)
+                    return uploaded, failed
+            elif attempt < 2:
+                time.sleep(2)
+                print(f" ↻", end="", flush=True)
+
+        if docs:
+            progress["uploaded_files"][str(fp)] = {
+                "location": docs[0].get("location", ""),
+                "folder": folder_name,
+                "name": docs[0].get("name", ""),
+            }
+            uploaded += 1
+            print(f" ✓")
+        else:
+            failed += 1
+            progress.setdefault("failed_files", {})[str(fp)] = {
+                "folder": folder_name, "error": str(error)[:200],
+            }
+            print(f" ✗ {error}")
+
+        if uploaded % 10 == 0:
+            save_progress(progress)
+
+    save_progress(progress)
+    return uploaded, failed
+
+
+def assign_to_workspaces(config, persona_folders, progress, batch_size, delay):
+    """Phase C2: assign uploaded docs to persona workspaces."""
+    print("── Assigning to workspaces ──\n")
+    existing_ws = get_existing_workspaces(config)
+
+    for codename, folders in sorted(persona_folders.items()):
+        ws_name = config["workspaces"][codename]["name"]
+        ws_info = existing_ws.get(ws_name)
+        if not ws_info:
+            continue
+
+        slug = ws_info["slug"]
+        doc_locs = []
+        for fn in folders:
+            for fpath, info in progress["uploaded_files"].items():
+                if info.get("folder") == fn and info.get("location"):
+                    doc_locs.append(info["location"])
+
+        already = set(progress.get("workspace_docs", {}).get(codename, []))
+        new_docs = [loc for loc in doc_locs if loc not in already]
+        if not new_docs:
+            if doc_locs:
+                print(f"  ✓ {codename}: {len(doc_locs)} docs assigned")
+            continue
+
+        print(f"  → {codename} ({slug}): {len(new_docs)} docs")
+        for bs in range(0, len(new_docs), batch_size):
+            batch = new_docs[bs:bs + batch_size]
+            result = api_request(config, "post", f"/workspace/{slug}/update-embeddings",
+                                 json={"adds": batch, "deletes": []})
+            if result:
+                progress.setdefault("workspace_docs", {}).setdefault(codename, []).extend(batch)
+                print(f"    ✓ {len(batch)} docs embedded")
+            else:
+                print(f"    ✗ batch failed")
+            if bs + batch_size < len(new_docs):
+                time.sleep(delay)
+        save_progress(progress)
+    print()
+
+
+def upload_documents(config, persona_list=None, priority_filter=None,
+                     dry_run=False, resume=False, max_size_mb=100):
+    """Three-phase pipeline: text upload → OCR → OCR upload → assign."""
+    print("═══ Upload Pipeline ═══\n")
+
+    if not check_api(config):
+        print("  ✗ AnythingLLM API not reachable.")
+        return
+
+    book_library = Path(config["storage"]["book_library"])
+    batch_size = config["processing"]["batch_size"]
+    delay = config["processing"]["delay_between_batches"]
+    progress = load_progress() if resume else {
+        "uploaded_files": {}, "workspace_docs": {}, "ocr_done": [], "ocr_failed": [],
+    }
+
+    # Scan & classify
+    print("  Scanning folders...\n")
+    text_files, scanned_files, persona_folders, _ = collect_all_files(
+        config, book_library, persona_list, priority_filter, max_size_mb,
+    )
+
+    total_text = sum(len(v) for v in text_files.values())
+    total_scanned = sum(len(v) for v in scanned_files.values())
+    already = len(progress["uploaded_files"])
+
+    print(f"\n  Text-based files: {total_text}")
+    print(f"  Scanned PDFs:     {total_scanned}")
+    print(f"  Already uploaded:  {already}")
+    print(f"  Personas:          {len(persona_folders)}\n")
+
+    if dry_run:
+        for fn in sorted(set(list(text_files.keys()) + list(scanned_files.keys()))):
+            t = len([f for f in text_files.get(fn, []) if str(f) not in progress["uploaded_files"]])
+            s = len([f for f in scanned_files.get(fn, []) if str(f) not in progress["ocr_done"]])
+            if t or s:
+                print(f"  {fn}: {t} text, {s} scanned")
+        print(f"\n  Personas:")
+        for c, flds in sorted(persona_folders.items()):
+            print(f"    {c}: {', '.join(flds)}")
+        return
+
+    # ── Phase A: Upload text-based files ──
+    print("══ Phase A: Upload text-based files ══\n")
+    total_up = total_fail = 0
+    for fn, files in sorted(text_files.items()):
+        up, fail = upload_file_batch(config, fn, files, progress, batch_size, delay)
+        total_up += up
+        total_fail += fail
+    print(f"\n  Phase A done: {total_up} uploaded, {total_fail} failed\n")
+
+    # ── Phase B: OCR scanned PDFs ──
+    total_scanned_remaining = sum(
+        1 for files in scanned_files.values()
+        for f in files if str(f) not in progress.get("ocr_done", [])
+    )
+    if total_scanned_remaining > 0:
+        print(f"══ Phase B: OCR {total_scanned_remaining} scanned PDFs ══\n")
+        ocr_ok = ocr_fail = 0
+        for fn, files in sorted(scanned_files.items()):
+            pending = [f for f in files if str(f) not in progress.get("ocr_done", [])]
+            if not pending:
+                continue
+            print(f"  → {fn}: {len(pending)} PDFs")
+            for i, pdf in enumerate(pending):
+                size_mb = pdf.stat().st_size / (1024 * 1024)
+                print(f"    [{i+1}/{len(pending)}] {pdf.name} ({size_mb:.1f}MB)", end="", flush=True)
+                if ocr_pdf(pdf):
+                    progress.setdefault("ocr_done", []).append(str(pdf))
+                    ocr_ok += 1
+                    print(f" ✓")
+                else:
+                    progress.setdefault("ocr_failed", []).append(str(pdf))
+                    ocr_fail += 1
+                    print(f" ✗")
+                if (ocr_ok + ocr_fail) % 5 == 0:
+                    save_progress(progress)
+            save_progress(progress)
+        print(f"\n  Phase B done: {ocr_ok} OCR'd, {ocr_fail} failed\n")
+
+    # ── Phase C: Upload OCR'd files ──
+    ocr_to_upload = {fn: [f for f in files if str(f) in progress.get("ocr_done", [])]
+                     for fn, files in scanned_files.items()}
+    total_ocr_upload = sum(
+        1 for files in ocr_to_upload.values()
+        for f in files if str(f) not in progress["uploaded_files"]
+    )
+    if total_ocr_upload > 0:
+        print(f"══ Phase C: Upload {total_ocr_upload} OCR'd files ══\n")
+        total_up2 = total_fail2 = 0
+        for fn, files in sorted(ocr_to_upload.items()):
+            if not files:
+                continue
+            up, fail = upload_file_batch(config, fn, files, progress, batch_size, delay)
+            total_up2 += up
+            total_fail2 += fail
+        print(f"\n  Phase C done: {total_up2} uploaded, {total_fail2} failed\n")
+
+    # ── Assign to workspaces ──
+    assign_to_workspaces(config, persona_folders, progress, batch_size, delay)
+
+
+# ──────────────────────────────────────────────────────────
+# STATUS
+# ──────────────────────────────────────────────────────────
+
+def show_status(config):
+    print("═══ Integration Status ═══\n")
+
+    # Storage
+    du = ANYTHINGLLM_STORAGE / "direct-uploads"
+    if du.is_symlink():
+        print(f"  ✓ direct-uploads/ → {du.resolve()} (HDD)")
+    elif du.exists():
+        print(f"  ⚠ direct-uploads/ on SSD — run --storage-setup")
+
+    for d in ["documents", "lancedb", "vector-cache"]:
+        p = ANYTHINGLLM_STORAGE / d
+        if p.exists():
+            try:
+                sz = sum(f.stat().st_size for f in p.rglob("*") if f.is_file()) / (1024**2)
+                print(f"  ● {d}/ ({sz:.0f} MB)")
+            except Exception:
+                print(f"  ● {d}/")
+
+    api_ok = check_api(config) if config["anythingllm"]["api_key"] else False
+    collector_ok = check_collector_alive()
+    print(f"\n  API: {'✓' if api_ok else '✗'}  Collector: {'✓' if collector_ok else '✗'}")
+
+    progress = load_progress()
+    uploaded = len(progress.get("uploaded_files", {}))
+    ocr_done = len(progress.get("ocr_done", []))
+    assigned = len(progress.get("workspace_docs", {}))
+
+    if uploaded or ocr_done:
+        print(f"\n  Uploaded: {uploaded}  OCR'd: {ocr_done}  Personas assigned: {assigned}")
+        folders = {}
+        for info in progress.get("uploaded_files", {}).values():
+            f = info.get("folder", "?")
+            folders[f] = folders.get(f, 0) + 1
+        for f, c in sorted(folders.items(), key=lambda x: -x[1])[:20]:
+            print(f"    {c:4d}  {f}")
+
+    print()
+
+
+# ──────────────────────────────────────────────────────────
+# MAIN
+# ──────────────────────────────────────────────────────────
+
+def resolve_persona_list(args, config):
+    """Resolve --persona / --cluster to a list of codenames."""
+    if args.persona:
+        return [args.persona]
+    if args.cluster:
+        cl = CLUSTERS.get(args.cluster)
+        if not cl:
+            print(f"Unknown cluster: {args.cluster}")
+            print(f"Available: {', '.join(CLUSTERS.keys())}")
+            sys.exit(1)
+        return cl
+    return None  # all
+
+
+def main():
+    parser = argparse.ArgumentParser(description="AnythingLLM × Persona Integration")
+    parser.add_argument("--storage-setup", action="store_true")
+    parser.add_argument("--create-workspaces", action="store_true")
+    parser.add_argument("--upload-documents", action="store_true")
+    parser.add_argument("--all", action="store_true", help="Run all steps")
+    parser.add_argument("--status", action="store_true")
+    parser.add_argument("--persona", type=str, help="Single persona filter")
+    parser.add_argument("--cluster", type=str, help="Cluster filter: intel, cyber, military, humanities, engineering")
+    parser.add_argument("--priority", type=int, help="Max priority (1=core)")
+    parser.add_argument("--max-size", type=int, default=100, help="Max file MB (default: 100)")
+    parser.add_argument("--dry-run", action="store_true")
+    parser.add_argument("--resume", action="store_true")
+
+    args = parser.parse_args()
+    config = load_config()
+
+    if not any([args.storage_setup, args.create_workspaces,
+                args.upload_documents, args.all, args.status]):
+        parser.print_help()
+        return
+
+    persona_list = resolve_persona_list(args, config)
+
+    if args.status:
+        show_status(config)
+        return
+    if args.storage_setup or args.all:
+        storage_setup(config, dry_run=args.dry_run)
+    if args.create_workspaces or args.all:
+        create_workspaces(config, persona_list=persona_list, dry_run=args.dry_run)
+    if args.upload_documents or args.all:
+        upload_documents(config, persona_list=persona_list,
+                         priority_filter=args.priority,
+                         dry_run=args.dry_run,
+                         resume=args.resume or args.all,
+                         max_size_mb=args.max_size)
+
+
+if __name__ == "__main__":
+    main()