Initial commit: AnythingLLM persona RAG integration

28 persona workspace with document upload, OCR pipeline, and vector embedding
assignment via AnythingLLM API. Supports 5 clusters (intel, cyber, military,
humanities, engineering) with batch processing and resume capability.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
salvacybersec
2026-04-06 23:07:44 +03:00
commit 9e9b75e0b3
4 changed files with 1178 additions and 0 deletions

10
.gitignore vendored Normal file
View File

@@ -0,0 +1,10 @@
# State files (machine-specific, regenerated by script)
upload_progress.json
# OCR output (large binary files)
ocr_output/
# Python
__pycache__/
*.pyc
.venv/

63
README.md Normal file
View File

@@ -0,0 +1,63 @@
# AnythingLLM × Persona RAG Integration
28 persona workspace'i olan, kitap kütüphanesinden beslenen RAG sistemi. Her persona kendi uzmanlık alanındaki dokümanlarla vektör embed edilmiş durumda.
## Mimari
- **AnythingLLM Desktop** — `http://localhost:3001`
- **LLM:** Ollama local (qwen3:14b)
- **Embedding:** Google Gemini (gemini-embedding-001)
- **Vector DB:** LanceDB
- **OCR:** ocrmypdf (tur+eng)
- **Kitap Kaynağı:** `/mnt/storage/Common/Books/`
## Personalar (5 Cluster)
| Cluster | Personalar |
|---------|-----------|
| Intel | Frodo, Echo, Ghost, Oracle, Wraith, Scribe, Polyglot |
| Cyber | Neo, Bastion, Sentinel, Specter, Phantom, Cipher, Vortex |
| Military | Marshal, Centurion, Corsair, Warden, Medic |
| Humanities | Chronos, Tribune, Arbiter, Ledger, Sage, Herald, Scholar, Gambit |
| Engineering | Forge, Architect |
## Kullanım
```bash
# Durum kontrolü
python3 setup.py --status
# Workspace oluştur / güncelle
python3 setup.py --create-workspaces
# Tam pipeline (upload + OCR + embed)
python3 setup.py --upload-documents --resume
# Tek cluster veya persona
python3 setup.py --upload-documents --cluster cyber --resume
python3 setup.py --upload-documents --persona neo --priority 1 --resume
# Önizleme
python3 setup.py --upload-documents --dry-run
```
## Pipeline
```
Phase A: Text dosyaları upload
Phase B: Scanned PDF'leri OCR (ocrmypdf)
Phase C: OCR'lı dosyaları upload
Final: Workspace'lere assign/embed
```
## Recovery
Vektör DB silinirse:
1. `upload_progress.json`'da `workspace_docs``{}` sıfırla
2. `python3 setup.py --upload-documents --resume` (sadece re-embed yapar)
## Dosyalar
- `setup.py` — Ana entegrasyon scripti (upload, OCR, workspace assignment)
- `config.yaml` — Persona-klasör eşlemeleri, API config, batch ayarları
- `upload_progress.json` — Upload/atama state tracker (gitignore'd)

425
config.yaml Normal file
View File

@@ -0,0 +1,425 @@
# AnythingLLM × Persona Library Integration Config
# Maps personas to book folders for workspace-based RAG
#
# Usage: python3 setup.py [--dry-run] [--persona <name>] [--upload-documents]
anythingllm:
base_url: "http://localhost:3001/api/v1"
api_key: "SXQGXH3-AQ64B8E-KQNMDWC-WZBQAFW"
storage:
book_library: "/mnt/storage/Common/Books"
personas_dir: "/home/salva/Documents/personas/personas"
# AnythingLLM copies uploaded originals to direct-uploads/
# This symlink sends them to HDD so SSD stays clean
hdd_storage: "/mnt/storage/anythingllm"
embedding:
primary:
engine: "gemini"
model: "gemini-embedding-001"
fallback:
engine: "ollama"
base_path: "http://127.0.0.1:40114/olla/ollama"
model: "nomic-embed-text"
# Batch processing — avoid API rate limits
processing:
batch_size: 50 # files per batch
delay_between_batches: 5 # seconds
max_concurrent: 3 # parallel uploads
skip_extensions: # don't process these
- ".bin"
- ".gz"
- ".zip"
- ".html"
- ".php"
- ".jpg"
- ".pptx"
- ".ppt"
- ".doc"
# ─────────────────────────────────────────────────────────────
# PERSONA → BOOK FOLDER MAPPINGS
# ─────────────────────────────────────────────────────────────
# priority: 1=core (always load), 2=extended (load if capacity allows)
# max_files: cap per folder to keep workspace focused
workspaces:
# ══════════════════════════════════════════════════════════
# INTELLIGENCE CLUSTER
# ══════════════════════════════════════════════════════════
frodo:
name: "Frodo — Stratejik İstihbarat"
persona_file: "frodo/general.md"
folders:
- path: "Istihbarat/TeoriVeAnaliz"
priority: 1
- path: "Istihbarat/Arastirmalar"
priority: 1
- path: "UluslararasiIliskiler"
priority: 1
- path: "GuvenlikStratejileri"
priority: 1
- path: "SETA"
priority: 2
- path: "ORSAM"
priority: 2
- path: "Istihbarat/TurkIstihbarati"
priority: 2
- path: "Istihbarat/RusIstihbarati"
priority: 2
echo:
name: "Echo — SIGINT/COMINT"
persona_file: "echo/general.md"
folders:
- path: "SiberGuvenlik/FOIA-IA-NSA-SIGINT"
priority: 1
- path: "SiberGuvenlik/ElektronikGuvenlik"
priority: 1
- path: "Istihbarat/FOIA-CIA-SogukSavas"
priority: 2
- path: "SiberGuvenlik/FOIA-SiberSavas"
priority: 2
ghost:
name: "Ghost — PSYOP & Bilgi Savaşı"
persona_file: "ghost/general.md"
folders:
- path: "SiberGuvenlik/BilgiSavasi"
priority: 1
- path: "Istihbarat/SorguTeknikleri"
priority: 1
- path: "GuvenlikStratejileri"
priority: 2
oracle:
name: "Oracle — OSINT & Dijital İstihbarat"
persona_file: "oracle/general.md"
folders:
- path: "SiberGuvenlik/OSINT"
priority: 1
- path: "Istihbarat/Arastirmalar"
priority: 2
wraith:
name: "Wraith — HUMINT & Karşı İstihbarat"
persona_file: "wraith/general.md"
folders:
- path: "Istihbarat/TeoriVeAnaliz"
priority: 1
- path: "Istihbarat/BiyografiVeAnilar"
priority: 1
- path: "Istihbarat/TurkIstihbarati"
priority: 1
- path: "Istihbarat/RusIstihbarati"
priority: 1
- path: "Istihbarat/SorguTeknikleri"
priority: 2
- path: "Istihbarat/IstihbaratTarihi"
priority: 2
scribe:
name: "Scribe — FOIA Arşivci"
persona_file: "scribe/general.md"
folders:
- path: "FOIA"
priority: 1
- path: "Istihbarat/CIA"
priority: 1
- path: "Istihbarat/FOIA-CIA-OrtaDogu"
priority: 1
- path: "Istihbarat/FOIA-CIA-SogukSavas"
priority: 1
- path: "Istihbarat/FOIA-CIA-Turkey"
priority: 1
- path: "Istihbarat/FOIA-FBI-COINTELPRO"
priority: 2
- path: "Istihbarat/FOIA-FBI-Vault"
priority: 2
- path: "Istihbarat/FOIA-IA-CIA-SogukSavas"
priority: 2
- path: "Istihbarat/FOIA-IA-CIA-Kuba-OrtaDogu"
priority: 2
- path: "Istihbarat/FOIA-IA-FBI"
priority: 2
- path: "Istihbarat/FOIA-IA-WWII"
priority: 2
- path: "SiberGuvenlik/FOIA-CyberWarfare"
priority: 2
- path: "SiberGuvenlik/FOIA-IA-NSA-SIGINT"
priority: 2
polyglot:
name: "Polyglot — Dilbilim & LINGINT"
persona_file: "polyglot/general.md"
folders:
- path: "Egitim"
priority: 1
# ══════════════════════════════════════════════════════════
# CYBERSECURITY CLUSTER
# ══════════════════════════════════════════════════════════
neo:
name: "Neo — Red Team & Exploit Dev"
persona_file: "neo/general.md"
folders:
- path: "SiberGuvenlik/PenetrasyonTesti"
priority: 1
- path: "SiberGuvenlik/SaldiriTeknikleri"
priority: 1
- path: "SiberGuvenlik/ZafiyetArastirmasi"
priority: 1
- path: "SiberGuvenlik/WebGuvenligi"
priority: 2
bastion:
name: "Bastion — Blue Team & DFIR"
persona_file: "bastion/general.md"
folders:
- path: "SiberGuvenlik/AdliBilisim"
priority: 1
- path: "SiberGuvenlik/GenelGuvenlik"
priority: 1
- path: "SiberGuvenlik/AgGuvenligi"
priority: 2
- path: "SiberGuvenlik/WindowsGuvenligi"
priority: 2
sentinel:
name: "Sentinel — Siber Tehdit İstihbaratı"
persona_file: "sentinel/general.md"
folders:
- path: "SiberGuvenlik/TehditIstihbarati"
priority: 1
- path: "SiberGuvenlik/SiberSavas"
priority: 1
- path: "SiberGuvenlik/SiberGuvenlikStratejisi"
priority: 1
- path: "SiberGuvenlik/FOIA-CyberWarfare"
priority: 2
specter:
name: "Specter — Zararlı Yazılım & Tersine Mühendislik"
persona_file: "specter/general.md"
folders:
- path: "SiberGuvenlik/ZararliYazilimAnalizi"
priority: 1
- path: "SiberGuvenlik/TersineMuhendislik"
priority: 1
- path: "SiberGuvenlik/KernelGuvenligi"
priority: 2
phantom:
name: "Phantom — Web Uygulama Güvenliği"
persona_file: "phantom/general.md"
folders:
- path: "SiberGuvenlik/WebGuvenligi"
priority: 1
- path: "SiberGuvenlik/PenetrasyonTesti"
priority: 2
- path: "SiberGuvenlik/BulutGuvenligi"
priority: 2
cipher:
name: "Cipher — Kriptografi"
persona_file: "cipher/general.md"
folders:
- path: "SiberGuvenlik/Kriptografi"
priority: 1
- path: "SiberGuvenlik/BilgiGuvenligi"
priority: 2
vortex:
name: "Vortex — Ağ Operasyonları"
persona_file: "vortex/general.md"
folders:
- path: "SiberGuvenlik/AgGuvenligi"
priority: 1
- path: "SiberGuvenlik/DonaninGuvenligi"
priority: 2
- path: "SiberGuvenlik/IoT"
priority: 2
# ══════════════════════════════════════════════════════════
# MILITARY CLUSTER
# ══════════════════════════════════════════════════════════
marshal:
name: "Marshal — Askeri Doktrin & Strateji"
persona_file: "marshal/general.md"
folders:
- path: "AskeriDoktrin"
priority: 1
- path: "NATO/Doktrin"
priority: 1
- path: "GuvenlikStratejileri"
priority: 1
- path: "NATO/Tatbikat"
priority: 2
centurion:
name: "Centurion — Askeri Tarih"
persona_file: "centurion/general.md"
folders:
- path: "AskeriTarih"
priority: 1
- path: "AskeriDoktrin"
priority: 2
- path: "DunyaTarihi"
priority: 2
corsair:
name: "Corsair — Özel Harekat & Düzensiz Savaş"
persona_file: "corsair/general.md"
folders:
- path: "AskeriDoktrin"
priority: 1
- path: "Istihbarat/TerorMucadele"
priority: 1
- path: "GuvenlikStratejileri"
priority: 2
warden:
name: "Warden — Savunma Analizi & Silah Sistemleri"
persona_file: "warden/general.md"
folders:
- path: "AskeriDoktrin"
priority: 1
- path: "NATO/Teknik"
priority: 1
- path: "GuvenlikStratejileri"
priority: 2
- path: "Istihbarat/SavunmaBakanligiRaporlari"
priority: 2
medic:
name: "Medic — Biyomedikal & KBRN"
persona_file: "medic/general.md"
folders:
- path: "Biyomedikal"
priority: 1
- path: "Istihbarat/KBRN"
priority: 1
- path: "BilimVeArastirma"
priority: 2
# ══════════════════════════════════════════════════════════
# HUMANITIES & ANALYSIS CLUSTER
# ══════════════════════════════════════════════════════════
chronos:
name: "Chronos — Dünya Tarihi & Medeniyet"
persona_file: "chronos/general.md"
folders:
- path: "DunyaTarihi"
priority: 1
- path: "OsmanliTarihi"
priority: 1
- path: "CumhuriyetTarihi"
priority: 1
- path: "RusyaTarihi"
priority: 1
- path: "YahudiTarihi"
priority: 2
- path: "AskeriTarih"
priority: 2
tribune:
name: "Tribune — Siyaset Bilimi & Rejim Analizi"
persona_file: "tribune/general.md"
folders:
- path: "UluslararasiIliskiler"
priority: 1
- path: "SETA"
priority: 1
- path: "ORSAM"
priority: 1
- path: "CumhuriyetTarihi"
priority: 2
arbiter:
name: "Arbiter — Uluslararası Hukuk"
persona_file: "arbiter/general.md"
folders:
- path: "Hukuk"
priority: 1
- path: "UluslararasiIliskiler"
priority: 2
- path: "NATO/Idari"
priority: 2
ledger:
name: "Ledger — Ekonomik İstihbarat & FININT"
persona_file: "ledger/general.md"
folders:
- path: "EkonomiVeFinans"
priority: 1
sage:
name: "Sage — Felsefe & İktidar Teorisi"
persona_file: "sage/general.md"
folders:
- path: "FelsefeVeEdebiyat"
priority: 1
herald:
name: "Herald — Medya Analizi & Stratejik İletişim"
persona_file: "herald/general.md"
folders:
- path: "SETA"
priority: 1
- path: "ORSAM"
priority: 2
- path: "UluslararasiIliskiler"
priority: 2
scholar:
name: "Scholar — Akademik Araştırma"
persona_file: "scholar/general.md"
folders:
- path: "BilimVeArastirma"
priority: 1
- path: "Egitim"
priority: 1
- path: "UluslararasiIliskiler"
priority: 2
gambit:
name: "Gambit — Satranç & Stratejik Düşünce"
persona_file: "gambit/general.md"
folders:
- path: "Satranc"
priority: 1
# ══════════════════════════════════════════════════════════
# ENGINEERING CLUSTER
# ══════════════════════════════════════════════════════════
forge:
name: "Forge — Yazılım & AI/ML"
persona_file: "forge/general.md"
folders:
- path: "AI"
priority: 1
- path: "Teknoloji"
priority: 1
- path: "SiberGuvenlik/Programlama"
priority: 2
- path: "SiberGuvenlik/YapayZekaGuvenligi"
priority: 2
architect:
name: "Architect — DevOps & Altyapı"
persona_file: "architect/general.md"
folders:
- path: "Teknoloji"
priority: 1
- path: "SiberGuvenlik/Linux"
priority: 2
- path: "SiberGuvenlik/BulutGuvenligi"
priority: 2

680
setup.py Normal file
View File

@@ -0,0 +1,680 @@
#!/usr/bin/env python3
"""
AnythingLLM × Persona Library Integration Setup
Three-phase pipeline:
Phase A: Upload all text-based files
Phase B: OCR all scanned PDFs in-place
Phase C: Upload newly OCR'd files + assign to workspaces
Usage:
python3 setup.py --storage-setup # Symlink direct-uploads/ to HDD
python3 setup.py --create-workspaces # Create workspaces + load persona prompts
python3 setup.py --upload-documents # Full pipeline: upload → OCR → upload → assign
python3 setup.py --persona frodo # Single persona
python3 setup.py --cluster intel # Intel cluster (frodo,echo,ghost,oracle,wraith,scribe,polyglot)
python3 setup.py --cluster cyber # Cyber cluster
python3 setup.py --cluster military # Military cluster
python3 setup.py --cluster humanities # Humanities cluster
python3 setup.py --cluster engineering # Engineering cluster
python3 setup.py --priority 1 # Only priority 1 (core) folders
python3 setup.py --dry-run # Preview
python3 setup.py --status # Show state
"""
import argparse
import json
import os
import shutil
import subprocess
import sys
import time
from pathlib import Path
import yaml
try:
import requests
except ImportError:
print("pip install requests pyyaml")
sys.exit(1)
CONFIG_PATH = Path(__file__).parent / "config.yaml"
PROGRESS_PATH = Path(__file__).parent / "upload_progress.json"
ANYTHINGLLM_STORAGE = Path.home() / ".config/anythingllm-desktop/storage"
SKIP_EXT = set()
CLUSTERS = {
"intel": ["frodo", "echo", "ghost", "oracle", "wraith", "scribe", "polyglot"],
"cyber": ["neo", "bastion", "sentinel", "specter", "phantom", "cipher", "vortex"],
"military": ["marshal", "centurion", "corsair", "warden", "medic"],
"humanities": ["chronos", "tribune", "arbiter", "ledger", "sage", "herald", "scholar", "gambit"],
"engineering": ["forge", "architect"],
}
def load_config():
with open(CONFIG_PATH) as f:
cfg = yaml.safe_load(f)
global SKIP_EXT
SKIP_EXT = set(cfg["processing"]["skip_extensions"])
return cfg
def load_progress():
if PROGRESS_PATH.exists():
with open(PROGRESS_PATH) as f:
return json.load(f)
return {"uploaded_files": {}, "workspace_docs": {}, "ocr_done": [], "ocr_failed": []}
def save_progress(progress):
with open(PROGRESS_PATH, "w") as f:
json.dump(progress, f, indent=2, ensure_ascii=False)
# ──────────────────────────────────────────────────────────
# API
# ──────────────────────────────────────────────────────────
def api_request(config, method, endpoint, **kwargs):
url = f"{config['anythingllm']['base_url']}{endpoint}"
headers = {"Authorization": f"Bearer {config['anythingllm']['api_key']}"}
if "json" in kwargs:
headers["Content-Type"] = "application/json"
resp = getattr(requests, method)(url, headers=headers, **kwargs)
if resp.status_code not in (200, 201):
print(f" API error {resp.status_code}: {resp.text[:300]}")
return None
return resp.json()
def api_upload(config, file_path, folder_name=None):
endpoint = f"/document/upload/{folder_name}" if folder_name else "/document/upload"
url = f"{config['anythingllm']['base_url']}{endpoint}"
headers = {"Authorization": f"Bearer {config['anythingllm']['api_key']}"}
size_mb = file_path.stat().st_size / (1024 * 1024)
timeout = max(120, int(120 + (size_mb / 10) * 30))
try:
with open(file_path, "rb") as f:
files = {"file": (file_path.name, f, "application/octet-stream")}
resp = requests.post(url, headers=headers, files=files, timeout=timeout)
if resp.status_code not in (200, 201):
return None, resp.text[:200]
data = resp.json()
if data.get("success") and data.get("documents"):
return data["documents"], None
return None, data.get("error", "Unknown error")
except requests.exceptions.Timeout:
return None, f"timeout ({timeout}s)"
except Exception as e:
return None, str(e)
def check_api(config):
try:
return api_request(config, "get", "/auth") is not None
except Exception:
return False
def check_collector_alive():
try:
return requests.get("http://127.0.0.1:8888", timeout=3).status_code == 200
except Exception:
return False
def wait_for_collector(max_wait=90):
print(" ⏳ waiting for collector...", end="", flush=True)
for i in range(max_wait):
if check_collector_alive():
print("")
return True
time.sleep(1)
if i % 10 == 9:
print(".", end="", flush=True)
print(" ✗ timeout")
return False
def get_existing_workspaces(config):
result = api_request(config, "get", "/workspaces")
if result and "workspaces" in result:
return {ws["name"]: ws for ws in result["workspaces"]}
return {}
# ──────────────────────────────────────────────────────────
# PDF DETECTION & OCR
# ──────────────────────────────────────────────────────────
def is_scanned_pdf(file_path):
"""Fast scan detection via pdffonts (~0.04s vs pdftotext ~2s).
No fonts embedded = scanned/image-only PDF."""
if file_path.suffix.lower() != ".pdf":
return False
try:
proc = subprocess.Popen(
["pdffonts", "-l", "3", str(file_path)],
stdout=subprocess.PIPE, stderr=subprocess.PIPE,
)
try:
stdout, _ = proc.communicate(timeout=3)
lines = [l for l in stdout.decode(errors="ignore").strip().split("\n") if l.strip()]
return len(lines) <= 2
except subprocess.TimeoutExpired:
proc.kill()
proc.wait()
return False # assume text if slow
except Exception:
return False
def ocr_pdf(file_path, language="tur+eng", dpi=200):
"""OCR a scanned PDF in-place. Returns True on success."""
import tempfile
tmp_fd, tmp_path = tempfile.mkstemp(suffix=".pdf", dir=file_path.parent)
os.close(tmp_fd)
tmp_path = Path(tmp_path)
try:
result = subprocess.run(
["ocrmypdf", "--skip-text", "--rotate-pages", "--deskew",
"--jobs", "4", "--image-dpi", str(dpi), "-l", language,
"--output-type", "pdf", "--quiet",
str(file_path), str(tmp_path)],
capture_output=True, text=True, timeout=600,
)
if result.returncode in (0, 6) and tmp_path.exists() and tmp_path.stat().st_size > 0:
tmp_path.replace(file_path)
return True
tmp_path.unlink(missing_ok=True)
return False
except Exception:
tmp_path.unlink(missing_ok=True)
return False
# ──────────────────────────────────────────────────────────
# STEP 1: Storage offload
# ──────────────────────────────────────────────────────────
def storage_setup(config, dry_run=False):
hdd_base = Path(config["storage"]["hdd_storage"])
src = ANYTHINGLLM_STORAGE / "direct-uploads"
dst = hdd_base / "direct-uploads"
print("═══ Storage: direct-uploads/ → HDD ═══\n")
if src.is_symlink():
print(f" ✓ already symlinked → {src.resolve()}")
return
if not dry_run:
dst.mkdir(parents=True, exist_ok=True)
if src.exists():
shutil.copytree(str(src), str(dst), dirs_exist_ok=True)
shutil.rmtree(src)
src.symlink_to(dst)
print(f" ✓ done\n")
# ──────────────────────────────────────────────────────────
# STEP 2: Create workspaces + load prompts
# ──────────────────────────────────────────────────────────
def extract_system_prompt(config, persona_file):
personas_dir = Path(config["storage"]["personas_dir"])
fp = personas_dir / persona_file
if not fp.exists():
alt = personas_dir / persona_file.replace("/general.md", "") / "_meta.yaml"
fp = alt if alt.exists() else None
if not fp:
return None
content = fp.read_text(encoding="utf-8")
if fp.suffix == ".yaml":
meta = yaml.safe_load(content)
return meta.get("system_prompt", meta.get("description", ""))
parts = content.split("---")
if len(parts) >= 3:
try:
fm = yaml.safe_load(parts[1])
except yaml.YAMLError:
fm = {}
tone = fm.get("tone", "")
body = "---".join(parts[2:]).strip()
return f"Tone: {tone}\n\n{body}" if tone else body
return content
def create_workspaces(config, persona_list=None, dry_run=False):
print("═══ Creating Workspaces ═══\n")
if not config["anythingllm"]["api_key"]:
print(" ✗ API key not set!")
return
existing = get_existing_workspaces(config)
created = skipped = 0
for codename, ws_config in config["workspaces"].items():
if persona_list and codename not in persona_list:
continue
name = ws_config["name"]
persona_file = ws_config.get("persona_file", "")
system_prompt = extract_system_prompt(config, persona_file) if persona_file else ""
if name in existing:
# Update prompt if workspace exists
slug = existing[name].get("slug", "?")
if system_prompt and not dry_run:
api_request(config, "post", f"/workspace/{slug}/update",
json={"openAiPrompt": system_prompt})
print(f"{codename} (prompt: {len(system_prompt or '')} chars)")
skipped += 1
continue
print(f"{codename}: creating '{name}'")
if not dry_run:
result = api_request(config, "post", "/workspace/new", json={"name": name})
if result:
slug = result.get("workspace", {}).get("slug", "?")
if system_prompt:
api_request(config, "post", f"/workspace/{slug}/update",
json={"openAiPrompt": system_prompt})
print(f" ✓ created + prompt ({len(system_prompt)} chars)")
created += 1
else:
print(f" ✗ failed")
print(f"\n Created: {created}, Updated: {skipped}\n")
# ──────────────────────────────────────────────────────────
# STEP 3: Three-phase upload pipeline
# ──────────────────────────────────────────────────────────
def collect_all_files(config, book_library, persona_list=None, priority_filter=None, max_size_mb=100):
"""Scan all folders and classify files as text-based or scanned."""
text_files = {} # folder_name → [paths]
scanned_files = {} # folder_name → [paths]
persona_folders = {} # codename → [folder_names]
folder_to_path = {} # folder_name → source path
for codename, ws_config in config["workspaces"].items():
if persona_list and codename not in persona_list:
continue
persona_folders[codename] = []
for entry in ws_config.get("folders", []):
folder_path = entry["path"]
priority = entry.get("priority", 2)
if priority_filter and priority > priority_filter:
continue
folder_name = folder_path.replace("/", "_")
persona_folders[codename].append(folder_name)
if folder_name in text_files:
continue # already scanned
src = book_library / folder_path
folder_to_path[folder_name] = src
text_files[folder_name] = []
scanned_files[folder_name] = []
if not src.exists():
print(f"{folder_path} not found")
continue
# Use find with -printf for speed on HDD (one syscall, no per-file stat)
try:
find_result = subprocess.run(
["find", str(src), "-type", "f", "-not", "-empty",
"-printf", "%s %p\n"],
capture_output=True, text=True, timeout=120,
)
all_files = []
max_bytes = (max_size_mb or 9999) * 1024 * 1024
for line in find_result.stdout.strip().split("\n"):
if not line:
continue
parts = line.split(" ", 1)
if len(parts) != 2:
continue
size, path = int(parts[0]), Path(parts[1])
if path.suffix.lower() not in SKIP_EXT and size <= max_bytes:
all_files.append(path)
all_files.sort()
except Exception:
all_files = sorted(
f for f in src.rglob("*")
if f.is_file() and f.suffix.lower() not in SKIP_EXT
and f.stat().st_size > 0
and (not max_size_mb or f.stat().st_size / (1024*1024) <= max_size_mb)
)
print(f" {folder_name}: {len(all_files)} files found", flush=True)
# Classify every file (pdffonts is fast: ~0.04s per file)
for i, f in enumerate(all_files):
if is_scanned_pdf(f):
scanned_files[folder_name].append(f)
else:
text_files[folder_name].append(f)
if (i + 1) % 500 == 0:
print(f" {folder_name}: {i+1}/{len(all_files)} classified...", flush=True)
t = len(text_files[folder_name])
s = len(scanned_files[folder_name])
print(f" {folder_name}: {t} text, {s} scanned")
return text_files, scanned_files, persona_folders, folder_to_path
def upload_file_batch(config, folder_name, files, progress, batch_size, delay):
"""Upload a list of files to a folder. Returns (uploaded, failed) counts."""
uploaded = failed = 0
new_files = [f for f in files if str(f) not in progress["uploaded_files"]]
if not new_files:
return 0, 0
print(f"{folder_name}: {len(new_files)} files")
for i, fp in enumerate(new_files):
if uploaded > 0 and uploaded % batch_size == 0:
print(f" ⏸ batch pause ({delay}s)...")
time.sleep(delay)
size_mb = fp.stat().st_size / (1024 * 1024)
print(f" [{i+1}/{len(new_files)}] {fp.name} ({size_mb:.1f}MB)", end="", flush=True)
if not check_collector_alive():
print(f" ⚠ collector down")
if not wait_for_collector():
print(" ✗ stopping — restart AnythingLLM and --resume")
save_progress(progress)
return uploaded, failed
docs, error = None, None
for attempt in range(3):
docs, error = api_upload(config, fp, folder_name=folder_name)
if docs:
break
if error and "not online" in str(error):
print(f"", end="", flush=True)
time.sleep(5)
if not wait_for_collector():
save_progress(progress)
return uploaded, failed
elif attempt < 2:
time.sleep(2)
print(f"", end="", flush=True)
if docs:
progress["uploaded_files"][str(fp)] = {
"location": docs[0].get("location", ""),
"folder": folder_name,
"name": docs[0].get("name", ""),
}
uploaded += 1
print(f"")
else:
failed += 1
progress.setdefault("failed_files", {})[str(fp)] = {
"folder": folder_name, "error": str(error)[:200],
}
print(f"{error}")
if uploaded % 10 == 0:
save_progress(progress)
save_progress(progress)
return uploaded, failed
def assign_to_workspaces(config, persona_folders, progress, batch_size, delay):
"""Phase C2: assign uploaded docs to persona workspaces."""
print("── Assigning to workspaces ──\n")
existing_ws = get_existing_workspaces(config)
for codename, folders in sorted(persona_folders.items()):
ws_name = config["workspaces"][codename]["name"]
ws_info = existing_ws.get(ws_name)
if not ws_info:
continue
slug = ws_info["slug"]
doc_locs = []
for fn in folders:
for fpath, info in progress["uploaded_files"].items():
if info.get("folder") == fn and info.get("location"):
doc_locs.append(info["location"])
already = set(progress.get("workspace_docs", {}).get(codename, []))
new_docs = [loc for loc in doc_locs if loc not in already]
if not new_docs:
if doc_locs:
print(f"{codename}: {len(doc_locs)} docs assigned")
continue
print(f"{codename} ({slug}): {len(new_docs)} docs")
for bs in range(0, len(new_docs), batch_size):
batch = new_docs[bs:bs + batch_size]
result = api_request(config, "post", f"/workspace/{slug}/update-embeddings",
json={"adds": batch, "deletes": []})
if result:
progress.setdefault("workspace_docs", {}).setdefault(codename, []).extend(batch)
print(f"{len(batch)} docs embedded")
else:
print(f" ✗ batch failed")
if bs + batch_size < len(new_docs):
time.sleep(delay)
save_progress(progress)
print()
def upload_documents(config, persona_list=None, priority_filter=None,
dry_run=False, resume=False, max_size_mb=100):
"""Three-phase pipeline: text upload → OCR → OCR upload → assign."""
print("═══ Upload Pipeline ═══\n")
if not check_api(config):
print(" ✗ AnythingLLM API not reachable.")
return
book_library = Path(config["storage"]["book_library"])
batch_size = config["processing"]["batch_size"]
delay = config["processing"]["delay_between_batches"]
progress = load_progress() if resume else {
"uploaded_files": {}, "workspace_docs": {}, "ocr_done": [], "ocr_failed": [],
}
# Scan & classify
print(" Scanning folders...\n")
text_files, scanned_files, persona_folders, _ = collect_all_files(
config, book_library, persona_list, priority_filter, max_size_mb,
)
total_text = sum(len(v) for v in text_files.values())
total_scanned = sum(len(v) for v in scanned_files.values())
already = len(progress["uploaded_files"])
print(f"\n Text-based files: {total_text}")
print(f" Scanned PDFs: {total_scanned}")
print(f" Already uploaded: {already}")
print(f" Personas: {len(persona_folders)}\n")
if dry_run:
for fn in sorted(set(list(text_files.keys()) + list(scanned_files.keys()))):
t = len([f for f in text_files.get(fn, []) if str(f) not in progress["uploaded_files"]])
s = len([f for f in scanned_files.get(fn, []) if str(f) not in progress["ocr_done"]])
if t or s:
print(f" {fn}: {t} text, {s} scanned")
print(f"\n Personas:")
for c, flds in sorted(persona_folders.items()):
print(f" {c}: {', '.join(flds)}")
return
# ── Phase A: Upload text-based files ──
print("══ Phase A: Upload text-based files ══\n")
total_up = total_fail = 0
for fn, files in sorted(text_files.items()):
up, fail = upload_file_batch(config, fn, files, progress, batch_size, delay)
total_up += up
total_fail += fail
print(f"\n Phase A done: {total_up} uploaded, {total_fail} failed\n")
# ── Phase B: OCR scanned PDFs ──
total_scanned_remaining = sum(
1 for files in scanned_files.values()
for f in files if str(f) not in progress.get("ocr_done", [])
)
if total_scanned_remaining > 0:
print(f"══ Phase B: OCR {total_scanned_remaining} scanned PDFs ══\n")
ocr_ok = ocr_fail = 0
for fn, files in sorted(scanned_files.items()):
pending = [f for f in files if str(f) not in progress.get("ocr_done", [])]
if not pending:
continue
print(f"{fn}: {len(pending)} PDFs")
for i, pdf in enumerate(pending):
size_mb = pdf.stat().st_size / (1024 * 1024)
print(f" [{i+1}/{len(pending)}] {pdf.name} ({size_mb:.1f}MB)", end="", flush=True)
if ocr_pdf(pdf):
progress.setdefault("ocr_done", []).append(str(pdf))
ocr_ok += 1
print(f"")
else:
progress.setdefault("ocr_failed", []).append(str(pdf))
ocr_fail += 1
print(f"")
if (ocr_ok + ocr_fail) % 5 == 0:
save_progress(progress)
save_progress(progress)
print(f"\n Phase B done: {ocr_ok} OCR'd, {ocr_fail} failed\n")
# ── Phase C: Upload OCR'd files ──
ocr_to_upload = {fn: [f for f in files if str(f) in progress.get("ocr_done", [])]
for fn, files in scanned_files.items()}
total_ocr_upload = sum(
1 for files in ocr_to_upload.values()
for f in files if str(f) not in progress["uploaded_files"]
)
if total_ocr_upload > 0:
print(f"══ Phase C: Upload {total_ocr_upload} OCR'd files ══\n")
total_up2 = total_fail2 = 0
for fn, files in sorted(ocr_to_upload.items()):
if not files:
continue
up, fail = upload_file_batch(config, fn, files, progress, batch_size, delay)
total_up2 += up
total_fail2 += fail
print(f"\n Phase C done: {total_up2} uploaded, {total_fail2} failed\n")
# ── Assign to workspaces ──
assign_to_workspaces(config, persona_folders, progress, batch_size, delay)
# ──────────────────────────────────────────────────────────
# STATUS
# ──────────────────────────────────────────────────────────
def show_status(config):
print("═══ Integration Status ═══\n")
# Storage
du = ANYTHINGLLM_STORAGE / "direct-uploads"
if du.is_symlink():
print(f" ✓ direct-uploads/ → {du.resolve()} (HDD)")
elif du.exists():
print(f" ⚠ direct-uploads/ on SSD — run --storage-setup")
for d in ["documents", "lancedb", "vector-cache"]:
p = ANYTHINGLLM_STORAGE / d
if p.exists():
try:
sz = sum(f.stat().st_size for f in p.rglob("*") if f.is_file()) / (1024**2)
print(f"{d}/ ({sz:.0f} MB)")
except Exception:
print(f"{d}/")
api_ok = check_api(config) if config["anythingllm"]["api_key"] else False
collector_ok = check_collector_alive()
print(f"\n API: {'' if api_ok else ''} Collector: {'' if collector_ok else ''}")
progress = load_progress()
uploaded = len(progress.get("uploaded_files", {}))
ocr_done = len(progress.get("ocr_done", []))
assigned = len(progress.get("workspace_docs", {}))
if uploaded or ocr_done:
print(f"\n Uploaded: {uploaded} OCR'd: {ocr_done} Personas assigned: {assigned}")
folders = {}
for info in progress.get("uploaded_files", {}).values():
f = info.get("folder", "?")
folders[f] = folders.get(f, 0) + 1
for f, c in sorted(folders.items(), key=lambda x: -x[1])[:20]:
print(f" {c:4d} {f}")
print()
# ──────────────────────────────────────────────────────────
# MAIN
# ──────────────────────────────────────────────────────────
def resolve_persona_list(args, config):
"""Resolve --persona / --cluster to a list of codenames."""
if args.persona:
return [args.persona]
if args.cluster:
cl = CLUSTERS.get(args.cluster)
if not cl:
print(f"Unknown cluster: {args.cluster}")
print(f"Available: {', '.join(CLUSTERS.keys())}")
sys.exit(1)
return cl
return None # all
def main():
parser = argparse.ArgumentParser(description="AnythingLLM × Persona Integration")
parser.add_argument("--storage-setup", action="store_true")
parser.add_argument("--create-workspaces", action="store_true")
parser.add_argument("--upload-documents", action="store_true")
parser.add_argument("--all", action="store_true", help="Run all steps")
parser.add_argument("--status", action="store_true")
parser.add_argument("--persona", type=str, help="Single persona filter")
parser.add_argument("--cluster", type=str, help="Cluster filter: intel, cyber, military, humanities, engineering")
parser.add_argument("--priority", type=int, help="Max priority (1=core)")
parser.add_argument("--max-size", type=int, default=100, help="Max file MB (default: 100)")
parser.add_argument("--dry-run", action="store_true")
parser.add_argument("--resume", action="store_true")
args = parser.parse_args()
config = load_config()
if not any([args.storage_setup, args.create_workspaces,
args.upload_documents, args.all, args.status]):
parser.print_help()
return
persona_list = resolve_persona_list(args, config)
if args.status:
show_status(config)
return
if args.storage_setup or args.all:
storage_setup(config, dry_run=args.dry_run)
if args.create_workspaces or args.all:
create_workspaces(config, persona_list=persona_list, dry_run=args.dry_run)
if args.upload_documents or args.all:
upload_documents(config, persona_list=persona_list,
priority_filter=args.priority,
dry_run=args.dry_run,
resume=args.resume or args.all,
max_size_mb=args.max_size)
if __name__ == "__main__":
main()