Initial commit: AnythingLLM persona RAG integration
28 persona workspace with document upload, OCR pipeline, and vector embedding assignment via AnythingLLM API. Supports 5 clusters (intel, cyber, military, humanities, engineering) with batch processing and resume capability. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
10
.gitignore
vendored
Normal file
10
.gitignore
vendored
Normal file
@@ -0,0 +1,10 @@
|
|||||||
|
# State files (machine-specific, regenerated by script)
|
||||||
|
upload_progress.json
|
||||||
|
|
||||||
|
# OCR output (large binary files)
|
||||||
|
ocr_output/
|
||||||
|
|
||||||
|
# Python
|
||||||
|
__pycache__/
|
||||||
|
*.pyc
|
||||||
|
.venv/
|
||||||
63
README.md
Normal file
63
README.md
Normal file
@@ -0,0 +1,63 @@
|
|||||||
|
# AnythingLLM × Persona RAG Integration
|
||||||
|
|
||||||
|
28 persona workspace'i olan, kitap kütüphanesinden beslenen RAG sistemi. Her persona kendi uzmanlık alanındaki dokümanlarla vektör embed edilmiş durumda.
|
||||||
|
|
||||||
|
## Mimari
|
||||||
|
|
||||||
|
- **AnythingLLM Desktop** — `http://localhost:3001`
|
||||||
|
- **LLM:** Ollama local (qwen3:14b)
|
||||||
|
- **Embedding:** Google Gemini (gemini-embedding-001)
|
||||||
|
- **Vector DB:** LanceDB
|
||||||
|
- **OCR:** ocrmypdf (tur+eng)
|
||||||
|
- **Kitap Kaynağı:** `/mnt/storage/Common/Books/`
|
||||||
|
|
||||||
|
## Personalar (5 Cluster)
|
||||||
|
|
||||||
|
| Cluster | Personalar |
|
||||||
|
|---------|-----------|
|
||||||
|
| Intel | Frodo, Echo, Ghost, Oracle, Wraith, Scribe, Polyglot |
|
||||||
|
| Cyber | Neo, Bastion, Sentinel, Specter, Phantom, Cipher, Vortex |
|
||||||
|
| Military | Marshal, Centurion, Corsair, Warden, Medic |
|
||||||
|
| Humanities | Chronos, Tribune, Arbiter, Ledger, Sage, Herald, Scholar, Gambit |
|
||||||
|
| Engineering | Forge, Architect |
|
||||||
|
|
||||||
|
## Kullanım
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Durum kontrolü
|
||||||
|
python3 setup.py --status
|
||||||
|
|
||||||
|
# Workspace oluştur / güncelle
|
||||||
|
python3 setup.py --create-workspaces
|
||||||
|
|
||||||
|
# Tam pipeline (upload + OCR + embed)
|
||||||
|
python3 setup.py --upload-documents --resume
|
||||||
|
|
||||||
|
# Tek cluster veya persona
|
||||||
|
python3 setup.py --upload-documents --cluster cyber --resume
|
||||||
|
python3 setup.py --upload-documents --persona neo --priority 1 --resume
|
||||||
|
|
||||||
|
# Önizleme
|
||||||
|
python3 setup.py --upload-documents --dry-run
|
||||||
|
```
|
||||||
|
|
||||||
|
## Pipeline
|
||||||
|
|
||||||
|
```
|
||||||
|
Phase A: Text dosyaları upload
|
||||||
|
Phase B: Scanned PDF'leri OCR (ocrmypdf)
|
||||||
|
Phase C: OCR'lı dosyaları upload
|
||||||
|
Final: Workspace'lere assign/embed
|
||||||
|
```
|
||||||
|
|
||||||
|
## Recovery
|
||||||
|
|
||||||
|
Vektör DB silinirse:
|
||||||
|
1. `upload_progress.json`'da `workspace_docs` → `{}` sıfırla
|
||||||
|
2. `python3 setup.py --upload-documents --resume` (sadece re-embed yapar)
|
||||||
|
|
||||||
|
## Dosyalar
|
||||||
|
|
||||||
|
- `setup.py` — Ana entegrasyon scripti (upload, OCR, workspace assignment)
|
||||||
|
- `config.yaml` — Persona-klasör eşlemeleri, API config, batch ayarları
|
||||||
|
- `upload_progress.json` — Upload/atama state tracker (gitignore'd)
|
||||||
425
config.yaml
Normal file
425
config.yaml
Normal file
@@ -0,0 +1,425 @@
|
|||||||
|
# AnythingLLM × Persona Library Integration Config
|
||||||
|
# Maps personas to book folders for workspace-based RAG
|
||||||
|
#
|
||||||
|
# Usage: python3 setup.py [--dry-run] [--persona <name>] [--upload-documents]
|
||||||
|
|
||||||
|
anythingllm:
|
||||||
|
base_url: "http://localhost:3001/api/v1"
|
||||||
|
api_key: "SXQGXH3-AQ64B8E-KQNMDWC-WZBQAFW"
|
||||||
|
|
||||||
|
storage:
|
||||||
|
book_library: "/mnt/storage/Common/Books"
|
||||||
|
personas_dir: "/home/salva/Documents/personas/personas"
|
||||||
|
# AnythingLLM copies uploaded originals to direct-uploads/
|
||||||
|
# This symlink sends them to HDD so SSD stays clean
|
||||||
|
hdd_storage: "/mnt/storage/anythingllm"
|
||||||
|
|
||||||
|
embedding:
|
||||||
|
primary:
|
||||||
|
engine: "gemini"
|
||||||
|
model: "gemini-embedding-001"
|
||||||
|
fallback:
|
||||||
|
engine: "ollama"
|
||||||
|
base_path: "http://127.0.0.1:40114/olla/ollama"
|
||||||
|
model: "nomic-embed-text"
|
||||||
|
|
||||||
|
# Batch processing — avoid API rate limits
|
||||||
|
processing:
|
||||||
|
batch_size: 50 # files per batch
|
||||||
|
delay_between_batches: 5 # seconds
|
||||||
|
max_concurrent: 3 # parallel uploads
|
||||||
|
skip_extensions: # don't process these
|
||||||
|
- ".bin"
|
||||||
|
- ".gz"
|
||||||
|
- ".zip"
|
||||||
|
- ".html"
|
||||||
|
- ".php"
|
||||||
|
- ".jpg"
|
||||||
|
- ".pptx"
|
||||||
|
- ".ppt"
|
||||||
|
- ".doc"
|
||||||
|
|
||||||
|
# ─────────────────────────────────────────────────────────────
|
||||||
|
# PERSONA → BOOK FOLDER MAPPINGS
|
||||||
|
# ─────────────────────────────────────────────────────────────
|
||||||
|
# priority: 1=core (always load), 2=extended (load if capacity allows)
|
||||||
|
# max_files: cap per folder to keep workspace focused
|
||||||
|
|
||||||
|
workspaces:
|
||||||
|
|
||||||
|
# ══════════════════════════════════════════════════════════
|
||||||
|
# INTELLIGENCE CLUSTER
|
||||||
|
# ══════════════════════════════════════════════════════════
|
||||||
|
|
||||||
|
frodo:
|
||||||
|
name: "Frodo — Stratejik İstihbarat"
|
||||||
|
persona_file: "frodo/general.md"
|
||||||
|
folders:
|
||||||
|
- path: "Istihbarat/TeoriVeAnaliz"
|
||||||
|
priority: 1
|
||||||
|
- path: "Istihbarat/Arastirmalar"
|
||||||
|
priority: 1
|
||||||
|
- path: "UluslararasiIliskiler"
|
||||||
|
priority: 1
|
||||||
|
- path: "GuvenlikStratejileri"
|
||||||
|
priority: 1
|
||||||
|
- path: "SETA"
|
||||||
|
priority: 2
|
||||||
|
- path: "ORSAM"
|
||||||
|
priority: 2
|
||||||
|
- path: "Istihbarat/TurkIstihbarati"
|
||||||
|
priority: 2
|
||||||
|
- path: "Istihbarat/RusIstihbarati"
|
||||||
|
priority: 2
|
||||||
|
|
||||||
|
echo:
|
||||||
|
name: "Echo — SIGINT/COMINT"
|
||||||
|
persona_file: "echo/general.md"
|
||||||
|
folders:
|
||||||
|
- path: "SiberGuvenlik/FOIA-IA-NSA-SIGINT"
|
||||||
|
priority: 1
|
||||||
|
- path: "SiberGuvenlik/ElektronikGuvenlik"
|
||||||
|
priority: 1
|
||||||
|
- path: "Istihbarat/FOIA-CIA-SogukSavas"
|
||||||
|
priority: 2
|
||||||
|
- path: "SiberGuvenlik/FOIA-SiberSavas"
|
||||||
|
priority: 2
|
||||||
|
|
||||||
|
ghost:
|
||||||
|
name: "Ghost — PSYOP & Bilgi Savaşı"
|
||||||
|
persona_file: "ghost/general.md"
|
||||||
|
folders:
|
||||||
|
- path: "SiberGuvenlik/BilgiSavasi"
|
||||||
|
priority: 1
|
||||||
|
- path: "Istihbarat/SorguTeknikleri"
|
||||||
|
priority: 1
|
||||||
|
- path: "GuvenlikStratejileri"
|
||||||
|
priority: 2
|
||||||
|
|
||||||
|
oracle:
|
||||||
|
name: "Oracle — OSINT & Dijital İstihbarat"
|
||||||
|
persona_file: "oracle/general.md"
|
||||||
|
folders:
|
||||||
|
- path: "SiberGuvenlik/OSINT"
|
||||||
|
priority: 1
|
||||||
|
- path: "Istihbarat/Arastirmalar"
|
||||||
|
priority: 2
|
||||||
|
|
||||||
|
wraith:
|
||||||
|
name: "Wraith — HUMINT & Karşı İstihbarat"
|
||||||
|
persona_file: "wraith/general.md"
|
||||||
|
folders:
|
||||||
|
- path: "Istihbarat/TeoriVeAnaliz"
|
||||||
|
priority: 1
|
||||||
|
- path: "Istihbarat/BiyografiVeAnilar"
|
||||||
|
priority: 1
|
||||||
|
- path: "Istihbarat/TurkIstihbarati"
|
||||||
|
priority: 1
|
||||||
|
- path: "Istihbarat/RusIstihbarati"
|
||||||
|
priority: 1
|
||||||
|
- path: "Istihbarat/SorguTeknikleri"
|
||||||
|
priority: 2
|
||||||
|
- path: "Istihbarat/IstihbaratTarihi"
|
||||||
|
priority: 2
|
||||||
|
|
||||||
|
scribe:
|
||||||
|
name: "Scribe — FOIA Arşivci"
|
||||||
|
persona_file: "scribe/general.md"
|
||||||
|
folders:
|
||||||
|
- path: "FOIA"
|
||||||
|
priority: 1
|
||||||
|
- path: "Istihbarat/CIA"
|
||||||
|
priority: 1
|
||||||
|
- path: "Istihbarat/FOIA-CIA-OrtaDogu"
|
||||||
|
priority: 1
|
||||||
|
- path: "Istihbarat/FOIA-CIA-SogukSavas"
|
||||||
|
priority: 1
|
||||||
|
- path: "Istihbarat/FOIA-CIA-Turkey"
|
||||||
|
priority: 1
|
||||||
|
- path: "Istihbarat/FOIA-FBI-COINTELPRO"
|
||||||
|
priority: 2
|
||||||
|
- path: "Istihbarat/FOIA-FBI-Vault"
|
||||||
|
priority: 2
|
||||||
|
- path: "Istihbarat/FOIA-IA-CIA-SogukSavas"
|
||||||
|
priority: 2
|
||||||
|
- path: "Istihbarat/FOIA-IA-CIA-Kuba-OrtaDogu"
|
||||||
|
priority: 2
|
||||||
|
- path: "Istihbarat/FOIA-IA-FBI"
|
||||||
|
priority: 2
|
||||||
|
- path: "Istihbarat/FOIA-IA-WWII"
|
||||||
|
priority: 2
|
||||||
|
- path: "SiberGuvenlik/FOIA-CyberWarfare"
|
||||||
|
priority: 2
|
||||||
|
- path: "SiberGuvenlik/FOIA-IA-NSA-SIGINT"
|
||||||
|
priority: 2
|
||||||
|
|
||||||
|
polyglot:
|
||||||
|
name: "Polyglot — Dilbilim & LINGINT"
|
||||||
|
persona_file: "polyglot/general.md"
|
||||||
|
folders:
|
||||||
|
- path: "Egitim"
|
||||||
|
priority: 1
|
||||||
|
|
||||||
|
# ══════════════════════════════════════════════════════════
|
||||||
|
# CYBERSECURITY CLUSTER
|
||||||
|
# ══════════════════════════════════════════════════════════
|
||||||
|
|
||||||
|
neo:
|
||||||
|
name: "Neo — Red Team & Exploit Dev"
|
||||||
|
persona_file: "neo/general.md"
|
||||||
|
folders:
|
||||||
|
- path: "SiberGuvenlik/PenetrasyonTesti"
|
||||||
|
priority: 1
|
||||||
|
- path: "SiberGuvenlik/SaldiriTeknikleri"
|
||||||
|
priority: 1
|
||||||
|
- path: "SiberGuvenlik/ZafiyetArastirmasi"
|
||||||
|
priority: 1
|
||||||
|
- path: "SiberGuvenlik/WebGuvenligi"
|
||||||
|
priority: 2
|
||||||
|
|
||||||
|
bastion:
|
||||||
|
name: "Bastion — Blue Team & DFIR"
|
||||||
|
persona_file: "bastion/general.md"
|
||||||
|
folders:
|
||||||
|
- path: "SiberGuvenlik/AdliBilisim"
|
||||||
|
priority: 1
|
||||||
|
- path: "SiberGuvenlik/GenelGuvenlik"
|
||||||
|
priority: 1
|
||||||
|
- path: "SiberGuvenlik/AgGuvenligi"
|
||||||
|
priority: 2
|
||||||
|
- path: "SiberGuvenlik/WindowsGuvenligi"
|
||||||
|
priority: 2
|
||||||
|
|
||||||
|
sentinel:
|
||||||
|
name: "Sentinel — Siber Tehdit İstihbaratı"
|
||||||
|
persona_file: "sentinel/general.md"
|
||||||
|
folders:
|
||||||
|
- path: "SiberGuvenlik/TehditIstihbarati"
|
||||||
|
priority: 1
|
||||||
|
- path: "SiberGuvenlik/SiberSavas"
|
||||||
|
priority: 1
|
||||||
|
- path: "SiberGuvenlik/SiberGuvenlikStratejisi"
|
||||||
|
priority: 1
|
||||||
|
- path: "SiberGuvenlik/FOIA-CyberWarfare"
|
||||||
|
priority: 2
|
||||||
|
|
||||||
|
specter:
|
||||||
|
name: "Specter — Zararlı Yazılım & Tersine Mühendislik"
|
||||||
|
persona_file: "specter/general.md"
|
||||||
|
folders:
|
||||||
|
- path: "SiberGuvenlik/ZararliYazilimAnalizi"
|
||||||
|
priority: 1
|
||||||
|
- path: "SiberGuvenlik/TersineMuhendislik"
|
||||||
|
priority: 1
|
||||||
|
- path: "SiberGuvenlik/KernelGuvenligi"
|
||||||
|
priority: 2
|
||||||
|
|
||||||
|
phantom:
|
||||||
|
name: "Phantom — Web Uygulama Güvenliği"
|
||||||
|
persona_file: "phantom/general.md"
|
||||||
|
folders:
|
||||||
|
- path: "SiberGuvenlik/WebGuvenligi"
|
||||||
|
priority: 1
|
||||||
|
- path: "SiberGuvenlik/PenetrasyonTesti"
|
||||||
|
priority: 2
|
||||||
|
- path: "SiberGuvenlik/BulutGuvenligi"
|
||||||
|
priority: 2
|
||||||
|
|
||||||
|
cipher:
|
||||||
|
name: "Cipher — Kriptografi"
|
||||||
|
persona_file: "cipher/general.md"
|
||||||
|
folders:
|
||||||
|
- path: "SiberGuvenlik/Kriptografi"
|
||||||
|
priority: 1
|
||||||
|
- path: "SiberGuvenlik/BilgiGuvenligi"
|
||||||
|
priority: 2
|
||||||
|
|
||||||
|
vortex:
|
||||||
|
name: "Vortex — Ağ Operasyonları"
|
||||||
|
persona_file: "vortex/general.md"
|
||||||
|
folders:
|
||||||
|
- path: "SiberGuvenlik/AgGuvenligi"
|
||||||
|
priority: 1
|
||||||
|
- path: "SiberGuvenlik/DonaninGuvenligi"
|
||||||
|
priority: 2
|
||||||
|
- path: "SiberGuvenlik/IoT"
|
||||||
|
priority: 2
|
||||||
|
|
||||||
|
# ══════════════════════════════════════════════════════════
|
||||||
|
# MILITARY CLUSTER
|
||||||
|
# ══════════════════════════════════════════════════════════
|
||||||
|
|
||||||
|
marshal:
|
||||||
|
name: "Marshal — Askeri Doktrin & Strateji"
|
||||||
|
persona_file: "marshal/general.md"
|
||||||
|
folders:
|
||||||
|
- path: "AskeriDoktrin"
|
||||||
|
priority: 1
|
||||||
|
- path: "NATO/Doktrin"
|
||||||
|
priority: 1
|
||||||
|
- path: "GuvenlikStratejileri"
|
||||||
|
priority: 1
|
||||||
|
- path: "NATO/Tatbikat"
|
||||||
|
priority: 2
|
||||||
|
|
||||||
|
centurion:
|
||||||
|
name: "Centurion — Askeri Tarih"
|
||||||
|
persona_file: "centurion/general.md"
|
||||||
|
folders:
|
||||||
|
- path: "AskeriTarih"
|
||||||
|
priority: 1
|
||||||
|
- path: "AskeriDoktrin"
|
||||||
|
priority: 2
|
||||||
|
- path: "DunyaTarihi"
|
||||||
|
priority: 2
|
||||||
|
|
||||||
|
corsair:
|
||||||
|
name: "Corsair — Özel Harekat & Düzensiz Savaş"
|
||||||
|
persona_file: "corsair/general.md"
|
||||||
|
folders:
|
||||||
|
- path: "AskeriDoktrin"
|
||||||
|
priority: 1
|
||||||
|
- path: "Istihbarat/TerorMucadele"
|
||||||
|
priority: 1
|
||||||
|
- path: "GuvenlikStratejileri"
|
||||||
|
priority: 2
|
||||||
|
|
||||||
|
warden:
|
||||||
|
name: "Warden — Savunma Analizi & Silah Sistemleri"
|
||||||
|
persona_file: "warden/general.md"
|
||||||
|
folders:
|
||||||
|
- path: "AskeriDoktrin"
|
||||||
|
priority: 1
|
||||||
|
- path: "NATO/Teknik"
|
||||||
|
priority: 1
|
||||||
|
- path: "GuvenlikStratejileri"
|
||||||
|
priority: 2
|
||||||
|
- path: "Istihbarat/SavunmaBakanligiRaporlari"
|
||||||
|
priority: 2
|
||||||
|
|
||||||
|
medic:
|
||||||
|
name: "Medic — Biyomedikal & KBRN"
|
||||||
|
persona_file: "medic/general.md"
|
||||||
|
folders:
|
||||||
|
- path: "Biyomedikal"
|
||||||
|
priority: 1
|
||||||
|
- path: "Istihbarat/KBRN"
|
||||||
|
priority: 1
|
||||||
|
- path: "BilimVeArastirma"
|
||||||
|
priority: 2
|
||||||
|
|
||||||
|
# ══════════════════════════════════════════════════════════
|
||||||
|
# HUMANITIES & ANALYSIS CLUSTER
|
||||||
|
# ══════════════════════════════════════════════════════════
|
||||||
|
|
||||||
|
chronos:
|
||||||
|
name: "Chronos — Dünya Tarihi & Medeniyet"
|
||||||
|
persona_file: "chronos/general.md"
|
||||||
|
folders:
|
||||||
|
- path: "DunyaTarihi"
|
||||||
|
priority: 1
|
||||||
|
- path: "OsmanliTarihi"
|
||||||
|
priority: 1
|
||||||
|
- path: "CumhuriyetTarihi"
|
||||||
|
priority: 1
|
||||||
|
- path: "RusyaTarihi"
|
||||||
|
priority: 1
|
||||||
|
- path: "YahudiTarihi"
|
||||||
|
priority: 2
|
||||||
|
- path: "AskeriTarih"
|
||||||
|
priority: 2
|
||||||
|
|
||||||
|
tribune:
|
||||||
|
name: "Tribune — Siyaset Bilimi & Rejim Analizi"
|
||||||
|
persona_file: "tribune/general.md"
|
||||||
|
folders:
|
||||||
|
- path: "UluslararasiIliskiler"
|
||||||
|
priority: 1
|
||||||
|
- path: "SETA"
|
||||||
|
priority: 1
|
||||||
|
- path: "ORSAM"
|
||||||
|
priority: 1
|
||||||
|
- path: "CumhuriyetTarihi"
|
||||||
|
priority: 2
|
||||||
|
|
||||||
|
arbiter:
|
||||||
|
name: "Arbiter — Uluslararası Hukuk"
|
||||||
|
persona_file: "arbiter/general.md"
|
||||||
|
folders:
|
||||||
|
- path: "Hukuk"
|
||||||
|
priority: 1
|
||||||
|
- path: "UluslararasiIliskiler"
|
||||||
|
priority: 2
|
||||||
|
- path: "NATO/Idari"
|
||||||
|
priority: 2
|
||||||
|
|
||||||
|
ledger:
|
||||||
|
name: "Ledger — Ekonomik İstihbarat & FININT"
|
||||||
|
persona_file: "ledger/general.md"
|
||||||
|
folders:
|
||||||
|
- path: "EkonomiVeFinans"
|
||||||
|
priority: 1
|
||||||
|
|
||||||
|
sage:
|
||||||
|
name: "Sage — Felsefe & İktidar Teorisi"
|
||||||
|
persona_file: "sage/general.md"
|
||||||
|
folders:
|
||||||
|
- path: "FelsefeVeEdebiyat"
|
||||||
|
priority: 1
|
||||||
|
|
||||||
|
herald:
|
||||||
|
name: "Herald — Medya Analizi & Stratejik İletişim"
|
||||||
|
persona_file: "herald/general.md"
|
||||||
|
folders:
|
||||||
|
- path: "SETA"
|
||||||
|
priority: 1
|
||||||
|
- path: "ORSAM"
|
||||||
|
priority: 2
|
||||||
|
- path: "UluslararasiIliskiler"
|
||||||
|
priority: 2
|
||||||
|
|
||||||
|
scholar:
|
||||||
|
name: "Scholar — Akademik Araştırma"
|
||||||
|
persona_file: "scholar/general.md"
|
||||||
|
folders:
|
||||||
|
- path: "BilimVeArastirma"
|
||||||
|
priority: 1
|
||||||
|
- path: "Egitim"
|
||||||
|
priority: 1
|
||||||
|
- path: "UluslararasiIliskiler"
|
||||||
|
priority: 2
|
||||||
|
|
||||||
|
gambit:
|
||||||
|
name: "Gambit — Satranç & Stratejik Düşünce"
|
||||||
|
persona_file: "gambit/general.md"
|
||||||
|
folders:
|
||||||
|
- path: "Satranc"
|
||||||
|
priority: 1
|
||||||
|
|
||||||
|
# ══════════════════════════════════════════════════════════
|
||||||
|
# ENGINEERING CLUSTER
|
||||||
|
# ══════════════════════════════════════════════════════════
|
||||||
|
|
||||||
|
forge:
|
||||||
|
name: "Forge — Yazılım & AI/ML"
|
||||||
|
persona_file: "forge/general.md"
|
||||||
|
folders:
|
||||||
|
- path: "AI"
|
||||||
|
priority: 1
|
||||||
|
- path: "Teknoloji"
|
||||||
|
priority: 1
|
||||||
|
- path: "SiberGuvenlik/Programlama"
|
||||||
|
priority: 2
|
||||||
|
- path: "SiberGuvenlik/YapayZekaGuvenligi"
|
||||||
|
priority: 2
|
||||||
|
|
||||||
|
architect:
|
||||||
|
name: "Architect — DevOps & Altyapı"
|
||||||
|
persona_file: "architect/general.md"
|
||||||
|
folders:
|
||||||
|
- path: "Teknoloji"
|
||||||
|
priority: 1
|
||||||
|
- path: "SiberGuvenlik/Linux"
|
||||||
|
priority: 2
|
||||||
|
- path: "SiberGuvenlik/BulutGuvenligi"
|
||||||
|
priority: 2
|
||||||
680
setup.py
Normal file
680
setup.py
Normal file
@@ -0,0 +1,680 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""
|
||||||
|
AnythingLLM × Persona Library Integration Setup
|
||||||
|
|
||||||
|
Three-phase pipeline:
|
||||||
|
Phase A: Upload all text-based files
|
||||||
|
Phase B: OCR all scanned PDFs in-place
|
||||||
|
Phase C: Upload newly OCR'd files + assign to workspaces
|
||||||
|
|
||||||
|
Usage:
|
||||||
|
python3 setup.py --storage-setup # Symlink direct-uploads/ to HDD
|
||||||
|
python3 setup.py --create-workspaces # Create workspaces + load persona prompts
|
||||||
|
python3 setup.py --upload-documents # Full pipeline: upload → OCR → upload → assign
|
||||||
|
python3 setup.py --persona frodo # Single persona
|
||||||
|
python3 setup.py --cluster intel # Intel cluster (frodo,echo,ghost,oracle,wraith,scribe,polyglot)
|
||||||
|
python3 setup.py --cluster cyber # Cyber cluster
|
||||||
|
python3 setup.py --cluster military # Military cluster
|
||||||
|
python3 setup.py --cluster humanities # Humanities cluster
|
||||||
|
python3 setup.py --cluster engineering # Engineering cluster
|
||||||
|
python3 setup.py --priority 1 # Only priority 1 (core) folders
|
||||||
|
python3 setup.py --dry-run # Preview
|
||||||
|
python3 setup.py --status # Show state
|
||||||
|
"""
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import json
|
||||||
|
import os
|
||||||
|
import shutil
|
||||||
|
import subprocess
|
||||||
|
import sys
|
||||||
|
import time
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
import yaml
|
||||||
|
|
||||||
|
try:
|
||||||
|
import requests
|
||||||
|
except ImportError:
|
||||||
|
print("pip install requests pyyaml")
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
|
||||||
|
CONFIG_PATH = Path(__file__).parent / "config.yaml"
|
||||||
|
PROGRESS_PATH = Path(__file__).parent / "upload_progress.json"
|
||||||
|
ANYTHINGLLM_STORAGE = Path.home() / ".config/anythingllm-desktop/storage"
|
||||||
|
SKIP_EXT = set()
|
||||||
|
|
||||||
|
CLUSTERS = {
|
||||||
|
"intel": ["frodo", "echo", "ghost", "oracle", "wraith", "scribe", "polyglot"],
|
||||||
|
"cyber": ["neo", "bastion", "sentinel", "specter", "phantom", "cipher", "vortex"],
|
||||||
|
"military": ["marshal", "centurion", "corsair", "warden", "medic"],
|
||||||
|
"humanities": ["chronos", "tribune", "arbiter", "ledger", "sage", "herald", "scholar", "gambit"],
|
||||||
|
"engineering": ["forge", "architect"],
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def load_config():
|
||||||
|
with open(CONFIG_PATH) as f:
|
||||||
|
cfg = yaml.safe_load(f)
|
||||||
|
global SKIP_EXT
|
||||||
|
SKIP_EXT = set(cfg["processing"]["skip_extensions"])
|
||||||
|
return cfg
|
||||||
|
|
||||||
|
|
||||||
|
def load_progress():
|
||||||
|
if PROGRESS_PATH.exists():
|
||||||
|
with open(PROGRESS_PATH) as f:
|
||||||
|
return json.load(f)
|
||||||
|
return {"uploaded_files": {}, "workspace_docs": {}, "ocr_done": [], "ocr_failed": []}
|
||||||
|
|
||||||
|
|
||||||
|
def save_progress(progress):
|
||||||
|
with open(PROGRESS_PATH, "w") as f:
|
||||||
|
json.dump(progress, f, indent=2, ensure_ascii=False)
|
||||||
|
|
||||||
|
|
||||||
|
# ──────────────────────────────────────────────────────────
|
||||||
|
# API
|
||||||
|
# ──────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
def api_request(config, method, endpoint, **kwargs):
|
||||||
|
url = f"{config['anythingllm']['base_url']}{endpoint}"
|
||||||
|
headers = {"Authorization": f"Bearer {config['anythingllm']['api_key']}"}
|
||||||
|
if "json" in kwargs:
|
||||||
|
headers["Content-Type"] = "application/json"
|
||||||
|
resp = getattr(requests, method)(url, headers=headers, **kwargs)
|
||||||
|
if resp.status_code not in (200, 201):
|
||||||
|
print(f" API error {resp.status_code}: {resp.text[:300]}")
|
||||||
|
return None
|
||||||
|
return resp.json()
|
||||||
|
|
||||||
|
|
||||||
|
def api_upload(config, file_path, folder_name=None):
|
||||||
|
endpoint = f"/document/upload/{folder_name}" if folder_name else "/document/upload"
|
||||||
|
url = f"{config['anythingllm']['base_url']}{endpoint}"
|
||||||
|
headers = {"Authorization": f"Bearer {config['anythingllm']['api_key']}"}
|
||||||
|
size_mb = file_path.stat().st_size / (1024 * 1024)
|
||||||
|
timeout = max(120, int(120 + (size_mb / 10) * 30))
|
||||||
|
try:
|
||||||
|
with open(file_path, "rb") as f:
|
||||||
|
files = {"file": (file_path.name, f, "application/octet-stream")}
|
||||||
|
resp = requests.post(url, headers=headers, files=files, timeout=timeout)
|
||||||
|
if resp.status_code not in (200, 201):
|
||||||
|
return None, resp.text[:200]
|
||||||
|
data = resp.json()
|
||||||
|
if data.get("success") and data.get("documents"):
|
||||||
|
return data["documents"], None
|
||||||
|
return None, data.get("error", "Unknown error")
|
||||||
|
except requests.exceptions.Timeout:
|
||||||
|
return None, f"timeout ({timeout}s)"
|
||||||
|
except Exception as e:
|
||||||
|
return None, str(e)
|
||||||
|
|
||||||
|
|
||||||
|
def check_api(config):
|
||||||
|
try:
|
||||||
|
return api_request(config, "get", "/auth") is not None
|
||||||
|
except Exception:
|
||||||
|
return False
|
||||||
|
|
||||||
|
|
||||||
|
def check_collector_alive():
|
||||||
|
try:
|
||||||
|
return requests.get("http://127.0.0.1:8888", timeout=3).status_code == 200
|
||||||
|
except Exception:
|
||||||
|
return False
|
||||||
|
|
||||||
|
|
||||||
|
def wait_for_collector(max_wait=90):
|
||||||
|
print(" ⏳ waiting for collector...", end="", flush=True)
|
||||||
|
for i in range(max_wait):
|
||||||
|
if check_collector_alive():
|
||||||
|
print(" ✓")
|
||||||
|
return True
|
||||||
|
time.sleep(1)
|
||||||
|
if i % 10 == 9:
|
||||||
|
print(".", end="", flush=True)
|
||||||
|
print(" ✗ timeout")
|
||||||
|
return False
|
||||||
|
|
||||||
|
|
||||||
|
def get_existing_workspaces(config):
|
||||||
|
result = api_request(config, "get", "/workspaces")
|
||||||
|
if result and "workspaces" in result:
|
||||||
|
return {ws["name"]: ws for ws in result["workspaces"]}
|
||||||
|
return {}
|
||||||
|
|
||||||
|
|
||||||
|
# ──────────────────────────────────────────────────────────
|
||||||
|
# PDF DETECTION & OCR
|
||||||
|
# ──────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
def is_scanned_pdf(file_path):
|
||||||
|
"""Fast scan detection via pdffonts (~0.04s vs pdftotext ~2s).
|
||||||
|
No fonts embedded = scanned/image-only PDF."""
|
||||||
|
if file_path.suffix.lower() != ".pdf":
|
||||||
|
return False
|
||||||
|
try:
|
||||||
|
proc = subprocess.Popen(
|
||||||
|
["pdffonts", "-l", "3", str(file_path)],
|
||||||
|
stdout=subprocess.PIPE, stderr=subprocess.PIPE,
|
||||||
|
)
|
||||||
|
try:
|
||||||
|
stdout, _ = proc.communicate(timeout=3)
|
||||||
|
lines = [l for l in stdout.decode(errors="ignore").strip().split("\n") if l.strip()]
|
||||||
|
return len(lines) <= 2
|
||||||
|
except subprocess.TimeoutExpired:
|
||||||
|
proc.kill()
|
||||||
|
proc.wait()
|
||||||
|
return False # assume text if slow
|
||||||
|
except Exception:
|
||||||
|
return False
|
||||||
|
|
||||||
|
|
||||||
|
def ocr_pdf(file_path, language="tur+eng", dpi=200):
|
||||||
|
"""OCR a scanned PDF in-place. Returns True on success."""
|
||||||
|
import tempfile
|
||||||
|
tmp_fd, tmp_path = tempfile.mkstemp(suffix=".pdf", dir=file_path.parent)
|
||||||
|
os.close(tmp_fd)
|
||||||
|
tmp_path = Path(tmp_path)
|
||||||
|
try:
|
||||||
|
result = subprocess.run(
|
||||||
|
["ocrmypdf", "--skip-text", "--rotate-pages", "--deskew",
|
||||||
|
"--jobs", "4", "--image-dpi", str(dpi), "-l", language,
|
||||||
|
"--output-type", "pdf", "--quiet",
|
||||||
|
str(file_path), str(tmp_path)],
|
||||||
|
capture_output=True, text=True, timeout=600,
|
||||||
|
)
|
||||||
|
if result.returncode in (0, 6) and tmp_path.exists() and tmp_path.stat().st_size > 0:
|
||||||
|
tmp_path.replace(file_path)
|
||||||
|
return True
|
||||||
|
tmp_path.unlink(missing_ok=True)
|
||||||
|
return False
|
||||||
|
except Exception:
|
||||||
|
tmp_path.unlink(missing_ok=True)
|
||||||
|
return False
|
||||||
|
|
||||||
|
|
||||||
|
# ──────────────────────────────────────────────────────────
|
||||||
|
# STEP 1: Storage offload
|
||||||
|
# ──────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
def storage_setup(config, dry_run=False):
|
||||||
|
hdd_base = Path(config["storage"]["hdd_storage"])
|
||||||
|
src = ANYTHINGLLM_STORAGE / "direct-uploads"
|
||||||
|
dst = hdd_base / "direct-uploads"
|
||||||
|
|
||||||
|
print("═══ Storage: direct-uploads/ → HDD ═══\n")
|
||||||
|
if src.is_symlink():
|
||||||
|
print(f" ✓ already symlinked → {src.resolve()}")
|
||||||
|
return
|
||||||
|
if not dry_run:
|
||||||
|
dst.mkdir(parents=True, exist_ok=True)
|
||||||
|
if src.exists():
|
||||||
|
shutil.copytree(str(src), str(dst), dirs_exist_ok=True)
|
||||||
|
shutil.rmtree(src)
|
||||||
|
src.symlink_to(dst)
|
||||||
|
print(f" ✓ done\n")
|
||||||
|
|
||||||
|
|
||||||
|
# ──────────────────────────────────────────────────────────
|
||||||
|
# STEP 2: Create workspaces + load prompts
|
||||||
|
# ──────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
def extract_system_prompt(config, persona_file):
|
||||||
|
personas_dir = Path(config["storage"]["personas_dir"])
|
||||||
|
fp = personas_dir / persona_file
|
||||||
|
if not fp.exists():
|
||||||
|
alt = personas_dir / persona_file.replace("/general.md", "") / "_meta.yaml"
|
||||||
|
fp = alt if alt.exists() else None
|
||||||
|
if not fp:
|
||||||
|
return None
|
||||||
|
|
||||||
|
content = fp.read_text(encoding="utf-8")
|
||||||
|
if fp.suffix == ".yaml":
|
||||||
|
meta = yaml.safe_load(content)
|
||||||
|
return meta.get("system_prompt", meta.get("description", ""))
|
||||||
|
|
||||||
|
parts = content.split("---")
|
||||||
|
if len(parts) >= 3:
|
||||||
|
try:
|
||||||
|
fm = yaml.safe_load(parts[1])
|
||||||
|
except yaml.YAMLError:
|
||||||
|
fm = {}
|
||||||
|
tone = fm.get("tone", "")
|
||||||
|
body = "---".join(parts[2:]).strip()
|
||||||
|
return f"Tone: {tone}\n\n{body}" if tone else body
|
||||||
|
return content
|
||||||
|
|
||||||
|
|
||||||
|
def create_workspaces(config, persona_list=None, dry_run=False):
|
||||||
|
print("═══ Creating Workspaces ═══\n")
|
||||||
|
if not config["anythingllm"]["api_key"]:
|
||||||
|
print(" ✗ API key not set!")
|
||||||
|
return
|
||||||
|
|
||||||
|
existing = get_existing_workspaces(config)
|
||||||
|
created = skipped = 0
|
||||||
|
|
||||||
|
for codename, ws_config in config["workspaces"].items():
|
||||||
|
if persona_list and codename not in persona_list:
|
||||||
|
continue
|
||||||
|
name = ws_config["name"]
|
||||||
|
persona_file = ws_config.get("persona_file", "")
|
||||||
|
system_prompt = extract_system_prompt(config, persona_file) if persona_file else ""
|
||||||
|
|
||||||
|
if name in existing:
|
||||||
|
# Update prompt if workspace exists
|
||||||
|
slug = existing[name].get("slug", "?")
|
||||||
|
if system_prompt and not dry_run:
|
||||||
|
api_request(config, "post", f"/workspace/{slug}/update",
|
||||||
|
json={"openAiPrompt": system_prompt})
|
||||||
|
print(f" ✓ {codename} (prompt: {len(system_prompt or '')} chars)")
|
||||||
|
skipped += 1
|
||||||
|
continue
|
||||||
|
|
||||||
|
print(f" → {codename}: creating '{name}'")
|
||||||
|
if not dry_run:
|
||||||
|
result = api_request(config, "post", "/workspace/new", json={"name": name})
|
||||||
|
if result:
|
||||||
|
slug = result.get("workspace", {}).get("slug", "?")
|
||||||
|
if system_prompt:
|
||||||
|
api_request(config, "post", f"/workspace/{slug}/update",
|
||||||
|
json={"openAiPrompt": system_prompt})
|
||||||
|
print(f" ✓ created + prompt ({len(system_prompt)} chars)")
|
||||||
|
created += 1
|
||||||
|
else:
|
||||||
|
print(f" ✗ failed")
|
||||||
|
|
||||||
|
print(f"\n Created: {created}, Updated: {skipped}\n")
|
||||||
|
|
||||||
|
|
||||||
|
# ──────────────────────────────────────────────────────────
|
||||||
|
# STEP 3: Three-phase upload pipeline
|
||||||
|
# ──────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
def collect_all_files(config, book_library, persona_list=None, priority_filter=None, max_size_mb=100):
|
||||||
|
"""Scan all folders and classify files as text-based or scanned."""
|
||||||
|
text_files = {} # folder_name → [paths]
|
||||||
|
scanned_files = {} # folder_name → [paths]
|
||||||
|
persona_folders = {} # codename → [folder_names]
|
||||||
|
folder_to_path = {} # folder_name → source path
|
||||||
|
|
||||||
|
for codename, ws_config in config["workspaces"].items():
|
||||||
|
if persona_list and codename not in persona_list:
|
||||||
|
continue
|
||||||
|
persona_folders[codename] = []
|
||||||
|
|
||||||
|
for entry in ws_config.get("folders", []):
|
||||||
|
folder_path = entry["path"]
|
||||||
|
priority = entry.get("priority", 2)
|
||||||
|
if priority_filter and priority > priority_filter:
|
||||||
|
continue
|
||||||
|
|
||||||
|
folder_name = folder_path.replace("/", "_")
|
||||||
|
persona_folders[codename].append(folder_name)
|
||||||
|
|
||||||
|
if folder_name in text_files:
|
||||||
|
continue # already scanned
|
||||||
|
|
||||||
|
src = book_library / folder_path
|
||||||
|
folder_to_path[folder_name] = src
|
||||||
|
text_files[folder_name] = []
|
||||||
|
scanned_files[folder_name] = []
|
||||||
|
|
||||||
|
if not src.exists():
|
||||||
|
print(f" ⚠ {folder_path} not found")
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Use find with -printf for speed on HDD (one syscall, no per-file stat)
|
||||||
|
try:
|
||||||
|
find_result = subprocess.run(
|
||||||
|
["find", str(src), "-type", "f", "-not", "-empty",
|
||||||
|
"-printf", "%s %p\n"],
|
||||||
|
capture_output=True, text=True, timeout=120,
|
||||||
|
)
|
||||||
|
all_files = []
|
||||||
|
max_bytes = (max_size_mb or 9999) * 1024 * 1024
|
||||||
|
for line in find_result.stdout.strip().split("\n"):
|
||||||
|
if not line:
|
||||||
|
continue
|
||||||
|
parts = line.split(" ", 1)
|
||||||
|
if len(parts) != 2:
|
||||||
|
continue
|
||||||
|
size, path = int(parts[0]), Path(parts[1])
|
||||||
|
if path.suffix.lower() not in SKIP_EXT and size <= max_bytes:
|
||||||
|
all_files.append(path)
|
||||||
|
all_files.sort()
|
||||||
|
except Exception:
|
||||||
|
all_files = sorted(
|
||||||
|
f for f in src.rglob("*")
|
||||||
|
if f.is_file() and f.suffix.lower() not in SKIP_EXT
|
||||||
|
and f.stat().st_size > 0
|
||||||
|
and (not max_size_mb or f.stat().st_size / (1024*1024) <= max_size_mb)
|
||||||
|
)
|
||||||
|
print(f" {folder_name}: {len(all_files)} files found", flush=True)
|
||||||
|
|
||||||
|
# Classify every file (pdffonts is fast: ~0.04s per file)
|
||||||
|
for i, f in enumerate(all_files):
|
||||||
|
if is_scanned_pdf(f):
|
||||||
|
scanned_files[folder_name].append(f)
|
||||||
|
else:
|
||||||
|
text_files[folder_name].append(f)
|
||||||
|
if (i + 1) % 500 == 0:
|
||||||
|
print(f" {folder_name}: {i+1}/{len(all_files)} classified...", flush=True)
|
||||||
|
|
||||||
|
t = len(text_files[folder_name])
|
||||||
|
s = len(scanned_files[folder_name])
|
||||||
|
print(f" {folder_name}: {t} text, {s} scanned")
|
||||||
|
|
||||||
|
return text_files, scanned_files, persona_folders, folder_to_path
|
||||||
|
|
||||||
|
|
||||||
|
def upload_file_batch(config, folder_name, files, progress, batch_size, delay):
|
||||||
|
"""Upload a list of files to a folder. Returns (uploaded, failed) counts."""
|
||||||
|
uploaded = failed = 0
|
||||||
|
new_files = [f for f in files if str(f) not in progress["uploaded_files"]]
|
||||||
|
if not new_files:
|
||||||
|
return 0, 0
|
||||||
|
|
||||||
|
print(f" → {folder_name}: {len(new_files)} files")
|
||||||
|
|
||||||
|
for i, fp in enumerate(new_files):
|
||||||
|
if uploaded > 0 and uploaded % batch_size == 0:
|
||||||
|
print(f" ⏸ batch pause ({delay}s)...")
|
||||||
|
time.sleep(delay)
|
||||||
|
|
||||||
|
size_mb = fp.stat().st_size / (1024 * 1024)
|
||||||
|
print(f" [{i+1}/{len(new_files)}] {fp.name} ({size_mb:.1f}MB)", end="", flush=True)
|
||||||
|
|
||||||
|
if not check_collector_alive():
|
||||||
|
print(f" ⚠ collector down")
|
||||||
|
if not wait_for_collector():
|
||||||
|
print(" ✗ stopping — restart AnythingLLM and --resume")
|
||||||
|
save_progress(progress)
|
||||||
|
return uploaded, failed
|
||||||
|
|
||||||
|
docs, error = None, None
|
||||||
|
for attempt in range(3):
|
||||||
|
docs, error = api_upload(config, fp, folder_name=folder_name)
|
||||||
|
if docs:
|
||||||
|
break
|
||||||
|
if error and "not online" in str(error):
|
||||||
|
print(f" ↻", end="", flush=True)
|
||||||
|
time.sleep(5)
|
||||||
|
if not wait_for_collector():
|
||||||
|
save_progress(progress)
|
||||||
|
return uploaded, failed
|
||||||
|
elif attempt < 2:
|
||||||
|
time.sleep(2)
|
||||||
|
print(f" ↻", end="", flush=True)
|
||||||
|
|
||||||
|
if docs:
|
||||||
|
progress["uploaded_files"][str(fp)] = {
|
||||||
|
"location": docs[0].get("location", ""),
|
||||||
|
"folder": folder_name,
|
||||||
|
"name": docs[0].get("name", ""),
|
||||||
|
}
|
||||||
|
uploaded += 1
|
||||||
|
print(f" ✓")
|
||||||
|
else:
|
||||||
|
failed += 1
|
||||||
|
progress.setdefault("failed_files", {})[str(fp)] = {
|
||||||
|
"folder": folder_name, "error": str(error)[:200],
|
||||||
|
}
|
||||||
|
print(f" ✗ {error}")
|
||||||
|
|
||||||
|
if uploaded % 10 == 0:
|
||||||
|
save_progress(progress)
|
||||||
|
|
||||||
|
save_progress(progress)
|
||||||
|
return uploaded, failed
|
||||||
|
|
||||||
|
|
||||||
|
def assign_to_workspaces(config, persona_folders, progress, batch_size, delay):
|
||||||
|
"""Phase C2: assign uploaded docs to persona workspaces."""
|
||||||
|
print("── Assigning to workspaces ──\n")
|
||||||
|
existing_ws = get_existing_workspaces(config)
|
||||||
|
|
||||||
|
for codename, folders in sorted(persona_folders.items()):
|
||||||
|
ws_name = config["workspaces"][codename]["name"]
|
||||||
|
ws_info = existing_ws.get(ws_name)
|
||||||
|
if not ws_info:
|
||||||
|
continue
|
||||||
|
|
||||||
|
slug = ws_info["slug"]
|
||||||
|
doc_locs = []
|
||||||
|
for fn in folders:
|
||||||
|
for fpath, info in progress["uploaded_files"].items():
|
||||||
|
if info.get("folder") == fn and info.get("location"):
|
||||||
|
doc_locs.append(info["location"])
|
||||||
|
|
||||||
|
already = set(progress.get("workspace_docs", {}).get(codename, []))
|
||||||
|
new_docs = [loc for loc in doc_locs if loc not in already]
|
||||||
|
if not new_docs:
|
||||||
|
if doc_locs:
|
||||||
|
print(f" ✓ {codename}: {len(doc_locs)} docs assigned")
|
||||||
|
continue
|
||||||
|
|
||||||
|
print(f" → {codename} ({slug}): {len(new_docs)} docs")
|
||||||
|
for bs in range(0, len(new_docs), batch_size):
|
||||||
|
batch = new_docs[bs:bs + batch_size]
|
||||||
|
result = api_request(config, "post", f"/workspace/{slug}/update-embeddings",
|
||||||
|
json={"adds": batch, "deletes": []})
|
||||||
|
if result:
|
||||||
|
progress.setdefault("workspace_docs", {}).setdefault(codename, []).extend(batch)
|
||||||
|
print(f" ✓ {len(batch)} docs embedded")
|
||||||
|
else:
|
||||||
|
print(f" ✗ batch failed")
|
||||||
|
if bs + batch_size < len(new_docs):
|
||||||
|
time.sleep(delay)
|
||||||
|
save_progress(progress)
|
||||||
|
print()
|
||||||
|
|
||||||
|
|
||||||
|
def upload_documents(config, persona_list=None, priority_filter=None,
|
||||||
|
dry_run=False, resume=False, max_size_mb=100):
|
||||||
|
"""Three-phase pipeline: text upload → OCR → OCR upload → assign."""
|
||||||
|
print("═══ Upload Pipeline ═══\n")
|
||||||
|
|
||||||
|
if not check_api(config):
|
||||||
|
print(" ✗ AnythingLLM API not reachable.")
|
||||||
|
return
|
||||||
|
|
||||||
|
book_library = Path(config["storage"]["book_library"])
|
||||||
|
batch_size = config["processing"]["batch_size"]
|
||||||
|
delay = config["processing"]["delay_between_batches"]
|
||||||
|
progress = load_progress() if resume else {
|
||||||
|
"uploaded_files": {}, "workspace_docs": {}, "ocr_done": [], "ocr_failed": [],
|
||||||
|
}
|
||||||
|
|
||||||
|
# Scan & classify
|
||||||
|
print(" Scanning folders...\n")
|
||||||
|
text_files, scanned_files, persona_folders, _ = collect_all_files(
|
||||||
|
config, book_library, persona_list, priority_filter, max_size_mb,
|
||||||
|
)
|
||||||
|
|
||||||
|
total_text = sum(len(v) for v in text_files.values())
|
||||||
|
total_scanned = sum(len(v) for v in scanned_files.values())
|
||||||
|
already = len(progress["uploaded_files"])
|
||||||
|
|
||||||
|
print(f"\n Text-based files: {total_text}")
|
||||||
|
print(f" Scanned PDFs: {total_scanned}")
|
||||||
|
print(f" Already uploaded: {already}")
|
||||||
|
print(f" Personas: {len(persona_folders)}\n")
|
||||||
|
|
||||||
|
if dry_run:
|
||||||
|
for fn in sorted(set(list(text_files.keys()) + list(scanned_files.keys()))):
|
||||||
|
t = len([f for f in text_files.get(fn, []) if str(f) not in progress["uploaded_files"]])
|
||||||
|
s = len([f for f in scanned_files.get(fn, []) if str(f) not in progress["ocr_done"]])
|
||||||
|
if t or s:
|
||||||
|
print(f" {fn}: {t} text, {s} scanned")
|
||||||
|
print(f"\n Personas:")
|
||||||
|
for c, flds in sorted(persona_folders.items()):
|
||||||
|
print(f" {c}: {', '.join(flds)}")
|
||||||
|
return
|
||||||
|
|
||||||
|
# ── Phase A: Upload text-based files ──
|
||||||
|
print("══ Phase A: Upload text-based files ══\n")
|
||||||
|
total_up = total_fail = 0
|
||||||
|
for fn, files in sorted(text_files.items()):
|
||||||
|
up, fail = upload_file_batch(config, fn, files, progress, batch_size, delay)
|
||||||
|
total_up += up
|
||||||
|
total_fail += fail
|
||||||
|
print(f"\n Phase A done: {total_up} uploaded, {total_fail} failed\n")
|
||||||
|
|
||||||
|
# ── Phase B: OCR scanned PDFs ──
|
||||||
|
total_scanned_remaining = sum(
|
||||||
|
1 for files in scanned_files.values()
|
||||||
|
for f in files if str(f) not in progress.get("ocr_done", [])
|
||||||
|
)
|
||||||
|
if total_scanned_remaining > 0:
|
||||||
|
print(f"══ Phase B: OCR {total_scanned_remaining} scanned PDFs ══\n")
|
||||||
|
ocr_ok = ocr_fail = 0
|
||||||
|
for fn, files in sorted(scanned_files.items()):
|
||||||
|
pending = [f for f in files if str(f) not in progress.get("ocr_done", [])]
|
||||||
|
if not pending:
|
||||||
|
continue
|
||||||
|
print(f" → {fn}: {len(pending)} PDFs")
|
||||||
|
for i, pdf in enumerate(pending):
|
||||||
|
size_mb = pdf.stat().st_size / (1024 * 1024)
|
||||||
|
print(f" [{i+1}/{len(pending)}] {pdf.name} ({size_mb:.1f}MB)", end="", flush=True)
|
||||||
|
if ocr_pdf(pdf):
|
||||||
|
progress.setdefault("ocr_done", []).append(str(pdf))
|
||||||
|
ocr_ok += 1
|
||||||
|
print(f" ✓")
|
||||||
|
else:
|
||||||
|
progress.setdefault("ocr_failed", []).append(str(pdf))
|
||||||
|
ocr_fail += 1
|
||||||
|
print(f" ✗")
|
||||||
|
if (ocr_ok + ocr_fail) % 5 == 0:
|
||||||
|
save_progress(progress)
|
||||||
|
save_progress(progress)
|
||||||
|
print(f"\n Phase B done: {ocr_ok} OCR'd, {ocr_fail} failed\n")
|
||||||
|
|
||||||
|
# ── Phase C: Upload OCR'd files ──
|
||||||
|
ocr_to_upload = {fn: [f for f in files if str(f) in progress.get("ocr_done", [])]
|
||||||
|
for fn, files in scanned_files.items()}
|
||||||
|
total_ocr_upload = sum(
|
||||||
|
1 for files in ocr_to_upload.values()
|
||||||
|
for f in files if str(f) not in progress["uploaded_files"]
|
||||||
|
)
|
||||||
|
if total_ocr_upload > 0:
|
||||||
|
print(f"══ Phase C: Upload {total_ocr_upload} OCR'd files ══\n")
|
||||||
|
total_up2 = total_fail2 = 0
|
||||||
|
for fn, files in sorted(ocr_to_upload.items()):
|
||||||
|
if not files:
|
||||||
|
continue
|
||||||
|
up, fail = upload_file_batch(config, fn, files, progress, batch_size, delay)
|
||||||
|
total_up2 += up
|
||||||
|
total_fail2 += fail
|
||||||
|
print(f"\n Phase C done: {total_up2} uploaded, {total_fail2} failed\n")
|
||||||
|
|
||||||
|
# ── Assign to workspaces ──
|
||||||
|
assign_to_workspaces(config, persona_folders, progress, batch_size, delay)
|
||||||
|
|
||||||
|
|
||||||
|
# ──────────────────────────────────────────────────────────
|
||||||
|
# STATUS
|
||||||
|
# ──────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
def show_status(config):
|
||||||
|
print("═══ Integration Status ═══\n")
|
||||||
|
|
||||||
|
# Storage
|
||||||
|
du = ANYTHINGLLM_STORAGE / "direct-uploads"
|
||||||
|
if du.is_symlink():
|
||||||
|
print(f" ✓ direct-uploads/ → {du.resolve()} (HDD)")
|
||||||
|
elif du.exists():
|
||||||
|
print(f" ⚠ direct-uploads/ on SSD — run --storage-setup")
|
||||||
|
|
||||||
|
for d in ["documents", "lancedb", "vector-cache"]:
|
||||||
|
p = ANYTHINGLLM_STORAGE / d
|
||||||
|
if p.exists():
|
||||||
|
try:
|
||||||
|
sz = sum(f.stat().st_size for f in p.rglob("*") if f.is_file()) / (1024**2)
|
||||||
|
print(f" ● {d}/ ({sz:.0f} MB)")
|
||||||
|
except Exception:
|
||||||
|
print(f" ● {d}/")
|
||||||
|
|
||||||
|
api_ok = check_api(config) if config["anythingllm"]["api_key"] else False
|
||||||
|
collector_ok = check_collector_alive()
|
||||||
|
print(f"\n API: {'✓' if api_ok else '✗'} Collector: {'✓' if collector_ok else '✗'}")
|
||||||
|
|
||||||
|
progress = load_progress()
|
||||||
|
uploaded = len(progress.get("uploaded_files", {}))
|
||||||
|
ocr_done = len(progress.get("ocr_done", []))
|
||||||
|
assigned = len(progress.get("workspace_docs", {}))
|
||||||
|
|
||||||
|
if uploaded or ocr_done:
|
||||||
|
print(f"\n Uploaded: {uploaded} OCR'd: {ocr_done} Personas assigned: {assigned}")
|
||||||
|
folders = {}
|
||||||
|
for info in progress.get("uploaded_files", {}).values():
|
||||||
|
f = info.get("folder", "?")
|
||||||
|
folders[f] = folders.get(f, 0) + 1
|
||||||
|
for f, c in sorted(folders.items(), key=lambda x: -x[1])[:20]:
|
||||||
|
print(f" {c:4d} {f}")
|
||||||
|
|
||||||
|
print()
|
||||||
|
|
||||||
|
|
||||||
|
# ──────────────────────────────────────────────────────────
|
||||||
|
# MAIN
|
||||||
|
# ──────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
def resolve_persona_list(args, config):
|
||||||
|
"""Resolve --persona / --cluster to a list of codenames."""
|
||||||
|
if args.persona:
|
||||||
|
return [args.persona]
|
||||||
|
if args.cluster:
|
||||||
|
cl = CLUSTERS.get(args.cluster)
|
||||||
|
if not cl:
|
||||||
|
print(f"Unknown cluster: {args.cluster}")
|
||||||
|
print(f"Available: {', '.join(CLUSTERS.keys())}")
|
||||||
|
sys.exit(1)
|
||||||
|
return cl
|
||||||
|
return None # all
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
parser = argparse.ArgumentParser(description="AnythingLLM × Persona Integration")
|
||||||
|
parser.add_argument("--storage-setup", action="store_true")
|
||||||
|
parser.add_argument("--create-workspaces", action="store_true")
|
||||||
|
parser.add_argument("--upload-documents", action="store_true")
|
||||||
|
parser.add_argument("--all", action="store_true", help="Run all steps")
|
||||||
|
parser.add_argument("--status", action="store_true")
|
||||||
|
parser.add_argument("--persona", type=str, help="Single persona filter")
|
||||||
|
parser.add_argument("--cluster", type=str, help="Cluster filter: intel, cyber, military, humanities, engineering")
|
||||||
|
parser.add_argument("--priority", type=int, help="Max priority (1=core)")
|
||||||
|
parser.add_argument("--max-size", type=int, default=100, help="Max file MB (default: 100)")
|
||||||
|
parser.add_argument("--dry-run", action="store_true")
|
||||||
|
parser.add_argument("--resume", action="store_true")
|
||||||
|
|
||||||
|
args = parser.parse_args()
|
||||||
|
config = load_config()
|
||||||
|
|
||||||
|
if not any([args.storage_setup, args.create_workspaces,
|
||||||
|
args.upload_documents, args.all, args.status]):
|
||||||
|
parser.print_help()
|
||||||
|
return
|
||||||
|
|
||||||
|
persona_list = resolve_persona_list(args, config)
|
||||||
|
|
||||||
|
if args.status:
|
||||||
|
show_status(config)
|
||||||
|
return
|
||||||
|
if args.storage_setup or args.all:
|
||||||
|
storage_setup(config, dry_run=args.dry_run)
|
||||||
|
if args.create_workspaces or args.all:
|
||||||
|
create_workspaces(config, persona_list=persona_list, dry_run=args.dry_run)
|
||||||
|
if args.upload_documents or args.all:
|
||||||
|
upload_documents(config, persona_list=persona_list,
|
||||||
|
priority_filter=args.priority,
|
||||||
|
dry_run=args.dry_run,
|
||||||
|
resume=args.resume or args.all,
|
||||||
|
max_size_mb=args.max_size)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
Reference in New Issue
Block a user