Compare commits

..

No commits in common. "master" and "feature/add-model-profiles" have entirely different histories.

7 changed files with 236 additions and 891 deletions

View File

@ -1,94 +0,0 @@
# Plan: Add user model profiles to manifest.yaml
# Date: 2025-06-15
# Author: Hermes Agent
# Status: DRAFT
## Context
User has a collection of GGUF models on their Main PC (10.0.4.11, RTX 3090 24GB VRAM).
The intelligence-router manifest needs profiles for all models with researched llama.cpp parameters.
Research sourced from r/LocalLLaMA, HuggingFace model cards, and community blog posts.
## Hardware constraints
- GPU: RTX 3090, 24GB VRAM
- All profiles use `n_gpu_layers: 999` (offload all layers that fit)
- All profiles use `flash-attn: on`
- KV cache quantization (q8_0 or q4_0) to enable 64K+ context within 24GB VRAM
- `min_p` set to 0.0 across all profiles (community standard for these models)
## Models to add (excluding mmproj files)
### Qwen3.6-27B (1 file: Qwen3.6-27B-Q4_K_M.gguf, ~10.5 GB)
Recommended sampling per HF model card and Unsloth: temp 0.6 / top_p 0.95 / top_k 20
| # | Profile ID | Name | n_ctx | cache_k/v | temp | top_k | repeat_pen |
|---|-----------|------|-------|-----------|------|-------|------------|
| 1 | qwen36-27b-balanced-64k | Qwen3.6-27B Balanced 64K | 65536 | q8_0/q8_0 | 0.6 | 20 | 1.0 |
| 2 | qwen36-27b-thinking-64k | Qwen3.6-27B Thinking 64K | 65536 | q8_0/q8_0 | 1.0 | 20 | 1.0 |
| 3 | qwen36-27b-extended-128k | Qwen3.6-27B Extended 128K | 131072 | q4_0/q4_0 | 0.6 | 20 | 1.05 |
### Gemma 4 12B (2 files: Q6_K_XL ~8.5 GB, IQ4_XS ~5 GB)
Google official: temp 1.0 / top_p 0.95 / top_k 64
| # | Profile ID | Name | File | n_ctx | cache_k/v | temp | top_k |
|---|-----------|------|------|-------|-----------|------|-------|
| 4 | gemma4-12b-standard-q6-64k | Gemma4 12B Standard Q6 64K | Q6_K_XL | 65536 | q8_0/q8_0 | 1.0 | 64 |
| 5 | gemma4-12b-extended-q6-128k | Gemma4 12B Extended Q6 128K | Q6_K_XL | 131072 | q4_0/q4_0 | 1.0 | 64 |
| 6 | gemma4-12b-compact-iq4-64k | Gemma4 12B Compact IQ4 64K | IQ4_XS | 65536 | q8_0/q8_0 | 1.0 | 64 |
| 7 | gemma4-12b-compact-long-128k | Gemma4 12B Compact IQ4 128K | IQ4_XS | 131072 | q8_0/q8_0 | 1.0 | 64 |
### Gemma 4 26B-A4B (2 files: Q4_K_M ~10.5 GB, IQ4_XS ~6 GB)
MoE, 4B active. Same sampling as 12B family.
| # | Profile ID | Name | File | n_ctx | cache_k/v | temp | top_k | repeat_pen |
|---|-----------|------|------|-------|-----------|------|-------|------------|
| 8 | gemma4-26b-balanced-64k | Gemma4 26B Balanced 64K | Q4_K_M | 65536 | q8_0/q8_0 | 1.0 | 64 | 1.0 |
| 9 | gemma4-26b-extended-128k | Gemma4 26B Extended 128K | Q4_K_M | 131072 | q4_0/q4_0 | 1.0 | 64 | 1.15 |
| 10 | gemma4-26b-ultra-long-iq4-256k | Gemma4 26B Ultra-Long IQ4 256K | IQ4_XS | 262144 | q4_0/q4_0 | 1.0 | 64 | 1.0 |
### Qwen3.6-35B-A3B (2 files: UD-Q4_K_M ~14 GB, MTP-UD-Q4_K_M ~16 GB)
**MTP note:** Unsloth benchmark shows MTP is net-negative on single 3090. Including MTP profile anyway since user has the file.
| # | Profile ID | Name | File | n_ctx | cache_k/v | temp | top_k | MTP |
|---|-----------|------|------|-------|-----------|------|-------|-----|
| 11 | qwen36-35b-fast-64k | Qwen3.6-35B Fast 64K | UD-Q4 | 65536 | q8_0/q8_0 | 0.6 | 20 | no |
| 12 | qwen36-35b-thinking-64k | Qwen3.6-35B Thinking 64K | UD-Q4 | 65536 | q8_0/q8_0 | 1.0 | 20 | no |
| 13 | qwen36-35b-extended-128k | Qwen3.6-35B Extended 128K | UD-Q4 | 131072 | q4_0/q4_0 | 0.6 | 20 | no |
| 14 | qwen36-35b-mtp-128k | Qwen3.6-35B MTP 128K | MTP-UD-Q4 | 131072 | q8_0/q8_0 | 0.6 | 20 | yes (n=3) |
### Uncensored models (apply censored family params)
| # | Profile ID | Name | File | n_ctx | cache_k/v | temp | top_k | Based on |
|---|-----------|------|------|-------|-----------|------|-------|----------|
| 15 | qwen36-35b-hauhau-aggressive-64k | Qwen3.6-35B HauhauCS Aggressive 64K | Uncensored-HauhauCS-Q4_K_P | 65536 | q8_0/q8_0 | 0.6 | 20 | Qwen3.6-35B fast |
| 16 | qwen36-35b-genesis-apex-64k | Qwen3.6-35B Genesis APEX 64K | Uncensored-Genesis-APEX | 65536 | q8_0/q8_0 | 0.6 | 20 | Qwen3.6-35B fast |
| 17 | qwen36-35b-genesis-mtp-apex-128k | Qwen3.6-35B Genesis MTP APEX 128K | Uncensored-Genesis-MTP-APEX | 131072 | q8_0/q8_0 | 0.6 | 20 | Qwen3.6-35B MTP |
| 18 | gemma4-26b-hauhau-balanced-64k | Gemma4 26B HauhauCS Balanced 64K | Uncensored-HauhauCS-Q5_K_M | 65536 | q8_0/q8_0 | 1.0 | 64 | Gemma4 26B balanced |
**Total: 18 profiles**
## Flag mapping (manifest → llama-server CLI)
Manifest flags use camelCase keys that the sidecar passes as `--key value` to llama-server:
| Manifest key | CLI flag | Type | Notes |
|-------------|----------|------|-------|
| n_gpu_layers | --n-gpu-layers | int | 999 = all |
| n_ctx | --ctx-size | int | context window |
| cache_type_k | --cache-type-k | str | q8_0, q4_0 |
| cache_type_v | --cache-type-v | str | q8_0, q4_0 |
| flash_attn | --flash-attn | bool | true/on |
| temp | --temp | float | sampling |
| top_p | --top-p | float | sampling |
| top_k | --top-k | int | sampling |
| repeat_penalty | --repeat-penalty | float | sampling |
| min_p | --min-p | float | 0.0 |
| spec_type | --spec-type | str | draft-mtp (only MTP profiles) |
| spec_draft_n_max | --spec-draft-n-max | int | 3 (only MTP profiles) |
| presence_penalty | --presence-penalty | float | 0.0 |
## Actions
1. Create branch `feature/add-model-profiles` from master
2. Create issues on Gitea for each model family (4 issues: qwen27, gemma12b, gemma26b, qwen35b)
3. Update `deploy/manifest.yaml` with all 18 profiles
4. Update tests if flag structure requires it
5. Run tests, commit

View File

@ -12,7 +12,6 @@ EnvironmentFile=-/home/bigt/AI/llm/.env
Environment=MANIFEST_PATH=/home/bigt/AI/llm/manifest.yaml Environment=MANIFEST_PATH=/home/bigt/AI/llm/manifest.yaml
Environment=SIDECAR_PORT=8080 Environment=SIDECAR_PORT=8080
Environment=PATH=/home/bigt/AI/llm/venv/bin:/usr/local/bin:/usr/bin:/bin Environment=PATH=/home/bigt/AI/llm/venv/bin:/usr/local/bin:/usr/bin:/bin
Environment=PYTHONUNBUFFERED=1
# Use the sidecar's venv — install deps via deploy/README.md # Use the sidecar's venv — install deps via deploy/README.md
ExecStart=/home/bigt/AI/llm/venv/bin/uvicorn sidecar.app:app --host 0.0.0.0 --port 8080 ExecStart=/home/bigt/AI/llm/venv/bin/uvicorn sidecar.app:app --host 0.0.0.0 --port 8080

View File

@ -11,88 +11,141 @@
# All profiles use flash-attn: on, n-gpu-layers: 999 (offload all) # All profiles use flash-attn: on, n-gpu-layers: 999 (offload all)
# KV cache quantization (q8_0/q4_0) enables 64K+ context within 24GB VRAM # KV cache quantization (q8_0/q4_0) enables 64K+ context within 24GB VRAM
- id: qwen-3-8b
name: "Qwen 3 8B"
model_path: "/home/bigt/AI/llm/qwen/qwen3-8b-q4.gguf"
flags:
n_ctx: 8192
n_gpu_layers: 35
- id: qwen-3-8b-long
name: "Qwen 3 8B (Long Context)"
model_path: "/home/bigt/AI/llm/qwen/qwen3-8b-q4.gguf"
flags:
n_ctx: 32768
n_gpu_layers: 20
- id: llama-4-maverick
name: "Llama 4 Maverick"
model_path: "/home/bigt/AI/llm/llama4/llama4-maverick-q4.gguf"
flags:
n_ctx: 8192
n_gpu_layers: 35
# --- Qwen3.6-27B (Q4_K_M, ~10.5 GB) --- # --- Qwen3.6-27B (Q4_K_M, ~10.5 GB) ---
# Sampling: temp 0.6/1.0, top_p 0.95, top_k 20 # Sampling: temp 0.6/1.0, top_p 0.95, top_k 20
- id: qwen36-27b-balanced-64k - id: qwen36-27b-balanced-64k
name: "Qwen3.6-27B Balanced 64K" name: "Qwen3.6-27B Balanced 64K"
model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-27B-Q4_K_M.gguf" model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-27B-Q4_K_M.gguf"
flags: flags:
ctx-size: 65536 n_ctx: 65536
n-gpu-layers: 999 n_gpu_layers: 999
cache-type-k: q8_0 cache-type-k: q8_0
cache-type-v: q8_0 cache-type-v: q8_0
flash-attn: on flash-attn: on
temp: 0.6 temp: 0.6
top-p: 0.95 top_p: 0.95
top-k: 20 top_k: 20
repeat-penalty: 1.0 repeat-penalty: 1.0
min-p: 0.0 min_p: 0.0
presence-penalty: 0.0 presence-penalty: 0.0
- id: qwen36-27b-thinking-64k - id: qwen36-27b-thinking-64k
name: "Qwen3.6-27B Thinking 64K" name: "Qwen3.6-27B Thinking 64K"
model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-27B-Q4_K_M.gguf" model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-27B-Q4_K_M.gguf"
flags: flags:
ctx-size: 65536 n_ctx: 65536
n-gpu-layers: 999 n_gpu_layers: 999
cache-type-k: q8_0 cache-type-k: q8_0
cache-type-v: q8_0 cache-type-v: q8_0
flash-attn: on flash-attn: on
temp: 1.0 temp: 1.0
top-p: 0.95 top_p: 0.95
top-k: 20 top_k: 20
repeat-penalty: 1.0 repeat-penalty: 1.0
min-p: 0.0 min_p: 0.0
presence-penalty: 0.0 presence-penalty: 0.0
- id: qwen36-27b-extended-128k - id: qwen36-27b-extended-128k
name: "Qwen3.6-27B Extended 128K" name: "Qwen3.6-27B Extended 128K"
model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-27B-Q4_K_M.gguf" model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-27B-Q4_K_M.gguf"
flags: flags:
ctx-size: 131072 n_ctx: 131072
n-gpu-layers: 999 n_gpu_layers: 999
cache-type-k: q4_0 cache-type-k: q4_0
cache-type-v: q4_0 cache-type-v: q4_0
flash-attn: on flash-attn: on
temp: 0.6 temp: 0.6
top-p: 0.95 top_p: 0.95
top-k: 20 top_k: 20
repeat-penalty: 1.05 repeat-penalty: 1.05
min-p: 0.0 min_p: 0.0
presence-penalty: 0.0 presence-penalty: 0.0
# --- Gemma 4 12B (Q6_K_XL ~8.5 GB) --- # --- Gemma 4 12B (Q6_K_XL ~8.5 GB, IQ4_XS ~5 GB) ---
# Sampling: temp 1.0, top_p 0.95, top_k 64 (Google official) # Sampling: temp 1.0, top_p 0.95, top_k 64 (Google official)
- id: gemma4-12b-standard-q6-64k - id: gemma4-12b-standard-q6-64k
name: "Gemma4 12B Standard Q6 64K" name: "Gemma4 12B Standard Q6 64K"
model_path: "/home/bigt/AI/llm/gemma4/gemma-4-12b-it-UD-Q6_K_XL.gguf" model_path: "/home/bigt/AI/llm/gemma4/gemma-4-12b-it-UD-Q6_K_XL.gguf"
flags: flags:
ctx-size: 65536 n_ctx: 65536
n-gpu-layers: 999 n_gpu_layers: 999
cache-type-k: q8_0 cache-type-k: q8_0
cache-type-v: q8_0 cache-type-v: q8_0
flash-attn: on flash-attn: on
temp: 1.0 temp: 1.0
top-p: 0.95 top_p: 0.95
top-k: 64 top_k: 64
repeat-penalty: 1.0 repeat-penalty: 1.0
min-p: 0.0 min_p: 0.0
presence-penalty: 0.0 presence-penalty: 0.0
- id: gemma4-12b-extended-q6-128k - id: gemma4-12b-extended-q6-128k
name: "Gemma4 12B Extended Q6 128K" name: "Gemma4 12B Extended Q6 128K"
model_path: "/home/bigt/AI/llm/gemma4/gemma-4-12b-it-UD-Q6_K_XL.gguf" model_path: "/home/bigt/AI/llm/gemma4/gemma-4-12b-it-UD-Q6_K_XL.gguf"
flags: flags:
ctx-size: 131072 n_ctx: 131072
n-gpu-layers: 999 n_gpu_layers: 999
cache-type-k: q4_0 cache-type-k: q4_0
cache-type-v: q4_0 cache-type-v: q4_0
flash-attn: on flash-attn: on
temp: 1.0 temp: 1.0
top-p: 0.95 top_p: 0.95
top-k: 64 top_k: 64
repeat-penalty: 1.0 repeat-penalty: 1.0
min-p: 0.0 min_p: 0.0
presence-penalty: 0.0
- id: gemma4-12b-compact-iq4-64k
name: "Gemma4 12B Compact IQ4 64K"
model_path: "/home/bigt/AI/llm/gemma4/bartowski-google_gemma-4-26B-A4B-it-IQ4_XS.gguf"
flags:
n_ctx: 65536
n_gpu_layers: 999
cache-type-k: q8_0
cache-type-v: q8_0
flash-attn: on
temp: 1.0
top_p: 0.95
top_k: 64
repeat-penalty: 1.0
min_p: 0.0
presence-penalty: 0.0
- id: gemma4-12b-compact-long-128k
name: "Gemma4 12B Compact IQ4 128K"
model_path: "/home/bigt/AI/llm/gemma4/bartowski-google_gemma-4-26B-A4B-it-IQ4_XS.gguf"
flags:
n_ctx: 131072
n_gpu_layers: 999
cache-type-k: q8_0
cache-type-v: q8_0
flash-attn: on
temp: 1.0
top_p: 0.95
top_k: 64
repeat-penalty: 1.0
min_p: 0.0
presence-penalty: 0.0 presence-penalty: 0.0
# --- Gemma 4 26B-A4B (Q4_K_M ~10.5 GB, IQ4_XS ~6 GB) --- # --- Gemma 4 26B-A4B (Q4_K_M ~10.5 GB, IQ4_XS ~6 GB) ---
@ -101,97 +154,48 @@
name: "Gemma4 26B Balanced 64K" name: "Gemma4 26B Balanced 64K"
model_path: "/home/bigt/AI/llm/gemma4/gemma-4-26B-A4B-it-UD-Q4_K_M.gguf" model_path: "/home/bigt/AI/llm/gemma4/gemma-4-26B-A4B-it-UD-Q4_K_M.gguf"
flags: flags:
ctx-size: 65536 n_ctx: 65536
n-gpu-layers: 999 n_gpu_layers: 999
cache-type-k: q8_0 cache-type-k: q8_0
cache-type-v: q8_0 cache-type-v: q8_0
flash-attn: on flash-attn: on
temp: 1.0 temp: 1.0
top-p: 0.95 top_p: 0.95
top-k: 64 top_k: 64
repeat-penalty: 1.0 repeat-penalty: 1.0
min-p: 0.0 min_p: 0.0
presence-penalty: 0.0 presence-penalty: 0.0
- id: gemma4-26b-extended-128k - id: gemma4-26b-extended-128k
name: "Gemma4 26B Extended 128K" name: "Gemma4 26B Extended 128K"
model_path: "/home/bigt/AI/llm/gemma4/gemma-4-26B-A4B-it-UD-Q4_K_M.gguf" model_path: "/home/bigt/AI/llm/gemma4/gemma-4-26B-A4B-it-UD-Q4_K_M.gguf"
flags: flags:
ctx-size: 131072 n_ctx: 131072
n-gpu-layers: 999 n_gpu_layers: 999
cache-type-k: q4_0 cache-type-k: q4_0
cache-type-v: q4_0 cache-type-v: q4_0
flash-attn: on flash-attn: on
temp: 1.0 temp: 1.0
top-p: 0.95 top_p: 0.95
top-k: 64 top_k: 64
repeat-penalty: 1.15 repeat-penalty: 1.15
min-p: 0.0 min_p: 0.0
presence-penalty: 0.0 presence-penalty: 0.0
- id: gemma4-26b-ultra-long-iq4-128k - id: gemma4-26b-ultra-long-iq4-128k
name: "Gemma4 26B Ultra-Long IQ4 128K" name: "Gemma4 26B Ultra-Long IQ4 128K"
model_path: "/home/bigt/AI/llm/gemma4/gemma-4-26B-A4B-it-UD-IQ4_XS.gguf" model_path: "/home/bigt/AI/llm/gemma4/gemma-4-26B-A4B-it-UD-IQ4_XS.gguf"
flags: flags:
ctx-size: 131072 n_ctx: 131072
n-gpu-layers: 999 n_gpu_layers: 999
cache-type-k: q4_0 cache-type-k: q4_0
cache-type-v: q4_0 cache-type-v: q4_0
flash-attn: on flash-attn: on
temp: 1.0 temp: 1.0
top-p: 0.95 top_p: 0.95
top-k: 64 top_k: 64
repeat-penalty: 1.0 repeat-penalty: 1.0
min-p: 0.0 min_p: 0.0
presence-penalty: 0.0
- id: gemma4-26b-q5-64k
name: "Gemma4 26B Q5 64K"
model_path: "/home/bigt/AI/llm/gemma4/google_gemma-4-26B-A4B-it-Q5_K_M.gguf"
flags:
ctx-size: 65536
n-gpu-layers: 999
cache-type-k: q8_0
cache-type-v: q8_0
flash-attn: on
temp: 1.0
top-p: 0.95
top-k: 64
repeat-penalty: 1.0
min-p: 0.0
presence-penalty: 0.0
# --- Gemma 4 26B Compact (IQ4_XS ~6 GB) ---
- id: gemma4-26b-compact-iq4-64k
name: "Gemma4 26B Compact IQ4 64K"
model_path: "/home/bigt/AI/llm/gemma4/bartowski-google_gemma-4-26B-A4B-it-IQ4_XS.gguf"
flags:
ctx-size: 65536
n-gpu-layers: 999
cache-type-k: q8_0
cache-type-v: q8_0
flash-attn: on
temp: 1.0
top-p: 0.95
top-k: 64
repeat-penalty: 1.0
min-p: 0.0
presence-penalty: 0.0
- id: gemma4-26b-compact-long-128k
name: "Gemma4 26B Compact IQ4 128K"
model_path: "/home/bigt/AI/llm/gemma4/bartowski-google_gemma-4-26B-A4B-it-IQ4_XS.gguf"
flags:
ctx-size: 131072
n-gpu-layers: 999
cache-type-k: q4_0
cache-type-v: q4_0
flash-attn: on
temp: 1.0
top-p: 0.95
top-k: 64
repeat-penalty: 1.0
min-p: 0.0
presence-penalty: 0.0 presence-penalty: 0.0
# --- Qwen3.6-35B-A3B (UD-Q4_K_M ~14 GB) --- # --- Qwen3.6-35B-A3B (UD-Q4_K_M ~14 GB) ---
@ -201,144 +205,95 @@
name: "Qwen3.6-35B Fast 64K" name: "Qwen3.6-35B Fast 64K"
model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-35B-A3B-UD-Q4_K_M.gguf" model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-35B-A3B-UD-Q4_K_M.gguf"
flags: flags:
ctx-size: 65536 n_ctx: 65536
n-gpu-layers: 999 n_gpu_layers: 999
cache-type-k: q8_0 cache-type-k: q8_0
cache-type-v: q8_0 cache-type-v: q8_0
flash-attn: on flash-attn: on
temp: 0.6 temp: 0.6
top-p: 0.95 top_p: 0.95
top-k: 20 top_k: 20
repeat-penalty: 1.0 repeat-penalty: 1.0
min-p: 0.0 min_p: 0.0
presence-penalty: 0.0 presence-penalty: 0.0
- id: qwen36-35b-thinking-64k - id: qwen36-35b-thinking-64k
name: "Qwen3.6-35B Thinking 64K" name: "Qwen3.6-35B Thinking 64K"
model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-35B-A3B-UD-Q4_K_M.gguf" model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-35B-A3B-UD-Q4_K_M.gguf"
flags: flags:
ctx-size: 65536 n_ctx: 65536
n-gpu-layers: 999 n_gpu_layers: 999
cache-type-k: q8_0 cache-type-k: q8_0
cache-type-v: q8_0 cache-type-v: q8_0
flash-attn: on flash-attn: on
temp: 1.0 temp: 1.0
top-p: 0.95 top_p: 0.95
top-k: 20 top_k: 20
repeat-penalty: 1.0 repeat-penalty: 1.0
min-p: 0.0 min_p: 0.0
presence-penalty: 0.0 presence-penalty: 0.0
- id: qwen36-35b-extended-128k - id: qwen36-35b-extended-128k
name: "Qwen3.6-35B Extended 128K" name: "Qwen3.6-35B Extended 128K"
model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-35B-A3B-UD-Q4_K_M.gguf" model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-35B-A3B-UD-Q4_K_M.gguf"
flags: flags:
ctx-size: 131072 n_ctx: 131072
n-gpu-layers: 999 n_gpu_layers: 999
cache-type-k: q4_0 cache-type-k: q4_0
cache-type-v: q4_0 cache-type-v: q4_0
flash-attn: on flash-attn: on
temp: 0.6 temp: 0.6
top-p: 0.95 top_p: 0.95
top-k: 20 top_k: 20
repeat-penalty: 1.0 repeat-penalty: 1.0
min-p: 0.0 min_p: 0.0
presence-penalty: 0.0 presence-penalty: 0.0
# --- Qwen3.6-35B-A3B MTP variant --- # --- Uncensored models (apply censored family params) ---
- id: qwen36-35b-mtp-fast-64k
name: "Qwen3.6-35B MTP Fast 64K"
model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-35B-A3B-MTP-UD-Q4_K_M.gguf"
flags:
ctx-size: 65536
n-gpu-layers: 999
cache-type-k: q8_0
cache-type-v: q8_0
flash-attn: on
temp: 0.6
top-p: 0.95
top-k: 20
repeat-penalty: 1.0
min-p: 0.0
presence-penalty: 0.0
- id: qwen36-35b-mtp-extended-128k
name: "Qwen3.6-35B MTP Extended 128K"
model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-35B-A3B-MTP-UD-Q4_K_M.gguf"
flags:
ctx-size: 131072
n-gpu-layers: 999
cache-type-k: q4_0
cache-type-v: q4_0
flash-attn: on
temp: 0.6
top-p: 0.95
top-k: 20
repeat-penalty: 1.0
min-p: 0.0
presence-penalty: 0.0
# --- Uncensored models ---
- id: qwen36-35b-hauhau-aggressive-64k - id: qwen36-35b-hauhau-aggressive-64k
name: "Qwen3.6-35B HauhauCS Aggressive 64K" name: "Qwen3.6-35B HauhauCS Aggressive 64K"
model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-Q4_K_P.gguf" model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-Q4_K_P.gguf"
flags: flags:
ctx-size: 65536 n_ctx: 65536
n-gpu-layers: 999 n_gpu_layers: 999
cache-type-k: q8_0 cache-type-k: q8_0
cache-type-v: q8_0 cache-type-v: q8_0
flash-attn: on flash-attn: on
temp: 0.6 temp: 0.6
top-p: 0.95 top_p: 0.95
top-k: 20 top_k: 20
repeat-penalty: 1.0 repeat-penalty: 1.0
min-p: 0.0 min_p: 0.0
presence-penalty: 0.0 presence-penalty: 0.0
- id: qwen36-35b-genesis-apex-64k - id: qwen36-35b-genesis-apex-64k
name: "Qwen3.6-35B Genesis APEX 64K" name: "Qwen3.6-35B Genesis APEX 64K"
model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-35B-A3B-Uncensored-Genesis-APEX.gguf" model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-35B-A3B-Uncensored-Genesis-APEX.gguf"
flags: flags:
ctx-size: 65536 n_ctx: 65536
n-gpu-layers: 999 n_gpu_layers: 999
cache-type-k: q8_0 cache-type-k: q8_0
cache-type-v: q8_0 cache-type-v: q8_0
flash-attn: on flash-attn: on
temp: 0.6 temp: 0.6
top-p: 0.95 top_p: 0.95
top-k: 20 top_k: 20
repeat-penalty: 1.0 repeat-penalty: 1.0
min-p: 0.0 min_p: 0.0
presence-penalty: 0.0
- id: qwen36-35b-genesis-mtp-apex-64k
name: "Qwen3.6-35B Genesis MTP APEX 64K"
model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-35B-A3B-Uncensored-Genesis-MTP-APEX.gguf"
flags:
ctx-size: 65536
n-gpu-layers: 999
cache-type-k: q8_0
cache-type-v: q8_0
flash-attn: on
temp: 0.6
top-p: 0.95
top-k: 20
repeat-penalty: 1.0
min-p: 0.0
presence-penalty: 0.0 presence-penalty: 0.0
- id: gemma4-26b-hauhau-balanced-64k - id: gemma4-26b-hauhau-balanced-64k
name: "Gemma4 26B HauhauCS Balanced 64K" name: "Gemma4 26B HauhauCS Balanced 64K"
model_path: "/home/bigt/AI/llm/gemma4/Gemma4-26B-A4B-Uncensored-HauhauCS-Balanced-Q5_K_M.gguf" model_path: "/home/bigt/AI/llm/gemma4/Gemma4-26B-A4B-Uncensored-HauhauCS-Balanced-Q5_K_M.gguf"
flags: flags:
ctx-size: 65536 n_ctx: 65536
n-gpu-layers: 999 n_gpu_layers: 999
cache-type-k: q8_0 cache-type-k: q8_0
cache-type-v: q8_0 cache-type-v: q8_0
flash-attn: on flash-attn: on
temp: 1.0 temp: 1.0
top-p: 0.95 top_p: 0.95
top-k: 64 top_k: 64
repeat-penalty: 1.0 repeat-penalty: 1.0
min-p: 0.0 min_p: 0.0
presence-penalty: 0.0 presence-penalty: 0.0

View File

@ -7,8 +7,8 @@ services:
ports: ports:
- "9001:9000" - "9001:9000"
environment: environment:
- SIDECAR_URL=http://10.0.4.11:8080 - SIDECAR_URL=http://10.0.4.11:8081
- MAIN_PC_URL=http://10.0.4.11:8081/v1 - MAIN_PC_URL=http://10.0.4.11:8080/v1
- FALLBACK_SLM_URL=http://10.0.4.200:8080/v1 - FALLBACK_SLM_URL=http://10.0.4.200:8080/v1
- OPENROUTER_API_KEY=${OPENROUTER_API_KEY:-} - OPENROUTER_API_KEY=${OPENROUTER_API_KEY:-}
restart: unless-stopped restart: unless-stopped

437
main.py
View File

@ -141,49 +141,6 @@ def complete_switch():
_switching_event.set() _switching_event.set()
async def _background_switch(requested_model: str):
"""Run a model switch in the background.
The sidecar POST is awaited but the caller gets an immediate SSE stream
so Hermes Desktop doesn't timeout waiting for the first response.
Called via asyncio.create_task() so it runs concurrently with the
SSE stream being sent to the client.
"""
try:
async with httpx.AsyncClient(timeout=120.0) as client:
switch_resp = await client.post(
f"{SIDECAR_URL}/models/switch",
json={"profile_id": requested_model},
)
switch_result = switch_resp.json()
if switch_result.get("status") == "ready":
print(
f"SWITCH SUCCESS: profile={requested_model}",
flush=True,
)
else:
circuit_record_failure()
print(
f"SWITCH FAILED: profile={requested_model}, "
f"status={switch_result.get('status')}, "
f"message={switch_result.get('message', '(no message)')}",
flush=True,
)
except Exception as e:
circuit_record_failure()
print(
f"SWITCH EXCEPTION: profile={requested_model}, "
f"error={type(e).__name__}: {e}",
flush=True,
)
finally:
# Signal all queued requests so they can proceed (and fall
# through to the fallback chain if the switch failed).
complete_switch()
drain_queue()
# ─── App ───────────────────────────────────────────────────────────────────── # ─── App ─────────────────────────────────────────────────────────────────────
@asynccontextmanager @asynccontextmanager
async def lifespan(app: FastAPI): async def lifespan(app: FastAPI):
@ -196,12 +153,6 @@ app = FastAPI(lifespan=lifespan)
# ─── GET /v1/models — Issue #2 ────────────────────────────────────────────── # ─── GET /v1/models — Issue #2 ──────────────────────────────────────────────
@app.get("/v1")
async def v1_root():
"""OpenAI API root — return basic info for Hermes Desktop WebUI probe."""
return {"object": "list", "data": []}
@app.get("/v1/models") @app.get("/v1/models")
async def get_models(): async def get_models():
"""OpenAI-compatible /v1/models endpoint proxying to Sidecar.""" """OpenAI-compatible /v1/models endpoint proxying to Sidecar."""
@ -228,170 +179,6 @@ async def health():
return {"status": "router_online"} return {"status": "router_online"}
# ─── Hermes Desktop Probe Endpoints ──────────────────────────────────────────
# These endpoints are probed by Hermes Desktop to validate/identify the
# provider before allowing model switching. Without them the desktop
# returns 503 and refuses to switch models.
@app.get("/v1/models/{model_id:path}")
async def get_single_model(model_id: str):
"""OpenAI-compatible single model query. Proxied via Sidecar model list."""
async with httpx.AsyncClient(timeout=5.0) as client:
try:
resp = await client.get(f"{SIDECAR_URL}/models/available")
profiles = resp.json()
except Exception:
return JSONResponse(
status_code=503,
content={"error": "Sidecar unavailable", "data": []},
)
for p in profiles:
if p.get("id") == model_id:
return {"id": p["id"], "object": "model", "owned_by": "sidecar"}
return JSONResponse(status_code=404, content={"error": "model not found", "id": model_id})
@app.get("/api/tags")
async def ollama_tags():
"""Ollama-compatible model list for Hermes Desktop discovery."""
async with httpx.AsyncClient(timeout=5.0) as client:
try:
resp = await client.get(f"{SIDECAR_URL}/models/available")
profiles = resp.json()
except Exception:
return JSONResponse(content={"models": []})
models = []
for p in profiles:
models.append({
"name": p.get("id", ""),
"model": p.get("id", ""),
"modified_at": "2025-01-01T00:00:00Z",
"size": 0,
"digest": "",
"details": {"format": "gguf", "family": p.get("name", "llm")},
})
return {"models": models}
@app.get("/api/show")
async def ollama_show_get(model: str = ""):
"""Ollama-compatible model info for Hermes Desktop discovery (GET variant).
Some Hermes Desktop versions probe /api/show via GET with a ?model= parameter.
"""
return await _ollama_show_lookup(model)
@app.post("/api/show")
async def ollama_show_post(request: Request):
"""Ollama-compatible model info for Hermes Desktop discovery (POST variant)."""
body = await request.body()
body_data = json.loads(body) if body else {}
model_name = body_data.get("model", "")
return await _ollama_show_lookup(model_name)
async def _ollama_show_lookup(model_name: str):
"""Shared logic for Ollama /api/show model info lookup.
When model_name is empty string (Hermes Desktop probe with no model field),
returns the currently-active profile's info so the desktop can determine
the correct context size. Previously returned 404, causing Hermes Desktop
to default to 256k context.
"""
async with httpx.AsyncClient(timeout=5.0) as client:
try:
resp = await client.get(f"{SIDECAR_URL}/models/available")
profiles = resp.json()
status_resp = await client.get(f"{SIDECAR_URL}/models/status")
status = status_resp.json()
except Exception:
return JSONResponse(status_code=404, content={"error": "model not found"})
# If no model specified, return the currently-active profile's info
active_id = status.get("active_profile")
if not model_name and active_id:
for p in profiles:
if p.get("id") == active_id:
flags = p.get("flags", {})
ctx_size = str(flags.get("ctx-size", flags.get("n_ctx", "4096")))
return {
"modelfile": "",
"parameters": f"num_ctx {ctx_size}",
"template": "",
"details": {
"format": "gguf",
"family": p.get("name", "llm"),
"parameter_size": ctx_size,
},
"model_info": {"id": p.get("id", "")},
}
for p in profiles:
if p.get("id") == model_name:
# Extract actual context size from the profile's flags
flags = p.get("flags", {})
ctx_size = str(flags.get("ctx-size", flags.get("n_ctx", "4096")))
return {
"modelfile": "",
"parameters": f"num_ctx {ctx_size}",
"template": "",
"details": {
"format": "gguf",
"family": p.get("name", "llm"),
"parameter_size": ctx_size,
},
"model_info": {"id": p.get("id", "")},
}
return JSONResponse(status_code=404, content={"error": "model not found"})
@app.get("/api/v1/models")
async def ollama_v1_models():
"""Ollama /api/v1/models redirect — return same list as /v1/models."""
return await get_models()
@app.get("/v1/props")
async def llama_cpp_props():
"""llama.cpp discovery endpoint for Hermes Desktop."""
async with httpx.AsyncClient(timeout=3.0) as client:
try:
resp = await client.get(f"{SIDECAR_URL}/models/status")
status = resp.json()
except Exception:
status = {"active_profile": None, "llama_server_running": False}
# Report the currently-running server version / capabilities
return {
"props": {
"version": 1,
"total_slots": 1,
"chat_endpoint": "/v1/chat/completions",
"completion_endpoint": "/v1/completions",
"embedding_endpoint": "/v1/embeddings",
"rerank_endpoint": "",
"health_endpoint": "/health",
},
"active_profile": status.get("active_profile"),
"server_running": status.get("llama_server_running", False),
}
@app.get("/props")
async def llm_props():
"""Legacy llama.cpp discovery endpoint (same as /v1/props)."""
return await llama_cpp_props()
@app.get("/version")
async def llm_version():
"""llama.cpp version endpoint for Hermes Desktop."""
return {"version": "0.2.0", "build": "router-proxy", "commit": "intelligence-router"}
# ─── GET /models/status ────────────────────────────────────────────────────── # ─── GET /models/status ──────────────────────────────────────────────────────
@app.get("/models/status") @app.get("/models/status")
async def router_model_status(): async def router_model_status():
@ -471,138 +258,96 @@ async def proxy(
# ── Determine target URL ────────────────────────────────────────────── # ── Determine target URL ──────────────────────────────────────────────
target_url: Optional[str] = None target_url: Optional[str] = None
error: Optional[str] = None error: Optional[str] = None
sidecar_status = None
# Always query the sidecar first (to detect recovery even when circuit is open) # Circuit breaker check
async with httpx.AsyncClient(timeout=3.0) as client: if not await circuit_breaker_check():
try:
resp = await client.get(f"{SIDECAR_URL}/models/status")
if resp.status_code == 200:
sidecar_status = resp.json()
circuit_reset()
except Exception:
pass # Handled below
if sidecar_status is None:
circuit_record_failure()
error = "sidecar_down"
elif not await circuit_breaker_check():
# Sidecar is up but circuit is open from prior switch failures
# Only block the switch — allow routing to already-active backend
error = "circuit_open" error = "circuit_open"
if sidecar_status.get("llama_server_running"):
target_url = f"{MAIN_PC_BASE}/{path}"
else: else:
# Both sidecar reachable and circuit closed — proceed normally # Query Sidecar for active model
body = await request.body() sidecar_status = None
body_data = json.loads(body) if body else {} async with httpx.AsyncClient(timeout=3.0) as client:
requested_model = body_data.get("model")
# Only trigger model switches for actual chat/completion POST requests.
# GET probes, /api/show lookups, and other non-chat endpoints should
# never trigger a switch — they just read current state.
is_chat_request = (
request.method == "POST"
and path in ("v1/chat/completions", "v1/completions")
)
if requested_model and sidecar_status.get("active_profile") == requested_model:
target_url = f"{MAIN_PC_BASE}/{path}"
elif requested_model and is_chat_request:
# All requests during a model switch get an immediate SSE streaming
# response so clients (Hermes Desktop) don't timeout while waiting
# for the model to load (10-30s). The switch runs in a background
# task; the SSE stream yields progress events, then pipes through
# the actual response once the backend model is ready.
current_switch = await wait_for_switch()
if current_switch is None:
# No switch in progress — start one in the background
await start_switch()
asyncio.create_task(_background_switch(requested_model))
# Queue this request — signals when switch completes
try: try:
wait_evt = await queue_request() resp = await client.get(f"{SIDECAR_URL}/models/status")
except HTTPException as he: if resp.status_code == 200:
raise sidecar_status = resp.json()
circuit_reset()
# Build request headers once except Exception:
req_headers = dict(request.headers) error = "sidecar_down"
req_headers.pop("host", None)
async def stream_with_sse():
sse_gen = sse_progress_stream(wait_evt)
try:
await wait_evt.wait()
async for sse_chunk in sse_gen:
yield sse_chunk
# Send actual request to Main PC
async with httpx.AsyncClient(timeout=60.0) as c:
async with c.stream(
request.method,
f"{MAIN_PC_BASE}/{path}",
content=body,
headers=req_headers,
) as resp:
async for chunk in resp.aiter_bytes():
yield chunk
except Exception:
# Main PC unreachable (switch failed or server died) —
# try fallback chain
yield _sse_format(
"error",
{"message": "Backend unreachable, trying fallback..."},
)
# Try OpenRouter
if OPENROUTER_API_KEY:
try:
fb_headers = dict(req_headers)
fb_headers["Authorization"] = f"Bearer {OPENROUTER_API_KEY}"
async with httpx.AsyncClient(timeout=60.0) as c:
async with c.stream(
request.method,
f"{OPENROUTER_BASE}/{path}",
content=body,
headers=fb_headers,
) as resp:
async for chunk in resp.aiter_bytes():
yield chunk
return
except Exception:
pass
# Fallback to LXC SLM
try:
async with httpx.AsyncClient(timeout=60.0) as c:
async with c.stream(
request.method,
f"{FALLBACK_SLM_URL}/{path}",
content=body,
headers=req_headers,
) as resp:
async for chunk in resp.aiter_bytes():
yield chunk
except Exception:
yield _sse_format(
"error",
{"message": "All backends unavailable"},
)
finally:
try:
await sse_gen.aclose()
except Exception:
pass
return StreamingResponse(
stream_with_sse(),
media_type="text/event-stream",
)
if sidecar_status is None:
circuit_record_failure()
error = "sidecar_down"
else: else:
# No model in request body (probe/GET/non-chat request) — # Extract requested model from request body
# route to the currently active backend when available, body = await request.body()
# or fall through to the fallback chain. body_data = json.loads(body) if body else {}
if sidecar_status.get("active_profile") and sidecar_status.get("llama_server_running"): requested_model = body_data.get("model")
if requested_model and sidecar_status.get("active_profile") == requested_model:
target_url = f"{MAIN_PC_BASE}/{path}" target_url = f"{MAIN_PC_BASE}/{path}"
else:
# Trigger switch
if requested_model:
# Check if a switch is already in progress
current_switch = await wait_for_switch()
if current_switch is not None and not current_switch.is_set():
# Another request started the switch — queue this one
try:
wait_evt = await queue_request()
except HTTPException as he:
raise
# SSE progress while waiting
async def stream_with_sse():
sse_gen = sse_progress_stream(wait_evt)
try:
await wait_evt.wait()
async for sse_chunk in sse_gen:
yield sse_chunk
complete_switch()
drain_queue()
async with httpx.AsyncClient(timeout=60.0) as c:
req_headers = dict(request.headers)
req_headers.pop("host", None)
async with c.stream(
request.method,
f"{MAIN_PC_BASE}/{path}",
content=body,
headers=req_headers,
) as resp:
async for chunk in resp.aiter_bytes():
yield chunk
finally:
# Clean up sse_gen
try:
await sse_gen.aclose()
except Exception:
pass
return StreamingResponse(
stream_with_sse(),
media_type="text/event-stream",
)
# First request triggers the switch
await start_switch() # Create event for tracking
try:
async with httpx.AsyncClient(timeout=120.0) as client:
switch_resp = await client.post(
f"{SIDECAR_URL}/models/switch",
json={"profile_id": requested_model},
)
switch_result = switch_resp.json()
if switch_result.get("status") == "ready":
complete_switch()
drain_queue()
target_url = f"{MAIN_PC_BASE}/{path}"
else:
error = "switch_failed"
except Exception as e:
circuit_record_failure()
error = f"switch_error: {str(e)}"
# ── Fallback chain ──────────────────────────────────────────────────── # ── Fallback chain ────────────────────────────────────────────────────
if target_url is None: if target_url is None:
@ -633,11 +378,8 @@ async def proxy(
request.method, target, request.method, target,
content=body, headers=headers, content=body, headers=headers,
) as resp: ) as resp:
if resp.status_code != 200:
print(f"PROXY: {target} returned {resp.status_code} during SSE stream", flush=True)
async for chunk in resp.aiter_bytes(): async for chunk in resp.aiter_bytes():
yield chunk yield chunk
return StreamingResponse(gen(), status_code=200) return StreamingResponse(gen(), status_code=200)
resp = await client.request( resp = await client.request(
@ -646,12 +388,6 @@ async def proxy(
content=body, content=body,
headers=headers, headers=headers,
) )
if resp.status_code != 200:
body_preview = resp.content[:500].decode("utf-8", errors="replace")
print(
f"PROXY: {request.method} {target} returned {resp.status_code}: {body_preview}",
flush=True,
)
return Response( return Response(
content=resp.content, content=resp.content,
status_code=resp.status_code, status_code=resp.status_code,
@ -661,11 +397,8 @@ async def proxy(
primary_result = None primary_result = None
try: try:
primary_result = await execute(target_url) primary_result = await execute(target_url)
except Exception as e: except Exception:
print( pass # Falls through to fallback chain
f"PROXY EXCEPTION on primary {target_url}: {type(e).__name__}: {e}",
flush=True,
) # Falls through to fallback chain
if primary_result is not None: if primary_result is not None:
return primary_result return primary_result

View File

@ -1,161 +0,0 @@
#!/usr/bin/env python3
"""
Sync intelligence-router model list into Hermes custom_providers.
Usage:
# One-shot: discover models from the router and update Hermes config
python3 scripts/sync_models.py
# Cron mode (auto): set up via:
# cp scripts/sync_models.py ~/.hermes/scripts/
# hermes cron create --schedule "every 30m" --no-agent --script sync_models.py
Silent exit when nothing changed. Prints a summary + restarts the gateway when
the model list differs.
"""
import json
import os
import subprocess
import sys
import urllib.error
import urllib.request
from pathlib import Path
# ── CONFIGURE THESE ──────────────────────────────────────────────────
ROUTER_BASE_URL = "http://10.0.4.100:9001/v1"
PROVIDER_NAME = "intelligence_router"
GATEWAY_SERVICE = "hermes-gateway"
# ─────────────────────────────────────────────────────────────────────
MODELS_URL = f"{ROUTER_BASE_URL}/models"
CONFIG_PATH = Path(os.path.expanduser("~/.hermes/config.yaml"))
def fetch_models() -> list[str] | None:
try:
req = urllib.request.Request(MODELS_URL, headers={"Accept": "application/json"})
with urllib.request.urlopen(req, timeout=10) as resp:
data = json.loads(resp.read().decode())
models = sorted(m["id"] for m in data.get("data", []) if isinstance(m, dict))
return models if models else None
except (urllib.error.URLError, TimeoutError, json.JSONDecodeError, OSError) as e:
print(f"ERROR: Failed to fetch models from {MODELS_URL}: {e}", file=sys.stderr)
return None
def read_current_models() -> list[str]:
"""Parse current custom_providers entries for our provider name."""
if not CONFIG_PATH.exists():
return []
models = []
with open(CONFIG_PATH) as f:
content = f.read()
idx = content.find("custom_providers:")
if idx == -1:
return []
section = content[idx:]
lines = section.split("\n")
current_entry = {}
for line in lines:
s = line.strip()
if s.startswith("- base_url:"):
if current_entry.get("name") == PROVIDER_NAME:
m = current_entry.get("model", "")
if m:
models.append(m)
current_entry = {}
elif s.startswith("model:"):
current_entry["model"] = s.split("model:", 1)[1].strip().strip("'\"")
elif s.startswith("name:"):
current_entry["name"] = s.split("name:", 1)[1].strip().strip("'\"")
elif s and not s.startswith(("-", " ")):
break
# Don't forget the last entry
if current_entry.get("name") == PROVIDER_NAME:
m = current_entry.get("model", "")
if m:
models.append(m)
return sorted(models)
def generate_block(models: list[str]) -> str:
lines = ["custom_providers:"]
for m in models:
lines.append(f"- base_url: {ROUTER_BASE_URL}")
lines.append(f" model: {m}")
lines.append(f" name: {PROVIDER_NAME}")
return "\n".join(lines)
def replace_section(models: list[str]) -> bool:
"""Replace the custom_providers section in-place. Returns True if changed."""
if not CONFIG_PATH.exists():
return False
import yaml
content = CONFIG_PATH.read_text()
config = yaml.safe_load(content)
new_entries = [
{"base_url": ROUTER_BASE_URL, "model": m, "name": PROVIDER_NAME}
for m in models
]
if config.get("custom_providers") == new_entries:
return False
config["custom_providers"] = new_entries
CONFIG_PATH.write_text(yaml.dump(config, default_flow_style=False, sort_keys=False))
return True
def restart_gateway() -> bool:
try:
r = subprocess.run(
["systemctl", "--user", "restart", GATEWAY_SERVICE],
capture_output=True, text=True, timeout=30,
)
return r.returncode == 0
except Exception:
return False
def main():
models = fetch_models()
if models is None:
sys.exit(1)
current = read_current_models()
if current == models:
print("Model list unchanged — nothing to do.")
return
added = set(models) - set(current)
removed = set(current) - set(models)
print(f"Model list changed! {len(current)}{len(models)} models")
if added:
print(f" Added: {sorted(added)}")
if removed:
print(f" Removed: {sorted(removed)}")
if not replace_section(models):
print("ERROR: Config update failed")
return
print("Config updated. Restarting gateway...")
if restart_gateway():
print("Gateway restarted successfully.")
else:
print("WARNING: Gateway restart failed — restart manually.")
if __name__ == "__main__":
main()

View File

@ -5,6 +5,7 @@ Runs on the Main PC, manages llama-server subprocess, serves manifest/profile da
import os import os
import asyncio import asyncio
import signal as signal_module import signal as signal_module
import threading
from contextlib import asynccontextmanager from contextlib import asynccontextmanager
from typing import Optional from typing import Optional
@ -17,98 +18,41 @@ from sidecar.manifest import load_manifest
# Configuration from environment # Configuration from environment
MANIFEST_PATH = os.getenv("MANIFEST_PATH", "/home/bigt/AI/llm/manifest.yaml") MANIFEST_PATH = os.getenv("MANIFEST_PATH", "/home/bigt/AI/llm/manifest.yaml")
SIDECAR_PORT = int(os.getenv("SIDECAR_PORT", "8080")) SIDECAR_PORT = int(os.getenv("SIDECAR_PORT", "8080"))
LLAMA_SERVER_PORT = 8081 LLAMA_SERVER_PORT = 8080
LLAMA_STDERR_LOG = os.path.join(
os.path.dirname(MANIFEST_PATH), "llama-server-stderr.log"
)
# Global state # Global state
_llama_server_process: Optional[asyncio.subprocess.Process] = None _llama_server_process: Optional[asyncio.subprocess.Process] = None
_active_profile: Optional[str] = None _active_profile: Optional[str] = None
_switch_lock = asyncio.Lock() # Use asyncio.Lock to avoid blocking the event loop _switch_lock = threading.Lock() # Use threading.Lock for compatibility with TestClient
@asynccontextmanager @asynccontextmanager
async def lifespan(app: FastAPI): async def lifespan(app: FastAPI):
"""Manage sidecar lifecycle — no default model loaded.""" """Manage sidecar lifecycle — no default model loaded."""
print(f"Sidecar starting, manifest={MANIFEST_PATH}, port={SIDECAR_PORT}", flush=True) print(f"Sidecar starting, manifest={MANIFEST_PATH}, port={SIDECAR_PORT}")
yield yield
# Cleanup: kill llama-server if running # Cleanup: kill llama-server if running
global _llama_server_process global _llama_server_process
if _llama_server_process: if _llama_server_process:
await _kill_llama_server() _kill_llama_server()
app = FastAPI(lifespan=lifespan) app = FastAPI(lifespan=lifespan)
def _close_stderr_log(): def _kill_llama_server():
"""Close the stderr log file handle if it's still attached to the process.""" """Kill the llama-server subprocess."""
global _llama_server_process global _llama_server_process
if _llama_server_process is not None: if _llama_server_process and _llama_server_process.returncode is None:
fh = getattr(_llama_server_process, "_stderr_fh", None)
if fh is not None and not fh.closed:
try:
fh.close()
except Exception:
pass
async def _kill_llama_server():
"""Kill the llama-server subprocess and wait for it to fully terminate.
This MUST be async because process.wait() is a coroutine. The synchronous
version was calling .wait() without await, creating an unawaited coroutine
object the old process was never actually waited on, so it could still
hold GPU VRAM when the new server started.
"""
global _llama_server_process
if _llama_server_process is None or _llama_server_process.returncode is not None:
_close_stderr_log()
return
try:
_llama_server_process.send_signal(signal_module.SIGTERM)
try: try:
await asyncio.wait_for(_llama_server_process.wait(), timeout=10) _llama_server_process.send_signal(signal_module.SIGTERM)
except asyncio.TimeoutError:
_llama_server_process.kill()
try: try:
await asyncio.wait_for(_llama_server_process.wait(), timeout=5) _llama_server_process.wait(timeout=5)
except asyncio.TimeoutError: except asyncio.TimeoutError:
pass _llama_server_process.kill()
except Exception: except Exception:
pass pass
finally:
_llama_server_process = None _llama_server_process = None
_close_stderr_log()
def _flag_value(value) -> str:
"""Convert a manifest flag value to a llama-server CLI argument string.
YAML booleans (True/False/on/off/yes/no) are parsed as Python bools by
safe_load. llama-server expects 'on'/'off' for boolean flags, not 'True'/'False'.
"""
if isinstance(value, bool):
return "on" if value else "off"
return str(value)
def _flag_key(key: str) -> str:
"""Convert a manifest flag key to the correct llama-server CLI flag name.
llama-server uses hyphenated flag names (--ctx-size, --n-gpu-layers),
but YAML keys often use underscores. Some flags were also renamed
across llama.cpp versions (e.g. --n-ctx --ctx-size).
This function normalises underscores to hyphens and applies known renames.
"""
normalized = key.replace("_", "-")
FLAG_RENAMES = {
"n-ctx": "ctx-size",
}
return FLAG_RENAMES.get(normalized, normalized)
async def _start_llama_server(profile: dict): async def _start_llama_server(profile: dict):
@ -116,39 +60,29 @@ async def _start_llama_server(profile: dict):
global _llama_server_process global _llama_server_process
# Kill any existing process # Kill any existing process
await _kill_llama_server() _kill_llama_server()
# Build command from profile flags # Build command from profile flags
cmd = ["/home/bigt/AI/llama.cpp/build/bin/llama-server"] cmd = ["llama-server"]
cmd += ["--model", profile["model_path"]] cmd += ["--model", profile["model_path"]]
cmd += ["--port", str(LLAMA_SERVER_PORT)] cmd += ["--port", str(LLAMA_SERVER_PORT)]
cmd += ["--host", "0.0.0.0"]
for key, value in profile.get("flags", {}).items(): for key, value in profile.get("flags", {}).items():
cmd += ["--" + _flag_key(key), _flag_value(value)] cmd += ["--" + key, str(value)]
print(f"Starting llama-server: {' '.join(cmd)}", flush=True) print(f"Starting llama-server: {' '.join(cmd)}")
# Capture stderr so we can diagnose crashes (model not found, OOM, bad flag)
stderr_fh = open(LLAMA_STDERR_LOG, "w")
_llama_server_process = await asyncio.create_subprocess_exec( _llama_server_process = await asyncio.create_subprocess_exec(
*cmd, *cmd,
stdout=asyncio.subprocess.DEVNULL, stdout=asyncio.subprocess.DEVNULL,
stderr=stderr_fh, stderr=asyncio.subprocess.DEVNULL,
) )
# Keep a reference so we can close the handle later
_llama_server_process._stderr_fh = stderr_fh # type: ignore[attr-defined]
return _llama_server_process return _llama_server_process
async def _poll_llama_server_ready(max_retries: int = 60, interval: float = 0.5): async def _poll_llama_server_ready(max_retries: int = 240, interval: float = 0.5):
"""Poll llama-server readiness via /v1/models endpoint. """Poll llama-server readiness via /v1/models endpoint."""
Returns True on success. On failure, dumps the captured stderr (if any)
so the user can see why llama-server crashed.
"""
import httpx import httpx
for attempt in range(max_retries): for _ in range(max_retries):
try: try:
async with httpx.AsyncClient(timeout=2.0) as client: async with httpx.AsyncClient(timeout=2.0) as client:
resp = await client.get(f"http://localhost:{LLAMA_SERVER_PORT}/v1/models") resp = await client.get(f"http://localhost:{LLAMA_SERVER_PORT}/v1/models")
@ -157,27 +91,6 @@ async def _poll_llama_server_ready(max_retries: int = 60, interval: float = 0.5)
except Exception: except Exception:
pass pass
await asyncio.sleep(interval) await asyncio.sleep(interval)
# Flush and close the stderr handle so all data is on disk before we read
_close_stderr_log()
# ── Dump stderr for diagnosis ──────────────────────────────────────
print("llama-server did NOT become ready — dumping stderr:", flush=True)
try:
with open(LLAMA_STDERR_LOG) as f:
for line in f:
print(f" {line.rstrip()}", flush=True)
except FileNotFoundError:
print(" (stderr log not found — process may not have started)", flush=True)
# Also log exit code if the process died
global _llama_server_process
if _llama_server_process and _llama_server_process.returncode is not None:
print(
f"llama-server exited with code {_llama_server_process.returncode}",
flush=True,
)
return False return False
@ -211,7 +124,7 @@ async def switch_model(payload: SwitchRequest):
"""Stop current llama-server, start new one with the given profile, wait for readiness.""" """Stop current llama-server, start new one with the given profile, wait for readiness."""
global _active_profile global _active_profile
async with _switch_lock: with _switch_lock:
# Validate profile_id # Validate profile_id
profiles = load_manifest(MANIFEST_PATH) profiles = load_manifest(MANIFEST_PATH)
if profiles is None: if profiles is None:
@ -240,7 +153,7 @@ async def switch_model(payload: SwitchRequest):
} }
# Start the new model # Start the new model
await _kill_llama_server() _kill_llama_server()
_active_profile = None _active_profile = None
await _start_llama_server(profile) await _start_llama_server(profile)