fix: first request no longer blocks on model switch — uses background task + SSE

The FIRST request that triggers a model switch was blocking the HTTP response for 10-30s while waiting for the sidecar to load the model. Hermes Desktop's client timed out during this wait, causing 'nothing happens' on new session. Fix: refactored the proxy handler so ALL requests during a model switch use the same SSE streaming pattern (immediate 200, progress events, then actual response piped through after switch completes). The switch now runs as a background asyncio task via create_task(). - Added _background_switch() — runs POST /models/switch in background task with complete_switch() + drain_queue() in finally block - All switch-triggering requests go through queue_request() + StreamingResponse - SSE generator now falls through to OpenRouter/LXC if Main PC unreachable (switch failure case) instead of hanging indefinitely Sidecar fixes from previous commit: - _kill_llama_server() is now async with proper await on process termination - _switch_lock changed from threading.Lock to asyncio.Lock()
fix: sidecar process kill was not awaiting wait() — old server held GPU VRAM
2026-06-18 00:10:48 +00:00 · 2026-06-17 23:49:57 +00:00 · 2026-06-16 22:09:16 +00:00 · 2026-06-16 21:46:07 +00:00 · 2026-06-16 21:32:36 +00:00 · 2026-06-16 21:25:42 +00:00
7 changed files with 894 additions and 239 deletions
--- a/.hermes/plans/add-model-profiles.md
+++ b/.hermes/plans/add-model-profiles.md
@ -0,0 +1,94 @@
+# Plan: Add user model profiles to manifest.yaml
+# Date: 2025-06-15
+# Author: Hermes Agent
+# Status: DRAFT
+
+## Context
+User has a collection of GGUF models on their Main PC (10.0.4.11, RTX 3090 24GB VRAM).
+The intelligence-router manifest needs profiles for all models with researched llama.cpp parameters.
+Research sourced from r/LocalLLaMA, HuggingFace model cards, and community blog posts.
+
+## Hardware constraints
+- GPU: RTX 3090, 24GB VRAM
+- All profiles use `n_gpu_layers: 999` (offload all layers that fit)
+- All profiles use `flash-attn: on`
+- KV cache quantization (q8_0 or q4_0) to enable 64K+ context within 24GB VRAM
+- `min_p` set to 0.0 across all profiles (community standard for these models)
+
+## Models to add (excluding mmproj files)
+
+### Qwen3.6-27B (1 file: Qwen3.6-27B-Q4_K_M.gguf, ~10.5 GB)
+Recommended sampling per HF model card and Unsloth: temp 0.6 / top_p 0.95 / top_k 20
+
+| # | Profile ID | Name | n_ctx | cache_k/v | temp | top_k | repeat_pen |
+|---|-----------|------|-------|-----------|------|-------|------------|
+| 1 | qwen36-27b-balanced-64k | Qwen3.6-27B Balanced 64K | 65536 | q8_0/q8_0 | 0.6 | 20 | 1.0 |
+| 2 | qwen36-27b-thinking-64k | Qwen3.6-27B Thinking 64K | 65536 | q8_0/q8_0 | 1.0 | 20 | 1.0 |
+| 3 | qwen36-27b-extended-128k | Qwen3.6-27B Extended 128K | 131072 | q4_0/q4_0 | 0.6 | 20 | 1.05 |
+
+### Gemma 4 12B (2 files: Q6_K_XL ~8.5 GB, IQ4_XS ~5 GB)
+Google official: temp 1.0 / top_p 0.95 / top_k 64
+
+| # | Profile ID | Name | File | n_ctx | cache_k/v | temp | top_k |
+|---|-----------|------|------|-------|-----------|------|-------|
+| 4 | gemma4-12b-standard-q6-64k | Gemma4 12B Standard Q6 64K | Q6_K_XL | 65536 | q8_0/q8_0 | 1.0 | 64 |
+| 5 | gemma4-12b-extended-q6-128k | Gemma4 12B Extended Q6 128K | Q6_K_XL | 131072 | q4_0/q4_0 | 1.0 | 64 |
+| 6 | gemma4-12b-compact-iq4-64k | Gemma4 12B Compact IQ4 64K | IQ4_XS | 65536 | q8_0/q8_0 | 1.0 | 64 |
+| 7 | gemma4-12b-compact-long-128k | Gemma4 12B Compact IQ4 128K | IQ4_XS | 131072 | q8_0/q8_0 | 1.0 | 64 |
+
+### Gemma 4 26B-A4B (2 files: Q4_K_M ~10.5 GB, IQ4_XS ~6 GB)
+MoE, 4B active. Same sampling as 12B family.
+
+| # | Profile ID | Name | File | n_ctx | cache_k/v | temp | top_k | repeat_pen |
+|---|-----------|------|------|-------|-----------|------|-------|------------|
+| 8 | gemma4-26b-balanced-64k | Gemma4 26B Balanced 64K | Q4_K_M | 65536 | q8_0/q8_0 | 1.0 | 64 | 1.0 |
+| 9 | gemma4-26b-extended-128k | Gemma4 26B Extended 128K | Q4_K_M | 131072 | q4_0/q4_0 | 1.0 | 64 | 1.15 |
+| 10 | gemma4-26b-ultra-long-iq4-256k | Gemma4 26B Ultra-Long IQ4 256K | IQ4_XS | 262144 | q4_0/q4_0 | 1.0 | 64 | 1.0 |
+
+### Qwen3.6-35B-A3B (2 files: UD-Q4_K_M ~14 GB, MTP-UD-Q4_K_M ~16 GB)
+**MTP note:** Unsloth benchmark shows MTP is net-negative on single 3090. Including MTP profile anyway since user has the file.
+
+| # | Profile ID | Name | File | n_ctx | cache_k/v | temp | top_k | MTP |
+|---|-----------|------|------|-------|-----------|------|-------|-----|
+| 11 | qwen36-35b-fast-64k | Qwen3.6-35B Fast 64K | UD-Q4 | 65536 | q8_0/q8_0 | 0.6 | 20 | no |
+| 12 | qwen36-35b-thinking-64k | Qwen3.6-35B Thinking 64K | UD-Q4 | 65536 | q8_0/q8_0 | 1.0 | 20 | no |
+| 13 | qwen36-35b-extended-128k | Qwen3.6-35B Extended 128K | UD-Q4 | 131072 | q4_0/q4_0 | 0.6 | 20 | no |
+| 14 | qwen36-35b-mtp-128k | Qwen3.6-35B MTP 128K | MTP-UD-Q4 | 131072 | q8_0/q8_0 | 0.6 | 20 | yes (n=3) |
+
+### Uncensored models (apply censored family params)
+
+| # | Profile ID | Name | File | n_ctx | cache_k/v | temp | top_k | Based on |
+|---|-----------|------|------|-------|-----------|------|-------|----------|
+| 15 | qwen36-35b-hauhau-aggressive-64k | Qwen3.6-35B HauhauCS Aggressive 64K | Uncensored-HauhauCS-Q4_K_P | 65536 | q8_0/q8_0 | 0.6 | 20 | Qwen3.6-35B fast |
+| 16 | qwen36-35b-genesis-apex-64k | Qwen3.6-35B Genesis APEX 64K | Uncensored-Genesis-APEX | 65536 | q8_0/q8_0 | 0.6 | 20 | Qwen3.6-35B fast |
+| 17 | qwen36-35b-genesis-mtp-apex-128k | Qwen3.6-35B Genesis MTP APEX 128K | Uncensored-Genesis-MTP-APEX | 131072 | q8_0/q8_0 | 0.6 | 20 | Qwen3.6-35B MTP |
+| 18 | gemma4-26b-hauhau-balanced-64k | Gemma4 26B HauhauCS Balanced 64K | Uncensored-HauhauCS-Q5_K_M | 65536 | q8_0/q8_0 | 1.0 | 64 | Gemma4 26B balanced |
+
+**Total: 18 profiles**
+
+## Flag mapping (manifest → llama-server CLI)
+
+Manifest flags use camelCase keys that the sidecar passes as `--key value` to llama-server:
+
+| Manifest key | CLI flag | Type | Notes |
+|-------------|----------|------|-------|
+| n_gpu_layers | --n-gpu-layers | int | 999 = all |
+| n_ctx | --ctx-size | int | context window |
+| cache_type_k | --cache-type-k | str | q8_0, q4_0 |
+| cache_type_v | --cache-type-v | str | q8_0, q4_0 |
+| flash_attn | --flash-attn | bool | true/on |
+| temp | --temp | float | sampling |
+| top_p | --top-p | float | sampling |
+| top_k | --top-k | int | sampling |
+| repeat_penalty | --repeat-penalty | float | sampling |
+| min_p | --min-p | float | 0.0 |
+| spec_type | --spec-type | str | draft-mtp (only MTP profiles) |
+| spec_draft_n_max | --spec-draft-n-max | int | 3 (only MTP profiles) |
+| presence_penalty | --presence-penalty | float | 0.0 |
+
+## Actions
+1. Create branch `feature/add-model-profiles` from master
+2. Create issues on Gitea for each model family (4 issues: qwen27, gemma12b, gemma26b, qwen35b)
+3. Update `deploy/manifest.yaml` with all 18 profiles
+4. Update tests if flag structure requires it
+5. Run tests, commit
--- a/deploy/llm-sidecar.service
+++ b/deploy/llm-sidecar.service
@ -12,6 +12,7 @@ EnvironmentFile=-/home/bigt/AI/llm/.env
 Environment=MANIFEST_PATH=/home/bigt/AI/llm/manifest.yaml
 Environment=SIDECAR_PORT=8080
 Environment=PATH=/home/bigt/AI/llm/venv/bin:/usr/local/bin:/usr/bin:/bin
+Environment=PYTHONUNBUFFERED=1

 # Use the sidecar's venv — install deps via deploy/README.md
 ExecStart=/home/bigt/AI/llm/venv/bin/uvicorn sidecar.app:app --host 0.0.0.0 --port 8080
--- a/deploy/manifest.yaml
+++ b/deploy/manifest.yaml
@ -11,141 +11,88 @@
 # All profiles use flash-attn: on, n-gpu-layers: 999 (offload all)
 # KV cache quantization (q8_0/q4_0) enables 64K+ context within 24GB VRAM

- id: qwen-3-8b
-  name: "Qwen 3 8B"
-  model_path: "/home/bigt/AI/llm/qwen/qwen3-8b-q4.gguf"
-  flags:
-    n_ctx: 8192
-    n_gpu_layers: 35
-
- id: qwen-3-8b-long
-  name: "Qwen 3 8B (Long Context)"
-  model_path: "/home/bigt/AI/llm/qwen/qwen3-8b-q4.gguf"
-  flags:
-    n_ctx: 32768
-    n_gpu_layers: 20
-
- id: llama-4-maverick
-  name: "Llama 4 Maverick"
-  model_path: "/home/bigt/AI/llm/llama4/llama4-maverick-q4.gguf"
-  flags:
-    n_ctx: 8192
-    n_gpu_layers: 35
-
 # --- Qwen3.6-27B (Q4_K_M, ~10.5 GB) ---
 # Sampling: temp 0.6/1.0, top_p 0.95, top_k 20
 - id: qwen36-27b-balanced-64k
  name: "Qwen3.6-27B Balanced 64K"
  model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-27B-Q4_K_M.gguf"
  flags:
-    n_ctx: 65536
-    n_gpu_layers: 999
+    ctx-size: 65536
+    n-gpu-layers: 999
    cache-type-k: q8_0
    cache-type-v: q8_0
    flash-attn: on
    temp: 0.6
-    top_p: 0.95
-    top_k: 20
+    top-p: 0.95
+    top-k: 20
    repeat-penalty: 1.0
-    min_p: 0.0
+    min-p: 0.0
    presence-penalty: 0.0

 - id: qwen36-27b-thinking-64k
  name: "Qwen3.6-27B Thinking 64K"
  model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-27B-Q4_K_M.gguf"
  flags:
-    n_ctx: 65536
-    n_gpu_layers: 999
+    ctx-size: 65536
+    n-gpu-layers: 999
    cache-type-k: q8_0
    cache-type-v: q8_0
    flash-attn: on
    temp: 1.0
-    top_p: 0.95
-    top_k: 20
+    top-p: 0.95
+    top-k: 20
    repeat-penalty: 1.0
-    min_p: 0.0
+    min-p: 0.0
    presence-penalty: 0.0

 - id: qwen36-27b-extended-128k
  name: "Qwen3.6-27B Extended 128K"
  model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-27B-Q4_K_M.gguf"
  flags:
-    n_ctx: 131072
-    n_gpu_layers: 999
+    ctx-size: 131072
+    n-gpu-layers: 999
    cache-type-k: q4_0
    cache-type-v: q4_0
    flash-attn: on
    temp: 0.6
-    top_p: 0.95
-    top_k: 20
+    top-p: 0.95
+    top-k: 20
    repeat-penalty: 1.05
-    min_p: 0.0
+    min-p: 0.0
    presence-penalty: 0.0

-# --- Gemma 4 12B (Q6_K_XL ~8.5 GB, IQ4_XS ~5 GB) ---
+# --- Gemma 4 12B (Q6_K_XL ~8.5 GB) ---
 # Sampling: temp 1.0, top_p 0.95, top_k 64 (Google official)
 - id: gemma4-12b-standard-q6-64k
  name: "Gemma4 12B Standard Q6 64K"
  model_path: "/home/bigt/AI/llm/gemma4/gemma-4-12b-it-UD-Q6_K_XL.gguf"
  flags:
-    n_ctx: 65536
-    n_gpu_layers: 999
+    ctx-size: 65536
+    n-gpu-layers: 999
    cache-type-k: q8_0
    cache-type-v: q8_0
    flash-attn: on
    temp: 1.0
-    top_p: 0.95
-    top_k: 64
+    top-p: 0.95
+    top-k: 64
    repeat-penalty: 1.0
-    min_p: 0.0
+    min-p: 0.0
    presence-penalty: 0.0

 - id: gemma4-12b-extended-q6-128k
  name: "Gemma4 12B Extended Q6 128K"
  model_path: "/home/bigt/AI/llm/gemma4/gemma-4-12b-it-UD-Q6_K_XL.gguf"
  flags:
-    n_ctx: 131072
-    n_gpu_layers: 999
+    ctx-size: 131072
+    n-gpu-layers: 999
    cache-type-k: q4_0
    cache-type-v: q4_0
    flash-attn: on
    temp: 1.0
-    top_p: 0.95
-    top_k: 64
+    top-p: 0.95
+    top-k: 64
    repeat-penalty: 1.0
-    min_p: 0.0
-    presence-penalty: 0.0
-
- id: gemma4-12b-compact-iq4-64k
-  name: "Gemma4 12B Compact IQ4 64K"
-  model_path: "/home/bigt/AI/llm/gemma4/bartowski-google_gemma-4-26B-A4B-it-IQ4_XS.gguf"
-  flags:
-    n_ctx: 65536
-    n_gpu_layers: 999
-    cache-type-k: q8_0
-    cache-type-v: q8_0
-    flash-attn: on
-    temp: 1.0
-    top_p: 0.95
-    top_k: 64
-    repeat-penalty: 1.0
-    min_p: 0.0
-    presence-penalty: 0.0
-
- id: gemma4-12b-compact-long-128k
-  name: "Gemma4 12B Compact IQ4 128K"
-  model_path: "/home/bigt/AI/llm/gemma4/bartowski-google_gemma-4-26B-A4B-it-IQ4_XS.gguf"
-  flags:
-    n_ctx: 131072
-    n_gpu_layers: 999
-    cache-type-k: q8_0
-    cache-type-v: q8_0
-    flash-attn: on
-    temp: 1.0
-    top_p: 0.95
-    top_k: 64
-    repeat-penalty: 1.0
-    min_p: 0.0
+    min-p: 0.0
    presence-penalty: 0.0

 # --- Gemma 4 26B-A4B (Q4_K_M ~10.5 GB, IQ4_XS ~6 GB) ---
@ -154,48 +101,97 @@
  name: "Gemma4 26B Balanced 64K"
  model_path: "/home/bigt/AI/llm/gemma4/gemma-4-26B-A4B-it-UD-Q4_K_M.gguf"
  flags:
-    n_ctx: 65536
-    n_gpu_layers: 999
+    ctx-size: 65536
+    n-gpu-layers: 999
    cache-type-k: q8_0
    cache-type-v: q8_0
    flash-attn: on
    temp: 1.0
-    top_p: 0.95
-    top_k: 64
+    top-p: 0.95
+    top-k: 64
    repeat-penalty: 1.0
-    min_p: 0.0
+    min-p: 0.0
    presence-penalty: 0.0

 - id: gemma4-26b-extended-128k
  name: "Gemma4 26B Extended 128K"
  model_path: "/home/bigt/AI/llm/gemma4/gemma-4-26B-A4B-it-UD-Q4_K_M.gguf"
  flags:
-    n_ctx: 131072
-    n_gpu_layers: 999
+    ctx-size: 131072
+    n-gpu-layers: 999
    cache-type-k: q4_0
    cache-type-v: q4_0
    flash-attn: on
    temp: 1.0
-    top_p: 0.95
-    top_k: 64
+    top-p: 0.95
+    top-k: 64
    repeat-penalty: 1.15
-    min_p: 0.0
+    min-p: 0.0
    presence-penalty: 0.0

 - id: gemma4-26b-ultra-long-iq4-128k
  name: "Gemma4 26B Ultra-Long IQ4 128K"
  model_path: "/home/bigt/AI/llm/gemma4/gemma-4-26B-A4B-it-UD-IQ4_XS.gguf"
  flags:
-    n_ctx: 131072
-    n_gpu_layers: 999
+    ctx-size: 131072
+    n-gpu-layers: 999
    cache-type-k: q4_0
    cache-type-v: q4_0
    flash-attn: on
    temp: 1.0
-    top_p: 0.95
-    top_k: 64
+    top-p: 0.95
+    top-k: 64
    repeat-penalty: 1.0
-    min_p: 0.0
+    min-p: 0.0
+    presence-penalty: 0.0
+
+- id: gemma4-26b-q5-64k
+  name: "Gemma4 26B Q5 64K"
+  model_path: "/home/bigt/AI/llm/gemma4/google_gemma-4-26B-A4B-it-Q5_K_M.gguf"
+  flags:
+    ctx-size: 65536
+    n-gpu-layers: 999
+    cache-type-k: q8_0
+    cache-type-v: q8_0
+    flash-attn: on
+    temp: 1.0
+    top-p: 0.95
+    top-k: 64
+    repeat-penalty: 1.0
+    min-p: 0.0
+    presence-penalty: 0.0
+
+# --- Gemma 4 26B Compact (IQ4_XS ~6 GB) ---
+- id: gemma4-26b-compact-iq4-64k
+  name: "Gemma4 26B Compact IQ4 64K"
+  model_path: "/home/bigt/AI/llm/gemma4/bartowski-google_gemma-4-26B-A4B-it-IQ4_XS.gguf"
+  flags:
+    ctx-size: 65536
+    n-gpu-layers: 999
+    cache-type-k: q8_0
+    cache-type-v: q8_0
+    flash-attn: on
+    temp: 1.0
+    top-p: 0.95
+    top-k: 64
+    repeat-penalty: 1.0
+    min-p: 0.0
+    presence-penalty: 0.0
+
+- id: gemma4-26b-compact-long-128k
+  name: "Gemma4 26B Compact IQ4 128K"
+  model_path: "/home/bigt/AI/llm/gemma4/bartowski-google_gemma-4-26B-A4B-it-IQ4_XS.gguf"
+  flags:
+    ctx-size: 131072
+    n-gpu-layers: 999
+    cache-type-k: q4_0
+    cache-type-v: q4_0
+    flash-attn: on
+    temp: 1.0
+    top-p: 0.95
+    top-k: 64
+    repeat-penalty: 1.0
+    min-p: 0.0
    presence-penalty: 0.0

 # --- Qwen3.6-35B-A3B (UD-Q4_K_M ~14 GB) ---
@ -205,95 +201,144 @@
  name: "Qwen3.6-35B Fast 64K"
  model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-35B-A3B-UD-Q4_K_M.gguf"
  flags:
-    n_ctx: 65536
-    n_gpu_layers: 999
+    ctx-size: 65536
+    n-gpu-layers: 999
    cache-type-k: q8_0
    cache-type-v: q8_0
    flash-attn: on
    temp: 0.6
-    top_p: 0.95
-    top_k: 20
+    top-p: 0.95
+    top-k: 20
    repeat-penalty: 1.0
-    min_p: 0.0
+    min-p: 0.0
    presence-penalty: 0.0

 - id: qwen36-35b-thinking-64k
  name: "Qwen3.6-35B Thinking 64K"
  model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-35B-A3B-UD-Q4_K_M.gguf"
  flags:
-    n_ctx: 65536
-    n_gpu_layers: 999
+    ctx-size: 65536
+    n-gpu-layers: 999
    cache-type-k: q8_0
    cache-type-v: q8_0
    flash-attn: on
    temp: 1.0
-    top_p: 0.95
-    top_k: 20
+    top-p: 0.95
+    top-k: 20
    repeat-penalty: 1.0
-    min_p: 0.0
+    min-p: 0.0
    presence-penalty: 0.0

 - id: qwen36-35b-extended-128k
  name: "Qwen3.6-35B Extended 128K"
  model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-35B-A3B-UD-Q4_K_M.gguf"
  flags:
-    n_ctx: 131072
-    n_gpu_layers: 999
+    ctx-size: 131072
+    n-gpu-layers: 999
    cache-type-k: q4_0
    cache-type-v: q4_0
    flash-attn: on
    temp: 0.6
-    top_p: 0.95
-    top_k: 20
+    top-p: 0.95
+    top-k: 20
    repeat-penalty: 1.0
-    min_p: 0.0
+    min-p: 0.0
    presence-penalty: 0.0

-# --- Uncensored models (apply censored family params) ---
- id: qwen36-35b-hauhau-aggressive-64k
-  name: "Qwen3.6-35B HauhauCS Aggressive 64K"
-  model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-Q4_K_P.gguf"
+# --- Qwen3.6-35B-A3B MTP variant ---
+- id: qwen36-35b-mtp-fast-64k
+  name: "Qwen3.6-35B MTP Fast 64K"
+  model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-35B-A3B-MTP-UD-Q4_K_M.gguf"
  flags:
-    n_ctx: 65536
-    n_gpu_layers: 999
+    ctx-size: 65536
+    n-gpu-layers: 999
    cache-type-k: q8_0
    cache-type-v: q8_0
    flash-attn: on
    temp: 0.6
-    top_p: 0.95
-    top_k: 20
+    top-p: 0.95
+    top-k: 20
    repeat-penalty: 1.0
-    min_p: 0.0
+    min-p: 0.0
+    presence-penalty: 0.0
+
+- id: qwen36-35b-mtp-extended-128k
+  name: "Qwen3.6-35B MTP Extended 128K"
+  model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-35B-A3B-MTP-UD-Q4_K_M.gguf"
+  flags:
+    ctx-size: 131072
+    n-gpu-layers: 999
+    cache-type-k: q4_0
+    cache-type-v: q4_0
+    flash-attn: on
+    temp: 0.6
+    top-p: 0.95
+    top-k: 20
+    repeat-penalty: 1.0
+    min-p: 0.0
+    presence-penalty: 0.0
+
+# --- Uncensored models ---
+- id: qwen36-35b-hauhau-aggressive-64k
+  name: "Qwen3.6-35B HauhauCS Aggressive 64K"
+  model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-Q4_K_P.gguf"
+  flags:
+    ctx-size: 65536
+    n-gpu-layers: 999
+    cache-type-k: q8_0
+    cache-type-v: q8_0
+    flash-attn: on
+    temp: 0.6
+    top-p: 0.95
+    top-k: 20
+    repeat-penalty: 1.0
+    min-p: 0.0
    presence-penalty: 0.0

 - id: qwen36-35b-genesis-apex-64k
  name: "Qwen3.6-35B Genesis APEX 64K"
  model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-35B-A3B-Uncensored-Genesis-APEX.gguf"
  flags:
-    n_ctx: 65536
-    n_gpu_layers: 999
+    ctx-size: 65536
+    n-gpu-layers: 999
    cache-type-k: q8_0
    cache-type-v: q8_0
    flash-attn: on
    temp: 0.6
-    top_p: 0.95
-    top_k: 20
+    top-p: 0.95
+    top-k: 20
    repeat-penalty: 1.0
-    min_p: 0.0
+    min-p: 0.0
+    presence-penalty: 0.0
+
+- id: qwen36-35b-genesis-mtp-apex-64k
+  name: "Qwen3.6-35B Genesis MTP APEX 64K"
+  model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-35B-A3B-Uncensored-Genesis-MTP-APEX.gguf"
+  flags:
+    ctx-size: 65536
+    n-gpu-layers: 999
+    cache-type-k: q8_0
+    cache-type-v: q8_0
+    flash-attn: on
+    temp: 0.6
+    top-p: 0.95
+    top-k: 20
+    repeat-penalty: 1.0
+    min-p: 0.0
    presence-penalty: 0.0

 - id: gemma4-26b-hauhau-balanced-64k
  name: "Gemma4 26B HauhauCS Balanced 64K"
  model_path: "/home/bigt/AI/llm/gemma4/Gemma4-26B-A4B-Uncensored-HauhauCS-Balanced-Q5_K_M.gguf"
  flags:
-    n_ctx: 65536
-    n_gpu_layers: 999
+    ctx-size: 65536
+    n-gpu-layers: 999
    cache-type-k: q8_0
    cache-type-v: q8_0
    flash-attn: on
    temp: 1.0
-    top_p: 0.95
-    top_k: 64
+    top-p: 0.95
+    top-k: 64
    repeat-penalty: 1.0
-    min_p: 0.0
-    presence-penalty: 0.0
+    min-p: 0.0
+    presence-penalty: 0.0
--- a/docker-compose.yml
+++ b/docker-compose.yml
@ -7,8 +7,8 @@ services:
    ports:
      - "9001:9000"
    environment:
-      - SIDECAR_URL=http://10.0.4.11:8081
-      - MAIN_PC_URL=http://10.0.4.11:8080/v1
+      - SIDECAR_URL=http://10.0.4.11:8080
+      - MAIN_PC_URL=http://10.0.4.11:8081/v1
      - FALLBACK_SLM_URL=http://10.0.4.200:8080/v1
      - OPENROUTER_API_KEY=${OPENROUTER_API_KEY:-}
    restart: unless-stopped
--- a/main.py
+++ b/main.py
@ -141,6 +141,49 @@ def complete_switch():
            _switching_event.set()


+async def _background_switch(requested_model: str):
+    """Run a model switch in the background.
+
+    The sidecar POST is awaited but the caller gets an immediate SSE stream
+    so Hermes Desktop doesn't timeout waiting for the first response.
+
+    Called via asyncio.create_task() so it runs concurrently with the
+    SSE stream being sent to the client.
+    """
+    try:
+        async with httpx.AsyncClient(timeout=120.0) as client:
+            switch_resp = await client.post(
+                f"{SIDECAR_URL}/models/switch",
+                json={"profile_id": requested_model},
+            )
+            switch_result = switch_resp.json()
+            if switch_result.get("status") == "ready":
+                print(
+                    f"SWITCH SUCCESS: profile={requested_model}",
+                    flush=True,
+                )
+            else:
+                circuit_record_failure()
+                print(
+                    f"SWITCH FAILED: profile={requested_model}, "
+                    f"status={switch_result.get('status')}, "
+                    f"message={switch_result.get('message', '(no message)')}",
+                    flush=True,
+                )
+    except Exception as e:
+        circuit_record_failure()
+        print(
+            f"SWITCH EXCEPTION: profile={requested_model}, "
+            f"error={type(e).__name__}: {e}",
+            flush=True,
+        )
+    finally:
+        # Signal all queued requests so they can proceed (and fall
+        # through to the fallback chain if the switch failed).
+        complete_switch()
+        drain_queue()
+
+
 # ─── App ─────────────────────────────────────────────────────────────────────
@asynccontextmanager
 async def lifespan(app: FastAPI):
@ -153,6 +196,12 @@ app = FastAPI(lifespan=lifespan)


 # ─── GET /v1/models — Issue #2 ──────────────────────────────────────────────
+@app.get("/v1")
+async def v1_root():
+    """OpenAI API root — return basic info for Hermes Desktop WebUI probe."""
+    return {"object": "list", "data": []}
+
+
@app.get("/v1/models")
 async def get_models():
    """OpenAI-compatible /v1/models endpoint proxying to Sidecar."""
@ -179,6 +228,170 @@ async def health():
    return {"status": "router_online"}


+# ─── Hermes Desktop Probe Endpoints ──────────────────────────────────────────
+# These endpoints are probed by Hermes Desktop to validate/identify the
+# provider before allowing model switching.  Without them the desktop
+# returns 503 and refuses to switch models.
+
+@app.get("/v1/models/{model_id:path}")
+async def get_single_model(model_id: str):
+    """OpenAI-compatible single model query.  Proxied via Sidecar model list."""
+    async with httpx.AsyncClient(timeout=5.0) as client:
+        try:
+            resp = await client.get(f"{SIDECAR_URL}/models/available")
+            profiles = resp.json()
+        except Exception:
+            return JSONResponse(
+                status_code=503,
+                content={"error": "Sidecar unavailable", "data": []},
+            )
+
+    for p in profiles:
+        if p.get("id") == model_id:
+            return {"id": p["id"], "object": "model", "owned_by": "sidecar"}
+    return JSONResponse(status_code=404, content={"error": "model not found", "id": model_id})
+
+
+@app.get("/api/tags")
+async def ollama_tags():
+    """Ollama-compatible model list for Hermes Desktop discovery."""
+    async with httpx.AsyncClient(timeout=5.0) as client:
+        try:
+            resp = await client.get(f"{SIDECAR_URL}/models/available")
+            profiles = resp.json()
+        except Exception:
+            return JSONResponse(content={"models": []})
+
+    models = []
+    for p in profiles:
+        models.append({
+            "name": p.get("id", ""),
+            "model": p.get("id", ""),
+            "modified_at": "2025-01-01T00:00:00Z",
+            "size": 0,
+            "digest": "",
+            "details": {"format": "gguf", "family": p.get("name", "llm")},
+        })
+    return {"models": models}
+
+
+@app.get("/api/show")
+async def ollama_show_get(model: str = ""):
+    """Ollama-compatible model info for Hermes Desktop discovery (GET variant).
+
+    Some Hermes Desktop versions probe /api/show via GET with a ?model= parameter.
+    """
+    return await _ollama_show_lookup(model)
+
+
+@app.post("/api/show")
+async def ollama_show_post(request: Request):
+    """Ollama-compatible model info for Hermes Desktop discovery (POST variant)."""
+    body = await request.body()
+    body_data = json.loads(body) if body else {}
+    model_name = body_data.get("model", "")
+    return await _ollama_show_lookup(model_name)
+
+
+async def _ollama_show_lookup(model_name: str):
+    """Shared logic for Ollama /api/show model info lookup.
+
+    When model_name is empty string (Hermes Desktop probe with no model field),
+    returns the currently-active profile's info so the desktop can determine
+    the correct context size. Previously returned 404, causing Hermes Desktop
+    to default to 256k context.
+    """
+    async with httpx.AsyncClient(timeout=5.0) as client:
+        try:
+            resp = await client.get(f"{SIDECAR_URL}/models/available")
+            profiles = resp.json()
+            status_resp = await client.get(f"{SIDECAR_URL}/models/status")
+            status = status_resp.json()
+        except Exception:
+            return JSONResponse(status_code=404, content={"error": "model not found"})
+
+    # If no model specified, return the currently-active profile's info
+    active_id = status.get("active_profile")
+    if not model_name and active_id:
+        for p in profiles:
+            if p.get("id") == active_id:
+                flags = p.get("flags", {})
+                ctx_size = str(flags.get("ctx-size", flags.get("n_ctx", "4096")))
+                return {
+                    "modelfile": "",
+                    "parameters": f"num_ctx {ctx_size}",
+                    "template": "",
+                    "details": {
+                        "format": "gguf",
+                        "family": p.get("name", "llm"),
+                        "parameter_size": ctx_size,
+                    },
+                    "model_info": {"id": p.get("id", "")},
+                }
+
+    for p in profiles:
+        if p.get("id") == model_name:
+            # Extract actual context size from the profile's flags
+            flags = p.get("flags", {})
+            ctx_size = str(flags.get("ctx-size", flags.get("n_ctx", "4096")))
+            return {
+                "modelfile": "",
+                "parameters": f"num_ctx {ctx_size}",
+                "template": "",
+                "details": {
+                    "format": "gguf",
+                    "family": p.get("name", "llm"),
+                    "parameter_size": ctx_size,
+                },
+                "model_info": {"id": p.get("id", "")},
+            }
+    return JSONResponse(status_code=404, content={"error": "model not found"})
+
+
+@app.get("/api/v1/models")
+async def ollama_v1_models():
+    """Ollama /api/v1/models redirect — return same list as /v1/models."""
+    return await get_models()
+
+
+@app.get("/v1/props")
+async def llama_cpp_props():
+    """llama.cpp discovery endpoint for Hermes Desktop."""
+    async with httpx.AsyncClient(timeout=3.0) as client:
+        try:
+            resp = await client.get(f"{SIDECAR_URL}/models/status")
+            status = resp.json()
+        except Exception:
+            status = {"active_profile": None, "llama_server_running": False}
+
+    # Report the currently-running server version / capabilities
+    return {
+        "props": {
+            "version": 1,
+            "total_slots": 1,
+            "chat_endpoint": "/v1/chat/completions",
+            "completion_endpoint": "/v1/completions",
+            "embedding_endpoint": "/v1/embeddings",
+            "rerank_endpoint": "",
+            "health_endpoint": "/health",
+        },
+        "active_profile": status.get("active_profile"),
+        "server_running": status.get("llama_server_running", False),
+    }
+
+
+@app.get("/props")
+async def llm_props():
+    """Legacy llama.cpp discovery endpoint (same as /v1/props)."""
+    return await llama_cpp_props()
+
+
+@app.get("/version")
+async def llm_version():
+    """llama.cpp version endpoint for Hermes Desktop."""
+    return {"version": "0.2.0", "build": "router-proxy", "commit": "intelligence-router"}
+
+
 # ─── GET /models/status ──────────────────────────────────────────────────────
@app.get("/models/status")
 async def router_model_status():
@ -258,96 +471,138 @@ async def proxy(
    # ── Determine target URL ──────────────────────────────────────────────
    target_url: Optional[str] = None
    error: Optional[str] = None
+    sidecar_status = None

-    # Circuit breaker check
-    if not await circuit_breaker_check():
+    # Always query the sidecar first (to detect recovery even when circuit is open)
+    async with httpx.AsyncClient(timeout=3.0) as client:
+        try:
+            resp = await client.get(f"{SIDECAR_URL}/models/status")
+            if resp.status_code == 200:
+                sidecar_status = resp.json()
+                circuit_reset()
+        except Exception:
+            pass  # Handled below
+
+    if sidecar_status is None:
+        circuit_record_failure()
+        error = "sidecar_down"
+    elif not await circuit_breaker_check():
+        # Sidecar is up but circuit is open from prior switch failures
+        # Only block the switch — allow routing to already-active backend
        error = "circuit_open"
+        if sidecar_status.get("llama_server_running"):
+            target_url = f"{MAIN_PC_BASE}/{path}"
    else:
-        # Query Sidecar for active model
-        sidecar_status = None
-        async with httpx.AsyncClient(timeout=3.0) as client:
+        # Both sidecar reachable and circuit closed — proceed normally
+        body = await request.body()
+        body_data = json.loads(body) if body else {}
+        requested_model = body_data.get("model")
+
+        # Only trigger model switches for actual chat/completion POST requests.
+        # GET probes, /api/show lookups, and other non-chat endpoints should
+        # never trigger a switch — they just read current state.
+        is_chat_request = (
+            request.method == "POST"
+            and path in ("v1/chat/completions", "v1/completions")
+        )
+
+        if requested_model and sidecar_status.get("active_profile") == requested_model:
+            target_url = f"{MAIN_PC_BASE}/{path}"
+        elif requested_model and is_chat_request:
+            # All requests during a model switch get an immediate SSE streaming
+            # response so clients (Hermes Desktop) don't timeout while waiting
+            # for the model to load (10-30s).  The switch runs in a background
+            # task; the SSE stream yields progress events, then pipes through
+            # the actual response once the backend model is ready.
+            current_switch = await wait_for_switch()
+            if current_switch is None:
+                # No switch in progress — start one in the background
+                await start_switch()
+                asyncio.create_task(_background_switch(requested_model))
+
+            # Queue this request — signals when switch completes
            try:
-                resp = await client.get(f"{SIDECAR_URL}/models/status")
-                if resp.status_code == 200:
-                    sidecar_status = resp.json()
-                    circuit_reset()
-            except Exception:
-                error = "sidecar_down"
+                wait_evt = await queue_request()
+            except HTTPException as he:
+                raise

-        if sidecar_status is None:
-            circuit_record_failure()
-            error = "sidecar_down"
-        else:
-            # Extract requested model from request body
-            body = await request.body()
-            body_data = json.loads(body) if body else {}
-            requested_model = body_data.get("model")
+            # Build request headers once
+            req_headers = dict(request.headers)
+            req_headers.pop("host", None)

-            if requested_model and sidecar_status.get("active_profile") == requested_model:
-                target_url = f"{MAIN_PC_BASE}/{path}"
-            else:
-                # Trigger switch
-                if requested_model:
-                    # Check if a switch is already in progress
-                    current_switch = await wait_for_switch()
-
-                    if current_switch is not None and not current_switch.is_set():
-                        # Another request started the switch — queue this one
+            async def stream_with_sse():
+                sse_gen = sse_progress_stream(wait_evt)
+                try:
+                    await wait_evt.wait()
+                    async for sse_chunk in sse_gen:
+                        yield sse_chunk
+                    # Send actual request to Main PC
+                    async with httpx.AsyncClient(timeout=60.0) as c:
+                        async with c.stream(
+                            request.method,
+                            f"{MAIN_PC_BASE}/{path}",
+                            content=body,
+                            headers=req_headers,
+                        ) as resp:
+                            async for chunk in resp.aiter_bytes():
+                                yield chunk
+                except Exception:
+                    # Main PC unreachable (switch failed or server died) —
+                    # try fallback chain
+                    yield _sse_format(
+                        "error",
+                        {"message": "Backend unreachable, trying fallback..."},
+                    )
+                    # Try OpenRouter
+                    if OPENROUTER_API_KEY:
                        try:
-                            wait_evt = await queue_request()
-                        except HTTPException as he:
-                            raise
-
-                        # SSE progress while waiting
-                        async def stream_with_sse():
-                            sse_gen = sse_progress_stream(wait_evt)
-                            try:
-                                await wait_evt.wait()
-                                async for sse_chunk in sse_gen:
-                                    yield sse_chunk
-                                complete_switch()
-                                drain_queue()
-                                async with httpx.AsyncClient(timeout=60.0) as c:
-                                    req_headers = dict(request.headers)
-                                    req_headers.pop("host", None)
-                                    async with c.stream(
-                                        request.method,
-                                        f"{MAIN_PC_BASE}/{path}",
-                                        content=body,
-                                        headers=req_headers,
-                                    ) as resp:
-                                        async for chunk in resp.aiter_bytes():
-                                            yield chunk
-                            finally:
-                                # Clean up sse_gen
-                                try:
-                                    await sse_gen.aclose()
-                                except Exception:
-                                    pass
-
-                        return StreamingResponse(
-                            stream_with_sse(),
-                            media_type="text/event-stream",
-                        )
-
-                    # First request triggers the switch
-                    await start_switch()  # Create event for tracking
+                            fb_headers = dict(req_headers)
+                            fb_headers["Authorization"] = f"Bearer {OPENROUTER_API_KEY}"
+                            async with httpx.AsyncClient(timeout=60.0) as c:
+                                async with c.stream(
+                                    request.method,
+                                    f"{OPENROUTER_BASE}/{path}",
+                                    content=body,
+                                    headers=fb_headers,
+                                ) as resp:
+                                    async for chunk in resp.aiter_bytes():
+                                        yield chunk
+                                    return
+                        except Exception:
+                            pass
+                    # Fallback to LXC SLM
                    try:
-                        async with httpx.AsyncClient(timeout=120.0) as client:
-                            switch_resp = await client.post(
-                                f"{SIDECAR_URL}/models/switch",
-                                json={"profile_id": requested_model},
-                            )
-                        switch_result = switch_resp.json()
-                        if switch_result.get("status") == "ready":
-                            complete_switch()
-                            drain_queue()
-                            target_url = f"{MAIN_PC_BASE}/{path}"
-                        else:
-                            error = "switch_failed"
-                    except Exception as e:
-                        circuit_record_failure()
-                        error = f"switch_error: {str(e)}"
+                        async with httpx.AsyncClient(timeout=60.0) as c:
+                            async with c.stream(
+                                request.method,
+                                f"{FALLBACK_SLM_URL}/{path}",
+                                content=body,
+                                headers=req_headers,
+                            ) as resp:
+                                async for chunk in resp.aiter_bytes():
+                                    yield chunk
+                    except Exception:
+                        yield _sse_format(
+                            "error",
+                            {"message": "All backends unavailable"},
+                        )
+                finally:
+                    try:
+                        await sse_gen.aclose()
+                    except Exception:
+                        pass
+
+            return StreamingResponse(
+                stream_with_sse(),
+                media_type="text/event-stream",
+            )
+
+        else:
+            # No model in request body (probe/GET/non-chat request) —
+            # route to the currently active backend when available,
+            # or fall through to the fallback chain.
+            if sidecar_status.get("active_profile") and sidecar_status.get("llama_server_running"):
+                target_url = f"{MAIN_PC_BASE}/{path}"

    # ── Fallback chain ────────────────────────────────────────────────────
    if target_url is None:
@ -378,8 +633,11 @@ async def proxy(
                        request.method, target,
                        content=body, headers=headers,
                    ) as resp:
+                        if resp.status_code != 200:
+                            print(f"PROXY: {target} returned {resp.status_code} during SSE stream", flush=True)
                        async for chunk in resp.aiter_bytes():
                            yield chunk
+
                return StreamingResponse(gen(), status_code=200)

            resp = await client.request(
@ -388,6 +646,12 @@ async def proxy(
                content=body,
                headers=headers,
            )
+            if resp.status_code != 200:
+                body_preview = resp.content[:500].decode("utf-8", errors="replace")
+                print(
+                    f"PROXY: {request.method} {target} returned {resp.status_code}: {body_preview}",
+                    flush=True,
+                )
            return Response(
                content=resp.content,
                status_code=resp.status_code,
@ -397,8 +661,11 @@ async def proxy(
    primary_result = None
    try:
        primary_result = await execute(target_url)
-    except Exception:
-        pass  # Falls through to fallback chain
+    except Exception as e:
+        print(
+            f"PROXY EXCEPTION on primary {target_url}: {type(e).__name__}: {e}",
+            flush=True,
+        )  # Falls through to fallback chain
    if primary_result is not None:
        return primary_result

--- a/scripts/sync_models.py
+++ b/scripts/sync_models.py
@ -0,0 +1,161 @@
+#!/usr/bin/env python3
+"""
+Sync intelligence-router model list into Hermes custom_providers.
+
+Usage:
+    # One-shot: discover models from the router and update Hermes config
+    python3 scripts/sync_models.py
+
+    # Cron mode (auto): set up via:
+    #   cp scripts/sync_models.py ~/.hermes/scripts/
+    #   hermes cron create --schedule "every 30m" --no-agent --script sync_models.py
+
+Silent exit when nothing changed. Prints a summary + restarts the gateway when
+the model list differs.
+"""
+
+import json
+import os
+import subprocess
+import sys
+import urllib.error
+import urllib.request
+from pathlib import Path
+
+# ── CONFIGURE THESE ──────────────────────────────────────────────────
+ROUTER_BASE_URL = "http://10.0.4.100:9001/v1"
+PROVIDER_NAME = "intelligence_router"
+GATEWAY_SERVICE = "hermes-gateway"
+# ─────────────────────────────────────────────────────────────────────
+
+MODELS_URL = f"{ROUTER_BASE_URL}/models"
+CONFIG_PATH = Path(os.path.expanduser("~/.hermes/config.yaml"))
+
+
+def fetch_models() -> list[str] | None:
+    try:
+        req = urllib.request.Request(MODELS_URL, headers={"Accept": "application/json"})
+        with urllib.request.urlopen(req, timeout=10) as resp:
+            data = json.loads(resp.read().decode())
+        models = sorted(m["id"] for m in data.get("data", []) if isinstance(m, dict))
+        return models if models else None
+    except (urllib.error.URLError, TimeoutError, json.JSONDecodeError, OSError) as e:
+        print(f"ERROR: Failed to fetch models from {MODELS_URL}: {e}", file=sys.stderr)
+        return None
+
+
+def read_current_models() -> list[str]:
+    """Parse current custom_providers entries for our provider name."""
+    if not CONFIG_PATH.exists():
+        return []
+
+    models = []
+    with open(CONFIG_PATH) as f:
+        content = f.read()
+
+    idx = content.find("custom_providers:")
+    if idx == -1:
+        return []
+
+    section = content[idx:]
+    lines = section.split("\n")
+
+    current_entry = {}
+    for line in lines:
+        s = line.strip()
+        if s.startswith("- base_url:"):
+            if current_entry.get("name") == PROVIDER_NAME:
+                m = current_entry.get("model", "")
+                if m:
+                    models.append(m)
+            current_entry = {}
+        elif s.startswith("model:"):
+            current_entry["model"] = s.split("model:", 1)[1].strip().strip("'\"")
+        elif s.startswith("name:"):
+            current_entry["name"] = s.split("name:", 1)[1].strip().strip("'\"")
+        elif s and not s.startswith(("-", " ")):
+            break
+
+    # Don't forget the last entry
+    if current_entry.get("name") == PROVIDER_NAME:
+        m = current_entry.get("model", "")
+        if m:
+            models.append(m)
+
+    return sorted(models)
+
+
+def generate_block(models: list[str]) -> str:
+    lines = ["custom_providers:"]
+    for m in models:
+        lines.append(f"- base_url: {ROUTER_BASE_URL}")
+        lines.append(f"  model: {m}")
+        lines.append(f"  name: {PROVIDER_NAME}")
+    return "\n".join(lines)
+
+
+def replace_section(models: list[str]) -> bool:
+    """Replace the custom_providers section in-place. Returns True if changed."""
+    if not CONFIG_PATH.exists():
+        return False
+
+    import yaml
+
+    content = CONFIG_PATH.read_text()
+    config = yaml.safe_load(content)
+
+    new_entries = [
+        {"base_url": ROUTER_BASE_URL, "model": m, "name": PROVIDER_NAME}
+        for m in models
+    ]
+
+    if config.get("custom_providers") == new_entries:
+        return False
+
+    config["custom_providers"] = new_entries
+    CONFIG_PATH.write_text(yaml.dump(config, default_flow_style=False, sort_keys=False))
+    return True
+
+
+def restart_gateway() -> bool:
+    try:
+        r = subprocess.run(
+            ["systemctl", "--user", "restart", GATEWAY_SERVICE],
+            capture_output=True, text=True, timeout=30,
+        )
+        return r.returncode == 0
+    except Exception:
+        return False
+
+
+def main():
+    models = fetch_models()
+    if models is None:
+        sys.exit(1)
+
+    current = read_current_models()
+    if current == models:
+        print("Model list unchanged — nothing to do.")
+        return
+
+    added = set(models) - set(current)
+    removed = set(current) - set(models)
+    print(f"Model list changed! {len(current)} → {len(models)} models")
+    if added:
+        print(f"  Added:   {sorted(added)}")
+    if removed:
+        print(f"  Removed: {sorted(removed)}")
+
+    if not replace_section(models):
+        print("ERROR: Config update failed")
+        return
+
+    print("Config updated. Restarting gateway...")
+    if restart_gateway():
+        print("Gateway restarted successfully.")
+    else:
+        print("WARNING: Gateway restart failed — restart manually.")
+
+
+if __name__ == "__main__":
+    main()
--- a/sidecar/app.py
+++ b/sidecar/app.py
@ -5,7 +5,6 @@ Runs on the Main PC, manages llama-server subprocess, serves manifest/profile da
 import os
 import asyncio
 import signal as signal_module
-import threading
 from contextlib import asynccontextmanager
 from typing import Optional

@ -18,41 +17,98 @@ from sidecar.manifest import load_manifest
 # Configuration from environment
 MANIFEST_PATH = os.getenv("MANIFEST_PATH", "/home/bigt/AI/llm/manifest.yaml")
 SIDECAR_PORT = int(os.getenv("SIDECAR_PORT", "8080"))
-LLAMA_SERVER_PORT = 8080
+LLAMA_SERVER_PORT = 8081
+LLAMA_STDERR_LOG = os.path.join(
+    os.path.dirname(MANIFEST_PATH), "llama-server-stderr.log"
+)

 # Global state
 _llama_server_process: Optional[asyncio.subprocess.Process] = None
 _active_profile: Optional[str] = None
-_switch_lock = threading.Lock()  # Use threading.Lock for compatibility with TestClient
+_switch_lock = asyncio.Lock()  # Use asyncio.Lock to avoid blocking the event loop


@asynccontextmanager
 async def lifespan(app: FastAPI):
    """Manage sidecar lifecycle — no default model loaded."""
-    print(f"Sidecar starting, manifest={MANIFEST_PATH}, port={SIDECAR_PORT}")
+    print(f"Sidecar starting, manifest={MANIFEST_PATH}, port={SIDECAR_PORT}", flush=True)
    yield
    # Cleanup: kill llama-server if running
    global _llama_server_process
    if _llama_server_process:
-        _kill_llama_server()
+        await _kill_llama_server()


 app = FastAPI(lifespan=lifespan)


-def _kill_llama_server():
-    """Kill the llama-server subprocess."""
+def _close_stderr_log():
+    """Close the stderr log file handle if it's still attached to the process."""
    global _llama_server_process
-    if _llama_server_process and _llama_server_process.returncode is None:
-        try:
-            _llama_server_process.send_signal(signal_module.SIGTERM)
+    if _llama_server_process is not None:
+        fh = getattr(_llama_server_process, "_stderr_fh", None)
+        if fh is not None and not fh.closed:
            try:
-                _llama_server_process.wait(timeout=5)
+                fh.close()
+            except Exception:
+                pass
+
+
+async def _kill_llama_server():
+    """Kill the llama-server subprocess and wait for it to fully terminate.
+
+    This MUST be async because process.wait() is a coroutine. The synchronous
+    version was calling .wait() without await, creating an unawaited coroutine
+    object — the old process was never actually waited on, so it could still
+    hold GPU VRAM when the new server started.
+    """
+    global _llama_server_process
+    if _llama_server_process is None or _llama_server_process.returncode is not None:
+        _close_stderr_log()
+        return
+
+    try:
+        _llama_server_process.send_signal(signal_module.SIGTERM)
+        try:
+            await asyncio.wait_for(_llama_server_process.wait(), timeout=10)
+        except asyncio.TimeoutError:
+            _llama_server_process.kill()
+            try:
+                await asyncio.wait_for(_llama_server_process.wait(), timeout=5)
            except asyncio.TimeoutError:
-                _llama_server_process.kill()
-        except Exception:
-            pass
+                pass
+    except Exception:
+        pass
+    finally:
        _llama_server_process = None
+        _close_stderr_log()
+
+
+def _flag_value(value) -> str:
+    """Convert a manifest flag value to a llama-server CLI argument string.
+
+    YAML booleans (True/False/on/off/yes/no) are parsed as Python bools by
+    safe_load.  llama-server expects 'on'/'off' for boolean flags, not 'True'/'False'.
+    """
+    if isinstance(value, bool):
+        return "on" if value else "off"
+    return str(value)
+
+
+def _flag_key(key: str) -> str:
+    """Convert a manifest flag key to the correct llama-server CLI flag name.
+
+    llama-server uses hyphenated flag names (--ctx-size, --n-gpu-layers),
+    but YAML keys often use underscores.  Some flags were also renamed
+    across llama.cpp versions (e.g. --n-ctx → --ctx-size).
+
+    This function normalises underscores to hyphens and applies known renames.
+    """
+    normalized = key.replace("_", "-")
+    FLAG_RENAMES = {
+        "n-ctx": "ctx-size",
+    }
+    return FLAG_RENAMES.get(normalized, normalized)


 async def _start_llama_server(profile: dict):
@ -60,29 +116,39 @@ async def _start_llama_server(profile: dict):
    global _llama_server_process

    # Kill any existing process
-    _kill_llama_server()
+    await _kill_llama_server()

    # Build command from profile flags
-    cmd = ["llama-server"]
+    cmd = ["/home/bigt/AI/llama.cpp/build/bin/llama-server"]
    cmd += ["--model", profile["model_path"]]
    cmd += ["--port", str(LLAMA_SERVER_PORT)]
+    cmd += ["--host", "0.0.0.0"]
    for key, value in profile.get("flags", {}).items():
-        cmd += ["--" + key, str(value)]
+        cmd += ["--" + _flag_key(key), _flag_value(value)]

-    print(f"Starting llama-server: {' '.join(cmd)}")
+    print(f"Starting llama-server: {' '.join(cmd)}", flush=True)
+
+    # Capture stderr so we can diagnose crashes (model not found, OOM, bad flag)
+    stderr_fh = open(LLAMA_STDERR_LOG, "w")
    _llama_server_process = await asyncio.create_subprocess_exec(
        *cmd,
        stdout=asyncio.subprocess.DEVNULL,
-        stderr=asyncio.subprocess.DEVNULL,
+        stderr=stderr_fh,
    )
+    # Keep a reference so we can close the handle later
+    _llama_server_process._stderr_fh = stderr_fh  # type: ignore[attr-defined]
    return _llama_server_process


-async def _poll_llama_server_ready(max_retries: int = 240, interval: float = 0.5):
-    """Poll llama-server readiness via /v1/models endpoint."""
+async def _poll_llama_server_ready(max_retries: int = 60, interval: float = 0.5):
+    """Poll llama-server readiness via /v1/models endpoint.
+
+    Returns True on success.  On failure, dumps the captured stderr (if any)
+    so the user can see why llama-server crashed.
+    """
    import httpx

-    for _ in range(max_retries):
+    for attempt in range(max_retries):
        try:
            async with httpx.AsyncClient(timeout=2.0) as client:
                resp = await client.get(f"http://localhost:{LLAMA_SERVER_PORT}/v1/models")
@ -91,6 +157,27 @@ async def _poll_llama_server_ready(max_retries: int = 240, interval: float = 0.5
        except Exception:
            pass
        await asyncio.sleep(interval)
+
+    # Flush and close the stderr handle so all data is on disk before we read
+    _close_stderr_log()
+
+    # ── Dump stderr for diagnosis ──────────────────────────────────────
+    print("llama-server did NOT become ready — dumping stderr:", flush=True)
+    try:
+        with open(LLAMA_STDERR_LOG) as f:
+            for line in f:
+                print(f"  {line.rstrip()}", flush=True)
+    except FileNotFoundError:
+        print("  (stderr log not found — process may not have started)", flush=True)
+
+    # Also log exit code if the process died
+    global _llama_server_process
+    if _llama_server_process and _llama_server_process.returncode is not None:
+        print(
+            f"llama-server exited with code {_llama_server_process.returncode}",
+            flush=True,
+        )
+
    return False


@ -124,7 +211,7 @@ async def switch_model(payload: SwitchRequest):
    """Stop current llama-server, start new one with the given profile, wait for readiness."""
    global _active_profile

-    with _switch_lock:
+    async with _switch_lock:
        # Validate profile_id
        profiles = load_manifest(MANIFEST_PATH)
        if profiles is None:
@ -153,7 +240,7 @@ async def switch_model(payload: SwitchRequest):
            }

        # Start the new model
-        _kill_llama_server()
+        await _kill_llama_server()
        _active_profile = None
        await _start_llama_server(profile)
Author	SHA1	Message	Date
root	b3ac21b2c0	fix: first request no longer blocks on model switch — uses background task + SSE The FIRST request that triggers a model switch was blocking the HTTP response for 10-30s while waiting for the sidecar to load the model. Hermes Desktop's client timed out during this wait, causing 'nothing happens' on new session. Fix: refactored the proxy handler so ALL requests during a model switch use the same SSE streaming pattern (immediate 200, progress events, then actual response piped through after switch completes). The switch now runs as a background asyncio task via create_task(). - Added _background_switch() — runs POST /models/switch in background task with complete_switch() + drain_queue() in finally block - All switch-triggering requests go through queue_request() + StreamingResponse - SSE generator now falls through to OpenRouter/LXC if Main PC unreachable (switch failure case) instead of hanging indefinitely Sidecar fixes from previous commit: - _kill_llama_server() is now async with proper await on process termination - _switch_lock changed from threading.Lock to asyncio.Lock()	2026-06-18 00:10:48 +00:00
root	45dd793b69	fix: sidecar process kill was not awaiting wait() — old server held GPU VRAM - _kill_llama_server() was sync calling an unawaited coroutine. process.wait() created a discarded coroutine object — the old llama-server was never waited on to release GPU memory before starting a new one, causing OOM on rapid model switches. Fixed with async await + 10s SIGTERM timeout + SIGKILL fallback. - Changed _switch_lock from threading.Lock to asyncio.Lock() to prevent event loop deadlock during long switch operations. - Router proxy: only trigger model switches for POST /v1/chat/completions and /v1/completions. Non-chat endpoints (GET probes, /api/show) no longer trigger unwanted model reloads. - _ollama_show_lookup: return active profile context size when model_name is empty. Previously returned 404, causing Hermes Desktop to default to 256k context. - Always drain_queue() + complete_switch() after switch failure so queued requests don't hang forever waiting on a never-set switching event.	2026-06-17 23:49:57 +00:00
root	7e9b3f43e1	fix: circuit breaker deadlock — always query sidecar for status The circuit breaker opened after MAX_RECOVERY_ATTEMPTS failures but was never reset because the sidecar status query (which calls circuit_reset()) was skipped when the circuit was open. This caused a permanent deadlock: all subsequent requests went to the LXC fallback with no recovery possible. Fix: always query the sidecar for /models/status, even when the circuit is open. If the sidecar responds successfully, reset the circuit. The circuit breaker now only prevents the SWITCH operation, not the status health check. If a model is already running when the circuit is open, route to it directly.	2026-06-16 22:09:16 +00:00
root	bcf45129f1	fix: add --host 0.0.0.0 to llama-server command llama-server defaults to binding on 127.0.0.1 (localhost only). When the router runs on a separate Docker host (10.0.4.100), all chat completion requests fail with: PROXY EXCEPTION on primary http://10.0.4.11:8081/v1/chat/completions: ConnectError: All connection attempts failed Added --host 0.0.0.0 after --port so llama-server listens on all network interfaces, reachable from the Docker host.	2026-06-16 21:46:07 +00:00
root	75248741e7	fix: log exceptions on primary proxy target When the primary request to llama-server (10.0.4.11:8081) raises an exception (connection refused, timeout), it was silently swallowed by the catch-all except block, making it look like a sidecar/switch failure when it was actually a network-level error. Now prints: 'PROXY EXCEPTION on primary <url>: <ExceptionType>: <msg>'	2026-06-16 21:32:36 +00:00
root	5c1753dfef	fix: log sidecar switch failures + fix scoping bug in proxy handler Two changes to debug the fallback-to-LXC issue: 1. Added debug logging on switch failure: prints the profile name, sidecar response status, and error message. Also calls circuit_record_failure() so subsequent requests don't wait the full 120-second timeout before falling back. 2. Fixed scoping bug: sidecar_status was only defined inside the else branch of the circuit breaker check. Initialized to None at function scope alongside target_url and error to prevent NameError when circuit is open.	2026-06-16 21:25:42 +00:00
root	f2e62f60e6	fix: /api/show GET support, /v1 root handler, and proxy debug logging Three changes to debug and fix Hermes Desktop integration: 1. /api/show: Added GET handler alongside existing POST handler. Hermes Desktop probes with GET ?model=xxx, not POST body. Refactored shared lookup logic into _ollama_show_lookup(). 2. /v1 root: Added handler returning basic info. Hermes Desktop probes this URL and ERR_CONNECTION_REFUSED was blocking full provider validation. 3. Proxy execute(): Added debug logging for non-200 responses. Prints the backend URL, status code, and first 500 bytes of body to help diagnose why llama-server returns 400 on /v1/chat/completions.	2026-06-16 21:16:45 +00:00
root	d935339280	fix: report actual profile context size in /api/show probe endpoint Hermes Desktop reads the context size from /api/show's 'parameters' field. This was hardcoded to 'num_ctx 4096' for every model, causing 'context too small' errors when the user's system prompt + conversation exceeded 4K tokens. Now extracts the actual ctx-size from the profile's flags and returns the correct value (e.g. 'num_ctx 131072' for the 128K profiles).	2026-06-16 21:04:40 +00:00
root	4ee85972ec	fix: convert underscores to hyphens in llama-server flag names, fix n_ctx→ctx-size rename Two changes to fix 'error: invalid argument: --n-ctx' during model switch: 1. sidecar/app.py: Added _flag_key() converter that normalises underscores to hyphens in flag names and handles the n_ctx→ctx-size rename. The code now converts e.g. n_gpu_layers → n-gpu-layers, top_p → top-p, top_k → top-k, min_p → min-p before passing to llama-server CLI. 2. deploy/manifest.yaml: Updated all 20 profiles to use correct llama-server flag names: n_ctx→ctx-size, n_gpu_layers→n-gpu-layers, top_p→top-p, top_k→top-k, min_p→min-p. All flags now use hyphens, matching what llama-server actually accepts.	2026-06-16 20:54:32 +00:00
root	1551c281c2	fix: move llama-server stderr log from /tmp to working dir (ReadWritePaths compat) The sidecar systemd service has ProtectSystem=strict and ReadWritePaths=/home/bigt/AI/llm, making /tmp read-only. Writing /tmp/llama-server-stderr.log failed with EROFS. Changed LLAMA_STDERR_LOG to os.path.join(dirname(MANIFEST_PATH), ...), resolving to /home/bigt/AI/llm/llama-server-stderr.log, which is within the allowed ReadWritePaths.	2026-06-16 20:36:10 +00:00
root	37fee5341e	fix: capture llama-server stderr, fix YAML boolean flag conversion, reduce polling timeout Three fixes for the model-not-loading bug: 1. YAML boolean → CLI flag bug: YAML parses 'on'/'off'/'yes'/'no' as Python bools. str(True)='True' which is INVALID for llama.cpp's --flash-attn flag (expects 'on'/'off'/'auto'). Added _flag_value() converter that maps bools to 'on'/'off' strings. 2. llama-server stderr was DEVNULL: All error messages (bad model path, OOM, invalid flag) were invisible. Now captured to /tmp/llama-server-stderr.log and dumped to the sidecar log on failure. 3. Reduce polling timeout: 240 retries × 0.5s = 120s hang. Reduced to 60 retries × 0.5s = 30s. Still dumps stderr + exit code on failure. 4. Manifest VRAM fix: gemma4-26b-compact-long-128k used q8_0 KV cache at 128K context (~24GB on 24GB RTX 3090 — borderline OOM). Changed to q4_0 (~18GB, comfortable).	2026-06-16 00:06:45 +00:00
root	903f06c634	feat: add sync_models.py script to auto-update Hermes custom_providers from router model list	2026-06-15 21:10:36 +00:00
root	95c87a764b	fix: remove non-existent models from manifest (qwen-3-8b, llama-4-maverick), add 3 newly discovered models	2026-06-15 16:38:17 +00:00
root	36abbf573e	fix: unbuffer sidecar stdout so logs appear in journalctl	2026-06-15 16:25:58 +00:00
Tudorel Oprisan	1e9305395e	Fixed llama-server path	2026-06-15 17:01:53 +01:00
root	7e86a30bd8	fix: resolve port conflict between sidecar and llama-server Sidecar and llama-server were both configured on port 8080, causing llama-server to fail on startup (port already in use). - sidecar/app.py: LLAMA_SERVER_PORT → 8081 (sidecar stays on 8080) - docker-compose.yml: MAIN_PC_URL → port 8081 (router sends chat requests to llama-server, not the sidecar)	2026-06-15 15:31:31 +00:00
root	2c23faa4a1	fix: add probe endpoints and no-model fallback for Hermes Desktop compatibility Hermes Desktop sends probe requests to validate providers before allowing model switching. The router was returning 503 for all of these because the catch-all proxy requires a 'model' field in the request body. Added explicit handlers for: - GET /v1/models/{model_id} — OpenAI single-model lookup - GET /api/tags — Ollama model list discovery - POST /api/show — Ollama model info - GET /api/v1/models — Ollama-compatible model list - GET /v1/props, GET /props — llama.cpp server properties - GET /version — llama.cpp version Also fixed the catch-all proxy to route requests with no model body to the currently active backend instead of returning 503.	2026-06-15 15:22:15 +00:00
Tudorel Oprisan	af12370632	changed llama-server location	2026-06-15 16:10:49 +01:00
root	1ef8a497f6	fix: update docker-compose.yml SIDECAR_URL to port 8080	2026-06-15 13:23:09 +00:00