7 changed files with 236 additions and 891 deletions
--- a/.hermes/plans/add-model-profiles.md
+++ b/.hermes/plans/add-model-profiles.md
@ -1,94 +0,0 @@
 # Plan: Add user model profiles to manifest.yaml
 # Date: 2025-06-15
 # Author: Hermes Agent
 # Status: DRAFT
 ## Context
 User has a collection of GGUF models on their Main PC (10.0.4.11, RTX 3090 24GB VRAM).
 The intelligence-router manifest needs profiles for all models with researched llama.cpp parameters.
 Research sourced from r/LocalLLaMA, HuggingFace model cards, and community blog posts.
 ## Hardware constraints
 - GPU: RTX 3090, 24GB VRAM
 - All profiles use `n_gpu_layers: 999` (offload all layers that fit)
 - All profiles use `flash-attn: on`
 - KV cache quantization (q8_0 or q4_0) to enable 64K+ context within 24GB VRAM
 - `min_p` set to 0.0 across all profiles (community standard for these models)
 ## Models to add (excluding mmproj files)
 ### Qwen3.6-27B (1 file: Qwen3.6-27B-Q4_K_M.gguf, ~10.5 GB)
 Recommended sampling per HF model card and Unsloth: temp 0.6 / top_p 0.95 / top_k 20
 | # | Profile ID | Name | n_ctx | cache_k/v | temp | top_k | repeat_pen |
 |---|-----------|------|-------|-----------|------|-------|------------|
 | 1 | qwen36-27b-balanced-64k | Qwen3.6-27B Balanced 64K | 65536 | q8_0/q8_0 | 0.6 | 20 | 1.0 |
 | 2 | qwen36-27b-thinking-64k | Qwen3.6-27B Thinking 64K | 65536 | q8_0/q8_0 | 1.0 | 20 | 1.0 |
 | 3 | qwen36-27b-extended-128k | Qwen3.6-27B Extended 128K | 131072 | q4_0/q4_0 | 0.6 | 20 | 1.05 |
 ### Gemma 4 12B (2 files: Q6_K_XL ~8.5 GB, IQ4_XS ~5 GB)
 Google official: temp 1.0 / top_p 0.95 / top_k 64
 | # | Profile ID | Name | File | n_ctx | cache_k/v | temp | top_k |
 |---|-----------|------|------|-------|-----------|------|-------|
 | 4 | gemma4-12b-standard-q6-64k | Gemma4 12B Standard Q6 64K | Q6_K_XL | 65536 | q8_0/q8_0 | 1.0 | 64 |
 | 5 | gemma4-12b-extended-q6-128k | Gemma4 12B Extended Q6 128K | Q6_K_XL | 131072 | q4_0/q4_0 | 1.0 | 64 |
 | 6 | gemma4-12b-compact-iq4-64k | Gemma4 12B Compact IQ4 64K | IQ4_XS | 65536 | q8_0/q8_0 | 1.0 | 64 |
 | 7 | gemma4-12b-compact-long-128k | Gemma4 12B Compact IQ4 128K | IQ4_XS | 131072 | q8_0/q8_0 | 1.0 | 64 |
 ### Gemma 4 26B-A4B (2 files: Q4_K_M ~10.5 GB, IQ4_XS ~6 GB)
 MoE, 4B active. Same sampling as 12B family.
 | # | Profile ID | Name | File | n_ctx | cache_k/v | temp | top_k | repeat_pen |
 |---|-----------|------|------|-------|-----------|------|-------|------------|
 | 8 | gemma4-26b-balanced-64k | Gemma4 26B Balanced 64K | Q4_K_M | 65536 | q8_0/q8_0 | 1.0 | 64 | 1.0 |
 | 9 | gemma4-26b-extended-128k | Gemma4 26B Extended 128K | Q4_K_M | 131072 | q4_0/q4_0 | 1.0 | 64 | 1.15 |
 | 10 | gemma4-26b-ultra-long-iq4-256k | Gemma4 26B Ultra-Long IQ4 256K | IQ4_XS | 262144 | q4_0/q4_0 | 1.0 | 64 | 1.0 |
 ### Qwen3.6-35B-A3B (2 files: UD-Q4_K_M ~14 GB, MTP-UD-Q4_K_M ~16 GB)
 **MTP note:** Unsloth benchmark shows MTP is net-negative on single 3090. Including MTP profile anyway since user has the file.
 | # | Profile ID | Name | File | n_ctx | cache_k/v | temp | top_k | MTP |
 |---|-----------|------|------|-------|-----------|------|-------|-----|
 | 11 | qwen36-35b-fast-64k | Qwen3.6-35B Fast 64K | UD-Q4 | 65536 | q8_0/q8_0 | 0.6 | 20 | no |
 | 12 | qwen36-35b-thinking-64k | Qwen3.6-35B Thinking 64K | UD-Q4 | 65536 | q8_0/q8_0 | 1.0 | 20 | no |
 | 13 | qwen36-35b-extended-128k | Qwen3.6-35B Extended 128K | UD-Q4 | 131072 | q4_0/q4_0 | 0.6 | 20 | no |
 | 14 | qwen36-35b-mtp-128k | Qwen3.6-35B MTP 128K | MTP-UD-Q4 | 131072 | q8_0/q8_0 | 0.6 | 20 | yes (n=3) |
 ### Uncensored models (apply censored family params)
 | # | Profile ID | Name | File | n_ctx | cache_k/v | temp | top_k | Based on |
 |---|-----------|------|------|-------|-----------|------|-------|----------|
 | 15 | qwen36-35b-hauhau-aggressive-64k | Qwen3.6-35B HauhauCS Aggressive 64K | Uncensored-HauhauCS-Q4_K_P | 65536 | q8_0/q8_0 | 0.6 | 20 | Qwen3.6-35B fast |
 | 16 | qwen36-35b-genesis-apex-64k | Qwen3.6-35B Genesis APEX 64K | Uncensored-Genesis-APEX | 65536 | q8_0/q8_0 | 0.6 | 20 | Qwen3.6-35B fast |
 | 17 | qwen36-35b-genesis-mtp-apex-128k | Qwen3.6-35B Genesis MTP APEX 128K | Uncensored-Genesis-MTP-APEX | 131072 | q8_0/q8_0 | 0.6 | 20 | Qwen3.6-35B MTP |
 | 18 | gemma4-26b-hauhau-balanced-64k | Gemma4 26B HauhauCS Balanced 64K | Uncensored-HauhauCS-Q5_K_M | 65536 | q8_0/q8_0 | 1.0 | 64 | Gemma4 26B balanced |
 **Total: 18 profiles**
 ## Flag mapping (manifest → llama-server CLI)
 Manifest flags use camelCase keys that the sidecar passes as `--key value` to llama-server:
 | Manifest key | CLI flag | Type | Notes |
 |-------------|----------|------|-------|
 | n_gpu_layers | --n-gpu-layers | int | 999 = all |
 | n_ctx | --ctx-size | int | context window |
 | cache_type_k | --cache-type-k | str | q8_0, q4_0 |
 | cache_type_v | --cache-type-v | str | q8_0, q4_0 |
 | flash_attn | --flash-attn | bool | true/on |
 | temp | --temp | float | sampling |
 | top_p | --top-p | float | sampling |
 | top_k | --top-k | int | sampling |
 | repeat_penalty | --repeat-penalty | float | sampling |
 | min_p | --min-p | float | 0.0 |
 | spec_type | --spec-type | str | draft-mtp (only MTP profiles) |
 | spec_draft_n_max | --spec-draft-n-max | int | 3 (only MTP profiles) |
 | presence_penalty | --presence-penalty | float | 0.0 |
 ## Actions
 1. Create branch `feature/add-model-profiles` from master
 2. Create issues on Gitea for each model family (4 issues: qwen27, gemma12b, gemma26b, qwen35b)
 3. Update `deploy/manifest.yaml` with all 18 profiles
 4. Update tests if flag structure requires it
 5. Run tests, commit
--- a/deploy/llm-sidecar.service
+++ b/deploy/llm-sidecar.service
@ -12,7 +12,6 @@ EnvironmentFile=-/home/bigt/AI/llm/.env
 Environment=MANIFEST_PATH=/home/bigt/AI/llm/manifest.yaml
 Environment=SIDECAR_PORT=8080
 Environment=PATH=/home/bigt/AI/llm/venv/bin:/usr/local/bin:/usr/bin:/bin
 Environment=PYTHONUNBUFFERED=1
 # Use the sidecar's venv — install deps via deploy/README.md
 ExecStart=/home/bigt/AI/llm/venv/bin/uvicorn sidecar.app:app --host 0.0.0.0 --port 8080
--- a/deploy/manifest.yaml
+++ b/deploy/manifest.yaml
@ -11,88 +11,141 @@
 # All profiles use flash-attn: on, n-gpu-layers: 999 (offload all)
 # KV cache quantization (q8_0/q4_0) enables 64K+ context within 24GB VRAM
 - id: qwen-3-8b
  name: "Qwen 3 8B"
  model_path: "/home/bigt/AI/llm/qwen/qwen3-8b-q4.gguf"
  flags:
    n_ctx: 8192
    n_gpu_layers: 35
 - id: qwen-3-8b-long
  name: "Qwen 3 8B (Long Context)"
  model_path: "/home/bigt/AI/llm/qwen/qwen3-8b-q4.gguf"
  flags:
    n_ctx: 32768
    n_gpu_layers: 20
 - id: llama-4-maverick
  name: "Llama 4 Maverick"
  model_path: "/home/bigt/AI/llm/llama4/llama4-maverick-q4.gguf"
  flags:
    n_ctx: 8192
    n_gpu_layers: 35
 # --- Qwen3.6-27B (Q4_K_M, ~10.5 GB) ---
 # Sampling: temp 0.6/1.0, top_p 0.95, top_k 20
 - id: qwen36-27b-balanced-64k
  name: "Qwen3.6-27B Balanced 64K"
  model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-27B-Q4_K_M.gguf"
  flags:
-    ctx-size: 65536
+    n_ctx: 65536
-    n-gpu-layers: 999
+    n_gpu_layers: 999
    cache-type-k: q8_0
    cache-type-v: q8_0
    flash-attn: on
    temp: 0.6
-    top-p: 0.95
+    top_p: 0.95
-    top-k: 20
+    top_k: 20
    repeat-penalty: 1.0
-    min-p: 0.0
+    min_p: 0.0
    presence-penalty: 0.0
 - id: qwen36-27b-thinking-64k
  name: "Qwen3.6-27B Thinking 64K"
  model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-27B-Q4_K_M.gguf"
  flags:
-    ctx-size: 65536
+    n_ctx: 65536
-    n-gpu-layers: 999
+    n_gpu_layers: 999
    cache-type-k: q8_0
    cache-type-v: q8_0
    flash-attn: on
    temp: 1.0
-    top-p: 0.95
+    top_p: 0.95
-    top-k: 20
+    top_k: 20
    repeat-penalty: 1.0
-    min-p: 0.0
+    min_p: 0.0
    presence-penalty: 0.0
 - id: qwen36-27b-extended-128k
  name: "Qwen3.6-27B Extended 128K"
  model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-27B-Q4_K_M.gguf"
  flags:
-    ctx-size: 131072
+    n_ctx: 131072
-    n-gpu-layers: 999
+    n_gpu_layers: 999
    cache-type-k: q4_0
    cache-type-v: q4_0
    flash-attn: on
    temp: 0.6
-    top-p: 0.95
+    top_p: 0.95
-    top-k: 20
+    top_k: 20
    repeat-penalty: 1.05
-    min-p: 0.0
+    min_p: 0.0
    presence-penalty: 0.0
-# --- Gemma 4 12B (Q6_K_XL ~8.5 GB) ---
+# --- Gemma 4 12B (Q6_K_XL ~8.5 GB, IQ4_XS ~5 GB) ---
 # Sampling: temp 1.0, top_p 0.95, top_k 64 (Google official)
 - id: gemma4-12b-standard-q6-64k
  name: "Gemma4 12B Standard Q6 64K"
  model_path: "/home/bigt/AI/llm/gemma4/gemma-4-12b-it-UD-Q6_K_XL.gguf"
  flags:
-    ctx-size: 65536
+    n_ctx: 65536
-    n-gpu-layers: 999
+    n_gpu_layers: 999
    cache-type-k: q8_0
    cache-type-v: q8_0
    flash-attn: on
    temp: 1.0
-    top-p: 0.95
+    top_p: 0.95
-    top-k: 64
+    top_k: 64
    repeat-penalty: 1.0
-    min-p: 0.0
+    min_p: 0.0
    presence-penalty: 0.0
 - id: gemma4-12b-extended-q6-128k
  name: "Gemma4 12B Extended Q6 128K"
  model_path: "/home/bigt/AI/llm/gemma4/gemma-4-12b-it-UD-Q6_K_XL.gguf"
  flags:
-    ctx-size: 131072
+    n_ctx: 131072
-    n-gpu-layers: 999
+    n_gpu_layers: 999
    cache-type-k: q4_0
    cache-type-v: q4_0
    flash-attn: on
    temp: 1.0
-    top-p: 0.95
+    top_p: 0.95
-    top-k: 64
+    top_k: 64
    repeat-penalty: 1.0
-    min-p: 0.0
+    min_p: 0.0
    presence-penalty: 0.0
 - id: gemma4-12b-compact-iq4-64k
  name: "Gemma4 12B Compact IQ4 64K"
  model_path: "/home/bigt/AI/llm/gemma4/bartowski-google_gemma-4-26B-A4B-it-IQ4_XS.gguf"
  flags:
    n_ctx: 65536
    n_gpu_layers: 999
    cache-type-k: q8_0
    cache-type-v: q8_0
    flash-attn: on
    temp: 1.0
    top_p: 0.95
    top_k: 64
    repeat-penalty: 1.0
    min_p: 0.0
    presence-penalty: 0.0
 - id: gemma4-12b-compact-long-128k
  name: "Gemma4 12B Compact IQ4 128K"
  model_path: "/home/bigt/AI/llm/gemma4/bartowski-google_gemma-4-26B-A4B-it-IQ4_XS.gguf"
  flags:
    n_ctx: 131072
    n_gpu_layers: 999
    cache-type-k: q8_0
    cache-type-v: q8_0
    flash-attn: on
    temp: 1.0
    top_p: 0.95
    top_k: 64
    repeat-penalty: 1.0
    min_p: 0.0
    presence-penalty: 0.0
 # --- Gemma 4 26B-A4B (Q4_K_M ~10.5 GB, IQ4_XS ~6 GB) ---
@ -101,97 +154,48 @@
  name: "Gemma4 26B Balanced 64K"
  model_path: "/home/bigt/AI/llm/gemma4/gemma-4-26B-A4B-it-UD-Q4_K_M.gguf"
  flags:
-    ctx-size: 65536
+    n_ctx: 65536
-    n-gpu-layers: 999
+    n_gpu_layers: 999
    cache-type-k: q8_0
    cache-type-v: q8_0
    flash-attn: on
    temp: 1.0
-    top-p: 0.95
+    top_p: 0.95
-    top-k: 64
+    top_k: 64
    repeat-penalty: 1.0
-    min-p: 0.0
+    min_p: 0.0
    presence-penalty: 0.0
 - id: gemma4-26b-extended-128k
  name: "Gemma4 26B Extended 128K"
  model_path: "/home/bigt/AI/llm/gemma4/gemma-4-26B-A4B-it-UD-Q4_K_M.gguf"
  flags:
-    ctx-size: 131072
+    n_ctx: 131072
-    n-gpu-layers: 999
+    n_gpu_layers: 999
    cache-type-k: q4_0
    cache-type-v: q4_0
    flash-attn: on
    temp: 1.0
-    top-p: 0.95
+    top_p: 0.95
-    top-k: 64
+    top_k: 64
    repeat-penalty: 1.15
-    min-p: 0.0
+    min_p: 0.0
    presence-penalty: 0.0
 - id: gemma4-26b-ultra-long-iq4-128k
  name: "Gemma4 26B Ultra-Long IQ4 128K"
  model_path: "/home/bigt/AI/llm/gemma4/gemma-4-26B-A4B-it-UD-IQ4_XS.gguf"
  flags:
-    ctx-size: 131072
+    n_ctx: 131072
-    n-gpu-layers: 999
+    n_gpu_layers: 999
    cache-type-k: q4_0
    cache-type-v: q4_0
    flash-attn: on
    temp: 1.0
-    top-p: 0.95
+    top_p: 0.95
-    top-k: 64
+    top_k: 64
    repeat-penalty: 1.0
-    min-p: 0.0
+    min_p: 0.0
    presence-penalty: 0.0
 - id: gemma4-26b-q5-64k
  name: "Gemma4 26B Q5 64K"
  model_path: "/home/bigt/AI/llm/gemma4/google_gemma-4-26B-A4B-it-Q5_K_M.gguf"
  flags:
    ctx-size: 65536
    n-gpu-layers: 999
    cache-type-k: q8_0
    cache-type-v: q8_0
    flash-attn: on
    temp: 1.0
    top-p: 0.95
    top-k: 64
    repeat-penalty: 1.0
    min-p: 0.0
    presence-penalty: 0.0
 # --- Gemma 4 26B Compact (IQ4_XS ~6 GB) ---
 - id: gemma4-26b-compact-iq4-64k
  name: "Gemma4 26B Compact IQ4 64K"
  model_path: "/home/bigt/AI/llm/gemma4/bartowski-google_gemma-4-26B-A4B-it-IQ4_XS.gguf"
  flags:
    ctx-size: 65536
    n-gpu-layers: 999
    cache-type-k: q8_0
    cache-type-v: q8_0
    flash-attn: on
    temp: 1.0
    top-p: 0.95
    top-k: 64
    repeat-penalty: 1.0
    min-p: 0.0
    presence-penalty: 0.0
 - id: gemma4-26b-compact-long-128k
  name: "Gemma4 26B Compact IQ4 128K"
  model_path: "/home/bigt/AI/llm/gemma4/bartowski-google_gemma-4-26B-A4B-it-IQ4_XS.gguf"
  flags:
    ctx-size: 131072
    n-gpu-layers: 999
    cache-type-k: q4_0
    cache-type-v: q4_0
    flash-attn: on
    temp: 1.0
    top-p: 0.95
    top-k: 64
    repeat-penalty: 1.0
    min-p: 0.0
    presence-penalty: 0.0
 # --- Qwen3.6-35B-A3B (UD-Q4_K_M ~14 GB) ---
@ -201,144 +205,95 @@
  name: "Qwen3.6-35B Fast 64K"
  model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-35B-A3B-UD-Q4_K_M.gguf"
  flags:
-    ctx-size: 65536
+    n_ctx: 65536
-    n-gpu-layers: 999
+    n_gpu_layers: 999
    cache-type-k: q8_0
    cache-type-v: q8_0
    flash-attn: on
    temp: 0.6
-    top-p: 0.95
+    top_p: 0.95
-    top-k: 20
+    top_k: 20
    repeat-penalty: 1.0
-    min-p: 0.0
+    min_p: 0.0
    presence-penalty: 0.0
 - id: qwen36-35b-thinking-64k
  name: "Qwen3.6-35B Thinking 64K"
  model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-35B-A3B-UD-Q4_K_M.gguf"
  flags:
-    ctx-size: 65536
+    n_ctx: 65536
-    n-gpu-layers: 999
+    n_gpu_layers: 999
    cache-type-k: q8_0
    cache-type-v: q8_0
    flash-attn: on
    temp: 1.0
-    top-p: 0.95
+    top_p: 0.95
-    top-k: 20
+    top_k: 20
    repeat-penalty: 1.0
-    min-p: 0.0
+    min_p: 0.0
    presence-penalty: 0.0
 - id: qwen36-35b-extended-128k
  name: "Qwen3.6-35B Extended 128K"
  model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-35B-A3B-UD-Q4_K_M.gguf"
  flags:
-    ctx-size: 131072
+    n_ctx: 131072
-    n-gpu-layers: 999
+    n_gpu_layers: 999
    cache-type-k: q4_0
    cache-type-v: q4_0
    flash-attn: on
    temp: 0.6
-    top-p: 0.95
+    top_p: 0.95
-    top-k: 20
+    top_k: 20
    repeat-penalty: 1.0
-    min-p: 0.0
+    min_p: 0.0
    presence-penalty: 0.0
-# --- Qwen3.6-35B-A3B MTP variant ---
+# --- Uncensored models (apply censored family params) ---
 - id: qwen36-35b-mtp-fast-64k
  name: "Qwen3.6-35B MTP Fast 64K"
  model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-35B-A3B-MTP-UD-Q4_K_M.gguf"
  flags:
    ctx-size: 65536
    n-gpu-layers: 999
    cache-type-k: q8_0
    cache-type-v: q8_0
    flash-attn: on
    temp: 0.6
    top-p: 0.95
    top-k: 20
    repeat-penalty: 1.0
    min-p: 0.0
    presence-penalty: 0.0
 - id: qwen36-35b-mtp-extended-128k
  name: "Qwen3.6-35B MTP Extended 128K"
  model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-35B-A3B-MTP-UD-Q4_K_M.gguf"
  flags:
    ctx-size: 131072
    n-gpu-layers: 999
    cache-type-k: q4_0
    cache-type-v: q4_0
    flash-attn: on
    temp: 0.6
    top-p: 0.95
    top-k: 20
    repeat-penalty: 1.0
    min-p: 0.0
    presence-penalty: 0.0
 # --- Uncensored models ---
 - id: qwen36-35b-hauhau-aggressive-64k
  name: "Qwen3.6-35B HauhauCS Aggressive 64K"
  model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-Q4_K_P.gguf"
  flags:
-    ctx-size: 65536
+    n_ctx: 65536
-    n-gpu-layers: 999
+    n_gpu_layers: 999
    cache-type-k: q8_0
    cache-type-v: q8_0
    flash-attn: on
    temp: 0.6
-    top-p: 0.95
+    top_p: 0.95
-    top-k: 20
+    top_k: 20
    repeat-penalty: 1.0
-    min-p: 0.0
+    min_p: 0.0
    presence-penalty: 0.0
 - id: qwen36-35b-genesis-apex-64k
  name: "Qwen3.6-35B Genesis APEX 64K"
  model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-35B-A3B-Uncensored-Genesis-APEX.gguf"
  flags:
-    ctx-size: 65536
+    n_ctx: 65536
-    n-gpu-layers: 999
+    n_gpu_layers: 999
    cache-type-k: q8_0
    cache-type-v: q8_0
    flash-attn: on
    temp: 0.6
-    top-p: 0.95
+    top_p: 0.95
-    top-k: 20
+    top_k: 20
    repeat-penalty: 1.0
-    min-p: 0.0
+    min_p: 0.0
    presence-penalty: 0.0
 - id: qwen36-35b-genesis-mtp-apex-64k
  name: "Qwen3.6-35B Genesis MTP APEX 64K"
  model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-35B-A3B-Uncensored-Genesis-MTP-APEX.gguf"
  flags:
    ctx-size: 65536
    n-gpu-layers: 999
    cache-type-k: q8_0
    cache-type-v: q8_0
    flash-attn: on
    temp: 0.6
    top-p: 0.95
    top-k: 20
    repeat-penalty: 1.0
    min-p: 0.0
    presence-penalty: 0.0
 - id: gemma4-26b-hauhau-balanced-64k
  name: "Gemma4 26B HauhauCS Balanced 64K"
  model_path: "/home/bigt/AI/llm/gemma4/Gemma4-26B-A4B-Uncensored-HauhauCS-Balanced-Q5_K_M.gguf"
  flags:
-    ctx-size: 65536
+    n_ctx: 65536
-    n-gpu-layers: 999
+    n_gpu_layers: 999
    cache-type-k: q8_0
    cache-type-v: q8_0
    flash-attn: on
    temp: 1.0
-    top-p: 0.95
+    top_p: 0.95
-    top-k: 64
+    top_k: 64
    repeat-penalty: 1.0
-    min-p: 0.0
+    min_p: 0.0
    presence-penalty: 0.0
--- a/docker-compose.yml
+++ b/docker-compose.yml
@ -7,8 +7,8 @@ services:
    ports:
      - "9001:9000"
    environment:
-      - SIDECAR_URL=http://10.0.4.11:8080
+      - SIDECAR_URL=http://10.0.4.11:8081
-      - MAIN_PC_URL=http://10.0.4.11:8081/v1
+      - MAIN_PC_URL=http://10.0.4.11:8080/v1
      - FALLBACK_SLM_URL=http://10.0.4.200:8080/v1
      - OPENROUTER_API_KEY=${OPENROUTER_API_KEY:-}
    restart: unless-stopped
--- a/main.py
+++ b/main.py
@ -141,49 +141,6 @@ def complete_switch():
            _switching_event.set()
 async def _background_switch(requested_model: str):
    """Run a model switch in the background.
    The sidecar POST is awaited but the caller gets an immediate SSE stream
    so Hermes Desktop doesn't timeout waiting for the first response.
    Called via asyncio.create_task() so it runs concurrently with the
    SSE stream being sent to the client.
    """
    try:
        async with httpx.AsyncClient(timeout=120.0) as client:
            switch_resp = await client.post(
                f"{SIDECAR_URL}/models/switch",
                json={"profile_id": requested_model},
            )
            switch_result = switch_resp.json()
            if switch_result.get("status") == "ready":
                print(
                    f"SWITCH SUCCESS: profile={requested_model}",
                    flush=True,
                )
            else:
                circuit_record_failure()
                print(
                    f"SWITCH FAILED: profile={requested_model}, "
                    f"status={switch_result.get('status')}, "
                    f"message={switch_result.get('message', '(no message)')}",
                    flush=True,
                )
    except Exception as e:
        circuit_record_failure()
        print(
            f"SWITCH EXCEPTION: profile={requested_model}, "
            f"error={type(e).__name__}: {e}",
            flush=True,
        )
    finally:
        # Signal all queued requests so they can proceed (and fall
        # through to the fallback chain if the switch failed).
        complete_switch()
        drain_queue()
 # ─── App ─────────────────────────────────────────────────────────────────────
@asynccontextmanager
 async def lifespan(app: FastAPI):
@ -196,12 +153,6 @@ app = FastAPI(lifespan=lifespan)
 # ─── GET /v1/models — Issue #2 ──────────────────────────────────────────────
@app.get("/v1")
 async def v1_root():
    """OpenAI API root — return basic info for Hermes Desktop WebUI probe."""
    return {"object": "list", "data": []}
@app.get("/v1/models")
 async def get_models():
    """OpenAI-compatible /v1/models endpoint proxying to Sidecar."""
@ -228,170 +179,6 @@ async def health():
    return {"status": "router_online"}
 # ─── Hermes Desktop Probe Endpoints ──────────────────────────────────────────
 # These endpoints are probed by Hermes Desktop to validate/identify the
 # provider before allowing model switching.  Without them the desktop
 # returns 503 and refuses to switch models.
@app.get("/v1/models/{model_id:path}")
 async def get_single_model(model_id: str):
    """OpenAI-compatible single model query.  Proxied via Sidecar model list."""
    async with httpx.AsyncClient(timeout=5.0) as client:
        try:
            resp = await client.get(f"{SIDECAR_URL}/models/available")
            profiles = resp.json()
        except Exception:
            return JSONResponse(
                status_code=503,
                content={"error": "Sidecar unavailable", "data": []},
            )
    for p in profiles:
        if p.get("id") == model_id:
            return {"id": p["id"], "object": "model", "owned_by": "sidecar"}
    return JSONResponse(status_code=404, content={"error": "model not found", "id": model_id})
@app.get("/api/tags")
 async def ollama_tags():
    """Ollama-compatible model list for Hermes Desktop discovery."""
    async with httpx.AsyncClient(timeout=5.0) as client:
        try:
            resp = await client.get(f"{SIDECAR_URL}/models/available")
            profiles = resp.json()
        except Exception:
            return JSONResponse(content={"models": []})
    models = []
    for p in profiles:
        models.append({
            "name": p.get("id", ""),
            "model": p.get("id", ""),
            "modified_at": "2025-01-01T00:00:00Z",
            "size": 0,
            "digest": "",
            "details": {"format": "gguf", "family": p.get("name", "llm")},
        })
    return {"models": models}
@app.get("/api/show")
 async def ollama_show_get(model: str = ""):
    """Ollama-compatible model info for Hermes Desktop discovery (GET variant).
    Some Hermes Desktop versions probe /api/show via GET with a ?model= parameter.
    """
    return await _ollama_show_lookup(model)
@app.post("/api/show")
 async def ollama_show_post(request: Request):
    """Ollama-compatible model info for Hermes Desktop discovery (POST variant)."""
    body = await request.body()
    body_data = json.loads(body) if body else {}
    model_name = body_data.get("model", "")
    return await _ollama_show_lookup(model_name)
 async def _ollama_show_lookup(model_name: str):
    """Shared logic for Ollama /api/show model info lookup.
    When model_name is empty string (Hermes Desktop probe with no model field),
    returns the currently-active profile's info so the desktop can determine
    the correct context size. Previously returned 404, causing Hermes Desktop
    to default to 256k context.
    """
    async with httpx.AsyncClient(timeout=5.0) as client:
        try:
            resp = await client.get(f"{SIDECAR_URL}/models/available")
            profiles = resp.json()
            status_resp = await client.get(f"{SIDECAR_URL}/models/status")
            status = status_resp.json()
        except Exception:
            return JSONResponse(status_code=404, content={"error": "model not found"})
    # If no model specified, return the currently-active profile's info
    active_id = status.get("active_profile")
    if not model_name and active_id:
        for p in profiles:
            if p.get("id") == active_id:
                flags = p.get("flags", {})
                ctx_size = str(flags.get("ctx-size", flags.get("n_ctx", "4096")))
                return {
                    "modelfile": "",
                    "parameters": f"num_ctx {ctx_size}",
                    "template": "",
                    "details": {
                        "format": "gguf",
                        "family": p.get("name", "llm"),
                        "parameter_size": ctx_size,
                    },
                    "model_info": {"id": p.get("id", "")},
                }
    for p in profiles:
        if p.get("id") == model_name:
            # Extract actual context size from the profile's flags
            flags = p.get("flags", {})
            ctx_size = str(flags.get("ctx-size", flags.get("n_ctx", "4096")))
            return {
                "modelfile": "",
                "parameters": f"num_ctx {ctx_size}",
                "template": "",
                "details": {
                    "format": "gguf",
                    "family": p.get("name", "llm"),
                    "parameter_size": ctx_size,
                },
                "model_info": {"id": p.get("id", "")},
            }
    return JSONResponse(status_code=404, content={"error": "model not found"})
@app.get("/api/v1/models")
 async def ollama_v1_models():
    """Ollama /api/v1/models redirect — return same list as /v1/models."""
    return await get_models()
@app.get("/v1/props")
 async def llama_cpp_props():
    """llama.cpp discovery endpoint for Hermes Desktop."""
    async with httpx.AsyncClient(timeout=3.0) as client:
        try:
            resp = await client.get(f"{SIDECAR_URL}/models/status")
            status = resp.json()
        except Exception:
            status = {"active_profile": None, "llama_server_running": False}
    # Report the currently-running server version / capabilities
    return {
        "props": {
            "version": 1,
            "total_slots": 1,
            "chat_endpoint": "/v1/chat/completions",
            "completion_endpoint": "/v1/completions",
            "embedding_endpoint": "/v1/embeddings",
            "rerank_endpoint": "",
            "health_endpoint": "/health",
        },
        "active_profile": status.get("active_profile"),
        "server_running": status.get("llama_server_running", False),
    }
@app.get("/props")
 async def llm_props():
    """Legacy llama.cpp discovery endpoint (same as /v1/props)."""
    return await llama_cpp_props()
@app.get("/version")
 async def llm_version():
    """llama.cpp version endpoint for Hermes Desktop."""
    return {"version": "0.2.0", "build": "router-proxy", "commit": "intelligence-router"}
 # ─── GET /models/status ──────────────────────────────────────────────────────
@app.get("/models/status")
 async def router_model_status():
@ -471,138 +258,96 @@ async def proxy(
    # ── Determine target URL ──────────────────────────────────────────────
    target_url: Optional[str] = None
    error: Optional[str] = None
    sidecar_status = None
-    # Always query the sidecar first (to detect recovery even when circuit is open)
+    # Circuit breaker check
-    async with httpx.AsyncClient(timeout=3.0) as client:
+    if not await circuit_breaker_check():
        try:
            resp = await client.get(f"{SIDECAR_URL}/models/status")
            if resp.status_code == 200:
                sidecar_status = resp.json()
                circuit_reset()
        except Exception:
            pass  # Handled below
    if sidecar_status is None:
        circuit_record_failure()
        error = "sidecar_down"
    elif not await circuit_breaker_check():
        # Sidecar is up but circuit is open from prior switch failures
        # Only block the switch — allow routing to already-active backend
        error = "circuit_open"
        if sidecar_status.get("llama_server_running"):
            target_url = f"{MAIN_PC_BASE}/{path}"
    else:
-        # Both sidecar reachable and circuit closed — proceed normally
+        # Query Sidecar for active model
-        body = await request.body()
+        sidecar_status = None
-        body_data = json.loads(body) if body else {}
+        async with httpx.AsyncClient(timeout=3.0) as client:
        requested_model = body_data.get("model")
        # Only trigger model switches for actual chat/completion POST requests.
        # GET probes, /api/show lookups, and other non-chat endpoints should
        # never trigger a switch — they just read current state.
        is_chat_request = (
            request.method == "POST"
            and path in ("v1/chat/completions", "v1/completions")
        )
        if requested_model and sidecar_status.get("active_profile") == requested_model:
            target_url = f"{MAIN_PC_BASE}/{path}"
        elif requested_model and is_chat_request:
            # All requests during a model switch get an immediate SSE streaming
            # response so clients (Hermes Desktop) don't timeout while waiting
            # for the model to load (10-30s).  The switch runs in a background
            # task; the SSE stream yields progress events, then pipes through
            # the actual response once the backend model is ready.
            current_switch = await wait_for_switch()
            if current_switch is None:
                # No switch in progress — start one in the background
                await start_switch()
                asyncio.create_task(_background_switch(requested_model))
            # Queue this request — signals when switch completes
            try:
-                wait_evt = await queue_request()
+                resp = await client.get(f"{SIDECAR_URL}/models/status")
-            except HTTPException as he:
+                if resp.status_code == 200:
-                raise
+                    sidecar_status = resp.json()
-
+                    circuit_reset()
-            # Build request headers once
+            except Exception:
-            req_headers = dict(request.headers)
+                error = "sidecar_down"
            req_headers.pop("host", None)
            async def stream_with_sse():
                sse_gen = sse_progress_stream(wait_evt)
                try:
                    await wait_evt.wait()
                    async for sse_chunk in sse_gen:
                        yield sse_chunk
                    # Send actual request to Main PC
                    async with httpx.AsyncClient(timeout=60.0) as c:
                        async with c.stream(
                            request.method,
                            f"{MAIN_PC_BASE}/{path}",
                            content=body,
                            headers=req_headers,
                        ) as resp:
                            async for chunk in resp.aiter_bytes():
                                yield chunk
                except Exception:
                    # Main PC unreachable (switch failed or server died) —
                    # try fallback chain
                    yield _sse_format(
                        "error",
                        {"message": "Backend unreachable, trying fallback..."},
                    )
                    # Try OpenRouter
                    if OPENROUTER_API_KEY:
                        try:
                            fb_headers = dict(req_headers)
                            fb_headers["Authorization"] = f"Bearer {OPENROUTER_API_KEY}"
                            async with httpx.AsyncClient(timeout=60.0) as c:
                                async with c.stream(
                                    request.method,
                                    f"{OPENROUTER_BASE}/{path}",
                                    content=body,
                                    headers=fb_headers,
                                ) as resp:
                                    async for chunk in resp.aiter_bytes():
                                        yield chunk
                                    return
                        except Exception:
                            pass
                    # Fallback to LXC SLM
                    try:
                        async with httpx.AsyncClient(timeout=60.0) as c:
                            async with c.stream(
                                request.method,
                                f"{FALLBACK_SLM_URL}/{path}",
                                content=body,
                                headers=req_headers,
                            ) as resp:
                                async for chunk in resp.aiter_bytes():
                                    yield chunk
                    except Exception:
                        yield _sse_format(
                            "error",
                            {"message": "All backends unavailable"},
                        )
                finally:
                    try:
                        await sse_gen.aclose()
                    except Exception:
                        pass
            return StreamingResponse(
                stream_with_sse(),
                media_type="text/event-stream",
            )
        if sidecar_status is None:
            circuit_record_failure()
            error = "sidecar_down"
        else:
-            # No model in request body (probe/GET/non-chat request) —
+            # Extract requested model from request body
-            # route to the currently active backend when available,
+            body = await request.body()
-            # or fall through to the fallback chain.
+            body_data = json.loads(body) if body else {}
-            if sidecar_status.get("active_profile") and sidecar_status.get("llama_server_running"):
+            requested_model = body_data.get("model")
            if requested_model and sidecar_status.get("active_profile") == requested_model:
                target_url = f"{MAIN_PC_BASE}/{path}"
            else:
                # Trigger switch
                if requested_model:
                    # Check if a switch is already in progress
                    current_switch = await wait_for_switch()
                    if current_switch is not None and not current_switch.is_set():
                        # Another request started the switch — queue this one
                        try:
                            wait_evt = await queue_request()
                        except HTTPException as he:
                            raise
                        # SSE progress while waiting
                        async def stream_with_sse():
                            sse_gen = sse_progress_stream(wait_evt)
                            try:
                                await wait_evt.wait()
                                async for sse_chunk in sse_gen:
                                    yield sse_chunk
                                complete_switch()
                                drain_queue()
                                async with httpx.AsyncClient(timeout=60.0) as c:
                                    req_headers = dict(request.headers)
                                    req_headers.pop("host", None)
                                    async with c.stream(
                                        request.method,
                                        f"{MAIN_PC_BASE}/{path}",
                                        content=body,
                                        headers=req_headers,
                                    ) as resp:
                                        async for chunk in resp.aiter_bytes():
                                            yield chunk
                            finally:
                                # Clean up sse_gen
                                try:
                                    await sse_gen.aclose()
                                except Exception:
                                    pass
                        return StreamingResponse(
                            stream_with_sse(),
                            media_type="text/event-stream",
                        )
                    # First request triggers the switch
                    await start_switch()  # Create event for tracking
                    try:
                        async with httpx.AsyncClient(timeout=120.0) as client:
                            switch_resp = await client.post(
                                f"{SIDECAR_URL}/models/switch",
                                json={"profile_id": requested_model},
                            )
                        switch_result = switch_resp.json()
                        if switch_result.get("status") == "ready":
                            complete_switch()
                            drain_queue()
                            target_url = f"{MAIN_PC_BASE}/{path}"
                        else:
                            error = "switch_failed"
                    except Exception as e:
                        circuit_record_failure()
                        error = f"switch_error: {str(e)}"
    # ── Fallback chain ────────────────────────────────────────────────────
    if target_url is None:
@ -633,11 +378,8 @@ async def proxy(
                        request.method, target,
                        content=body, headers=headers,
                    ) as resp:
                        if resp.status_code != 200:
                            print(f"PROXY: {target} returned {resp.status_code} during SSE stream", flush=True)
                        async for chunk in resp.aiter_bytes():
                            yield chunk
                return StreamingResponse(gen(), status_code=200)
            resp = await client.request(
@ -646,12 +388,6 @@ async def proxy(
                content=body,
                headers=headers,
            )
            if resp.status_code != 200:
                body_preview = resp.content[:500].decode("utf-8", errors="replace")
                print(
                    f"PROXY: {request.method} {target} returned {resp.status_code}: {body_preview}",
                    flush=True,
                )
            return Response(
                content=resp.content,
                status_code=resp.status_code,
@ -661,11 +397,8 @@ async def proxy(
    primary_result = None
    try:
        primary_result = await execute(target_url)
-    except Exception as e:
+    except Exception:
-        print(
+        pass  # Falls through to fallback chain
            f"PROXY EXCEPTION on primary {target_url}: {type(e).__name__}: {e}",
            flush=True,
        )  # Falls through to fallback chain
    if primary_result is not None:
        return primary_result
--- a/scripts/sync_models.py
+++ b/scripts/sync_models.py
@ -1,161 +0,0 @@
 #!/usr/bin/env python3
 """
 Sync intelligence-router model list into Hermes custom_providers.
 Usage:
    # One-shot: discover models from the router and update Hermes config
    python3 scripts/sync_models.py
    # Cron mode (auto): set up via:
    #   cp scripts/sync_models.py ~/.hermes/scripts/
    #   hermes cron create --schedule "every 30m" --no-agent --script sync_models.py
 Silent exit when nothing changed. Prints a summary + restarts the gateway when
 the model list differs.
 """
 import json
 import os
 import subprocess
 import sys
 import urllib.error
 import urllib.request
 from pathlib import Path
 # ── CONFIGURE THESE ──────────────────────────────────────────────────
 ROUTER_BASE_URL = "http://10.0.4.100:9001/v1"
 PROVIDER_NAME = "intelligence_router"
 GATEWAY_SERVICE = "hermes-gateway"
 # ─────────────────────────────────────────────────────────────────────
 MODELS_URL = f"{ROUTER_BASE_URL}/models"
 CONFIG_PATH = Path(os.path.expanduser("~/.hermes/config.yaml"))
 def fetch_models() -> list[str] | None:
    try:
        req = urllib.request.Request(MODELS_URL, headers={"Accept": "application/json"})
        with urllib.request.urlopen(req, timeout=10) as resp:
            data = json.loads(resp.read().decode())
        models = sorted(m["id"] for m in data.get("data", []) if isinstance(m, dict))
        return models if models else None
    except (urllib.error.URLError, TimeoutError, json.JSONDecodeError, OSError) as e:
        print(f"ERROR: Failed to fetch models from {MODELS_URL}: {e}", file=sys.stderr)
        return None
 def read_current_models() -> list[str]:
    """Parse current custom_providers entries for our provider name."""
    if not CONFIG_PATH.exists():
        return []
    models = []
    with open(CONFIG_PATH) as f:
        content = f.read()
    idx = content.find("custom_providers:")
    if idx == -1:
        return []
    section = content[idx:]
    lines = section.split("\n")
    current_entry = {}
    for line in lines:
        s = line.strip()
        if s.startswith("- base_url:"):
            if current_entry.get("name") == PROVIDER_NAME:
                m = current_entry.get("model", "")
                if m:
                    models.append(m)
            current_entry = {}
        elif s.startswith("model:"):
            current_entry["model"] = s.split("model:", 1)[1].strip().strip("'\"")
        elif s.startswith("name:"):
            current_entry["name"] = s.split("name:", 1)[1].strip().strip("'\"")
        elif s and not s.startswith(("-", " ")):
            break
    # Don't forget the last entry
    if current_entry.get("name") == PROVIDER_NAME:
        m = current_entry.get("model", "")
        if m:
            models.append(m)
    return sorted(models)
 def generate_block(models: list[str]) -> str:
    lines = ["custom_providers:"]
    for m in models:
        lines.append(f"- base_url: {ROUTER_BASE_URL}")
        lines.append(f"  model: {m}")
        lines.append(f"  name: {PROVIDER_NAME}")
    return "\n".join(lines)
 def replace_section(models: list[str]) -> bool:
    """Replace the custom_providers section in-place. Returns True if changed."""
    if not CONFIG_PATH.exists():
        return False
    import yaml
    content = CONFIG_PATH.read_text()
    config = yaml.safe_load(content)
    new_entries = [
        {"base_url": ROUTER_BASE_URL, "model": m, "name": PROVIDER_NAME}
        for m in models
    ]
    if config.get("custom_providers") == new_entries:
        return False
    config["custom_providers"] = new_entries
    CONFIG_PATH.write_text(yaml.dump(config, default_flow_style=False, sort_keys=False))
    return True
 def restart_gateway() -> bool:
    try:
        r = subprocess.run(
            ["systemctl", "--user", "restart", GATEWAY_SERVICE],
            capture_output=True, text=True, timeout=30,
        )
        return r.returncode == 0
    except Exception:
        return False
 def main():
    models = fetch_models()
    if models is None:
        sys.exit(1)
    current = read_current_models()
    if current == models:
        print("Model list unchanged — nothing to do.")
        return
    added = set(models) - set(current)
    removed = set(current) - set(models)
    print(f"Model list changed! {len(current)} → {len(models)} models")
    if added:
        print(f"  Added:   {sorted(added)}")
    if removed:
        print(f"  Removed: {sorted(removed)}")
    if not replace_section(models):
        print("ERROR: Config update failed")
        return
    print("Config updated. Restarting gateway...")
    if restart_gateway():
        print("Gateway restarted successfully.")
    else:
        print("WARNING: Gateway restart failed — restart manually.")
 if __name__ == "__main__":
    main()
--- a/sidecar/app.py
+++ b/sidecar/app.py
@ -5,6 +5,7 @@ Runs on the Main PC, manages llama-server subprocess, serves manifest/profile da
 import os
 import asyncio
 import signal as signal_module
 import threading
 from contextlib import asynccontextmanager
 from typing import Optional
@ -17,98 +18,41 @@ from sidecar.manifest import load_manifest
 # Configuration from environment
 MANIFEST_PATH = os.getenv("MANIFEST_PATH", "/home/bigt/AI/llm/manifest.yaml")
 SIDECAR_PORT = int(os.getenv("SIDECAR_PORT", "8080"))
-LLAMA_SERVER_PORT = 8081
+LLAMA_SERVER_PORT = 8080
 LLAMA_STDERR_LOG = os.path.join(
    os.path.dirname(MANIFEST_PATH), "llama-server-stderr.log"
 )
 # Global state
 _llama_server_process: Optional[asyncio.subprocess.Process] = None
 _active_profile: Optional[str] = None
-_switch_lock = asyncio.Lock()  # Use asyncio.Lock to avoid blocking the event loop
+_switch_lock = threading.Lock()  # Use threading.Lock for compatibility with TestClient
@asynccontextmanager
 async def lifespan(app: FastAPI):
    """Manage sidecar lifecycle — no default model loaded."""
-    print(f"Sidecar starting, manifest={MANIFEST_PATH}, port={SIDECAR_PORT}", flush=True)
+    print(f"Sidecar starting, manifest={MANIFEST_PATH}, port={SIDECAR_PORT}")
    yield
    # Cleanup: kill llama-server if running
    global _llama_server_process
    if _llama_server_process:
-        await _kill_llama_server()
+        _kill_llama_server()
 app = FastAPI(lifespan=lifespan)
-def _close_stderr_log():
+def _kill_llama_server():
-    """Close the stderr log file handle if it's still attached to the process."""
+    """Kill the llama-server subprocess."""
    global _llama_server_process
-    if _llama_server_process is not None:
+    if _llama_server_process and _llama_server_process.returncode is None:
        fh = getattr(_llama_server_process, "_stderr_fh", None)
        if fh is not None and not fh.closed:
            try:
                fh.close()
            except Exception:
                pass
 async def _kill_llama_server():
    """Kill the llama-server subprocess and wait for it to fully terminate.
    This MUST be async because process.wait() is a coroutine. The synchronous
    version was calling .wait() without await, creating an unawaited coroutine
    object — the old process was never actually waited on, so it could still
    hold GPU VRAM when the new server started.
    """
    global _llama_server_process
    if _llama_server_process is None or _llama_server_process.returncode is not None:
        _close_stderr_log()
        return
    try:
        _llama_server_process.send_signal(signal_module.SIGTERM)
        try:
-            await asyncio.wait_for(_llama_server_process.wait(), timeout=10)
+            _llama_server_process.send_signal(signal_module.SIGTERM)
        except asyncio.TimeoutError:
            _llama_server_process.kill()
            try:
-                await asyncio.wait_for(_llama_server_process.wait(), timeout=5)
+                _llama_server_process.wait(timeout=5)
            except asyncio.TimeoutError:
-                pass
+                _llama_server_process.kill()
-    except Exception:
+        except Exception:
-        pass
+            pass
    finally:
        _llama_server_process = None
        _close_stderr_log()
 def _flag_value(value) -> str:
    """Convert a manifest flag value to a llama-server CLI argument string.
    YAML booleans (True/False/on/off/yes/no) are parsed as Python bools by
    safe_load.  llama-server expects 'on'/'off' for boolean flags, not 'True'/'False'.
    """
    if isinstance(value, bool):
        return "on" if value else "off"
    return str(value)
 def _flag_key(key: str) -> str:
    """Convert a manifest flag key to the correct llama-server CLI flag name.
    llama-server uses hyphenated flag names (--ctx-size, --n-gpu-layers),
    but YAML keys often use underscores.  Some flags were also renamed
    across llama.cpp versions (e.g. --n-ctx → --ctx-size).
    This function normalises underscores to hyphens and applies known renames.
    """
    normalized = key.replace("_", "-")
    FLAG_RENAMES = {
        "n-ctx": "ctx-size",
    }
    return FLAG_RENAMES.get(normalized, normalized)
 async def _start_llama_server(profile: dict):
@ -116,39 +60,29 @@ async def _start_llama_server(profile: dict):
    global _llama_server_process
    # Kill any existing process
-    await _kill_llama_server()
+    _kill_llama_server()
    # Build command from profile flags
-    cmd = ["/home/bigt/AI/llama.cpp/build/bin/llama-server"]
+    cmd = ["llama-server"]
    cmd += ["--model", profile["model_path"]]
    cmd += ["--port", str(LLAMA_SERVER_PORT)]
    cmd += ["--host", "0.0.0.0"]
    for key, value in profile.get("flags", {}).items():
-        cmd += ["--" + _flag_key(key), _flag_value(value)]
+        cmd += ["--" + key, str(value)]
-    print(f"Starting llama-server: {' '.join(cmd)}", flush=True)
+    print(f"Starting llama-server: {' '.join(cmd)}")
    # Capture stderr so we can diagnose crashes (model not found, OOM, bad flag)
    stderr_fh = open(LLAMA_STDERR_LOG, "w")
    _llama_server_process = await asyncio.create_subprocess_exec(
        *cmd,
        stdout=asyncio.subprocess.DEVNULL,
-        stderr=stderr_fh,
+        stderr=asyncio.subprocess.DEVNULL,
    )
    # Keep a reference so we can close the handle later
    _llama_server_process._stderr_fh = stderr_fh  # type: ignore[attr-defined]
    return _llama_server_process
-async def _poll_llama_server_ready(max_retries: int = 60, interval: float = 0.5):
+async def _poll_llama_server_ready(max_retries: int = 240, interval: float = 0.5):
-    """Poll llama-server readiness via /v1/models endpoint.
+    """Poll llama-server readiness via /v1/models endpoint."""
    Returns True on success.  On failure, dumps the captured stderr (if any)
    so the user can see why llama-server crashed.
    """
    import httpx
-    for attempt in range(max_retries):
+    for _ in range(max_retries):
        try:
            async with httpx.AsyncClient(timeout=2.0) as client:
                resp = await client.get(f"http://localhost:{LLAMA_SERVER_PORT}/v1/models")
@ -157,27 +91,6 @@ async def _poll_llama_server_ready(max_retries: int = 60, interval: float = 0.5)
        except Exception:
            pass
        await asyncio.sleep(interval)
    # Flush and close the stderr handle so all data is on disk before we read
    _close_stderr_log()
    # ── Dump stderr for diagnosis ──────────────────────────────────────
    print("llama-server did NOT become ready — dumping stderr:", flush=True)
    try:
        with open(LLAMA_STDERR_LOG) as f:
            for line in f:
                print(f"  {line.rstrip()}", flush=True)
    except FileNotFoundError:
        print("  (stderr log not found — process may not have started)", flush=True)
    # Also log exit code if the process died
    global _llama_server_process
    if _llama_server_process and _llama_server_process.returncode is not None:
        print(
            f"llama-server exited with code {_llama_server_process.returncode}",
            flush=True,
        )
    return False
@ -211,7 +124,7 @@ async def switch_model(payload: SwitchRequest):
    """Stop current llama-server, start new one with the given profile, wait for readiness."""
    global _active_profile
-    async with _switch_lock:
+    with _switch_lock:
        # Validate profile_id
        profiles = load_manifest(MANIFEST_PATH)
        if profiles is None:
@ -240,7 +153,7 @@ async def switch_model(payload: SwitchRequest):
            }
        # Start the new model
-        await _kill_llama_server()
+        _kill_llama_server()
        _active_profile = None
        await _start_llama_server(profile)