7 changed files with 236 additions and 891 deletions
--- a/.hermes/plans/add-model-profiles.md
+++ b/.hermes/plans/add-model-profiles.md
@ -1,94 +0,0 @@
-# Plan: Add user model profiles to manifest.yaml
-# Date: 2025-06-15
-# Author: Hermes Agent
-# Status: DRAFT
-
-## Context
-User has a collection of GGUF models on their Main PC (10.0.4.11, RTX 3090 24GB VRAM).
-The intelligence-router manifest needs profiles for all models with researched llama.cpp parameters.
-Research sourced from r/LocalLLaMA, HuggingFace model cards, and community blog posts.
-
-## Hardware constraints
- GPU: RTX 3090, 24GB VRAM
- All profiles use `n_gpu_layers: 999` (offload all layers that fit)
- All profiles use `flash-attn: on`
- KV cache quantization (q8_0 or q4_0) to enable 64K+ context within 24GB VRAM
- `min_p` set to 0.0 across all profiles (community standard for these models)
-
-## Models to add (excluding mmproj files)
-
-### Qwen3.6-27B (1 file: Qwen3.6-27B-Q4_K_M.gguf, ~10.5 GB)
-Recommended sampling per HF model card and Unsloth: temp 0.6 / top_p 0.95 / top_k 20
-
-| # | Profile ID | Name | n_ctx | cache_k/v | temp | top_k | repeat_pen |
-|---|-----------|------|-------|-----------|------|-------|------------|
-| 1 | qwen36-27b-balanced-64k | Qwen3.6-27B Balanced 64K | 65536 | q8_0/q8_0 | 0.6 | 20 | 1.0 |
-| 2 | qwen36-27b-thinking-64k | Qwen3.6-27B Thinking 64K | 65536 | q8_0/q8_0 | 1.0 | 20 | 1.0 |
-| 3 | qwen36-27b-extended-128k | Qwen3.6-27B Extended 128K | 131072 | q4_0/q4_0 | 0.6 | 20 | 1.05 |
-
-### Gemma 4 12B (2 files: Q6_K_XL ~8.5 GB, IQ4_XS ~5 GB)
-Google official: temp 1.0 / top_p 0.95 / top_k 64
-
-| # | Profile ID | Name | File | n_ctx | cache_k/v | temp | top_k |
-|---|-----------|------|------|-------|-----------|------|-------|
-| 4 | gemma4-12b-standard-q6-64k | Gemma4 12B Standard Q6 64K | Q6_K_XL | 65536 | q8_0/q8_0 | 1.0 | 64 |
-| 5 | gemma4-12b-extended-q6-128k | Gemma4 12B Extended Q6 128K | Q6_K_XL | 131072 | q4_0/q4_0 | 1.0 | 64 |
-| 6 | gemma4-12b-compact-iq4-64k | Gemma4 12B Compact IQ4 64K | IQ4_XS | 65536 | q8_0/q8_0 | 1.0 | 64 |
-| 7 | gemma4-12b-compact-long-128k | Gemma4 12B Compact IQ4 128K | IQ4_XS | 131072 | q8_0/q8_0 | 1.0 | 64 |
-
-### Gemma 4 26B-A4B (2 files: Q4_K_M ~10.5 GB, IQ4_XS ~6 GB)
-MoE, 4B active. Same sampling as 12B family.
-
-| # | Profile ID | Name | File | n_ctx | cache_k/v | temp | top_k | repeat_pen |
-|---|-----------|------|------|-------|-----------|------|-------|------------|
-| 8 | gemma4-26b-balanced-64k | Gemma4 26B Balanced 64K | Q4_K_M | 65536 | q8_0/q8_0 | 1.0 | 64 | 1.0 |
-| 9 | gemma4-26b-extended-128k | Gemma4 26B Extended 128K | Q4_K_M | 131072 | q4_0/q4_0 | 1.0 | 64 | 1.15 |
-| 10 | gemma4-26b-ultra-long-iq4-256k | Gemma4 26B Ultra-Long IQ4 256K | IQ4_XS | 262144 | q4_0/q4_0 | 1.0 | 64 | 1.0 |
-
-### Qwen3.6-35B-A3B (2 files: UD-Q4_K_M ~14 GB, MTP-UD-Q4_K_M ~16 GB)
-**MTP note:** Unsloth benchmark shows MTP is net-negative on single 3090. Including MTP profile anyway since user has the file.
-
-| # | Profile ID | Name | File | n_ctx | cache_k/v | temp | top_k | MTP |
-|---|-----------|------|------|-------|-----------|------|-------|-----|
-| 11 | qwen36-35b-fast-64k | Qwen3.6-35B Fast 64K | UD-Q4 | 65536 | q8_0/q8_0 | 0.6 | 20 | no |
-| 12 | qwen36-35b-thinking-64k | Qwen3.6-35B Thinking 64K | UD-Q4 | 65536 | q8_0/q8_0 | 1.0 | 20 | no |
-| 13 | qwen36-35b-extended-128k | Qwen3.6-35B Extended 128K | UD-Q4 | 131072 | q4_0/q4_0 | 0.6 | 20 | no |
-| 14 | qwen36-35b-mtp-128k | Qwen3.6-35B MTP 128K | MTP-UD-Q4 | 131072 | q8_0/q8_0 | 0.6 | 20 | yes (n=3) |
-
-### Uncensored models (apply censored family params)
-
-| # | Profile ID | Name | File | n_ctx | cache_k/v | temp | top_k | Based on |
-|---|-----------|------|------|-------|-----------|------|-------|----------|
-| 15 | qwen36-35b-hauhau-aggressive-64k | Qwen3.6-35B HauhauCS Aggressive 64K | Uncensored-HauhauCS-Q4_K_P | 65536 | q8_0/q8_0 | 0.6 | 20 | Qwen3.6-35B fast |
-| 16 | qwen36-35b-genesis-apex-64k | Qwen3.6-35B Genesis APEX 64K | Uncensored-Genesis-APEX | 65536 | q8_0/q8_0 | 0.6 | 20 | Qwen3.6-35B fast |
-| 17 | qwen36-35b-genesis-mtp-apex-128k | Qwen3.6-35B Genesis MTP APEX 128K | Uncensored-Genesis-MTP-APEX | 131072 | q8_0/q8_0 | 0.6 | 20 | Qwen3.6-35B MTP |
-| 18 | gemma4-26b-hauhau-balanced-64k | Gemma4 26B HauhauCS Balanced 64K | Uncensored-HauhauCS-Q5_K_M | 65536 | q8_0/q8_0 | 1.0 | 64 | Gemma4 26B balanced |
-
-**Total: 18 profiles**
-
-## Flag mapping (manifest → llama-server CLI)
-
-Manifest flags use camelCase keys that the sidecar passes as `--key value` to llama-server:
-
-| Manifest key | CLI flag | Type | Notes |
-|-------------|----------|------|-------|
-| n_gpu_layers | --n-gpu-layers | int | 999 = all |
-| n_ctx | --ctx-size | int | context window |
-| cache_type_k | --cache-type-k | str | q8_0, q4_0 |
-| cache_type_v | --cache-type-v | str | q8_0, q4_0 |
-| flash_attn | --flash-attn | bool | true/on |
-| temp | --temp | float | sampling |
-| top_p | --top-p | float | sampling |
-| top_k | --top-k | int | sampling |
-| repeat_penalty | --repeat-penalty | float | sampling |
-| min_p | --min-p | float | 0.0 |
-| spec_type | --spec-type | str | draft-mtp (only MTP profiles) |
-| spec_draft_n_max | --spec-draft-n-max | int | 3 (only MTP profiles) |
-| presence_penalty | --presence-penalty | float | 0.0 |
-
-## Actions
-1. Create branch `feature/add-model-profiles` from master
-2. Create issues on Gitea for each model family (4 issues: qwen27, gemma12b, gemma26b, qwen35b)
-3. Update `deploy/manifest.yaml` with all 18 profiles
-4. Update tests if flag structure requires it
-5. Run tests, commit
--- a/deploy/llm-sidecar.service
+++ b/deploy/llm-sidecar.service
@ -12,7 +12,6 @@ EnvironmentFile=-/home/bigt/AI/llm/.env
 Environment=MANIFEST_PATH=/home/bigt/AI/llm/manifest.yaml
 Environment=SIDECAR_PORT=8080
 Environment=PATH=/home/bigt/AI/llm/venv/bin:/usr/local/bin:/usr/bin:/bin
-Environment=PYTHONUNBUFFERED=1

 # Use the sidecar's venv — install deps via deploy/README.md
 ExecStart=/home/bigt/AI/llm/venv/bin/uvicorn sidecar.app:app --host 0.0.0.0 --port 8080
--- a/deploy/manifest.yaml
+++ b/deploy/manifest.yaml
@ -11,88 +11,141 @@
 # All profiles use flash-attn: on, n-gpu-layers: 999 (offload all)
 # KV cache quantization (q8_0/q4_0) enables 64K+ context within 24GB VRAM

+- id: qwen-3-8b
+  name: "Qwen 3 8B"
+  model_path: "/home/bigt/AI/llm/qwen/qwen3-8b-q4.gguf"
+  flags:
+    n_ctx: 8192
+    n_gpu_layers: 35
+
+- id: qwen-3-8b-long
+  name: "Qwen 3 8B (Long Context)"
+  model_path: "/home/bigt/AI/llm/qwen/qwen3-8b-q4.gguf"
+  flags:
+    n_ctx: 32768
+    n_gpu_layers: 20
+
+- id: llama-4-maverick
+  name: "Llama 4 Maverick"
+  model_path: "/home/bigt/AI/llm/llama4/llama4-maverick-q4.gguf"
+  flags:
+    n_ctx: 8192
+    n_gpu_layers: 35
+
 # --- Qwen3.6-27B (Q4_K_M, ~10.5 GB) ---
 # Sampling: temp 0.6/1.0, top_p 0.95, top_k 20
 - id: qwen36-27b-balanced-64k
  name: "Qwen3.6-27B Balanced 64K"
  model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-27B-Q4_K_M.gguf"
  flags:
-    ctx-size: 65536
-    n-gpu-layers: 999
+    n_ctx: 65536
+    n_gpu_layers: 999
    cache-type-k: q8_0
    cache-type-v: q8_0
    flash-attn: on
    temp: 0.6
-    top-p: 0.95
-    top-k: 20
+    top_p: 0.95
+    top_k: 20
    repeat-penalty: 1.0
-    min-p: 0.0
+    min_p: 0.0
    presence-penalty: 0.0

 - id: qwen36-27b-thinking-64k
  name: "Qwen3.6-27B Thinking 64K"
  model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-27B-Q4_K_M.gguf"
  flags:
-    ctx-size: 65536
-    n-gpu-layers: 999
+    n_ctx: 65536
+    n_gpu_layers: 999
    cache-type-k: q8_0
    cache-type-v: q8_0
    flash-attn: on
    temp: 1.0
-    top-p: 0.95
-    top-k: 20
+    top_p: 0.95
+    top_k: 20
    repeat-penalty: 1.0
-    min-p: 0.0
+    min_p: 0.0
    presence-penalty: 0.0

 - id: qwen36-27b-extended-128k
  name: "Qwen3.6-27B Extended 128K"
  model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-27B-Q4_K_M.gguf"
  flags:
-    ctx-size: 131072
-    n-gpu-layers: 999
+    n_ctx: 131072
+    n_gpu_layers: 999
    cache-type-k: q4_0
    cache-type-v: q4_0
    flash-attn: on
    temp: 0.6
-    top-p: 0.95
-    top-k: 20
+    top_p: 0.95
+    top_k: 20
    repeat-penalty: 1.05
-    min-p: 0.0
+    min_p: 0.0
    presence-penalty: 0.0

-# --- Gemma 4 12B (Q6_K_XL ~8.5 GB) ---
+# --- Gemma 4 12B (Q6_K_XL ~8.5 GB, IQ4_XS ~5 GB) ---
 # Sampling: temp 1.0, top_p 0.95, top_k 64 (Google official)
 - id: gemma4-12b-standard-q6-64k
  name: "Gemma4 12B Standard Q6 64K"
  model_path: "/home/bigt/AI/llm/gemma4/gemma-4-12b-it-UD-Q6_K_XL.gguf"
  flags:
-    ctx-size: 65536
-    n-gpu-layers: 999
+    n_ctx: 65536
+    n_gpu_layers: 999
    cache-type-k: q8_0
    cache-type-v: q8_0
    flash-attn: on
    temp: 1.0
-    top-p: 0.95
-    top-k: 64
+    top_p: 0.95
+    top_k: 64
    repeat-penalty: 1.0
-    min-p: 0.0
+    min_p: 0.0
    presence-penalty: 0.0

 - id: gemma4-12b-extended-q6-128k
  name: "Gemma4 12B Extended Q6 128K"
  model_path: "/home/bigt/AI/llm/gemma4/gemma-4-12b-it-UD-Q6_K_XL.gguf"
  flags:
-    ctx-size: 131072
-    n-gpu-layers: 999
+    n_ctx: 131072
+    n_gpu_layers: 999
    cache-type-k: q4_0
    cache-type-v: q4_0
    flash-attn: on
    temp: 1.0
-    top-p: 0.95
-    top-k: 64
+    top_p: 0.95
+    top_k: 64
    repeat-penalty: 1.0
-    min-p: 0.0
+    min_p: 0.0
+    presence-penalty: 0.0
+
+- id: gemma4-12b-compact-iq4-64k
+  name: "Gemma4 12B Compact IQ4 64K"
+  model_path: "/home/bigt/AI/llm/gemma4/bartowski-google_gemma-4-26B-A4B-it-IQ4_XS.gguf"
+  flags:
+    n_ctx: 65536
+    n_gpu_layers: 999
+    cache-type-k: q8_0
+    cache-type-v: q8_0
+    flash-attn: on
+    temp: 1.0
+    top_p: 0.95
+    top_k: 64
+    repeat-penalty: 1.0
+    min_p: 0.0
+    presence-penalty: 0.0
+
+- id: gemma4-12b-compact-long-128k
+  name: "Gemma4 12B Compact IQ4 128K"
+  model_path: "/home/bigt/AI/llm/gemma4/bartowski-google_gemma-4-26B-A4B-it-IQ4_XS.gguf"
+  flags:
+    n_ctx: 131072
+    n_gpu_layers: 999
+    cache-type-k: q8_0
+    cache-type-v: q8_0
+    flash-attn: on
+    temp: 1.0
+    top_p: 0.95
+    top_k: 64
+    repeat-penalty: 1.0
+    min_p: 0.0
    presence-penalty: 0.0

 # --- Gemma 4 26B-A4B (Q4_K_M ~10.5 GB, IQ4_XS ~6 GB) ---
@ -101,97 +154,48 @@
  name: "Gemma4 26B Balanced 64K"
  model_path: "/home/bigt/AI/llm/gemma4/gemma-4-26B-A4B-it-UD-Q4_K_M.gguf"
  flags:
-    ctx-size: 65536
-    n-gpu-layers: 999
+    n_ctx: 65536
+    n_gpu_layers: 999
    cache-type-k: q8_0
    cache-type-v: q8_0
    flash-attn: on
    temp: 1.0
-    top-p: 0.95
-    top-k: 64
+    top_p: 0.95
+    top_k: 64
    repeat-penalty: 1.0
-    min-p: 0.0
+    min_p: 0.0
    presence-penalty: 0.0

 - id: gemma4-26b-extended-128k
  name: "Gemma4 26B Extended 128K"
  model_path: "/home/bigt/AI/llm/gemma4/gemma-4-26B-A4B-it-UD-Q4_K_M.gguf"
  flags:
-    ctx-size: 131072
-    n-gpu-layers: 999
+    n_ctx: 131072
+    n_gpu_layers: 999
    cache-type-k: q4_0
    cache-type-v: q4_0
    flash-attn: on
    temp: 1.0
-    top-p: 0.95
-    top-k: 64
+    top_p: 0.95
+    top_k: 64
    repeat-penalty: 1.15
-    min-p: 0.0
+    min_p: 0.0
    presence-penalty: 0.0

 - id: gemma4-26b-ultra-long-iq4-128k
  name: "Gemma4 26B Ultra-Long IQ4 128K"
  model_path: "/home/bigt/AI/llm/gemma4/gemma-4-26B-A4B-it-UD-IQ4_XS.gguf"
  flags:
-    ctx-size: 131072
-    n-gpu-layers: 999
+    n_ctx: 131072
+    n_gpu_layers: 999
    cache-type-k: q4_0
    cache-type-v: q4_0
    flash-attn: on
    temp: 1.0
-    top-p: 0.95
-    top-k: 64
+    top_p: 0.95
+    top_k: 64
    repeat-penalty: 1.0
-    min-p: 0.0
-    presence-penalty: 0.0
-
- id: gemma4-26b-q5-64k
-  name: "Gemma4 26B Q5 64K"
-  model_path: "/home/bigt/AI/llm/gemma4/google_gemma-4-26B-A4B-it-Q5_K_M.gguf"
-  flags:
-    ctx-size: 65536
-    n-gpu-layers: 999
-    cache-type-k: q8_0
-    cache-type-v: q8_0
-    flash-attn: on
-    temp: 1.0
-    top-p: 0.95
-    top-k: 64
-    repeat-penalty: 1.0
-    min-p: 0.0
-    presence-penalty: 0.0
-
-# --- Gemma 4 26B Compact (IQ4_XS ~6 GB) ---
- id: gemma4-26b-compact-iq4-64k
-  name: "Gemma4 26B Compact IQ4 64K"
-  model_path: "/home/bigt/AI/llm/gemma4/bartowski-google_gemma-4-26B-A4B-it-IQ4_XS.gguf"
-  flags:
-    ctx-size: 65536
-    n-gpu-layers: 999
-    cache-type-k: q8_0
-    cache-type-v: q8_0
-    flash-attn: on
-    temp: 1.0
-    top-p: 0.95
-    top-k: 64
-    repeat-penalty: 1.0
-    min-p: 0.0
-    presence-penalty: 0.0
-
- id: gemma4-26b-compact-long-128k
-  name: "Gemma4 26B Compact IQ4 128K"
-  model_path: "/home/bigt/AI/llm/gemma4/bartowski-google_gemma-4-26B-A4B-it-IQ4_XS.gguf"
-  flags:
-    ctx-size: 131072
-    n-gpu-layers: 999
-    cache-type-k: q4_0
-    cache-type-v: q4_0
-    flash-attn: on
-    temp: 1.0
-    top-p: 0.95
-    top-k: 64
-    repeat-penalty: 1.0
-    min-p: 0.0
+    min_p: 0.0
    presence-penalty: 0.0

 # --- Qwen3.6-35B-A3B (UD-Q4_K_M ~14 GB) ---
@ -201,144 +205,95 @@
  name: "Qwen3.6-35B Fast 64K"
  model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-35B-A3B-UD-Q4_K_M.gguf"
  flags:
-    ctx-size: 65536
-    n-gpu-layers: 999
+    n_ctx: 65536
+    n_gpu_layers: 999
    cache-type-k: q8_0
    cache-type-v: q8_0
    flash-attn: on
    temp: 0.6
-    top-p: 0.95
-    top-k: 20
+    top_p: 0.95
+    top_k: 20
    repeat-penalty: 1.0
-    min-p: 0.0
+    min_p: 0.0
    presence-penalty: 0.0

 - id: qwen36-35b-thinking-64k
  name: "Qwen3.6-35B Thinking 64K"
  model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-35B-A3B-UD-Q4_K_M.gguf"
  flags:
-    ctx-size: 65536
-    n-gpu-layers: 999
+    n_ctx: 65536
+    n_gpu_layers: 999
    cache-type-k: q8_0
    cache-type-v: q8_0
    flash-attn: on
    temp: 1.0
-    top-p: 0.95
-    top-k: 20
+    top_p: 0.95
+    top_k: 20
    repeat-penalty: 1.0
-    min-p: 0.0
+    min_p: 0.0
    presence-penalty: 0.0

 - id: qwen36-35b-extended-128k
  name: "Qwen3.6-35B Extended 128K"
  model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-35B-A3B-UD-Q4_K_M.gguf"
  flags:
-    ctx-size: 131072
-    n-gpu-layers: 999
+    n_ctx: 131072
+    n_gpu_layers: 999
    cache-type-k: q4_0
    cache-type-v: q4_0
    flash-attn: on
    temp: 0.6
-    top-p: 0.95
-    top-k: 20
+    top_p: 0.95
+    top_k: 20
    repeat-penalty: 1.0
-    min-p: 0.0
+    min_p: 0.0
    presence-penalty: 0.0

-# --- Qwen3.6-35B-A3B MTP variant ---
- id: qwen36-35b-mtp-fast-64k
-  name: "Qwen3.6-35B MTP Fast 64K"
-  model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-35B-A3B-MTP-UD-Q4_K_M.gguf"
-  flags:
-    ctx-size: 65536
-    n-gpu-layers: 999
-    cache-type-k: q8_0
-    cache-type-v: q8_0
-    flash-attn: on
-    temp: 0.6
-    top-p: 0.95
-    top-k: 20
-    repeat-penalty: 1.0
-    min-p: 0.0
-    presence-penalty: 0.0
-
- id: qwen36-35b-mtp-extended-128k
-  name: "Qwen3.6-35B MTP Extended 128K"
-  model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-35B-A3B-MTP-UD-Q4_K_M.gguf"
-  flags:
-    ctx-size: 131072
-    n-gpu-layers: 999
-    cache-type-k: q4_0
-    cache-type-v: q4_0
-    flash-attn: on
-    temp: 0.6
-    top-p: 0.95
-    top-k: 20
-    repeat-penalty: 1.0
-    min-p: 0.0
-    presence-penalty: 0.0
-
-# --- Uncensored models ---
+# --- Uncensored models (apply censored family params) ---
 - id: qwen36-35b-hauhau-aggressive-64k
  name: "Qwen3.6-35B HauhauCS Aggressive 64K"
  model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-Q4_K_P.gguf"
  flags:
-    ctx-size: 65536
-    n-gpu-layers: 999
+    n_ctx: 65536
+    n_gpu_layers: 999
    cache-type-k: q8_0
    cache-type-v: q8_0
    flash-attn: on
    temp: 0.6
-    top-p: 0.95
-    top-k: 20
+    top_p: 0.95
+    top_k: 20
    repeat-penalty: 1.0
-    min-p: 0.0
+    min_p: 0.0
    presence-penalty: 0.0

 - id: qwen36-35b-genesis-apex-64k
  name: "Qwen3.6-35B Genesis APEX 64K"
  model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-35B-A3B-Uncensored-Genesis-APEX.gguf"
  flags:
-    ctx-size: 65536
-    n-gpu-layers: 999
+    n_ctx: 65536
+    n_gpu_layers: 999
    cache-type-k: q8_0
    cache-type-v: q8_0
    flash-attn: on
    temp: 0.6
-    top-p: 0.95
-    top-k: 20
+    top_p: 0.95
+    top_k: 20
    repeat-penalty: 1.0
-    min-p: 0.0
-    presence-penalty: 0.0
-
- id: qwen36-35b-genesis-mtp-apex-64k
-  name: "Qwen3.6-35B Genesis MTP APEX 64K"
-  model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-35B-A3B-Uncensored-Genesis-MTP-APEX.gguf"
-  flags:
-    ctx-size: 65536
-    n-gpu-layers: 999
-    cache-type-k: q8_0
-    cache-type-v: q8_0
-    flash-attn: on
-    temp: 0.6
-    top-p: 0.95
-    top-k: 20
-    repeat-penalty: 1.0
-    min-p: 0.0
+    min_p: 0.0
    presence-penalty: 0.0

 - id: gemma4-26b-hauhau-balanced-64k
  name: "Gemma4 26B HauhauCS Balanced 64K"
  model_path: "/home/bigt/AI/llm/gemma4/Gemma4-26B-A4B-Uncensored-HauhauCS-Balanced-Q5_K_M.gguf"
  flags:
-    ctx-size: 65536
-    n-gpu-layers: 999
+    n_ctx: 65536
+    n_gpu_layers: 999
    cache-type-k: q8_0
    cache-type-v: q8_0
    flash-attn: on
    temp: 1.0
-    top-p: 0.95
-    top-k: 64
+    top_p: 0.95
+    top_k: 64
    repeat-penalty: 1.0
-    min-p: 0.0
+    min_p: 0.0
    presence-penalty: 0.0
--- a/docker-compose.yml
+++ b/docker-compose.yml
@ -7,8 +7,8 @@ services:
    ports:
      - "9001:9000"
    environment:
-      - SIDECAR_URL=http://10.0.4.11:8080
-      - MAIN_PC_URL=http://10.0.4.11:8081/v1
+      - SIDECAR_URL=http://10.0.4.11:8081
+      - MAIN_PC_URL=http://10.0.4.11:8080/v1
      - FALLBACK_SLM_URL=http://10.0.4.200:8080/v1
      - OPENROUTER_API_KEY=${OPENROUTER_API_KEY:-}
    restart: unless-stopped
--- a/main.py
+++ b/main.py
@ -141,49 +141,6 @@ def complete_switch():
            _switching_event.set()


-async def _background_switch(requested_model: str):
-    """Run a model switch in the background.
-
-    The sidecar POST is awaited but the caller gets an immediate SSE stream
-    so Hermes Desktop doesn't timeout waiting for the first response.
-
-    Called via asyncio.create_task() so it runs concurrently with the
-    SSE stream being sent to the client.
-    """
-    try:
-        async with httpx.AsyncClient(timeout=120.0) as client:
-            switch_resp = await client.post(
-                f"{SIDECAR_URL}/models/switch",
-                json={"profile_id": requested_model},
-            )
-            switch_result = switch_resp.json()
-            if switch_result.get("status") == "ready":
-                print(
-                    f"SWITCH SUCCESS: profile={requested_model}",
-                    flush=True,
-                )
-            else:
-                circuit_record_failure()
-                print(
-                    f"SWITCH FAILED: profile={requested_model}, "
-                    f"status={switch_result.get('status')}, "
-                    f"message={switch_result.get('message', '(no message)')}",
-                    flush=True,
-                )
-    except Exception as e:
-        circuit_record_failure()
-        print(
-            f"SWITCH EXCEPTION: profile={requested_model}, "
-            f"error={type(e).__name__}: {e}",
-            flush=True,
-        )
-    finally:
-        # Signal all queued requests so they can proceed (and fall
-        # through to the fallback chain if the switch failed).
-        complete_switch()
-        drain_queue()
-
-
 # ─── App ─────────────────────────────────────────────────────────────────────
@asynccontextmanager
 async def lifespan(app: FastAPI):
@ -196,12 +153,6 @@ app = FastAPI(lifespan=lifespan)


 # ─── GET /v1/models — Issue #2 ──────────────────────────────────────────────
-@app.get("/v1")
-async def v1_root():
-    """OpenAI API root — return basic info for Hermes Desktop WebUI probe."""
-    return {"object": "list", "data": []}
-
-
@app.get("/v1/models")
 async def get_models():
    """OpenAI-compatible /v1/models endpoint proxying to Sidecar."""
@ -228,170 +179,6 @@ async def health():
    return {"status": "router_online"}


-# ─── Hermes Desktop Probe Endpoints ──────────────────────────────────────────
-# These endpoints are probed by Hermes Desktop to validate/identify the
-# provider before allowing model switching.  Without them the desktop
-# returns 503 and refuses to switch models.
-
-@app.get("/v1/models/{model_id:path}")
-async def get_single_model(model_id: str):
-    """OpenAI-compatible single model query.  Proxied via Sidecar model list."""
-    async with httpx.AsyncClient(timeout=5.0) as client:
-        try:
-            resp = await client.get(f"{SIDECAR_URL}/models/available")
-            profiles = resp.json()
-        except Exception:
-            return JSONResponse(
-                status_code=503,
-                content={"error": "Sidecar unavailable", "data": []},
-            )
-
-    for p in profiles:
-        if p.get("id") == model_id:
-            return {"id": p["id"], "object": "model", "owned_by": "sidecar"}
-    return JSONResponse(status_code=404, content={"error": "model not found", "id": model_id})
-
-
-@app.get("/api/tags")
-async def ollama_tags():
-    """Ollama-compatible model list for Hermes Desktop discovery."""
-    async with httpx.AsyncClient(timeout=5.0) as client:
-        try:
-            resp = await client.get(f"{SIDECAR_URL}/models/available")
-            profiles = resp.json()
-        except Exception:
-            return JSONResponse(content={"models": []})
-
-    models = []
-    for p in profiles:
-        models.append({
-            "name": p.get("id", ""),
-            "model": p.get("id", ""),
-            "modified_at": "2025-01-01T00:00:00Z",
-            "size": 0,
-            "digest": "",
-            "details": {"format": "gguf", "family": p.get("name", "llm")},
-        })
-    return {"models": models}
-
-
-@app.get("/api/show")
-async def ollama_show_get(model: str = ""):
-    """Ollama-compatible model info for Hermes Desktop discovery (GET variant).
-
-    Some Hermes Desktop versions probe /api/show via GET with a ?model= parameter.
-    """
-    return await _ollama_show_lookup(model)
-
-
-@app.post("/api/show")
-async def ollama_show_post(request: Request):
-    """Ollama-compatible model info for Hermes Desktop discovery (POST variant)."""
-    body = await request.body()
-    body_data = json.loads(body) if body else {}
-    model_name = body_data.get("model", "")
-    return await _ollama_show_lookup(model_name)
-
-
-async def _ollama_show_lookup(model_name: str):
-    """Shared logic for Ollama /api/show model info lookup.
-
-    When model_name is empty string (Hermes Desktop probe with no model field),
-    returns the currently-active profile's info so the desktop can determine
-    the correct context size. Previously returned 404, causing Hermes Desktop
-    to default to 256k context.
-    """
-    async with httpx.AsyncClient(timeout=5.0) as client:
-        try:
-            resp = await client.get(f"{SIDECAR_URL}/models/available")
-            profiles = resp.json()
-            status_resp = await client.get(f"{SIDECAR_URL}/models/status")
-            status = status_resp.json()
-        except Exception:
-            return JSONResponse(status_code=404, content={"error": "model not found"})
-
-    # If no model specified, return the currently-active profile's info
-    active_id = status.get("active_profile")
-    if not model_name and active_id:
-        for p in profiles:
-            if p.get("id") == active_id:
-                flags = p.get("flags", {})
-                ctx_size = str(flags.get("ctx-size", flags.get("n_ctx", "4096")))
-                return {
-                    "modelfile": "",
-                    "parameters": f"num_ctx {ctx_size}",
-                    "template": "",
-                    "details": {
-                        "format": "gguf",
-                        "family": p.get("name", "llm"),
-                        "parameter_size": ctx_size,
-                    },
-                    "model_info": {"id": p.get("id", "")},
-                }
-
-    for p in profiles:
-        if p.get("id") == model_name:
-            # Extract actual context size from the profile's flags
-            flags = p.get("flags", {})
-            ctx_size = str(flags.get("ctx-size", flags.get("n_ctx", "4096")))
-            return {
-                "modelfile": "",
-                "parameters": f"num_ctx {ctx_size}",
-                "template": "",
-                "details": {
-                    "format": "gguf",
-                    "family": p.get("name", "llm"),
-                    "parameter_size": ctx_size,
-                },
-                "model_info": {"id": p.get("id", "")},
-            }
-    return JSONResponse(status_code=404, content={"error": "model not found"})
-
-
-@app.get("/api/v1/models")
-async def ollama_v1_models():
-    """Ollama /api/v1/models redirect — return same list as /v1/models."""
-    return await get_models()
-
-
-@app.get("/v1/props")
-async def llama_cpp_props():
-    """llama.cpp discovery endpoint for Hermes Desktop."""
-    async with httpx.AsyncClient(timeout=3.0) as client:
-        try:
-            resp = await client.get(f"{SIDECAR_URL}/models/status")
-            status = resp.json()
-        except Exception:
-            status = {"active_profile": None, "llama_server_running": False}
-
-    # Report the currently-running server version / capabilities
-    return {
-        "props": {
-            "version": 1,
-            "total_slots": 1,
-            "chat_endpoint": "/v1/chat/completions",
-            "completion_endpoint": "/v1/completions",
-            "embedding_endpoint": "/v1/embeddings",
-            "rerank_endpoint": "",
-            "health_endpoint": "/health",
-        },
-        "active_profile": status.get("active_profile"),
-        "server_running": status.get("llama_server_running", False),
-    }
-
-
-@app.get("/props")
-async def llm_props():
-    """Legacy llama.cpp discovery endpoint (same as /v1/props)."""
-    return await llama_cpp_props()
-
-
-@app.get("/version")
-async def llm_version():
-    """llama.cpp version endpoint for Hermes Desktop."""
-    return {"version": "0.2.0", "build": "router-proxy", "commit": "intelligence-router"}
-
-
 # ─── GET /models/status ──────────────────────────────────────────────────────
@app.get("/models/status")
 async def router_model_status():
@ -471,9 +258,13 @@ async def proxy(
    # ── Determine target URL ──────────────────────────────────────────────
    target_url: Optional[str] = None
    error: Optional[str] = None
-    sidecar_status = None

-    # Always query the sidecar first (to detect recovery even when circuit is open)
+    # Circuit breaker check
+    if not await circuit_breaker_check():
+        error = "circuit_open"
+    else:
+        # Query Sidecar for active model
+        sidecar_status = None
        async with httpx.AsyncClient(timeout=3.0) as client:
            try:
                resp = await client.get(f"{SIDECAR_URL}/models/status")
@ -481,63 +272,44 @@ async def proxy(
                    sidecar_status = resp.json()
                    circuit_reset()
            except Exception:
-            pass  # Handled below
+                error = "sidecar_down"

        if sidecar_status is None:
            circuit_record_failure()
            error = "sidecar_down"
-    elif not await circuit_breaker_check():
-        # Sidecar is up but circuit is open from prior switch failures
-        # Only block the switch — allow routing to already-active backend
-        error = "circuit_open"
-        if sidecar_status.get("llama_server_running"):
-            target_url = f"{MAIN_PC_BASE}/{path}"
        else:
-        # Both sidecar reachable and circuit closed — proceed normally
+            # Extract requested model from request body
            body = await request.body()
            body_data = json.loads(body) if body else {}
            requested_model = body_data.get("model")

-        # Only trigger model switches for actual chat/completion POST requests.
-        # GET probes, /api/show lookups, and other non-chat endpoints should
-        # never trigger a switch — they just read current state.
-        is_chat_request = (
-            request.method == "POST"
-            and path in ("v1/chat/completions", "v1/completions")
-        )
-
            if requested_model and sidecar_status.get("active_profile") == requested_model:
                target_url = f"{MAIN_PC_BASE}/{path}"
-        elif requested_model and is_chat_request:
-            # All requests during a model switch get an immediate SSE streaming
-            # response so clients (Hermes Desktop) don't timeout while waiting
-            # for the model to load (10-30s).  The switch runs in a background
-            # task; the SSE stream yields progress events, then pipes through
-            # the actual response once the backend model is ready.
+            else:
+                # Trigger switch
+                if requested_model:
+                    # Check if a switch is already in progress
                    current_switch = await wait_for_switch()
-            if current_switch is None:
-                # No switch in progress — start one in the background
-                await start_switch()
-                asyncio.create_task(_background_switch(requested_model))

-            # Queue this request — signals when switch completes
+                    if current_switch is not None and not current_switch.is_set():
+                        # Another request started the switch — queue this one
                        try:
                            wait_evt = await queue_request()
                        except HTTPException as he:
                            raise

-            # Build request headers once
-            req_headers = dict(request.headers)
-            req_headers.pop("host", None)
-
+                        # SSE progress while waiting
                        async def stream_with_sse():
                            sse_gen = sse_progress_stream(wait_evt)
                            try:
                                await wait_evt.wait()
                                async for sse_chunk in sse_gen:
                                    yield sse_chunk
-                    # Send actual request to Main PC
+                                complete_switch()
+                                drain_queue()
                                async with httpx.AsyncClient(timeout=60.0) as c:
+                                    req_headers = dict(request.headers)
+                                    req_headers.pop("host", None)
                                    async with c.stream(
                                        request.method,
                                        f"{MAIN_PC_BASE}/{path}",
@ -546,47 +318,8 @@ async def proxy(
                                    ) as resp:
                                        async for chunk in resp.aiter_bytes():
                                            yield chunk
-                except Exception:
-                    # Main PC unreachable (switch failed or server died) —
-                    # try fallback chain
-                    yield _sse_format(
-                        "error",
-                        {"message": "Backend unreachable, trying fallback..."},
-                    )
-                    # Try OpenRouter
-                    if OPENROUTER_API_KEY:
-                        try:
-                            fb_headers = dict(req_headers)
-                            fb_headers["Authorization"] = f"Bearer {OPENROUTER_API_KEY}"
-                            async with httpx.AsyncClient(timeout=60.0) as c:
-                                async with c.stream(
-                                    request.method,
-                                    f"{OPENROUTER_BASE}/{path}",
-                                    content=body,
-                                    headers=fb_headers,
-                                ) as resp:
-                                    async for chunk in resp.aiter_bytes():
-                                        yield chunk
-                                    return
-                        except Exception:
-                            pass
-                    # Fallback to LXC SLM
-                    try:
-                        async with httpx.AsyncClient(timeout=60.0) as c:
-                            async with c.stream(
-                                request.method,
-                                f"{FALLBACK_SLM_URL}/{path}",
-                                content=body,
-                                headers=req_headers,
-                            ) as resp:
-                                async for chunk in resp.aiter_bytes():
-                                    yield chunk
-                    except Exception:
-                        yield _sse_format(
-                            "error",
-                            {"message": "All backends unavailable"},
-                        )
                            finally:
+                                # Clean up sse_gen
                                try:
                                    await sse_gen.aclose()
                                except Exception:
@ -597,12 +330,24 @@ async def proxy(
                            media_type="text/event-stream",
                        )

-        else:
-            # No model in request body (probe/GET/non-chat request) —
-            # route to the currently active backend when available,
-            # or fall through to the fallback chain.
-            if sidecar_status.get("active_profile") and sidecar_status.get("llama_server_running"):
+                    # First request triggers the switch
+                    await start_switch()  # Create event for tracking
+                    try:
+                        async with httpx.AsyncClient(timeout=120.0) as client:
+                            switch_resp = await client.post(
+                                f"{SIDECAR_URL}/models/switch",
+                                json={"profile_id": requested_model},
+                            )
+                        switch_result = switch_resp.json()
+                        if switch_result.get("status") == "ready":
+                            complete_switch()
+                            drain_queue()
                            target_url = f"{MAIN_PC_BASE}/{path}"
+                        else:
+                            error = "switch_failed"
+                    except Exception as e:
+                        circuit_record_failure()
+                        error = f"switch_error: {str(e)}"

    # ── Fallback chain ────────────────────────────────────────────────────
    if target_url is None:
@ -633,11 +378,8 @@ async def proxy(
                        request.method, target,
                        content=body, headers=headers,
                    ) as resp:
-                        if resp.status_code != 200:
-                            print(f"PROXY: {target} returned {resp.status_code} during SSE stream", flush=True)
                        async for chunk in resp.aiter_bytes():
                            yield chunk
-
                return StreamingResponse(gen(), status_code=200)

            resp = await client.request(
@ -646,12 +388,6 @@ async def proxy(
                content=body,
                headers=headers,
            )
-            if resp.status_code != 200:
-                body_preview = resp.content[:500].decode("utf-8", errors="replace")
-                print(
-                    f"PROXY: {request.method} {target} returned {resp.status_code}: {body_preview}",
-                    flush=True,
-                )
            return Response(
                content=resp.content,
                status_code=resp.status_code,
@ -661,11 +397,8 @@ async def proxy(
    primary_result = None
    try:
        primary_result = await execute(target_url)
-    except Exception as e:
-        print(
-            f"PROXY EXCEPTION on primary {target_url}: {type(e).__name__}: {e}",
-            flush=True,
-        )  # Falls through to fallback chain
+    except Exception:
+        pass  # Falls through to fallback chain
    if primary_result is not None:
        return primary_result

--- a/scripts/sync_models.py
+++ b/scripts/sync_models.py
@ -1,161 +0,0 @@
-#!/usr/bin/env python3
-"""
-Sync intelligence-router model list into Hermes custom_providers.
-
-Usage:
-    # One-shot: discover models from the router and update Hermes config
-    python3 scripts/sync_models.py
-
-    # Cron mode (auto): set up via:
-    #   cp scripts/sync_models.py ~/.hermes/scripts/
-    #   hermes cron create --schedule "every 30m" --no-agent --script sync_models.py
-
-Silent exit when nothing changed. Prints a summary + restarts the gateway when
-the model list differs.
-"""
-
-import json
-import os
-import subprocess
-import sys
-import urllib.error
-import urllib.request
-from pathlib import Path
-
-# ── CONFIGURE THESE ──────────────────────────────────────────────────
-ROUTER_BASE_URL = "http://10.0.4.100:9001/v1"
-PROVIDER_NAME = "intelligence_router"
-GATEWAY_SERVICE = "hermes-gateway"
-# ─────────────────────────────────────────────────────────────────────
-
-MODELS_URL = f"{ROUTER_BASE_URL}/models"
-CONFIG_PATH = Path(os.path.expanduser("~/.hermes/config.yaml"))
-
-
-def fetch_models() -> list[str] | None:
-    try:
-        req = urllib.request.Request(MODELS_URL, headers={"Accept": "application/json"})
-        with urllib.request.urlopen(req, timeout=10) as resp:
-            data = json.loads(resp.read().decode())
-        models = sorted(m["id"] for m in data.get("data", []) if isinstance(m, dict))
-        return models if models else None
-    except (urllib.error.URLError, TimeoutError, json.JSONDecodeError, OSError) as e:
-        print(f"ERROR: Failed to fetch models from {MODELS_URL}: {e}", file=sys.stderr)
-        return None
-
-
-def read_current_models() -> list[str]:
-    """Parse current custom_providers entries for our provider name."""
-    if not CONFIG_PATH.exists():
-        return []
-
-    models = []
-    with open(CONFIG_PATH) as f:
-        content = f.read()
-
-    idx = content.find("custom_providers:")
-    if idx == -1:
-        return []
-
-    section = content[idx:]
-    lines = section.split("\n")
-
-    current_entry = {}
-    for line in lines:
-        s = line.strip()
-        if s.startswith("- base_url:"):
-            if current_entry.get("name") == PROVIDER_NAME:
-                m = current_entry.get("model", "")
-                if m:
-                    models.append(m)
-            current_entry = {}
-        elif s.startswith("model:"):
-            current_entry["model"] = s.split("model:", 1)[1].strip().strip("'\"")
-        elif s.startswith("name:"):
-            current_entry["name"] = s.split("name:", 1)[1].strip().strip("'\"")
-        elif s and not s.startswith(("-", " ")):
-            break
-
-    # Don't forget the last entry
-    if current_entry.get("name") == PROVIDER_NAME:
-        m = current_entry.get("model", "")
-        if m:
-            models.append(m)
-
-    return sorted(models)
-
-
-def generate_block(models: list[str]) -> str:
-    lines = ["custom_providers:"]
-    for m in models:
-        lines.append(f"- base_url: {ROUTER_BASE_URL}")
-        lines.append(f"  model: {m}")
-        lines.append(f"  name: {PROVIDER_NAME}")
-    return "\n".join(lines)
-
-
-def replace_section(models: list[str]) -> bool:
-    """Replace the custom_providers section in-place. Returns True if changed."""
-    if not CONFIG_PATH.exists():
-        return False
-
-    import yaml
-
-    content = CONFIG_PATH.read_text()
-    config = yaml.safe_load(content)
-
-    new_entries = [
-        {"base_url": ROUTER_BASE_URL, "model": m, "name": PROVIDER_NAME}
-        for m in models
-    ]
-
-    if config.get("custom_providers") == new_entries:
-        return False
-
-    config["custom_providers"] = new_entries
-    CONFIG_PATH.write_text(yaml.dump(config, default_flow_style=False, sort_keys=False))
-    return True
-
-
-def restart_gateway() -> bool:
-    try:
-        r = subprocess.run(
-            ["systemctl", "--user", "restart", GATEWAY_SERVICE],
-            capture_output=True, text=True, timeout=30,
-        )
-        return r.returncode == 0
-    except Exception:
-        return False
-
-
-def main():
-    models = fetch_models()
-    if models is None:
-        sys.exit(1)
-
-    current = read_current_models()
-    if current == models:
-        print("Model list unchanged — nothing to do.")
-        return
-
-    added = set(models) - set(current)
-    removed = set(current) - set(models)
-    print(f"Model list changed! {len(current)} → {len(models)} models")
-    if added:
-        print(f"  Added:   {sorted(added)}")
-    if removed:
-        print(f"  Removed: {sorted(removed)}")
-
-    if not replace_section(models):
-        print("ERROR: Config update failed")
-        return
-
-    print("Config updated. Restarting gateway...")
-    if restart_gateway():
-        print("Gateway restarted successfully.")
-    else:
-        print("WARNING: Gateway restart failed — restart manually.")
-
-
-if __name__ == "__main__":
-    main()
--- a/sidecar/app.py
+++ b/sidecar/app.py
@ -5,6 +5,7 @@ Runs on the Main PC, manages llama-server subprocess, serves manifest/profile da
 import os
 import asyncio
 import signal as signal_module
+import threading
 from contextlib import asynccontextmanager
 from typing import Optional

@ -17,98 +18,41 @@ from sidecar.manifest import load_manifest
 # Configuration from environment
 MANIFEST_PATH = os.getenv("MANIFEST_PATH", "/home/bigt/AI/llm/manifest.yaml")
 SIDECAR_PORT = int(os.getenv("SIDECAR_PORT", "8080"))
-LLAMA_SERVER_PORT = 8081
-LLAMA_STDERR_LOG = os.path.join(
-    os.path.dirname(MANIFEST_PATH), "llama-server-stderr.log"
-)
+LLAMA_SERVER_PORT = 8080

 # Global state
 _llama_server_process: Optional[asyncio.subprocess.Process] = None
 _active_profile: Optional[str] = None
-_switch_lock = asyncio.Lock()  # Use asyncio.Lock to avoid blocking the event loop
+_switch_lock = threading.Lock()  # Use threading.Lock for compatibility with TestClient


@asynccontextmanager
 async def lifespan(app: FastAPI):
    """Manage sidecar lifecycle — no default model loaded."""
-    print(f"Sidecar starting, manifest={MANIFEST_PATH}, port={SIDECAR_PORT}", flush=True)
+    print(f"Sidecar starting, manifest={MANIFEST_PATH}, port={SIDECAR_PORT}")
    yield
    # Cleanup: kill llama-server if running
    global _llama_server_process
    if _llama_server_process:
-        await _kill_llama_server()
+        _kill_llama_server()


 app = FastAPI(lifespan=lifespan)


-def _close_stderr_log():
-    """Close the stderr log file handle if it's still attached to the process."""
+def _kill_llama_server():
+    """Kill the llama-server subprocess."""
    global _llama_server_process
-    if _llama_server_process is not None:
-        fh = getattr(_llama_server_process, "_stderr_fh", None)
-        if fh is not None and not fh.closed:
-            try:
-                fh.close()
-            except Exception:
-                pass
-
-
-async def _kill_llama_server():
-    """Kill the llama-server subprocess and wait for it to fully terminate.
-
-    This MUST be async because process.wait() is a coroutine. The synchronous
-    version was calling .wait() without await, creating an unawaited coroutine
-    object — the old process was never actually waited on, so it could still
-    hold GPU VRAM when the new server started.
-    """
-    global _llama_server_process
-    if _llama_server_process is None or _llama_server_process.returncode is not None:
-        _close_stderr_log()
-        return
-
+    if _llama_server_process and _llama_server_process.returncode is None:
        try:
            _llama_server_process.send_signal(signal_module.SIGTERM)
            try:
-            await asyncio.wait_for(_llama_server_process.wait(), timeout=10)
+                _llama_server_process.wait(timeout=5)
            except asyncio.TimeoutError:
                _llama_server_process.kill()
-            try:
-                await asyncio.wait_for(_llama_server_process.wait(), timeout=5)
-            except asyncio.TimeoutError:
-                pass
        except Exception:
            pass
-    finally:
        _llama_server_process = None
-        _close_stderr_log()
-
-
-def _flag_value(value) -> str:
-    """Convert a manifest flag value to a llama-server CLI argument string.
-
-    YAML booleans (True/False/on/off/yes/no) are parsed as Python bools by
-    safe_load.  llama-server expects 'on'/'off' for boolean flags, not 'True'/'False'.
-    """
-    if isinstance(value, bool):
-        return "on" if value else "off"
-    return str(value)
-
-
-def _flag_key(key: str) -> str:
-    """Convert a manifest flag key to the correct llama-server CLI flag name.
-
-    llama-server uses hyphenated flag names (--ctx-size, --n-gpu-layers),
-    but YAML keys often use underscores.  Some flags were also renamed
-    across llama.cpp versions (e.g. --n-ctx → --ctx-size).
-
-    This function normalises underscores to hyphens and applies known renames.
-    """
-    normalized = key.replace("_", "-")
-    FLAG_RENAMES = {
-        "n-ctx": "ctx-size",
-    }
-    return FLAG_RENAMES.get(normalized, normalized)


 async def _start_llama_server(profile: dict):
@ -116,39 +60,29 @@ async def _start_llama_server(profile: dict):
    global _llama_server_process

    # Kill any existing process
-    await _kill_llama_server()
+    _kill_llama_server()

    # Build command from profile flags
-    cmd = ["/home/bigt/AI/llama.cpp/build/bin/llama-server"]
+    cmd = ["llama-server"]
    cmd += ["--model", profile["model_path"]]
    cmd += ["--port", str(LLAMA_SERVER_PORT)]
-    cmd += ["--host", "0.0.0.0"]
    for key, value in profile.get("flags", {}).items():
-        cmd += ["--" + _flag_key(key), _flag_value(value)]
+        cmd += ["--" + key, str(value)]

-    print(f"Starting llama-server: {' '.join(cmd)}", flush=True)
-
-    # Capture stderr so we can diagnose crashes (model not found, OOM, bad flag)
-    stderr_fh = open(LLAMA_STDERR_LOG, "w")
+    print(f"Starting llama-server: {' '.join(cmd)}")
    _llama_server_process = await asyncio.create_subprocess_exec(
        *cmd,
        stdout=asyncio.subprocess.DEVNULL,
-        stderr=stderr_fh,
+        stderr=asyncio.subprocess.DEVNULL,
    )
-    # Keep a reference so we can close the handle later
-    _llama_server_process._stderr_fh = stderr_fh  # type: ignore[attr-defined]
    return _llama_server_process


-async def _poll_llama_server_ready(max_retries: int = 60, interval: float = 0.5):
-    """Poll llama-server readiness via /v1/models endpoint.
-
-    Returns True on success.  On failure, dumps the captured stderr (if any)
-    so the user can see why llama-server crashed.
-    """
+async def _poll_llama_server_ready(max_retries: int = 240, interval: float = 0.5):
+    """Poll llama-server readiness via /v1/models endpoint."""
    import httpx

-    for attempt in range(max_retries):
+    for _ in range(max_retries):
        try:
            async with httpx.AsyncClient(timeout=2.0) as client:
                resp = await client.get(f"http://localhost:{LLAMA_SERVER_PORT}/v1/models")
@ -157,27 +91,6 @@ async def _poll_llama_server_ready(max_retries: int = 60, interval: float = 0.5)
        except Exception:
            pass
        await asyncio.sleep(interval)
-
-    # Flush and close the stderr handle so all data is on disk before we read
-    _close_stderr_log()
-
-    # ── Dump stderr for diagnosis ──────────────────────────────────────
-    print("llama-server did NOT become ready — dumping stderr:", flush=True)
-    try:
-        with open(LLAMA_STDERR_LOG) as f:
-            for line in f:
-                print(f"  {line.rstrip()}", flush=True)
-    except FileNotFoundError:
-        print("  (stderr log not found — process may not have started)", flush=True)
-
-    # Also log exit code if the process died
-    global _llama_server_process
-    if _llama_server_process and _llama_server_process.returncode is not None:
-        print(
-            f"llama-server exited with code {_llama_server_process.returncode}",
-            flush=True,
-        )
-
    return False


@ -211,7 +124,7 @@ async def switch_model(payload: SwitchRequest):
    """Stop current llama-server, start new one with the given profile, wait for readiness."""
    global _active_profile

-    async with _switch_lock:
+    with _switch_lock:
        # Validate profile_id
        profiles = load_manifest(MANIFEST_PATH)
        if profiles is None:
@ -240,7 +153,7 @@ async def switch_model(payload: SwitchRequest):
            }

        # Start the new model
-        await _kill_llama_server()
+        _kill_llama_server()
        _active_profile = None
        await _start_llama_server(profile)