Compare commits

...

19 Commits

Author SHA1 Message Date
root
b3ac21b2c0 fix: first request no longer blocks on model switch — uses background task + SSE
The FIRST request that triggers a model switch was blocking the HTTP response
for 10-30s while waiting for the sidecar to load the model. Hermes Desktop's
client timed out during this wait, causing 'nothing happens' on new session.

Fix: refactored the proxy handler so ALL requests during a model switch use
the same SSE streaming pattern (immediate 200, progress events, then actual
response piped through after switch completes). The switch now runs as a
background asyncio task via create_task().

- Added _background_switch() — runs POST /models/switch in background task
  with complete_switch() + drain_queue() in finally block
- All switch-triggering requests go through queue_request() + StreamingResponse
- SSE generator now falls through to OpenRouter/LXC if Main PC unreachable
  (switch failure case) instead of hanging indefinitely

Sidecar fixes from previous commit:
- _kill_llama_server() is now async with proper await on process termination
- _switch_lock changed from threading.Lock to asyncio.Lock()
2026-06-18 00:10:48 +00:00
root
45dd793b69 fix: sidecar process kill was not awaiting wait() — old server held GPU VRAM
- _kill_llama_server() was sync calling an unawaited coroutine. process.wait() created
  a discarded coroutine object — the old llama-server was never waited on to release
  GPU memory before starting a new one, causing OOM on rapid model switches.
  Fixed with async await + 10s SIGTERM timeout + SIGKILL fallback.

- Changed _switch_lock from threading.Lock to asyncio.Lock() to prevent event loop
  deadlock during long switch operations.

- Router proxy: only trigger model switches for POST /v1/chat/completions and
  /v1/completions. Non-chat endpoints (GET probes, /api/show) no longer trigger
  unwanted model reloads.

- _ollama_show_lookup: return active profile context size when model_name is empty.
  Previously returned 404, causing Hermes Desktop to default to 256k context.

- Always drain_queue() + complete_switch() after switch failure so queued requests
  don't hang forever waiting on a never-set switching event.
2026-06-17 23:49:57 +00:00
root
7e9b3f43e1 fix: circuit breaker deadlock — always query sidecar for status
The circuit breaker opened after MAX_RECOVERY_ATTEMPTS failures but
was never reset because the sidecar status query (which calls
circuit_reset()) was skipped when the circuit was open.  This caused
a permanent deadlock: all subsequent requests went to the LXC fallback
with no recovery possible.

Fix: always query the sidecar for /models/status, even when the
circuit is open.  If the sidecar responds successfully, reset the
circuit.  The circuit breaker now only prevents the SWITCH operation,
not the status health check.  If a model is already running when the
circuit is open, route to it directly.
2026-06-16 22:09:16 +00:00
root
bcf45129f1 fix: add --host 0.0.0.0 to llama-server command
llama-server defaults to binding on 127.0.0.1 (localhost only).
When the router runs on a separate Docker host (10.0.4.100), all
chat completion requests fail with:

  PROXY EXCEPTION on primary http://10.0.4.11:8081/v1/chat/completions:
    ConnectError: All connection attempts failed

Added --host 0.0.0.0 after --port so llama-server listens on all
network interfaces, reachable from the Docker host.
2026-06-16 21:46:07 +00:00
root
75248741e7 fix: log exceptions on primary proxy target
When the primary request to llama-server (10.0.4.11:8081) raises an
exception (connection refused, timeout), it was silently swallowed by
the catch-all except block, making it look like a sidecar/switch
failure when it was actually a network-level error.

Now prints: 'PROXY EXCEPTION on primary <url>: <ExceptionType>: <msg>'
2026-06-16 21:32:36 +00:00
root
5c1753dfef fix: log sidecar switch failures + fix scoping bug in proxy handler
Two changes to debug the fallback-to-LXC issue:

1. Added debug logging on switch failure: prints the profile name,
   sidecar response status, and error message. Also calls
   circuit_record_failure() so subsequent requests don't wait the
   full 120-second timeout before falling back.

2. Fixed scoping bug: sidecar_status was only defined inside the
   else branch of the circuit breaker check. Initialized to None
   at function scope alongside target_url and error to prevent
   NameError when circuit is open.
2026-06-16 21:25:42 +00:00
root
f2e62f60e6 fix: /api/show GET support, /v1 root handler, and proxy debug logging
Three changes to debug and fix Hermes Desktop integration:

1. /api/show: Added GET handler alongside existing POST handler.
   Hermes Desktop probes with GET ?model=xxx, not POST body.
   Refactored shared lookup logic into _ollama_show_lookup().

2. /v1 root: Added handler returning basic info. Hermes Desktop
   probes this URL and ERR_CONNECTION_REFUSED was blocking
   full provider validation.

3. Proxy execute(): Added debug logging for non-200 responses.
   Prints the backend URL, status code, and first 500 bytes of body
   to help diagnose why llama-server returns 400 on
   /v1/chat/completions.
2026-06-16 21:16:45 +00:00
root
d935339280 fix: report actual profile context size in /api/show probe endpoint
Hermes Desktop reads the context size from /api/show's 'parameters'
field.  This was hardcoded to 'num_ctx 4096' for every model, causing
'context too small' errors when the user's system prompt + conversation
exceeded 4K tokens.

Now extracts the actual ctx-size from the profile's flags and returns
the correct value (e.g. 'num_ctx 131072' for the 128K profiles).
2026-06-16 21:04:40 +00:00
root
4ee85972ec fix: convert underscores to hyphens in llama-server flag names, fix n_ctx→ctx-size rename
Two changes to fix 'error: invalid argument: --n-ctx' during model switch:

1. sidecar/app.py: Added _flag_key() converter that normalises
   underscores to hyphens in flag names and handles the n_ctx→ctx-size
   rename. The code now converts e.g. n_gpu_layers → n-gpu-layers,
   top_p → top-p, top_k → top-k, min_p → min-p before passing to
   llama-server CLI.

2. deploy/manifest.yaml: Updated all 20 profiles to use correct
   llama-server flag names: n_ctx→ctx-size, n_gpu_layers→n-gpu-layers,
   top_p→top-p, top_k→top-k, min_p→min-p. All flags now use hyphens,
   matching what llama-server actually accepts.
2026-06-16 20:54:32 +00:00
root
1551c281c2 fix: move llama-server stderr log from /tmp to working dir (ReadWritePaths compat)
The sidecar systemd service has ProtectSystem=strict and
ReadWritePaths=/home/bigt/AI/llm, making /tmp read-only. Writing
/tmp/llama-server-stderr.log failed with EROFS.

Changed LLAMA_STDERR_LOG to os.path.join(dirname(MANIFEST_PATH), ...),
resolving to /home/bigt/AI/llm/llama-server-stderr.log, which is
within the allowed ReadWritePaths.
2026-06-16 20:36:10 +00:00
root
37fee5341e fix: capture llama-server stderr, fix YAML boolean flag conversion, reduce polling timeout
Three fixes for the model-not-loading bug:

1. **YAML boolean → CLI flag bug**: YAML parses 'on'/'off'/'yes'/'no' as Python
   bools. str(True)='True' which is INVALID for llama.cpp's --flash-attn flag
   (expects 'on'/'off'/'auto'). Added _flag_value() converter that maps bools
   to 'on'/'off' strings.

2. **llama-server stderr was DEVNULL**: All error messages (bad model path,
   OOM, invalid flag) were invisible. Now captured to /tmp/llama-server-stderr.log
   and dumped to the sidecar log on failure.

3. **Reduce polling timeout**: 240 retries × 0.5s = 120s hang. Reduced to
   60 retries × 0.5s = 30s. Still dumps stderr + exit code on failure.

4. **Manifest VRAM fix**: gemma4-26b-compact-long-128k used q8_0 KV cache at
   128K context (~24GB on 24GB RTX 3090 — borderline OOM). Changed to q4_0
   (~18GB, comfortable).
2026-06-16 00:06:45 +00:00
root
903f06c634 feat: add sync_models.py script to auto-update Hermes custom_providers from router model list 2026-06-15 21:10:36 +00:00
root
95c87a764b fix: remove non-existent models from manifest (qwen-3-8b, llama-4-maverick), add 3 newly discovered models 2026-06-15 16:38:17 +00:00
root
36abbf573e fix: unbuffer sidecar stdout so logs appear in journalctl 2026-06-15 16:25:58 +00:00
1e9305395e Fixed llama-server path 2026-06-15 17:01:53 +01:00
root
7e86a30bd8 fix: resolve port conflict between sidecar and llama-server
Sidecar and llama-server were both configured on port 8080, causing
llama-server to fail on startup (port already in use).

- sidecar/app.py: LLAMA_SERVER_PORT → 8081 (sidecar stays on 8080)
- docker-compose.yml: MAIN_PC_URL → port 8081 (router sends chat
  requests to llama-server, not the sidecar)
2026-06-15 15:31:31 +00:00
root
2c23faa4a1 fix: add probe endpoints and no-model fallback for Hermes Desktop compatibility
Hermes Desktop sends probe requests to validate providers before allowing
model switching. The router was returning 503 for all of these because
the catch-all proxy requires a 'model' field in the request body.

Added explicit handlers for:
- GET /v1/models/{model_id} — OpenAI single-model lookup
- GET /api/tags — Ollama model list discovery
- POST /api/show — Ollama model info
- GET /api/v1/models — Ollama-compatible model list
- GET /v1/props, GET /props — llama.cpp server properties
- GET /version — llama.cpp version

Also fixed the catch-all proxy to route requests with no model body to
the currently active backend instead of returning 503.
2026-06-15 15:22:15 +00:00
af12370632 changed llama-server location 2026-06-15 16:10:49 +01:00
root
1ef8a497f6 fix: update docker-compose.yml SIDECAR_URL to port 8080 2026-06-15 13:23:09 +00:00
7 changed files with 894 additions and 239 deletions

View File

@ -0,0 +1,94 @@
# Plan: Add user model profiles to manifest.yaml
# Date: 2025-06-15
# Author: Hermes Agent
# Status: DRAFT
## Context
User has a collection of GGUF models on their Main PC (10.0.4.11, RTX 3090 24GB VRAM).
The intelligence-router manifest needs profiles for all models with researched llama.cpp parameters.
Research sourced from r/LocalLLaMA, HuggingFace model cards, and community blog posts.
## Hardware constraints
- GPU: RTX 3090, 24GB VRAM
- All profiles use `n_gpu_layers: 999` (offload all layers that fit)
- All profiles use `flash-attn: on`
- KV cache quantization (q8_0 or q4_0) to enable 64K+ context within 24GB VRAM
- `min_p` set to 0.0 across all profiles (community standard for these models)
## Models to add (excluding mmproj files)
### Qwen3.6-27B (1 file: Qwen3.6-27B-Q4_K_M.gguf, ~10.5 GB)
Recommended sampling per HF model card and Unsloth: temp 0.6 / top_p 0.95 / top_k 20
| # | Profile ID | Name | n_ctx | cache_k/v | temp | top_k | repeat_pen |
|---|-----------|------|-------|-----------|------|-------|------------|
| 1 | qwen36-27b-balanced-64k | Qwen3.6-27B Balanced 64K | 65536 | q8_0/q8_0 | 0.6 | 20 | 1.0 |
| 2 | qwen36-27b-thinking-64k | Qwen3.6-27B Thinking 64K | 65536 | q8_0/q8_0 | 1.0 | 20 | 1.0 |
| 3 | qwen36-27b-extended-128k | Qwen3.6-27B Extended 128K | 131072 | q4_0/q4_0 | 0.6 | 20 | 1.05 |
### Gemma 4 12B (2 files: Q6_K_XL ~8.5 GB, IQ4_XS ~5 GB)
Google official: temp 1.0 / top_p 0.95 / top_k 64
| # | Profile ID | Name | File | n_ctx | cache_k/v | temp | top_k |
|---|-----------|------|------|-------|-----------|------|-------|
| 4 | gemma4-12b-standard-q6-64k | Gemma4 12B Standard Q6 64K | Q6_K_XL | 65536 | q8_0/q8_0 | 1.0 | 64 |
| 5 | gemma4-12b-extended-q6-128k | Gemma4 12B Extended Q6 128K | Q6_K_XL | 131072 | q4_0/q4_0 | 1.0 | 64 |
| 6 | gemma4-12b-compact-iq4-64k | Gemma4 12B Compact IQ4 64K | IQ4_XS | 65536 | q8_0/q8_0 | 1.0 | 64 |
| 7 | gemma4-12b-compact-long-128k | Gemma4 12B Compact IQ4 128K | IQ4_XS | 131072 | q8_0/q8_0 | 1.0 | 64 |
### Gemma 4 26B-A4B (2 files: Q4_K_M ~10.5 GB, IQ4_XS ~6 GB)
MoE, 4B active. Same sampling as 12B family.
| # | Profile ID | Name | File | n_ctx | cache_k/v | temp | top_k | repeat_pen |
|---|-----------|------|------|-------|-----------|------|-------|------------|
| 8 | gemma4-26b-balanced-64k | Gemma4 26B Balanced 64K | Q4_K_M | 65536 | q8_0/q8_0 | 1.0 | 64 | 1.0 |
| 9 | gemma4-26b-extended-128k | Gemma4 26B Extended 128K | Q4_K_M | 131072 | q4_0/q4_0 | 1.0 | 64 | 1.15 |
| 10 | gemma4-26b-ultra-long-iq4-256k | Gemma4 26B Ultra-Long IQ4 256K | IQ4_XS | 262144 | q4_0/q4_0 | 1.0 | 64 | 1.0 |
### Qwen3.6-35B-A3B (2 files: UD-Q4_K_M ~14 GB, MTP-UD-Q4_K_M ~16 GB)
**MTP note:** Unsloth benchmark shows MTP is net-negative on single 3090. Including MTP profile anyway since user has the file.
| # | Profile ID | Name | File | n_ctx | cache_k/v | temp | top_k | MTP |
|---|-----------|------|------|-------|-----------|------|-------|-----|
| 11 | qwen36-35b-fast-64k | Qwen3.6-35B Fast 64K | UD-Q4 | 65536 | q8_0/q8_0 | 0.6 | 20 | no |
| 12 | qwen36-35b-thinking-64k | Qwen3.6-35B Thinking 64K | UD-Q4 | 65536 | q8_0/q8_0 | 1.0 | 20 | no |
| 13 | qwen36-35b-extended-128k | Qwen3.6-35B Extended 128K | UD-Q4 | 131072 | q4_0/q4_0 | 0.6 | 20 | no |
| 14 | qwen36-35b-mtp-128k | Qwen3.6-35B MTP 128K | MTP-UD-Q4 | 131072 | q8_0/q8_0 | 0.6 | 20 | yes (n=3) |
### Uncensored models (apply censored family params)
| # | Profile ID | Name | File | n_ctx | cache_k/v | temp | top_k | Based on |
|---|-----------|------|------|-------|-----------|------|-------|----------|
| 15 | qwen36-35b-hauhau-aggressive-64k | Qwen3.6-35B HauhauCS Aggressive 64K | Uncensored-HauhauCS-Q4_K_P | 65536 | q8_0/q8_0 | 0.6 | 20 | Qwen3.6-35B fast |
| 16 | qwen36-35b-genesis-apex-64k | Qwen3.6-35B Genesis APEX 64K | Uncensored-Genesis-APEX | 65536 | q8_0/q8_0 | 0.6 | 20 | Qwen3.6-35B fast |
| 17 | qwen36-35b-genesis-mtp-apex-128k | Qwen3.6-35B Genesis MTP APEX 128K | Uncensored-Genesis-MTP-APEX | 131072 | q8_0/q8_0 | 0.6 | 20 | Qwen3.6-35B MTP |
| 18 | gemma4-26b-hauhau-balanced-64k | Gemma4 26B HauhauCS Balanced 64K | Uncensored-HauhauCS-Q5_K_M | 65536 | q8_0/q8_0 | 1.0 | 64 | Gemma4 26B balanced |
**Total: 18 profiles**
## Flag mapping (manifest → llama-server CLI)
Manifest flags use camelCase keys that the sidecar passes as `--key value` to llama-server:
| Manifest key | CLI flag | Type | Notes |
|-------------|----------|------|-------|
| n_gpu_layers | --n-gpu-layers | int | 999 = all |
| n_ctx | --ctx-size | int | context window |
| cache_type_k | --cache-type-k | str | q8_0, q4_0 |
| cache_type_v | --cache-type-v | str | q8_0, q4_0 |
| flash_attn | --flash-attn | bool | true/on |
| temp | --temp | float | sampling |
| top_p | --top-p | float | sampling |
| top_k | --top-k | int | sampling |
| repeat_penalty | --repeat-penalty | float | sampling |
| min_p | --min-p | float | 0.0 |
| spec_type | --spec-type | str | draft-mtp (only MTP profiles) |
| spec_draft_n_max | --spec-draft-n-max | int | 3 (only MTP profiles) |
| presence_penalty | --presence-penalty | float | 0.0 |
## Actions
1. Create branch `feature/add-model-profiles` from master
2. Create issues on Gitea for each model family (4 issues: qwen27, gemma12b, gemma26b, qwen35b)
3. Update `deploy/manifest.yaml` with all 18 profiles
4. Update tests if flag structure requires it
5. Run tests, commit

View File

@ -12,6 +12,7 @@ EnvironmentFile=-/home/bigt/AI/llm/.env
Environment=MANIFEST_PATH=/home/bigt/AI/llm/manifest.yaml
Environment=SIDECAR_PORT=8080
Environment=PATH=/home/bigt/AI/llm/venv/bin:/usr/local/bin:/usr/bin:/bin
Environment=PYTHONUNBUFFERED=1
# Use the sidecar's venv — install deps via deploy/README.md
ExecStart=/home/bigt/AI/llm/venv/bin/uvicorn sidecar.app:app --host 0.0.0.0 --port 8080

View File

@ -11,141 +11,88 @@
# All profiles use flash-attn: on, n-gpu-layers: 999 (offload all)
# KV cache quantization (q8_0/q4_0) enables 64K+ context within 24GB VRAM
- id: qwen-3-8b
name: "Qwen 3 8B"
model_path: "/home/bigt/AI/llm/qwen/qwen3-8b-q4.gguf"
flags:
n_ctx: 8192
n_gpu_layers: 35
- id: qwen-3-8b-long
name: "Qwen 3 8B (Long Context)"
model_path: "/home/bigt/AI/llm/qwen/qwen3-8b-q4.gguf"
flags:
n_ctx: 32768
n_gpu_layers: 20
- id: llama-4-maverick
name: "Llama 4 Maverick"
model_path: "/home/bigt/AI/llm/llama4/llama4-maverick-q4.gguf"
flags:
n_ctx: 8192
n_gpu_layers: 35
# --- Qwen3.6-27B (Q4_K_M, ~10.5 GB) ---
# Sampling: temp 0.6/1.0, top_p 0.95, top_k 20
- id: qwen36-27b-balanced-64k
name: "Qwen3.6-27B Balanced 64K"
model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-27B-Q4_K_M.gguf"
flags:
n_ctx: 65536
n_gpu_layers: 999
ctx-size: 65536
n-gpu-layers: 999
cache-type-k: q8_0
cache-type-v: q8_0
flash-attn: on
temp: 0.6
top_p: 0.95
top_k: 20
top-p: 0.95
top-k: 20
repeat-penalty: 1.0
min_p: 0.0
min-p: 0.0
presence-penalty: 0.0
- id: qwen36-27b-thinking-64k
name: "Qwen3.6-27B Thinking 64K"
model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-27B-Q4_K_M.gguf"
flags:
n_ctx: 65536
n_gpu_layers: 999
ctx-size: 65536
n-gpu-layers: 999
cache-type-k: q8_0
cache-type-v: q8_0
flash-attn: on
temp: 1.0
top_p: 0.95
top_k: 20
top-p: 0.95
top-k: 20
repeat-penalty: 1.0
min_p: 0.0
min-p: 0.0
presence-penalty: 0.0
- id: qwen36-27b-extended-128k
name: "Qwen3.6-27B Extended 128K"
model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-27B-Q4_K_M.gguf"
flags:
n_ctx: 131072
n_gpu_layers: 999
ctx-size: 131072
n-gpu-layers: 999
cache-type-k: q4_0
cache-type-v: q4_0
flash-attn: on
temp: 0.6
top_p: 0.95
top_k: 20
top-p: 0.95
top-k: 20
repeat-penalty: 1.05
min_p: 0.0
min-p: 0.0
presence-penalty: 0.0
# --- Gemma 4 12B (Q6_K_XL ~8.5 GB, IQ4_XS ~5 GB) ---
# --- Gemma 4 12B (Q6_K_XL ~8.5 GB) ---
# Sampling: temp 1.0, top_p 0.95, top_k 64 (Google official)
- id: gemma4-12b-standard-q6-64k
name: "Gemma4 12B Standard Q6 64K"
model_path: "/home/bigt/AI/llm/gemma4/gemma-4-12b-it-UD-Q6_K_XL.gguf"
flags:
n_ctx: 65536
n_gpu_layers: 999
ctx-size: 65536
n-gpu-layers: 999
cache-type-k: q8_0
cache-type-v: q8_0
flash-attn: on
temp: 1.0
top_p: 0.95
top_k: 64
top-p: 0.95
top-k: 64
repeat-penalty: 1.0
min_p: 0.0
min-p: 0.0
presence-penalty: 0.0
- id: gemma4-12b-extended-q6-128k
name: "Gemma4 12B Extended Q6 128K"
model_path: "/home/bigt/AI/llm/gemma4/gemma-4-12b-it-UD-Q6_K_XL.gguf"
flags:
n_ctx: 131072
n_gpu_layers: 999
ctx-size: 131072
n-gpu-layers: 999
cache-type-k: q4_0
cache-type-v: q4_0
flash-attn: on
temp: 1.0
top_p: 0.95
top_k: 64
top-p: 0.95
top-k: 64
repeat-penalty: 1.0
min_p: 0.0
presence-penalty: 0.0
- id: gemma4-12b-compact-iq4-64k
name: "Gemma4 12B Compact IQ4 64K"
model_path: "/home/bigt/AI/llm/gemma4/bartowski-google_gemma-4-26B-A4B-it-IQ4_XS.gguf"
flags:
n_ctx: 65536
n_gpu_layers: 999
cache-type-k: q8_0
cache-type-v: q8_0
flash-attn: on
temp: 1.0
top_p: 0.95
top_k: 64
repeat-penalty: 1.0
min_p: 0.0
presence-penalty: 0.0
- id: gemma4-12b-compact-long-128k
name: "Gemma4 12B Compact IQ4 128K"
model_path: "/home/bigt/AI/llm/gemma4/bartowski-google_gemma-4-26B-A4B-it-IQ4_XS.gguf"
flags:
n_ctx: 131072
n_gpu_layers: 999
cache-type-k: q8_0
cache-type-v: q8_0
flash-attn: on
temp: 1.0
top_p: 0.95
top_k: 64
repeat-penalty: 1.0
min_p: 0.0
min-p: 0.0
presence-penalty: 0.0
# --- Gemma 4 26B-A4B (Q4_K_M ~10.5 GB, IQ4_XS ~6 GB) ---
@ -154,48 +101,97 @@
name: "Gemma4 26B Balanced 64K"
model_path: "/home/bigt/AI/llm/gemma4/gemma-4-26B-A4B-it-UD-Q4_K_M.gguf"
flags:
n_ctx: 65536
n_gpu_layers: 999
ctx-size: 65536
n-gpu-layers: 999
cache-type-k: q8_0
cache-type-v: q8_0
flash-attn: on
temp: 1.0
top_p: 0.95
top_k: 64
top-p: 0.95
top-k: 64
repeat-penalty: 1.0
min_p: 0.0
min-p: 0.0
presence-penalty: 0.0
- id: gemma4-26b-extended-128k
name: "Gemma4 26B Extended 128K"
model_path: "/home/bigt/AI/llm/gemma4/gemma-4-26B-A4B-it-UD-Q4_K_M.gguf"
flags:
n_ctx: 131072
n_gpu_layers: 999
ctx-size: 131072
n-gpu-layers: 999
cache-type-k: q4_0
cache-type-v: q4_0
flash-attn: on
temp: 1.0
top_p: 0.95
top_k: 64
top-p: 0.95
top-k: 64
repeat-penalty: 1.15
min_p: 0.0
min-p: 0.0
presence-penalty: 0.0
- id: gemma4-26b-ultra-long-iq4-128k
name: "Gemma4 26B Ultra-Long IQ4 128K"
model_path: "/home/bigt/AI/llm/gemma4/gemma-4-26B-A4B-it-UD-IQ4_XS.gguf"
flags:
n_ctx: 131072
n_gpu_layers: 999
ctx-size: 131072
n-gpu-layers: 999
cache-type-k: q4_0
cache-type-v: q4_0
flash-attn: on
temp: 1.0
top_p: 0.95
top_k: 64
top-p: 0.95
top-k: 64
repeat-penalty: 1.0
min_p: 0.0
min-p: 0.0
presence-penalty: 0.0
- id: gemma4-26b-q5-64k
name: "Gemma4 26B Q5 64K"
model_path: "/home/bigt/AI/llm/gemma4/google_gemma-4-26B-A4B-it-Q5_K_M.gguf"
flags:
ctx-size: 65536
n-gpu-layers: 999
cache-type-k: q8_0
cache-type-v: q8_0
flash-attn: on
temp: 1.0
top-p: 0.95
top-k: 64
repeat-penalty: 1.0
min-p: 0.0
presence-penalty: 0.0
# --- Gemma 4 26B Compact (IQ4_XS ~6 GB) ---
- id: gemma4-26b-compact-iq4-64k
name: "Gemma4 26B Compact IQ4 64K"
model_path: "/home/bigt/AI/llm/gemma4/bartowski-google_gemma-4-26B-A4B-it-IQ4_XS.gguf"
flags:
ctx-size: 65536
n-gpu-layers: 999
cache-type-k: q8_0
cache-type-v: q8_0
flash-attn: on
temp: 1.0
top-p: 0.95
top-k: 64
repeat-penalty: 1.0
min-p: 0.0
presence-penalty: 0.0
- id: gemma4-26b-compact-long-128k
name: "Gemma4 26B Compact IQ4 128K"
model_path: "/home/bigt/AI/llm/gemma4/bartowski-google_gemma-4-26B-A4B-it-IQ4_XS.gguf"
flags:
ctx-size: 131072
n-gpu-layers: 999
cache-type-k: q4_0
cache-type-v: q4_0
flash-attn: on
temp: 1.0
top-p: 0.95
top-k: 64
repeat-penalty: 1.0
min-p: 0.0
presence-penalty: 0.0
# --- Qwen3.6-35B-A3B (UD-Q4_K_M ~14 GB) ---
@ -205,95 +201,144 @@
name: "Qwen3.6-35B Fast 64K"
model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-35B-A3B-UD-Q4_K_M.gguf"
flags:
n_ctx: 65536
n_gpu_layers: 999
ctx-size: 65536
n-gpu-layers: 999
cache-type-k: q8_0
cache-type-v: q8_0
flash-attn: on
temp: 0.6
top_p: 0.95
top_k: 20
top-p: 0.95
top-k: 20
repeat-penalty: 1.0
min_p: 0.0
min-p: 0.0
presence-penalty: 0.0
- id: qwen36-35b-thinking-64k
name: "Qwen3.6-35B Thinking 64K"
model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-35B-A3B-UD-Q4_K_M.gguf"
flags:
n_ctx: 65536
n_gpu_layers: 999
ctx-size: 65536
n-gpu-layers: 999
cache-type-k: q8_0
cache-type-v: q8_0
flash-attn: on
temp: 1.0
top_p: 0.95
top_k: 20
top-p: 0.95
top-k: 20
repeat-penalty: 1.0
min_p: 0.0
min-p: 0.0
presence-penalty: 0.0
- id: qwen36-35b-extended-128k
name: "Qwen3.6-35B Extended 128K"
model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-35B-A3B-UD-Q4_K_M.gguf"
flags:
n_ctx: 131072
n_gpu_layers: 999
ctx-size: 131072
n-gpu-layers: 999
cache-type-k: q4_0
cache-type-v: q4_0
flash-attn: on
temp: 0.6
top_p: 0.95
top_k: 20
top-p: 0.95
top-k: 20
repeat-penalty: 1.0
min_p: 0.0
min-p: 0.0
presence-penalty: 0.0
# --- Uncensored models (apply censored family params) ---
- id: qwen36-35b-hauhau-aggressive-64k
name: "Qwen3.6-35B HauhauCS Aggressive 64K"
model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-Q4_K_P.gguf"
# --- Qwen3.6-35B-A3B MTP variant ---
- id: qwen36-35b-mtp-fast-64k
name: "Qwen3.6-35B MTP Fast 64K"
model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-35B-A3B-MTP-UD-Q4_K_M.gguf"
flags:
n_ctx: 65536
n_gpu_layers: 999
ctx-size: 65536
n-gpu-layers: 999
cache-type-k: q8_0
cache-type-v: q8_0
flash-attn: on
temp: 0.6
top_p: 0.95
top_k: 20
top-p: 0.95
top-k: 20
repeat-penalty: 1.0
min_p: 0.0
min-p: 0.0
presence-penalty: 0.0
- id: qwen36-35b-mtp-extended-128k
name: "Qwen3.6-35B MTP Extended 128K"
model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-35B-A3B-MTP-UD-Q4_K_M.gguf"
flags:
ctx-size: 131072
n-gpu-layers: 999
cache-type-k: q4_0
cache-type-v: q4_0
flash-attn: on
temp: 0.6
top-p: 0.95
top-k: 20
repeat-penalty: 1.0
min-p: 0.0
presence-penalty: 0.0
# --- Uncensored models ---
- id: qwen36-35b-hauhau-aggressive-64k
name: "Qwen3.6-35B HauhauCS Aggressive 64K"
model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-Q4_K_P.gguf"
flags:
ctx-size: 65536
n-gpu-layers: 999
cache-type-k: q8_0
cache-type-v: q8_0
flash-attn: on
temp: 0.6
top-p: 0.95
top-k: 20
repeat-penalty: 1.0
min-p: 0.0
presence-penalty: 0.0
- id: qwen36-35b-genesis-apex-64k
name: "Qwen3.6-35B Genesis APEX 64K"
model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-35B-A3B-Uncensored-Genesis-APEX.gguf"
flags:
n_ctx: 65536
n_gpu_layers: 999
ctx-size: 65536
n-gpu-layers: 999
cache-type-k: q8_0
cache-type-v: q8_0
flash-attn: on
temp: 0.6
top_p: 0.95
top_k: 20
top-p: 0.95
top-k: 20
repeat-penalty: 1.0
min_p: 0.0
min-p: 0.0
presence-penalty: 0.0
- id: qwen36-35b-genesis-mtp-apex-64k
name: "Qwen3.6-35B Genesis MTP APEX 64K"
model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-35B-A3B-Uncensored-Genesis-MTP-APEX.gguf"
flags:
ctx-size: 65536
n-gpu-layers: 999
cache-type-k: q8_0
cache-type-v: q8_0
flash-attn: on
temp: 0.6
top-p: 0.95
top-k: 20
repeat-penalty: 1.0
min-p: 0.0
presence-penalty: 0.0
- id: gemma4-26b-hauhau-balanced-64k
name: "Gemma4 26B HauhauCS Balanced 64K"
model_path: "/home/bigt/AI/llm/gemma4/Gemma4-26B-A4B-Uncensored-HauhauCS-Balanced-Q5_K_M.gguf"
flags:
n_ctx: 65536
n_gpu_layers: 999
ctx-size: 65536
n-gpu-layers: 999
cache-type-k: q8_0
cache-type-v: q8_0
flash-attn: on
temp: 1.0
top_p: 0.95
top_k: 64
top-p: 0.95
top-k: 64
repeat-penalty: 1.0
min_p: 0.0
presence-penalty: 0.0
min-p: 0.0
presence-penalty: 0.0

View File

@ -7,8 +7,8 @@ services:
ports:
- "9001:9000"
environment:
- SIDECAR_URL=http://10.0.4.11:8081
- MAIN_PC_URL=http://10.0.4.11:8080/v1
- SIDECAR_URL=http://10.0.4.11:8080
- MAIN_PC_URL=http://10.0.4.11:8081/v1
- FALLBACK_SLM_URL=http://10.0.4.200:8080/v1
- OPENROUTER_API_KEY=${OPENROUTER_API_KEY:-}
restart: unless-stopped

435
main.py
View File

@ -141,6 +141,49 @@ def complete_switch():
_switching_event.set()
async def _background_switch(requested_model: str):
"""Run a model switch in the background.
The sidecar POST is awaited but the caller gets an immediate SSE stream
so Hermes Desktop doesn't timeout waiting for the first response.
Called via asyncio.create_task() so it runs concurrently with the
SSE stream being sent to the client.
"""
try:
async with httpx.AsyncClient(timeout=120.0) as client:
switch_resp = await client.post(
f"{SIDECAR_URL}/models/switch",
json={"profile_id": requested_model},
)
switch_result = switch_resp.json()
if switch_result.get("status") == "ready":
print(
f"SWITCH SUCCESS: profile={requested_model}",
flush=True,
)
else:
circuit_record_failure()
print(
f"SWITCH FAILED: profile={requested_model}, "
f"status={switch_result.get('status')}, "
f"message={switch_result.get('message', '(no message)')}",
flush=True,
)
except Exception as e:
circuit_record_failure()
print(
f"SWITCH EXCEPTION: profile={requested_model}, "
f"error={type(e).__name__}: {e}",
flush=True,
)
finally:
# Signal all queued requests so they can proceed (and fall
# through to the fallback chain if the switch failed).
complete_switch()
drain_queue()
# ─── App ─────────────────────────────────────────────────────────────────────
@asynccontextmanager
async def lifespan(app: FastAPI):
@ -153,6 +196,12 @@ app = FastAPI(lifespan=lifespan)
# ─── GET /v1/models — Issue #2 ──────────────────────────────────────────────
@app.get("/v1")
async def v1_root():
"""OpenAI API root — return basic info for Hermes Desktop WebUI probe."""
return {"object": "list", "data": []}
@app.get("/v1/models")
async def get_models():
"""OpenAI-compatible /v1/models endpoint proxying to Sidecar."""
@ -179,6 +228,170 @@ async def health():
return {"status": "router_online"}
# ─── Hermes Desktop Probe Endpoints ──────────────────────────────────────────
# These endpoints are probed by Hermes Desktop to validate/identify the
# provider before allowing model switching. Without them the desktop
# returns 503 and refuses to switch models.
@app.get("/v1/models/{model_id:path}")
async def get_single_model(model_id: str):
"""OpenAI-compatible single model query. Proxied via Sidecar model list."""
async with httpx.AsyncClient(timeout=5.0) as client:
try:
resp = await client.get(f"{SIDECAR_URL}/models/available")
profiles = resp.json()
except Exception:
return JSONResponse(
status_code=503,
content={"error": "Sidecar unavailable", "data": []},
)
for p in profiles:
if p.get("id") == model_id:
return {"id": p["id"], "object": "model", "owned_by": "sidecar"}
return JSONResponse(status_code=404, content={"error": "model not found", "id": model_id})
@app.get("/api/tags")
async def ollama_tags():
"""Ollama-compatible model list for Hermes Desktop discovery."""
async with httpx.AsyncClient(timeout=5.0) as client:
try:
resp = await client.get(f"{SIDECAR_URL}/models/available")
profiles = resp.json()
except Exception:
return JSONResponse(content={"models": []})
models = []
for p in profiles:
models.append({
"name": p.get("id", ""),
"model": p.get("id", ""),
"modified_at": "2025-01-01T00:00:00Z",
"size": 0,
"digest": "",
"details": {"format": "gguf", "family": p.get("name", "llm")},
})
return {"models": models}
@app.get("/api/show")
async def ollama_show_get(model: str = ""):
"""Ollama-compatible model info for Hermes Desktop discovery (GET variant).
Some Hermes Desktop versions probe /api/show via GET with a ?model= parameter.
"""
return await _ollama_show_lookup(model)
@app.post("/api/show")
async def ollama_show_post(request: Request):
"""Ollama-compatible model info for Hermes Desktop discovery (POST variant)."""
body = await request.body()
body_data = json.loads(body) if body else {}
model_name = body_data.get("model", "")
return await _ollama_show_lookup(model_name)
async def _ollama_show_lookup(model_name: str):
"""Shared logic for Ollama /api/show model info lookup.
When model_name is empty string (Hermes Desktop probe with no model field),
returns the currently-active profile's info so the desktop can determine
the correct context size. Previously returned 404, causing Hermes Desktop
to default to 256k context.
"""
async with httpx.AsyncClient(timeout=5.0) as client:
try:
resp = await client.get(f"{SIDECAR_URL}/models/available")
profiles = resp.json()
status_resp = await client.get(f"{SIDECAR_URL}/models/status")
status = status_resp.json()
except Exception:
return JSONResponse(status_code=404, content={"error": "model not found"})
# If no model specified, return the currently-active profile's info
active_id = status.get("active_profile")
if not model_name and active_id:
for p in profiles:
if p.get("id") == active_id:
flags = p.get("flags", {})
ctx_size = str(flags.get("ctx-size", flags.get("n_ctx", "4096")))
return {
"modelfile": "",
"parameters": f"num_ctx {ctx_size}",
"template": "",
"details": {
"format": "gguf",
"family": p.get("name", "llm"),
"parameter_size": ctx_size,
},
"model_info": {"id": p.get("id", "")},
}
for p in profiles:
if p.get("id") == model_name:
# Extract actual context size from the profile's flags
flags = p.get("flags", {})
ctx_size = str(flags.get("ctx-size", flags.get("n_ctx", "4096")))
return {
"modelfile": "",
"parameters": f"num_ctx {ctx_size}",
"template": "",
"details": {
"format": "gguf",
"family": p.get("name", "llm"),
"parameter_size": ctx_size,
},
"model_info": {"id": p.get("id", "")},
}
return JSONResponse(status_code=404, content={"error": "model not found"})
@app.get("/api/v1/models")
async def ollama_v1_models():
"""Ollama /api/v1/models redirect — return same list as /v1/models."""
return await get_models()
@app.get("/v1/props")
async def llama_cpp_props():
"""llama.cpp discovery endpoint for Hermes Desktop."""
async with httpx.AsyncClient(timeout=3.0) as client:
try:
resp = await client.get(f"{SIDECAR_URL}/models/status")
status = resp.json()
except Exception:
status = {"active_profile": None, "llama_server_running": False}
# Report the currently-running server version / capabilities
return {
"props": {
"version": 1,
"total_slots": 1,
"chat_endpoint": "/v1/chat/completions",
"completion_endpoint": "/v1/completions",
"embedding_endpoint": "/v1/embeddings",
"rerank_endpoint": "",
"health_endpoint": "/health",
},
"active_profile": status.get("active_profile"),
"server_running": status.get("llama_server_running", False),
}
@app.get("/props")
async def llm_props():
"""Legacy llama.cpp discovery endpoint (same as /v1/props)."""
return await llama_cpp_props()
@app.get("/version")
async def llm_version():
"""llama.cpp version endpoint for Hermes Desktop."""
return {"version": "0.2.0", "build": "router-proxy", "commit": "intelligence-router"}
# ─── GET /models/status ──────────────────────────────────────────────────────
@app.get("/models/status")
async def router_model_status():
@ -258,96 +471,138 @@ async def proxy(
# ── Determine target URL ──────────────────────────────────────────────
target_url: Optional[str] = None
error: Optional[str] = None
sidecar_status = None
# Circuit breaker check
if not await circuit_breaker_check():
# Always query the sidecar first (to detect recovery even when circuit is open)
async with httpx.AsyncClient(timeout=3.0) as client:
try:
resp = await client.get(f"{SIDECAR_URL}/models/status")
if resp.status_code == 200:
sidecar_status = resp.json()
circuit_reset()
except Exception:
pass # Handled below
if sidecar_status is None:
circuit_record_failure()
error = "sidecar_down"
elif not await circuit_breaker_check():
# Sidecar is up but circuit is open from prior switch failures
# Only block the switch — allow routing to already-active backend
error = "circuit_open"
if sidecar_status.get("llama_server_running"):
target_url = f"{MAIN_PC_BASE}/{path}"
else:
# Query Sidecar for active model
sidecar_status = None
async with httpx.AsyncClient(timeout=3.0) as client:
# Both sidecar reachable and circuit closed — proceed normally
body = await request.body()
body_data = json.loads(body) if body else {}
requested_model = body_data.get("model")
# Only trigger model switches for actual chat/completion POST requests.
# GET probes, /api/show lookups, and other non-chat endpoints should
# never trigger a switch — they just read current state.
is_chat_request = (
request.method == "POST"
and path in ("v1/chat/completions", "v1/completions")
)
if requested_model and sidecar_status.get("active_profile") == requested_model:
target_url = f"{MAIN_PC_BASE}/{path}"
elif requested_model and is_chat_request:
# All requests during a model switch get an immediate SSE streaming
# response so clients (Hermes Desktop) don't timeout while waiting
# for the model to load (10-30s). The switch runs in a background
# task; the SSE stream yields progress events, then pipes through
# the actual response once the backend model is ready.
current_switch = await wait_for_switch()
if current_switch is None:
# No switch in progress — start one in the background
await start_switch()
asyncio.create_task(_background_switch(requested_model))
# Queue this request — signals when switch completes
try:
resp = await client.get(f"{SIDECAR_URL}/models/status")
if resp.status_code == 200:
sidecar_status = resp.json()
circuit_reset()
except Exception:
error = "sidecar_down"
wait_evt = await queue_request()
except HTTPException as he:
raise
if sidecar_status is None:
circuit_record_failure()
error = "sidecar_down"
else:
# Extract requested model from request body
body = await request.body()
body_data = json.loads(body) if body else {}
requested_model = body_data.get("model")
# Build request headers once
req_headers = dict(request.headers)
req_headers.pop("host", None)
if requested_model and sidecar_status.get("active_profile") == requested_model:
target_url = f"{MAIN_PC_BASE}/{path}"
else:
# Trigger switch
if requested_model:
# Check if a switch is already in progress
current_switch = await wait_for_switch()
if current_switch is not None and not current_switch.is_set():
# Another request started the switch — queue this one
async def stream_with_sse():
sse_gen = sse_progress_stream(wait_evt)
try:
await wait_evt.wait()
async for sse_chunk in sse_gen:
yield sse_chunk
# Send actual request to Main PC
async with httpx.AsyncClient(timeout=60.0) as c:
async with c.stream(
request.method,
f"{MAIN_PC_BASE}/{path}",
content=body,
headers=req_headers,
) as resp:
async for chunk in resp.aiter_bytes():
yield chunk
except Exception:
# Main PC unreachable (switch failed or server died) —
# try fallback chain
yield _sse_format(
"error",
{"message": "Backend unreachable, trying fallback..."},
)
# Try OpenRouter
if OPENROUTER_API_KEY:
try:
wait_evt = await queue_request()
except HTTPException as he:
raise
# SSE progress while waiting
async def stream_with_sse():
sse_gen = sse_progress_stream(wait_evt)
try:
await wait_evt.wait()
async for sse_chunk in sse_gen:
yield sse_chunk
complete_switch()
drain_queue()
async with httpx.AsyncClient(timeout=60.0) as c:
req_headers = dict(request.headers)
req_headers.pop("host", None)
async with c.stream(
request.method,
f"{MAIN_PC_BASE}/{path}",
content=body,
headers=req_headers,
) as resp:
async for chunk in resp.aiter_bytes():
yield chunk
finally:
# Clean up sse_gen
try:
await sse_gen.aclose()
except Exception:
pass
return StreamingResponse(
stream_with_sse(),
media_type="text/event-stream",
)
# First request triggers the switch
await start_switch() # Create event for tracking
fb_headers = dict(req_headers)
fb_headers["Authorization"] = f"Bearer {OPENROUTER_API_KEY}"
async with httpx.AsyncClient(timeout=60.0) as c:
async with c.stream(
request.method,
f"{OPENROUTER_BASE}/{path}",
content=body,
headers=fb_headers,
) as resp:
async for chunk in resp.aiter_bytes():
yield chunk
return
except Exception:
pass
# Fallback to LXC SLM
try:
async with httpx.AsyncClient(timeout=120.0) as client:
switch_resp = await client.post(
f"{SIDECAR_URL}/models/switch",
json={"profile_id": requested_model},
)
switch_result = switch_resp.json()
if switch_result.get("status") == "ready":
complete_switch()
drain_queue()
target_url = f"{MAIN_PC_BASE}/{path}"
else:
error = "switch_failed"
except Exception as e:
circuit_record_failure()
error = f"switch_error: {str(e)}"
async with httpx.AsyncClient(timeout=60.0) as c:
async with c.stream(
request.method,
f"{FALLBACK_SLM_URL}/{path}",
content=body,
headers=req_headers,
) as resp:
async for chunk in resp.aiter_bytes():
yield chunk
except Exception:
yield _sse_format(
"error",
{"message": "All backends unavailable"},
)
finally:
try:
await sse_gen.aclose()
except Exception:
pass
return StreamingResponse(
stream_with_sse(),
media_type="text/event-stream",
)
else:
# No model in request body (probe/GET/non-chat request) —
# route to the currently active backend when available,
# or fall through to the fallback chain.
if sidecar_status.get("active_profile") and sidecar_status.get("llama_server_running"):
target_url = f"{MAIN_PC_BASE}/{path}"
# ── Fallback chain ────────────────────────────────────────────────────
if target_url is None:
@ -378,8 +633,11 @@ async def proxy(
request.method, target,
content=body, headers=headers,
) as resp:
if resp.status_code != 200:
print(f"PROXY: {target} returned {resp.status_code} during SSE stream", flush=True)
async for chunk in resp.aiter_bytes():
yield chunk
return StreamingResponse(gen(), status_code=200)
resp = await client.request(
@ -388,6 +646,12 @@ async def proxy(
content=body,
headers=headers,
)
if resp.status_code != 200:
body_preview = resp.content[:500].decode("utf-8", errors="replace")
print(
f"PROXY: {request.method} {target} returned {resp.status_code}: {body_preview}",
flush=True,
)
return Response(
content=resp.content,
status_code=resp.status_code,
@ -397,8 +661,11 @@ async def proxy(
primary_result = None
try:
primary_result = await execute(target_url)
except Exception:
pass # Falls through to fallback chain
except Exception as e:
print(
f"PROXY EXCEPTION on primary {target_url}: {type(e).__name__}: {e}",
flush=True,
) # Falls through to fallback chain
if primary_result is not None:
return primary_result

161
scripts/sync_models.py Normal file
View File

@ -0,0 +1,161 @@
#!/usr/bin/env python3
"""
Sync intelligence-router model list into Hermes custom_providers.
Usage:
# One-shot: discover models from the router and update Hermes config
python3 scripts/sync_models.py
# Cron mode (auto): set up via:
# cp scripts/sync_models.py ~/.hermes/scripts/
# hermes cron create --schedule "every 30m" --no-agent --script sync_models.py
Silent exit when nothing changed. Prints a summary + restarts the gateway when
the model list differs.
"""
import json
import os
import subprocess
import sys
import urllib.error
import urllib.request
from pathlib import Path
# ── CONFIGURE THESE ──────────────────────────────────────────────────
ROUTER_BASE_URL = "http://10.0.4.100:9001/v1"
PROVIDER_NAME = "intelligence_router"
GATEWAY_SERVICE = "hermes-gateway"
# ─────────────────────────────────────────────────────────────────────
MODELS_URL = f"{ROUTER_BASE_URL}/models"
CONFIG_PATH = Path(os.path.expanduser("~/.hermes/config.yaml"))
def fetch_models() -> list[str] | None:
try:
req = urllib.request.Request(MODELS_URL, headers={"Accept": "application/json"})
with urllib.request.urlopen(req, timeout=10) as resp:
data = json.loads(resp.read().decode())
models = sorted(m["id"] for m in data.get("data", []) if isinstance(m, dict))
return models if models else None
except (urllib.error.URLError, TimeoutError, json.JSONDecodeError, OSError) as e:
print(f"ERROR: Failed to fetch models from {MODELS_URL}: {e}", file=sys.stderr)
return None
def read_current_models() -> list[str]:
"""Parse current custom_providers entries for our provider name."""
if not CONFIG_PATH.exists():
return []
models = []
with open(CONFIG_PATH) as f:
content = f.read()
idx = content.find("custom_providers:")
if idx == -1:
return []
section = content[idx:]
lines = section.split("\n")
current_entry = {}
for line in lines:
s = line.strip()
if s.startswith("- base_url:"):
if current_entry.get("name") == PROVIDER_NAME:
m = current_entry.get("model", "")
if m:
models.append(m)
current_entry = {}
elif s.startswith("model:"):
current_entry["model"] = s.split("model:", 1)[1].strip().strip("'\"")
elif s.startswith("name:"):
current_entry["name"] = s.split("name:", 1)[1].strip().strip("'\"")
elif s and not s.startswith(("-", " ")):
break
# Don't forget the last entry
if current_entry.get("name") == PROVIDER_NAME:
m = current_entry.get("model", "")
if m:
models.append(m)
return sorted(models)
def generate_block(models: list[str]) -> str:
lines = ["custom_providers:"]
for m in models:
lines.append(f"- base_url: {ROUTER_BASE_URL}")
lines.append(f" model: {m}")
lines.append(f" name: {PROVIDER_NAME}")
return "\n".join(lines)
def replace_section(models: list[str]) -> bool:
"""Replace the custom_providers section in-place. Returns True if changed."""
if not CONFIG_PATH.exists():
return False
import yaml
content = CONFIG_PATH.read_text()
config = yaml.safe_load(content)
new_entries = [
{"base_url": ROUTER_BASE_URL, "model": m, "name": PROVIDER_NAME}
for m in models
]
if config.get("custom_providers") == new_entries:
return False
config["custom_providers"] = new_entries
CONFIG_PATH.write_text(yaml.dump(config, default_flow_style=False, sort_keys=False))
return True
def restart_gateway() -> bool:
try:
r = subprocess.run(
["systemctl", "--user", "restart", GATEWAY_SERVICE],
capture_output=True, text=True, timeout=30,
)
return r.returncode == 0
except Exception:
return False
def main():
models = fetch_models()
if models is None:
sys.exit(1)
current = read_current_models()
if current == models:
print("Model list unchanged — nothing to do.")
return
added = set(models) - set(current)
removed = set(current) - set(models)
print(f"Model list changed! {len(current)}{len(models)} models")
if added:
print(f" Added: {sorted(added)}")
if removed:
print(f" Removed: {sorted(removed)}")
if not replace_section(models):
print("ERROR: Config update failed")
return
print("Config updated. Restarting gateway...")
if restart_gateway():
print("Gateway restarted successfully.")
else:
print("WARNING: Gateway restart failed — restart manually.")
if __name__ == "__main__":
main()

View File

@ -5,7 +5,6 @@ Runs on the Main PC, manages llama-server subprocess, serves manifest/profile da
import os
import asyncio
import signal as signal_module
import threading
from contextlib import asynccontextmanager
from typing import Optional
@ -18,41 +17,98 @@ from sidecar.manifest import load_manifest
# Configuration from environment
MANIFEST_PATH = os.getenv("MANIFEST_PATH", "/home/bigt/AI/llm/manifest.yaml")
SIDECAR_PORT = int(os.getenv("SIDECAR_PORT", "8080"))
LLAMA_SERVER_PORT = 8080
LLAMA_SERVER_PORT = 8081
LLAMA_STDERR_LOG = os.path.join(
os.path.dirname(MANIFEST_PATH), "llama-server-stderr.log"
)
# Global state
_llama_server_process: Optional[asyncio.subprocess.Process] = None
_active_profile: Optional[str] = None
_switch_lock = threading.Lock() # Use threading.Lock for compatibility with TestClient
_switch_lock = asyncio.Lock() # Use asyncio.Lock to avoid blocking the event loop
@asynccontextmanager
async def lifespan(app: FastAPI):
"""Manage sidecar lifecycle — no default model loaded."""
print(f"Sidecar starting, manifest={MANIFEST_PATH}, port={SIDECAR_PORT}")
print(f"Sidecar starting, manifest={MANIFEST_PATH}, port={SIDECAR_PORT}", flush=True)
yield
# Cleanup: kill llama-server if running
global _llama_server_process
if _llama_server_process:
_kill_llama_server()
await _kill_llama_server()
app = FastAPI(lifespan=lifespan)
def _kill_llama_server():
"""Kill the llama-server subprocess."""
def _close_stderr_log():
"""Close the stderr log file handle if it's still attached to the process."""
global _llama_server_process
if _llama_server_process and _llama_server_process.returncode is None:
try:
_llama_server_process.send_signal(signal_module.SIGTERM)
if _llama_server_process is not None:
fh = getattr(_llama_server_process, "_stderr_fh", None)
if fh is not None and not fh.closed:
try:
_llama_server_process.wait(timeout=5)
fh.close()
except Exception:
pass
async def _kill_llama_server():
"""Kill the llama-server subprocess and wait for it to fully terminate.
This MUST be async because process.wait() is a coroutine. The synchronous
version was calling .wait() without await, creating an unawaited coroutine
object the old process was never actually waited on, so it could still
hold GPU VRAM when the new server started.
"""
global _llama_server_process
if _llama_server_process is None or _llama_server_process.returncode is not None:
_close_stderr_log()
return
try:
_llama_server_process.send_signal(signal_module.SIGTERM)
try:
await asyncio.wait_for(_llama_server_process.wait(), timeout=10)
except asyncio.TimeoutError:
_llama_server_process.kill()
try:
await asyncio.wait_for(_llama_server_process.wait(), timeout=5)
except asyncio.TimeoutError:
_llama_server_process.kill()
except Exception:
pass
pass
except Exception:
pass
finally:
_llama_server_process = None
_close_stderr_log()
def _flag_value(value) -> str:
"""Convert a manifest flag value to a llama-server CLI argument string.
YAML booleans (True/False/on/off/yes/no) are parsed as Python bools by
safe_load. llama-server expects 'on'/'off' for boolean flags, not 'True'/'False'.
"""
if isinstance(value, bool):
return "on" if value else "off"
return str(value)
def _flag_key(key: str) -> str:
"""Convert a manifest flag key to the correct llama-server CLI flag name.
llama-server uses hyphenated flag names (--ctx-size, --n-gpu-layers),
but YAML keys often use underscores. Some flags were also renamed
across llama.cpp versions (e.g. --n-ctx --ctx-size).
This function normalises underscores to hyphens and applies known renames.
"""
normalized = key.replace("_", "-")
FLAG_RENAMES = {
"n-ctx": "ctx-size",
}
return FLAG_RENAMES.get(normalized, normalized)
async def _start_llama_server(profile: dict):
@ -60,29 +116,39 @@ async def _start_llama_server(profile: dict):
global _llama_server_process
# Kill any existing process
_kill_llama_server()
await _kill_llama_server()
# Build command from profile flags
cmd = ["llama-server"]
cmd = ["/home/bigt/AI/llama.cpp/build/bin/llama-server"]
cmd += ["--model", profile["model_path"]]
cmd += ["--port", str(LLAMA_SERVER_PORT)]
cmd += ["--host", "0.0.0.0"]
for key, value in profile.get("flags", {}).items():
cmd += ["--" + key, str(value)]
cmd += ["--" + _flag_key(key), _flag_value(value)]
print(f"Starting llama-server: {' '.join(cmd)}")
print(f"Starting llama-server: {' '.join(cmd)}", flush=True)
# Capture stderr so we can diagnose crashes (model not found, OOM, bad flag)
stderr_fh = open(LLAMA_STDERR_LOG, "w")
_llama_server_process = await asyncio.create_subprocess_exec(
*cmd,
stdout=asyncio.subprocess.DEVNULL,
stderr=asyncio.subprocess.DEVNULL,
stderr=stderr_fh,
)
# Keep a reference so we can close the handle later
_llama_server_process._stderr_fh = stderr_fh # type: ignore[attr-defined]
return _llama_server_process
async def _poll_llama_server_ready(max_retries: int = 240, interval: float = 0.5):
"""Poll llama-server readiness via /v1/models endpoint."""
async def _poll_llama_server_ready(max_retries: int = 60, interval: float = 0.5):
"""Poll llama-server readiness via /v1/models endpoint.
Returns True on success. On failure, dumps the captured stderr (if any)
so the user can see why llama-server crashed.
"""
import httpx
for _ in range(max_retries):
for attempt in range(max_retries):
try:
async with httpx.AsyncClient(timeout=2.0) as client:
resp = await client.get(f"http://localhost:{LLAMA_SERVER_PORT}/v1/models")
@ -91,6 +157,27 @@ async def _poll_llama_server_ready(max_retries: int = 240, interval: float = 0.5
except Exception:
pass
await asyncio.sleep(interval)
# Flush and close the stderr handle so all data is on disk before we read
_close_stderr_log()
# ── Dump stderr for diagnosis ──────────────────────────────────────
print("llama-server did NOT become ready — dumping stderr:", flush=True)
try:
with open(LLAMA_STDERR_LOG) as f:
for line in f:
print(f" {line.rstrip()}", flush=True)
except FileNotFoundError:
print(" (stderr log not found — process may not have started)", flush=True)
# Also log exit code if the process died
global _llama_server_process
if _llama_server_process and _llama_server_process.returncode is not None:
print(
f"llama-server exited with code {_llama_server_process.returncode}",
flush=True,
)
return False
@ -124,7 +211,7 @@ async def switch_model(payload: SwitchRequest):
"""Stop current llama-server, start new one with the given profile, wait for readiness."""
global _active_profile
with _switch_lock:
async with _switch_lock:
# Validate profile_id
profiles = load_manifest(MANIFEST_PATH)
if profiles is None:
@ -153,7 +240,7 @@ async def switch_model(payload: SwitchRequest):
}
# Start the new model
_kill_llama_server()
await _kill_llama_server()
_active_profile = None
await _start_llama_server(profile)