Compare commits
No commits in common. "master" and "feature/add-model-profiles" have entirely different histories.
master
...
feature/ad
@ -1,94 +0,0 @@
|
||||
# Plan: Add user model profiles to manifest.yaml
|
||||
# Date: 2025-06-15
|
||||
# Author: Hermes Agent
|
||||
# Status: DRAFT
|
||||
|
||||
## Context
|
||||
User has a collection of GGUF models on their Main PC (10.0.4.11, RTX 3090 24GB VRAM).
|
||||
The intelligence-router manifest needs profiles for all models with researched llama.cpp parameters.
|
||||
Research sourced from r/LocalLLaMA, HuggingFace model cards, and community blog posts.
|
||||
|
||||
## Hardware constraints
|
||||
- GPU: RTX 3090, 24GB VRAM
|
||||
- All profiles use `n_gpu_layers: 999` (offload all layers that fit)
|
||||
- All profiles use `flash-attn: on`
|
||||
- KV cache quantization (q8_0 or q4_0) to enable 64K+ context within 24GB VRAM
|
||||
- `min_p` set to 0.0 across all profiles (community standard for these models)
|
||||
|
||||
## Models to add (excluding mmproj files)
|
||||
|
||||
### Qwen3.6-27B (1 file: Qwen3.6-27B-Q4_K_M.gguf, ~10.5 GB)
|
||||
Recommended sampling per HF model card and Unsloth: temp 0.6 / top_p 0.95 / top_k 20
|
||||
|
||||
| # | Profile ID | Name | n_ctx | cache_k/v | temp | top_k | repeat_pen |
|
||||
|---|-----------|------|-------|-----------|------|-------|------------|
|
||||
| 1 | qwen36-27b-balanced-64k | Qwen3.6-27B Balanced 64K | 65536 | q8_0/q8_0 | 0.6 | 20 | 1.0 |
|
||||
| 2 | qwen36-27b-thinking-64k | Qwen3.6-27B Thinking 64K | 65536 | q8_0/q8_0 | 1.0 | 20 | 1.0 |
|
||||
| 3 | qwen36-27b-extended-128k | Qwen3.6-27B Extended 128K | 131072 | q4_0/q4_0 | 0.6 | 20 | 1.05 |
|
||||
|
||||
### Gemma 4 12B (2 files: Q6_K_XL ~8.5 GB, IQ4_XS ~5 GB)
|
||||
Google official: temp 1.0 / top_p 0.95 / top_k 64
|
||||
|
||||
| # | Profile ID | Name | File | n_ctx | cache_k/v | temp | top_k |
|
||||
|---|-----------|------|------|-------|-----------|------|-------|
|
||||
| 4 | gemma4-12b-standard-q6-64k | Gemma4 12B Standard Q6 64K | Q6_K_XL | 65536 | q8_0/q8_0 | 1.0 | 64 |
|
||||
| 5 | gemma4-12b-extended-q6-128k | Gemma4 12B Extended Q6 128K | Q6_K_XL | 131072 | q4_0/q4_0 | 1.0 | 64 |
|
||||
| 6 | gemma4-12b-compact-iq4-64k | Gemma4 12B Compact IQ4 64K | IQ4_XS | 65536 | q8_0/q8_0 | 1.0 | 64 |
|
||||
| 7 | gemma4-12b-compact-long-128k | Gemma4 12B Compact IQ4 128K | IQ4_XS | 131072 | q8_0/q8_0 | 1.0 | 64 |
|
||||
|
||||
### Gemma 4 26B-A4B (2 files: Q4_K_M ~10.5 GB, IQ4_XS ~6 GB)
|
||||
MoE, 4B active. Same sampling as 12B family.
|
||||
|
||||
| # | Profile ID | Name | File | n_ctx | cache_k/v | temp | top_k | repeat_pen |
|
||||
|---|-----------|------|------|-------|-----------|------|-------|------------|
|
||||
| 8 | gemma4-26b-balanced-64k | Gemma4 26B Balanced 64K | Q4_K_M | 65536 | q8_0/q8_0 | 1.0 | 64 | 1.0 |
|
||||
| 9 | gemma4-26b-extended-128k | Gemma4 26B Extended 128K | Q4_K_M | 131072 | q4_0/q4_0 | 1.0 | 64 | 1.15 |
|
||||
| 10 | gemma4-26b-ultra-long-iq4-256k | Gemma4 26B Ultra-Long IQ4 256K | IQ4_XS | 262144 | q4_0/q4_0 | 1.0 | 64 | 1.0 |
|
||||
|
||||
### Qwen3.6-35B-A3B (2 files: UD-Q4_K_M ~14 GB, MTP-UD-Q4_K_M ~16 GB)
|
||||
**MTP note:** Unsloth benchmark shows MTP is net-negative on single 3090. Including MTP profile anyway since user has the file.
|
||||
|
||||
| # | Profile ID | Name | File | n_ctx | cache_k/v | temp | top_k | MTP |
|
||||
|---|-----------|------|------|-------|-----------|------|-------|-----|
|
||||
| 11 | qwen36-35b-fast-64k | Qwen3.6-35B Fast 64K | UD-Q4 | 65536 | q8_0/q8_0 | 0.6 | 20 | no |
|
||||
| 12 | qwen36-35b-thinking-64k | Qwen3.6-35B Thinking 64K | UD-Q4 | 65536 | q8_0/q8_0 | 1.0 | 20 | no |
|
||||
| 13 | qwen36-35b-extended-128k | Qwen3.6-35B Extended 128K | UD-Q4 | 131072 | q4_0/q4_0 | 0.6 | 20 | no |
|
||||
| 14 | qwen36-35b-mtp-128k | Qwen3.6-35B MTP 128K | MTP-UD-Q4 | 131072 | q8_0/q8_0 | 0.6 | 20 | yes (n=3) |
|
||||
|
||||
### Uncensored models (apply censored family params)
|
||||
|
||||
| # | Profile ID | Name | File | n_ctx | cache_k/v | temp | top_k | Based on |
|
||||
|---|-----------|------|------|-------|-----------|------|-------|----------|
|
||||
| 15 | qwen36-35b-hauhau-aggressive-64k | Qwen3.6-35B HauhauCS Aggressive 64K | Uncensored-HauhauCS-Q4_K_P | 65536 | q8_0/q8_0 | 0.6 | 20 | Qwen3.6-35B fast |
|
||||
| 16 | qwen36-35b-genesis-apex-64k | Qwen3.6-35B Genesis APEX 64K | Uncensored-Genesis-APEX | 65536 | q8_0/q8_0 | 0.6 | 20 | Qwen3.6-35B fast |
|
||||
| 17 | qwen36-35b-genesis-mtp-apex-128k | Qwen3.6-35B Genesis MTP APEX 128K | Uncensored-Genesis-MTP-APEX | 131072 | q8_0/q8_0 | 0.6 | 20 | Qwen3.6-35B MTP |
|
||||
| 18 | gemma4-26b-hauhau-balanced-64k | Gemma4 26B HauhauCS Balanced 64K | Uncensored-HauhauCS-Q5_K_M | 65536 | q8_0/q8_0 | 1.0 | 64 | Gemma4 26B balanced |
|
||||
|
||||
**Total: 18 profiles**
|
||||
|
||||
## Flag mapping (manifest → llama-server CLI)
|
||||
|
||||
Manifest flags use camelCase keys that the sidecar passes as `--key value` to llama-server:
|
||||
|
||||
| Manifest key | CLI flag | Type | Notes |
|
||||
|-------------|----------|------|-------|
|
||||
| n_gpu_layers | --n-gpu-layers | int | 999 = all |
|
||||
| n_ctx | --ctx-size | int | context window |
|
||||
| cache_type_k | --cache-type-k | str | q8_0, q4_0 |
|
||||
| cache_type_v | --cache-type-v | str | q8_0, q4_0 |
|
||||
| flash_attn | --flash-attn | bool | true/on |
|
||||
| temp | --temp | float | sampling |
|
||||
| top_p | --top-p | float | sampling |
|
||||
| top_k | --top-k | int | sampling |
|
||||
| repeat_penalty | --repeat-penalty | float | sampling |
|
||||
| min_p | --min-p | float | 0.0 |
|
||||
| spec_type | --spec-type | str | draft-mtp (only MTP profiles) |
|
||||
| spec_draft_n_max | --spec-draft-n-max | int | 3 (only MTP profiles) |
|
||||
| presence_penalty | --presence-penalty | float | 0.0 |
|
||||
|
||||
## Actions
|
||||
1. Create branch `feature/add-model-profiles` from master
|
||||
2. Create issues on Gitea for each model family (4 issues: qwen27, gemma12b, gemma26b, qwen35b)
|
||||
3. Update `deploy/manifest.yaml` with all 18 profiles
|
||||
4. Update tests if flag structure requires it
|
||||
5. Run tests, commit
|
||||
@ -12,7 +12,6 @@ EnvironmentFile=-/home/bigt/AI/llm/.env
|
||||
Environment=MANIFEST_PATH=/home/bigt/AI/llm/manifest.yaml
|
||||
Environment=SIDECAR_PORT=8080
|
||||
Environment=PATH=/home/bigt/AI/llm/venv/bin:/usr/local/bin:/usr/bin:/bin
|
||||
Environment=PYTHONUNBUFFERED=1
|
||||
|
||||
# Use the sidecar's venv — install deps via deploy/README.md
|
||||
ExecStart=/home/bigt/AI/llm/venv/bin/uvicorn sidecar.app:app --host 0.0.0.0 --port 8080
|
||||
|
||||
@ -11,88 +11,141 @@
|
||||
# All profiles use flash-attn: on, n-gpu-layers: 999 (offload all)
|
||||
# KV cache quantization (q8_0/q4_0) enables 64K+ context within 24GB VRAM
|
||||
|
||||
- id: qwen-3-8b
|
||||
name: "Qwen 3 8B"
|
||||
model_path: "/home/bigt/AI/llm/qwen/qwen3-8b-q4.gguf"
|
||||
flags:
|
||||
n_ctx: 8192
|
||||
n_gpu_layers: 35
|
||||
|
||||
- id: qwen-3-8b-long
|
||||
name: "Qwen 3 8B (Long Context)"
|
||||
model_path: "/home/bigt/AI/llm/qwen/qwen3-8b-q4.gguf"
|
||||
flags:
|
||||
n_ctx: 32768
|
||||
n_gpu_layers: 20
|
||||
|
||||
- id: llama-4-maverick
|
||||
name: "Llama 4 Maverick"
|
||||
model_path: "/home/bigt/AI/llm/llama4/llama4-maverick-q4.gguf"
|
||||
flags:
|
||||
n_ctx: 8192
|
||||
n_gpu_layers: 35
|
||||
|
||||
# --- Qwen3.6-27B (Q4_K_M, ~10.5 GB) ---
|
||||
# Sampling: temp 0.6/1.0, top_p 0.95, top_k 20
|
||||
- id: qwen36-27b-balanced-64k
|
||||
name: "Qwen3.6-27B Balanced 64K"
|
||||
model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-27B-Q4_K_M.gguf"
|
||||
flags:
|
||||
ctx-size: 65536
|
||||
n-gpu-layers: 999
|
||||
n_ctx: 65536
|
||||
n_gpu_layers: 999
|
||||
cache-type-k: q8_0
|
||||
cache-type-v: q8_0
|
||||
flash-attn: on
|
||||
temp: 0.6
|
||||
top-p: 0.95
|
||||
top-k: 20
|
||||
top_p: 0.95
|
||||
top_k: 20
|
||||
repeat-penalty: 1.0
|
||||
min-p: 0.0
|
||||
min_p: 0.0
|
||||
presence-penalty: 0.0
|
||||
|
||||
- id: qwen36-27b-thinking-64k
|
||||
name: "Qwen3.6-27B Thinking 64K"
|
||||
model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-27B-Q4_K_M.gguf"
|
||||
flags:
|
||||
ctx-size: 65536
|
||||
n-gpu-layers: 999
|
||||
n_ctx: 65536
|
||||
n_gpu_layers: 999
|
||||
cache-type-k: q8_0
|
||||
cache-type-v: q8_0
|
||||
flash-attn: on
|
||||
temp: 1.0
|
||||
top-p: 0.95
|
||||
top-k: 20
|
||||
top_p: 0.95
|
||||
top_k: 20
|
||||
repeat-penalty: 1.0
|
||||
min-p: 0.0
|
||||
min_p: 0.0
|
||||
presence-penalty: 0.0
|
||||
|
||||
- id: qwen36-27b-extended-128k
|
||||
name: "Qwen3.6-27B Extended 128K"
|
||||
model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-27B-Q4_K_M.gguf"
|
||||
flags:
|
||||
ctx-size: 131072
|
||||
n-gpu-layers: 999
|
||||
n_ctx: 131072
|
||||
n_gpu_layers: 999
|
||||
cache-type-k: q4_0
|
||||
cache-type-v: q4_0
|
||||
flash-attn: on
|
||||
temp: 0.6
|
||||
top-p: 0.95
|
||||
top-k: 20
|
||||
top_p: 0.95
|
||||
top_k: 20
|
||||
repeat-penalty: 1.05
|
||||
min-p: 0.0
|
||||
min_p: 0.0
|
||||
presence-penalty: 0.0
|
||||
|
||||
# --- Gemma 4 12B (Q6_K_XL ~8.5 GB) ---
|
||||
# --- Gemma 4 12B (Q6_K_XL ~8.5 GB, IQ4_XS ~5 GB) ---
|
||||
# Sampling: temp 1.0, top_p 0.95, top_k 64 (Google official)
|
||||
- id: gemma4-12b-standard-q6-64k
|
||||
name: "Gemma4 12B Standard Q6 64K"
|
||||
model_path: "/home/bigt/AI/llm/gemma4/gemma-4-12b-it-UD-Q6_K_XL.gguf"
|
||||
flags:
|
||||
ctx-size: 65536
|
||||
n-gpu-layers: 999
|
||||
n_ctx: 65536
|
||||
n_gpu_layers: 999
|
||||
cache-type-k: q8_0
|
||||
cache-type-v: q8_0
|
||||
flash-attn: on
|
||||
temp: 1.0
|
||||
top-p: 0.95
|
||||
top-k: 64
|
||||
top_p: 0.95
|
||||
top_k: 64
|
||||
repeat-penalty: 1.0
|
||||
min-p: 0.0
|
||||
min_p: 0.0
|
||||
presence-penalty: 0.0
|
||||
|
||||
- id: gemma4-12b-extended-q6-128k
|
||||
name: "Gemma4 12B Extended Q6 128K"
|
||||
model_path: "/home/bigt/AI/llm/gemma4/gemma-4-12b-it-UD-Q6_K_XL.gguf"
|
||||
flags:
|
||||
ctx-size: 131072
|
||||
n-gpu-layers: 999
|
||||
n_ctx: 131072
|
||||
n_gpu_layers: 999
|
||||
cache-type-k: q4_0
|
||||
cache-type-v: q4_0
|
||||
flash-attn: on
|
||||
temp: 1.0
|
||||
top-p: 0.95
|
||||
top-k: 64
|
||||
top_p: 0.95
|
||||
top_k: 64
|
||||
repeat-penalty: 1.0
|
||||
min-p: 0.0
|
||||
min_p: 0.0
|
||||
presence-penalty: 0.0
|
||||
|
||||
- id: gemma4-12b-compact-iq4-64k
|
||||
name: "Gemma4 12B Compact IQ4 64K"
|
||||
model_path: "/home/bigt/AI/llm/gemma4/bartowski-google_gemma-4-26B-A4B-it-IQ4_XS.gguf"
|
||||
flags:
|
||||
n_ctx: 65536
|
||||
n_gpu_layers: 999
|
||||
cache-type-k: q8_0
|
||||
cache-type-v: q8_0
|
||||
flash-attn: on
|
||||
temp: 1.0
|
||||
top_p: 0.95
|
||||
top_k: 64
|
||||
repeat-penalty: 1.0
|
||||
min_p: 0.0
|
||||
presence-penalty: 0.0
|
||||
|
||||
- id: gemma4-12b-compact-long-128k
|
||||
name: "Gemma4 12B Compact IQ4 128K"
|
||||
model_path: "/home/bigt/AI/llm/gemma4/bartowski-google_gemma-4-26B-A4B-it-IQ4_XS.gguf"
|
||||
flags:
|
||||
n_ctx: 131072
|
||||
n_gpu_layers: 999
|
||||
cache-type-k: q8_0
|
||||
cache-type-v: q8_0
|
||||
flash-attn: on
|
||||
temp: 1.0
|
||||
top_p: 0.95
|
||||
top_k: 64
|
||||
repeat-penalty: 1.0
|
||||
min_p: 0.0
|
||||
presence-penalty: 0.0
|
||||
|
||||
# --- Gemma 4 26B-A4B (Q4_K_M ~10.5 GB, IQ4_XS ~6 GB) ---
|
||||
@ -101,97 +154,48 @@
|
||||
name: "Gemma4 26B Balanced 64K"
|
||||
model_path: "/home/bigt/AI/llm/gemma4/gemma-4-26B-A4B-it-UD-Q4_K_M.gguf"
|
||||
flags:
|
||||
ctx-size: 65536
|
||||
n-gpu-layers: 999
|
||||
n_ctx: 65536
|
||||
n_gpu_layers: 999
|
||||
cache-type-k: q8_0
|
||||
cache-type-v: q8_0
|
||||
flash-attn: on
|
||||
temp: 1.0
|
||||
top-p: 0.95
|
||||
top-k: 64
|
||||
top_p: 0.95
|
||||
top_k: 64
|
||||
repeat-penalty: 1.0
|
||||
min-p: 0.0
|
||||
min_p: 0.0
|
||||
presence-penalty: 0.0
|
||||
|
||||
- id: gemma4-26b-extended-128k
|
||||
name: "Gemma4 26B Extended 128K"
|
||||
model_path: "/home/bigt/AI/llm/gemma4/gemma-4-26B-A4B-it-UD-Q4_K_M.gguf"
|
||||
flags:
|
||||
ctx-size: 131072
|
||||
n-gpu-layers: 999
|
||||
n_ctx: 131072
|
||||
n_gpu_layers: 999
|
||||
cache-type-k: q4_0
|
||||
cache-type-v: q4_0
|
||||
flash-attn: on
|
||||
temp: 1.0
|
||||
top-p: 0.95
|
||||
top-k: 64
|
||||
top_p: 0.95
|
||||
top_k: 64
|
||||
repeat-penalty: 1.15
|
||||
min-p: 0.0
|
||||
min_p: 0.0
|
||||
presence-penalty: 0.0
|
||||
|
||||
- id: gemma4-26b-ultra-long-iq4-128k
|
||||
name: "Gemma4 26B Ultra-Long IQ4 128K"
|
||||
model_path: "/home/bigt/AI/llm/gemma4/gemma-4-26B-A4B-it-UD-IQ4_XS.gguf"
|
||||
flags:
|
||||
ctx-size: 131072
|
||||
n-gpu-layers: 999
|
||||
n_ctx: 131072
|
||||
n_gpu_layers: 999
|
||||
cache-type-k: q4_0
|
||||
cache-type-v: q4_0
|
||||
flash-attn: on
|
||||
temp: 1.0
|
||||
top-p: 0.95
|
||||
top-k: 64
|
||||
top_p: 0.95
|
||||
top_k: 64
|
||||
repeat-penalty: 1.0
|
||||
min-p: 0.0
|
||||
presence-penalty: 0.0
|
||||
|
||||
- id: gemma4-26b-q5-64k
|
||||
name: "Gemma4 26B Q5 64K"
|
||||
model_path: "/home/bigt/AI/llm/gemma4/google_gemma-4-26B-A4B-it-Q5_K_M.gguf"
|
||||
flags:
|
||||
ctx-size: 65536
|
||||
n-gpu-layers: 999
|
||||
cache-type-k: q8_0
|
||||
cache-type-v: q8_0
|
||||
flash-attn: on
|
||||
temp: 1.0
|
||||
top-p: 0.95
|
||||
top-k: 64
|
||||
repeat-penalty: 1.0
|
||||
min-p: 0.0
|
||||
presence-penalty: 0.0
|
||||
|
||||
# --- Gemma 4 26B Compact (IQ4_XS ~6 GB) ---
|
||||
- id: gemma4-26b-compact-iq4-64k
|
||||
name: "Gemma4 26B Compact IQ4 64K"
|
||||
model_path: "/home/bigt/AI/llm/gemma4/bartowski-google_gemma-4-26B-A4B-it-IQ4_XS.gguf"
|
||||
flags:
|
||||
ctx-size: 65536
|
||||
n-gpu-layers: 999
|
||||
cache-type-k: q8_0
|
||||
cache-type-v: q8_0
|
||||
flash-attn: on
|
||||
temp: 1.0
|
||||
top-p: 0.95
|
||||
top-k: 64
|
||||
repeat-penalty: 1.0
|
||||
min-p: 0.0
|
||||
presence-penalty: 0.0
|
||||
|
||||
- id: gemma4-26b-compact-long-128k
|
||||
name: "Gemma4 26B Compact IQ4 128K"
|
||||
model_path: "/home/bigt/AI/llm/gemma4/bartowski-google_gemma-4-26B-A4B-it-IQ4_XS.gguf"
|
||||
flags:
|
||||
ctx-size: 131072
|
||||
n-gpu-layers: 999
|
||||
cache-type-k: q4_0
|
||||
cache-type-v: q4_0
|
||||
flash-attn: on
|
||||
temp: 1.0
|
||||
top-p: 0.95
|
||||
top-k: 64
|
||||
repeat-penalty: 1.0
|
||||
min-p: 0.0
|
||||
min_p: 0.0
|
||||
presence-penalty: 0.0
|
||||
|
||||
# --- Qwen3.6-35B-A3B (UD-Q4_K_M ~14 GB) ---
|
||||
@ -201,144 +205,95 @@
|
||||
name: "Qwen3.6-35B Fast 64K"
|
||||
model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-35B-A3B-UD-Q4_K_M.gguf"
|
||||
flags:
|
||||
ctx-size: 65536
|
||||
n-gpu-layers: 999
|
||||
n_ctx: 65536
|
||||
n_gpu_layers: 999
|
||||
cache-type-k: q8_0
|
||||
cache-type-v: q8_0
|
||||
flash-attn: on
|
||||
temp: 0.6
|
||||
top-p: 0.95
|
||||
top-k: 20
|
||||
top_p: 0.95
|
||||
top_k: 20
|
||||
repeat-penalty: 1.0
|
||||
min-p: 0.0
|
||||
min_p: 0.0
|
||||
presence-penalty: 0.0
|
||||
|
||||
- id: qwen36-35b-thinking-64k
|
||||
name: "Qwen3.6-35B Thinking 64K"
|
||||
model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-35B-A3B-UD-Q4_K_M.gguf"
|
||||
flags:
|
||||
ctx-size: 65536
|
||||
n-gpu-layers: 999
|
||||
n_ctx: 65536
|
||||
n_gpu_layers: 999
|
||||
cache-type-k: q8_0
|
||||
cache-type-v: q8_0
|
||||
flash-attn: on
|
||||
temp: 1.0
|
||||
top-p: 0.95
|
||||
top-k: 20
|
||||
top_p: 0.95
|
||||
top_k: 20
|
||||
repeat-penalty: 1.0
|
||||
min-p: 0.0
|
||||
min_p: 0.0
|
||||
presence-penalty: 0.0
|
||||
|
||||
- id: qwen36-35b-extended-128k
|
||||
name: "Qwen3.6-35B Extended 128K"
|
||||
model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-35B-A3B-UD-Q4_K_M.gguf"
|
||||
flags:
|
||||
ctx-size: 131072
|
||||
n-gpu-layers: 999
|
||||
n_ctx: 131072
|
||||
n_gpu_layers: 999
|
||||
cache-type-k: q4_0
|
||||
cache-type-v: q4_0
|
||||
flash-attn: on
|
||||
temp: 0.6
|
||||
top-p: 0.95
|
||||
top-k: 20
|
||||
top_p: 0.95
|
||||
top_k: 20
|
||||
repeat-penalty: 1.0
|
||||
min-p: 0.0
|
||||
min_p: 0.0
|
||||
presence-penalty: 0.0
|
||||
|
||||
# --- Qwen3.6-35B-A3B MTP variant ---
|
||||
- id: qwen36-35b-mtp-fast-64k
|
||||
name: "Qwen3.6-35B MTP Fast 64K"
|
||||
model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-35B-A3B-MTP-UD-Q4_K_M.gguf"
|
||||
flags:
|
||||
ctx-size: 65536
|
||||
n-gpu-layers: 999
|
||||
cache-type-k: q8_0
|
||||
cache-type-v: q8_0
|
||||
flash-attn: on
|
||||
temp: 0.6
|
||||
top-p: 0.95
|
||||
top-k: 20
|
||||
repeat-penalty: 1.0
|
||||
min-p: 0.0
|
||||
presence-penalty: 0.0
|
||||
|
||||
- id: qwen36-35b-mtp-extended-128k
|
||||
name: "Qwen3.6-35B MTP Extended 128K"
|
||||
model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-35B-A3B-MTP-UD-Q4_K_M.gguf"
|
||||
flags:
|
||||
ctx-size: 131072
|
||||
n-gpu-layers: 999
|
||||
cache-type-k: q4_0
|
||||
cache-type-v: q4_0
|
||||
flash-attn: on
|
||||
temp: 0.6
|
||||
top-p: 0.95
|
||||
top-k: 20
|
||||
repeat-penalty: 1.0
|
||||
min-p: 0.0
|
||||
presence-penalty: 0.0
|
||||
|
||||
# --- Uncensored models ---
|
||||
# --- Uncensored models (apply censored family params) ---
|
||||
- id: qwen36-35b-hauhau-aggressive-64k
|
||||
name: "Qwen3.6-35B HauhauCS Aggressive 64K"
|
||||
model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-Q4_K_P.gguf"
|
||||
flags:
|
||||
ctx-size: 65536
|
||||
n-gpu-layers: 999
|
||||
n_ctx: 65536
|
||||
n_gpu_layers: 999
|
||||
cache-type-k: q8_0
|
||||
cache-type-v: q8_0
|
||||
flash-attn: on
|
||||
temp: 0.6
|
||||
top-p: 0.95
|
||||
top-k: 20
|
||||
top_p: 0.95
|
||||
top_k: 20
|
||||
repeat-penalty: 1.0
|
||||
min-p: 0.0
|
||||
min_p: 0.0
|
||||
presence-penalty: 0.0
|
||||
|
||||
- id: qwen36-35b-genesis-apex-64k
|
||||
name: "Qwen3.6-35B Genesis APEX 64K"
|
||||
model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-35B-A3B-Uncensored-Genesis-APEX.gguf"
|
||||
flags:
|
||||
ctx-size: 65536
|
||||
n-gpu-layers: 999
|
||||
n_ctx: 65536
|
||||
n_gpu_layers: 999
|
||||
cache-type-k: q8_0
|
||||
cache-type-v: q8_0
|
||||
flash-attn: on
|
||||
temp: 0.6
|
||||
top-p: 0.95
|
||||
top-k: 20
|
||||
top_p: 0.95
|
||||
top_k: 20
|
||||
repeat-penalty: 1.0
|
||||
min-p: 0.0
|
||||
presence-penalty: 0.0
|
||||
|
||||
- id: qwen36-35b-genesis-mtp-apex-64k
|
||||
name: "Qwen3.6-35B Genesis MTP APEX 64K"
|
||||
model_path: "/home/bigt/AI/llm/qwen3.6/Qwen3.6-35B-A3B-Uncensored-Genesis-MTP-APEX.gguf"
|
||||
flags:
|
||||
ctx-size: 65536
|
||||
n-gpu-layers: 999
|
||||
cache-type-k: q8_0
|
||||
cache-type-v: q8_0
|
||||
flash-attn: on
|
||||
temp: 0.6
|
||||
top-p: 0.95
|
||||
top-k: 20
|
||||
repeat-penalty: 1.0
|
||||
min-p: 0.0
|
||||
min_p: 0.0
|
||||
presence-penalty: 0.0
|
||||
|
||||
- id: gemma4-26b-hauhau-balanced-64k
|
||||
name: "Gemma4 26B HauhauCS Balanced 64K"
|
||||
model_path: "/home/bigt/AI/llm/gemma4/Gemma4-26B-A4B-Uncensored-HauhauCS-Balanced-Q5_K_M.gguf"
|
||||
flags:
|
||||
ctx-size: 65536
|
||||
n-gpu-layers: 999
|
||||
n_ctx: 65536
|
||||
n_gpu_layers: 999
|
||||
cache-type-k: q8_0
|
||||
cache-type-v: q8_0
|
||||
flash-attn: on
|
||||
temp: 1.0
|
||||
top-p: 0.95
|
||||
top-k: 64
|
||||
top_p: 0.95
|
||||
top_k: 64
|
||||
repeat-penalty: 1.0
|
||||
min-p: 0.0
|
||||
min_p: 0.0
|
||||
presence-penalty: 0.0
|
||||
@ -7,8 +7,8 @@ services:
|
||||
ports:
|
||||
- "9001:9000"
|
||||
environment:
|
||||
- SIDECAR_URL=http://10.0.4.11:8080
|
||||
- MAIN_PC_URL=http://10.0.4.11:8081/v1
|
||||
- SIDECAR_URL=http://10.0.4.11:8081
|
||||
- MAIN_PC_URL=http://10.0.4.11:8080/v1
|
||||
- FALLBACK_SLM_URL=http://10.0.4.200:8080/v1
|
||||
- OPENROUTER_API_KEY=${OPENROUTER_API_KEY:-}
|
||||
restart: unless-stopped
|
||||
|
||||
345
main.py
345
main.py
@ -141,49 +141,6 @@ def complete_switch():
|
||||
_switching_event.set()
|
||||
|
||||
|
||||
async def _background_switch(requested_model: str):
|
||||
"""Run a model switch in the background.
|
||||
|
||||
The sidecar POST is awaited but the caller gets an immediate SSE stream
|
||||
so Hermes Desktop doesn't timeout waiting for the first response.
|
||||
|
||||
Called via asyncio.create_task() so it runs concurrently with the
|
||||
SSE stream being sent to the client.
|
||||
"""
|
||||
try:
|
||||
async with httpx.AsyncClient(timeout=120.0) as client:
|
||||
switch_resp = await client.post(
|
||||
f"{SIDECAR_URL}/models/switch",
|
||||
json={"profile_id": requested_model},
|
||||
)
|
||||
switch_result = switch_resp.json()
|
||||
if switch_result.get("status") == "ready":
|
||||
print(
|
||||
f"SWITCH SUCCESS: profile={requested_model}",
|
||||
flush=True,
|
||||
)
|
||||
else:
|
||||
circuit_record_failure()
|
||||
print(
|
||||
f"SWITCH FAILED: profile={requested_model}, "
|
||||
f"status={switch_result.get('status')}, "
|
||||
f"message={switch_result.get('message', '(no message)')}",
|
||||
flush=True,
|
||||
)
|
||||
except Exception as e:
|
||||
circuit_record_failure()
|
||||
print(
|
||||
f"SWITCH EXCEPTION: profile={requested_model}, "
|
||||
f"error={type(e).__name__}: {e}",
|
||||
flush=True,
|
||||
)
|
||||
finally:
|
||||
# Signal all queued requests so they can proceed (and fall
|
||||
# through to the fallback chain if the switch failed).
|
||||
complete_switch()
|
||||
drain_queue()
|
||||
|
||||
|
||||
# ─── App ─────────────────────────────────────────────────────────────────────
|
||||
@asynccontextmanager
|
||||
async def lifespan(app: FastAPI):
|
||||
@ -196,12 +153,6 @@ app = FastAPI(lifespan=lifespan)
|
||||
|
||||
|
||||
# ─── GET /v1/models — Issue #2 ──────────────────────────────────────────────
|
||||
@app.get("/v1")
|
||||
async def v1_root():
|
||||
"""OpenAI API root — return basic info for Hermes Desktop WebUI probe."""
|
||||
return {"object": "list", "data": []}
|
||||
|
||||
|
||||
@app.get("/v1/models")
|
||||
async def get_models():
|
||||
"""OpenAI-compatible /v1/models endpoint proxying to Sidecar."""
|
||||
@ -228,170 +179,6 @@ async def health():
|
||||
return {"status": "router_online"}
|
||||
|
||||
|
||||
# ─── Hermes Desktop Probe Endpoints ──────────────────────────────────────────
|
||||
# These endpoints are probed by Hermes Desktop to validate/identify the
|
||||
# provider before allowing model switching. Without them the desktop
|
||||
# returns 503 and refuses to switch models.
|
||||
|
||||
@app.get("/v1/models/{model_id:path}")
|
||||
async def get_single_model(model_id: str):
|
||||
"""OpenAI-compatible single model query. Proxied via Sidecar model list."""
|
||||
async with httpx.AsyncClient(timeout=5.0) as client:
|
||||
try:
|
||||
resp = await client.get(f"{SIDECAR_URL}/models/available")
|
||||
profiles = resp.json()
|
||||
except Exception:
|
||||
return JSONResponse(
|
||||
status_code=503,
|
||||
content={"error": "Sidecar unavailable", "data": []},
|
||||
)
|
||||
|
||||
for p in profiles:
|
||||
if p.get("id") == model_id:
|
||||
return {"id": p["id"], "object": "model", "owned_by": "sidecar"}
|
||||
return JSONResponse(status_code=404, content={"error": "model not found", "id": model_id})
|
||||
|
||||
|
||||
@app.get("/api/tags")
|
||||
async def ollama_tags():
|
||||
"""Ollama-compatible model list for Hermes Desktop discovery."""
|
||||
async with httpx.AsyncClient(timeout=5.0) as client:
|
||||
try:
|
||||
resp = await client.get(f"{SIDECAR_URL}/models/available")
|
||||
profiles = resp.json()
|
||||
except Exception:
|
||||
return JSONResponse(content={"models": []})
|
||||
|
||||
models = []
|
||||
for p in profiles:
|
||||
models.append({
|
||||
"name": p.get("id", ""),
|
||||
"model": p.get("id", ""),
|
||||
"modified_at": "2025-01-01T00:00:00Z",
|
||||
"size": 0,
|
||||
"digest": "",
|
||||
"details": {"format": "gguf", "family": p.get("name", "llm")},
|
||||
})
|
||||
return {"models": models}
|
||||
|
||||
|
||||
@app.get("/api/show")
|
||||
async def ollama_show_get(model: str = ""):
|
||||
"""Ollama-compatible model info for Hermes Desktop discovery (GET variant).
|
||||
|
||||
Some Hermes Desktop versions probe /api/show via GET with a ?model= parameter.
|
||||
"""
|
||||
return await _ollama_show_lookup(model)
|
||||
|
||||
|
||||
@app.post("/api/show")
|
||||
async def ollama_show_post(request: Request):
|
||||
"""Ollama-compatible model info for Hermes Desktop discovery (POST variant)."""
|
||||
body = await request.body()
|
||||
body_data = json.loads(body) if body else {}
|
||||
model_name = body_data.get("model", "")
|
||||
return await _ollama_show_lookup(model_name)
|
||||
|
||||
|
||||
async def _ollama_show_lookup(model_name: str):
|
||||
"""Shared logic for Ollama /api/show model info lookup.
|
||||
|
||||
When model_name is empty string (Hermes Desktop probe with no model field),
|
||||
returns the currently-active profile's info so the desktop can determine
|
||||
the correct context size. Previously returned 404, causing Hermes Desktop
|
||||
to default to 256k context.
|
||||
"""
|
||||
async with httpx.AsyncClient(timeout=5.0) as client:
|
||||
try:
|
||||
resp = await client.get(f"{SIDECAR_URL}/models/available")
|
||||
profiles = resp.json()
|
||||
status_resp = await client.get(f"{SIDECAR_URL}/models/status")
|
||||
status = status_resp.json()
|
||||
except Exception:
|
||||
return JSONResponse(status_code=404, content={"error": "model not found"})
|
||||
|
||||
# If no model specified, return the currently-active profile's info
|
||||
active_id = status.get("active_profile")
|
||||
if not model_name and active_id:
|
||||
for p in profiles:
|
||||
if p.get("id") == active_id:
|
||||
flags = p.get("flags", {})
|
||||
ctx_size = str(flags.get("ctx-size", flags.get("n_ctx", "4096")))
|
||||
return {
|
||||
"modelfile": "",
|
||||
"parameters": f"num_ctx {ctx_size}",
|
||||
"template": "",
|
||||
"details": {
|
||||
"format": "gguf",
|
||||
"family": p.get("name", "llm"),
|
||||
"parameter_size": ctx_size,
|
||||
},
|
||||
"model_info": {"id": p.get("id", "")},
|
||||
}
|
||||
|
||||
for p in profiles:
|
||||
if p.get("id") == model_name:
|
||||
# Extract actual context size from the profile's flags
|
||||
flags = p.get("flags", {})
|
||||
ctx_size = str(flags.get("ctx-size", flags.get("n_ctx", "4096")))
|
||||
return {
|
||||
"modelfile": "",
|
||||
"parameters": f"num_ctx {ctx_size}",
|
||||
"template": "",
|
||||
"details": {
|
||||
"format": "gguf",
|
||||
"family": p.get("name", "llm"),
|
||||
"parameter_size": ctx_size,
|
||||
},
|
||||
"model_info": {"id": p.get("id", "")},
|
||||
}
|
||||
return JSONResponse(status_code=404, content={"error": "model not found"})
|
||||
|
||||
|
||||
@app.get("/api/v1/models")
|
||||
async def ollama_v1_models():
|
||||
"""Ollama /api/v1/models redirect — return same list as /v1/models."""
|
||||
return await get_models()
|
||||
|
||||
|
||||
@app.get("/v1/props")
|
||||
async def llama_cpp_props():
|
||||
"""llama.cpp discovery endpoint for Hermes Desktop."""
|
||||
async with httpx.AsyncClient(timeout=3.0) as client:
|
||||
try:
|
||||
resp = await client.get(f"{SIDECAR_URL}/models/status")
|
||||
status = resp.json()
|
||||
except Exception:
|
||||
status = {"active_profile": None, "llama_server_running": False}
|
||||
|
||||
# Report the currently-running server version / capabilities
|
||||
return {
|
||||
"props": {
|
||||
"version": 1,
|
||||
"total_slots": 1,
|
||||
"chat_endpoint": "/v1/chat/completions",
|
||||
"completion_endpoint": "/v1/completions",
|
||||
"embedding_endpoint": "/v1/embeddings",
|
||||
"rerank_endpoint": "",
|
||||
"health_endpoint": "/health",
|
||||
},
|
||||
"active_profile": status.get("active_profile"),
|
||||
"server_running": status.get("llama_server_running", False),
|
||||
}
|
||||
|
||||
|
||||
@app.get("/props")
|
||||
async def llm_props():
|
||||
"""Legacy llama.cpp discovery endpoint (same as /v1/props)."""
|
||||
return await llama_cpp_props()
|
||||
|
||||
|
||||
@app.get("/version")
|
||||
async def llm_version():
|
||||
"""llama.cpp version endpoint for Hermes Desktop."""
|
||||
return {"version": "0.2.0", "build": "router-proxy", "commit": "intelligence-router"}
|
||||
|
||||
|
||||
# ─── GET /models/status ──────────────────────────────────────────────────────
|
||||
@app.get("/models/status")
|
||||
async def router_model_status():
|
||||
@ -471,9 +258,13 @@ async def proxy(
|
||||
# ── Determine target URL ──────────────────────────────────────────────
|
||||
target_url: Optional[str] = None
|
||||
error: Optional[str] = None
|
||||
sidecar_status = None
|
||||
|
||||
# Always query the sidecar first (to detect recovery even when circuit is open)
|
||||
# Circuit breaker check
|
||||
if not await circuit_breaker_check():
|
||||
error = "circuit_open"
|
||||
else:
|
||||
# Query Sidecar for active model
|
||||
sidecar_status = None
|
||||
async with httpx.AsyncClient(timeout=3.0) as client:
|
||||
try:
|
||||
resp = await client.get(f"{SIDECAR_URL}/models/status")
|
||||
@ -481,63 +272,44 @@ async def proxy(
|
||||
sidecar_status = resp.json()
|
||||
circuit_reset()
|
||||
except Exception:
|
||||
pass # Handled below
|
||||
error = "sidecar_down"
|
||||
|
||||
if sidecar_status is None:
|
||||
circuit_record_failure()
|
||||
error = "sidecar_down"
|
||||
elif not await circuit_breaker_check():
|
||||
# Sidecar is up but circuit is open from prior switch failures
|
||||
# Only block the switch — allow routing to already-active backend
|
||||
error = "circuit_open"
|
||||
if sidecar_status.get("llama_server_running"):
|
||||
target_url = f"{MAIN_PC_BASE}/{path}"
|
||||
else:
|
||||
# Both sidecar reachable and circuit closed — proceed normally
|
||||
# Extract requested model from request body
|
||||
body = await request.body()
|
||||
body_data = json.loads(body) if body else {}
|
||||
requested_model = body_data.get("model")
|
||||
|
||||
# Only trigger model switches for actual chat/completion POST requests.
|
||||
# GET probes, /api/show lookups, and other non-chat endpoints should
|
||||
# never trigger a switch — they just read current state.
|
||||
is_chat_request = (
|
||||
request.method == "POST"
|
||||
and path in ("v1/chat/completions", "v1/completions")
|
||||
)
|
||||
|
||||
if requested_model and sidecar_status.get("active_profile") == requested_model:
|
||||
target_url = f"{MAIN_PC_BASE}/{path}"
|
||||
elif requested_model and is_chat_request:
|
||||
# All requests during a model switch get an immediate SSE streaming
|
||||
# response so clients (Hermes Desktop) don't timeout while waiting
|
||||
# for the model to load (10-30s). The switch runs in a background
|
||||
# task; the SSE stream yields progress events, then pipes through
|
||||
# the actual response once the backend model is ready.
|
||||
else:
|
||||
# Trigger switch
|
||||
if requested_model:
|
||||
# Check if a switch is already in progress
|
||||
current_switch = await wait_for_switch()
|
||||
if current_switch is None:
|
||||
# No switch in progress — start one in the background
|
||||
await start_switch()
|
||||
asyncio.create_task(_background_switch(requested_model))
|
||||
|
||||
# Queue this request — signals when switch completes
|
||||
if current_switch is not None and not current_switch.is_set():
|
||||
# Another request started the switch — queue this one
|
||||
try:
|
||||
wait_evt = await queue_request()
|
||||
except HTTPException as he:
|
||||
raise
|
||||
|
||||
# Build request headers once
|
||||
req_headers = dict(request.headers)
|
||||
req_headers.pop("host", None)
|
||||
|
||||
# SSE progress while waiting
|
||||
async def stream_with_sse():
|
||||
sse_gen = sse_progress_stream(wait_evt)
|
||||
try:
|
||||
await wait_evt.wait()
|
||||
async for sse_chunk in sse_gen:
|
||||
yield sse_chunk
|
||||
# Send actual request to Main PC
|
||||
complete_switch()
|
||||
drain_queue()
|
||||
async with httpx.AsyncClient(timeout=60.0) as c:
|
||||
req_headers = dict(request.headers)
|
||||
req_headers.pop("host", None)
|
||||
async with c.stream(
|
||||
request.method,
|
||||
f"{MAIN_PC_BASE}/{path}",
|
||||
@ -546,47 +318,8 @@ async def proxy(
|
||||
) as resp:
|
||||
async for chunk in resp.aiter_bytes():
|
||||
yield chunk
|
||||
except Exception:
|
||||
# Main PC unreachable (switch failed or server died) —
|
||||
# try fallback chain
|
||||
yield _sse_format(
|
||||
"error",
|
||||
{"message": "Backend unreachable, trying fallback..."},
|
||||
)
|
||||
# Try OpenRouter
|
||||
if OPENROUTER_API_KEY:
|
||||
try:
|
||||
fb_headers = dict(req_headers)
|
||||
fb_headers["Authorization"] = f"Bearer {OPENROUTER_API_KEY}"
|
||||
async with httpx.AsyncClient(timeout=60.0) as c:
|
||||
async with c.stream(
|
||||
request.method,
|
||||
f"{OPENROUTER_BASE}/{path}",
|
||||
content=body,
|
||||
headers=fb_headers,
|
||||
) as resp:
|
||||
async for chunk in resp.aiter_bytes():
|
||||
yield chunk
|
||||
return
|
||||
except Exception:
|
||||
pass
|
||||
# Fallback to LXC SLM
|
||||
try:
|
||||
async with httpx.AsyncClient(timeout=60.0) as c:
|
||||
async with c.stream(
|
||||
request.method,
|
||||
f"{FALLBACK_SLM_URL}/{path}",
|
||||
content=body,
|
||||
headers=req_headers,
|
||||
) as resp:
|
||||
async for chunk in resp.aiter_bytes():
|
||||
yield chunk
|
||||
except Exception:
|
||||
yield _sse_format(
|
||||
"error",
|
||||
{"message": "All backends unavailable"},
|
||||
)
|
||||
finally:
|
||||
# Clean up sse_gen
|
||||
try:
|
||||
await sse_gen.aclose()
|
||||
except Exception:
|
||||
@ -597,12 +330,24 @@ async def proxy(
|
||||
media_type="text/event-stream",
|
||||
)
|
||||
|
||||
else:
|
||||
# No model in request body (probe/GET/non-chat request) —
|
||||
# route to the currently active backend when available,
|
||||
# or fall through to the fallback chain.
|
||||
if sidecar_status.get("active_profile") and sidecar_status.get("llama_server_running"):
|
||||
# First request triggers the switch
|
||||
await start_switch() # Create event for tracking
|
||||
try:
|
||||
async with httpx.AsyncClient(timeout=120.0) as client:
|
||||
switch_resp = await client.post(
|
||||
f"{SIDECAR_URL}/models/switch",
|
||||
json={"profile_id": requested_model},
|
||||
)
|
||||
switch_result = switch_resp.json()
|
||||
if switch_result.get("status") == "ready":
|
||||
complete_switch()
|
||||
drain_queue()
|
||||
target_url = f"{MAIN_PC_BASE}/{path}"
|
||||
else:
|
||||
error = "switch_failed"
|
||||
except Exception as e:
|
||||
circuit_record_failure()
|
||||
error = f"switch_error: {str(e)}"
|
||||
|
||||
# ── Fallback chain ────────────────────────────────────────────────────
|
||||
if target_url is None:
|
||||
@ -633,11 +378,8 @@ async def proxy(
|
||||
request.method, target,
|
||||
content=body, headers=headers,
|
||||
) as resp:
|
||||
if resp.status_code != 200:
|
||||
print(f"PROXY: {target} returned {resp.status_code} during SSE stream", flush=True)
|
||||
async for chunk in resp.aiter_bytes():
|
||||
yield chunk
|
||||
|
||||
return StreamingResponse(gen(), status_code=200)
|
||||
|
||||
resp = await client.request(
|
||||
@ -646,12 +388,6 @@ async def proxy(
|
||||
content=body,
|
||||
headers=headers,
|
||||
)
|
||||
if resp.status_code != 200:
|
||||
body_preview = resp.content[:500].decode("utf-8", errors="replace")
|
||||
print(
|
||||
f"PROXY: {request.method} {target} returned {resp.status_code}: {body_preview}",
|
||||
flush=True,
|
||||
)
|
||||
return Response(
|
||||
content=resp.content,
|
||||
status_code=resp.status_code,
|
||||
@ -661,11 +397,8 @@ async def proxy(
|
||||
primary_result = None
|
||||
try:
|
||||
primary_result = await execute(target_url)
|
||||
except Exception as e:
|
||||
print(
|
||||
f"PROXY EXCEPTION on primary {target_url}: {type(e).__name__}: {e}",
|
||||
flush=True,
|
||||
) # Falls through to fallback chain
|
||||
except Exception:
|
||||
pass # Falls through to fallback chain
|
||||
if primary_result is not None:
|
||||
return primary_result
|
||||
|
||||
|
||||
@ -1,161 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Sync intelligence-router model list into Hermes custom_providers.
|
||||
|
||||
Usage:
|
||||
# One-shot: discover models from the router and update Hermes config
|
||||
python3 scripts/sync_models.py
|
||||
|
||||
# Cron mode (auto): set up via:
|
||||
# cp scripts/sync_models.py ~/.hermes/scripts/
|
||||
# hermes cron create --schedule "every 30m" --no-agent --script sync_models.py
|
||||
|
||||
Silent exit when nothing changed. Prints a summary + restarts the gateway when
|
||||
the model list differs.
|
||||
"""
|
||||
|
||||
import json
|
||||
import os
|
||||
import subprocess
|
||||
import sys
|
||||
import urllib.error
|
||||
import urllib.request
|
||||
from pathlib import Path
|
||||
|
||||
# ── CONFIGURE THESE ──────────────────────────────────────────────────
|
||||
ROUTER_BASE_URL = "http://10.0.4.100:9001/v1"
|
||||
PROVIDER_NAME = "intelligence_router"
|
||||
GATEWAY_SERVICE = "hermes-gateway"
|
||||
# ─────────────────────────────────────────────────────────────────────
|
||||
|
||||
MODELS_URL = f"{ROUTER_BASE_URL}/models"
|
||||
CONFIG_PATH = Path(os.path.expanduser("~/.hermes/config.yaml"))
|
||||
|
||||
|
||||
def fetch_models() -> list[str] | None:
|
||||
try:
|
||||
req = urllib.request.Request(MODELS_URL, headers={"Accept": "application/json"})
|
||||
with urllib.request.urlopen(req, timeout=10) as resp:
|
||||
data = json.loads(resp.read().decode())
|
||||
models = sorted(m["id"] for m in data.get("data", []) if isinstance(m, dict))
|
||||
return models if models else None
|
||||
except (urllib.error.URLError, TimeoutError, json.JSONDecodeError, OSError) as e:
|
||||
print(f"ERROR: Failed to fetch models from {MODELS_URL}: {e}", file=sys.stderr)
|
||||
return None
|
||||
|
||||
|
||||
def read_current_models() -> list[str]:
|
||||
"""Parse current custom_providers entries for our provider name."""
|
||||
if not CONFIG_PATH.exists():
|
||||
return []
|
||||
|
||||
models = []
|
||||
with open(CONFIG_PATH) as f:
|
||||
content = f.read()
|
||||
|
||||
idx = content.find("custom_providers:")
|
||||
if idx == -1:
|
||||
return []
|
||||
|
||||
section = content[idx:]
|
||||
lines = section.split("\n")
|
||||
|
||||
current_entry = {}
|
||||
for line in lines:
|
||||
s = line.strip()
|
||||
if s.startswith("- base_url:"):
|
||||
if current_entry.get("name") == PROVIDER_NAME:
|
||||
m = current_entry.get("model", "")
|
||||
if m:
|
||||
models.append(m)
|
||||
current_entry = {}
|
||||
elif s.startswith("model:"):
|
||||
current_entry["model"] = s.split("model:", 1)[1].strip().strip("'\"")
|
||||
elif s.startswith("name:"):
|
||||
current_entry["name"] = s.split("name:", 1)[1].strip().strip("'\"")
|
||||
elif s and not s.startswith(("-", " ")):
|
||||
break
|
||||
|
||||
# Don't forget the last entry
|
||||
if current_entry.get("name") == PROVIDER_NAME:
|
||||
m = current_entry.get("model", "")
|
||||
if m:
|
||||
models.append(m)
|
||||
|
||||
return sorted(models)
|
||||
|
||||
|
||||
def generate_block(models: list[str]) -> str:
|
||||
lines = ["custom_providers:"]
|
||||
for m in models:
|
||||
lines.append(f"- base_url: {ROUTER_BASE_URL}")
|
||||
lines.append(f" model: {m}")
|
||||
lines.append(f" name: {PROVIDER_NAME}")
|
||||
return "\n".join(lines)
|
||||
|
||||
|
||||
def replace_section(models: list[str]) -> bool:
|
||||
"""Replace the custom_providers section in-place. Returns True if changed."""
|
||||
if not CONFIG_PATH.exists():
|
||||
return False
|
||||
|
||||
import yaml
|
||||
|
||||
content = CONFIG_PATH.read_text()
|
||||
config = yaml.safe_load(content)
|
||||
|
||||
new_entries = [
|
||||
{"base_url": ROUTER_BASE_URL, "model": m, "name": PROVIDER_NAME}
|
||||
for m in models
|
||||
]
|
||||
|
||||
if config.get("custom_providers") == new_entries:
|
||||
return False
|
||||
|
||||
config["custom_providers"] = new_entries
|
||||
CONFIG_PATH.write_text(yaml.dump(config, default_flow_style=False, sort_keys=False))
|
||||
return True
|
||||
|
||||
|
||||
def restart_gateway() -> bool:
|
||||
try:
|
||||
r = subprocess.run(
|
||||
["systemctl", "--user", "restart", GATEWAY_SERVICE],
|
||||
capture_output=True, text=True, timeout=30,
|
||||
)
|
||||
return r.returncode == 0
|
||||
except Exception:
|
||||
return False
|
||||
|
||||
|
||||
def main():
|
||||
models = fetch_models()
|
||||
if models is None:
|
||||
sys.exit(1)
|
||||
|
||||
current = read_current_models()
|
||||
if current == models:
|
||||
print("Model list unchanged — nothing to do.")
|
||||
return
|
||||
|
||||
added = set(models) - set(current)
|
||||
removed = set(current) - set(models)
|
||||
print(f"Model list changed! {len(current)} → {len(models)} models")
|
||||
if added:
|
||||
print(f" Added: {sorted(added)}")
|
||||
if removed:
|
||||
print(f" Removed: {sorted(removed)}")
|
||||
|
||||
if not replace_section(models):
|
||||
print("ERROR: Config update failed")
|
||||
return
|
||||
|
||||
print("Config updated. Restarting gateway...")
|
||||
if restart_gateway():
|
||||
print("Gateway restarted successfully.")
|
||||
else:
|
||||
print("WARNING: Gateway restart failed — restart manually.")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
125
sidecar/app.py
125
sidecar/app.py
@ -5,6 +5,7 @@ Runs on the Main PC, manages llama-server subprocess, serves manifest/profile da
|
||||
import os
|
||||
import asyncio
|
||||
import signal as signal_module
|
||||
import threading
|
||||
from contextlib import asynccontextmanager
|
||||
from typing import Optional
|
||||
|
||||
@ -17,98 +18,41 @@ from sidecar.manifest import load_manifest
|
||||
# Configuration from environment
|
||||
MANIFEST_PATH = os.getenv("MANIFEST_PATH", "/home/bigt/AI/llm/manifest.yaml")
|
||||
SIDECAR_PORT = int(os.getenv("SIDECAR_PORT", "8080"))
|
||||
LLAMA_SERVER_PORT = 8081
|
||||
LLAMA_STDERR_LOG = os.path.join(
|
||||
os.path.dirname(MANIFEST_PATH), "llama-server-stderr.log"
|
||||
)
|
||||
LLAMA_SERVER_PORT = 8080
|
||||
|
||||
# Global state
|
||||
_llama_server_process: Optional[asyncio.subprocess.Process] = None
|
||||
_active_profile: Optional[str] = None
|
||||
_switch_lock = asyncio.Lock() # Use asyncio.Lock to avoid blocking the event loop
|
||||
_switch_lock = threading.Lock() # Use threading.Lock for compatibility with TestClient
|
||||
|
||||
|
||||
@asynccontextmanager
|
||||
async def lifespan(app: FastAPI):
|
||||
"""Manage sidecar lifecycle — no default model loaded."""
|
||||
print(f"Sidecar starting, manifest={MANIFEST_PATH}, port={SIDECAR_PORT}", flush=True)
|
||||
print(f"Sidecar starting, manifest={MANIFEST_PATH}, port={SIDECAR_PORT}")
|
||||
yield
|
||||
# Cleanup: kill llama-server if running
|
||||
global _llama_server_process
|
||||
if _llama_server_process:
|
||||
await _kill_llama_server()
|
||||
_kill_llama_server()
|
||||
|
||||
|
||||
app = FastAPI(lifespan=lifespan)
|
||||
|
||||
|
||||
def _close_stderr_log():
|
||||
"""Close the stderr log file handle if it's still attached to the process."""
|
||||
def _kill_llama_server():
|
||||
"""Kill the llama-server subprocess."""
|
||||
global _llama_server_process
|
||||
if _llama_server_process is not None:
|
||||
fh = getattr(_llama_server_process, "_stderr_fh", None)
|
||||
if fh is not None and not fh.closed:
|
||||
try:
|
||||
fh.close()
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
|
||||
async def _kill_llama_server():
|
||||
"""Kill the llama-server subprocess and wait for it to fully terminate.
|
||||
|
||||
This MUST be async because process.wait() is a coroutine. The synchronous
|
||||
version was calling .wait() without await, creating an unawaited coroutine
|
||||
object — the old process was never actually waited on, so it could still
|
||||
hold GPU VRAM when the new server started.
|
||||
"""
|
||||
global _llama_server_process
|
||||
if _llama_server_process is None or _llama_server_process.returncode is not None:
|
||||
_close_stderr_log()
|
||||
return
|
||||
|
||||
if _llama_server_process and _llama_server_process.returncode is None:
|
||||
try:
|
||||
_llama_server_process.send_signal(signal_module.SIGTERM)
|
||||
try:
|
||||
await asyncio.wait_for(_llama_server_process.wait(), timeout=10)
|
||||
_llama_server_process.wait(timeout=5)
|
||||
except asyncio.TimeoutError:
|
||||
_llama_server_process.kill()
|
||||
try:
|
||||
await asyncio.wait_for(_llama_server_process.wait(), timeout=5)
|
||||
except asyncio.TimeoutError:
|
||||
pass
|
||||
except Exception:
|
||||
pass
|
||||
finally:
|
||||
_llama_server_process = None
|
||||
_close_stderr_log()
|
||||
|
||||
|
||||
def _flag_value(value) -> str:
|
||||
"""Convert a manifest flag value to a llama-server CLI argument string.
|
||||
|
||||
YAML booleans (True/False/on/off/yes/no) are parsed as Python bools by
|
||||
safe_load. llama-server expects 'on'/'off' for boolean flags, not 'True'/'False'.
|
||||
"""
|
||||
if isinstance(value, bool):
|
||||
return "on" if value else "off"
|
||||
return str(value)
|
||||
|
||||
|
||||
def _flag_key(key: str) -> str:
|
||||
"""Convert a manifest flag key to the correct llama-server CLI flag name.
|
||||
|
||||
llama-server uses hyphenated flag names (--ctx-size, --n-gpu-layers),
|
||||
but YAML keys often use underscores. Some flags were also renamed
|
||||
across llama.cpp versions (e.g. --n-ctx → --ctx-size).
|
||||
|
||||
This function normalises underscores to hyphens and applies known renames.
|
||||
"""
|
||||
normalized = key.replace("_", "-")
|
||||
FLAG_RENAMES = {
|
||||
"n-ctx": "ctx-size",
|
||||
}
|
||||
return FLAG_RENAMES.get(normalized, normalized)
|
||||
|
||||
|
||||
async def _start_llama_server(profile: dict):
|
||||
@ -116,39 +60,29 @@ async def _start_llama_server(profile: dict):
|
||||
global _llama_server_process
|
||||
|
||||
# Kill any existing process
|
||||
await _kill_llama_server()
|
||||
_kill_llama_server()
|
||||
|
||||
# Build command from profile flags
|
||||
cmd = ["/home/bigt/AI/llama.cpp/build/bin/llama-server"]
|
||||
cmd = ["llama-server"]
|
||||
cmd += ["--model", profile["model_path"]]
|
||||
cmd += ["--port", str(LLAMA_SERVER_PORT)]
|
||||
cmd += ["--host", "0.0.0.0"]
|
||||
for key, value in profile.get("flags", {}).items():
|
||||
cmd += ["--" + _flag_key(key), _flag_value(value)]
|
||||
cmd += ["--" + key, str(value)]
|
||||
|
||||
print(f"Starting llama-server: {' '.join(cmd)}", flush=True)
|
||||
|
||||
# Capture stderr so we can diagnose crashes (model not found, OOM, bad flag)
|
||||
stderr_fh = open(LLAMA_STDERR_LOG, "w")
|
||||
print(f"Starting llama-server: {' '.join(cmd)}")
|
||||
_llama_server_process = await asyncio.create_subprocess_exec(
|
||||
*cmd,
|
||||
stdout=asyncio.subprocess.DEVNULL,
|
||||
stderr=stderr_fh,
|
||||
stderr=asyncio.subprocess.DEVNULL,
|
||||
)
|
||||
# Keep a reference so we can close the handle later
|
||||
_llama_server_process._stderr_fh = stderr_fh # type: ignore[attr-defined]
|
||||
return _llama_server_process
|
||||
|
||||
|
||||
async def _poll_llama_server_ready(max_retries: int = 60, interval: float = 0.5):
|
||||
"""Poll llama-server readiness via /v1/models endpoint.
|
||||
|
||||
Returns True on success. On failure, dumps the captured stderr (if any)
|
||||
so the user can see why llama-server crashed.
|
||||
"""
|
||||
async def _poll_llama_server_ready(max_retries: int = 240, interval: float = 0.5):
|
||||
"""Poll llama-server readiness via /v1/models endpoint."""
|
||||
import httpx
|
||||
|
||||
for attempt in range(max_retries):
|
||||
for _ in range(max_retries):
|
||||
try:
|
||||
async with httpx.AsyncClient(timeout=2.0) as client:
|
||||
resp = await client.get(f"http://localhost:{LLAMA_SERVER_PORT}/v1/models")
|
||||
@ -157,27 +91,6 @@ async def _poll_llama_server_ready(max_retries: int = 60, interval: float = 0.5)
|
||||
except Exception:
|
||||
pass
|
||||
await asyncio.sleep(interval)
|
||||
|
||||
# Flush and close the stderr handle so all data is on disk before we read
|
||||
_close_stderr_log()
|
||||
|
||||
# ── Dump stderr for diagnosis ──────────────────────────────────────
|
||||
print("llama-server did NOT become ready — dumping stderr:", flush=True)
|
||||
try:
|
||||
with open(LLAMA_STDERR_LOG) as f:
|
||||
for line in f:
|
||||
print(f" {line.rstrip()}", flush=True)
|
||||
except FileNotFoundError:
|
||||
print(" (stderr log not found — process may not have started)", flush=True)
|
||||
|
||||
# Also log exit code if the process died
|
||||
global _llama_server_process
|
||||
if _llama_server_process and _llama_server_process.returncode is not None:
|
||||
print(
|
||||
f"llama-server exited with code {_llama_server_process.returncode}",
|
||||
flush=True,
|
||||
)
|
||||
|
||||
return False
|
||||
|
||||
|
||||
@ -211,7 +124,7 @@ async def switch_model(payload: SwitchRequest):
|
||||
"""Stop current llama-server, start new one with the given profile, wait for readiness."""
|
||||
global _active_profile
|
||||
|
||||
async with _switch_lock:
|
||||
with _switch_lock:
|
||||
# Validate profile_id
|
||||
profiles = load_manifest(MANIFEST_PATH)
|
||||
if profiles is None:
|
||||
@ -240,7 +153,7 @@ async def switch_model(payload: SwitchRequest):
|
||||
}
|
||||
|
||||
# Start the new model
|
||||
await _kill_llama_server()
|
||||
_kill_llama_server()
|
||||
_active_profile = None
|
||||
await _start_llama_server(profile)
|
||||
|
||||
|
||||
Loading…
Reference in New Issue
Block a user